evaluating algorithm performance metrics tailored for ... · 02.11.2008 · evaluating algorithm...

11
Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha Research Institute for Advanced Research Institute for Advanced Mission Critical Technologies, Computer Science Computer Science Inc. 650-604-3208 650-604-4596 650-604-4379 [email protected] [email protected] [email protected] Sankalita Saha Kai Goebel Research Institute for Advanced Computer Science National Aeronautics and Space Administration 650-604-4593 650-604-4204 [email protected] [email protected] Prognostics Center of Excellence NASA Ames Research Center, Moffett Field CA 94035 Abstract —Prognostics has taken a center stage in Condition Based Maintenance (CBM) where it is desired to estimate Remaining Useful Life (RUL) of the system so that remedial measures may be taken in advance to avoid catastrophic events or unwanted downtimes. Validation of such predictions is an important but difficult proposition and a lack of appropriate evaluation methods renders prognostics meaningless. Evaluation methods currently used in the research community are not standardized and in many cases do not sufficiently assess key performance aspects expected out of a prognostics algorithm. In this paper we introduce several new evaluation metrics tailored for prognostics and show that they can effectively evaluate various algorithms as compared to other conventional metrics. Specifically four algorithms namely; Relevance Vector Machine (RVM), Gaussian Process Regression (GPR), Artificial Neural Network (ANN), and Polynomial Regression (PR) are compared. These algorithms vary in complexity and their ability to manage uncertainty around predicted estimates. Results show that the new metrics rank these algorithms in different manner and depending on the requirements and constraints suitable metrics may be chosen. Beyond these results, these metrics offer ideas about how metrics suitable to prognostics may be designed so that the evaluation procedure can be standardized. 1 2 TABLE OF C ONTENTS 1. I NTRODUCTION .................................................................1 2. MOTIVATION ....................................................................2 3. PREVIOUS WORK .............................................................2 4. APPLICATION D OMAIN .....................................................3 5. ALGORITHMS EVALUATED ..............................................3 6. PERFORMANCE METRICS ................................................5 7. RESULTS & DISCUSSION ..................................................7 1 U.S. Government work not protected by U.S. copyright. 2 IEEEAC paper #1387, Version 1, Updated November 2, 2008 8. C ONCLUSION .................................................................... 9 REFERENCES ......................................................................10 BIOGRAPHIES .....................................................................10 1. INTRODUCTION Prognostics is an emerging concept in condition based maintenance (CBM) of critical systems. Along with developing the fundamentals of being able to confidently predict Remaining Useful Life (RUL), the technology calls for fielded applications as it inches towards maturation. This requires a stringent performance evaluation so that the significance of the concept can be fully exploited. Currently, prognostics concepts lack standard definitions and suffer from ambiguous and inconsistent interpretations. This lack of standards is in part due to the varied end-user requirements for different applications, a wide range of time scales involved, available domain information, domain dynamics, etc. to name a few issues. Instead, the research community has used a variety of metrics based largely on convenience with respect to their respective requirements. Very little attention has been focused on establishing a common ground to compare different efforts. This paper builds upon previous work that surveyed metrics in use for prognostics in a variety of domains including medicine, nuclear, automotive, aerospace, and electronics. [1]. The effort suggested a list of metrics to assess critical aspects of RUL predictions. This paper will show how such metrics can be used to assess the performance of a prognostics algorithm. Furthermore, it will assess whether these metrics capture the performance criteria they were designed for. The paper will focus on metrics that are specifically designed for prognostics beyond conventional metrics currently being used for diagnostics and other forecasting applications. These metrics in general address the issue of how well the RUL prediction estimates improve https://ntrs.nasa.gov/search.jsp?R=20090033821 2020-01-08T21:37:05+00:00Z

Upload: others

Post on 18-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

Evaluating Algorithm Performance MetricsTailored for Prognostics

Abhinav Saxena José Celaya Bhaskar SahaResearch Institute for Advanced Research Institute for Advanced Mission Critical Technologies,

Computer Science Computer Science Inc.650-604-3208 650-604-4596 650-604-4379

[email protected] [email protected] [email protected]

Sankalita Saha

Kai GoebelResearch Institute for Advanced Computer Science

National Aeronautics and Space Administration

650-604-4593

[email protected]

[email protected]

Prognostics Center of ExcellenceNASA Ames Research Center, Moffett Field CA 94035

Abstract—Prognostics has taken a center stage in ConditionBased Maintenance (CBM) where it is desired to estimateRemaining Useful Life (RUL) of the system so thatremedial measures may be taken in advance to avoidcatastrophic events or unwanted downtimes. Validation ofsuch predictions is an important but difficult propositionand a lack of appropriate evaluation methods rendersprognostics meaningless. Evaluation methods currently usedin the research community are not standardized and in manycases do not sufficiently assess key performance aspectsexpected out of a prognostics algorithm. In this paper weintroduce several new evaluation metrics tailored forprognostics and show that they can effectively evaluatevarious algorithms as compared to other conventionalmetrics. Specifically four algorithms namely; RelevanceVector Machine (RVM), Gaussian Process Regression(GPR), Artificial Neural Network (ANN), and PolynomialRegression (PR) are compared. These algorithms vary incomplexity and their ability to manage uncertainty aroundpredicted estimates. Results show that the new metrics rankthese algorithms in different manner and depending on therequirements and constraints suitable metrics may bechosen. Beyond these results, these metrics offer ideasabout how metrics suitable to prognostics may be designedso that the evaluation procedure can be standardized. 1 2

TABLE OF CONTENTS

1. INTRODUCTION .................................................................12. MOTIVATION ....................................................................23. PREVIOUS WORK .............................................................24. APPLICATION DOMAIN .....................................................35. ALGORITHMS EVALUATED ..............................................36. PERFORMANCE METRICS ................................................57. RESULTS & DISCUSSION ..................................................7

1U.S. Government work not protected by U.S. copyright.2 IEEEAC paper #1387, Version 1, Updated November 2, 2008

8. CONCLUSION .................................................................... 9REFERENCES ......................................................................10BIOGRAPHIES .....................................................................10

1. INTRODUCTION

Prognostics is an emerging concept in condition basedmaintenance (CBM) of critical systems. Along withdeveloping the fundamentals of being able to confidentlypredict Remaining Useful Life (RUL), the technology callsfor fielded applications as it inches towards maturation.This requires a stringent performance evaluation so that thesignificance of the concept can be fully exploited.Currently, prognostics concepts lack standard definitionsand suffer from ambiguous and inconsistent interpretations.This lack of standards is in part due to the varied end-userrequirements for different applications, a wide range of timescales involved, available domain information, domaindynamics, etc. to name a few issues. Instead, the researchcommunity has used a variety of metrics based largely onconvenience with respect to their respective requirements.Very little attention has been focused on establishing acommon ground to compare different efforts.

This paper builds upon previous work that surveyed metricsin use for prognostics in a variety of domains includingmedicine, nuclear, automotive, aerospace, and electronics.[1]. The effort suggested a list of metrics to assess criticalaspects of RUL predictions. This paper will show how suchmetrics can be used to assess the performance of aprognostics algorithm. Furthermore, it will assess whetherthese metrics capture the performance criteria they weredesigned for. The paper will focus on metrics that arespecifically designed for prognostics beyond conventionalmetrics currently being used for diagnostics and otherforecasting applications. These metrics in general addressthe issue of how well the RUL prediction estimates improve

https://ntrs.nasa.gov/search.jsp?R=20090033821 2020-01-08T21:37:05+00:00Z

Page 2: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

over time as more measurement data become available. Agood prognostic algorithm should not only improve in RULestimation but also ensure a reasonable prediction horizonand confidence levels on the predictions.

compare different algorithms. With these ideas we hope toprovide some starting points for future discussions.

Overall the paper is expected to enhance a generalunderstanding behind these metrics so that they can befurther refined and be accepted by the research communityas standard metrics for the performance assessment ofprognostics algorithms.

2. MOTIVATION

Prognostics technology is reaching a point where it must beevaluated in real world environments in a truly integratedfashion. This, however, requires rigorous testing andevaluation on a variety of performance measures beforethey can be certified for critical systems. For end-of-lifepredictions of critical systems, it becomes imperative toestablish a fair amount of faith in the prognostic systemsbefore incorporating their predictions into the decision-making process. Furthermore, performance metrics helpestablish design requirements that must be met. In theabsence of standardized metrics it has been difficult toquantify acceptable performance limits and specify crispand unambiguous requirements to the designers.Performance evaluation allows comparing differentalgorithms and also yields constructive feedback to furtherimprove these algorithms.

Performance evaluation is usually the foremost step once anew technique is developed. In many cases benchmarkdatasets or models are used to evaluate such techniques on acommon ground so they can be fairly compared. Prognosticsystems, in most cases, have neither of these options.Different researchers have used different metrics to evaluatetheir algorithms that makes it rather difficult to comparevarious algorithms even if they have been declaredsuccessful based on their respective evaluations. It isaccepted that prognostics methods must be tailored forspecific applications, which makes it difficult to develop ageneric algorithm useful for every situation. In such casescustomized metrics may be used but there are characteristicsof prognostics applications that remain unchanged andcorresponding performance evaluation can lay a commonground for comparisons. So far very little has been done toidentify a common ground when it comes to testing andcomparing different algorithms. In two surveys of methodsfor prognostics (one of data-driven methods and one ofartificial-intelligence-based methods) [2, 3], it can be seenthat there is a lack of standardized methodology forperformance evaluation and in many cases performanceevaluation is not even formally addressed. Even the ISOstandard [4] for prognostics in condition monitoring anddiagnostics of machines lacks a firm definition of suchmetrics. Therefore, in this paper we present several newmetrics and show how they can be effectively used to

3. PREVIOUS WORK

In a recent effort a thorough survey of various applicationdomains that employ prediction related tasks was conducted[1]. The central idea, there, was to identify establishedmethods of performance evaluation in the domains that canbe considered mature and already have fielded applications.Specifically, domains like medicine, weather, nuclear,finance and economics, automotive, aerospace, electronics,etc. were considered. The survey revealed that althougheach domain employs a variety of custom metrics, metricsbased on accuracy and precision dominated the landscape.However, these metrics were often used in differentcontexts depending on the type of data available and thekind of information derived out of them. This suggests thatone must interpret the usage very carefully beforeborrowing any concepts from other domains. A briefsummary of the findings is presented here for reference.

Domains like medicine and finance heavily utilize statisticalmeasures. These domains benefit from availability of largedatasets under different conditions. Predictions in medicineare based on hypothesis testing methodologies and metricslike accuracy, precision, interseparability, and resamblanceare computed on test outcomes. In finance statisticalmeasures are computed on errors computed based onreference prediction models. Metrics like MSE (meansquared error), MAD (mean absolute), MdAD (medianabsolute deviation), MAPE (mean absolute percentageerror), and their several variations are widely used. Thesemetrics represent different ways of expressing accuracy andprecision measures. The domain of weather predictionsmainly uses two classes of evaluation methods, error basedstatistics and measures of resolution between two outcomes.A related domain of wind mill power prediction usesstatistical measures already listed above. Other domains likeaerospace, electronics, and nuclear are relatively immatureas far as fielded prognostics applications are concerned. Inaddition to conventional accuracy and precision measures asignificant focus has been on metrics that assess businessmerits like ROI (return on investment), TV (technicalvalue), life cycle cost other than reliability based metricslike MTBF (mean time between failure) or the ratioMTBF/MTBUR (mean time between unit replacements).

Several classifications of these metrics have been presentedin [1] that are derived from the end use of the prognosticsinformation. It has been argued that depending on the enduser requirements one must choose appropriate set of thesemetrics or their variants to appropriately evaluate theperformance of the algorithms.

2

Page 3: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

4. APPLICATION DOMAIN

In this section we describe the application domain we usedto show how these new prognostics metrics may be appliedand can be used to compare various different algorithms.

INL Battery Dataset

In 1998 the Office of Vehicle Technologies at the U.S.Department of Energy initiated the Advanced TechnologyDevelopment (ATD) program in order to find solutions tothe barriers limiting commercialization of high-powerLithium-ion batteries for hybrid-electric and plug-in electricvehicles. Under this program, a set of second-generation18650-size Lithium-ion cells were cycle-life tested at theIdaho National Laboratory (INL).

The cells were aged under different experimental settingslike temperature, State-of-Charge (SOC), current load, etc.Regular characterization tests were performed to measurebehavioral changes from the baseline under different agingconditions. The test matrix consisted of three SOCs (60, 80,and 100%), four temperatures (25, 35, 45, and 55°C), andthree different life tests (calendar-life, cycle-life, andaccelerated-life) [5]. Electrode Impedance Spectroscopy(EIS) measurements were recorded every four weeks toestimate battery health. EIS measurements were then usedto extract internal resistance parameters (RE and RCT, seeFigure 1) that have been shown to empirically characterizeageing characteristics using a lumped parameter model forthe Li-ion batteries [6].

Battery Internal Resistance Parameters (RE + RCT)

0.04

0.035 01.0o•

i'0.03

cc0.025

cc

0.02 _^,_.-•-

0 8 16 24 32 40 48 56 64 72Time (weeks)

Figure 1 – Internal parameters values are used asfeatures extracted from EIS measurements tocharacterize battery health.

As shown in Figure 2, battery capacity was also measuredin ampere hours by measuring time and currents duringdischarge cycle for the batteries. For the data used in ourstudy, the cells were aged at 60% SOC and at temperaturesof 25°C and 45°C. The 25°C data is used solely for trainingwhile the 45°C data is used for both training as well astesting.

Different approaches can be taken to predict battery lifebased on above measurements. One approach makes use ofEIS measurements to compute RE+RCT and then uses

prediction algorithms to predict evolution of theseparameters. RE+RCT have been shown to be directlyconnected to battery capacity and hence their evolutioncurve can be easily transformed into battery RUL. Anotherapproach directly tracks battery capacity and trends it tocome up with RUL estimates. In the next sections wedescribe our prediction algorithms and the correspondingapproaches that were used to estimate battery life.

Battery Capacity Decay with Time

EOL (64.46 weeks)

Failure Threshold

8 16 24 32 40 48 56 64 72Time (weeks)

Figure 2 –Battery capacity decay profile at 45°C.

5. ALGORITHMS EVALUATED

In this effort we chose four data-driven algorithms to showthe effectiveness of various metrics in evaluating theperformance. These algorithms range from simplepolynomial regression to sophisticated Bayesian learningmethods. The approaches used here have been describedbefore in [6, 7], but they are repeated here for the sake ofcompleteness and readability. Also mentioned briefly is theprocedure how each of these algorithms were applied to thebattery health management dataset.

Polynomial Regression (PR) Approach

We employed a simple data-driven routine to establish abaseline for battery health prediction performance anduncertainty assessment. For this data-driven approach, asthe first step, the equivalent damage threshold in theRE+RCT (dth=0.033) is gleaned from the relationshipbetween RE+RCT and the capacity C at baseline temperature(25ºC). Next, via extracted features from the EISmeasurements, RE+RCT was tracked at elevatedtemperatures (here 45ºC). Ignoring the first two data points(which behave similar to what is considered as “wear-in”pattern in other domains), a second degree polynomial wasused at the prediction points to extrapolate out to thedamage threshold. This linear extrapolation is then used tocompute the expected RUL values.

Relevance Vector Machines (RVM)

The Relevance Vector Machine (RVM) [8] is a Bayesianform representing a generalized linear model of identicalfunctional form of the Support Vector Machine (SVM) [9].Although, SVM is a state-of-the-art technique forclassification and regression, it suffers from a number of

0.95

0.9xE 0.85S,, 0.8

0.75a0.7

0.65

Page 4: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

disadvantages, one of which is the lack of probabilisticoutputs that make more sense in health monitoringapplications. The RVM attempts to address these veryissues in a Bayesian framework. Besides the probabilisticinterpretation of its output, it uses a lot fewer kernelfunctions for comparable generalization performance.

This type of supervised machine learning starts with a set ofinput vectors {xn}, n = 1,..., N, and their correspondingtargets {tn}. The aim is to learn a model of the dependencyof the targets on the inputs in order to make accuratepredictions of t for unseen values of x. Typically, thepredictions are based on some function F(x) defined overthe input space, and learning is the process of inferring theparameters of this function. The targets are assumed to besamples from the model with additive noise:

t n = F (x n ;w)+ εn (1)

where, sn are independent samples from some noise process(Gaussian with mean 0 and variance 62). Assuming theindependence of tn, the likelihood of the complete data setcan be written as:

2

p (t | w , σ2) (2πσ 2

)−N /2

exp − 2

σ2 ^^

t −Φ (2)

where, w = (w 1 , w2,..., wM)T is a weight vector and (D is theN x (N+ 1) design matrix with (D = [(p(t 1 ), (p(t2), ... (p(tN),] T;

in which) _ [1 x x) x x) x x(p (tN , K( n, 1 , K( n, 2 , • • •, K( n,)]T

N ,

K(x , x i) being a kernel function.

To prevent over-fitting a preference for smoother functionsis encoded by choosing a zero-mean Gaussian priordistribution over w parameterized by the hyperparametervector ri. To complete the specification of this hierarchicalprior, the hyperpriors over ri and the noise variance 62 areapproximated as delta functions at their most probablevalues riMP and 6

2MP. Predictions for new data are then made

according to:

p (t* | t) = p (t* | w,σMP )p(w | t ,ηMP , σ,22 )dw . (3)

Gaussian Process Regression (GPR)

Gaussian Process Regression (GPR) is a probabilistictechnique for nonlinear regression that computes posteriordegradation estimates by constraining the prior distributionto fit the available training data [10]. A Gaussian Process(GP) is a collection of random variables any finite numberof which have a joint Gaussian distribution. A real GP f(x)is completely specified by its mean function m(x) and co-variance function k(x,x’) defined as:

m (x) = Ε[ f (x)],

k (x, x') = Ε[(f (x) − m (x)) (f (x ') − m (x'))], and (4)

f (x) — GP(m (x), k (x, x')).

The index set X ∈ ℜ is the set of possible inputs, whichneed not necessarily be a time vector. Given priorinformation about the GP and a set of training points {(xi ,fi)|i = 1,...,n}, the posterior distribution over functions isderived by imposing a restriction on prior joint distributionto contain only those functions that agree with the observeddata points. These functions can be assumed to be noisy asin real world situations we have access to only noisyobservations rather than exact function values, i.e. yi = f(x)+ s, where s is additive IID N(0 ,6n2). Once we have aposterior distribution it can be used to assess predictivevalues for the test data points. Following equations describethe predictive distribution for GPR [11].

Prior

[ st^y

0[K(X,X)+σn K(X,Xtes (5)Jre

,L K(Xtest , x) K(Xtest , Xtest )

Posterior

.fist | X, y, X test — N( .f est , cov(f,st )), where

ff.t ≡Ε[ .ftest | X , y , Xtest ] = K (X , Xtest ) [K(X , X) + σn2I ] 1− y,

(6)

cov(ftest ) = K (Xtest , Xtest) − K (Xtest, X) + σn I ] 1 K (X, Xtest ).

A crucial ingredient in a Gaussian process predictor is thecovariance function (K(X, X’)) that encodes the assumptionsabout the functions to be learnt by defining the relationshipbetween data points. GPR requires a prior knowledge aboutthe form of covariance function, which must be derivedfrom the context if possible. Furthermore, covariancefunctions consist of various hyper-parameters that definetheir properties. Setting right values of such hyper-parameters is yet another challenge in learning the desiredfunctions. Although the choice of covariance function mustbe specified by the user, corresponding hyper-parameterscan be learned from the training data using a gradient basedoptimizer such as maximizing the marginal likelihood of theobserved data with respect to hyper-parameters [12].

We used GPR to regress the evolution of internalparameters (RE+RCT) of the battery with time at.Relationship between these parameters and the batterycapacity was learned from experimental data at [7]. Thusthe internal parameters were regressed for the data obtainedat and the corresponding estimates were translated intoestimated battery capacity at 45°C using the relationshiplearnt at 25°C.

Neural Network (NN) Approach

A neural network based approach was considered as analternative data-driven approach for prognostics. A basicfeed forward neural network with back propagation training

4

Page 5: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

was used, details on this algorithm can be found in .[13, 14]As described earlier for other approaches, data at 25 oC wasused to learn the relationship between internal parameterRE+RCT and the capacity C using the neural network NN1 . Inaddition, the 45 oC data was used as a test case. Here,measurements of the internal parameter RE+RCT are onlyavailable up to time tP (time at which RUL prediction ismade). The available RE+RCT measurements areextrapolated after time tP in order to predict future values.This extrapolation is done using neural network NN2 whichlearns the relationship between RE+RCT and time. Oncefuture values for RE+RCT are computed using NN2 theseRE+RCT values are the used as an input to NN1 in order toobtain C.

The structure of NN1 consists of two hidden layers with oneand three nodes respectively. For the hidden layers tan-sigmoid transfer functions and for the output layers log-sigmoid transfer functions were chosen. Training considersrandom initial weights, a reduced memory Levenberg-Marquardt algorithm, 200 training epochs, and mean-squared error as a performance parameter.

The structure and training parameters of the NN2 remainedfixed during the forecasting. The net was trained with dataavailable up to week 32, and then the resulting model wasused to extrapolate RE+RCT until tEOP is reached or it is clearthat it will not be reached if the model does not converge.Once the next measurement point is available at week 36,the net was trained again including the new data point. Theresulting model was used to extrapolate RE+RCT fromtp+1=36 onwards. It is not expected that a fixed net structureand fixed training settings could perform optimally for allthe training instances as measurements become availableweek 32 onwards. To make sure the results are acceptablefor all the training instances, the initial weights were set torandom and the training was repeated 30 times. Thisallowed the exploration of with different initial values in theoptimization of the weights and allowed the exploration ofdifferent local minimums. The results of the 30 trainingcases were aggregated on the extrapolated values bycomputing the median. Cases were observed where thetraining stopped prematurely resulting in a net with poorperformance, these cases were regarded as outliers and theuse of the median was intended to diminish the impact ofsuch outliers while aggregating all the training cases. Thestructure of NN2 consists of one hidden layer with threenodes, and tan-sigmoid transfer functions for all the layers.Training considers random initial weights, a reducedmemory Levenberg-Marquardt algorithm, 200 trainingepochs, and mean-squared error as a performanceparameter.

6. PERFORMANCE METRICS

the community, i.e., accuracy, precision, Mean SquaredError (MSE), and Mean Absolute Percentage Error(MAPE). These metrics have been included to illustrate theidea about how these metrics are useful but may notencapsulate time varying aspects of prognostic estimates.Further, five new metrics have been introduced thatencapsulate such features of interest. These metrics havebeen first defined briefly and then evaluated based on theresults for battery health management as presented in thefollowing section.

Terms and Notations

• UUT is the unit under test

• A'(i) is the error between the predicted and the true RUL

at time index i for UUT l.

• EOP (End-of-Prediction) is the earliest time index, i,after prediction crosses the failure threshold.

• EOL represents End-of-Life, the time index for actualend of life defined by the failure threshold.

• P is the time index at which the first prediction is madeby the prognostic system.

• rl(i) is the RUL estimate at time ti given that data isavailable up to time ti for the lth UUT.

• t is the cardinality of the set of all time indices at whichthe predictions are made, i.e. 2 = (i | P ≤ i ≤ EOP) .

Average Bias (Accuracy)

Average bias is one of the conventional metrics that hasbeen used in many ways as a measure of accuracy. Itaverages the errors in predictions made at all subsequenttimes after prediction starts for the lth UUT. This metric canbe extended to average biases over all UUTs to establishoverall bias.

s

Bl

=1 Δ l ( i). (7)i

Sample Standard Deviation (Precision)

Sample standard deviation measures the dispersion/spreadof the error with respect to the sample mean of the error.This metric is restricted to the assumption of normaldistribution of the error. It is, therefore, recommended tocarry out a visual inspection of the er ror plots to determinethe distribution characteristics before interpreting thismetric.

In this section nine different performance metrics have beendescribed. Four of them are the metrics most widely used in S =

1 (Δ( i) − m)

2 (8)

−1

Page 6: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

where m is the sample mean of the error.

Mean Squared Error (MSE)

Simple average bias metric suffers from the fact thatnegative and positive errors cancel each other and highvariance may not be reflected in the metric. Therefore, MSEaverages the squared prediction error for all predictions andencapsulates both accuracy and precision. A derivative ofMSE, often used, is Root Mean Squared Error (RMSE).

I

MSE = 1 Δ( i) 2 . (9) i=1

Mean Absolute Percentage Error (MAPE)

For prediction applications it is important to differentiatebetween errors observed far away from the EOL than thoseare observed close to EOL. Smaller errors are desirable asEOL approaches. Therefore, MAPE weighs errors withRULs and averages the absolute percentage errors in themultiple predictions. Instead of the mean, median can beused to compute Median absolute percentage error(MdAPE) in a similar fashion.

MAPE = 1 ^, 100Δ(i) (10)

i=1^ r„ (i) ^

It must be noted that the above metrics can be more suitablyused in cases where either a distribution of RUL predictionsis available as the algorithm output or there are multiplepredictions available from several UUTs to compute thestatistics. Whereas these metrics can convey meaningfulinformation in these cases, these metrics are not designedfor applications where RULs are continuously updated asmore data is available. It is desirable to have metrics thatcan characterize improvement in the performance of aprognostic algorithm as time approaches near end-of-life. Inthis paper we discuss one such application where algorithmsare tracking battery health and show how newer metrics canencapsulate such information which is valuable forsuccessful fielded application of prognostics. Therefore,next we discuss new metrics tailored for prognostics andshow how they are more informative than the onestraditionally used.

Prognostic Horizon (PH)

Prediction Horizon has been in the literature for quite sometime but no formal definition is available. The notionsuggests that longer the prognostics horizon more time isavailable to act based on a prediction that has somecredibility. We define Prognostic Horizon as the differencebetween the current time index i and EOP utilizing dataaccumulated up to the time index i, provided the predictionmeets desired specifications. This specification may bespecified in terms of allowable error bound (a) around trueEOL. This metric ensures that the predicted estimates are

within specified limits around the actual EOL and hence thepredictions may be considered trust worthy. It is expectedthat PHs are determined for an algorithm-application pairoffline during the validation phase and then these numbersbe used as guidelines when the algorithm is deployed in testapplication where actual EOL are not known in advance.While comparing algorithms, an algorithm with longerprediction horizon would be preferred.

H =EOP − i (11)

where i = min{j |(j ∈ ) ∧ (r„(1 − α) ≤ r l (j) ≤ r„(1 + α))}.

For instance, a PH with error bound of a = 5% identifieswhen a given algorithm starts predicting estimates that arewithin 5% of the actual EOL. Other specifications may beused to derive PH as desired.

a-.1 Accuracy

Another way to quantify prediction quality may be througha metric that determines whether the prediction falls withinspecified accuracy levels at a given specific time. Thesetimes instances may be specified as percentage of totalremaining life from the point the first prediction is made ora given absolute time interval before EOL is reached. In ourimplementation we define a-.1 accuracy as the predictionaccuracy to be within a„ 100% of the actual RUL at specifictime instance t.1 expressed as a fraction of time between thepoint when an algorithm starts predicting and the actualfailure.; For example, it determines whether a predictionfalls within 20% accuracy (i.e., a=0.2) halfway to failurefrom the time the first prediction is made (i.e., .1 =0.5).

[1 − α] ⋅ r„(t) ≤ r l (tλ ) ≤ [1 + α] ⋅ r„(t) (12)

where α : accuracy modifier.1: time window modifiertλ = P + λ(EOL − P).

r„l (i)

a-.1 Accuracy

J r l (i)

α = 20% r l (i)

λ= 0 λ= 0.5 λ= 1

tP Time Index (i) tEOL

Figure 3 – Schematic depicting Į -Ȝ Accuracy.

6

Page 7: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

Relative Accuracy (RA)

Relative prediction accuracy is a notion similar to a-.1

accuracy where instead of finding out whether thepredictions fall within a given accuracy levels at a giventime instant we measure the accuracy level. The time instantis again described as a fraction of actual remaining usefullife from the point when the first prediction is made. Analgorithm with higher relative accuracy is desirable.

RAλ = 1 − r*(tλ) − rl(tλ)

(13)r*(tλ)

where tλ = P + λ(EOL − P)

rl

(i)rl*(i)

tλr l (i)

tP t λt EOL

Figure 4 – Schematic showing Relative Accuracyconcept.

Cumulative Relative Accuracy (CRA)

Relative accuracy can be evaluated at multiple timeinstances. To aggregate these accuracy levels we defineCumulative Relative Accuracy as a normalized weightedsum of relative prediction accuracies at specific timeinstances.

sCRAλ =

1 w(r l )RAλ (14)

i

Where w is a weight factor as a function of RUL at all timeindices. In most cases it is desirable to weigh the relativeaccuracies higher closer to the EOL.

Convergence

Convergence is defined to quantify the manner in which anymetric like accuracy or precision improves with time toreach its perfect score. As illustrated below, three casesconverge at different rates. It can be shown that the distancebetween the origin and the centroid of the area under thecurve for a metric quantifies convergence. Lower thedistance faster the convergence. Convergence is a usefulmetric since we expect a prognostics algorithm to convergeto true value as more information accumulates over time.Further, a faster convergence is desired to achieve a highconfidence keeping the prediction horizon as large aspossible.

Let (xc, yc) be the center of mass of the area under the curveM(i). then, the convergence CM can be represented by theEuclidean distance between the center of mass and ( tp, 0),where

CM = (xc − tP ) 2 + yc2 ,

1 EOP2 2

(ti+ 1 − ti )M (i)

2 i P=

x c = EOP, and

(ti+ 1 − ti )M ( i) (15)

i=P

1 EOP

(ti+ 1 − t

i)M (i ) 2

2 i=Py c = EOP

^ (ti +1 − 0M Q)

i=P

M(i) is a non-negative prediction error accuracy or precisionmetric.

tP xc ,1 Time Index (a) tEOP

Figure 5 – Schematic for the convergence of a metric.

7. RESULTS & DISCUSSION

As mentioned earlier battery health measurements weretaken every four weeks. Therefore, each algorithm wastasked to predict every four weeks after the week 32, whichgives about eight data points to learn the degradation trend.Algorithms predict RULs until the end-of-prediction isreached, i.e. the estimates show that battery capacity hasalready hit 70% of the full capacity of one ampere hour.Corresponding predictions are then evaluated using all ninemetrics. Algorithms like RVM always predictedconservatively, i.e. predicted a faster degradation thanactually observed. Estimates were available for all weeksstarting week 32 through week 64. Other algorithms likeNN and PR started predicting at week 32 but could notpredict beyond week 60 as their estimates had alreadycrossed the failure threshold before that. GPR, however,required more training data before it could provide anyestimates. Therefore, predictions for GPR start at week 48and go until week 60.

Table 1 – Performance evaluation for all four testalgorithms with Error Bound = 5%.

RVM GPR NN PR

Bias -7.12 5.96 5.04 1.87SSD 6.57 15.24 6.81 4.26MSE 84.81 184.16 59.49 17.35

7

Page 8: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

MAPE 41.36 53.93 37.54 23.05PH 8.46 12.46 12.46 24.46

RA (I = 0.5) 0.60 0.86 0.34 0.82CRA (I = 0.5) 0.63 0.52 0.55 0.65Convergence 14.80 8.85 13.36 11.41

Prediction Horizon (5% error)

50 O 95% accuracy zone

PA! GPR i - actual RUL

40 NNEnd of Life (EOL)

-^^ RVM RUL-0- GPR RUL

^ -fir NN RULJ 30 \ •- PA RUL

RVM20

10

A

0 35 40 45 50 55 60 65 70

Time (weeks)

Figure 6 – Predictions from different algorithms fallwithin the error bound at different times.

In Table 1 results are aggregated based on all availablepredictions. These results clearly show that polynomial fitapproach outperforms all other algorithms in almost allcases. Even though the convergence properties are not thebest they are comparable to the top numbers. However,using all predictions to compute these metrics results in awide range of values, which makes it difficult to assess howother algorithms fare even if they may not necessarily be thebest. Most metrics describe how close or far the predictionsare to the true value but prediction horizon indicates whenthese predictions enter within the specified error bound andtherefore may be trust worthy (see Figure 6). PR enters theerror bound early on where as all other algorithms convergeslowly as times passes by. The convergence metricencapsulates this attribute and shows that algorithms likeGPR converge faster to better estimates and may be usefullater on. We also learned that the current convergencemetric does not take into account cases where algorithmsstart predicting at different time instances. In such casesalgorithms that start predicting early on may have adisadvantage. Although this metric works well in mostcases, few adjustments may be needed to make it robusttowards extreme cases.

It must be pointed out that these metrics summarize allpredictions, good or bad, into one aggregate, which may notbe fare for algorithms that learn over time and get betterlater on. Therefore, next, it was decided to evaluate onlythose predictions that were made within the predictionhorizon so that only the meaningful predictions areevaluated (Table 2). As expected the results changesignificantly and all the performance numbers become

comparable for all algorithms. This provides a betterunderstanding on how these algorithms compare.

Table 2 – Performance evaluation for all four testalgorithms for predictions made within predictionhorizon with Error Bound = 5%.

RVM GPR NN PR

Bias -1.19 -1.78 -1.53 0.22SSD 1.18 1.33 1.45 3.33MSE 2.03 3.96 3.27 7.75

MAPE 39.33 30.40 27.44 23.25PH 8.46 12.46 12.46 24.46

RA (I = 0.5) 0.77 0.62 0.69 0.95CRA (I = 0.5) 0.50 0.31 0.33 0.58Convergence 3.76 4.44 4.61 7.36

Another aspect of performance evaluation is therequirement specifications. As specifications change theperformance evaluation criteria also changes. To illustratethis point, prediction horizon was now defined on a relaxederror bound of 10%. As expected prediction horizonsbecome longer for most of the algorithms and hence morepredictions are taken into account while computing themetrics. Table 3 shows the results with the new predictionhorizons and now the NN based approach also seems toperform well on several criteria. This means that for someapplications where more relaxed requirements areacceptable simpler approaches may be chosen if needed.

Table 3 – Performance evaluation for all four testalgorithms for predictions made within predictionhorizon with Error Bound = 10%.

RVM GPR NN PR

Bias -1.83 0.05 -1.53 0.22SSD 1.73 4.34 1.45 3.33MSE 5.02 10.6 3.27 7.75

MAPE 37.01 31.20 27.44 23.25PH 12.46 16.46 12.46 24.46

RA (I = 0.5) 0.76 0.79 0.69 0.95CRA (I = 0.5) 0.57 0.43 0.33 0.58Convergence 5.49 3.43 4.61 7.36

Figure 7 shows the a-.1 Accuracy metric for all fouralgorithms. Since all algorithms except GPR start predictionfrom week 32 onward, to is determined to be around 48.3weeks. At that point only PR lies within 80% accuracylevels. GPR starts predicting week 44 onward, i.e. its to isdetermined to be around 54.3 week where it seems to meetthe requirements. This metric signifies whether a particularalgorithm reaches within a desired accuracy level halfwayto the EOL from the point it starts predicting. Anotheraspect that may be of interest is whether an algorithmreaches the desired accuracy level some fixed time intervalahead of the EOL. In that case, for example, if to is chosenas 48 weeks then GPR will not meet the requirement.Therefore, this metric may be modified to incorporate cases

Page 9: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

where not all algorithms may be able to start predicting atthe same time.

α-λ Accuracy ( a = 0.2, λ = 0.5)06

0 End of Life (EOL)

80% accuracy zone50 O wk. 48.3 _____ actual RUL

— RVM RUL

4^ ^— GPR RUL

"U\^ —r NN RUL

PA RUL

30

\`^\tea + wk. 54.3

20 .^.'.

10

0 35 40 45 50 55 60 65

Time (weeks)

Figure 7 – The Į -Ȝ Accuracy metric determines whetherpredictions are within the cone of desired accuracylevels at a given time instant (tx).

Battery Capacity Decay with Time

0.9

E (a)

0.85a;, 0.8

0.75

8 0.7 Failure Threshold

0.65

8 16 24 32 40 48 56 64 72Time (weeks)

Figure 8 – Battery capacity decay profile shows severalfeatures that are difficult to learn using simpleregression techniques.

It can be observed from the results (Figure 7) mostalgorithms fail to follow the trend towards the end. Theseapproaches being data-driven regression based techniquesfind it difficult to learn the physical phenomenon by whichbatteries degrade. As shown in Figure 8, initially the batterycapacity degrades quite fast and then the degradation rateslows down before towards the end it further slows down.These algorithms are not able to learn this characteristic andpredict an earlier EOL.

Finally, we would like to mention few key points that areimportant for performance evaluation and should beconsidered ahead of choosing the metrics. Time scalesobserved in various prognostic algorithms are often verydifferent in different applications. For instance, in batteryhealth management time scales are in the order of weekswhere as in other cases like electronics it may be a matter ofhours or seconds. Therefore, the chosen metrics shouldacknowledge the importance of prediction horizon and

weigh errors close to EOL with higher penalties. Next, thesemetrics may be modified to address asymmetric preferenceon RUL error. In most applications where a failure may leadto catastrophic outcomes an early prediction is preferredover late predictions. Finally, in the example discussed inthis paper RUL estimates were obtained as a single value asagainst a RUL distribution for every prediction. The metricspresented in this paper can be applied to such applicationswith slight modifications. Similarly for cases where multipleUUTs are available to provide data, minor adjustments willsuffice.

8. CONCLUSION

In this paper we have shown how performance metrics forprognostics can be designed. Four different predictionalgorithms were used to show how various metrics conveydifferent kinds of information. No single metric should beexpected to cover all performance criteria. Depending onthe requirements a subset of these metrics should be chosenand a decision matrix should be used to rank differentalgorithms. In this paper we used nine metrics includingfour conventional ones that are most commonly used toevaluate algorithm performance. The new metrics provideadditional information that may be useful in comparingprognostic algorithms in particular. Specifically thesemetrics track the evolution of prediction performance overtime and help determine when these predictions can beconsidered trust worthy. Notions like convergence andprediction horizon that have existed in the literature for along time have been quantified so they can be used inautomated fashion. Further new notions of performancemeasures at specific time instances have been instantiatedusing metrics like relative accuracy and a-.1 performance.These metrics represent the notion that a prediction is usefulonly if it allows certain amount of time to mitigate thepredicted contingency.

Whereas these metrics demonstrate several ideas specific toprognostics performance evaluation, we by no means claimthis list to be near perfect. It is anticipated that as new ideasare generated and the metrics themselves are evaluated indifferent applications, this list will be revised and refinedbefore a standard methodology can be devised forevaluating prognostics. This paper is intended to serve as astart towards developing such metrics that can betterencapsulate prognostic algorithm performance.

9

Page 10: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

REFERENCES

[1] A. Saxena, J. Celaya, E. Balaban, K. Goebel, B.Saha, S. Saha, and M. Schwabacher, "Metrics forEvaluating Performance of Prognostics Techniques,"in 1st International Conference on Prognostics andHealth Management (PHM08) Denver, CO, 2008.

[2] M. Schwabacher, "A Survey of Data DrivenPrognostics," in AIAA Infotech@AerospaceConference, 2005.

[3] M. Schwabacher and K. Goebel, "A Survey ofArtificial Intelligence for Prognostics," in AAAI FallSymposium, Arlington, VA, 2007.

[4] ISO, "Condition Monitoring and Diagnostics ofMachines - Prognostics part 1: General Guidelines,"in ISO13381-1:2004(E). vol. ISO/IEC DirectivesPart 2, I. O. f. S. (ISO), Ed.: ISO, pp. 14, 2004.

[5] J. P. Christophersen, I. Bloom, E. V. Thomas, K. L.Gering, G. L. Henriksen, V. S. Battaglia, and D.Howell, "Gen 2 Performance Evaluation FinalReport," Idaho National Laboratory, Idaho Falls, IDINL/EXT-05-00913, July, 2006.

[6] K. Goebel, B. Saha, A. Saxena, J. Celaya, and J. P.Christopherson, "Prognostics in Battery HealthManagement," in IEEE Instrumentation andMeasurement Magazine. vol. 11, pp. 33 - 40, 2008.

[7] K. Goebel, B. Saha, and A. Saxena, "A Comparisonof Three Data-Driven Techniques for Prognostics,"in 62nd Meeting of the Society For MachineryFailure Prevention Technology (MFPT) VirginiaBeach, VA, pp. 119-131, 2008.

[8] Tipping, M. E., "The Relevance Vector Machine," inAdvances in Neural Information ProcessingSystems. vol. 12 Cambridge MIT Press, pp. 652-658,2000.

[9] V. N. Vapnik, The Nature of Statistical Learning.Berlin: Springer, 1995.

[10] C. E. Rasmussen and C. K. I. Williams, GaussianProcesses for Machine Learning: The MIT Press,2006.

[11] C. K. I. Williams and C. E. Rasmussen, "GaussianProcesses for Regression," in Advances in NeuralInformation Processing Systems. vol. 8, D. S.Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds.Cambridge, MA: The MIT Press, pp. 514-520, 1996.

[12] K. V. Mardia and R. J. Marshall, "MaximumLikelihood Estimation for Models of ResidualCovariance in Spatial Regression," Biometrika, vol.71, pp. 135-146, 1984.

[13] R. O. Duda, P. E. Hart, and D. G. Stork, Patternclassification, Second ed. New York: John Wiley &Sons, Inc., 2000.

[14] T. Hastie, R. Tibshirani, and J. Friedman, TheElements of Statistical Learning: Data Mining,Inference, and Prediction, Second ed., 2003.

BIOGRAPHIES

Abhinav Saxena is a StaffScientist with Research Institutefor Advanced Computer Scienceat the Prognostics Center ofExcellence NASA Ames Research

f Center, Moffet Field CA. Hisresearch focus lies in developingand evaluating prognosticalgorithms for engineeringsystems using soft computing

techniques. He is a PhD in Electrical and ComputerEngineering from Georgia Institute of Technology, Atlanta.He earned his B.Tech in 2001 from Indian Institute ofTechnology (IIT) Delhi, and Masters Degree in 2003 fromGeorgia Tech. Abhinav has been a GM manufacturingscholar and is also a member of IEEE, AAAI and ASME.

Jose R. Celaya is a visiting scientistwith the Research Institute forAdvanced Computer Science at the

" Prognostics Center of Excellence,NASA Ames Research Center. Hereceived a Ph.D. degree in DecisionSciences and Engineering Systems

— in 2008, a M. E. degree inOperations Research and Statisticsin 2008, a M. S. degree in Electrical

Engineering in 2003, all from Rensselaer PolytechnicInstitute, Troy New York; and a B.S. in CyberneticsEngineering in 2001 from CETYS University, Mexico.

y Bhaskar Saha is a ResearchProgrammer with Mission CriticalTechnologies at the Prognostics

'• _^ Center of Excellence NASA Ames• r ~^` Research Center. His research is

•' focused on applying variousR classification, regression and state

estimation techniques forpredicting remaining useful life ofsystems and their components. He

has also published a fair number of papers on these topics.Bhaskar completed his PhD from the School of Electricaland Computer Engineering at Georgia Institute ofTechnology in 2008. He received his MS from the sameschool and his B. Tech. (Bachelor of Technology) degreefrom the Department of Electrical Engineering, IndianInstitute of Technology, Kharagpur.

10

Page 11: Evaluating Algorithm Performance Metrics Tailored for ... · 02.11.2008 · Evaluating Algorithm Performance Metrics Tailored for Prognostics Abhinav Saxena José Celaya Bhaskar Saha

Sankalita Saha received herB. Tech (Bachelor of Technology)degree in Electronics andElectrical CommunicationEngineering from Indian Instituteof Technology, Kharagpur, Indiain 2002 and Ph.D. degree inElectrical and ComputerEngineering from University ofMaryland, College Park in 2007.She is currently a post-doctoral

scientist working at RIACS/NASA Ames Research Center,Moffett Field, CA. Her research interests are in prognosticsalgorithms and architectures, distributed systems, andsystem synthesis.

Kai Goebel is a senior scientist at

-^ NASA Ames Research Centerwhere he leads the PrognosticsCenter of Excellence

S (prognostics. arc.nasa.gov). Priorto that, he worked at GeneralElectric’s Global Research Centerin Niskayuna, NY from 1997 to

• 2006 as a senior research scientist.He has carried out applied

research in the areas of artificial intelligence, softcomputing, and information fusion. His research interestlies in advancing these techniques for real time monitoring,diagnostics, and prognostics. He has fielded numerousapplications for aircraft engines, transportation systems,medical systems, and manufacturing systems. He holds halfa dozen patents and has published more than 75 papers. Hereceived the degree of Diplom -Ingenieur from theTechnische Universität München, Germany in 1990. Hereceived the M.S. and Ph.D. from the University ofCalifornia at Berkeley in 1993 and 1996, respectively.

11