machine learning innovation for high accuracy remaining useful … · 2018. 10. 6. · machine...

Machine Learning Innovation for High Accuracy Remaining Useful Life (RUL) Estimation for Critical

Assets in IoT Infrastructures

Kenny C. Gross1 and DeJun Li1 1Oracle Physical Sciences Research Center, Oracle Corporation, San Diego, CA, USA

Abstract - Prognostics Health Management (PHM) for business & mission critical assets in IoT industries comprises a comprehensive methodology for proactively detecting and isolating failures, recommending condition-based maintenance (CBM), and estimating in real time the remaining useful life (RUL) of critical components. This paper introduces a variety of innovative algorithms that leverage time-series telemetry coupled with advanced machine learning (ML) pattern recognition for high accuracy estimation of Remaining Useful Life (RUL) of systems, components, and subsystems in business-critical and mission-critical environments for PHM applications in dense-sensor IoT industries. RUL capability is a key enabler for Condition Based Maintenance (CBM) of customer assets. RUL-based CBM is a structured preventative maintenance framework that significantly reduces operations-and-maintenance (O&M) costs for IoT and Big Data customers in the industrial sectors of utilities, transportation, manufacturing, oil-and-gas, and enterprise data centers.

Keywords: remaining useful life, RUL, MSET, SPRT

1 Introduction 1.1 MSET and SPRT The Multivariate State Estimation Technique (MSET) [1-3] is a nonlinear, nonparametric Machine Learning (ML) method that was originally developed by Argonne National Laboratory (ANL) for high-sensitivity proactive fault monitoring applications in advanced commercial nuclear power systems where plant downtime can cost utilities and their customers on the order of $1M a day. MSET is a statistical modeling technique that learns a high fidelity model of an asset from a sample of its normal operating data. Once built, the software model provides an accurate estimate for each observed signal given new data observations from each/all signals associated with the asset. Each estimated signal is compared to its actual signal counterpart using a highly sensitive fault detection procedure called the Sequential Probability Ratio Test (SPRT) [4-6] to statistically determine whether the actual signal agrees with the learned model or, alternatively, is indicative of a process anomaly, sensor data quality problem, or equipment problem [7].

The intellectual property for the original Argonne MSET expired in 2016, after which Oracle developed a second-generation comprehensive prognostic health management system, MSET2, which embodies a suite of Intelligent Data Preprocessing (IDP) innovations that solve widespread sensor and signal challenges that cause poor performance for ML prognostic surveillance of assets in all IoT industries, challenges that include missing values in streaming signals (for which MSET2 employs “optimal missing value imputation”, MVI; asynchrony of signals due to common clock-skew issues in data-acquisition systems for dense-sensor IoT applications, and low-resolution signals that arise from low-bit analog-to-digital (A/D) chips used in modern assets across many industries (for which MSET uses Oracle’s “UnQuantize” innovation that turns low-resolution input signals into high-accuracy output signals). This paper describes how output alerts and sensor-operability-validation flags from MSET2 and SPRT are incorporated into a novel approach for high-accuracy RUL estimation for business-critical and mission-critical components, subsystems, and integrated systems. MSET2 uses advanced statistical pattern recognition techniques to measure the similarity or overlap between data signals within a learned operational domain. The learned patterns or relationships among the signals are used to estimate the operating state that most closely corresponds with the current measured set of signals. By quantifying the relationship between the present and learned states, MSET2 in the real-time surveillance mode computes highly-accurate estimates the current expected responses of all system signals. For cases where it can be established that sensor failure (and not an equipment malfunction) is responsible for anomalous behavior, MSET's accurate analytical estimate of the signal is used as a temporary substitute, called an “analytical sensor,” for the erroneous signal until a sensor repair can be accomplished [8]. The SPRT is an outstanding “detector” algorithm when combined with prognostic ML algorithms for rapid annunciation of the incipience or onset of anomalous patterns in digitized time series signals under surveillance. The SPRT is optimal in the sense that it gives the fastest mathematically possible annunciation of subtle disturbances in noisy process

Int'l Conf. Internet Computing and Internet of Things | ICOMP'18 | 101

ISBN: 1-60132-482-0, CSREA Press ©

variables, and allows the data scientists setting up prognostics algorithms to independently specify the false- and missed-alarm probabilities (FAPs and MAPs). This is in sharp contrast to conventional prognostic algorithms that are based upon threshold-limit tests [9]. Coupling the MSET2 pattern recognition method with a SPRT provides a superior surveillance tool because it is sensitive not only to disturbances in signal mean, but also to very subtle changes in the statistical moments of the monitored signals and the patterns of correlation between/among multiple types of signals. MSET2 or similar NLNP pattern recognition coupled with a SPRT provides the basis for detecting very subtle statistical anomalies in noisy process signals at the earliest mathematically possible time, thereby providing actionable warning-alert information on the type and the exact time of onset of the disturbance. Instead of simple threshold limits that trigger faults when a signal increases beyond some threshold value, the SPRT technique is based on user-specified false-alarm and missed-alarm probabilities, allowing the end user to control the likelihood of missed detection or false alarms. For sudden, gross failures of sensors or system components the SPRT annunciates the disturbance as fast as a conventional threshold limit check. However, for slow degradation that evolves over a long time period (gradual decalibration bias in a sensor; very subtle voltage drift from the variety of aging mechanisms that cause resistances to change very slowly with age; bearing degradation, lubrication dryout, or buildup of a radial rub in all types of rotating machinery; the gradual appearance of new vibration spectral components in the presence of noisy background signals, etc), the SPRT raises a warning of the incipience or onset of the disturbance long before it would be apparent to any conventional threshold based rules [10].

2 Remaining Useful Life Oracle has developed a variety of innovative algorithms that leverage time series telemetry from dense-sensor IoT assets coupled with advanced ML pattern recognition (MSET2 and SPRT) for high accuracy estimation of Remaining Useful Life (RUL) of systems, components, and subsystems in business-critical and mission-critical environments. RUL capability is a key enabler for Condition Based Maintenance (CBM) of customer assets. RUL-based CBM is a structured preventative maintenance framework that significantly reduces operations-and-maintenance (O&M) costs for Oracle’s IoT and Big Data customers in the industrial sectors of utilities, transportation, manufacturing, and oil-and-gas. Switching to RUL-enabled CBM significantly boosts annualized throughput metrics (for IoT manufacturing customers) and overall availability of critical assets for all IoT customers, versus time-based preventative maintenance (PM) windows. Conventional time-based PM windows lower overall throughput for manufacturing operations, lower availability for revenue-generating critical assets for utilities and oil-and-gas assets, and result in

overhauling/upgrading/recalibrating machinery prematurely that is not yet exhibiting any wear symptoms. Moreover, time-based PM produces additional unscheduled outages because of what we call "maintenance induced failures", i.e. the fact that technicians are dismantling, overhauling, recalibrating, and reassembling expensive assets will lead to some human-factors errors, and assets may go down early in the next production cycle that were fault-free before the technicians dismantled/reassembled the assets. This paper introduces the underpinning mathematical and pattern recognition details behind Oracle’s RUL methodology, which consumes output alerts from multivariate ML prognostic algorithms and creates a continuously updated optimal estimate of RUL with quantitative confidence factors. Once a parameter that reflects the condition of sensored components inside critical assets has been identified and continuously or periodically measured through direct measurements or through the empirical MSET2 modeling procedure, the remaining useful life prediction is done using a trending technique called multinomial logistic regression. A trending parameter is assumed to follow a known-shape trend with random parameters or described by a “birth and death” process (also known as a diffusion process). When telemetry data exists from a batch of deployed components that have been monitored from the point of incipient degradation all the way to failure, the time-series life data are used to estimate the shape of the trend and distribution of its parameters [11-14]. Fig. 1 below shows a typical analysis of time series signals from digitized sensor transducers monitoring a typical mechanical asset from a manufacturing facility. The residuals shown in the first subplots of Fig. 1 (a) and Fig. 1 (b) are reasonably Gaussian and white. The SPRT in this case is set up with a false-alarm and missed-alarm probability (alpha and beta, respectively) of 5%. This alpha and beta are much larger than we use for production implementations of MSET2 prognostics, and are set this high just to illustrate “false” alerts in the SPRT output results (2nd and 3rd subplots).

102 Int'l Conf. Internet Computing and Internet of Things | ICOMP'18 |


(a)

(b)

Fig. 1. Example signals for MSET2 residuals, SPRT alarms,

and cumulative SPRT tripping frequency for a (a) temperature sensor and (b) RPM sensor.

There is no degradation in the sensors monitored by the SPRT in Fig. 1. The SPRT alerts are normal and expected from the Wald theorem. [15] Only when the frequency of SPRT alerts exceeds the prespecified value of alpha will a real Alarm be triggered. For normal Gaussian processes, The SPRT cumulative tripping frequency (3rd subplots), which we call empirical alpha, will always be conservatively lower than alpha. [15] SPRT tripping frequencies are a unitless measure of the anomaly severity, thus making it a prime candidate to use as a prognostic measure for remaining useful life analyses. Fig. 2 shows the SPRT tripping frequencies for simulated asset signatures in blue that end up in failure. A trend of the SPRT tripping frequencies from systems that failed in operation is shown in black.

Fig. 2. SPRT tripping frequencies for simulated data of failed and non-failed systems (blue). SPRT tripping frequency trend

(black). 95% failure probability points marked (red).

The blue circles in Fig. 3 below represent failure data. A “1” means that a system failed within next T hours from current time t, given the system’s SPRT tripping frequency at that time t. A “0” means that a system did not fail within next T hours. For this use-case example, we use T=1000 hours. Each blue circle represent an individual system/component from the historical telemetry archive.

Fig. 3. Logistic regression applied to simulated data. 95%

failure probability (Threshold) marked. Next a logistic curve is fitted through these data points to obtain a measure of probability of failure with respect to the SPRT tripping frequency. When a sensor’s SPRT tripping frequency exceeds the 95% probability of failure of this curve, the monitored asset is considered to be in imminent peril for failure. The red markers in Fig. 2 shows when each of the simulated SPRT tripping frequencies hit 95% probability of failure, based on the probability curve estimated in Fig. 3. By using the probability of failure curve estimated in Fig. 3 and the SPRT tripping frequency trend in Fig. 2, one can approximate a failure probability distribution for any value of SPRT tripping frequency. Fig. 4 shows the approximated failure distribution of a system that is experiencing a tripping frequency of 0.06. Taking a numerical derivative of the distribution function from Fig. 4 will provide the failure probability density function in Fig. 5. One can estimate, on average, the remaining useful life or how soon the system experiences a SPRT tripping frequency of 0.6 will likely fail by identifying the peak in the density function.



Fig. 4. Approximated Cumulative Distribution Function (CDF) of Remaining Useful Life (RUL) for an observed

degradation parameter (SPRT tripping frequency) of 0.06. The true model is 1 dimensional.

Fig. 5. Estimated Probability Density Function (PDF) of

Remaining Useful Life (RUL) for an observed degradation parameter (SPRT tripping frequency) of 0.06. The true model

is 1 dimensional. On average a system with an observed tripping frequency of 0.06 will fail in 800 time units.

Until now, the underlying failure models have been 1-dimensional, ie. there is only a single SPRT tripping frequency being monitored. For very many IoT use cases in the industrial sectors of Utilities, Oil-and-Gas, Manufacturing, Transportation, and Data Centers, it is most often the case that two or more prognostic variables start displaying elevated risk indicators simultaneously. For example, in mechanical rotating machinery, a common age-related degradation mode is out-of-roundness for bearings. When mechanical centrifugal pumps, fans, blowers, or motors start to show symptoms of bearing out-of-roundness, the degradation can appear in multiple monitored transducers simultaneously: e.g. localized thermal hot spots, increased in tri-axis accelerometer vibration readings, decrease in operational RPMs, decrease in flow rate (GPM or CFM), or an increase in current-draw to the motor operating the pump. A challenge for conventional RUL estimation methodology is how to incorporate prognostic parameters that may have different units [e.g. temperatures, RPMs, G(RMS) for

vibration metrics, GPM or CFM for flow rates, Amps for motor currents], with drastically different magnitudes, and drastically different variance characteristics, into a unified model to achieve multi-dimensional RUL estimation. The first innovation introduced in this paper is to use a prognostic machine learning (ML) technique to monitor all signals, then to process the residuals from the ML technique with a SPRT to produce a SPRT tripping frequency for whatever signals may show signs of degradation for the monitored system, then to base the RUL estimation on quantitative SPRT tripping frequencies. Now we have a unitless metric that is already inherently scaled between zero and one, and which already has optimally incorporated all of the statistical moments for the original raw transducer signals (magnitude, variance, skewness, kurtosis) that are empirically learned by the ML algorithm, to produce a consolidated unitless metric that can now be analyzed with a straightforward multi-dimensional logistic regression model, thereby achieving higher accuracy with higher confidence factors for quantitative RUL estimation (as illustrated below). Moreover, for each additional variable that shows degradation, the uncertainty drops significantly and the confidence factor improves significantly, as we demonstrate in the following section. Figure 6 shows the approximated CDF of the RUL for a 2-prognostic variable model, with the corresponding first derivatives plotted in Figure 7. According to the two variable model, the system will fail on average at 1050 hours. Whereas, the single variable model indicates that the system will fail on average at 1300 hours (but with a wider estimate uncertainty). Similar to how data points were sampled from the SPRT tripping frequency in Fig. 2 and a logistic probability curve was approximated in Fig. 3, the same sampling procedure is now extended to two SPRT tripping frequencies to generate a failure probability surface. The failure probability surface from two correlated sensors is shown in Fig. 8, with the 95% failure probability shown in Fig. 9. The effects of using a single prognostic parameter (in this case, the SPRT tripping frequency) vs two prognostic parameters when the true underlying failure model is 2-dimensional is examined by comparing their resulting failure density function. As shown in Fig. 7, the density function for the model that account for 2 parameters is narrower than the model that only account for a single parameter. The width of this density curve can be viewed as uncertainty in the failure time estimate of the sensor. As such, a slimmer density function is more beneficial because there is less uncertainty in the remaining useful life estimation. The examination of the effects of additional parameters in the true underlying failure model can be further examined by straightforward extension to 3 and 4 parameter models. The density function for a 3-parameter and 4-parameter failure model is shown in Figs. 11 through 13. For both the 3-



parameter and 4-parameter cases, the density function distribution is significantly narrower when a larger number of prognostic parameters are available for the IoT prognostic use case at hand.

Fig. 6. Approximated CDF for 2 variable model

Fig. 7. Approximated PDF for 2 variable model.

Fig. 8. Logistic regression surface fitted to 2-variable data. Fail/Not Fail data are plotted on the 3D surface to illustrate

how the logistic regression curve attempts to cover these points.

Fig. 9. Logistic regression surface with a threshold of 95%

failure probability (red).

Fig. 10. Approximated CDF for 3-variable model

Fig . 11. Approximated PDF for 3-variable model. The

uncertainty of the remaining useful life (RUL) decreases as we increase the number of variables used. The true

underlying model is 3 dimensional.

Table 1 shows the uncertainty range of the 4-variable model. The uncertainty range on the time of failure goes down significantly as we increase the number of variables used to approximate the underlying model.



Fig. 12. Approximated CDF for 4 variable model.

Fig. 13. Approximated PDF for 4-variable model. With the

addition of another variable, the uncertainty in the range of the RUL goes down even further.

Table 1. Uncertainty Measure of Different Models Number of Variables of Model

RUL Estimate

Lower Bound

Upper Bound

Uncertainty Range

1 4089 4079 4098 19 2 3820 3811 3829 18 3 4066 4058 4073 15 4 4179 4173 4185 12

3 Conclusions Oracle's MSET2-based prognostic innovations are helping to increase component reliability margins and system availability goals while reducing (through improved root cause analysis) costly sources of "no trouble found" (NTF) events that have become a significant sparing-logistics issue and warranty-cost issue across enterprise computing (and other) industries. MSET2 coupled with SPRT enables high-sensitivity detection of the incipience or onset of anomalies in

critical assets, and the the SPRT “tripping frequencies” allows Remaining Useful Live (RUL) estimation for fleets of assets for which telemetry signatures have been captured in a data historian as assets progressed all the way to failure [a prerequisite for RUL estimation]. Industries that switch to RUL-enabled “condition based maintenance” (CBM) significantly boost annualized throughput metrics (for IoT manufacturing customers) and overall critical-asset availability (for all IoT industrial applications) versus time-based preventative maintenance (PM) windows. Conventional time-based PM windows lower overall throughput for manufacturing operations, lower availability for critical assets, and result in overhauling, upgrading, recalibrating, and refurbishing machinery prematurely that is not yet exhibiting any wear symptoms. MSET2-based CBM lowers overall Operations & Maintenance (O&M) costs while reducing penalizing challenges to critical-asset availability. This paper presents a systematic framework for leveraging output alerts from an MSET2 + SPRT prognostic system to obtain quantitative RUL estimates, and with significantly higher prognostic accuracy and quantitative conficence factors when employed with IoT use cases involving multiple prognostic indicators. 4 References [1] J. P. Herzog, S. W. Wegerich, R. M. Singer, and K. C.

Gross, “Theoretical Basis of the Multivariate State Estimation Technique (MSET), ANL-NT-49,” Argonne National Laboratory Technical Report Series (Dec 1997).

[2] R. M. Singer, K. C. Gross, J. P. Herzog, R. W. King, and S. Wegerich, "Model-Based Nuclear Power Plant Monitoring and Fault Detection:Theoretical Foundations,” Proc. 9th Intnl. Conf. On Intelligent Systems Applications to Power Systems, pp. 60-65, Seoul, Korea (July 6-10, 1997)

[3] K. C. Gross, R. M. Singer, S. W. Wegerich, J. P. Herzog, R. VanAlstine, and F. Bockhorst, "Application of a Model-based Fault Detection System to Nuclear Plant Signals," Proc. 9th Intnl. Conf. On Intelligent Systems Applications to Power Systems, pp. 66-70, Seoul, Korea (July 6-10, 1997).

[4] K. C. Gross and K. E. Humenik, “SPRTS for Nuclear Plant Component Surveillance” J. of Nucl. Technol., 93, 131-137 (Feb 1991).

[5] K. C. Gross and R. Dhanekula,"Multivariate SPRT for Improved Electronic Prognostics of Enterprise Computing Systems," Proc. 65th Meeting of the Machinery Failure Prevention Technology Society (MFPT2012), Dayton, OH (April 2012).

[6] “Early Detection of Signal and Process Anomalies in Enterprise Computing Systems,” K. C. Gross and W. Lu, Proc. 2002 IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Las Vegas, NV (June 2002).

[7] K. Whisnant, K. C. Gross and N. Lingurovska, “Proactive Fault Monitoring in Enterprise Servers,” Proc. 2005 IEEE Intn'l Multiconference in Computer Science & Computer Eng., Las Vegas, NV (June 2005).

[8] K. C. Gross, K. Baclawski, E.S. Chan, D. Gawlick, A. Ghoneimy, Z.H. Liu, “KIDS Supervisory Control Loop



with MSET Prognostics for Human-in-the-Loop Decision Support and Control Applications,”, 2016 IEEE Intn’l Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA) (Mar 2017).

[9] T. Masoumi and K. C. Gross, “SimSPRT-II: Monte Carlo Simulation of Sequential Probability Ratio Test Algorithms for Optimal Prognostic Performance,” 2016 International Conference on Computational Science and Computational Intelligence.

[10] K. C. Gross, V. Bhardwaj, and R. Bickford, “Proactive Detection of Software Aging Mechanisms in Performance Critical Computers,” 27th Annual NASA Goddard/IEEE Software Engineering Workshop, pp. 17-23, 2002.

[11] K. C. Gross, A. Urmanov, L. G. Votta, S. McMaster, and A. Porter, “Towards Dependability in Everday Software Using Software Telemetry,” Third IEEE International Workshop on Engineering of Autonomic and Autnomous Systems, pp. 9 -18, 2006.

[12] A. Urmanov, “Electronic Prognostics for Computer Servers: From Condition Monitering to Decision Support,” Annual Reliability and Maintainability Symposium, pp. 65-70, 2007.

[13] J. W. Hines and A. Usynin, “Empirical Model Optimization for Computer Monitering and Diagnostics/Prognostics,” Technical Report, Dept. of Nuclear Engineering, Univ. of Tennessee, 2006.

[14] D. R. Garvey, J. W. Hines, and K. C. Gross, “Real-Time Remaining Useful Life (RUL) Estimation of Computer Server Power Supplies,” Proc. 61st Meeting of the Machinery Failure Prevention Technology Soc., Virginia Beach, VA, 2007.

[15] A. Wald. Sequential Analysis. John Wiley & Sons, New York, NY, 1947.

[16] A. R. Moore and K. C. Gross, “SimML Framework: Monte Carlo Simulation of Statistical Machine Learning Algorithms for IoT Prognostic Applications,” 2016 International Conference on Computational Science and Computational Intelligence.

[17] J. W. Hines and D. R. Garvey, “Nonparametric Model-Based Prognostics,” 2008 Annual Reliability and Maintainability Symposium, pp. 469-474.



machine learning innovation for high accuracy remaining useful … · 2018. 10. 6. · machine...

Documents