remote condition monitoring: application to railway traction … · remote condition monitoring:...

Remote Condition Monitoring: Application to Railway Traction

Systems

Rafael Mendes Borges

[email protected]

Instituto Superior Tecnico, Universidade de Lisboa, Portugal

November 2018

Abstract

Maintenance is a crucial aspect for any transportation operator. The predictive maintenance strategyimproves fleet availability while reducing costs. This approach aims to anticipate equipment failures, al-lowing to schedule maintenance activities accordingly, preventing downtime and the severe consequencesof breakdowns during operation. This thesis presents a data-driven approach based on data science toolsfor failure prediction applied to railway traction systems, using real data collected by a remote conditionmonitoring system. The first steps include the exploratory data analysis and dimensionality reduction.Then, the engineering problem is represented from a data science point of view, for which the expertknowledge perspective is considered. Lastly, a scheme for failure prediction is designed and machinelearning algorithms together with strategies for imbalanced datasets are used to learn from data. Im-portant results were achieved as all failures were correctly predicted in the test set, meaning that everyfault is anticipated. This allows repairing the system just before its breakdown, avoiding failures duringoperation and their potentially serious repercussions. However, the false positive rate (0.40) is higherthan desired, leading to unnecessary maintenance operations, but these are expected to have a lowerimpact than not predicting a true failure. From an operation’s perspective, this solution represents acost reduction if a repair costs at least 2.22 times more than a maintenance activity, which is clearlybelow the estimated relative cost (3.7 – 6.1). Future work directions are discussed, from alternativelearning methods to solutions which were designed but not performed due to time constraints.

Keywords: failure prediction, predictive maintenance, data-driven approach, railway traction systems,remote condition monitoring

1. Introduction

Predictive maintenance (PdM) tries to predict thetime of failure, allowing maintenance to be plannedaccordingly. This technique requires intelligent sys-tems capable of predicting failures based on condi-tion monitoring [1]. One of the approaches, usuallycalled a model-driven approach, consist of using ruleengines based on failure modes. Another type ofmethods is known as a knowledge-based approachemploying methods like fuzzy logic and expert sys-tems [2]. The current paradigm of PdM, generallyreferred to as a data-driven approach, takes advan-tage of data science, machine learning and big datatools. These include machine learning algorithmslike neural networks, decision trees and support vec-tor machines (SVMs) [3].

This work addresses a real industry problem forwhich an engineering solution is being sought. Sometrains from Portuguese Railways (CP) received a re-furbishment on its traction converter. In addition,these trains were equipped with a remote conditionmonitoring (RCM) system, allowing the extractionof relevant data, in particular from the traction sys-tem. Since this is the first time a refurbishment like

this was done in this train series, there is an interestin deepening the knowledge on the new system, spe-cially regarding its failures and modern approachesof maintenance.

This work presents a PdM approach using datascience and machine learning methods. Data col-lected from two different train systems is used topredict failures on the traction system. The datawas collected over time representing several vari-ables that monitor the train. Also, events wererecorded, namely those representing failures. Someof these events have also several variables associ-ated (variables of context), which are collected ev-ery time a failure is detected. Our goal is to predictfour failure events of the traction system. The ideaof this work is to use a learning from data approach,as opposed to the traditional ones based on failuremodes and rule engines [2]. This work follows theusual steps involved in a data science project, ofcourse adapted to the circumstances that are par-ticular to this case.

PdM approaches using machine learning toolshave been discussed in various studies, some ofwhich address problems common to this work.

1

In [4], a machine learning approach for failure pre-diction is proposed. The authors start by usingprincipal component analysis for dimensionality re-duction of the dataset. Then, SVMs with strategiesfor imbalanced data are used to solve the classifi-cation problem. Since the dataset is imbalanced,appropriate performance metrics are used. The au-thors of [5] propose an approach to predict trainwheel faults by combining multiple classifiers usingdecision trees and Naıve Bayes algorithms. Thiswork proposes a different evaluation method, join-ing the common metrics for imbalanced data with areward function to reinforce early and penalize latealerts. A log-based PdM approach for failure pre-diction in medical devices is proposed in [6]. Thedata comes from equipment event logs, labeled asnormal/fault. The classifier intends to label bothsingle events and events in an interval.

2. Background2.1. TrainThe data used in this work comes from a passengertrain operated in Lisbon suburban area. Regardingthe nomenclature used, train or unit refers to fourrail cars; vehicle stand for two rail cars (motor car+ trailer car); M1 and M2 designate each vehiclethat composes the train. M1 refers always to thesame vehicle and the same is valid for M2. Themotor cars are on the edges and the trailers in themiddle. The traction motors are placed in the twomotor cars. The train has four traction groups: M1has two and M2 the remaining two. To simplifythe notation, the traction group #i of vehicle Mj isreferred to as Mj,Gi.

2.2. Sources of DataThe available data is collected by an RCM systeminstalled in the train used for this work. Severalfiles containing the data were provided. These aredivided by system – central control unit (CCU) andtraction control unit (TCU) – and types of data.The CCU is the core of the train control system(supervision, control and command of the principaltrain parameters) while the TCU is specially de-signed for the control and command of a tractionconverter.

The information can be divided into variables,events and variables of context. The CCU variablesare mostly real-valued parameters of the train withan associated timestamp and GPS data, which arerecorded whenever the train is working, at a certainsample rate. Events represent failures and, for thatreason, they are stored whenever the train detects afailure. Each log has the fault description, the startand end timestamps, and the GPS coordinates rel-ative to the location where the event started. BothCCU and TCU have specific events. Finally, thevariables of context are exclusive from the TCU andthey are collected every time a TCU event starts.

They represent the values of several parameters andalso flags indicating internal commands and statesof the traction system. Their values correspond tothe instant when its associated event starts.

2.3. Problem FormulationThe problem addressed by this thesis was presentedfrom a railway engineering point of view. As itis always necessary for real problems, a challeng-ing process is followed to adapt the problem fromthe expert knowledge to the data science viewpoint.Figure 1 summarizes a possible interpretation ofthe steps involved. Once a suitable formulation isfound, the initial problem can be adequately repre-sented as a data science problem without compro-mising the real engineering context.

ExpertKnowledge

(Railway Eng.) Problem

FormulationData

Pre-Processing(Data Science)

Iterate until reaching suitable formulation

Final ProblemFormulation

Fig. 1: Several iterations are necessary to find an ap-propriate representation of the problem, which satisfiesthe requirements from both expert knowledge and datascience domains. Problem formulation cycle.

2.4. Exploratory Data Analysis (EDA)The EDA is a way of looking at the data, allowing abetter comprehension of its main characteristics. Itcan be done using multiple techniques, usually mak-ing use of graphical methods. The analysis made isbased on [7]; the main statistics, histograms andother figures representing the data were used in theanalysis.

2.5. Dimensionality ReductionThe principal component analysis (PCA) is prob-ably the most widely used technique for dimen-sion reduction of datasets. This technique an-alyzes a data table containing observations de-scribed by many dependent variables, generallyinter-correlated. Its goal is to extract the relevantinformation from the data table in order to repre-sent it as a set of new orthogonal variables knownas principal components. PCA generates an ap-proximation of a numeric data table, in terms ofthe product of two smaller matrices, which capturethe fundamental patterns of the data table [8].

The data table is represented by the matrixA ∈ Rm×n, where m is the number of observationsand n the number of features. In general, be-fore performing the analysis, the matrix A is pre-processed by adjusting its columns so that they

2

have zero mean and unit norm. It is done for eachvariable by subtracting its mean and diving by itsnorm. This standardization rescales the data in or-der to compensate for unequal scaling in differentfeatures [9]. PCA finds a matrix Z ∈ Rm×n ofrank k < min(m,n) that best approximates A in theleast-squares sense. The constraint of the rank canbe encoded implicitly be expressing Z in a factoredform as Z = XY , with X ∈ Rm×k and Y ∈ Rk×n,whose factorization is not unique. The problem isreduced to choosing X and Y in order to minimize‖A−XY ‖2F (Frobenius norm of a matrix) [10].

The PCA technique can be used for numeric ma-trices not containing missing entries. When onlysome entries are observed, this method cannot beused. PCA is also not suitable for data tablescontaining non-numeric features, such as Booleanor categorical data. Generalized low rank models(GLRM) is a method which generalizes PCA on anarbitrary dataset composed of different data types,e.g. numerical, Boolean and categorical, allowingto have missing entries [11].

Consider that the matrix A has miss-ing entries so that only the entries Aij for(i, j) ∈ Ω ⊆ 1, . . . ,m × 1, . . . , n are observed,and that the entries of A can have different datatypes. For each data type, a loss function is given,which describes the approximation error incurredwhen a feature value a is represented by the numberu ∈ R. In addition, one must provide regularizers,which are used to prevent the overfitting to theobservations.

The problem can be adapted to allow a similarapproach to that commonly used in PCA, whichis the standardization of the data. In this tech-nique, instead of performing this on the data, theloss functions are rescaled, preserving the sparsityof the dataset. Finally, xi ∈ R1×n and yj ∈ Rm aredefined to be the ith row of X and the jth columnof Y , respectively. Therefore, xiyj ∈ R denotes ainner product. These considerations lead to a finalformulation of the GLRM problem [10],

minimize∑

(i,j)∈Ω

Lij (xiyj + µj , Aij) /σ2j

+

m∑i=1

ri (xi) +

n∑j=1

rj (yj) ,

(1)

with variables xi, yi and µj , where, for eachfeature j,

µj = argminµ

∑i:(i,j)∈Ω

Lij (µ,Aij) ,

σ2j =

1

nj − 1

∑i:(i,j)∈Ω

Lij (µj , Aij) .(2)

A solution of this problem gives an estimateAij = xiyj for the value of the entries (i, j) /∈ Ω

that were not observed. Considering a matrix wheremost entries are missing (it is the case of interestin this work), this method is used to guess all itsentries, given just a few of them [10].

In this work, two different loss functions wereused: for real-valued data, the quadratic loss,L(u, a) = (u− a)

2, which is a simple and com-

monly used loss function; for Boolean data, theHinge loss, L(u, a) = max(1 − au, 0), representingthe value “true” with +1 and “false” with -1. TheHinge loss is chosen since it punishes misclassifi-cations. The regularizers used were r(x) = γ‖x‖22and r(y) = γ‖y‖22, to prevent overfitting. For thecase of r(y), a sparsity inducing regularizer was alsotested, r(y) = γ‖y‖1, to verify if inducing spar-sity on matrix Y could reduce the problem to onlysome features while preserving significant informa-tion present in the original data.

Note that the each row of X represents an exam-ple by a vector in Rk. The solution to the problemgives a compressed and real-valued representationthat may be used to efficiently store the informa-tion present in the original dataset [11].

The authors of [10] developed a package to solvethis problem. The approach to solve it uses alter-nating minimization, which iteratively keeps one ofX, Y fixed and optimize over the other, then switchand repeat. The overall problem is non-convex dueto bi-linearity, which is why in general it cannotbe solved globally. Even though, it can be locallysolved by alternating minimization, where each sub-problem is convex and can be solved efficiently. De-spite the wide usage of bi-linear representation andalternating minimization, global optimality guaran-tees for such methods are still lacking [11, 12].

2.6. Model Selection

When selecting a model, a way of testing the differ-ent values for each parameter is required in orderto choose the most adequate ones. One of the mostused techniques is the k-fold cross-validation, whichis utilized in this work. The dataset is split intotraining and test sets. Then, for model selection,the training set is split into k folds (subsets of equalsize). The number of folds k is chosen to be equal tothe number of models to test. For each model, k−1folds are used for training and one fold is used fortest. The process is repeated k times such that eachfold is used once as test set. After this, the modelachieving the best performance is selected. Then,the entire training set is used to train the algorithmand the final evaluation is done using the test set,guaranteeing that the final model is tested utilizingdata that was never used before [13].

2.7. Classification

Two different classifiers are used in this work: sup-port vector machines (SVMs) and decision trees.

3

SVMs are appropriated to deal with not too largedatasets, which is the case. They are fast and pro-vide linear models. SVMs are not explainable, butit is not a requirement. High accuracy values canbe obtained, specially when using kernels, but los-ing speed. Decision trees are fast training and ex-plainable algorithms while achieving high accuracyvalues. Even though, decision trees have high vari-ance since they are sensitive to the training data.This variance can be reduced using adequate tech-niques, like bagging, which is described later.

SVMs divide the data by defining a hyperplanein a high-dimensional feature space. Its objectiveconsists of maximizing the separability of the twoclasses. The hyperplane is expressed in the inputspace using the support vectors, which are the clos-est points to the hyperplane. The optimal hyper-plane is the one that maximizes the margin betweensamples from the two classes. Since the boundarybetween classes is usually not linear, the problemcan be solved by mapping the data from the inputspace into a high dimension feature space, wherethe data can be separated by a hyperplane. Thisis done using kernel functions [14]. Two kernels areused in this work: linear and radial basis function.

Decision trees divide the training data intosmaller subsets so that the label variables in eachsubset are as homogeneous as possible. The clas-sifier is assumed to be constant in each subset andthe class is chosen by majority vote. For a givendata point, the label is found by going through thetree and making decisions based on the feature val-ues. There are noisy labels both on training andtest sets, which consist of examples leading to mis-classification because they have the same attributesbut different labels. This motivates the use of mea-sures of impurity. Training a tree consists of findinga tree which minimizes the impurity.

2.8. Appropriate Metrics for Imbalanced Data

It is very common to have a dataset whose classesare not equally distributed. Moreover, frequentlythe most important samples belong the classes hav-ing less points. This is exactly what happens infailure prediction, in general. The samples repre-senting faults are expected to be much less but moreimportant than the ones associated with a normalcondition. These problems require a high detectionin the minority class rather than the majority class.For this reason, simple predictive accuracy is not agood measure. An important concept when defin-ing performance metrics is the confusion matrix, asshown in the Figure 2. It is a representation of thepredicted label compared to the true class.

The most common metrics for imbalanced dataare derived from the interpretation of the confusion

matrix and two examples are precision(

TPTP+FP

)

True

Lab

el

Predicted Label

TruePositives

(TP)

FalseNegatives

(FN)

FalsePositives

(FP)

TrueNegatives

(TN)

P

P

N

N

(a) General definitions.

True

Lab

el

Predicted Label

Failurecorrectlypredicted

Failure butnormal

condition predicted

Normalconditionbut failurepredicted

Normalconditioncorrectlypredicted

Failure

Failu

re

OK

OK

(b) Meanings for fail-ure prediction.

Fig. 2: The confusion matrix represents the combina-tions of the true and predicted labels. Confusion matrix.

and recall(

TPTP+FN

). Using an example from the

context of failure prediction, precision measureswhat proportion of predicted failures was actuallycorrect. Recall measures what proportion of realfailures was identified correctly. Ideally, both met-rics should be close to one. For each particularproblem, one of these metrics can be considered asthe more relevant.

Other tools are frequently used for classifier per-formance evaluation. A generic method, not specif-ically designed for imbalance data, uses the re-ceiver operating characteristic (ROC) curve. Itrepresents the classifier performance over combi-nations of true positive and false positive errorrates. The true positive rate (TPR) is the same asthe recall, and the false positive rate is defined asFPR = FP/(FP + TN). The objective of the ROCcurves is to be in the upper left hand of the ROCspace. A traditional metric for the ROC curves isthe area under the curve (AUC), that should be asclose to one as possible. The AUC provides an ag-gregate measure of performance across all thresh-olds. The curve is obtained from the scores ex-tracted from the classifier. Different confusion ma-trices are created, from which the points on theROC space are generated. These points are theninterpolated. This is a general method and it isnot the most adequate for imbalance data, as theROC curve may provide a too optimistic view ofthe algorithm performance. For this case, an alter-native is commonly used, the precision-recall (PR)curve. It is obtained similarly to what is done forthe ROC curve, but computing precision and recall.Contrary to what happens with the ROC curve, theAUC of the PR curve does not make a relevant con-tribution. Generally, the comparison can be madeby considering that the curves closer to the upperright corner have a better performance [15].

2.9. Strategies for Imbalanced Data

Many techniques were developed to mitigate theproblems with imbalanced datasets. Four dif-ferent strategies were chosen to be used in thiswork: SVMs with weighted classes, undersampling,

4

SMOTE and balanced bagging. The first two meth-ods are classic and simple strategies, which is whythey are tested. Two more advanced techniques arealso used.

SVMs with weighted classes consist of an adjust-ment done to the soft margin of the SVM algorithm,in a way that the weights for each class become in-versely proportional to the class frequencies. Thisintends to increase the penalty for minority classmisclassification to prevent the minority class sam-ples from being overwhelmed by the majority class.

The sampling methods consist of the modifi-cation of an imbalanced dataset to get a bal-anced distribution. In general, a balanced datasetachieves better performances compared to imbal-anced datasets [16]. The first sampling techniqueused in this work consists of undersampling the ma-jority class by randomly removing samples from ituntil reaching the desired class size. The percent-age of undersampling can be adjusted, for whichcross-validation is used. The majority class can bereduced to have a size similar to the minority class.

The authors of [17] proposed a combination ofundersampling the majority class and oversamplingthe minority class, called synthetic minority over-sampling technique (SMOTE). The oversampling ofthe minority class is done creating synthetic datapoints. These are introduced along the line seg-ments joining any or all the k minority class nearestneighbors. According to the amount of oversam-pling needed, the necessary neighbors from these kneighbors are randomly chosen. Using this method,the decision region of the minority class becomesmore general. The undersampling of the majorityclass is done by randomly removing samples from ituntil achieving the desired class size. Finally, the se-lection of the parameters (undersampling and over-sampling rates, and the number of neighbors) canbe done using cross-validation.

Undersampling the majority class is a good strat-egy but has a problem which is its high variance.This leads to different performances of classifiersover individual bootstrap examples. The solutionproposed in [18] is to use the bagging (bootstrapaggregating) variance-reduction ensemble method.The balanced bagging method can be summarizedas follows: an ensemble composed of a given num-ber of models is built, each induced over an under-sampled bootstrapped sample of the training data.The bootstrap samples are generated by samplingrandomly with replacement and then the under-sampling of the majority class is done removinginstances randomly until the classes are balanced.When a new instance is to be classified, each modelmakes a prediction, and the final prediction is ob-tained as the majority vote. In general, this ap-proach will improve classifier performance when the

models of the ensemble are high-variance, which isthe case. This method is usually applied with deci-sion trees as the base classifiers, which will also bethe case in this work. Finally, cross-validation canbe used to select the parameters needed: the num-ber of models and the size of the bootstrap samples.

3. Implementation3.1. Exploratory Data Analysis (EDA)After extracting the data from the files usingMATLAB, the EDA was performed. This analysiswas very important as it allowed to detect variousproblems with the data. At the same time, it pro-vided an useful overview of the data, in particularconcerning the most important events. In summary,some of the contributions of EDA were:• It was found that the first days of the CCU vari-

ables contained faulty data and these had to beremoved.

• Each type of data has its own missing pattern, ex-cept the TCU variables of context which doesn’thave missing data.

• Regarding the CCU variables, 2 out of 19 areconstant; for the case of the TCU variables ofcontext, these are 485 out of 939.

• The ranges of the variables are very diverse, asthey represent many different parameters (mostof them with unkown meaning).

• The important events were registered 55 times(one of these four events never occurred). M1registered 71% of these events.

3.2. Problem FormulationSeveral steps were needed to achieve the final rep-resentation of the problem. The main details whichwere being figured out and included are presentedbelow:1. There are two severe and two very severe events

which represent the failures to be predicted.Each of the four traction groups has two indi-vidual events (one severe and one very severe).

2. The very severe event is always followed by asevere event; a severe event can happen withoutany very severe event.

3. In some situations, a train reset done by thedriver may recover from a severe event (perhapspreceded by a very severe event) to a normal con-dition.

4. There are two types of failure: permanent andnon-permanent. It is possible to identify a per-manent failure reading its indication in the endtimestamp, but only from raw files (non pre-processed by the RCM system).

5. The repair of a permanent failure is only doneby maintenance and is not visible from the dataavailable; maintenance information is needed.The considerations above lead to the diagram of

states shown in the Figure 3. These states can beinferred from the data and are later used as thelabel for the machine learning algorithms.

5

Maintenance

Reset

AA/AB

AA/AB

AC/AD

OK

F2

F1 PFSpecial AA/AB

Fig. 3: Both severe and very severe events can happen ina normal condition state. A very severe event causes achange to the F2 failure state, from which a severe eventis always generated, leading to the F1 state. It is alsopossible to go directly to the F1 state if only a severeevent occurs. In F1, a reset could be enough to returnto OK. If not, the event indicates that it is a permanentfailure, leading to a PF state, requiring maintenance tofix it. Diagram of the final problem formulation.

3.3. Dataset Preparation

The systems of interest are the traction groups andnot the entire train or vehicles. This is why it makessense to build four blocks of data, one per group.For each vehicle, the different sources contain bothgeneral data which is associated with the whole ve-hicle but also data concerning each traction group.The CCU data (variables and events) is commonto both traction groups of the vehicle, while theTCU data can be divided into data which is com-mon to the whole vehicle and data specific to eachtraction group. These considerations result into thefour blocks of data shown in Figure 4. The left partof the blocks contain the data related to the wholevehicle while the right part is composed of the dataspecific to each traction group. It means that theinformation on the left part is duplicated, to assurethat the common information exists in the rows ofboth traction groups for each vehicle.

M2

M1

M1,G1

M1,G2

M2,G1

M2,G2

M1

M2

Fig. 4: The data can be divided into four blocks, eachone corresponding to a specific traction group. Theblocks of traction groups belonging the same vehicleshare the data which is common to the whole vehicle.Division of data into blocks.

The construction of the dataset is based on thisdivision. Each row of the dataset (associated with atimestamp) is composed of all types of data by ag-gregating the features coming from each type. Ev-ery variable, event and variable of context is repre-sented by a column in the dataset. For the case ofevents, these are encoded as binary variables. Foreach event, its associated column is one while theevent is happening and zero otherwise. Finally, foreach block, obtaining the label consists of finding

the state of the correspondent traction group, ac-cording to the diagram of states.

The CCU variables are collected continuously,while the remaining types are registered only whenfailures occur. This implies that most rows of thedataset contain only the CCU variables and thesecorrespond to a small fraction of the features. Mostfeatures are related to the TCU variables of contextand their frequency is not comparable. This impliesthat their columns have missing entries for mostrows of the dataset. The case of events is different;they are never missing entries since the timestampsnot having a specific event are coded as zeros in therespective column. Therefore, events do not origi-nate missing entries. Considering the relative sizeof the variables of context (85.4%) in the number offeatures, it is easily understood the reason why mostentries (85.24%) of the dataset are missing. Finally,the dataset has 780852 rows and 479 columns.

3.4. Dimensionality Reduction

This dataset has two significant characteristics: thehigh number of features and also the large amountof missing entries. These two challenging propertiesmotivate the use of GLRM, which is used to bothdimensionality reduction and imputation of miss-ing values. In order to use GLRM, several inputsto build the model need to be considered. The in-puts to be selected are: the rank k, the regularizerr(y) and the scales of both regularizers, γ and γ.The k-fold cross-validation technique is applied formodel selection.

Due to the time requirements of GLRM on sucha large dataset, some non-ideal solutions had to beused. The first consists of using only a small ran-domly chosen subset from the entire dataset andthen applying the k-fold cross-validation techniqueto this small portion of the dataset. In addition,since there are several possibilities to combine theparameters to choose, the number of folds wouldbe enormous. The strategy used consist of refin-ing gradually the parameters until finding the bestmodel, by applying k-fold cross-validation as manytimes as necessary. Of course, to a certain degreeand not refining indefinitely. The performance met-ric to evaluate each model is the normalized recon-struction error.

The first tests allowed to conclude that thequadratic regularization performs better than thesparsity-inducing, which is why the latter was dis-carded. Regarding the rank k, the training errordecreases when increasing the rank, as expected.However, the behavior of the test error is not asexpected, that is decreasing at first but then in-creasing. For some tests, it slightly decreases inthe beginning and then increases; in other cases, itis always increasing or oscillating. This unexpectedbehavior can be caused by optimization errors or by

6

the randomness of the samples in the test set. Pos-sibly the data used for test in cross-validation is par-ticularly bad in some cases but in other test sets it isespecially good. This problem would be mitigated ifmore data could be used for cross-validation, whichis not possible due to time constraints. Finally, thequadratic regularization leads to a well-conditionedproblem, which is why it is preferred.

Figures 5 and 6 present the training and testerrors (respectively) for the final refinement usingquadratic regularization on both X and Y . It showswhat was discussed above. The training errors havean expected behavior while the test errors do not.From these results, the final model is selected tohave rank k = 25 and quadratic regularization withscales γ = 0 and γ = 2 for r(x) and r(y) respec-tively. This represents a considerable reduction onthe number of features when comparing to the orig-inal dataset (25 instead of 479 columns), meaningthat the size of matrix X is 5.22% of the originaldataset size.

The value obtained for the regularization scale onX, γ = 0, implies that the regularization term onX vanishes. We are aware that it can be polarizingour solution. Nevertheless, this value was obtainedfrom cross-validation using the normalized recon-struction error as the metric for evaluation, whichis why we consider it as an adequate solution.

10 15 20 25 30 35 40 45 50Rank k

10 3

10 2

10 1

Train

ing

Err

or

Q(0)+Q(1.5)

Q(0)+Q(2)

Q(0)+Q(2.5)

Q(0)+Q(3)

10 15 20

3 × 10 1

4 × 10 1

5 × 10 1

Test

Err

or

=0, =1.5

=0, =2

=0, =2.5

=0, =3

Fig. 5: The training errors decrease while increasing therank k, as expected. Cross-validation training errorsusing quadratic regularization.

After selecting the model, the final step consistsof applying the algorithm with these inputs usingthe entire dataset to get the low rank approxima-tion. Then, the matrix X is used in the place ofthe original dataset. This representation allows tohave a much more compact dataset without missingentries. Again, GLRM time restrictions motivateda non-ideal approach and the stop criteria had tobe relaxed, otherwise the fitting process would taketoo much time. This causes the low rank approxi-mation to be much less accurate than desired.

3.5. Classification

This Section addresses the final goal of this work,which is failure prediction. This challenge requires a

10 15 20 25 30 35 40 45 50Rank k

3 × 10 1

4 × 10 1

5 × 10 1

Test

Err

or

=0, =1.5

=0, =2

=0, =2.5

=0, =3

Fig. 6: The test errors deviate from the expected be-havior. The lowest test error for all ranks is achievedby the same curve (orange), from which the best modelis selected. Cross-validation test errors using quadraticregularization.

new dataset, that is built from the original dataset.

The scheme for failure prediction is shown in Fig-ure 7, representing how to extract the data fromthe original dataset. The green block representsthe data points which are used to predict if therewill be a failure afterwards. The label of the up-coming data is the blue block. The orange squareis the label obtained for prediction, which is one ifthere is any failure on the blue block and zero oth-erwise. The data points from the green block arethen vectorized to be a row on the new dataset forprediction whose corresponding label is the valueof the orange block. These steps are repeated bysliding on time, an operation that is represented bythe dashed blocks. A new row results from thesedashed blocks. The dataset for prediction is builtmoving forward on the original dataset until reach-ing the end, that is when the blue block reaches thelast data point. Note that this operation is doneseparately for each traction group, using the corre-spondent data.

The time window for prediction and the horizonto be predicted are set to be one week. Then, thesliding window step used is one day, allowing tohave more weeks to train the algorithm. Finally, nodelay is used, meaning that the time windows areconsecutive: data from one week is used to predictif there is a fault in the next week. These blocks aredefined to begin always at midnight, meaning thatthis tool can be used whenever a new day begins.

slide

horizon

delay

window

time

Fig. 7: New rows for the predicition dataset are gen-erated by applying the same method when moving for-ward in time. Scheme for prediction.

7

As expected, the weeks do not have the samenumber of data points. The size of each week variessignificantly, from 6,000 to 18,000 data points, ap-proximately. When building the dataset for predic-tion, the ideal is to have synchronous data pointsfor every week. This makes the features of the pre-diction dataset corresponding to equivalent timesof the day, in order to preserve the time charac-teristics. However, it doesn’t happen having weekswith so different sizes. Figure 8 represents the datapoints for some weeks over time. There are big holesinside several weeks, and these are propagated alongsome days because of the sliding window of one day.

Fig. 8: Several weeks have big gaps; these are propa-gated because of the sliding window of one day. Eventhe other weeks are very irregular. Distribution of datafrom weeks of M1,G1.

The limitations related to GLRM imply that nosolution involving oversampling can be used, whichwould be the desired way to deal with the problemof synchronization. The solution found to cope withthis problem is undersampling. The week is dividedinto slots, whose size can be selected. For each slot,it is calculated the minimum number of points in allweeks. Then, for the weeks to have the same num-ber of data points, it is necessary to remove pointsin all weeks until all of them having the same num-ber of points. The minimum can be zero, implyingthat all points are removed from that slot. In or-der to minimize the damage of removing data, thealgorithm chooses the points using a semi-randomcriteria. Firstly, it removes randomly points orig-inally coming from the CCU data. Then, if morepoints need to be removed, it randomly removespoints originally coming from the TCU data. Thisallows to keep as much TCU data as possible, as itis the more relevant one (traction specific data).

Figure 8 shows the distribution of data fromwhere several big holes are visible. Even using bigslots (order of hours), at least one week would nothave any point in each slot. A solution needs tobe designed to overcome the obstacle. It can bedone into two parts. Firstly, the weeks containingbig holes are removed and not used. For this, aminimum number of data points is used as the cri-

terion. Most weeks having big holes coincide withlonger maintenance periods. The second part of thesolution consists of choosing a longer slot duration.Removing the weeks containing big holes doesn’tsolve the problem by itself since one week not hav-ing data in a slot is enough to force all the datain that slot to be removed. Therefore, the solutionconsists of using a slot of two hours, which is muchbigger than desired but, considering the context, itis not so long that the synchronization is severelyaffected. Figure 9 shows the distribution of dataafter applying this undersampling approach. After

Fig. 9: The distribution becomes generally uniform us-ing the undersampling approach. Distribution of datausing the undersampling method.

performing undersampling, the week size is muchsmaller than the mean week size before removingdata. It represents a reduction of more than 50%.

At this point, the dataset obtained by applyingundersampling is ready for classification. Regard-ing the label, the dataset is more or less balanced(57%/43%). The division of this dataset into train-ing and test was not done by randomly dividing itinto training and test sets. The approach employedwas using the most recent weeks of each tractiongroup as the test set. This has two purposes: firstly,not using weeks for training and test containingdata points in common, which could compromisethe confidence of the results. This happens becauseof the sliding window, as explained before, whichmakes the same data to be in several weeks. Evenusing the most recent weeks as the test set, someweeks have to be removed to prevent the test set tocontain data in common to the training data, againbecause of the sliding window. Secondly, this waysimulates what would be the real application of thealgorithm: receiving unseen data from the train andtrying to make a prediction for the future based onhistorical data.

4. ResultsThis Section presents the main results of this work,obtained using the dataset for failure prediction.

SVMs were tested without imbalance methods,for which a precision and recall of 0.33 and 0.40 wereobtained using a linear SVM, respectively. These

8

values are very far from the desired, which is why itmakes sense to try an alternative approach. Eventhough the prediction dataset is more or less bal-anced, the imbalanced methods discussed beforewere tested. Figure 10 presents the results of theimbalance methods. These results represent an im-provement when compared to the previous ones notusing strategies for imbalanced data. The accuracyseems to not be significantly affected, but the samedoesn’t apply for precision and recall, which can beconsidered as the most important parameters. Thebagging technique achieves the highest values forboth precision and recall, and also accuracy (0.70).The precision still low but the recall is the maxi-mum (1.0). Although bagging cannot be consideredas an optimal solution due to its low precision, itsrecall value is very important. It means that allweeks having failures are correctly labeled, which isa relevant feature of this prediction algorithm. Fig-ure 10 shows also both the ROC and PR curves.These curves confirm bagging as the best solutionamong the strategies tested, achieving the highestAUC in the ROC curve and being the closest curveto the upper right corner in the PR curve.

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

Weighted SVM

Undersampling

SMOTE

Bagging

(a) Bagging has the highest AUC (0.83), out-performing the remaining methods: weightedSVM (0.57), undersampling (0.53) andSMOTE (0.57). ROC curves.

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

Weighted SVM

Undersampling

SMOTE

Bagging

(b) The bagging curve is the closest to theupper right corner, outperforming the otherstrategies. PR curves.

Fig. 10: Both ROC and PR curves verify bagging asthe best solution. ROC and PR curves for imbalancedmethods (markers representing the observed values).

One must conclude that the results are far fromthe expected for this type of application. Even so,the analysis of the results must take into account all

the limitations. For example, it is not known hownoisy the data is. As happened with the tempera-ture data discarded at the EDA stage, other sourcescan be introducing relevant noise which was not de-tected due to the huge amount of variables (mostof them with unkown meaning).

5. ConclusionsThis work addressed the problem of failure pre-diction on railways traction systems, for which adata-driven approach was used to design a solu-tion, making use of data science and machine learn-ing tools. This work resulted from the desire offinding an innovative solution for a real engineeringproblem. For this reason, it involved a challengingprocess of interpretation. Naturally, this kind ofprocesses with real data give rise to several difficul-ties. The EDA was very important. It allowed todiscard irrelevant variables and, at the same time,previously unknown problems were detected.

The process of joining all the types of data into asingle dataset end up with a considerable number offeatures and a huge percentage of missing entries.These characteristics motivate the use of GLRM,both for dimension reduction and missing imputa-tion. Despite being a powerful tool to perform thesesteps, GLRM has also some drawbacks, in particu-lar the time required for each fit. These time restric-tions motivated some undesired changes to have re-sults in a reasonable time in order to proceed to thenext phases of the work, implying a much less accu-rate approximation. Nevertheless, GLRM allowedto find a low rank approximation with a much lowernumber of features and, at the same time, perform-ing the imputation of an extreme number of missingentries.

When designing the solution for failure predic-tion, an oversampling approach was considered, al-lowing to maintain the synchronism of the data.Even so, this strategy could not be adopted, againbecause of time constraints. An alternative ap-proach consisting of undersampling was used, inwhich the new dataset was built from the approxi-mation of the original dataset. Many consequencesresulted from this solution. Firstly, several weekswere not used due to the irregularity of the data intime, with some weeks having much less data thanthe others. Even discarding those weeks, more than50% of the data was removed when synchronizingthe data, despite using a relaxed synchronism cri-terion.

Even though the dataset used for failure predic-tion was relatively balanced, the use of methods forimbalanced data considerably improved the results.The precision and recall for the best-performingtechnique (bagging) achieved 45% and 100%, re-spectively. This represents a lower precision thandesired but a maximum value for recall, which is a

9

very important feature for a failure prediction al-gorithm. The optimal recall means that all weekshaving failures are correctly predicted. This guar-antees that all failures are anticipated, allowing torepair the system before its breakdown. Moreover,it avoids the serious consequences of a failure oc-curring during operation, which can force the trainto stop, causing high financial losses and comprisingall the operation schedule, which is even worse sinceit is a passenger train. On the other hand, the preci-sion value below what one might expect implies thatthe algorithm incorrectly labels some weeks as hav-ing failures. This leads to unnecessary maintenanceoperations, negatively affecting the train availabil-ity. Nevertheless, these negative consequences areexpected to have a lower impact than not predictinga true failure.

One can compare this solution with a baselinescenario of planned maintenance, assuming somesimplifications: the baseline scenario only performsperiodic maintenance activities and all failures im-ply repairs; all failures predicted by the proposedsolution lead to maintenance activities, avoiding re-pairs for the cases of correct predictions. From aquantitative perspective, the test set contains 15weeks of normal condition and 5 weeks having fail-ures. The proposed solution detected all failures.However, this solution incorrectly predicted 6 weeksas having failures, adding 6 unnecessary mainte-nance activities to those of periodic maintenance.Considering a baseline scenario of planned mainte-nance, the 5 failures would lead to 5 repairs (whichare all avoided using the proposed solution). Eventough, the baseline scenario would require only theperiodic maintenance activities. From an opera-tion’s perspective, this solution represents a costreduction if a repair costs at least 2.22 times morethan a maintenance activity. From domain knowl-edge, we know an estimate for this relative cost,which is between 3.7 and 6.1. Therefore, consider-ing the results on the test set, the proposed solutionrepresents a cost reduction when compared to theplanned maintenance strategy.

Despite having achieved significant results, onewould expect to obtain better results if the pre-ferred alternatives could be used. All the code de-veloped is prepared to perform those alternatives,but the time constraints did not allow to go aheadwith those approaches. Having more time wouldcertainly contribute to attain a better solution, butthe data itself plays also an important role. Firstly,using more data would improve the model. Thiswork was restricted to the data available by thattime. In addition, the quality of the data is also acrucial aspect as noisy data can severely influencethe results. One must conclude that important re-sults were achieved, especially considering all the

limitations. Finally, most of this work is not spe-cific for railways. These tools can be applied in awide range of domains and applications.

Further improvements could be achieved by con-sidering the temporal correlations. In addition,there are various other methods that can be used fordifferent phases of this work. Regarding the strate-gies for failure prediction, other approaches couldbe tested, such as outlier detection and time seriesanalysis.

References[1] P. Umiliacchi, D. Lane, F. Romano, and A. SpA. Predic-

tive maintenance of railway systems using an ontologybased modelling approach. In Proceedings of 9th worldconference on railway research, pages 22–26, 2011.

[2] Y. Peng, M. Dong, and M. J. Zuo. Current status ofmachine prognostics in condition-based maintenance: areview. The International Journal of Advanced Manu-facturing Technology, 50(1-4):297–313, 2010.

[3] A. Mohsen. Condition based maintenance. 2017.[4] H. Li, D. Parikh, Q. He, B. Qian, Z. Li, D. Fang,

and A. Hampapur. Improving rail network velocity: Amachine learning approach to predictive maintenance.Transportation Research Part C, 2014.

[5] C. Yang and S. Letourneau. Learning to predict trainwheel failures. In Proceedings of the eleventh ACMSIGKDD international conference on Knowledge dis-covery in data mining, pages 516–525. ACM, 2005.

[6] R. Sipos, D. Fradkin, F. Moerchen, and Z. Wang.Log-based predictive maintenance. In Proceedings ofthe 20th ACM SIGKDD international conference onknowledge discovery and data mining. ACM, 2014.

[7] C. Soares. Processing big data - exploratory data anal-ysis, lecture notes, 2018.

[8] S. Wold, K. Esbensen, and P. Geladi. Principal compo-nent analysis. Chemometrics and intelligent laboratorysystems, 2(1-3):37–52, 1987.

[9] H. Abdi and L. J. Williams. Principal component anal-ysis. Wiley interdisciplinary reviews: computationalstatistics, 2(4):433–459, 2010.

[10] M. Udell, C. Horn, R. Zadeh, S. Boyd, et al. Gener-alized low rank models. Foundations and Trends R© inMachine Learning, 9(1):1–118, 2016.

[11] M. Udell and S. Boyd. PCA on a data frame, 2015.[12] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank ma-

trix completion using alternating minimization. In Pro-ceedings of the forty-fifth annual ACM symposium onTheory of computing, pages 665–674. ACM, 2013.

[13] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin.Learning from data. AMLBook New York, 2012.

[14] J. S. Marques. Machine learning, lecture notes, 2017.[15] J. Davis and M. Goadrich. The relationship between

precision-recall and ROC curves. In Proceedings ofthe 23rd international conference on Machine learning,pages 233–240. ACM, 2006.

[16] H. He and E. A. Garcia. Learning from imbalanced data.IEEE Transactions on Knowledge & Data Engineering,(9):1263–1284, 2008.

[17] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.Kegelmeyer. SMOTE: synthetic minority over-samplingtechnique. Journal of artificial intelligence research, 16:321–357, 2002.

[18] B. C. Wallace, K. Small, C. E. Brodley, and T. A.Trikalinos. Class imbalance, redux. In Data Mining(ICDM), 2011 IEEE 11th International Conference on,pages 754–763. IEEE, 2011.

10

remote condition monitoring: application to railway traction … · remote condition monitoring:...

Documents