[ieee 2013 international conference on advances in computing, communications and informatics...

Extracting Anomalies from Time Sequences Derived from Nuclear Power Plant Data by Using Fixed Width

Clustering AlgorithmAditya Gupta, Durga Toshniwal

Department of Computer Science & Engineering Indian Institute of Technology Roorkee

Roorkee, India { adityag, durgatoshniwal} @gmail.com

Pramod K. Gupta, Vikas Khurana, Pushp Upadhyay C&I and R&D-ES

Nuclear Power Corporation of India Ltd. Mumbai, {pkgupta, vkhurana, pupadhyay} @npcil.co.in

Abstract— Time series is basically data recorded at successive points in time. In this paper we have analyzed time series data provided to us by Nuclear Power Corporation of India. We aim to find anomalies, correlations and patterns in the time series. In a nuclear reactor, anomalies can be generated due to various reasons, and it is important to identify the anomalies so that the cause of the anomaly can be found and corrective action can be taken. In order to analyze the dataset we have used Fixed Width Clustering Algorithm. While using this algorithm, we have proposed a dynamic method for deciding the cluster width that is used in clustering. We have also identified correlations between parameters in the dataset. We have cross checked all our results.

Keywords-Time Series; Anomaly Detection; Fixed Width Clustering Algorithm;

I. INTRODUCTION Data mining aims to find hidden, implicit and useful patterns

from the data. Most of the data generated in businesses, industries and academics is time series database and hence, the problem of anomaly detection in time series database is very interesting [2, 4, 7, 9]. Time series is basically data recorded at successive points in time. Hence time series has an implicit temporal ordering in it[2]. This temporal ordering in time series makes it special, since normal database don’t have any natural ordering.Anomaly Detection in time series datasets is a non-trivial problem because datasets show high degree of variance among themselves. They may vary in their size, dimensionality of data, type of data etc. and this results in need for different anomaly detection algorithms for different type of datasets [6].

In our experiments, we have analyzed two different datasets, provided to us by Nuclear Power Corporation of India. These datasets are very different from each other and require different algorithms for identifying patterns and anomalies in them.

The first dataset named DNM dataset has two sub-datasets in it. The first sub-dataset ‘DP1’has data for 5 years from 2006-

2010. The second sub-dataset ‘DP4’ has data for 6 years 2006 – 2010. Each of these dataset contains readings for number of days. Each day readings consist of 28 sensors with 14 readings for each sensor. In order to analyze this dataset, we have used Symbolic Aggregate Approximation [1] algorithm and Heuristically Ordered Time Series [2] algorithm. The results for the analysis of this dataset have been presented in [10].

In this paper, we have presented the analysis of second dataset. The second dataset named Packet Dataset is a much larger dataset and has a very high dimensionality when compared with first dataset. This dataset contains data about 25 million data points which are contained in around 300 time series. In order to analyze this high dimensionality data, we have used Fixed Width Clustering algorithm [4].

Anomaly detection using brute force method on SAX is an O(n2) complexity algorithm [2]. We have used it for first dataset because, this algorithm is very accurate and it allows dimensionality/numerosity reduction of the original dataset. Hence we can reduce the size of original dataset before processing it. We have also optimized this algorithm before applying it.

Because of the high complexity of heuristically ordered time series algorithm, we can not use it on the second dataset since second dataset is a very big dataset. We have used Fixed Width Clustering Algorithm because it’s low complexity; it requires only one pass through the data. The complexity of fixed width clustering algorithm is O(cn) where c is the number of clusters formed and n is size of the dataset. In best case, its complexity is O(n) while in worst case the complexity is O(n2) but in average case for appropriate value of cluster width, c is significantly smaller than n [4].

In a nuclear reactor, anomalies can be generated due to various reasons, and it is important to identify the anomalies so that the cause of the anomaly can be found and corrective action

1587978-1-4673-6217-7/13/$31.00 c©2013 IEEE

can be taken. Also we aim to identify patterns like correlation among parameters, identifying most anomalous sensors, analyzing long term changes in readings, finding dependence between parameters etc. so that these can be used in predictive maintenance of nuclear power plant.

The Remainder of the paper is organized as follows. Section 2 contains the related work. It also explains the process of anomaly detection using Fixed Width Clustering. Section 3 explains the proposed framework. It also gives details of the dataset that we will be analyzing. Section 4 contains the experimental results and discussions. It presents all the results that we have obtained. Section 5 contains the conclusions.

II. RELATED WORK Clustering for anomaly detection is a well-documented

problem. In [1] and [2] anomaly detection algorithm using symbolic aggregate approximation algorithm is described. This algorithm is great for datasets containing relatively less values but with high dimensions. We cannot use it in the experiment because our dataset has large number of values with relatively less number of attributes. Many methods of anomaly detection are described in [7]. One particular algorithm that is interesting among them is fixed width clustering algorithm [4].

Fixed width clustering is explained in detail in [9] and [4]. The performance of this algorithm for anomaly detection depends greatly on choice of cluster width and the percentage of clusters that we label as anomalous. In anomaly detection using fixed width clustering, first clustering is done on all the points. Then elements in those clusters that contain very few points are declared as anomalies, since these points are far away than other points, otherwise there would have been more points in the cluster.

The algorithm for fixed-width clustering is based on the outline in [4,5]. Anomaly detection using fixed width clustering is a three stage process, (1) normalization, (2) cluster formation, and (3) cluster labeling

A. Normalization To ensure that all attributes of the readings have the same

influence when calculating distance between two readings, we must rescale the attributes so that they are on a common scale. The most common technique for rescaling is Normalization [6]. We normalized each feature by subtracting the mean from it and dividing it by the standard deviation. The formula for normalization is shown in Equation 1.

x_new= (x_old-x_mean)/x_std (1)

x_new is the new value of the variable

x_old is the old value of the variable

x_mean is the mean value of the variable

x_std is the standard deviation in the values of the variable.

Though it must be noted that when we are normalizing, we normalize each attribute of a reading separately. For example in our experiment, each reading comprises of [Packet_ID, Value] so Packet_ID and Values may be on very different scales. So we normalize all Packet_ID’s by subtracting their mean value from each reading of Packet_ID and then divide the same with its standard deviation. We repeat the same process for all readings for ‘Value’.

B. Clustering Fixed width clustering algorithm takes a single parameter w

as input from the user and performs clustering on the entire dataset.In this algorithm, in the beginning there is only one empty cluster. Then for each of the new data obtained from the dataset (normalized), we find the distance between it and each of the existing clusters. We take centroids of the existing clusters as their points of reference. The cluster with the least distance is selected. If this distance is less than the cluster width that we have chosen, then the new data instance is added to that cluster [4,8]. Otherwise a new cluster is formed with the new data instance as its center. The algorithm is explained in detail in the following points.

1. Initialize set S as empty set. This will be the cluster set. 2. Get next data instance from the dataset. Let is be called

d. 3. If S is empty set, then create a new cluster with d as its

centroid. Otherwise find distance between d and all of the existing clusters. Find the cluster nearest to d. In other words find a cluster C in S, such that for all C1 in S, dist(C,d) <= dist(C1, d).

4. If If dist(C,d)<= w, then add d to C. Otherwise form a new cluster with d as its center.

5. Repeat steps 2,3 and 4 until no instances are left in the dataset.

C. Cluster Labeling It is known that data instances that are similar to each other

are close to each other, while those that are different are far apart. If cluster width is correctly chosen then we obtain clusters with similar elements clustered together.

We assume that normal instances of data constitute a very large percent (>=98%) of the dataset. If it is true, than there is high probability that clusters containing normal data will have a much larger number of instances associated with them than would a cluster containing anomalies. We therefor label some percentage N of the clusters containing largest number of elements in them as normal, and rest as anomalous. In our experiment we have taken a decent assumption that 99% elements are normal, so only last 1% elements that comprise the smaller clusters are labeled as anomalous.

There are certain problems that we have identified with fixed width clustering algorithm that we will try to address during our experiments, such as, there is not much literature available on fixed width clustering, so we must verify our results. Also we

1588 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

must formulate a method to decide the cluster wbe using.

III. PROPOSED FRAMEWORK

Packet Dataset comprises of 3 smaller adatasets. These sub-datasets are labeled as ‘Pro2’ and ‘Project 3’. Each of these sub-datasets conumber of packets. These packets are identifiedID’ (each sub dataset contains a range of Packethe packets has large number of attributes that dTable 1 contains details of how many packets of the sub datasets, there range of Packet Iattributes does each of the packet have in each onumber of data points.

The proposed framework for the Packet Datthe Figure 1.

Figure 1. Framework for our experime

In each of the sub datasets, each attribute forThese time series constitutes of data points whias [x,y]. Where x is the Packet ID to whichbelongs and y is the value of that attribute in Packet ID x.

So in all we have 308 time series comprisindata points that we will be analyzing. In this rehave presented the results that we have obtainedof the first dataset.

TABLE 1: DETAILS OF THREE SUB DATASETS T

THE PACKET DATASET

Sub Dataset Label

No. of Packets

No. of Attributes

Tot(Pa

Project 1 64,498 107 6Project 2 95,409 96 9Project 3 88,208 105 9 In the proposed framework (Figure 1),

provided with raw dataset. The raw data2,53,22,390 values distributed in 3 sub dataseare contained in 308 time series. So we wri

width that we will

K and independent oject 1’, ‘Project

ontains data about d by their ‘Packet et ID’s). Each of define the packet. are there in each ID’s, how many of them and total

taset is shown in

ent

rms a time series. ich are identified h the data point the packet with

ng of 2,53,22,390 esearch paper we d by the analysis

THAT COMPRISE

tal Data Points ackets*Attributes) 69,01,286 91,59,264 92,61,840

initially we are aset has around ets. These values te a program to

extract all the time series in separprocess and analyze.

Then for each of the time serieattributes. This is done to ensure ththe same influence, when we calculinstances of time series. In order Equation 1. Once normalized, we pseries. Fixed Width Clustering AlgWe have chosen this algorithm beWe are analyzing a very large datacomplexity.

The performance of fixed wdetermined by the cluster width thatdetermine the cluster width, we hamethod. We select the cluster widtthe dataset so that an appropriate nHow we determine this cluster section.

After the clustering algorithm hanow we must label clusters as ‘Nlabel the small clusters as anomaloon the assumption that anomalies anormal data points. We label the percent of the data points have beeour considered to be normal data.

Then we apply the anomaly detthe packets where there might be pthe time series. In the final step, wepossible anomalies to identify thbetween the parameters of the paanomaly burst size and distribution

Once the anomalies have been fofind support and correlations betwtwo parameters are showing anomathey will have a higher support and

Also we are identifying the anomanomalies occur in bursts, and shapthese bursts.

We have also cross checked ofinding correlation between paramefrom the mean value. So if theydeviating from the mean, they are results for correlations were obtainresults, we are indirectly validatalgorithm that we have used.

In the end, we will be trying tparameter affects another paramedependence. So in this process weregarding when anomaly in oneanomaly in another parameter. regarding dependence of anomalies

rate files which are easier to

es, we normalize each of the hat each of the attributes have late the distance between two to normalize we have used

erform clustering on the time orithm is used for clustering. cause of its low complexity.

aset and can not afford a high

width clustering is greatly t we are choosing, In order to

ave formulated an innovative th dynamically depending on number of clusters is formed. width is explained in next

s been applied on the dataset, Normal’ or ‘Anomalous’. We ous, because we are working re few and different from the clusters as anomalous till 1 n labeled as anomalous. Rest

tection algorithm and decide possible anomalies in each of e do statistical analysis of the he support and confidence ackets. Also we analyze the of anomalies.

ound, based on the results we ween parameters, i.e. how if lies in the same packets, then confidence.

maly burst, that is how many pe, sizes and the locations of

our results of correlation by eters based on their deviation y show similar trends while

correlated. Since our earlier ed by analyzing the anomaly ting the anomaly detection

to find how anomaly in one eter. This process is called e are trying to find patterns parameter is followed by

There might be patterns between parameters.

2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1589

IV. EXPERIMENTAL RESULTS & DISC

This section contains experimental results and‘Project 1’ sub dataset of Packet Dataset. Aearlier, Project 1 has 64498 packets in it (Pack12264625 ) where each packet has 107 paramwith it.

A. Anomaly Detection As discussed earlier, we use fixed width

clustering on time series datasets. Then we labclusters as normal or anomalous. Clusters contapackets contain very few elements. During thithere are two parameters that we have to select, the experiment outcome. The first parameter iscluster width used in fixed width clusterinparameter is N, the percentage of packets that abe normal. Thus we label (100-N) percentaganomalous. These packets are contained in clusvery few packets.

While selecting w if we choose a very large will be few clusters formed and all the clusters wnumber of elements. Hence we won’t be able tothat contain few elements and so we won’t beanomalous packets. If we choose a very small very large number of clusters will be formed wwill have very few packets. Hence again we identify small size cluster since every cluster wiand hence every packet will be anomalous. Twrong result. So to choose the optimal size written a program that dynamically selects ww=0.01, based on the number of clusters beiincremented or decremented and clustering is until number of clusters formed is in pre-determ

We determine the pre-determined rangenumber of clusters formed experimentally. labeling packets as anomalous, we see packclusters in the order of smallest cluster to largeshave labeled 1% of the packets as anomalous. Tthe packets in smallest cluster will be most anpackets in largest clusters will be least anomalou

So in order to determine the pre-determinethe size of largest cluster whose elements weanomalous for different number of clusters folast cluster is very small we need to decreasmaximum number of clusters formed and if itneed to increase the minimum number of cluster

For parameter N, we take a cautious approthat only 1% of the values contain anomalies. series that contains 64498 readings, we identify 644 packets. These packets are picked from sma

TABLE 2: MOST ANOMALOUS PACKETS AND

PARAMETERS THAT ARE ANOMALOUS IN

CUSSION d discussions for

As we have seen ket ID 12200128- meters associated

clustering to do bel the packets in aining anomalous is entire process, and which affect

s w, which is the ng. The second are considered to ge of packets as sters that contain

value, then there will contain large o identify clusters e able to identify value of w, then

where each cluster wont be able to ill be small sized

This is obviously of w, we have

w. We start with ing formed w is performed again

mined range.

for acceptable When we are

ets contained in st cluster until we The reason being

nomalous and the us.

ed range, we see e are labeling as ormed. So if that se the acceptable t is very big, we rs acceptable.

oach and assume So in each time

most anomalous allest clusters!

D NUMBER OF N THEM

Packet ID

Anomaly Count

PaID

12255919 71 12

12255916 70 12

12255918 69 12

12255917 69 12

12255921 68 12

12255920 68 12

12255915 68 12

12255922 65

So for each of the 107 parame

packets that show maximum anomanalyzing the results we found anomaly in one or more parameterthere is anomaly in one or more pathe packets that contain maximuanomalous packets are shown in Tmost anomalous packet.

We have also analyzed the anomalies, and we have found that aoccurring in the last 25% of the pdistribution of 5 biggest anomaly bu

Figure 2. Distribution of 5 biggest

B. Support and Confidence among Support and confidence are two

statistics. They help us understanrelated to each other. Here we wiconfidence in context of anomalanomalous then how does it affect p

If X and Y represent two paramdefined as, what is the probabilitparameter X and parameter Y simulby Equation 2.

acket D

Anomaly Count

2255923 61

2255925 60

2255924 59

2255928 58

2255927 58

2255929 55

2255926 55

eter, we have identified 644 maly for that parameter. On that 33746 packets contain rs! That is in 52.3% packets

arameters. We have identified um anomaly, and 15 most Table 2. So 12255919 is the

location of occurrence of all the big anomaly bursts are packets. Figure 2 shows the ursts.

t anomaly bursts in the dataset

Parameters very important parameters in

nd how two parameters are ll be discussing support and lies, i.e. if parameter X is parameter Y [3].

meters, then Support(X Y) is ty that there is anomaly in ltaneously. Support is defined


Support (X Y) = P(X Y)

Another objective measure in statisticconfidence, which assesses the degree of certainsecond parameter given that there is anomparameter. Confidence is defined by Equation 3

Confidence(X Y) = P(Y|X)

It must be noted that in our experiment dataset, Confidence(X Y) and Confidence(Yequal since

P(Y|X)= {P(X Y)}/ P(X) P(X|Y)= {P(X Y)}/ P(Y)

Where P(X) is probability of anomaly occP(Y) is probability of anomaly occurring in Yalready decided that after Fixed Width Clusterreadings will be declared as anomalous, so botare equal to 1%.

After we have found which packets conreadings for each of the 107 parameters, westatistical analysis on it. Using Equation 3 we between parameters. Similarly using Equationsupport between parameters. Table 3 containparameters showing maximum support and conthem.

TABLE 3: PAIR OF PARAMETERS SHOWING MAAND CONFIDENCE

Parameter

1

Parameter

2

Confidence

(%)

S

(

119 82 74.689441 1

79 7 74.068323 1

48 45 70.341615 1

54 51 69.875776 1

57 54 67.236025 1

47 41 66.304348 1

47 44 66.149068 1

48 42 65.372671 1

53 50 64.751553 1

54 42 63.975155 1

44 41 63.975155 1

62 60 63.819876 1

On closer analysis of the above table we iden

following correlations: Correlation 1: 119, 82 Correlation 2: 7, 41, 42, 44, 45, 47, 48, 50, 5

57, 79 Correlation 3: 62, 60 Correlation 4: 3, 28, 29, 30, 63, 64

(2)

cal analysis is nty of anomaly in

maly in the first

(3)

with the Packet Y X) are both

(4) (5)

curring in X and Y. Since we have

ring, bottom 1% th P(X) and P(Y)

ntain anomalous e perform some find the support

n 2, we find the s the results for nfidence between

AXIMUM SUPPORT

Support

(%)

1.425354

1.413501

1.342381

1.333491

1.283115

1.265335

1.262372

1.247555

1.235702

1.220885

1.220885

1.217922

ntify the

51, 53, 54, 56,

Correlation 5: 39, 115, 116

C. Deviation from Mean for Each PWe analyze the dataset based o

for each parameter. While calculaparameter, we don’t include the valas anomalous. So our mean is not dplot graphs showing how the parammean value. So we obtain graphs shows percentage deviation from m

Figure 3. Percentage Deviation f

We obtain such graphs for all the pa

Figure 4. Percentage Deviation f

When we compare these graphs wewe obtained for correlations above. from mean for parameter 62. We cais with figure 2, hence validatinbetween parameter 60 and parameteand Figure 6, we can verify the co119 and parameter 82.

Similarly we have verified all the reparameters that we obtained incorrelations were obtained using rthis indirectly verifies the process Fixed Width Clustering Algorithm.

Parameter n deviation from mean value

ating the mean value of the lues that have been identified distorted by anomalies. So we meter value deviates from its like that in Figure 3, which

mean for parameter 60.

from mean for parameter 60

arameters.

from mean for parameter 62

e can validate the results that Figure 4 shows the deviation

an see how similar Figure 4 it ng the result for correlation er 62. By comparing Figure 5 orrelation between parameter

esults for correlation between n Section B. Since these results of anomaly detection,

of anomaly detection using

2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 1591

Figure 5. Percentage Deviation from mean for pa

Figure 6. Percentage Deviation from mean for p

V. CONCLUSION We have formulated a method to determine t

We select the cluster width dynamically dependiso that an appropriate number of clusters are given cluster width very large number of clustvery less number of clusters is formed, theinaccurate. So unless the number of clusters fordetermined range, we keep modifying theappropriately and re perform the clustering.

For each of the parameters, we have idenwhich they are showing anomaly. With respecparameters, we have labeled 1% of the most difanomalous. We have identified packets in wnumber of parameters are showing anomaly. Pacin Project 1 sub dataset is most anomalous witparameters.

Based on the results of anomaly detection,support and confidence between parameters. Fhave found that there is very high confidenbetween Parameter 119 and Parameter 82 in the In order to cross check our results for corrparameters, we have analyzed parameters withdeviation from their mean values. By doing so, w

arameter 119

parameter 82

the cluster width. ing on the dataset formed. If for a

ters is formed or e results will be rmed is in a pre-e cluster width

ntified packets in ct to each of the fferent packets as which maximum cket ID 12255919 th anomaly in 71

, we have found For example, we nce and support Project 1 dataset. relation between

h respect to their we have obtained

graphs of deviation from mean vsparameters having very high correlaidentical) to each other hence valianomaly detection and correlation. obtained using results of anomaverifies the process of anomaly dClustering Algorithm.

In the future work, we will be anto identify if they are occurring in positions of anomaly bursts.

ACKNOWLED

We would like to thank Deparproviding the domain knowledgefunding the research work and progNumber DAE-603-ECD.

REFERENC

[1] Jessica Lin , Eamonn Keogh , Stefan

representation of time series, with impProceedings of the 8th ACM SIGMOdata mining and knowledge discoveCalifornia.

[2] Eamonn Keogh , Jessica Lin , Ada Fu, Most Unusual Time Series SubsequenInternational Conference on Data Mi2005 [doi>10.1109/ICDM.2005.79].

[3] Larsen, R. J. & Marx, M. L. (1986)Statistics and Its Applications. PrenticEdition

[4] Faloutsos, C., Ranganathan, M., &Subsequence Matching in Time-SeriesACM SIGMOD Int’l Conference on Minneapolis, MN. pp 419-429.

[5] Chan, K. & Fu, A. W. (1999), “EWavelets”, In proceedings of the 15Engineering. Sydney, Australia, Mar 2

[6] Geurts, P. (2001) “Pattern Extraction proceedings of the 5th European Confeand Knowledge Discovery. Sep 3-7, Fr

[7] Keogh, E., Chakrabarti, K., Pazzani, MAdaptive Dimensionality Reduction Databases”, In proceedings of ACM SIof Data. Santa Barbara, CA, May 21- 2

[8] Datar, M. & Muthukrishnan, S. (2002over Data Stream Windows” In pSymposium on Algorithms. Sep 17-21,

[9] Keogh, E. & Kasetty, S. (2002), “OMining Benchmarks: A Survey anproceedings of the 8th ACM SIGKKnowledge Discovery and Data MiniAlberta, Canada. pp 102-111

[10] Durga Toshinwal, Aditya Gupta, P.K.(2013), “Identifying Patterns and AnoData of Nuclear Power Plant”, The 9Mining. July 22-25. Las Vegas, Nevad

s. Packet ID. Graphs of two ation are very similar (almost idating our earlier results of Since these correlations were ly detection, this indirectly detection using Fixed Width

nalyzing the anomalies found bursts and if so the size and

GMENT rtment of Atomic Energy for e, dataset and for partially gram. Research Project Grant

CES

o Lonardi , Bill Chiu. “A symbolic plications for streaming algorithms”,

OD workshop on Research issues in ery, June 13-13, 2003, San Diego,

“HOT SAX: Efficiently Finding the nce”, Proceedings of the Fifth IEEE ining, p.226-233, November 27-30,

). An Introduction to Mathematical ce Hall, Englewood, Cliffs, N.J. 2nd

Manolopoulos, Y. (1994), “Fast s Databases”, In proceedings of the Management of Data. May 24-27,

Efficient Time Series Matching by 5th IEEE Int'l Conference on Data 3-26. pp 126-133. for Time Series Classification”, In

ference on Principles of Data Mining eiburg, Germany. pp. 115-127. M. & Mehrotra, S. (2001), “Locally

for Indexing Large Time Series IGMOD Conference on Management 24. pp 151-162. 2), “Estimating Rarity and Similarity proceedings of the 10th European Rome, Italy.

On the Need for Time Series Data nd Empirical Demonstration”, In KDD International Conference on ing. July 23 - 26, 2002. Edmonton,

. Gupta, V. Khurana, P. Upadhyay omalies in Delayed Neutron Monitor th International conference on Data a, USA. (Accepted)


[ieee 2013 international conference on advances in computing, communications and informatics...

Documents