anomaly detection using symmetric compression benjamin arai & chris baron computer science and...

Anomaly Detection Using Symmetric Compression

Benjamin Arai & Chris BaronComputer Science and Engineering Department

University of California - [email protected]

[email protected]

What are anomalies?

• Something that is peculiar, irregular, abnormal, and difficult to classify with the surrounding data

• Anomalies are subject to interpretation

• Two anomalies can look completely different from one another

Why is this important?

• Security– Abnormal activity– Intrusion detection

• Health– Atypical rhythmic patterns (i.e. heart beat,

breathing)

• Equities and Financial Data• Detection in general

Motivation

• Searching for a specific pattern is relatively trivial for a computer (at least in linear time), and has been well researched (I.e. KMP, Boyer-Moore, Edit Distance)

• How does a computer detect surprising patterns without being told in advance what they look like?

• Utilize Kolmogorov complexity with compression!

Kolmogorov complexity and information distance

• K(x) – Smallest program that prints out x

• K(x|y) – Smallest program that prints out x given y as an input

• Information distance – How different are x and y?– Edit distance?– Normalize

Normalized information distance

• (K(x|y) + K(y|x)) / (K(xy)– Close to 0 then very similar– Close to 1 then very different

• Compression does a good job at estimating Kolmogorov complexity

• We use compression to find anomalies

How compression works

• Create a dictionary that maps long sequences to short ones

• The more these long sequences are used, the better the compression (works well with text) i.e:– the = 01– and = 10– algorithm = 11

Compression dictionary example

How compression works

• Bzip2– Burrow-Wheeler transform– Huffman Encoding– Compressed with dictionary

• These methods combined create an efficient estimate of Kolmogorov complexity

Our algorithm

Split input string into equal sections– How many sections?

• Compress each section, and sections containing anomalies should appear as outliers (by looking at their size normalized)

• For each section containing an anomaly, split and compare against section most likely not containing an anomaly

Pseudo code

Initial_cuts(data){

do {split(data, number of splits);compress splits;number of splits++;

} while(no normalized splits > threshold)

base_check = minimal normalized coefficientfor each normalized split > threshold {

drill_down(normalized split);}

}

Normalized split x =)))...(((

))...((#

1#1__

#1_

splits

isplitsfilescompressedidiffcompressed

splitsfilescompressedi

xxmeanxabsmean

xxmeanxabs

How it works (Example)

• Initial split

1.0 1.0

How it works

• Second split

0.752880921895 0.747119078105 1.5

How it works

• Final split

0.678651685393 0.718202247191 0.603146067416 2.0

Preliminary results

Randomized Sine Waves and anomaly (x * 0.1) between points 3134-3234 of 3334 data points

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

0.165094

1.367925

0.04717

2.287736

Preliminary results

Randomized Sine Waves and anomaly (x * 0.3) between points 3134-3234 of 3334 data points

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

0.141892

0.628378

1.682432

0.02027

1.317568

2.209459

Preliminary results

Randomized Sine Waves and anomaly (x = random noise) between points 3134-3234 of 3334 data points

-2.000

-1.000

0.000

1.000

2.000

3.000

0.169903

1.019417

0.776699

2.5

Preliminary results

Randomized Sine Waves and anomaly (abs(x)) between points 3134-3234 of 3334 data points

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

0.909091

0.909091

0.181818

2

Results (Partial Epilepsy 1)

Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)

-10.000

-5.000

0.000

5.000

10.000

Time



-10

-5

0

5

10

Time


0.000

1.000

2.000

3.000

4.000

Time



-10

-5

0

5

10

Time


-10.000

-5.000

0.000

5.000

10.000

Time

0.678652

0.718202

0.603146

2



-3.000

-2.000

-1.000

0.000

1.000

2.000

3.000

Time



-3

-2

-1

0

1

2

3

Time


-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

Time



-3

-2

-1

0

1

2

3

Time


-3.000

-2.000

-1.000

0.000

1.000

2.000

3.000

Time

0.630324

2

0.630324

0.739353

Multi-anomaly detection

Full Set (64000 Points)

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

Initial 1500 Points

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

Final 1500 Points

-1.500

-1.000

-0.500

0.000

0.500

1.000

1.500

2.000

20.29010.01531.6947

Future research

• Tests extended to using binary data (i.e. pictures, video, etc.)

• Finding anomalies in pairs of data– It is hot out– Chris is wearing a coat– It is hot out, and Chris is wearing a coat

• Anomaly detection refinement?

Drill down

Drill_down(data)

{

a = data(0…n/2);

b = data(n/2+1…n);

if(data < size_threshold) {

add data’s coordinates to link list and return;

} else if(a is similar to b) {

Drill_down(a);

Drill_down(b);

} else if(a is closer to mean) {

Drill_down(b);

} else {

Drill_down(a);

}

}

• Drills down splits containing anomalies to get a closer approximation

• Mean = slices of split most likely not to contain an anomaly of sizes data/2

Questions?

If you have any questions, please visit http://www.google.com

References

• M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, The Similarity Metric, 2002

• M. Burrows and D.J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, digital Systems Research Center, Palo Alto, CA, 1994

• E. Keogh, S. Lonardi, and B. Chiu, Finding Surprising Patterns in a Time Series Database in Linear Time and Space, University of California Riverside, Riverside, CA, 2002

anomaly detection using symmetric compression benjamin arai & chris baron computer science and...

Documents