anomaly detection using symmetric compression benjamin arai & chris baron computer science and...
TRANSCRIPT
Anomaly Detection Using Symmetric Compression
Benjamin Arai & Chris BaronComputer Science and Engineering Department
University of California - [email protected]
What are anomalies?
• Something that is peculiar, irregular, abnormal, and difficult to classify with the surrounding data
• Anomalies are subject to interpretation
• Two anomalies can look completely different from one another
Why is this important?
• Security– Abnormal activity– Intrusion detection
• Health– Atypical rhythmic patterns (i.e. heart beat,
breathing)
• Equities and Financial Data• Detection in general
Motivation
• Searching for a specific pattern is relatively trivial for a computer (at least in linear time), and has been well researched (I.e. KMP, Boyer-Moore, Edit Distance)
• How does a computer detect surprising patterns without being told in advance what they look like?
• Utilize Kolmogorov complexity with compression!
Kolmogorov complexity and information distance
• K(x) – Smallest program that prints out x
• K(x|y) – Smallest program that prints out x given y as an input
• Information distance – How different are x and y?– Edit distance?– Normalize
Normalized information distance
• (K(x|y) + K(y|x)) / (K(xy)– Close to 0 then very similar– Close to 1 then very different
• Compression does a good job at estimating Kolmogorov complexity
• We use compression to find anomalies
How compression works
• Create a dictionary that maps long sequences to short ones
• The more these long sequences are used, the better the compression (works well with text) i.e:– the = 01– and = 10– algorithm = 11
Compression dictionary example
How compression works
• Bzip2– Burrow-Wheeler transform– Huffman Encoding– Compressed with dictionary
• These methods combined create an efficient estimate of Kolmogorov complexity
Our algorithm
Split input string into equal sections– How many sections?
• Compress each section, and sections containing anomalies should appear as outliers (by looking at their size normalized)
• For each section containing an anomaly, split and compare against section most likely not containing an anomaly
Pseudo code
Initial_cuts(data){
do {split(data, number of splits);compress splits;number of splits++;
} while(no normalized splits > threshold)
base_check = minimal normalized coefficientfor each normalized split > threshold {
drill_down(normalized split);}
}
Normalized split x =)))...(((
))...((#
1#1__
#1_
splits
isplitsfilescompressedidiffcompressed
splitsfilescompressedi
xxmeanxabsmean
xxmeanxabs
How it works (Example)
• Initial split
1.0 1.0
How it works
• Second split
0.752880921895 0.747119078105 1.5
How it works
• Final split
0.678651685393 0.718202247191 0.603146067416 2.0
Preliminary results
Randomized Sine Waves and anomaly (x * 0.1) between points 3134-3234 of 3334 data points
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
0.165094
1.367925
0.04717
2.287736
Preliminary results
Randomized Sine Waves and anomaly (x * 0.3) between points 3134-3234 of 3334 data points
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
0.141892
0.628378
1.682432
0.02027
1.317568
2.209459
Preliminary results
Randomized Sine Waves and anomaly (x = random noise) between points 3134-3234 of 3334 data points
-2.000
-1.000
0.000
1.000
2.000
3.000
0.169903
1.019417
0.776699
2.5
Preliminary results
Randomized Sine Waves and anomaly (abs(x)) between points 3134-3234 of 3334 data points
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
0.909091
0.909091
0.181818
2
Results (Partial Epilepsy 1)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)
-10.000
-5.000
0.000
5.000
10.000
Time
Results (Partial Epilepsy 1)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)
-10
-5
0
5
10
Time
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)
0.000
1.000
2.000
3.000
4.000
Time
Results (Partial Epilepsy 1)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)
-10
-5
0
5
10
Time
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #1)
-10.000
-5.000
0.000
5.000
10.000
Time
0.678652
0.718202
0.603146
2
Results (Partial Epilepsy 2)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #2)
-3.000
-2.000
-1.000
0.000
1.000
2.000
3.000
Time
Results (Partial Epilepsy 2)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #2)
-3
-2
-1
0
1
2
3
Time
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #2)
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
Time
Results (Partial Epilepsy 2)
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #2)
-3
-2
-1
0
1
2
3
Time
Post-Ictal Heart Rate Oscillations in Partial Epilepsy (Patient #2)
-3.000
-2.000
-1.000
0.000
1.000
2.000
3.000
Time
0.630324
2
0.630324
0.739353
Multi-anomaly detection
Full Set (64000 Points)
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
Initial 1500 Points
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
Final 1500 Points
-1.500
-1.000
-0.500
0.000
0.500
1.000
1.500
2.000
20.29010.01531.6947
Future research
• Tests extended to using binary data (i.e. pictures, video, etc.)
• Finding anomalies in pairs of data– It is hot out– Chris is wearing a coat– It is hot out, and Chris is wearing a coat
• Anomaly detection refinement?
Drill down
Drill_down(data)
{
a = data(0…n/2);
b = data(n/2+1…n);
if(data < size_threshold) {
add data’s coordinates to link list and return;
} else if(a is similar to b) {
Drill_down(a);
Drill_down(b);
} else if(a is closer to mean) {
Drill_down(b);
} else {
Drill_down(a);
}
}
• Drills down splits containing anomalies to get a closer approximation
• Mean = slices of split most likely not to contain an anomaly of sizes data/2
Questions?
If you have any questions, please visit http://www.google.com
References
• M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, The Similarity Metric, 2002
• M. Burrows and D.J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, digital Systems Research Center, Palo Alto, CA, 1994
• E. Keogh, S. Lonardi, and B. Chiu, Finding Surprising Patterns in a Time Series Database in Linear Time and Space, University of California Riverside, Riverside, CA, 2002