scalable anomaly detection with spark and sos

66
Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019

Upload: others

Post on 17-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Anomaly Detection with Spark and SOS

ScalableAnomaly Detectionwith Spark and SOS

Strata NYCSeptember 26, 2019

Page 2: Scalable Anomaly Detection with Spark and SOS

Hi there, my name is Jeroen Janssens

Page 3: Scalable Anomaly Detection with Spark and SOS

Today

● SOS, World!● Anomalies and outliers● Evaluating outlier-selection algorithms● Various approaches to outlier selection● Stochastic Outlier Selection● Conclusion

Page 4: Scalable Anomaly Detection with Spark and SOS

SOS, World!

01-sos-world.ipynb

Page 5: Scalable Anomaly Detection with Spark and SOS

Implementations of SOS

● Python: http://bit.ly/sos-python● Spark: http://bit.ly/sos-spark● R: http://bit.ly/sos-r● Flink: http://bit.ly/sos-flink

Page 6: Scalable Anomaly Detection with Spark and SOS

Anomalies and outliers

Page 7: Scalable Anomaly Detection with Spark and SOS

An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a

domain expert.

Page 8: Scalable Anomaly Detection with Spark and SOS

Detecting anomalies is important

● Expensive● Dangerous● Mess up your model

Page 9: Scalable Anomaly Detection with Spark and SOS

Human anomaly detection may suffer from

● Fatigue● Information overload● Emotional bias

Page 10: Scalable Anomaly Detection with Spark and SOS

Feature-vector representation

Page 11: Scalable Anomaly Detection with Spark and SOS

Dissimilarity-matrix representation

Page 12: Scalable Anomaly Detection with Spark and SOS

From anomaly to outlier

Page 13: Scalable Anomaly Detection with Spark and SOS

An outlier is a data point that deviates quantitatively from the majority of the

data points, according to an outlier-selection algorithm.

Page 14: Scalable Anomaly Detection with Spark and SOS

The symbiotic relationship between the domain expert and the algorithm

Page 15: Scalable Anomaly Detection with Spark and SOS

Data flow diagram

Data flow diagram illustrating the relationship between the domain expert (square) and the outlier-selection algorithm (top circle).

Page 16: Scalable Anomaly Detection with Spark and SOS

Six Euler diagrams (1/2)

Page 17: Scalable Anomaly Detection with Spark and SOS

Six Euler diagrams (2/2)

Page 18: Scalable Anomaly Detection with Spark and SOS

Evaluating outlier-selection algorithms

Page 19: Scalable Anomaly Detection with Spark and SOS

Confusion matrix

Computer says no.

Page 20: Scalable Anomaly Detection with Spark and SOS

Four possible outcomes

Page 21: Scalable Anomaly Detection with Spark and SOS

Evaluation

Illustration of relabelling a multi-class data set into multiple one-class data sets.

Page 22: Scalable Anomaly Detection with Spark and SOS

Anomalies are rare

In order to evaluate the algorithm we simulate anomalies to be rare. Banana for scale.

Page 23: Scalable Anomaly Detection with Spark and SOS

Outlier scores

The dashed line indicates the threshold chosen by the domain expert.

Page 24: Scalable Anomaly Detection with Spark and SOS

ROC curve

An ROC curve plots the false alarm rate against the hit rate for all possible thresholds.

Page 25: Scalable Anomaly Detection with Spark and SOS

Various approachesto outlier selection

Page 26: Scalable Anomaly Detection with Spark and SOS

Distribution-based outlier selection

Page 27: Scalable Anomaly Detection with Spark and SOS

Distance-based outlier selection

Size does matter

Page 28: Scalable Anomaly Detection with Spark and SOS

Density-based outlier-selection

Page 29: Scalable Anomaly Detection with Spark and SOS

Stochastic Outlier Selection

Page 30: Scalable Anomaly Detection with Spark and SOS

Stochastic Outlier Selection

● Unsupervised outlier selection algorithm● Employs concept of affinity (inspired by t-SNE)● One parameter: perplexity● Computes outlier probabilities

Page 31: Scalable Anomaly Detection with Spark and SOS

t-Distributed Neighbor Embedding (t-SNE; Van der Maaten, Hinton) employs affinity to perform dimensionality reduction

Page 32: Scalable Anomaly Detection with Spark and SOS

A data point is selected as an outlier when all the other data points have

insufficient affinity with it.

Page 33: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 34: Scalable Anomaly Detection with Spark and SOS

From feature matrix to dissimilarity matrix

Page 35: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 36: Scalable Anomaly Detection with Spark and SOS

Smooth neighborhoods

Page 37: Scalable Anomaly Detection with Spark and SOS

Affinity between data points

Page 38: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 39: Scalable Anomaly Detection with Spark and SOS

From affinity to binding probability

The binding matrix B is obtained by normalising each row in the affinity matrix A.

Page 40: Scalable Anomaly Detection with Spark and SOS

Binding probabilities form a graph

Page 41: Scalable Anomaly Detection with Spark and SOS

Binding probabilities form a graph

Page 42: Scalable Anomaly Detection with Spark and SOS

Stochastic Neighbor Graph

A data point belongs to the outlier class when no it is not selected by any other data points.

Page 43: Scalable Anomaly Detection with Spark and SOS

Three SNGs

The three SNGs Ga, Gb, and Gc are sampled from the discrete probability distribution P(G).

Page 44: Scalable Anomaly Detection with Spark and SOS

Set of all SNGs

Page 45: Scalable Anomaly Detection with Spark and SOS

Approximating outlier probabilities by sampling SNGs

Page 46: Scalable Anomaly Detection with Spark and SOS

Demo: Sampling SNGs inCoffeeScript and D3

http://bit.ly/sos-d3

Page 47: Scalable Anomaly Detection with Spark and SOS

Computing outlier probabilities through marginalisation

Page 48: Scalable Anomaly Detection with Spark and SOS

Computing outlier probabilities in closed form

Page 49: Scalable Anomaly Detection with Spark and SOS

Proof!

Page 50: Scalable Anomaly Detection with Spark and SOS

Selecting outliers

Page 51: Scalable Anomaly Detection with Spark and SOS

Adaptive variances via the perplexity parameter

Page 52: Scalable Anomaly Detection with Spark and SOS

Continuous binary search

Page 53: Scalable Anomaly Detection with Spark and SOS

Perplexity influences outlier probabilities

Page 54: Scalable Anomaly Detection with Spark and SOS

Evaluation and comparison

Page 55: Scalable Anomaly Detection with Spark and SOS

Putlier-score plots

Page 56: Scalable Anomaly Detection with Spark and SOS

Real-world datasets

Page 57: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 58: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 59: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 60: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 61: Scalable Anomaly Detection with Spark and SOS

SOS performs significantly better

Page 62: Scalable Anomaly Detection with Spark and SOS

Spark implementation of SOS

Page 63: Scalable Anomaly Detection with Spark and SOS

Spark implementation of SOS

● Developed by Fokko Driesprong● Works with DataFrame API● Available on GitHub● Plan is to make it part of MLLib

Page 64: Scalable Anomaly Detection with Spark and SOS

SOS on PySpark

92-pyspark-sos.ipynb

Page 65: Scalable Anomaly Detection with Spark and SOS

Summary

● Outlier selection can support the detection of anomalies● SOS is an intuitive and probabilistic algorithm to select outliers● SOS has a very good performance● No free lunch

Page 66: Scalable Anomaly Detection with Spark and SOS

Thank you! Here are some links

● Blog: http://bit.ly/sos-blog● D3 Demo: http://bit.ly/sos-d3● Python implementation: http://bit.ly/sos-python● Spark implementation: http://bit.ly/sos-spark● R implementation: http://bit.ly/sos-r● Flink implementation: http://bit.ly/sos-flink