machine learning and extremes for anomaly detection · some litterature in anomaly detection i...

54
Machine Learning and Extremes for Anomaly Detection Nicolas Goix LTCI, CNRS, Telecom ParisTech, Universit´ e Paris-Saclay, France GT valeurs extremes, Jussieu, November 15th 2016 1

Upload: others

Post on 14-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Machine Learning and Extremes for AnomalyDetection

Nicolas GoixLTCI, CNRS, Telecom ParisTech, Universite Paris-Saclay, France

GT valeurs extremes, Jussieu, November 15th 2016

1

Page 2: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Anomaly Detection (AD)

‘Finding patterns in the data that do not conform to expectedbehavior’

Huge number of applications: Network intrusions, credit card frauddetection, insurance, finance, military surveillance,...

2

Page 3: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Machine Learning context

Different kind of Anomaly Detection

I Supervised AD (not dealt with)Labels available for both normal data and anomalies(similar to rare class mining)

I Novelty Detection (our theoretical framework)The algorithm learns on normal data only

I Outlier Detection (extended application framework)Training set (unlabeled) = normal + abnormal data(assumption: anomalies are very rare)

3

Page 4: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Some litterature in Anomaly Detection

I Statistical AD techniquesfit a statistical model for normal behavior (ex: gaussian, gaussianmixture)

I K-nearest neighbors- ex: Local Outlier Factor (LOF) [Breunig et al. 2000]

I Support estimation- One-Class-SVM [Scholkopf et al. 2000, Vert and Vert 2006]- Minimum Volume set estimate [Einmahl and Mason 92, Polonik 97,Scott and Nowak 2006]

I High-dimensional techniques- Dimensionality reduction [Aggarwal and Yu 2001, Shyu et al. 2003]- One-class Random Forests [Shi and Horvath 2012, Desir et al.2012]- Isolation Forest [Liu et al. 2008]

4

Page 5: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Outline

An AD algorithm returns a scoring function s : Rd → R.It represents the ‘degree of abnormality’ of an observation x ∈ Rd

I Part I: Performance criterion on s.(model selection)

I Part II: Building good s on extreme regions.(model design)

5

Page 6: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

6

Page 7: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

(unsupervised) performance criterion

Such a criterion allows:

I 1- To build good s by optimizing this criterion.I 2- To evaluate any AD algorithm without using any labels.

Practical motivations:Most of the time, data come without any label.→ no ROC or PR curves!

Idea:How good is an anomaly detection algorithm?

↓How good is it estimating the level sets?

7

Page 8: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Context

Novelty Detection (‘One-Class Classification’, ‘semi-supervised AD’)

I Data: inliers.i.i.d. observations in Rd from the normal behavior, density f .

I Output to evaluate: scoring function s : Rd → R- s defines a pre-order on Rd = ‘degree of abnormality’.- s level sets are estimates of f level sets.- s can be interpreted as a box which contains an infinite numberof level sets estimates (at different levels).

Remark. Ideal scoring functions: s = f or s = 2f + 3 or s = T f anyincreasing transform of f .

8

Page 9: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Problem reformulation

We want a criterion C(s) which measures how well the level sets of fare approximated by those of s.

I Fact: For any strictly increasing transform T , level sets of T fare exactly those of f .⇒ Criterion C(s) = ‖s − f‖ is not relevant! (C(2f ) > 0)

I We are looking for a criterion s.t:- CΦ(s) = ‖Φ(s) −Φ(f )‖ with Φ s.t. Φ(T s) = Φ(s).- level sets of optimal s∗ = level sets of f .- CΦ(s) = ‘distance’ between level sets of s and those of f .

Question: How to choose Φ(s) ?

9

Page 10: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Existing criterion: Mass-Volume curve

Minimum volume set [Polonik, 1997]

Γ∗α = arg minΓ borelian

Leb(Γ),P(X ∈ Γ) ≥ α

Under regularity assumptions, minimum-volume sets are densitylevel sets:

∃tα > 0, Γ∗α = f > tα =: Ωtα

Mass Volume curve of a scoring function s [Clemencon andJakubowicz, 2013]:

MVs(α) : = infΩ level-set of s

Leb(Ω) s.t . P(X ∈ Ω) ≥ α

MV ∗(α) : = MVf (α)

10

Page 11: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Existing criterion: Mass-Volume curveProperties:

I For any scoring function s, MV ∗(α) ≤ MVs(α)

I For any increasing transform T , MV ∗(α) = MVTf (α)

I MV ∗(α) = Leb(Γ∗α) := minΩ borelian

Leb(Ω) s.t. P(X ∈ Ω) ≥ α

(MVs(α) := infΩ level-set of s

Leb(Ω) s.t . P(X ∈ Ω) ≥ α

)

s

f

(a) Scoring functions

s

f

(b) Mass Volume curves

11

Page 12: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Drawbacks and alternative criterion: Excess-Masscurve

Drawbacks of MV:I When optimized w.r.t. different levels α, produces not necessarily

nested empirical level sets.I → low convergence rates – of order O(n−1/4).I MV diverges in 1 in case of unbounded support.

Solution: Excess-Mass curve

I Definitions:

EMs(t) = supΩ level-set of s

P(X ∈ Ω) − tLeb(Ω)

I Optimal curves:

EM∗(t) := EMf (t) = EMTf (t) = maxΩ borelian

P(X ∈ Ω) − tLeb(Ω)

EM∗(t) ≥ EMs(t) for all scoring function s

Property: Previous drawbacks are fixed with EM.12

Page 13: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

MV and EM criteria

I Interpretation: (EMf − EMs)(t) ' infu>0 Leb(s > u ∆ f > t)

I How well t-level sets of f are approximated by level sets of s,t ∈ I ? l

how small is EMf − EMs on I ?lhow large is EMs on I ? → CEM(s) = ‖EMs‖1,I

I How well α minimum-volume sets of f are approximated by levelsets of s, α ∈ J ? l

how small is MVs − MVf on J ?lhow small is MVs on J ? → CMV (s) = ‖MVs‖1,J

13

Page 14: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

14

Page 15: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

15

Page 16: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Learning a scoring function with M-estimationWe are looking for nearly optimal scoring functions of the forms =∑N

j=1 aj1x∈Ωj , with aj ≥ 0, Ωj ∈ G.

Procedure: Fixe t0 > 0For k = 1, . . . ,N,

tk+1 =tk

(1 + 1√n )

Ωtk+1 = arg maxΩ⊃Ωtk

Pn(X ∈ Ω) − tk+1Leb(Ω)

sN(x) :=∑N

j=1(tj − tj+1)1x∈Ωtj

16

Page 17: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Learning a scoring function with M-estimation

Rates: (if density bounded and without flat parts, G VC-class)With probability at least 1 − δ,

supt∈]0,t1]

|EM∗(t) − EMsN (t)| ≤[A +

√2 log(1/δ)

] 1√n+ bias(G) + oN(1).

17

Page 18: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

18

Page 19: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Evaluation of scoring functions

Inputs: scoring function sI Estimation:

MV s(α) = infu≥0

Leb(s ≥ u) s.t. Pn(s ≥ u) ≥ α

EMs(t) = supu≥0

Pn(s ≥ u) − tLeb(s ≥ u)

I Empirical criteria:

CEM(s) = ‖EMs‖L1(I) I = [0, EM−1

(0.9)],

CMV (s) = ‖MV s‖L1(J) J = [0.9, 1],

I Issue: The volume Leb(s ≥ u) has to be estimated (Monte-Carlo).Challenging in high dimensions.

19

Page 20: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Evaluation: Heuristic solution

Feature sub-sampling (random projection) and averaging

Inputs: AD algorithm A, data set X size n × d , feature sub-sampling sized ′, number of draws m.

for k = 1, . . . ,m do-randomly select a sub-group Fk of d ′ features-compute the associated scoring function sk = A

((x j

i )1≤i≤n, j∈Fk

)-compute CEM

k = ‖EMsk ‖L1(I) or CMVk = ‖MV sk ‖L1(J)

end for

Return performance criteria:

CEMhigh dim(A) =

1m

m∑k=1

CEMk or CMV

high dim(A) =1m

m∑k=1

CMVk .

20

Page 21: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

21

Page 22: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Why dealing with extremes?

General ideas:I Extreme observations play a special role when dealing with

outlying data.

I But no anomaly detection algorithm has specific treatment forsuch multivariate extreme observations. Univariate EVT:[Roberts 99, Lee and Roberts 2008, Clifton et al. 2011]

I Our goal:I Define a notion of sparsity for extremes observations.I Provide a method which can improve performance of standard AD

algorithms by combining them with a multivariate extremeanalysis of the dependence structure, using this notion ofsparsity.

22

Page 23: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Purpose

X = (X1, . . . ,Xd )

Goal: find the groups of features which can be largetogether

ex: X1,X2, X1,X3,X4, X5

Namely: characterize the extreme dependence structure

→ Anomalies = points which violate this structure

23

Page 24: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Theoretical framework

I ContextI Random vector X = (X1, . . . ,Xd )

I Margins: Xj ∼ Fj (Fj continuous)

I Preliminary step: Standardization of each marginal

I Standard Pareto: Vj =1

1−Fj (Xj )

(P(Vj ≥ x) = 1

x , x ≥ 1)

24

Page 25: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Problematic of Extreme Value Theory

Describe V’s distribution, when V exceeds some large threshold.

P(V ∈ A) = ? (A ‘far from the origin’).

25

Page 26: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Fundamental hypothesis and consequences

I Standard assumption: let A extreme region,

P[V ∈ t A] ' t−1P[V ∈ A] (radial homogeneity)

I Formally:

regular variation (after standardization):

If 0 /∈ A, tP[V ∈ t A] −−−→t→∞ µ(A).

µ: exponent measure

Necessarily: µ(tA) = t−1µ(A)

I ⇒ angular measure on sphere Sd−1: Φ(B) = µtB, t ≥ 1

26

Page 27: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

General model of multivariate EVT

P[V ∈ A] ' µ(A), if A extreme region.

Model for excessesFor a large r > 0 and a region B on the unit sphere:

P[‖V‖ > r ,

V‖V‖

∈ B]

∼r→∞ 1

rΦ(B) = µ(tB, t ≥ r )

⇒ Φ (or µ) rules the joint distribution of extremes (if margins areknown).

27

Page 28: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Angular distribution

I Φ rules the joint distribution of extremes

Asymptotic dependence: Asymptotic independence:(V1,V2) may be large together. Only V1 or V2 may be large.

28

Page 29: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

General Case

I Sub-cones: Cα =‖v‖ ≥ 1, vj > 0 (j ∈ α), vj = 0 (j /∈ α)

I Corresponding sub-spheres:

Ωα, α ⊂ 1, . . . ,d

(Ωα = Cα ∩ Sd−1)

29

Page 30: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Representation of extreme data

I Natural decomposition of the angular measure :

Φ =∑

α⊂1,...,d

Φα with Φα = Φ|Ωα ↔ µ|Cα

I ⇒ yields a representation

M =Φ(Ωα) : ∅ 6= α ⊂ 1, . . . , d

=µ(Cα) : ∅ 6= α ⊂ 1, . . . , d

I Assumption: dµ|Cαdvα

= O(1).

30

Page 31: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Sparse Representation ?

Full pattern : Sparse patternanything may happen (V1 not large if V2 or V3 large)

31

Page 32: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

32

Page 33: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Problem: M is an asymptotic representation

M =Φ(Ωα), α

=µ(Cα), α

is the restriction of an asymptotic measure

µ(A) = limt→∞ tP[V ∈ t A]

to a representative class of set Cα, α, but only the central sub-conehas positive Lebesgue measure!

⇒ Cannot just do, for large t :

Φ(Ωα) = µ(Cα) ' tP(tCα)

33

Page 34: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Solution

Fix ε > 0. Affect data ε-close to an edge, to that edge.

Cα → Cεα = ‖v‖ ≥ 1, vj > ε (j ∈ α), vj ≤ ε (j /∈ α).Ωα → Ωεα = Cεα ∩ Sd−1

New partition of Sd−1

34

Page 35: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Resulting estimation procedure

V ji = 1

1−Fj(X ji )

with Fj(Xji ) =

rank(X ji )−1

n

⇒ get an natural estimate ofΦ(Ωα)

Φ(Ωα) :=nkPn(V ∈

nkCεα)

(nk

large, ε small)

⇒ we obtainM :=

Φ(Ωα), α

35

Page 36: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Statistical guaranties: Main issue

Would like to use concentration inequality...

In our case: supA∈A

nk

∣∣∣∣(P − Pn)(k

nA)∣∣∣∣

But usually: supA∈A

∣∣(P − Pn)(A)∣∣

I scaling nk

I classical VC-inequality: kn nice but not used !→ high proba bound in

nk×√

1n

log1δ

→ ∞ !!

⇒ Needs to take into account that the proba of kn A is small.

36

Page 37: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Statistical guaranties: Solution

Key: VC-inequality adapted to rare regions→ bound in

√p

nk

√dn

log1δ

with p the probability to be in the union class ∪A∈AA.

p . dkn⇒ bound in

d

√1k

log1δ

k ∝ number of data considered as extreme (data used for estimation)

37

Page 38: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Statistical guaranties

TheoremThere is an absolute constant C > 0 such that for anyn > 0, k > 0, 0 < ε < 1, δ > 0 such that 0 < δ < e−k , withprobability at least 1 − δ,

‖M−M‖∞ ≤ Cd

(√1εk

logdδ+ Mdε

)+ bias(ε, k ,n),

Comments:I M = sup(density on sub-cones)I Existing litterature (for spectral measure) Einmahl Segers 09,

Einmahl et al. 01

d = 2.

asymptotic behaviour, rates in 1/√

k .Here: 1/

√k → 1/

√εk + ε. Price to pay for biasing our estimator

with ε.38

Page 39: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Theorem’s proof

Decompose error:

|µn(Cεα) − µ(Cα)| ≤ |µn − µ|(Cεα)︸ ︷︷ ︸A

+ |µ(Cεα) − µ(Cα)|︸ ︷︷ ︸B

I term A : Bounded with VC inequality adapted to small probabilityregions.

I term B : Leb(Cεα \ Cα) is small when ε is small.

39

Page 40: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Corresponding Algorithm

DAMEX in O(dn log n)

Input: parameters ε > 0, k = k(n)

for i = 1, . . . ,n do# Standardize via marginal rank-transformation:Vi :=

(1/(1 − Fj(X

ji )))

j=1,...,d .

if Vi >nk then

# Assign to each Vi the cone nk Cεα it belongs to:

α = α(Vi)cα ++

end ifend forΦα,εn := n

k cα

Output: (sparse) representation of the dependence structure:Φα,εn = Φ(Ωα) =

nk Pn(V ∈ n

k Cεα), estimate of the α-mass of Φ for

every α.

M := (Φα,εn )α⊂1,...,d,Φα,εn >Φmin

40

Page 41: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Application to Anomaly DetectionRecall that after standardization of marginals:P[R > r ,W ∈ B] ' 1

r Φ(B)

→ scoring function = Φεn ×1/r :

sn(x) := (1/‖T (x)‖∞)∑α

Φα,εn 1T(x)∈Cεα.

where T : X 7→ V (Vj =1

1−Fj(Xj))

41

Page 42: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Experiments

Figure: ROC and PR curve on SA dataset

42

Page 43: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Thank you!

43

Page 44: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Some references:

I Aggarwal and Yu 2001, Outlier detection for high dimensional data.I Chandola, Banerjee, and Kumar 2009, Anomaly detection: A survey.I Clifton, Hugueny and Tarassenko 2011, Novelty detection with multivariate

extreme value statistics.I Desir, Bernard, Petitjean and Heutte 2012, A New Random Forest Method for

One-Class Classification.I Einmahl and Mason 1992, Generalized quantile processes.I Einmahl, Krajina and Segers 2012, An m-estimator for tail dependence in

arbitrary dimensions.I Goix, Sabourin and Clemencon 2015, Learning the dependence structure of rare

events: a non-asymptotic study.I Goix, Sabourin and Clemencon 2015, Sparse Representation of Multivariate

Extremes with Applications to Anomaly Ranking.I Goix, Sabourin and Clemencon 2015, On Anomaly Ranking and Excess-Mass

Curves.I Goix 2016, How to Evaluate the Quality of Unsupervised Anomaly Detection

Algorithms?I Goix, Brault, Drougard and Chiapino 2016, One Class Splitting Criteria for

Random Forests with Application to Anomaly Detection.

44

Page 45: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Some references:

I de Haan and Ferreira, Extreme value theory, 2006.I Lee and Roberts 2008, On-line novelty detection using the Kalman filter and

extreme value theory.I Liu, Ting, Zhou 2008, Isolation forest.I Polonik 1997, Minimum volume sets and generalized quantile processes.I Qi 1997, Almost sure convergence of the stable tail empirical dependence

function in multivariate extreme statistics.I Resnick 1987, Extreme Values, Regular Variation, Point Processes.I Roberts 1999, Novelty detection using extreme value statistics.I Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson 2001, Estimating the

support of a high-dimensional distribution.I Scott and Nowak 2006, Learning minimum volume sets.I J. Segers 2012, Max-stable models for multivariate extremes.I Shi and Horvath 2012, Unsupervised learning with random forest predictors.I Shyu, Chen, Sarinnapakorn and Chang 2003, A novel anomaly detection scheme

based on principal component classifierI Vert and Vert 2006, Consistency and Convergence Rates of One-Class SVMs

and Related Algorithms

45

Page 46: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Part I: performance criterionDefinitionLearning a scoring functionEvaluating a scoring function

Part II: Learning accurate scoring functions on extreme regionsMultivariate EVT & Representation of ExtremesEstimationExperiments

46

Page 47: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

number of samples number of features

shuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3

Table: Datasets characteristics

47

Page 48: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Figure: ROC and PR curve on SF dataset

48

Page 49: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Figure: ROC and PR curve on http dataset

49

Page 50: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Figure: ROC and PR curve on shuttle dataset

50

Page 51: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Figure: ROC and PR curve on forestcover dataset

51

Page 52: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Benchmarks for EM/MV

Does performance in term of EM/MV correspond toperformance in term of ROC/PR?

I Experiments: 12 datasets, 3 AD algorithms (LOF, OCSVM, iForest)→ 36 possible pairwise comparisons: (A1 on D, A2 on D

), A1,A2 ∈ iForest, LOF, OCSVM,

D ∈ adult, http, . . . , spambase.

I Results: If we only consider the pairs s.t. ROC and PR agree onwhich algorithm is the best, we are able (with EM and MV scores) torecover it in 80% of the cases.

52

Page 53: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Table: Original Datasets characteristics

nb of samples nb of features anomaly class

adult 48842 6 class ’> 50K ’ (23.9%)http 567498 3 attack (0.39%)pima 768 8 pos (class 1) (34.9%)smtp 95156 3 attack (0.03%)wilt 4839 5 class ’w’ (diseased trees) (5.39%)annthyroid 7200 6 classes 6= 3 (7.42%)arrhythmia 452 164 classes 6= 1 (features 10-14 removed) (45.8%)forestcover 286048 10 class 4 (vs. class 2 ) (0.96%)ionosphere 351 32 bad (35.9%)pendigits 10992 16 class 4 (10.4%)shuttle 85849 9 classes 6= 1 (class 4 removed) (7.17%)spambase 4601 57 spam (39.4%)

53

Page 54: Machine Learning and Extremes for Anomaly Detection · Some litterature in Anomaly Detection I Statistical AD techniques fit a statistical model for normal behavior (ex: gaussian,

Table: Results for the novelty detection setting. One can see that ROC, PR, EM, MVoften do agree on which algorithm is the best (in bold), which algorithm is the worse(underlined) on some fixed datasets. When they do not agree, it is often because ROCand PR themselves do not, meaning that the ranking is not clear.

Dataset iForest OCSVM LOF

ROC PR EM MV ROC PR EM MV ROC PR EM MVadult 0.661 0.277 1.0e-04 7.5e01 0.642 0.206 2.9e-05 4.3e02 0.618 0.187 1.7e-05 9.0e02http 0.994 0.192 1.3e-03 9.0 0.999 0.970 6.0e-03 2.6 0.946 0.035 8.0e-05 3.9e02pima 0.727 0.182 5.0e-07 1.2e04 0.760 0.229 5.2e-07 1.3e04 0.705 0.155 3.2e-07 2.1e04smtp 0.907 0.005 1.8e-04 9.4e01 0.852 0.522 1.2e-03 8.2 0.922 0.189 1.1e-03 5.8wilt 0.491 0.045 4.7e-05 2.1e03 0.325 0.037 5.9e-05 4.5e02 0.698 0.088 2.1e-05 1.6e03

annthyroid 0.913 0.456 2.0e-04 2.6e02 0.699 0.237 6.3e-05 2.2e02 0.823 0.432 6.3e-05 1.5e03arrhythmia 0.763 0.487 1.6e-04 9.4e01 0.736 0.449 1.1e-04 1.0e02 0.730 0.413 8.3e-05 1.6e02forestcov. 0.863 0.046 3.9e-05 2.0e02 0.958 0.110 5.2e-05 1.2e02 0.990 0.792 3.5e-04 3.9e01ionosphere 0.902 0.529 9.6e-05 7.5e01 0.977 0.898 1.3e-04 5.4e01 0.971 0.895 1.0e-04 7.0e01pendigits 0.811 0.197 2.8e-04 2.6e01 0.606 0.112 2.7e-04 2.7e01 0.983 0.829 4.6e-04 1.7e01shuttle 0.996 0.973 1.8e-05 5.7e03 0.992 0.924 3.2e-05 2.0e01 0.999 0.994 7.9e-06 2.0e06spambase 0.824 0.371 9.5e-04 4.5e01 0.729 0.230 4.9e-04 1.1e03 0.754 0.173 2.2e-04 4.1e04

54