anomaly detection using projective markov models

Anomaly Detection Using Projective Markov Models in a Distributed Sensor Network

Sean MeynDepartment of Electrical and Computer Engineeringand the Coordinated Science Laboratory University of Illinois

Joint work with Amit Surana, Yiqing Lin, and Satish Narayanan, United Technologies Research Center

Acknowledgements: Research supported by United Technologies Research Center and the National Science Foundation, CCF 07-29031

Outline

• Detection in a Sensor Network

• Multiple Models for Distributed Detection

• Application to a Building Security

• Detection in a Sensor Network

• Multiple Models for Distributed Detection

• Application to a Building Security

IDetection in a Sensor Network

Problem Statement

Detect anomalous behavior based on

• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly

Interest at UTRC: building monitoring for security and energy efficiency



Interest at UTRC: building monitoring for security and energy efficiency

Challenges and Resolution

1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection






Challenges and Resolution


Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.


Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.

IIMultiple Models for Distributed Detection

Qβ∗( )

Q ( )ηπ0

π0

π1π1

Binary Hypothesis Testing - Geometric View

For ease of explanation only: The classical i.i.d. setting

For starters: Binary hypothesis testing



Z is an i.i.d. sequence on a finite state space

marginal under model of normal behaviorπ0

marginal under model of anomalous behaviorπ1






Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,

Z is an i.i.d. sequence on a finite state space

marginal under model of normal behaviorπ0

marginal under model of anomalous behavior

Log-likelihood Ratio Test Log-likelihood Ratio

π1

L = log(dπ1/dπ0)φ(ZT1 ) = I

1

T

T

t=1

L(Z(t)) ≥ τ


Geometry:Geometry:




1

T

T

t=1

L(Z(t)) ≥ τ

Qβ∗( )

Q ( )η

π0

π0

π1π1

Separating hyperplane


Geometry:Geometry:




1

T

T

t=1

L(Z(t)) ≥ τ

Qβ∗( )

Q ( )η

π0

π0

π1π1

Separating hyperplane:

{µ :

∫L(z)µ(dz) = τ

}

Qη(π0) = µ : D(µ‖π0) < η{ }


Geometry:Geometry:

LLR test: Declare Anomaly if empirical distributions lie outside of lower half spaceLLR test: Declare Anomaly if empirical distributions lie outside of lower half space

Qβ∗( )

Q ( )η

π0

π0

π1π1

{µ :

∫L(z)µ(dz) = τ

}

ΓT (z) =:1

T

T∑

t=1

I Z(t) = z , z ∈ Z{ }

Universal Detection

Anomalous behavior is not modeledAnomalous behavior is not modeled

Alarm is sounded if empirical distribution lies outside divergence nbd


π0

Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

sean

Typewritten Text

[Hoeffding, 1965]

Universal Detection


Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test



π0Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

Universal Detection


Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test

Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z

Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z



π0Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

D(ΓT ‖π0)

sean

Typewritten Text

[Unnikrishnan, Huang, M., Surana, Veeravalli, 2009] [Clarke and Barron, 1990]

sean

Typewritten Text

Universal Detection - Multiple Models

Suppose we extract features of the observations,Suppose we extract features of the observations,

Selected based onSelected based on

Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction





Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i



Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

⋂

i

Qη(π0i )

}




Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i




Geometry:Geometry:

Qη(π01)

Qη(π02)

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

⋂

i

Qη(π0i )

}




Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i




Geometry:Geometry:

Qη(π01)

Qη(π02)

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

⋂

i

Qη(π0i )

}

Safe RegionSafe Region

Markov Models

KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate

wherewhere andand are the distributions of are the distributions of

assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P

J(Q‖P ) = limn→∞

1

nD(γ(n)‖π(n))

γ(n) π(n)

= D(γ(2)‖π(2)) − D(γ‖π)

(Z(1), . . . , Z(n))

Equal under Markov assumption

Markov Models

KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate

Local models? Shannon-Mori-Zwanzig projection:Local models? Shannon-Mori-Zwanzig projection:

wherewhere andand are the distributions of are the distributions of

assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P

J(Q‖P ) = limn→∞

1

nD(γ(n)‖π(n))

γ(n) π(n)

= D(γ(2)‖π(2)) − D(γ‖π)

(Z(1), . . . , Z(n))

P (x, y) :=π(2)(x, y)

π(x)

Equal under Markov assumption

Markov Models

Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:

Option B: Parameterized models:Option B: Parameterized models:

θ chosen using ML estimationθ chosen using ML estimation

Local models? Two approaches:Local models? Two approaches:

P (x, y) :=π(2)(x, y)

π(x)

π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm

Markov Models

Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:

Option B: Parameterized models:Option B: Parameterized models:

Advantage of option B: Variance grows with dimension m, not the cardinality of the observation spaceAdvantage of option B: Variance grows with dimension m, not the cardinality of the observation space

θ chosen using ML estimationθ chosen using ML estimation

Local models? Two approaches:Local models? Two approaches:

P (x, y) :=π(2)(x, y)

π(x)

π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm

IIIApplication to Building Security

Building Testbed at UTRC

Eleven Markov models for occupancy based on eleven zonesEleven Markov models for occupancy based on eleven zones

Option A: Empirical Markov model Option A: Empirical Markov model

Option B: Queueing modelOption B: Queueing model

Video cameraVideo camera

ZoneZone

[Smith and Towsley, 1981][Smith and Towsley, 1981]

1210

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Experiment Architecture12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Scenarios:Scenarios:

Capture a range of unusual traffic patterns in a buildingCapture a range of unusual traffic patterns in a building

1 Convergence: Numerous occupants converge to a single zone

2 Divergence: Numerous occupants leave a single zone

3 Idleness: Numerous occupants converge to a single zone

4 Loitering: Numerous occupants converge to a single zone

5 High occupancy: Higher than normal occupancy in combined zones

1 Convergence: Numerous occupants converge to a single zone

2 Divergence: Numerous occupants leave a single zone

3 Idleness: Numerous occupants converge to a single zone

4 Loitering: Numerous occupants converge to a single zone

5 High occupancy: Higher than normal occupancy in combined zones

Typical ROC Curves12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

0.2 0.4 0.6 0.80.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1Empirical model

Test statistic: Based on a moving window of length δ

Semi-empirical model

0

0.2

0.4

0.6

0.8

1

1010

Delay

52

7

R(t) =

J(Γ(2)δ0,t

Γ(2)δ0,t

‖π(2)) Empirical bivariate empirical distribution

1

δ0

0

t∑

k=t−δ0+1

log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )

Typical ROC Curves12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

0.2 0.4 0.6 0.80.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1Empirical model

Test statistic: Based on a moving window of length δ

Semi-empirical model

0

0.2

0.4

0.6

0.8

1

1010

Delay

52

7

R(t) =

J(Γ(2)δ0,t

Γ(2)δ0,t

‖π(2)) Empirical bivariate empirical distribution

1

δ0

0

t∑

k=t−δ0+1

log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )

�t,δ0 :=Pt,δ0(Z(k), Z(k + 1))

P (Z(k), Z(k + 1))

Centralized Detection12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Max

imum

of s

tatis

tics

Empirical Statistic

Delay is similar using either test

Semi-empirical statistic is far more disciminating

30 40 50 600

5

10

15

Semi-empirical Statistic

Anomalous episode:Convergence to zone 6

6

Decentralized Detection: Divergence12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Empirical Statistic

Delay is similar using either test

Many false alarms from empirical statistic


Anomalous episode:Divergence from zone 5

88

99

5

10 20 30 40

05

1015

05

1015

05

1015

Z4Z5

Z6

Decentralized Detection: Occupancy12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

30 40 50 60

05

1015

05

1015

Empirical Statistic

Empirical statistic clairvoyant?

Missed detection using empirical statistic


Anomalous episode:10% higher occupancy in zones 5 and 6

99

5

6

Z5Z6

ConclusionsEmpirical Statistic


Contributions:

• Feasibility of an anomaly detection framework using projected Markov models.

• Advantages of semi-empirical Markov models

ConclusionsEmpirical Statistic


Current research:

• Feature selection for distributed detection

• Active learning - e.g., query for additional data

• Diagnosis

• Response

Contributions:

• Feasibility of an anomaly detection framework using projected Markov models.

• Advantages of semi-empirical Markov models

References

[1,3,4,5,6,8] Geometry

[2,3,4,6,8] Universal detection

[4,7] Variance in detection

and parameter estimation

.razsisC.I]1[ I-divergence geometry of probability distributionsand minimization problems. Ann. Probab., 3:146–158, 1975.

[2] O. Zeitouni and M. Gutman. On universal hypotheses testingvia large deviations. IEEE Trans. Inform. Theory, 37(2):285–290, 1991.

[3] C. Pandit and S. P. Meyn. Worst-case large-deviations withapplication to queueing and information theory. Stoch. Proc.Applns., 116(5):724–756, May 2006.

[4] J. Unnikrishnan, D. Huang, S. Meyn, A. Surana, and V. Veer-avalli. Universal and composite hypothesis testing via mis-matched divergence. CoRR and submitted for publication,IEEE Trans. IT., abs/0909.2234, 2009.

[5] S. Borade and L. Zheng. I-projection and the geometryof error exponents. In Proceedings of the Forty-FourthAnnual Allerton Conference on Communication, Control, andComputing, Sept 27-29, 2006, UIUC, Illinois, USA, 2006.

[6] E. Abbe, M. Medard, S. Meyn, and L. Zheng. Finding thebest mismatched detector for channel coding and hypothesistesting. Information Theory and Applications Workshop, 2007,pages 284–288, 29 2007-Feb. 2 2007.

[7] B. S. Clarke and A. R. Barron. Information-theoretic asymp-totics of Bayes methods. IEEE Trans. Inform. Theory,36(3):453–471, 1990.

[8] D. Huang, J. Unnikrishnan, S. Meyn, V. Veeravalli, andA. Surana. Statistical SVMs for robust detection, supervisedlearning, and universal classification. In Proceedings of theInformation Theory Workshop on Networking and InformationTheory, Volos, Greece., 2009.

anomaly detection using projective markov models

Education