anomaly detection using projective markov models
DESCRIPTION
Presented at the 2009 CDC, Shanghai Anomaly Detection Using Projective Markov Models in a Distributed Sensor Network Sean Meyn, Amit Surana, Yiqing Lin, and Satish Narayanan https://netfiles.uiuc.edu/meyn/www/spm_files/Mismatch/Mismatch.htmlTRANSCRIPT
Anomaly Detection Using Projective Markov Models in a Distributed Sensor Network
Sean MeynDepartment of Electrical and Computer Engineeringand the Coordinated Science Laboratory University of Illinois
Joint work with Amit Surana, Yiqing Lin, and Satish Narayanan, United Technologies Research Center
Acknowledgements: Research supported by United Technologies Research Center and the National Science Foundation, CCF 07-29031
Outline
• Detection in a Sensor Network
• Multiple Models for Distributed Detection
• Application to a Building Security
• Detection in a Sensor Network
• Multiple Models for Distributed Detection
• Application to a Building Security
IDetection in a Sensor Network
Problem Statement
Detect anomalous behavior based on
• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly
Interest at UTRC: building monitoring for security and energy efficiency
Detect anomalous behavior based on
• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly
Interest at UTRC: building monitoring for security and energy efficiency
Challenges and Resolution
1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection
Detect anomalous behavior based on
• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly
1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection
Detect anomalous behavior based on
• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly
Challenges and Resolution
1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection
Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.
1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection
Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.
IIMultiple Models for Distributed Detection
Qβ∗( )
Q ( )ηπ0
π0
π1π1
Binary Hypothesis Testing - Geometric View
For ease of explanation only: The classical i.i.d. setting
For starters: Binary hypothesis testing
For ease of explanation only: The classical i.i.d. setting
For starters: Binary hypothesis testing
Z is an i.i.d. sequence on a finite state space
marginal under model of normal behaviorπ0
marginal under model of anomalous behaviorπ1
Binary Hypothesis Testing - Geometric View
For ease of explanation only: The classical i.i.d. setting
For starters: Binary hypothesis testing
For ease of explanation only: The classical i.i.d. setting
For starters: Binary hypothesis testing
Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,
Z is an i.i.d. sequence on a finite state space
marginal under model of normal behaviorπ0
marginal under model of anomalous behavior
Log-likelihood Ratio Test Log-likelihood Ratio
π1
L = log(dπ1/dπ0)φ(ZT1 ) = I
1
T
T
t=1
L(Z(t)) ≥ τ
Binary Hypothesis Testing - Geometric View
Geometry:Geometry:
Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,
Log-likelihood Ratio Test Log-likelihood Ratio
L = log(dπ1/dπ0)φ(ZT1 ) = I
1
T
T
t=1
L(Z(t)) ≥ τ
Qβ∗( )
Q ( )η
π0
π0
π1π1
Separating hyperplane
Binary Hypothesis Testing - Geometric View
Geometry:Geometry:
Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,
Log-likelihood Ratio Test Log-likelihood Ratio
L = log(dπ1/dπ0)φ(ZT1 ) = I
1
T
T
t=1
L(Z(t)) ≥ τ
Qβ∗( )
Q ( )η
π0
π0
π1π1
Separating hyperplane:
{µ :
∫L(z)µ(dz) = τ
}
Qη(π0) = µ : D(µ‖π0) < η{ }
Binary Hypothesis Testing - Geometric View
Geometry:Geometry:
LLR test: Declare Anomaly if empirical distributions lie outside of lower half spaceLLR test: Declare Anomaly if empirical distributions lie outside of lower half space
Qβ∗( )
Q ( )η
π0
π0
π1π1
{µ :
∫L(z)µ(dz) = τ
}
ΓT (z) =:1
T
T∑
t=1
I Z(t) = z , z ∈ Z{ }
Universal Detection
Anomalous behavior is not modeledAnomalous behavior is not modeled
Alarm is sounded if empirical distribution lies outside divergence nbd
Alarm is sounded if empirical distribution lies outside divergence nbd
π0
Q ( )η π0
{ }
{ }
φ(ZT1 ) = I ΓT Q∈� η(π0)
= I D(ΓT ‖π0) ≥ η
Universal Detection
Anomalous behavior is not modeledAnomalous behavior is not modeled
Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test
Alarm is sounded if empirical distribution lies outside divergence nbd
Alarm is sounded if empirical distribution lies outside divergence nbd
π0Q ( )η π0
{ }
{ }
φ(ZT1 ) = I ΓT Q∈� η(π0)
= I D(ΓT ‖π0) ≥ η
Universal Detection
Anomalous behavior is not modeledAnomalous behavior is not modeled
Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test
Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z
Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z
Alarm is sounded if empirical distribution lies outside divergence nbd
Alarm is sounded if empirical distribution lies outside divergence nbd
π0Q ( )η π0
{ }
{ }
φ(ZT1 ) = I ΓT Q∈� η(π0)
= I D(ΓT ‖π0) ≥ η
D(ΓT ‖π0)
Universal Detection - Multiple Models
Suppose we extract features of the observations,Suppose we extract features of the observations,
Selected based onSelected based on
Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
Universal Detection - Multiple Models
Suppose we extract features of the observations,Suppose we extract features of the observations,
Selected based onSelected based on
Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:
φ(ZT1 ) = I Γi(T ) Q∈� η(π0
i ) for some i{ }
= I{ΓT �∈
⋂
i
Qη(π0i )
}
Universal Detection - Multiple Models
Suppose we extract features of the observations,Suppose we extract features of the observations,
Selected based onSelected based on
Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:
Geometry:Geometry:
Qη(π01)
Qη(π02)
φ(ZT1 ) = I Γi(T ) Q∈� η(π0
i ) for some i{ }
= I{ΓT �∈
⋂
i
Qη(π0i )
}
Universal Detection - Multiple Models
Suppose we extract features of the observations,Suppose we extract features of the observations,
Selected based onSelected based on
Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction
Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:
Geometry:Geometry:
Qη(π01)
Qη(π02)
φ(ZT1 ) = I Γi(T ) Q∈� η(π0
i ) for some i{ }
= I{ΓT �∈
⋂
i
Qη(π0i )
}
Safe RegionSafe Region
Markov Models
KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate
wherewhere andand are the distributions of are the distributions of
assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P
J(Q‖P ) = limn→∞
1
nD(γ(n)‖π(n))
γ(n) π(n)
= D(γ(2)‖π(2)) − D(γ‖π)
(Z(1), . . . , Z(n))
Equal under Markov assumption
Markov Models
KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate
Local models? Shannon-Mori-Zwanzig projection:Local models? Shannon-Mori-Zwanzig projection:
wherewhere andand are the distributions of are the distributions of
assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P
J(Q‖P ) = limn→∞
1
nD(γ(n)‖π(n))
γ(n) π(n)
= D(γ(2)‖π(2)) − D(γ‖π)
(Z(1), . . . , Z(n))
P (x, y) :=π(2)(x, y)
π(x)
Equal under Markov assumption
Markov Models
Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:
Option B: Parameterized models:Option B: Parameterized models:
θ chosen using ML estimationθ chosen using ML estimation
Local models? Two approaches:Local models? Two approaches:
P (x, y) :=π(2)(x, y)
π(x)
π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm
Markov Models
Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:
Option B: Parameterized models:Option B: Parameterized models:
Advantage of option B: Variance grows with dimension m, not the cardinality of the observation spaceAdvantage of option B: Variance grows with dimension m, not the cardinality of the observation space
θ chosen using ML estimationθ chosen using ML estimation
Local models? Two approaches:Local models? Two approaches:
P (x, y) :=π(2)(x, y)
π(x)
π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm
IIIApplication to Building Security
Building Testbed at UTRC
Eleven Markov models for occupancy based on eleven zonesEleven Markov models for occupancy based on eleven zones
Option A: Empirical Markov model Option A: Empirical Markov model
Option B: Queueing modelOption B: Queueing model
Video cameraVideo camera
ZoneZone
[Smith and Towsley, 1981][Smith and Towsley, 1981]
1210
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
Experiment Architecture12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
Scenarios:Scenarios:
Capture a range of unusual traffic patterns in a buildingCapture a range of unusual traffic patterns in a building
1 Convergence: Numerous occupants converge to a single zone
2 Divergence: Numerous occupants leave a single zone
3 Idleness: Numerous occupants converge to a single zone
4 Loitering: Numerous occupants converge to a single zone
5 High occupancy: Higher than normal occupancy in combined zones
1 Convergence: Numerous occupants converge to a single zone
2 Divergence: Numerous occupants leave a single zone
3 Idleness: Numerous occupants converge to a single zone
4 Loitering: Numerous occupants converge to a single zone
5 High occupancy: Higher than normal occupancy in combined zones
Typical ROC Curves12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
0.2 0.4 0.6 0.80.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1Empirical model
Test statistic: Based on a moving window of length δ
Semi-empirical model
0
0.2
0.4
0.6
0.8
1
1010
Delay
52
7
R(t) =
J(Γ(2)δ0,t
Γ(2)δ0,t
‖π(2)) Empirical bivariate empirical distribution
1
δ0
0
t∑
k=t−δ0+1
log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )
Typical ROC Curves12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
0.2 0.4 0.6 0.80.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1Empirical model
Test statistic: Based on a moving window of length δ
Semi-empirical model
0
0.2
0.4
0.6
0.8
1
1010
Delay
52
7
R(t) =
J(Γ(2)δ0,t
Γ(2)δ0,t
‖π(2)) Empirical bivariate empirical distribution
1
δ0
0
t∑
k=t−δ0+1
log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )
�t,δ0 :=Pt,δ0(Z(k), Z(k + 1))
P (Z(k), Z(k + 1))
Centralized Detection12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
Max
imum
of s
tatis
tics
Empirical Statistic
Delay is similar using either test
Semi-empirical statistic is far more disciminating
30 40 50 600
5
10
15
Semi-empirical Statistic
Anomalous episode:Convergence to zone 6
6
Decentralized Detection: Divergence12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
Empirical Statistic
Delay is similar using either test
Many false alarms from empirical statistic
Semi-empirical Statistic
Anomalous episode:Divergence from zone 5
88
99
5
10 20 30 40
05
1015
05
1015
05
1015
Z4Z5
Z6
Decentralized Detection: Occupancy12
10
8
75
14
1
2 3 4 6
9
1113
16
1
2 34
5
6
78910
1115
30 40 50 60
05
1015
05
1015
Empirical Statistic
Empirical statistic clairvoyant?
Missed detection using empirical statistic
Semi-empirical Statistic
Anomalous episode:10% higher occupancy in zones 5 and 6
99
5
6
Z5Z6
ConclusionsEmpirical Statistic
Semi-empirical Statistic
Contributions:
• Feasibility of an anomaly detection framework using projected Markov models.
• Advantages of semi-empirical Markov models
ConclusionsEmpirical Statistic
Semi-empirical Statistic
Current research:
• Feature selection for distributed detection
• Active learning - e.g., query for additional data
• Diagnosis
• Response
Contributions:
• Feasibility of an anomaly detection framework using projected Markov models.
• Advantages of semi-empirical Markov models
References
[1,3,4,5,6,8] Geometry
[2,3,4,6,8] Universal detection
[4,7] Variance in detection
and parameter estimation
.razsisC.I]1[ I-divergence geometry of probability distributionsand minimization problems. Ann. Probab., 3:146–158, 1975.
[2] O. Zeitouni and M. Gutman. On universal hypotheses testingvia large deviations. IEEE Trans. Inform. Theory, 37(2):285–290, 1991.
[3] C. Pandit and S. P. Meyn. Worst-case large-deviations withapplication to queueing and information theory. Stoch. Proc.Applns., 116(5):724–756, May 2006.
[4] J. Unnikrishnan, D. Huang, S. Meyn, A. Surana, and V. Veer-avalli. Universal and composite hypothesis testing via mis-matched divergence. CoRR and submitted for publication,IEEE Trans. IT., abs/0909.2234, 2009.
[5] S. Borade and L. Zheng. I-projection and the geometryof error exponents. In Proceedings of the Forty-FourthAnnual Allerton Conference on Communication, Control, andComputing, Sept 27-29, 2006, UIUC, Illinois, USA, 2006.
[6] E. Abbe, M. Medard, S. Meyn, and L. Zheng. Finding thebest mismatched detector for channel coding and hypothesistesting. Information Theory and Applications Workshop, 2007,pages 284–288, 29 2007-Feb. 2 2007.
[7] B. S. Clarke and A. R. Barron. Information-theoretic asymp-totics of Bayes methods. IEEE Trans. Inform. Theory,36(3):453–471, 1990.
[8] D. Huang, J. Unnikrishnan, S. Meyn, V. Veeravalli, andA. Surana. Statistical SVMs for robust detection, supervisedlearning, and universal classification. In Proceedings of theInformation Theory Workshop on Networking and InformationTheory, Volos, Greece., 2009.