information-theoretic learning and data mining...information-theoretic learning and data mining...
TRANSCRIPT
Information-theoretic Learning and Data Mining
Kenji Yamanishi
The University of Tokyo, JAPAN
Oct. 18th, 2009
Contents
1.Information-theoretic Learning based Novelty Detection
2.Theory of Dynamic Model Selection
3.Applications of DMS to Data Mining
3-1. Masquerade Detection
3-2. Failure Detection
3-3. Topic Dynamics Detection
3-4. Network Structure Mining
4.Summary
1.Information-theoretic Learning and Novelty Detection
Data Mining Issues
Patten Discovery
Anomaly Detection
DataMIning
=Machine Learnig+ Pre/PostProcessing
Distributed and Heterogeneous Data
Knowledge Discovery
Marketing, CRM, SCM , ERP, Community Mgmnt.
Data
Risk Managment
Fraud Detection,Failure Detection, Compliance Mgmnt, Security
WEB
logs
RFID
cellar
media
sensor
Dynamic and Non-stationary Data
Future Prediction
0
50
100
150
200
250
300
DATE 2004/8/18
12:00
2004/8/19 0:00 2004/8/19
12:00
2004/8/20 0:00 2004/8/20
12:00
2004/8/21 0:00 2004/8/21
12:00
測定値
予測値・推定値
上限値
閾値
Mining from large amount of dynamic and heterogeneous data is a critical issue
Information-theoretic Learning and Data Mining
Non-stationary DynamicInformation Sources
Stochastic Complexity Minimization…..Minimum Description Length Principle(Data Compression and Summarization)
Dynamics of Statistical Models
Structural Changes
Novelty Detection
Structured Information
Dynamic Data
Structural changes from dynamic data are discovered using stochastic complexity minimization
2.Theory of Dynamic Model Selection
Concept of Dynamic Model Selection
Issue: How do you track changes of statistical models
under non-stationary environments?
timeModel 1 Model 2
x1 xn… xtxt-1
…
Selecting a model sequence that explains the data best
Data
Model M1,……………..,M1, M2,……………………,M2
■Discounting Learning
■Windowing
■Dynamic Model Selection(DMS)
Methodologies
Pattern 1
cdls
#patterns:1→ 2
time
Pattern 1
cdls
Pattern 2
shpsvi
Ex.1: UNIX Command Pattern Analysis
Tracking changes of the mixture size leads to detection of a new pattern
Probabilistic model (Mixture Model)
1 emacs 2 cp 3 bash 4 perl 5 :
1 cd2 ls3 emacs4 platex5 :
UNIX commandsequence
1 cd2 ls 3 emacs4 platex 5 :
1 emacs 2 make 3 ./a.exe4 :5 :
Ex.2 Curve Fitting
Static ModelSelection
Dynamic Model
Selection
K: degree of polynomial
K=2 K=4K=5
Degree of curve changes over time
Tracking curve degrees which will change over time
Related Work• Static model selection
AIC [Akaike 1973], BIC [Schwarz 1978], MDL [Rissanen 1978], MML [Wallace & Freeman 1978]
– Under stationarity assumption
• Tracking the best expert [Herbster & Warmuth 1998]
Derandomization [Vovk 1997][Cesanianchi Lugosi 2006]
– Best model (=“Best expert”) may change over time
– Cannot track the sequence of best models itself
・ MDL-Based Dynamic Model Selection under non-stationarity assumption [Yamanishi and Maruyama 2007]
・Switching [Ervan, Grunwald, Rooij 2008]
-Convergence as with MDL and convergence rate as with AIC
・Changing dependency detection [Fearnhead& Liu 2007]
MDL-based Information Criterion for DMS
Evaluating goodness of model sequence from theMinimum Description Length (MDL) Principle [Rissanen 1978]
Code-length required by A to encode xn
and kn under prefix condition
Data Sequence
Model Sequence
Minimize w.r.t. kn for given xn
・ Batch DMS
・ Sequential DMS
(Kraft’s Inequality)
Predictive Distribution
Dynamics of model changes is described using predictive dist.
Switching Distribution
Description of Model Transition
k+1
time
k
Model
Change point
Predictive code length Predictive code length
Code length for model transition
Model transition probability : parameter),,|( 1 t
tt kkkP
Assumption: model probabilistically transits
n
t
t
tt
n
t
t
t
t
nn kkPkxxPkx1
1
1
1 )ˆ,|(log):|(log):(
Predictive code-length for data Predictive code- length for model seq.
DMS (Dynamic Model Selection) Criterion:
: estimate ofInput: xn=x1,….xn
Output: kn =k1,…. kn minimizing (xn: kn)
DMS(Dynamic Model Selction) Criterion
1 if ,2
,,1 and if ,2
1
,,1 and if ,1
):|(
1
11
11
1
tt
ttt
ttt
tt
kk
Kkkk
Kkkk
kkP
k
k-1
k
k+1
1
2/
2/
Theorem 1: [Yamanishi and Maruyama IEEE Trans IT 07] There exists a batch DMS algorithm (DMS1) for switching distributions that runs in time O(Kn2) and the total code-length is upper bounded by
Coding Bound for Batch DMS for Switching Dist.
… … …
…
…
…
…
…
…
1
K
k At each time, at each stateselect an optimal pathfrom path sets selected at the latest time
POINT 1: Optimal path search based on Dynamic ProgrammingPOINT 2: Krichevsky and Trofimov Estimation of model transition probabilities
Model seq. complexityData complexity for switching
Sequential DMS Problem
model sequence set
model sequence set restricted to (a,b)
so that the DMS criterion is minimum:
Predictive code-length of model seq.Data code-length for switching dist.
tt-B1
Window of size B
[Yamanishi and Sakurai WITMSE 2010]
3. Applications of DMS to Data Mining
Information leak
1 emacs 2 cp 3 bash 4 perl 5 :
time
1 cd2 ls3 emacs4 platex5 :
Program coding
emacs cp
make
Structuralchanges
Computer
Usage
Logs
1 cd2 ls 3 emacs4 platex 5 :
1 emacs 2 make 3 ./a.exe4 :5 :
Latent
Information
Program coding
emac cp
make
cp **.exe
Tracking the emergence of a new pattern leads to the discovery of a masquerader’s behavior
3.1. Masquerade Detection
command session y j= (ps, cd, ls, cp, …)
= (y1, … yTj) length Tj
y1 ,...,yM : command patterns are modeled using HMM MixtureWish to detect masquerade by tracking changes in mixture size
K
k
kjkkj PP1
)|()|( yy
Each behavior pattern is represented by 1-st order HMM
),...,(
1
1 1
11
1
)|()|()()|(
jT
j j
xx
T
t
T
t
ttkttkkkjk xybxxaxP y
),...,( 1 jTxx :hidden states⇒Markov
K: Number of different behavior patterns (Mixture size)
Model
γk : initial prob. ak : transition prob. bk : event prob.
session
Time
[ Maruyama & Yamansihi ITW04]
Session time
Session Stream
Anomalous Behavior
Detection
①On-line Discounting Learning Algorithm of each mixture model
③Universal Test Statistics BasedScoring with dynamically optimizedthreshold
②Dynamic Model SelectionLearning
Parameters
Selection of
Optimal Mixture Size K*
Anomaly Score
Computation
Learning
Parameters
Flow of Behavior Change Detection
time
Anomaly
Score
Mixture Size
Emerging PatternIdentification
HMM Mixture
K=Kmax
HMM Mixture
K=1
threshold
0
1
2
3
0 200 400 600 800 1000 1200 1400
Masquerade data (t=820,…,849)
Pattern 1
file→ridstd
cat→rdistd
touch→tcsh
df→tcsh
rshd→rdistd
tcsh→rshd
rdistd→tcsh
0 0.2 0.4 0.6 0.8 1
Pattern 2
cat→post
cat→file
ln→ln
generic→ln
m→generic
file→post
awk→cat
0 0.2 0.4 0.6 0.8 1
t=820
Ex. User30
T-DMS1 detects the masquerader’s
command pattern at t=820
Top 7 transitions of each cluster (k=2 at t=820)
Masquerade Pattern DetectionMasquerade’s pattern is largely deviated from the normal
user’s pattern (mainly operating “remote shells” ).
3.2 Failure Detection
ID Time stamp Event Severity
Att1 Att2 Message
## Nov 13 00:06:23: ERR bridge: !brdgursrv: queue is full. discarding
a message.
## Nov 13 10:15:00: WARN: INTR: ether2atm: Ethernet Slot 2L/1 Lock-
Up!!
## Nov 13 10:15:10: WARN: INTR: ether2atm: Ethernet Slot 2L/2
Lock-Up!!
## Nov 13 10:15:20: WARN: INTR: ether2atm: Ethernet Slot 2L/3
Lock-Up!!
■ Syslog….A sequence of events collected using the BSD syslog protocol
■ Includes information crucial to network failure
y1 ,...,yM : syslog patterns are also modeled using HMM MixtureWish to identify emerging failure patterns by tracking changes in mixture size.
[Yamansihi & Maruama KDD2005]
Emerging Pattern Identification
Identify a new pattern by detecting the change point of mixture size
■ Goal: Track a new behavior pattern caused by a network failure
The component with smallest occurrence probability corresponds to a new pattern
lock-up failure
Topic 1
Topic2
Topic 3
Topic1
Topic2
Topic 5Topic 4
Inquiries
Claims
DMS
Tracking changes of topic organization
Clustering
Topic Emergence
Detection
3-3. Topic Dynamics Detection
Text Processing at Contact Center
Topic 1
Activity
number
時間
Topic 2
Topic 5
Emergence of Topic 5
[Morinaga and Yamanishi KDD2004]
time
dot = input document
p.d.f.component
= main topic
Gaussian mixture
Detecting emerging component
Online
learning
Dynamic Topic Modeling with Gaussian Mixture
C42 is main.
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
3/23 4/2 4/12 4/22 5/2 5/12 5/22
-4
-3
-2
-1
0
1
2
3
4
トピック “サービスZZZ”, “障害”がアクティブな期間
Date
Topic Activity Topic
“Transfer”Topic
“Failure of service X”
Emergence
of new topic
Disappearance
of old topic
Topic structure changes in contact center text stream
Topic Dynamics Detection at Contact Center
0
10
20
30
40
50
2/23 3/14 4/3 4/23 5/13 time
•Total number of inquiries at a contact center is 1202
Changes of number of main topics over time
Changes of Topic Number
3-4. Network Structure Mining
Time evolving
Tracking network hierarchy changes leads to new hub detection
ex. SNS networksDetection of influencers, criminal groups, new communities
Other examples:Physical networks, coauthor network, copurchase network
Time evolving
Emergence of new hubs
Emergence of new hierarchies
Time evolving Time evolving
Change of partitioning structure
Emergence of a new cluster
Graph Structure Change Detection
Tracking a new graph cluster leads to discovery of a new community
①
②
③④⑤
⑥⑦
① ⑦①
⑦
・・・
Time series of
correlation matrix
[Hirose, Yamanishi, Nakata, Fujimaki KDD2009]
4. Summary
• Novelty detection from dynamic data is a challenging issue in data mining.
• Dynamic model selection based on MDLprinciple is a novel method to detect structural changes in a data stream
• Applications cover masquerade detection, failure detection, topic emergence detection,network structure mining and are applicable to a wide range of areas.
References1) K.Yamanishi, J.Takeuchi, G.Williamas, and P.Milne: On-line Unsupervised Outlier
Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery Journal, pp:275-300, May 2004, Volume 8, Issue 3.
2) J.Takeuchi and K.Yamanishi: A Unifying Framework for Detecting Outliers and Change-points from Time Series. IEEE Transactions on Knowledge and Data Engineering, 18:44, pp: 482-492, 2006.
3)S.Morinaga and K.Yamanishi: Tracking Dynamics of Topic Trends Using a Finite Mixture Model. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004), ACM Press, 2004.
4) K.Yamanishi and Y.Maruyama: Dynamic Syslog Mining for Network Failure Monitoring. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2005), pp: 499-508, ACM Press, 2005.
5) K.Yamanishi and Y.Maruyama: Dynamic Model Selection with Its Applications to Novelty Detection. IEEE Transactions on Information Theory, pp:2180-2189, VOL 53, NO 6, June, 2007.
6) S.Hirose, K.Yamanishi, T.Nakata, R.Fujimaki: Network Anomaly Detection based on Eigen Equation Compression. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2009), 2009.
7) K.Yamanishi and E.Sakurai: Extensions and Probabilistic Analysis of Dynamic Model Selection.2010 Workshop on Information Theoretic Methods for Science and Engineering(WITMSE 2010), Tampere, Finland, August 2010.