automatic sequential pattern mining in data streamsyasushi/...•e.g., iot sensors/web click logs...
TRANSCRIPT
Automatic Sequential Pattern Miningin Data Streams
Koki Kawabata*, Yasuko Matsubara & Yasushi SakuraiISIR-AIRC, Osaka University
*Supported by SIGIR Student Travel Grants
MotivationGiven: time-evolving data streams• e.g., IoT sensors/Web click logs• contain multiple patterns
CIKM2019 Sakurai Lab. K.Kawabata et al. 2
MotivationGiven: time-evolving data streams• e.g., IoT sensors/Web click logs• contain multiple patterns
Answer: the following questions:
1. What kind of patterns?2. How many patterns?3. When do patterns change?
CIKM2019 Sakurai Lab. K.Kawabata et al. 3
?
MotivationGiven: time-evolving data streams• e.g., IoT sensors/Web click logs• contain multiple patterns
Requirements:• Incremental• We cannot access all historical data
• Automatic• # of patterns are unknown in advance• without any parameter tunings
CIKM2019 Sakurai Lab. K.Kawabata et al. 4
MotivationGiven: time-evolving data streams• e.g., IoT sensors/Web click logs• contain multiple patterns
Requirements:• Incremental• We cannot access all historical data
• Automatic• # of patterns are unknown in advance• without any parameter tunings
CIKM2019 Sakurai Lab. K.Kawabata et al. 5
StreamScope: automatic & incremental approach
Demo movie
CIKM2019 Sakurai Lab. K.Kawabata et al. 6
Demo movie
CIKM2019 Sakurai Lab. K.Kawabata et al. 7
#1: Arm curl
#2: Rowing
Demo movie
CIKM2019 Sakurai Lab. K.Kawabata et al. 8
Demo movie
CIKM2019 Sakurai Lab. K.Kawabata et al. 9
#1: Arm curl
#2: Rowing
#5: Push up
#4: side raise
#3: Intervals
Outline1. Motivation2. Problem definition3. Model4. Streaming Algorithm5. Experiments6. Conclusions
CIKM2019 Sakurai Lab. K.Kawabata et al. 10
Problem definition
CIKM2019 Sakurai Lab. K.Kawabata et al. 11
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
12
34
Given:• Data stream X
Find:1. Segment: 𝒮2. Regime: Θ3. Segment-
membership: ℱ
Problem definitionData stream: set of d-dimensional vectors
CIKM2019 Sakurai Lab. K.Kawabata et al. 12
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
𝑋 = 𝑥!, … , 𝑥"
Given
𝑑 = 4
Problem definitionSegment: start/end positions of each pattern
CIKM2019 Sakurai Lab. K.Kawabata et al. 13
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
𝒮 = 𝑠!, … , 𝑠#
𝑚 = 8
Hidden
𝑠! 𝑠" 𝑠# 𝑠$ 𝑠% 𝑠& 𝑠' 𝑠(
Problem definitionRegime: segment groups
CIKM2019 Sakurai Lab. K.Kawabata et al. 14
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
12
34
Θ = 𝜃!, … , 𝜃$ , Φ
𝑟 = 4
Hidden
Problem definitionSegment-membership: regime-assignment
CIKM2019 Sakurai Lab. K.Kawabata et al. 15
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
12
34
ℱ = 𝑓 1 ,… , 𝑓 𝑚
1, 1,2, 2, 2,4, 4,3,ℱ = { }
e.g., 𝑓 3 = 4
Hidden
Problem definitionGiven: d-dimensional data stream 𝑋 = 𝑥!, … , 𝑥)
Find: compact description 𝒞 = 𝑚, 𝑟, 𝒮, Θ, ℱ of 𝑋
•𝑚 segments 𝒮• 𝑟 regimes Θ• segment-membership ℱ
CIKM2019 Sakurai Lab. K.Kawabata et al. 16
𝒞 = 𝑚, 𝑟, 𝒮, Θ, ℱ
Outline1. Motivation2. Problem definition3. Model4. Streaming Algorithm5. Experiments6. Conclusions
CIKM2019 Sakurai Lab. K.Kawabata et al. 17
Proposed modelGoal: find compact description C in a streaming setting
CIKM2019 Sakurai Lab. K.Kawabata et al. 18
Challenges:Q1. How can we represent regimes?
Idea (1): Hierarchical probabilistic modelQ2. How can we decide # of segments/regimes?
Idea (2): Model description cost
Idea (1): hierarchical probabilistic model
Q. How to describe patterns?
CIKM2019 Sakurai Lab. K.Kawabata et al. 19
𝜃* 𝑎""
𝑎#"𝑎"#𝑎"$ 𝑎$"
state1
state3state2
Idea: HMM-based probabilistic model• ‘within-regime’ transitions: A hidden Markov model 𝜃 = 𝜋, 𝐴, 𝐵• ‘across-regime’ transitions: Regime transition matrix Φ = 𝜙!" !,"$%
&
Model
Data stream
stand
run
walk
tRegimes
Idea (1): hierarchical probabilistic model
Full model Θ = 𝜃!, … , 𝜃+ , Φ
CIKM2019 Sakurai Lab. K.Kawabata et al. 20
𝜃* 𝑎""
𝑎#"𝑎"#
𝑎##
𝑎#$𝑎$#𝑎$$
𝑎"$ 𝑎$"state1
state3state2
𝜃* = 𝜋* , 𝐴* , 𝐵*Single HMM parameters:
Θ = 𝜃!, … , 𝜃$ , Φ
stand
run
walk
Regimes
Idea (1): hierarchical probabilistic model
Full model Θ = 𝜃!, … , 𝜃+ , Φ
CIKM2019 Sakurai Lab. K.Kawabata et al. 21
𝜃* 𝑎""
𝑎#"𝑎"#
𝑎##
𝑎#$𝑎$#𝑎$$
𝑎"$ 𝑎$"state1
state3state2
𝜃* = 𝜋* , 𝐴* , 𝐵*Single HMM parameters:
Φ = 𝜙*, *,,.!+
Regime transition matrix:
Θ = 𝜃!, … , 𝜃$ , Φ
stand
run
walk
Regimes
𝜙%% 𝜙''
𝜙%(𝜙(% 𝜙'( 𝜙('
𝜙((
𝜙%'𝜙'%
Idea (2): Incremental encoding schemeQ. How to decide # of segments/regimes?
CIKM2019 Sakurai Lab. K.Kawabata et al. 22
Idea: Minimum description length (MDL)• Minimize the total description cost of a data stream• Update ‘optimal’ # of segments/regimes
Idea (2): Incremental encoding schemeIdea: Minimize total encoding cost
CIKM2019 Sakurai Lab. K.Kawabata et al. 23
Goodcompression
Gooddescription
CostM(C) + CostC(X|C)Model cost Coding cost
min ( )
1 2 3 4 5 6 7 8 9 10
CostM
CostC
CostT
(# of r, m)
Q. How many new components does 𝒞 need?
Idea (2): Incremental encoding scheme
CIKM2019 Sakurai Lab. K.Kawabata et al. 24
A regimeA segment A stateKeep compact!
𝒞
𝑋!
regime?How many?
segment?
Q. How many new components does 𝒞 need?
Idea (2): Incremental encoding scheme
CIKM2019 Sakurai Lab. K.Kawabata et al. 25
Keep compact!
𝒞
𝑋!
regime?How many?
segment?
A regimeA segment A stateDetails in paper
Outline1. Motivation2. Problem definition3. Model4. Streaming Algorithm5. Experiments6. Conclusions
CIKM2019 Sakurai Lab. K.Kawabata et al. 26
Streaming algorithms• Algorithms
CIKM2019 Sakurai Lab. K.Kawabata et al. 27
StreamScopeOptimize/update parameter set 𝒞
1. SegmentAssignmentIdentify regime transitions & segments
2. RegimeGeneration
Estimate new regimes 𝜃
Main
StreamScope• Overview
CIKM2019 Sakurai Lab. K.Kawabata et al. 28
𝑋:
1. Keep current window:• The latest segment, 𝑠%• New observations, 𝑥&, …
𝑋) = 𝑠* ∪ 𝑥+
Data stream 𝑋
𝑡 →
𝒞
StreamScope• Overview
CIKM2019 Sakurai Lab. K.Kawabata et al. 29
𝑋:
1. Keep current window:• The latest segment, 𝑠%• New observations, 𝑥&, …
𝑋) = 𝑠* ∪ 𝑥+
Data stream 𝑋
𝑡 →
𝒞
2. Update model set 𝓒• Minimize Δ𝐶𝑜𝑠𝑡'(𝑋(|𝒞)
Increase segments?(SegmentAssignment)
vs.Increase states/regimes?(RegimeGeneration)
StreamScope• Overview
CIKM2019 Sakurai Lab. K.Kawabata et al. 30
Data stream 𝑋
𝑡 →
𝒞
𝑋:
1. Keep current window:• The latest segment, 𝑠%• New observations, 𝑥&, …
𝑋) = 𝑠* ∪ 𝑥+
3. Update 𝑿𝒄• If pattern has changed
2. Update model set 𝓒• Minimize Δ𝐶𝑜𝑠𝑡'(𝑋(|𝒞)
Increase segments?(SegmentAssignment)
vs.Increase states/regimes?(RegimeGeneration)
1. SegmentAssignmentGiven:• Observation 𝒙𝒕• Model parameter set Θ = {𝜃%, … , 𝜃&, Φ}
Find:• Optimal cut point between regimes: 𝑚, 𝒮, ℱ
CIKM2019 Sakurai Lab. K.Kawabata et al. 31
1. SegmentAssignmentOverview
CIKM2019 Sakurai Lab. K.Kawabata et al. 32
𝜃!
𝜃"
𝜃#
1 2 3 4
𝑡 →𝑥" 𝑥# 𝑥$ 𝑥* 𝑥+ 𝑥, 𝑥-
Dynamic programing algorithm to compute
𝑃(𝑥"|Θ) 𝜙"$
1. SegmentAssignmentOverview
CIKM2019 Sakurai Lab. K.Kawabata et al. 33
𝜃!
𝜃"
𝜃#
1 2 3 4
𝜙"#Dynamic programing algorithm to compute
𝑃(𝑥"|Θ)Keep all candidate
cut pointsℒ = 𝐿!, 𝐿%, …
𝐿# = 2, 3
𝐿# = 4, 2
𝑥" 𝑥# 𝑥$ 𝑥* 𝑥+ 𝑥, 𝑥-
𝑡 →
𝑠𝑤𝑖𝑡𝑐ℎ? ?
𝑠𝑤𝑖𝑡𝑐ℎ? ?𝜙"$
1. SegmentAssignmentOverview
CIKM2019 Sakurai Lab. K.Kawabata et al. 34
𝜃!
𝜃"
𝜃#𝑡 →
Dynamic programing algorithm to compute
𝑃(𝑥"|Θ)
𝛾 − guarantee:𝛾 ∝ 𝑚𝑒𝑎𝑛( 𝑠 )
𝐿# = 2, 3
𝛾
𝑥" 𝑥# 𝑥$ 𝑥* 𝑥+ 𝑥, 𝑥-
Keep all candidate cut points
ℒ = 𝐿!, 𝐿%, …
𝜙"$
2. RegimeGenerationGiven:• Current window 𝑋)
Find:• New regimes: parameter set 𝑚, 𝑟, 𝒮, Θ, ℱ for 𝑋)
CIKM2019 Sakurai Lab. K.Kawabata et al. 35
2. RegimeGeneration1. Two phase iterative approach
• Phase1: split segments into 2 groups
• Phase2: update 2 model parameters
CIKM2019 Sakurai Lab. K.Kawabata et al. 36
Phase 1
Phase 2
S1 =S2 =
𝜃!, 𝜃%, Φ
𝜃!𝜃"
𝑋/
2. RegimeGeneration1. Two phase iterative approach
• Phase1: split segments into 2 groups
• Phase2: update 2 model parameters
2. Recursively split new regimes
• While total cost can be reduced
CIKM2019 Sakurai Lab. K.Kawabata et al. 37
𝜃!𝜃"
𝑋/
𝜃!𝜃"𝜃#
⋯
Outline1. Motivation2. Problem definition3. Model4. Streaming Algorithm5. Experiments6. Conclusions
CIKM2019 Sakurai Lab. K.Kawabata et al. 38
Experiments•We answer the following questions:
CIKM2019 Sakurai Lab. K.Kawabata et al. 39
Q1. Effectiveness:How successful is it in discovering patterns?
Q2. Accuracy:How well does it find cut-points & regimes?
Q3. Scalability:How does it scale in terms of time & memory consumption?
Q1. Effectiveness - #Mocap• MoCap sensor stream
CIKM2019 Sakurai Lab. K.Kawabata et al. 40
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
?
Q1. Effectiveness - #Mocap• MoCap sensor stream
CIKM2019 Sakurai Lab. K.Kawabata et al. 41
1000 2000 3000 4000 5000Time
0
0.5
1
Value
L-armR-armL-legR-leg
12
34
#1 Going straight
#2 Stretching arms
#4 Stretching left arm
#3 Stretching right arm
StreamScope can find intuitive patterns automatically
Q1. Effectiveness - #Bicycle• Bicycle dataset
CIKM2019 Sakurai Lab. K.Kawabata et al. 42
≈Acceleration– (X,Y,Z)
Q1. Effectiveness - #Bicycle• Bicycle dataset
CIKM2019 Sakurai Lab. K.Kawabata et al. 43
#1 #2
#3 #4
#5
Q2. Accuracy• Segmentation accuracy (higher is better)
CIKM2019 Sakurai Lab. K.Kawabata et al. 44
#Mocap #Bicycle #Workout0
0.5
1
Mac
ro-F
1 sc
ore
StreamScope AutoPlait TICC-2 TICC-4 TICC-8 pHMM
Good accuracy compared with other methods
Q2. Accuracy• Clustering accuracy (higher is better)
CIKM2019 Sakurai Lab. K.Kawabata et al. 45
#Mocap #Bicycle #Workout0
0.5
1
Accuracy
StreamScope AutoPlait TICC-2 TICC-4 TICC-8 pHMM
Good accuracy compared with other methods
Q3. Scalability•Wall clock time vs. stream length
CIKM2019 Sakurai Lab. K.Kawabata et al. 46The complexity is independent of the data length
1 1.5 2 2.5 3 3.5 4Time 104
10-4
10-2
100
102
104
Wal
l clo
ck ti
me
(s)
StreamScope AutoPlait TICC pHMM
100x
Q3. Scalability• Memory space vs. stream length
CIKM2019 Sakurai Lab. K.Kawabata et al. 47The complexity is independent of the data length
1 1.5 2 2.5 3 3.5 4Time 104
104
105
106
Mem
ory
spac
e (b
yte)
StreamScopeO(n)
100x
ConclusionsStreamScope has the following advantages:
Effective:Find optimal segments/regimes
Adaptive:Automatic and incremental
Scalable:It does not depend on data length
CIKM2019 Sakurai Lab. K.Kawabata et al. 48
Thank you !
CIKM2019 Sakurai Lab. K.Kawabata et al. 49
≈