the study on mining temporal patterns and related applications in dynamic social...
TRANSCRIPT
Yi-Cheng Chen 陳以錚
1
Mining Temporal Pattern and Related Applications
Curriculum VitaeBasic Information
Birthday – Aug. 31, 1978Education
Depart. of CSE, YZU (B. S. 2000) Depart. of CS, NTUST (M. S. 2002)Depart. of CSIE, NCTU (Ph. D. 2012)
Advisor: Prof. Suh-Yin Lee ( 李素瑛 教授 ), Wen-Chih Peng (彭文志 教授 )
Ph. D. Dissertation: A Study on Time Interval-based Sequential Patterns Mining
2
OutlineCurrent Research
Temporal Pattern Mining
Social Network Analysis
Smart Home Application
Cloud Computing
3
Lots of data is being collected Web data, e-commerce purchases at department Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Data Mining? Commercial Viewpoint
4
Why Data Mining? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible for raw dataData mining may help scientists
in classifying and analyzing data in Hypothesis Formation 5
Data MiningWe are buried in data, but looking for
knowledge Data mining
Knowledge discovery in databasesExtraction of interesting knowledge (rules,
regularities, patterns) from data in large databases
6
7
Temporal Pattern Mining
8
Point-based sequential pattern mining Customer analysis, network intrusion detection, finding
tandem repeats in DNA sequence… Simple relation between point
time point-based
diaper
milk
diaper
beer
milk
beer
Three relation(before, equal, after )
with min_sup = 2, (ab)dc is a frequent sequential pattern
Sequential Pattern Mining
Interval Data Everywhere !!Interval data
Data has duration time
Clinical data, library data, appliance usage data
ApplicationsDiagnosis System, recommendation
system, Smart home
9Diagnosis System Smart Home
DB
Recommendation
10
Chess pain
fever
cough
Interval-based sequential pattern mining Library reader analysis, patient disease analysis, stock
fluctuation, ... Complex relations
Allen’s 13 temporal relations
time interval-based
With min_sup = 4, is a frequent temporal pattern
Temporal Pattern Mining
11
Allen’s 13 temporal logics describe relationship between any two events (binary relation) [ACM 1983]
Allen Relationship
12
Real example Some temporal patterns generated from NCTU library
13
Representation Allen’s relations are binary relation Express the relation more than 3 intervals
Ambiguous problem Space usage
Efficient algorithms Mining temporal pattern * Mining closed temporal pattern Incrementally maintain discovered temporal
pattern and closed temporal pattern Related applications
Social network Smart home
Motivation
14
Proposed Method Coincidence representation
Segment intervals into disjoint slices Nonambiguous and compact representation
Endpoint representation Global information of a sequence Nonambiguous and compact representation
TPMiner (Temporal Pattern Miner) Pattern-growth approach
Without candidate generation and test Two components
RPrefixSpan Pruning strategies
15
Segment intervals into disjoint slices Four kinds of event slice Start slice (+), intermediate slice (*), finish slice (-)
and intact slice ( ) Coincidence
Slices occurring simultaneously
Space usage (for a k-pattern) Best: k, Worst: 2k space
coincidence
event intervals
coincidence representation: (A+) (AB+) (B) (C+) (C*D ) (C) (E)
C
(AB+) (E)(C+) (C*D) (C)(A+) (B)
EA
B D
Coincidence representation
16
A data structure, endtime_list Sort and merge Trace endtime_list one-by-one
(A, 1, 4)
(B, 2, 5)
(C, 2, 8)
(D, 3, 5)
(E, 5, 7)
Incision strategy
coincidence representation:
(A+) (B+ C+) (A D+ ) (B D) @ (E) (C)
trace one- by- one
endtime_listtypesymbol time
sD 3
sA 1sBC 2
fA 4fBD 5sE 5fE 7fC 8
endtime_listtypesymbol time
sC 2
sA 1
sB 2
fC 8sD 3
sE 5fE 7
fD 5
fA 4
fB 5
merge
sort
…
17
Sequence of ordered time points +: start time, : finish time
NonambiguousSpace usage (for a k-pattern)
2k space
Endpoint Representation time points of events
ABCD
A ( B C ) A ( B C D ) D
18
Example Database
19
Every item is disjoint The relations among slices are simple
Before, equal and after (like time-point data) RPrefixSpan
Borrow the idea of PrefixSpan Scan local database to find frequent slices Append and extend the pattern Project database
Pruning strategy Reduce search space Pre-pruning and post-pruning
TPMiner – RPrefixSpan (1/2)
20
D
D |en
…
D |e1
D |e2
D |ei
…
transform sequences and project database
scan database
frequent items:e1, e2, ..., ei, ..., en
..
..
..
..
..
..
..
D |e2...
D |e1...
..
…D |en...
D |ei...
..
…
..
..
collect all mining patterns
Frequent temporal patterns
recursively project database and append & extend pattern
TPMiner – RPrefixSpan (2/2)
21
Pruning Strategy – Pre-pruning
scan database
frequent local slice :A, B+, B, C
D| A+
A+ …
A+ C D|A+ C
A+ B D|A+ B
A+ B+ D|A+ B+
A+ A D|A+ A
Non-qualified pattern
Non-promising projection can be pre-pruning !
Utilize the concept of slice and coincidence Start slices and finish slices occur in pairs Only require projecting the frequent finish slices which
have the corresponding start slices in their prefixes
22
Pruning Strategy – Post-pruning
E S1: (D - )(B- )S2: (D - )S3: (D - )
D |E
...S1: (B + )(D + )(E)(D - )(B- )S2: (B + )(B - D + )(E)(D - )S3: (B)(A)(D + )(E)(D - )
A coincidence database D
...
…
…
Insignificant sequences
Projected database can be post-pruning
Utilize the concept of slice and coincidence Start slice always appear before finish slice Only collect the significant postfixes
With respect to a prefix , all finish slices in postfix have corresponding start slices in
23
Experimental Results (1/2)
(b) The number of temporal patterns(a) The performance of six algorithms
D200k – C40 – N10k
num
ber
of g
ener
ated
pat
tern
s
minimum support (%)minimum support (%)
exec
utio
n tim
e (s
ec)
D200k – C40 – N10k
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
0
10000
20000
30000
40000
50000
60000
70000
1 0.9 0.8 0.7 0.6 0.5
0
500
1000
1500
2000
2500
3000
3500
4000
1 0.9 0.8 0.7 0.6 0.5
(b) The number of temporal patterns(a) The performance of six algorithms
D200k – C40 – N10k
num
ber
of g
ener
ated
pat
tern
s
minimum support (%)minimum support (%)
exec
utio
n tim
e (s
ec)
D200k – C40 – N10k
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
0
10000
20000
30000
40000
50000
60000
70000
1 0.9 0.8 0.7 0.6 0.50
10000
20000
30000
40000
50000
60000
70000
1 0.9 0.8 0.7 0.6 0.5
0
500
1000
1500
2000
2500
3000
3500
4000
1 0.9 0.8 0.7 0.6 0.5
0
500
1000
1500
2000
2500
3000
3500
4000
1 0.9 0.8 0.7 0.6 0.5
N10k – C20 – N10k
minimum support (%)
mem
ory
usag
e (M
B)
0
500
1000
1500
2000
2500
1 0.9 0.8 0.7 0.6 0.5
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
N10k – C20 – N10k
minimum support (%)
mem
ory
usag
e (M
B)
0
500
1000
1500
2000
2500
1 0.9 0.8 0.7 0.6 0.50
500
1000
1500
2000
2500
1 0.9 0.8 0.7 0.6 0.5
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
H-DFS
ARMADA
TPrefixSpan
IEMiner
TPMiner-CR
TPMiner-ER
24
Experimental Results (2/2)
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without post-pruning strategy
(b) The performance test of influence on post-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
TPMiner-CR
TPMiner-CR without pre-pruning strategy
(a) The performance test of influence on pre-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without subset-pruning strategy
(c) The performance test of influence on subset-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
(b) The performance test of influence on all proposed pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without any pruning strategy
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without post-pruning strategy
(b) The performance test of influence on post-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.50
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without post-pruning strategy
TPMiner-CR
TPMiner-CR without post-pruning strategy
(b) The performance test of influence on post-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
TPMiner-CR
TPMiner-CR without pre-pruning strategy
(a) The performance test of influence on pre-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without pre-pruning strategy
TPMiner-CR
TPMiner-CR without pre-pruning strategy
(a) The performance test of influence on pre-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
1 0.9 0.8 0.7 0.6 0.50
1000
2000
3000
4000
5000
6000
7000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without subset-pruning strategy
(c) The performance test of influence on subset-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without subset-pruning strategy
TPMiner-CR
TPMiner-CR without subset-pruning strategy
(c) The performance test of influence on subset-pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
0
1000
2000
3000
4000
5000
6000
1 0.9 0.8 0.7 0.6 0.5
(b) The performance test of influence on all proposed pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without any pruning strategy
(b) The performance test of influence on all proposed pruning strategies
minimum support (%)
exec
utio
n tim
e (s
ec)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 0.9 0.8 0.7 0.6 0.5
0
1000
2000
3000
4000
5000
6000
7000
8000
1 0.9 0.8 0.7 0.6 0.5
TPMiner-CR
TPMiner-CR without any pruning strategy
TPMiner-CR
TPMiner-CR without any pruning strategy
25
Related Applications
26
Smart Home Application
(2) Pattern Mining
CloudDatabase
UsagePattern
s
P2:P3: …
P1: (1) Sensor data log
(5) System Alarm & Remote Control
(3) Behavior Detection
(4) Abnormal Detection
Home
Current Behavior
Usage Pattern
Air Conditioner
light
Air Conditioner
light
Current Behavior
Air Conditioner
light
Alarm
Home Server
Remote Control
on offID3
on offID2
on offID2
on offID4
D-Link controler
Light
Alarm
Home Server
Remote Control
Alarm
Home Server
Remote Control
on offID3
on offID2
on offID2
on offID4
D-Link controler
Light on offID3
on offID2
on offID2
on offID4
D-Link controler
Light on offID3
on offID3
on offID2
on offID2
on offID2
on offID4
D-Link controler
Light
27
Dynamic Social Network (1/2)Dynamic social network
A sequence of interaction graph Nodes and edges vary with time
A lossless transformation Graph sequence interval sequence
B
A
CD
E
G4
B
A
CD
E
G1
B
A
CD
E
G2
B
A
CD
E
G3
….B
A
CD
E
G4
B
A
CD
E
G1
B
A
CD
E
G2
B
A
CD
E
G3
….
31C
31AD
64E
42D
31B
C
31C
31AB
64E
42D
31BA
event sequencefinishtime
starttime
event symbol
SID
31C
31AD
64E
42D
31B
C
31C
31AB
64E
42D
31BA
event sequencefinishtime
starttime
event symbol
SID
EB
D
EB
D
A
C
A
C
EB
D
EB
D
A
C
A
C
31C
31AD
64E
42D
31B
C
31C
31AB
64E
42D
31BA
event sequencefinishtime
starttime
event symbol
SID
31C
31AD
64E
42D
31B
C
31C
31AB
64E
42D
31BA
event sequencefinishtime
starttime
event symbol
SID
EB
D
EB
D
A
C
A
C
EB
D
EB
D
A
C
A
C
t3
…
t2t1
Reduce the complexity of graphAvoid isomorphism testing
Dynamic Social Network Analysis Pattern miningClassificationRecommending systemNetwork sampling Clustering
28
Dynamic Social Network (2/2)
29
Social Network Analysis
30
Social Network Analysis A graph representation
Nodes and edges
31
Influence Maximization
32
Advertisement Budget According to , advertisement spending
on worldwide social networking sites 2008, $23.3 millions 2010, $23.6 billions 2011, almost $25.5 billions Advertisement spending
33
Word-of-mouth effect in social networkInfluence maximization problem
Select initial users (seeds) so that the number of users that adopt the product or innovation is maximized
Influence Maximization
social networksocial network
Seeds select
34
MotivationCharacteristic of social network
Community structure
Community and degree heuristic (CDH) Utilize community information Avoid influence overlapping
65
4
11
12
103
72
9 8
1
65
4
11
12
103
72
9 8
1
8
2
9
4
5
1 21
8
7
9
4
5 6
35
Proposed Algorithm – CDHFramework of CDH
36
CDH – Adjust Step Adjust selected fundamental nodes
Seeds selected from large community may activate more inactive nodes than small community
Replace the fundamental node in small community If we can activate more inactive nodes
Finally, output the result as selected seed nodes
CkC1 C2
second largestdegree node
in C1
C3 ……
largest degree node in Ck
replace!!delete!!
37
Experimental Results - Facebook
38
Dynamic Recommendation
Recommendation Systempredict the ratings or preferencesusing a model build from the characteristics
39
(a) amazon.com (b) youtube.com
Collaborative Filtering (CF)1. Calculate the similarity between the active user
and the other users• Person’s correlation, cosine similarity, conditional
probability, etc.
2. Predict the rating of items that have not been rated by the active user
3. Output the top-k items by the predicting results
40
i1 i2 i3 i4Avg.Ofuser
A 4 1 4 3
B 2 4 3
C 3 3 2 2
normalize
wwp
normalizew
normalizew
cabaia
ca
ba
,,,
,
,
*)22(*)34(3
)23)(34()23)(31(
)32)(31(
4
item
user
41
MotivationDynamic! Dynamic! Dynamic!
Why we need dynamicAll things vary with time
Dynamic Collaborative Filteringconsider the time influence in the calculation.
Without considering about the timethe results of prediction might be out of date.
42
Dynamic Similarity based on Collaborative Filtering (DSCF)
( user->item : rating (time) )1 -> 1193 :5 (2012.5.18)5 -> 661 :3 (2012.3.5)3 -> 914 :3 (2012.6.27)1 -> 3408 :4 (2012.3.18)… …
( user->item : rating (time) )9 -> 6610 : 5 (2012.7.8)2 -> 6610 : 3 (2012.7.15)… ….
………. ….. ..
………. …. ..
………. …. ..
*(1-α)*(1-α)
*α
101
0 )1( t
ttt MsimMsimMsim
01tDB
1ttDB
01tMsim
1ttMsim
0tMsim
43
Advanced DSCFα (similarity decay value, SDV) might not be
consistent for all time.each user might have his/her own SDV in
different time points.feedback predicted values from actual values
44
k
j jaja
k
j jajajij
aiamsimsi
msimsirrrA
1 ,,
1 ,,,
,])1([
])1([)(
45
Activeuser
?
k
j jaja
k
j jajajij
aiamsimsi
msimsirrrp
1 ,,
1 ,,,
,])1([
])1([)(
Recommend
Predict
Activeuser
Aa,i
Feedback
Experimental Results
46
47