pattern-growth methods for sequential pattern mining iris zhang 2003-5-14
TRANSCRIPT
![Page 1: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/1.jpg)
Pattern-Growth Methods for Sequential Pattern Mining
Iris Zhang2003-5-14
![Page 2: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/2.jpg)
Outline• Sequential pattern mining• Apriori-like methods
– GSP
• Pattern-growth methods– FreeSpan– PrefixSpan
• Performance analysis• Conclusions
![Page 3: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/3.jpg)
Motivation
• Sequential pattern mining: Finding time-related frequent patterns
• Most data and applications are time-related– Customer shopping patterns, telephone calling
patterns
– Natural disasters (e.g., earthquake, hurricane)
– Disease and treatment
– Stock market fluctuation
– Weblog click stream analysis
– DNA sequence analysis
![Page 4: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/4.jpg)
Concepts• Let I={i1,i2,…,in} be a set of all items
• Itemset is a subset of items• Sequence is an ordered list of itemset.
itemsets are called elements. The number of items in the sequence is its length– e.g. < (ef)(ab)(df)cb >
• A sequence =<a1a2…an> is called subsequence of =<b1b2…bm>, denoted , if there exist integers 1j1 <j2<…<jn m such that a1bj1, a2bj2,…,anbjn
– e.g. <a(bc)dc>is subsequence of <<a(abc)(ac))d(cf)>>
![Page 5: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/5.jpg)
Concepts (con’t)• Sequence database is a set of tuples <sid,s>, sid is a
sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s
• Support of is the number of tuples in the database containing
• If the support of no less than a threshold, it is called sequential pattern– <(ab)c> is a sequential pattern given support threshold
min_sup =2
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
![Page 6: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/6.jpg)
Problem definition
• Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database
![Page 7: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/7.jpg)
Apriori-like methods
• Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent– e.g. <bh> is infrequent, so do <abh>,<b(dh)>
• GSP (Generalized Sequential Pattern) algorithm– Level-by-level do
• Generate candidate sequences• Use Apriori property to prune candidates• Scan database to collect support counts
![Page 8: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/8.jpg)
GSP Mining Process
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba> Cand. cannot pass sup. threshold
Cand. not in DB at all
![Page 9: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/9.jpg)
Bottlenecks of Apriori-Like Methods• Potentially huge set of candidate sequences
– 1,000 frequent length-1 sequences generate length-2
candidates
• Multiple scans of database
• Difficulties at mining long sequential patterns– Exponential number of short candidates
– A length-100 sequential pattern needs candidate sequences
500,499,12
999100010001000
30100100
1
1012100
i i
![Page 10: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/10.jpg)
Pattern-growth methods• A divide-and-conquer approach
– Recursively project a sequence database into a set of smaller databases
– Mine each projected database to find the subset of patterns
• Algorithms– FreeSpan: Frequent Pattern-Projected Sequential
Pattern Mining– PrefixSpan: Prefix-Projected Sequential Pattern
Mining
![Page 11: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/11.jpg)
FreeSpan• Example: given a sequence database S and
min_support = 2
• Step 1: find length-1 sequential patterns and list them in support descending order– f_list = a:4,b:4,c:4,d:3,e:3,f:3
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
![Page 12: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/12.jpg)
FreeSpan (con’t)• Step 2: divide search space. The complete
set of seq. pat. can be partitioned into 6 disjoint subsets:– ones only contain item a– ones contain item b but no items after b in f_list– ones contain item c but no items after c in f_list– ones contain item d but no items after d in f_list– ones contain item e but no items after e in f_list– ones contain item f
find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
![Page 13: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/13.jpg)
FreeSpan (con’t)• Finding Seq. Patterns containing item b but
no items after b in f_list– <b>-projected database: <a(ab)a>, <aba>,
<(ab)b>, <ab>
– Find all the length-2 seq. pat. containing item b but no items after b in f_list : <ab>:4, <ba>:2, <(ab)>:2
– Further partition and miningSID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
![Page 14: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/14.jpg)
From FreeSpan to PrefixSpan• Freespan:
– Projection-based: No candidate sequence needs to be generated
– But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database
• PrefixSpan– Projection-based
– But only prefix-based projection: less projections and quickly shrinking sequences
![Page 15: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/15.jpg)
PrefixSpan-conceptsSuppose all items in an element are listed alphabetically.Given a sequence =<e1e2…en>, =<e’1e’2…e’m>(mn)
• Prefix: is the prefix of iff (1) e’i=ei (i m-1) (2) e’m
em(3) all items in (em- e’m) are alphabetically after those in e’m.
– e.g. =<a(abc)(ac)d(cf)>, =<a(ab)>, ’=<a(bc)>
• Postfix: sequence =<e1e2…e’m>, =<e’’mem+1…en> is called the postfix of w.r.t. prefix , where e’’m=(em-e’m), denoted as =.
– e.g. =<(_c)(ac)d(cf)> is the postfix of w.r.t. prefix <a(ab)>
![Page 16: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/16.jpg)
PrefixSpan-concepts (con’t)
• Projected database: let be a sequential pattern in S. -projected database, denoted s|, is the collection of postfixes of sequences in S w.r.t. prefix
• Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .
![Page 17: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/17.jpg)
PrefixSpan-process• Step 1: find length-1 sequential patterns
– <a>:4, <b>:4, <c>:4, <d>:3, <e>:3, <f>:3
• Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
– ones having prefix <a>;– ones having prefix <b>;– …– ones having prefix <f>;
find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
![Page 18: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/18.jpg)
PrefixSpan-Process (con’t)• Finding Seq. Patterns with Prefix <a>
– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
– Find all the length-2 seq. pat. having prefix <a>:<aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2
– Further partition into 6 subsets• Having prefix <aa>;
• …
• Having prefix <af>;
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
![Page 19: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/19.jpg)
Completeness of PrefixSpanSID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>
…
prefix <af>
<b>-projected database …
prefix <b><a>-projected database
<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>
Length-2 seq. pan<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
prefix <a>
prefix <aa>
<aa>-proj. db <af>-proj. db
prefix <c>, …, <f>
… …
![Page 20: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/20.jpg)
Efficiency of PrefixSpan
• No candidate sequence needs to be
generated
• Projected databases keep shrinking
• Major cost of PrefixSpan: constructing
projected databases
– Can be improved by bi-level projections and
pseudo-projections
![Page 21: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/21.jpg)
Optimization Techniques in PrefixSpan
• Single-level vs. bi-level projection
– Bi-level projection with 3-way checking may
reduce the number and size of projected
databases
• Physical projection vs. pseudo-projection
– Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
![Page 22: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/22.jpg)
S-matrix for sequence databaseLength-1 sequential patterns: <a>, <b>, <c>, <d>, <e>, <f>
All length-2 sequential patterns are found in S-matrix
S-matrix
fedcba
1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f
0(1, 1, 0)(1, 2, 0)(1, 2, 0)(1, 2, 1)e
0(1, 3, 0)(2, 2, 0)(2, 1, 1)d
3(3, 3, 2)(4, 2, 1)c
1(4, 2, 2)b
2a
<aa> happens twice
<ac> happens4 times
<ca>happens twice
<(ac)> happens once
![Page 23: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/23.jpg)
S-matrix for <ab>-projected database• <ab>-projected database:
– <(_c)(ac)d(cf)>,<(_c)(ae)>,<c>
• frequent items:<a>,<c>,<(_c)>• S-matrix:
a 0
c (1, 0, 1) 1
(_c) (, 2, ) (, 1, )
a c (_c)
No a(_c), no count
Lead to pattern
<a(bc)a>
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
![Page 24: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/24.jpg)
Scaling-up by Bi-level Projection
• Partition search space based on length-2
sequential patterns
• Only form projected databases and pursue
recursive mining over bi-level projected
databases
![Page 25: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/25.jpg)
Benefits of Bi-level Projection• More patterns are found in each shoot
• Much less projections
– In the example, there are 53 patterns.
– 53 level-by-level projections
– 22 bi-level projections
![Page 26: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/26.jpg)
3-way Apriori Checking
• Using Apriori heuristic to prune items in projected databases
a 2
b (4, 2, 2) 1
c (4, 2, 1) (3, 3, 2) 3
d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0
e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0
f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1
a b c d e f
<acd> cannot be a pattern w.r.t. min_support=2exclude d from <ac>-projected database
![Page 27: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/27.jpg)
Pseudo-projection• Major cost of PrefixSpan: projection
– Postfixes of sequences often appear repeatedly in recursive projected databases
• When the projected database fit in memory, use pointers to form projections– Pointer to the sequence
– Offset of the postfix
s=<a(abc)(ac)d(cf)>
<(abc)(ac)d(cf)>
<(_c)(ac)d(cf)>
s|<a>: ( , 2)
s|<ab>: ( , 4)
![Page 28: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/28.jpg)
Pseudo-Projection vs. Physical Projection• Pseudo-projection avoids physically copying
postfixes– Efficient when database fits in main memory
– Not efficient when database cannot fit in main memory
• Disk-based random accessing is very costly
• Suggested Approach:– Integration of physical and pseudo-projection
– Swapping to pseudo-projection when the data set fits in memory
![Page 29: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/29.jpg)
Experiments
• Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95– number of items 1000– number of sequences in the data set 10,000– average number of items within elements 8– average number of elements in a sequence 8
![Page 30: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/30.jpg)
Experiments (con’t)
• Comparing PrefixSpan with GSP and
FreeSpan in large databases – GSP (IBM Almaden, Srikant & Agrawal EDBT’96)
– FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00)
– Prefix-Span-1 (single-level projection)
– Prefix-Span-2 (bi-level projection)
• Comparing effects of pseudo-projection
• Comparing I/O cost and scalability
![Page 31: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/31.jpg)
PrefixSpan Is Faster Than GSP and FreeSpan
0
50
100
150
200
250
300
350
400
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Support threshold (%)
Ru
nti
me
(se
con
d)
PrefixSpan-1
PrefixSpan-2
FreeSpan
GSP
![Page 32: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/32.jpg)
Effect of Pseudo-Projection for projected database fit in memory
0
40
80
120
160
200
0.20 0.30 0.40 0.50 0.60
Support threshold (%)
Ru
nti
me
(se
con
d)
PrefixSpan-1
PrefixSpan-2
PrefixSpan-1 (Pseudo)
PrefixSpan-2 (Pseudo)
![Page 33: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/33.jpg)
I/O Cost: When It Cannot Fit in Memory
0.E+00
2.E+09
4.E+09
6.E+09
8.E+09
1.E+10
0.0 1.0 2.0 3.0Support threshold (%)
I/O C
ost
PrefixSpan-1PrefixSpan-1 (pseudo)PrefixSpan-2PrefixSpan-2 (pseudo)
![Page 34: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/34.jpg)
Scalability (When DB Is Large)
0
5
10
15
20
25
30
0 100 200 300 400 500
# of sequences (thousand)
Ru
nti
me
(th
ou
san
d
seco
nd
)
PrefixSpan-1
PrefixSpan-2
min_sup=0.2%
![Page 35: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/35.jpg)
Conclusions• Both PrefixSpan and FreeSpan are pattern-
growth methods which perform better than Apriori-like methods for sequential pattern mining problem
• PrefixSpan is more elegant than FreeSpan– Apriori heuristic is integrated into bi-level
projection in PrefixSpan– Pseudo-projection substantially enhances the
performance of the memory-based processing
![Page 36: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/36.jpg)
References
• J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.
• J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.
![Page 37: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/37.jpg)
Q&A
![Page 38: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ee65503460f94bf590c/html5/thumbnails/38.jpg)
Thanks