spade -
TRANSCRIPT
SPADESequence mining algorithm
Monica DăgădiţăISI
04/12/2023Data Mining 2
OUTLINE Introduction to sequence mining Why sequence mining? Sequence mining algorithms SPADE
MotivationDefinitions and examplesAlgorithm Implementation
04/12/2023Data Mining 3
INTRODUCTION TO SEQUENCE MINING Aim - finding statistically relevant
patterns between data examples where the values are delivered in a sequence
Originally introduced for market basket analysis - customer behaviour predictions
2 types of sequence mining:string mining – biology (gene/protein
sequences) itemset mining - marketing and CRM
applications
04/12/2023Data Mining 4
WHY SEQUENCE MINING? Discovering patterns:
Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month
Website: finding sequences of most frequently accessed pages
Usage:PromotionsShelf placementRestructure the websiteRecommender systems
04/12/2023Data Mining 5
SEQUENCE MINING ALGORITHMS Apriori GSP (Generalized Sequential Pattern) FreeSpan (Frequent pattern-projected
Sequential pattern mining) PrefixSpan (Prefix-projected Sequential
pattern mining) SPADE (Sequential PAttern Discovery
using Equivalence classes)
04/12/2023Data Mining 6
MOTIVATION Problems of existing solutions
Repeated database scans Complex internal data structures
Key features of SPADE:Fixed number of database scans Vertical id-list database formatDecomposition of search space into smaller
pieces – processed independently
04/12/2023Data Mining 7
DEFINITIONS AND EXAMPLES Itemset: set of m distinct items
I = {i1, i2, …, im } Event: non-empty collection of items
(i1,i2 … ik) Sequence : ordered list of events
< e1 -> e2 -> … -> en > K-sequence : sequence with k items
(B->AC) – 3-sequence
04/12/2023Data Mining 8
DEFINITIONS AND EXAMPLES (2) Subsequence: given two sequences α=<a1
a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn
Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE)
04/12/2023Data Mining 9
DEFINITIONS AND EXAMPLES (3)
04/12/2023Data Mining 10
DEFINITIONS AND EXAMPLES (4)
Id-lists of the most frequent items (1-sequences)
04/12/2023Data Mining 11
DEFINITIONS AND EXAMPLES (5) D->BF->A
Step 1: D->B
Step 2: D->BF
04/12/2023Data Mining 12
DEFINITIONS AND EXAMPLES (6) D->BF->A
Step 3 : D->BF->A
Not space-efficientSolution: 2 columns - (sid,eid) for each
sequenceEid – id of the sequence’s last item
04/12/2023Data Mining 13
DEFINITIONS AND EXAMPLES (6) D->BF->A (space-efficient id-list joins)
D->B
SID EID
1 15
1 20
4 20
D->BF
SID EID
1 20
4 20
D->BF->A
SID EID
1 25
4 25
04/12/2023Data Mining 14
DEFINITIONS AND EXAMPLES (7) Complete latice representation
04/12/2023Data Mining 15
04/12/2023Data Mining 16
DEFINITIONS AND EXAMPLES (8) Decomposing the latice => smaller
pieces that can be solved independently
Equivalence classes2 sequences are in the same class (Θk) if
they share a common k length prefixExample
k=1 : Θ1 -> {[A],[B],[D],[F]}
04/12/2023Data Mining 17
DEFINITIONS AND EXAMPLES (9)
04/12/2023Data Mining 18
DEFINITIONS AND EXAMPLES (10)
04/12/2023Data Mining 19
ALGORITHM SPADE(min_sup,D)
//min_sup – minimum_support//D –initial datasetF1<- {frequent items or 1-sequences}F2<- {frequent 2-sequences}Ε <- {equivalence classes [X] Θ1 }
for all [X] in Eenumerate_frequent_seq([X],min_sup)
04/12/2023Data Mining 20
ALGORITHM(2) Enumerate_frequent_seq(S,min_sup)
for all Ai in S
Ti <- {}
for all Aj in S, with j≥i
R<- Ai v Aj (join)
if R satisfies min_supTi <- Ti U {R}
endEnumerate_frequent_seq(Ti , min_sup) //DFS
endFor all non-empty Ti
Enumerate_frequent_seq(Ti , min_sup) //BFS
04/12/2023Data Mining 21
IMPLEMENTATION
The R Project for Statistical Computingdeveloped at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and colleagues
Different implementation of S language
arulesSequences package
04/12/2023Data Mining 22
QUESTIONS
?