comp9318: data warehousing and data miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · comp9318:...
TRANSCRIPT
COMP9318: Data Warehousing and Data Mining 1
COMP9318: Data Warehousing and Data Mining
— L6: Association Rule Mining —
COMP9318: Data Warehousing and Data Mining 2
n Problem definition and preliminaries
COMP9318: Data Warehousing and Data Mining 3
What Is Association Mining?n Association rule mining:
n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
n Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]
n Motivation: finding regularities in datan What products were often purchased together? — Beer
and diapers?!n What are the subsequent purchases after buying a PC?n What kinds of DNA are sensitive to this new drug?n Can we automatically classify web documents?
COMP9318: Data Warehousing and Data Mining
Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?
n Foundation for many essential data mining tasksn Association, correlation, causalityn Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia associationn Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)n Broad applications
n Basket data analysis, cross-marketing, catalog design, sale campaign analysis
n Web log (click stream) analysis, DNA sequence analysis, etc. c.f., google’s spelling suggestion
COMP9318: Data Warehousing and Data Mining 5
Basic Concepts: Frequent Patterns and Association Rules
n Itemset X={x1, …, xk}n Shorthand: x1 x2 … xk
n Find all the rules XàY with min confidence and supportn support, s, probability that a
transaction contains XÈYn confidence, c, conditional
probability that a transaction having X also contains Y.
Let min_support = 50%, min_conf = 70%:
sup(AC) = 2A è C (50%, 66.7%)C è A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id Items bought10 { A, B, C }20 { A, C }30 { A, D }40 { B, E, F }
frequent itemset
association rule
COMP9318: Data Warehousing and Data Mining 6
Mining Association Rules—an Example
For rule A è C:support = support({A}∪{C}) = 50%confidence = support({A}∪{C})/support({A}) = 66.6%
Min. support 50%Min. confidence 50%
Transaction-id Items bought10 A, B, C20 A, C30 A, D40 B, E, F
Frequent pattern Support{A} 75%{B} 50%{C} 50%
{A, C} 50%
major computation challenge: calculate the support of itemsetsç The frequent itemset mining problem
COMP9318: Data Warehousing and Data Mining 7
n Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
Association Rule Mining Algorithms
n Naïve algorithmn Enumerate all possible itemsets
and check their support against min_sup
n Generate all association rules and check their confidence against min_conf
n The Apriori propertyn Apriori Algorithmn FP-growth Algorithm
COMP9318: Data Warehousing and Data Mining 8
Candidate Generation & Verification
COMP9318: Data Warehousing and Data Mining 9
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
All Candidate Itemsets for {A, B, C, D, E}
COMP9318: Data Warehousing and Data Mining 10
Apriori Property
n A frequent (used to be called large) itemset is an itemset whose support is ≥ min_sup.
n Apriori property (downward closure): any subsets of a frequent itemset are also frequent itemsets
n Aka the anti-monotone property of support
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD “any supersets of an infrequent itemset are also infrequent itemsets”
COMP9318: Data Warehousing and Data Mining 11
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Q: How to design an algorithm to improve the naïve algorithm?
COMP9318: Data Warehousing and Data Mining 12
Apriori: A Candidate Generation-and-test Approach
n Apriori pruning principle: If there is any itemsetwhich is infrequent, its superset should not be generated/tested!
n Algorithm [Agrawal & Srikant 1994]1. Ck ç Perform level-wise candidate generation
(from singleton itemsets)2. Lk ç Verify Ck against Lk3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty
COMP9318: Data Warehousing and Data Mining 13
The Apriori Algorithmn Pseudo-code:
Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;for each transaction t in database do begin
increment the count of all candidates in Ck+1that are contained in t
endLk+1 = candidates in Ck+1 with min_support
endreturn ∪kLk;
COMP9318: Data Warehousing and Data Mining 14
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C22nd scan
C3 L33rd scan
Tid Items10 A, C, D20 B, C, E30 A, B, C, E40 B, E
Itemset sup{A} 2{B} 3{C} 3{D} 1{E} 3
Itemset sup{A} 2{B} 3{C} 3{E} 3
Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset{B, C, E}
Itemset sup{B, C, E} 2
minsup = 50%
COMP9318: Data Warehousing and Data Mining 15
Important Details of Apriori
1. How to generate candidates?n Step 1: self-joining Lk (what’s the join condition? why?)n Step 2: pruning
2. How to count supports of candidates?
Example of Candidate-generationn L3={abc, abd, acd, ace, bcd}n Self-joining: L3*L3
n abcd from abc and abdn acde from acd and ace
n Pruning:n acde is removed because ade is not in L3
n C4={abcd}
COMP9318: Data Warehousing and Data Mining 16
Generating Candidates in SQL
n Suppose the items in Lk-1 are listed in an ordern Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
n Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
COMP9318: Data Warehousing and Data Mining 17
Derive rules from frequent itemsets
n Frequent itemsets != association rulesn One more step is required to find association
rulesn For each frequent itemset X,
For each proper nonempty subset A of X, n Let B = X - An A à B is an association rule if
n Confidence (A à B) ≥ min_conf,where support (A à B) = support (AB), andconfidence (A à B) = support (AB) / support (A)
COMP9318: Data Warehousing and Data Mining 18
Example – deriving rules from frequent itemsets
n Suppose 234 is frequent, with supp=50%n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with
supp=50%, 50%, 75%, 75%, 75%, 75% respectivelyn These generate these association rules:
n 23 => 4, confidence=100%n 24 => 3, confidence=100%n 34 => 2, confidence=67%n 2 => 34, confidence=67%n 3 => 24, confidence=67%n 4 => 23, confidence=67%n All rules have support = 50%
= (N* 50%)/(N*75%)
Q: is there any optimization (e.g., pruning) for this step?
COMP9318: Data Warehousing and Data Mining 19
Deriving rulesn To recap, in order to obtain A à B, we need
to have Support(AB) and Support(A)n This step is not as time-consuming as
frequent itemsets generationn Why?
n It’s also easy to speedup using techniques such as parallel processing.n How?
n Do we really need candidate generation for deriving association rules? n Frequent-Pattern Growth (FP-Tree)
COMP9318: Data Warehousing and Data Mining 20
Bottleneck of Frequent-pattern Mining
n Multiple database scans are costlyn Mining long patterns needs many passes of
scanning and generates lots of candidatesn To find frequent itemset i1i2…i100
n # of scans: 100n # of Candidates:
n Bottleneck: candidate-generation-and-testCan we avoid candidate generation altogether?
✓100
1
◆+
✓100
2
◆+ . . .+
✓100
100
◆= 2100 � 1
n FP-growth
COMP9318: Data Warehousing and Data Mining 21
No Pain, No GainJava Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
n Apriori: n L1 = {J, L, S, P, R}n C2 = all the (5
2) combinationsn Most of C2 do not contribute to the resultn There is no way to tell because
No Pain, No GainJava Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
J
ɸ
minsup = 1
{A, C}
Ideas: • Keep the support set for
each frequent itemset• DFS
J è JL? J è ???Only need to look at support set for J
No Pain, No GainJava Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
J
ɸ
JPRminsup = 1
{A, C}
{C}
{C}JP JR
Ideas: • Keep the support set for
each frequent itemset• DFS
…
{A,C}
Notations and Invariantsn CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}n DB = DB|∅ (i.e., conditioned on nothing)n Shorthand: DB|px = DB|(p∪x)
n SupportSet(p∪x, DB) = SupportSet(x, DB|p)n {x | x mod 6 = 0 ⋀ x ∈ [100] } =
{x | x mod 3 = 0 ⋀ x ∈ even([100]) }n A FP-tree|p is equivalent to a DB|p
n One can be converted to anothern Next, we illustrate the alg using conditionalDB
25
FP-tree Essential Idea /1n Recursive algorithm again!
n FreqItemsets(DB|p):
n X = FindFrequentItems(DB|p)
n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n
n FreqItemsets(DB*|px)
all frequent itemsets in DB|p belong to one of the following categories:
output { (x p) | x ∈ X }
patterns ~ xippatterns ~ ★px1
patterns ~ ★px2
patterns ~ ★pxi
patterns ~ ★pxn
obtained via recursion
easy task, as only items (not itemsets) are needed
No Pain, No GainJava Lisp Scheme Python Ruby
Alice X X
Charlie X X Xminsup = 1
DB|J
n FreqItemsets(DB|J):n {P, R} ç FindFrequentItems(DB|J)n Output {JP, JR}n Get DB*|JP; FreqItemsets(DB*|JP)n Get DB*|JR; FreqItemsets(DB*|JR)n // Guaranteed no other frequent itemset in DB|J
FP-tree Essential Idea /2
n FreqItemsets(DB|p):n If boundary condition, then …n X = FindFrequentItems(DB|p)n [optional] DB*|p = PruneDB(DB|p, X)
n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n [optional] if DB*|px is degenerated, then powerset(DB*|px)n FreqItemsets(DB*|px) Also gets rid of items
already processed before x è avoid duplicates
Also output each item in X (appended with the conditional pattern)
Remove items not in X; potentially reduce # of transactions (∅ or dup). Improves the efficiency.
output { (x p) | x ∈ X }
Lv 1 Recursion
n minsup = 3
F C A D G I M PA B C F L M OB F H J O WB C K S PA F C E L P M N
F C A M PF C A B MF BC B PF C A M P
DB DB*
X = {F, C, A, B, M, P}
DB*|P
DB*|M (sans P)
DB*|B (sans MP)
DB*|A (sans BMP)
DB*|C (sans ABMP)
DB*|F (sans CABMP)
F C A M PC B PF C A M P
F C AF C AF C A
Grayed items are for illustration purpose only.
Output: F, C, A, B, M, P
Lv 2 Recursion on DB*|P
n minsup = 3
CCC
DB DB*
X = {C}
DB*|CF C A M PC B PF C A M P
Which is actually FullDB*|CP
CCC
Context = Lv 3 recursion on DB*|CP: DB has only empty
sets or X = {} èimmediately returnsOutput: CP
Lv 2 Recursion on DB*|A (sans …)n minsup = 3
F CF CF C
DB DB*
X = {F, C}
DB*|CF C AF C AF C A
Which is actually FullDB*|CA
FCFCFC
DB*|FFFF
boundary case
Further recursion
(output: FCA)
Output: FA, CA
Different Example:Lv 2 Recursion on DB*|P
n minsup = 2
F C AF CF A
DB DB*
X = {F, C, A}
DB*|A
F C A M PF C B PF A P
Which is actually FullDB*|AP F CF
DB*|CFF
DB*|F
F F
X = {F}
Output: FP, CP, AP
Output: FAP
I will give you back the FP-tree
n An FP-tree tree of DB consists of: n A fixed order among items in DB n A prefix, threaded tree of sorted transactions
in DBn Header table: (item, freq, ptr)
n When used in the algorithm, the input DB is always pruned (c.f., PruneDB())n Remove infequent itemsn Remove infrequent items in every transaction
FP-tree ExampleTID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
minsup = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
f : 1
{ }
c : 1
a : 1
m : 1
p : 1
Insert t1
f : 2
{ }
c : 2
a : 2
m : 1
p : 1
b : 1
m : 1
Insert t2
f : 4
{ }
c : 3
a : 3
m : 2
p : 2
b : 1
m : 1
b : 1
c : 1
b : 1
p : 1
Insert all ti
Item freq headf 4c 4a 3b 3m 3p 3
Outputfcabmp
…
Item freq headf 4c 4a 3b 3m 3p 3
f : 4
{ }
c : 3
a : 3
m : 2
p : 2
b : 1
m : 1
b : 1
c : 1
b : 1
p : 1
p's conditional pattern base f c a m : 2 c b : 1
2 3 2 1 2
Outputpc
c : 3
{ }
HeaderTableSTOP
TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}
Cleaned p’s conditional pattern baseC :2
C :1
f : 4
{ }
c : 3
a : 3
m : 2 b : 1
m : 1
b : 1
c : 1
b : 1
Item freq headf 4c 4a 3b 3m 3
m's conditional pattern base f c a : 2 f c a b : 1
3 3 3 1
Outputmfmcma
{ }
HeaderTable
f : 3
c : 3
a : 3
gen_powerset
Outputmacmafmcfmacf
TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}
f : 4
{ }
c : 3
a : 3
b : 1
b : 1
c : 1
b : 1
Item freq headf 4c 4a 3b 3
b's conditional pattern base f c a : 1 f : 1
c : 1
2 2 1
STOP
f : 4
{ }
c : 3
a : 3
c : 1Item freq headf 4c 4a 3
a's conditional pattern base f c : 3
3 3
Outputafac
{ }
HeaderTable
f : 3
c : 3
gen_powerset
Outputacf
f : 4
{ }
c : 3
c : 1Item freq headf 4c 4
c's conditional pattern base f : 3
3
Outputcf
{ }
HeaderTable
f : 3
STOP
Item freq headf 4
f : 4
{ }
STOP
COMP9318: Data Warehousing and Data Mining 42
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3Support threshold(%)
Run
time(
sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
COMP9318: Data Warehousing and Data Mining 43
Why Is FP-Growth the Winner?
n Divide-and-conquer: n decompose both the mining task and DB according to
the frequent patterns obtained so farn leads to focused search of smaller databases
n Other factorsn no candidate generation, no candidate testn compressed database: FP-tree structuren no repeated scan of entire database n basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching