comp9318: data warehousing and data miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · comp9318:...

COMP9318: Data Warehousing and Data Mining 1

COMP9318: Data Warehousing and Data Mining

— L6: Association Rule Mining —


n Problem definition and preliminaries


What Is Association Mining?n Association rule mining:

n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

n Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]

n Motivation: finding regularities in datan What products were often purchased together? — Beer

and diapers?!n What are the subsequent purchases after buying a PC?n What kinds of DNA are sensitive to this new drug?n Can we automatically classify web documents?

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

COMP9318: Data Warehousing and Data Mining

Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?

n Foundation for many essential data mining tasksn Association, correlation, causalityn Sequential patterns, temporal or cyclic association,

partial periodicity, spatial and multimedia associationn Associative classification, cluster analysis, iceberg cube,

fascicles (semantic data compression)n Broad applications

n Basket data analysis, cross-marketing, catalog design, sale campaign analysis

n Web log (click stream) analysis, DNA sequence analysis, etc. c.f., google’s spelling suggestion

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Basic Concepts: Frequent Patterns and Association Rules

n Itemset X={x1, …, xk}n Shorthand: x1 x2 … xk

n Find all the rules XàY with min confidence and supportn support, s, probability that a

transaction contains XÈYn confidence, c, conditional

probability that a transaction having X also contains Y.

Let min_support = 50%, min_conf = 70%:

sup(AC) = 2A è C (50%, 66.7%)C è A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought10 { A, B, C }20 { A, C }30 { A, D }40 { B, E, F }

frequent itemset

association rule

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Mining Association Rules—an Example

For rule A è C:support = support({A}∪{C}) = 50%confidence = support({A}∪{C})/support({A}) = 66.6%

Min. support 50%Min. confidence 50%

Transaction-id Items bought10 A, B, C20 A, C30 A, D40 B, E, F

Frequent pattern Support{A} 75%{B} 50%{C} 50%

{A, C} 50%

major computation challenge: calculate the support of itemsetsç The frequent itemset mining problem

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


n Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases

Association Rule Mining Algorithms

n Naïve algorithmn Enumerate all possible itemsets

and check their support against min_sup

n Generate all association rules and check their confidence against min_conf

n The Apriori propertyn Apriori Algorithmn FP-growth Algorithm


Candidate Generation & Verification

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

All Candidate Itemsets for {A, B, C, D, E}

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Apriori Property

n A frequent (used to be called large) itemset is an itemset whose support is ≥ min_sup.

n Apriori property (downward closure): any subsets of a frequent itemset are also frequent itemsets

n Aka the anti-monotone property of support

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD “any supersets of an infrequent itemset are also infrequent itemsets”

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Found to be Infrequent

null


A B C D E



ABCDE

Illustrating Apriori Principlenull


A B C D E



ABCDEPruned supersets

Q: How to design an algorithm to improve the naïve algorithm?

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Apriori: A Candidate Generation-and-test Approach

n Apriori pruning principle: If there is any itemsetwhich is infrequent, its superset should not be generated/tested!

n Algorithm [Agrawal & Srikant 1994]1. Ck ç Perform level-wise candidate generation

(from singleton itemsets)2. Lk ç Verify Ck against Lk3. Ck+1 ç generated from Lk

4. Goto 2 if Ck+1 is not empty

Wei Wang


The Apriori Algorithmn Pseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;for each transaction t in database do begin

increment the count of all candidates in Ck+1that are contained in t

endLk+1 = candidates in Ck+1 with min_support

endreturn ∪kLk;

Wei Wang


The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

Tid Items10 A, C, D20 B, C, E30 A, B, C, E40 B, E

Itemset sup{A} 2{B} 3{C} 3{D} 1{E} 3

Itemset sup{A} 2{B} 3{C} 3{E} 3

Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset{B, C, E}

Itemset sup{B, C, E} 2

minsup = 50%

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Important Details of Apriori

1. How to generate candidates?n Step 1: self-joining Lk (what’s the join condition? why?)n Step 2: pruning

2. How to count supports of candidates?

Example of Candidate-generationn L3={abc, abd, acd, ace, bcd}n Self-joining: L3*L3

n abcd from abc and abdn acde from acd and ace

n Pruning:n acde is removed because ade is not in L3

n C4={abcd}

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Generating Candidates in SQL

n Suppose the items in Lk-1 are listed in an ordern Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

n Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

Wei Wang

Wei Wang

Wei Wang


Derive rules from frequent itemsets

n Frequent itemsets != association rulesn One more step is required to find association

rulesn For each frequent itemset X,

For each proper nonempty subset A of X, n Let B = X - An A à B is an association rule if

n Confidence (A à B) ≥ min_conf,where support (A à B) = support (AB), andconfidence (A à B) = support (AB) / support (A)

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Example – deriving rules from frequent itemsets

n Suppose 234 is frequent, with supp=50%n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with

supp=50%, 50%, 75%, 75%, 75%, 75% respectivelyn These generate these association rules:

n 23 => 4, confidence=100%n 24 => 3, confidence=100%n 34 => 2, confidence=67%n 2 => 34, confidence=67%n 3 => 24, confidence=67%n 4 => 23, confidence=67%n All rules have support = 50%

= (N* 50%)/(N*75%)

Q: is there any optimization (e.g., pruning) for this step?

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang


Deriving rulesn To recap, in order to obtain A à B, we need

to have Support(AB) and Support(A)n This step is not as time-consuming as

frequent itemsets generationn Why?

n It’s also easy to speedup using techniques such as parallel processing.n How?

n Do we really need candidate generation for deriving association rules? n Frequent-Pattern Growth (FP-Tree)

Wei Wang


Bottleneck of Frequent-pattern Mining

n Multiple database scans are costlyn Mining long patterns needs many passes of

scanning and generates lots of candidatesn To find frequent itemset i1i2…i100

n # of scans: 100n # of Candidates:

n Bottleneck: candidate-generation-and-testCan we avoid candidate generation altogether?

✓100

1

◆+

✓100

2

◆+ . . .+

✓100

100

◆= 2100 � 1

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

Wei Wang

n FP-growth


Wei Wang

No Pain, No GainJava Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

minsup = 1

n Apriori: n L1 = {J, L, S, P, R}n C2 = all the (5

2) combinationsn Most of C2 do not contribute to the resultn There is no way to tell because


Alice X X

Bob X X

Charlie X X X

Dora X X

J

ɸ

minsup = 1

{A, C}

Ideas: • Keep the support set for

each frequent itemset• DFS

J è JL? J è ???Only need to look at support set for J


Alice X X

Bob X X

Charlie X X X

Dora X X

J

ɸ

JPRminsup = 1

{A, C}

{C}

{C}JP JR

Ideas: • Keep the support set for

each frequent itemset• DFS

…

{A,C}

Notations and Invariantsn CondiditonalDB:

n DB|p = {t ∈ DB | t contains itemset p}n DB = DB|∅ (i.e., conditioned on nothing)n Shorthand: DB|px = DB|(p∪x)

n SupportSet(p∪x, DB) = SupportSet(x, DB|p)n {x | x mod 6 = 0 ⋀ x ∈ [100] } =

{x | x mod 3 = 0 ⋀ x ∈ even([100]) }n A FP-tree|p is equivalent to a DB|p

n One can be converted to anothern Next, we illustrate the alg using conditionalDB

25

FP-tree Essential Idea /1n Recursive algorithm again!

n FreqItemsets(DB|p):

n X = FindFrequentItems(DB|p)

n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n

n FreqItemsets(DB*|px)

all frequent itemsets in DB|p belong to one of the following categories:

output { (x p) | x ∈ X }

patterns ~ xippatterns ~ ★px1

patterns ~ ★px2

patterns ~ ★pxi

patterns ~ ★pxn

obtained via recursion

easy task, as only items (not itemsets) are needed


Alice X X

Charlie X X Xminsup = 1

DB|J

n FreqItemsets(DB|J):n {P, R} ç FindFrequentItems(DB|J)n Output {JP, JR}n Get DB*|JP; FreqItemsets(DB*|JP)n Get DB*|JR; FreqItemsets(DB*|JR)n // Guaranteed no other frequent itemset in DB|J

FP-tree Essential Idea /2

n FreqItemsets(DB|p):n If boundary condition, then …n X = FindFrequentItems(DB|p)n [optional] DB*|p = PruneDB(DB|p, X)

n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n [optional] if DB*|px is degenerated, then powerset(DB*|px)n FreqItemsets(DB*|px) Also gets rid of items

already processed before x è avoid duplicates

Also output each item in X (appended with the conditional pattern)

Remove items not in X; potentially reduce # of transactions (∅ or dup). Improves the efficiency.

output { (x p) | x ∈ X }

Lv 1 Recursion

n minsup = 3

F C A D G I M PA B C F L M OB F H J O WB C K S PA F C E L P M N

F C A M PF C A B MF BC B PF C A M P

DB DB*

X = {F, C, A, B, M, P}

DB*|P

DB*|M (sans P)

DB*|B (sans MP)

DB*|A (sans BMP)

DB*|C (sans ABMP)

DB*|F (sans CABMP)

F C A M PC B PF C A M P

F C AF C AF C A

Grayed items are for illustration purpose only.

Output: F, C, A, B, M, P

Lv 2 Recursion on DB*|P

n minsup = 3

CCC

DB DB*

X = {C}

DB*|CF C A M PC B PF C A M P

Which is actually FullDB*|CP

CCC

Context = Lv 3 recursion on DB*|CP: DB has only empty

sets or X = {} èimmediately returnsOutput: CP

Lv 2 Recursion on DB*|A (sans …)n minsup = 3

F CF CF C

DB DB*

X = {F, C}

DB*|CF C AF C AF C A

Which is actually FullDB*|CA

FCFCFC

DB*|FFFF

boundary case

Further recursion

(output: FCA)

Output: FA, CA

Different Example:Lv 2 Recursion on DB*|P

n minsup = 2

F C AF CF A

DB DB*

X = {F, C, A}

DB*|A

F C A M PF C B PF A P

Which is actually FullDB*|AP F CF

DB*|CFF

DB*|F

F F

X = {F}

Output: FP, CP, AP

Output: FAP

I will give you back the FP-tree

n An FP-tree tree of DB consists of: n A fixed order among items in DB n A prefix, threaded tree of sorted transactions

in DBn Header table: (item, freq, ptr)

n When used in the algorithm, the input DB is always pruned (c.f., PruneDB())n Remove infequent itemsn Remove infrequent items in every transaction

FP-tree ExampleTID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

minsup = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

f : 1

{ }

c : 1

a : 1

m : 1

p : 1

Insert t1

f : 2

{ }

c : 2

a : 2

m : 1

p : 1

b : 1

m : 1

Insert t2

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

Insert all ti

Item freq headf 4c 4a 3b 3m 3p 3

Outputfcabmp

…

Item freq headf 4c 4a 3b 3m 3p 3

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

p's conditional pattern base f c a m : 2 c b : 1

2 3 2 1 2

Outputpc

c : 3

{ }

HeaderTableSTOP

TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}

Cleaned p’s conditional pattern baseC :2

C :1

f : 4

{ }

c : 3

a : 3

m : 2 b : 1

m : 1

b : 1

c : 1

b : 1

Item freq headf 4c 4a 3b 3m 3

m's conditional pattern base f c a : 2 f c a b : 1

3 3 3 1

Outputmfmcma

{ }

HeaderTable

f : 3

c : 3

a : 3

gen_powerset

Outputmacmafmcfmacf

TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}

f : 4

{ }

c : 3

a : 3

b : 1

b : 1

c : 1

b : 1

Item freq headf 4c 4a 3b 3

b's conditional pattern base f c a : 1 f : 1

c : 1

2 2 1

STOP

f : 4

{ }

c : 3

a : 3

c : 1Item freq headf 4c 4a 3

a's conditional pattern base f c : 3

3 3

Outputafac

{ }

HeaderTable

f : 3

c : 3

gen_powerset

Outputacf

f : 4

{ }

c : 3

c : 1Item freq headf 4c 4

c's conditional pattern base f : 3

3

Outputcf

{ }

HeaderTable

f : 3

STOP

Item freq headf 4

f : 4

{ }

STOP


FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3Support threshold(%)

Run

time(

sec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K


Why Is FP-Growth the Winner?

n Divide-and-conquer: n decompose both the mining task and DB according to

the frequent patterns obtained so farn leads to focused search of smaller databases

n Other factorsn no candidate generation, no candidate testn compressed database: FP-tree structuren no repeated scan of entire database n basic ops—counting local freq items and building sub

FP-tree, no pattern search and matching

comp9318: data warehousing and data miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · comp9318:...

Documents