lecture 5: mining association rule introduction to data mining yunming ye department of computer...

144
Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

Upload: daniella-sophia-warren

Post on 27-Dec-2015

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

Lecture 5: Mining Association Rule

Introduction to Data Mining

Yunming Ye

Department of Computer Science

Shenzhen Graduate School

Harbin Institute of Technology

Page 2: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 2

Agenda

1. Introduction to Association Rule Mining 2. Apriori Algorithm 3. FP-Tree Algorithm 4. Sequential Association Rule Mining 5. Advanced Association Rule Mining

Page 3: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 3

Introduction to Association Rule Mining

Page 4: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 4

What Is Association Rule Mining?

Association rule mining: Finding associations, correlations, or causal structures among

sets of items or objects in transaction databases, relational databases, or other information repositories.

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Applications: Basket analysis, cross-selling, catalog design, loss-leader

analysis, clustering, classification, etc.

Page 5: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 5

An Example

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 6: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 6

Definition: Association Rule

Example:

Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the form X

Y, where X and Y are itemsets Example:

{Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y

appear in transactions that contain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 7: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 7

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold

Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

Page 8: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 8

Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 9: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 9

Categorization of Association Rules

Based on the types of values handled in the rule: Boolean association rule Quantitative association rule

Based on the dimensions of data involved Single-dimensional Multi-dimensional

Based on the levels of abstraction involved Based on various extensions to association mining

Frequent closed itemset Max-pattern

Page 10: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 10

Roadmap for Mining Association Rules

Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is the most computationally expensive

Page 11: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 11

Apriori Algorithm

Page 12: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 12

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets!

Page 13: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 13

Frequent Itemset Generation

Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

Page 14: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 14

Frequent Itemset Generation Strategies

Reduce the number of candidates (M) Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or

transactions No need to match every candidate against every transaction

Page 15: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 15

Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains

{beer, diaper}

Scalable mining methods: Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

Page 16: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 16

Apriori: A Candidate Generation-and-Test Approach

Apriori pruning principle: If there is any itemset which is infrequent,

its superset should not be generated/tested! (i.e. Anti-monotone)

(Agrawal & Srikant @VLDB’94)

Method: Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k frequent

itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be generated

Page 17: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2{B} 3{C} 3{D} 1{E} 3

Itemset sup

{A} 2{B} 3{C} 3{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup

{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup

{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

Page 18: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 18

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 19: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 19

Example of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

Page 20: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 20

Is Apriori Fast Enough?

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the

candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},

one needs to generate 2100 1030 candidates. Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern

Page 21: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 21

Methods to Improve Apriori’s Efficiency

Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent

Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Page 22: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 22

Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB must be

frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An efficient

algorithm for mining association in large databases. In

VLDB’95

Page 23: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 23

DHP: Reduce the Number of Candidates

J. Park, M. Chen, and P. Yu. An effective hash-based

algorithm for mining association rules. In SIGMOD’95

Goal: Improve the efficiency of Apriori-based mining.The

algorithm is based on Apriori algorithm by reduce the

number of candidates.

The difference of DHP and Apriori is the process of

generate k-itemsets, and that of DHP is show below: Step1:Generate all of the k-itemsets for each transaction, hash them into the

different buckets of a hash table structure, and increase the corresponding

bucket counts.

Step2:A k-itemset whose corresponding bucket count in the hash table is

below the support threshold cannot be frequent and thus should be removed

from the candidate set.

Page 24: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 24

DHP: Reduce the Number of Candidates

Example: Step1:

TID Items

100 A C D

200 B C E

300 A B C E

400 B E

{A} 2{B} 3{C} 3{D} 1{E} 3

{A}{B} {C} {E}

C 1L 1

Data Base

Page 25: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 25

DHP: Reduce the Number of Candidates

Making a hash table

h({x y}) = {{order of x}*10 + {order of y}}mod 7

Step2: Generate L2

TID Items

100 A C D

200 B C E

300 A B C E

400 B E

100{A C},{A D},{C D}200{B C},{B E},{C E}300{A B},{A C},{A E},{B C},{B E},{C E}400{B E}

Page 26: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 26

Sampling for Frequent Patterns

Select a sample of original database, mine frequent

patterns within sample using Apriori

Scan database once to verify frequent itemsets found

in sample, only borders of closure of frequent patterns

are checked Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association

rules. In VLDB’96

Page 27: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 27

DIC: Dynamic itemset counting

S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

DIC: Database is partitioned into blocks marked by start point. New candidate itemsets can be added at any start point.

Apriori: New candidate itemsets only generated before each complete database scan.

DIC requires fewer database scans than Apriori.

Page 28: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 28

DIC: Reduce Number of Scans

Example

Min support = 2;

TID Items

100 ABC

200 BCD

300 BCD

400 ABC

500 ABC

600 ABC

700 BCD

800 BCD

Block TID Items

B1 100 ABC

200 BCD

300 BCD

400 ABC

B2 500 ABC

600 ABC

700 BCD

800 BCD Start point

Page 29: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 29

DIC: Reduce Number of Scans

ExampleB1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

A 1

D

B 1

C 1

Page 30: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 30

DIC: Reduce Number of Scans

Example

A 1

D 1

B 2

C 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 31: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 31

DIC: Reduce Number of Scans

Example

A 1

D 2

B 3

C 3

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 32: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 32

DIC: Reduce Number of Scans

Example

A 2

D 2

B 4

C 4

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 33: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 33

DIC: Reduce Number of Scans

Example

A 2

D 2

B 4

C 4

AB

AC

BC

AD

BD

CD

Add new candidateat start point

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 34: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 34

DIC: Reduce Number of Scans

Example

A 3

D 2

B 5

C 5

AB 1

AC 1

BC 1

AD

BD

CD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 35: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 35

DIC: Reduce Number of Scans

Example

A 4

D 2

B 6

C 6

AB 2

AC 2

BC 2

AD

BD

CD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 36: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 36

DIC: Reduce Number of Scans

Example

A 4

D 3

B 7

C 7

AB 2

AC 2

BC 3

AD

BD 1

CD 1

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 37: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 37

DIC: Reduce Number of Scans

Example

A 4

D 4

B 8

C 8

AB 2

AC 2

BC 4

AD

BD 2

CD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 38: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 38

DIC: Reduce Number of Scans

ExampleB1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

A 4

D 4

B 8

C 8

AB 2

AC 2

BC 4

AD

BD 2

CD 2

ABC

BCD

Page 39: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 39

DIC: Reduce Number of Scans

Example

AB 2

AC 2

BC 4

AD

BD 2

CD 2

ABC

BCD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCDIf a dashed itemset has been counted through all thetransactions, make it solid and stop counting it.

Page 40: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 40

DIC: Reduce Number of Scans

Example

AB 3

AC 3

BC 5

AD

BD 2

CD 2

ABC 1

BCD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 41: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 41

DIC: Reduce Number of Scans

Example

AB 3

AC 3

BC 6

AD

BD 3

CD 3

ABC 1

BCD 1

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 42: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 42

DIC: Reduce Number of Scans

Example

AB 3

AC 3

BC 7

AD

BD 4

CD 4

ABC 1

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 43: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 43

DIC: Reduce Number of Scans

Example

AB 4

AC 4

BC 8

AD

BD 4

CD 4

ABC 2

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

Page 44: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 44

DIC: Reduce Number of Scans

Example

AB 4

AC 4

BC 6

AD

BD 4

CD 4

ABC 2

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

If dashed itemset has been counted through all thetransactions, make it solid and stop counting it.

Finish!

Page 45: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 45

DIC: Reduce Number of Scans

Example

Apriori DIC

3 round 1.5 round

Page 46: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 46

FP-Tree Algorithm

Page 47: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 47

Mining Frequent Patterns Without Candidate Generation

Grow long patterns from short ones using local

frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc abcd is a

frequent pattern

Page 48: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 48

Mining Frequent Patterns With FP-trees

Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database

partition

Method For each frequent item, construct its conditional pattern-base,

and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path

—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Page 49: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 49

Construct FP-tree from a Transaction Database

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

F-list=f-c-a-b-m-p

Page 50: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 50

Construct FP-tree from a Transaction Database

{}

f:1

c:1

a:1

m:1

p:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Page 51: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 51

Construct FP-tree from a Transaction Database

{}

f:2

c:2

a:2

m:1

p:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

b:1

m:1

Page 52: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 52

Construct FP-tree from a Transaction Database

{}

f:3

c:2

a:2

m:1

p:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

b:1

m:1

b:1

Page 53: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 53

Construct FP-tree from a Transaction Database

{}

f:3

c:2

a:2

m:1

p:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

b:1

m:1

b:1

c:1

b:1

p:1

Page 54: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 54

Construct FP-tree from a Transaction Database

{}

f:4

c:3

a:3

m:2

p:2

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

b:1

m:1

b:1

c:1

b:1

p:1

Page 55: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 55

Benefits of the FP-tree Structure

Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently

occurring, the more likely to be shared Never be larger than the original database (not count node-links

and the count field) For Connect-4 DB, compression ratio could be over 100

Page 56: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 56

Find Patterns Having P From P-conditional Database

Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s

conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Page 57: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 57

From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern basem-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns relate to m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

Page 58: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 58

Recursion: Mining Each Conditional FP-tree

{}

f:3

c:3

a:3m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3

c:3am-conditional FP-tree

Cond. pattern base of “cm”: (f:3){}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

Page 59: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 59

A Special Case: Single Prefix Path in FP-tree

Suppose a (conditional) FP-tree T has a shared

single prefix-path P

Mining can be decomposed into two parts Reduction of the single prefix path into one node

Concatenation of the mining results of the two parts

a2:n2

a3:n3

a1:n1

{}

b1:m1C1:k1

C2:k2 C3:k3

b1:m1C1:k1

C2:k2 C3:k3

r1

+a2:n2

a3:n3

a1:n1

{}

r1 =

Page 60: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 60

Scaling FP-growth by DB Projection

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery, Volume 8(1):pp.53-87, 2004.

FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques

Parallel projection is space costly

Page 61: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23

Parallel Projection

Parallel projection needs a lot of disk space

Partition projection saves it

Page 62: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology
Page 63: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 63

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Page 64: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 64

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB according to the frequent

patterns obtained so far leads to focused search of smaller databases

Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no

pattern search and matching

Page 65: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 65

CHARM: Mining by Exploring Vertical Data Format

Vertical format: t(AB) = {T11, T25, …} tid-list: list of trans.-ids containing an itemset

Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y

Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}

CHARM: An Efficient Algorithm for Closed Itemset Mining. (Mohammed J. Zaki & Ching-Jui Hsiao@SDM’02)

Page 66: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 66

CHARM: Mining by Exploring Vertical Data Format

Item TIDI1 {T100, T400, T500, T700, T800, T900}

I2 {T100, T200, T300, T400, T600, T800, T900}

I3 {T300, T500, T600, T700, T800, T900}

I4 {T200, T400}

I5 {T100, T800}

Item TID{I1, I2}

{T100, T400, T800, T900}

{I1, I3}

{T500, T700, T800, T900}

{I1, I4}

{T400}

{I1, I5}

{T100, T800}

{I2, I3}

{T300, T600, T800, T900}

{I2, I4}

{T200, T400}

{I2, I5}

{T100, T800}

{I3, I5}

{T800}

Item TID

{I1, I2, I3} {T800, T900}

{I1, I2, I5} {T100, T800}

1

2 3

Page 67: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 67

Interestingness Measure: Correlations (Lift)

Buys games buys videos[40%, 66%] is misleading The overall % of purchasing videos is 75% > 66.7%.

Buys games not buy video[20%, 33.3%] is more

accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

89.010000/7500*10000/6000

10000/4000),( VGlift

game Not game Sum (row)

Video 4000(4500) 3500(3000) 7500

Not video 2000(1500) 500(1000) 2500

Sum(col.) 6000 4000 10000

)()(

)(

BPAP

BAPlift

33.110000/2500*10000/6000

10000/2000),( VGlift

Page 68: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 68

The influence of null-transactions!

( , )

( ) ( )

P A Blift

P A P B

Expected

ExpectedObserved 22 )(

Are lift and 2 Good Measures of Correlation?

Page 69: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 69

Null-invariant Measures of Correlation

)sup(_max_

)sup(_

Xitem

Xconfall

cosine=

Null-invariant measure:

-if its value is free from the influence of null-transactions

Kulczynski measure:

Max confidence:

All confidence:

Cosine measure:

Page 70: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 70

Null-invariant Measures of Correlation: examples

)sup(_max_

)sup(_

Xitem

Xconfall

cosine=

Page 71: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 71

Which Null-invariant Measure is better?

Imbalance ratio (IR):-IR=0, balanced -otherwise, the larger the difference between the two, the larger the IR

Page 72: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 72

Summary of Measures of Correlation

lift and 2 are not good measures for correlations in large transactional DBs,because they do not have the null-invariance property

Among the four null-invariant measures studied here, namely all_confidence, max_confidence, Kulc, and cosine, we recommend using Kulc in conjunction with the imbalance ratio

all-conf has the downward closure property, and efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)

Page 73: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 73

Sequential Association Rule Mining

Page 74: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 74

Sequence Data

10 15 20 25 30 35

235

61

1

Timeline

Object A:

Object B:

Object C:

456

2 7812

16

178

Object Timestamp EventsA 10 2, 3, 5A 20 6, 1A 23 1B 11 4, 5, 6B 17 2B 21 7, 8, 1, 2B 28 1, 6C 14 1, 8, 7

Sequence Database:

Page 75: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 75

Examples of Sequence Data

Sequence Database

Sequence Element (Transaction) Event(Item)

Customer Purchase history of a given customer

A set of items bought by a customer at time t

Books, diary products, CDs, etc

Web Data Browsing activity of a particular Web visitor

A collection of files viewed by a Web visitor after a single mouse click

Home page, index page, contact info, etc

Sensor data

History of events generated by a given sensor

Events triggered by a sensor at time t

Types of alarms generated by sensors

Genome sequences

DNA sequence of a particular species

An element of the DNA sequence

Bases A,T,G,C

Sequence

E1E2

E1E3

E2E3E4E2

Element (Transaction

)

Event (Item)

Page 76: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 76

Formal Definition of a Sequence

A sequence is an ordered list of elements (transactions)

s = < e1 e2 e3 … >

Each element contains a collection of events (items)

ei = {i1, i2, …, ik}

Each element is attributed to a specific time or location

Length of a sequence, |s|, is given by the number of elements of the sequence

A k-sequence is a sequence that contains k elements

Page 77: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 77

Examples of Sequence

Web sequence:

< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} >

Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the

King}>

Page 78: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 78

Formal Definition of a Subsequence A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm>

(m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi2, …, an bin

The support of a subsequence w is defined as the fraction of data sequences that contain w

A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)

Data sequence Subsequence Contain?

< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes

< {1,2} {3,4} > < {1} {2} > No

< {2,4} {2,4} {2,5} > < {2} {4} > Yes

Page 79: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 79

Sequential Pattern Mining: Definition

Given: a database of sequences a user-specified minimum support threshold,

minsup

Task: Find all subsequences with support ≥ minsup

Page 80: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 80

Example

Q. How to find the sequential patterns?

Page 81: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 81

Example (cont.)

Item

Itemset

Transaction

Sorted by Customer_Id and TransactionTime

Page 82: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 82

Example (cont.)

Sequence

<(30) (90)> is supported by customer 1 and 4

<(30) (40 70)> is supported by customer 2 and 4

With minimum support of 2 customers:The large itemset (litemset):

(30), (40), (70), (40 70), (90)

3-Sequence

Page 83: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 83

Example (cont.)

Q. Find the maximal sequences with minimum support of 2 customers:

- The answer set is: <(30) (90)>, <(30) (40 70)>

Sequential Patterns

Page 84: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/2384

The Algorithm

Five phases Sort phase Large itemset phase Transformation phase Sequence phase Maximal phase

ApriorAll

ApriorSome

DynamicSome

Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering, ICDE 1995.

Page 85: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 85

Sort the database with customer-id as the major key and transaction-time as the minor key

Sort phase

Page 86: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 86

Find the large itemset. Itemsets mapping

Litemset phase

Page 87: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 87

Transformation phase

Deleting non-large itemsets Mapping large itemsets to integers

Page 88: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 88

Sequence phase

Use the set of litemsets to find the desired sequence.

Two families of algorithms: Count-all: counts all large sequences, including non-

maximal sequences.

Algorithm AprioriAll Count-some: try to avoid counting non-maximal

sequences by counting longer sequences first.

Algorithm AprioriSome, Algorithm DynamicSome

Page 89: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 89

Maximal phase

Find the maximum sequences among the set of large sequences.

In some algorithms, this phase is combined with the sequence phase.

Page 90: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 90

Maximal phase

Algorithm: S the set of all litemsets n the length of the longest sequence

for (k = n; k > 1; k--) do for each k-sequence sk do Delete from S all subsequences of sk

Page 91: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 91

AprioriAll

The basic method to mine sequential patterns

Based on the Apriori algorithm. Count all the large sequences, including

non-maximal sequences. Use Apriori-generate function to generate

candidate sequence.

Page 92: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 92

Apriori Candidate Generation

generate candidates for pass using only the large sequences found in the previous pass and then makes a pass over the data to find their support.

Page 93: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 93

Algorithm: Lk the set of all large k-sequences

Ck the set of candidate k-sequences

Apriori Candidate Generation

insert into Ck

select p.litemset1, p.litemset2,…, p.litemsetk-1, q.litemsetk-1

from Lk-1 p, Lk-1 qwhere p.litemset1=q.litemset1,…, p.litemsetk-2=q.litemsetk-2;

forall sequences cCk do forall (k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;

Page 94: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 94

AprioriAll (cont.)

L1 = {large 1-sequences}; // Result of the phasefor ( k=2; Lk-1≠Φ; k++) do begin Ck = New candidate generate from Lk-1 for each customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support.EndAnswer=Maximal Sequences in UkLk;

Page 95: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 95

Example: (Customer Sequences)

Apriori Candidate Generation

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

next step: find the large 1-sequences

With minimum set to 40%

Page 96: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 96

next step: find the large 2-sequences

Sequence Support

<1>

<2>

<3>

<4>

<5>

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

Example

Large 1-Sequence

4

2

4

4

2

Page 97: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 97

next step: find the large 3-sequences

Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 2

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 2-Sequence

Page 98: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 98

next step: find the large 4-sequences

Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 3-Sequence

Page 99: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 99

next step: find the sequential pattern

Sequence Support

<1 2 3 4> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 4-Sequence

Page 100: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 100

Sequence Support

<1 2 3 4> 2

Example

Sequence Support

<1> 4

<2> 2

<3> 4

<4> 4

<5> 2

Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Find the maximal large sequences

Page 101: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 101

Count-some Algorithms

Try to avoid counting non-maximal sequences by counting longer sequences first.

2 phases: Forward Phase – find all large

sequences or certain lengths. Backward Phase – find all remaining

large sequences.

Page 102: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 102

AprioriSome (1)

Determines which lengths to count using next() function.

next() takes in as a parameter the length of the sequence counted in the last pass.

next(k) = k + 1 - Same as AprioriAll Balances tradeoff between:

Counting non-maximal sequences Counting extensions of small candidate

sequences

Page 103: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 103

AprioriSome (2)

hitk = Lk/ Ck Intuition: As hitk increases the time wasted

by counting extensions of small candidates decreases.

Page 104: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 104

AprioriSome (3)

Page 105: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 105

AprioriSome (4)

Backward Phase: For all lengths which we skipped:

Delete sequences in candidate set which are contained in some large sequence.

Count remaining candidates and find all sequences with min. support.

Also delete large sequences found in forward phase which are non-maximal.

Page 106: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 106

AprioriSome (5)

Page 107: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 107

AprioriSome (6)

Example:

3-Sequences

C3

next(k) = 2kminsup = 2Forward Phase:

Page 108: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 108

AprioriSome (7)

Example

Backward Phase:

3-Sequences

C3

Page 109: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 109

Performance of two algorithms

AprioriSome does a little better than AprioriAll. It avoids counting

many non-maximal sequences.

Page 110: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 110

Advanced Association Rule Mining

Page 111: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 111

Mining Various Kinds of Association Rules

Mining multilevel association

Miming multidimensional association

(Optional) Mining Max and Closed association

patterns

(Optional) Constraint-based association mining

Page 112: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 112

Mining Multiple-Level Association Rules

Items often form hierarchies Flexible support settings

Items at the lower level are expected to have lower support

Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

Page 113: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 113

Multi-level Association: Redundancy Filtering

Some rules may be redundant due to “ancestor” relationships between items.

Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

Page 114: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 114

Mining Multi-Dimensional Association

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Categorical Attributes: finite number of possible values, no ordering among values—data cube approach

Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

Page 115: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 115

Mining Quantitative Associations

Techniques can be categorized by how numerical attributes, such as age or salary are treated

1. Static discretization based on predefined concept hierarchies (data cube methods)

2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)

3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97)

one dimensional clustering then association

4. Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)

Page 116: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 116

Static Discretization of Quantitative Attributes

Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets

will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional

cuboid correspond to the

predicate sets.

Mining from data cubescan be much faster.

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

Page 117: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 117

Quantitative Association Rules

age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”)

Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined

is maximized 2-D quantitative association rules: Aquan1 Aquan2 Acat

Cluster adjacent association rules to form general rules using a 2-D grid

Exampleage(X, 34) income(X,”31-40K”) buys(X,”HDTV”)

age(X, 35) income(X,”31-40K”) buys(X,”HDTV”)

age(X, 34) income(X,”31-50K”) buys(X,”HDTV”)

age(X, 35) income(X,”31-50K”) buys(X,”HDTV”)

Page 118: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

Classification by Association Rule Analysis

Page 119: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 119119

Associative Classification

Associative classification: Major steps Mine data to find strong associations between frequent patterns

(conjunctions of attribute-value pairs) and class labels

Association rules are generated in the form of

P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)

Organize the rules to form a rule-based classifier

Why effective? It explores highly confident associations among multiple attributes and

may overcome some constraints introduced by decision-tree induction,

which considers only one attribute at a time

Associative classification has been found to be often more accurate

than some traditional classification methods, such as C4.5

Page 120: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 120120

Typical Associative Classification Methods CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)

Mine possible association rules in the form of

Cond-set (a set of attribute-value pairs) class label Build classifier: Organize rules according to decreasing precedence

based on confidence and then support

CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)

Classification: Statistical analysis on multiple rules

CPAR (Classification based on Predictive Association Rules: Yin & Han,

SDM’03)

Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight

Prediction using best k rules

High efficiency, accuracy similar to CMAR

Page 121: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 121

CBA [Liu, Hsu and Ma, KDD’98]

• Basic idea• Mine high-confidence, high-support class

association rules with Apriori• Rule LHS: a conjunction of conditions• Rule RHS: a class label• Example:

R1: age < 25 & credit = ‘good’ buy iPhone (sup=30%, conf=80%)

R2: age > 40 & income < 50k not buy iPhone (sup=40%, conf=90%)

Page 122: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 122

CBA

• Rule mining• Mine the set of association rules wrt. min_sup and

min_conf• Rank rules in descending order of confidence and

support• Select rules to ensure training instance coverage

• Prediction• Apply the first rule that matches a test case• Otherwise, apply the default rule

Page 123: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 123

CBA – An exampleage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

min_sup=25% min_conf=80%

1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes

(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=yes

Rules:

•Rule mining

Page 124: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

CBA - An example

2023年4月19日 124

1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes

(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=no

Rules:

•Prediction

age income student credit_rating<=30 high yes fair

Apply Rule 2, buys_computer=yes

age income student credit_rating30…40 high yes excellent

Apply Rule 1, buys_computer=yes

age income student credit_rating<=30 high no excellent

Apply Default rule, buys_computer=no

Page 125: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 125

CMAR [Li, Han and Pei, ICDM’01]

Basic idea Mining: build a class distribution-associated FP-tree Prediction: combine the strength of multiple rules

Rule mining Mine association rules from a class distribution-

associated FP-tree Store and retrieve association rules in a CR-tree Prune rules based on confidence, correlation and

database coverage

Page 126: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 126

CMAR (Classification based on Multiple Association Rules) (1)

Adopted from the FP-growth Phases:

rule generation or training (R: Pc, such that sup(R ) and conf( R ) pass the given thresholds), and

classification or testing phase (predict the classification of the new sample).

Page 127: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 127

CMAR (Classification based on Multiple Association Rules) (2)

Training database T for CMAR algorithm (the support threshold is 2 and the confidence threshold is 70%).

ID A B C D Class

01 a1 b1 c1 d1 A

02 a1 b2 c1 d2 B

03 a2 b3 c2 d3 A

04 a1 b2 c3 d3 C

05 a1 b2 c1 d3 C

FP-tree is a FP-tree is a prefix tree with prefix tree with respect to F-listrespect to F-list

F-list: F-list: (a1, b2, c1, d3)(a1, b2, c1, d3)

Page 128: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 128

Page 129: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 129

CMAR (Classification based on Multiple Association Rules) (3)

Rules subsets: The rules having d3 value;

The rules having c1 but no d3;

The rules having b2 but no d3 nor c1; and

The rules having only a1.

d3-projected samples:

(a1, b2, c1, d3): C, (a1, b2, d3): C, and (d3): A

=> Rule: A(a1, b2, d3) C (sup = 2, conf =100%)

(a1, c1) is a frequent pattern with support 3, but all rules

are with confidence less than threshold value. Similar conclusions are for pattern (a1, b2), and finally for (a1).

Page 130: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 130

CMAR (Classification based on Multiple Association Rules) (4)

Classification or testing phase If all the rules have the same class, CMAR

simply assigns that label to the new sample If the rules are not consistent in the class

label of the “strongest” group To compare the strength of groups, it is

necessary to measure the “combined effect” of each group

If the rules in a group are highly positively correlated and have good support, the group should have a strong effect

Page 131: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 131

CMAR (Classification based on Multiple Association Rules) (5)

Possible ways to measure the combined effect of a group of rules Highest X 2 value Compound of correlation Integrate both information of

correlation and population Weighted X 2

Page 132: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 132

Weighted X 2

maxX 2 computes the upper bound of X 2 value of the rule w.r.t. other setting are fixed

For each group of rules, the weighted X 2 measure of the group is defined as

CMAR (Classification based on Multiple Association Rules) (6)

Page 133: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 133

CPAR [Yin and Han, SDM’03]

Basic idea Combine associative classification and FOIL-

based rule generation Foil gain: criterion for selecting a literal

Improve accuracy over traditional rule-based classifiers

Improve efficiency and reduce number of rules over association rule-based methods

Page 134: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 134

CPAR (1)

Rule generation Build a rule by adding literals one by one in a

greedy way according to foil gain measure Keep all close-to-the-best literals and build

several rules simultaneously

Prediction Collect all rules matching a test case Select the best k rules for each class Choose the class with the highest expected

accuracy for prediction

Page 135: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 135

CPAR (2)

Build rules by adding literal one by one

CPAR keeps all “close-to-the-best” literal during rule building process

select more than one literal at the same time and build several rules simultaneously

Page 136: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 136

CPAR (3)

After finding the best literal p, another literal q has the similar gain as p

(e.g. differ by at most 1%)

Appending p and q to current rule r create new rule r’

Page 137: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 137

How CPAR generates rules

Example 1. Literal (A1=2) has the most Foil

gain

A1=2

Page 138: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 138

How CPAR generates rules

2.After the first literal is selected,two literals(A2=1) and (A3=1) are found to have similar gain, which is higher than others.

A1=2 A2=1

A3=1

A1=2

Page 139: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 139

How CPAR generates rules 3. Choose literal (A2=1) first. A rule is generated along this

direction. ( A1=2, A2=1, A4=1)

A2=1

A3=1

A1=2 A4=1

Page 140: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

2023年4月19日 140

How CPAR generates rules

4. Then, the rule (A1=2, A3=1) is taken as the current rule. Again two literals with similar gain are selected.

A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

Page 141: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

5. Choose (A1=2,A3=1,A4=2) first. A rule is generated.

(A1=2,A3=1,A4=2,A2=3)

2023年4月19日 141

How CPAR generates rules

A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

A2=3

Page 142: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

6. (A1=2,A3=1,A2=1) is generated.

2023年4月19日 142

How CPAR generates rules

A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

A2=3

Page 143: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

More reading on Associative Classification

FADI THABTAH. A review of associative classification mining. The Knowledge Engineering Review, Vol. 22:1, 37–65, 2007.

Page 144: Lecture 5: Mining Association Rule Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology

04/19/23 144

Q&A