lecture 5: mining association rule introduction to data mining yunming ye department of computer...

Lecture 5: Mining Association Rule

Introduction to Data Mining

Yunming Ye

Department of Computer Science

Shenzhen Graduate School

Harbin Institute of Technology

04/19/23 2

Agenda

1. Introduction to Association Rule Mining 2. Apriori Algorithm 3. FP-Tree Algorithm 4. Sequential Association Rule Mining 5. Advanced Association Rule Mining

04/19/23 3

Introduction to Association Rule Mining

04/19/23 4

What Is Association Rule Mining?

Association rule mining: Finding associations, correlations, or causal structures among

sets of items or objects in transaction databases, relational databases, or other information repositories.

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Applications: Basket analysis, cross-selling, catalog design, loss-leader

analysis, clustering, classification, etc.

04/19/23 5

An Example

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Customerbuys diaper

Customerbuys both

Customerbuys beer

04/19/23 6

Definition: Association Rule

Example:

Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the form X

Y, where X and Y are itemsets Example:

{Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y

appear in transactions that contain X

TID Items

1 Bread, Milk





04/19/23 7

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold

Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

04/19/23 8

Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk





Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

04/19/23 9

Categorization of Association Rules

Based on the types of values handled in the rule: Boolean association rule Quantitative association rule

Based on the dimensions of data involved Single-dimensional Multi-dimensional

Based on the levels of abstraction involved Based on various extensions to association mining

Frequent closed itemset Max-pattern

04/19/23 10

Roadmap for Mining Association Rules

Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is the most computationally expensive

04/19/23 11

Apriori Algorithm

04/19/23 12

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets!

04/19/23 13

Frequent Itemset Generation

Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

04/19/23 14

Frequent Itemset Generation Strategies

Reduce the number of candidates (M) Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or

transactions No need to match every candidate against every transaction

04/19/23 15

Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains

{beer, diaper}

Scalable mining methods: Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

04/19/23 16

Apriori: A Candidate Generation-and-Test Approach

Apriori pruning principle: If there is any itemset which is infrequent,

its superset should not be generated/tested! (i.e. Anti-monotone)

(Agrawal & Srikant @VLDB’94)

Method: Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k frequent

itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2{B} 3{C} 3{D} 1{E} 3

Itemset sup

{A} 2{B} 3{C} 3{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup

{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup

{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

04/19/23 18

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

04/19/23 19

Example of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

04/19/23 20

Is Apriori Fast Enough?

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the

candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},

one needs to generate 2100 1030 candidates. Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern

04/19/23 21

Methods to Improve Apriori’s Efficiency

Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent

Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

04/19/23 22

Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB must be

frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An efficient

algorithm for mining association in large databases. In

VLDB’95

04/19/23 23

DHP: Reduce the Number of Candidates

J. Park, M. Chen, and P. Yu. An effective hash-based

algorithm for mining association rules. In SIGMOD’95

Goal: Improve the efficiency of Apriori-based mining.The

algorithm is based on Apriori algorithm by reduce the

number of candidates.

The difference of DHP and Apriori is the process of

generate k-itemsets, and that of DHP is show below: Step1:Generate all of the k-itemsets for each transaction, hash them into the

different buckets of a hash table structure, and increase the corresponding

bucket counts.

Step2:A k-itemset whose corresponding bucket count in the hash table is

below the support threshold cannot be frequent and thus should be removed

from the candidate set.

04/19/23 24


Example: Step1:

TID Items

100 A C D

200 B C E

300 A B C E

400 B E

{A} 2{B} 3{C} 3{D} 1{E} 3

{A}{B} {C} {E}

C 1L 1

Data Base

04/19/23 25


Making a hash table

h({x y}) = {{order of x}*10 + {order of y}}mod 7

Step2: Generate L2

TID Items

100 A C D

200 B C E

300 A B C E

400 B E

100{A C},{A D},{C D}200{B C},{B E},{C E}300{A B},{A C},{A E},{B C},{B E},{C E}400{B E}

04/19/23 26

Sampling for Frequent Patterns

Select a sample of original database, mine frequent

patterns within sample using Apriori

Scan database once to verify frequent itemsets found

in sample, only borders of closure of frequent patterns

are checked Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association

rules. In VLDB’96

04/19/23 27

DIC: Dynamic itemset counting

S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

DIC: Database is partitioned into blocks marked by start point. New candidate itemsets can be added at any start point.

Apriori: New candidate itemsets only generated before each complete database scan.

DIC requires fewer database scans than Apriori.

04/19/23 28

DIC: Reduce Number of Scans

Example

Min support = 2;

TID Items

100 ABC

200 BCD

300 BCD

400 ABC

500 ABC

600 ABC

700 BCD

800 BCD

Block TID Items

B1 100 ABC

200 BCD

300 BCD

400 ABC

B2 500 ABC

600 ABC

700 BCD

800 BCD Start point

04/19/23 29


ExampleB1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

A 1

D

B 1

C 1

04/19/23 30


Example

A 1

D 1

B 2

C 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 31


Example

A 1

D 2

B 3

C 3

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 32


Example

A 2

D 2

B 4

C 4

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 33


Example

A 2

D 2

B 4

C 4

AB

AC

BC

AD

BD

CD

Add new candidateat start point

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 34


Example

A 3

D 2

B 5

C 5

AB 1

AC 1

BC 1

AD

BD

CD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 35


Example

A 4

D 2

B 6

C 6

AB 2

AC 2

BC 2

AD

BD

CD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 36


Example

A 4

D 3

B 7

C 7

AB 2

AC 2

BC 3

AD

BD 1

CD 1

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 37


Example

A 4

D 4

B 8

C 8

AB 2

AC 2

BC 4

AD

BD 2

CD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 38


ExampleB1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

A 4

D 4

B 8

C 8

AB 2

AC 2

BC 4

AD

BD 2

CD 2

ABC

BCD

04/19/23 39


Example

AB 2

AC 2

BC 4

AD

BD 2

CD 2

ABC

BCD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCDIf a dashed itemset has been counted through all thetransactions, make it solid and stop counting it.

04/19/23 40


Example

AB 3

AC 3

BC 5

AD

BD 2

CD 2

ABC 1

BCD

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 41


Example

AB 3

AC 3

BC 6

AD

BD 3

CD 3

ABC 1

BCD 1

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 42


Example

AB 3

AC 3

BC 7

AD

BD 4

CD 4

ABC 1

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 43


Example

AB 4

AC 4

BC 8

AD

BD 4

CD 4

ABC 2

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

04/19/23 44


Example

AB 4

AC 4

BC 6

AD

BD 4

CD 4

ABC 2

BCD 2

B1

100 ABC

200 BCD

300 BCD

400 ABC

B2

500 ABC

600 ABC

700 BCD

800 BCD

If dashed itemset has been counted through all thetransactions, make it solid and stop counting it.

Finish!

04/19/23 45


Example

Apriori DIC

3 round 1.5 round

04/19/23 46

FP-Tree Algorithm

04/19/23 47

Mining Frequent Patterns Without Candidate Generation

Grow long patterns from short ones using local

frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc abcd is a

frequent pattern

04/19/23 48

Mining Frequent Patterns With FP-trees

Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database

partition

Method For each frequent item, construct its conditional pattern-base,

and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path

—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

04/19/23 49

Construct FP-tree from a Transaction Database

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

F-list=f-c-a-b-m-p

04/19/23 50


{}

f:1

c:1

a:1

m:1

p:1

Header Table



04/19/23 51


{}

f:2

c:2

a:2

m:1

p:1

Header Table



b:1

m:1

04/19/23 52


{}

f:3

c:2

a:2

m:1

p:1

Header Table



b:1

m:1

b:1

04/19/23 53


{}

f:3

c:2

a:2

m:1

p:1

Header Table



b:1

m:1

b:1

c:1

b:1

p:1

04/19/23 54


{}

f:4

c:3

a:3

m:2

p:2

Header Table



b:1

m:1

b:1

c:1

b:1

p:1

04/19/23 55

Benefits of the FP-tree Structure

Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently

occurring, the more likely to be shared Never be larger than the original database (not count node-links

and the count field) For Connect-4 DB, compression ratio could be over 100

04/19/23 56

Find Patterns Having P From P-conditional Database

Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s

conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table


04/19/23 57

From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern basem-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns relate to m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

04/19/23 58

Recursion: Mining Each Conditional FP-tree

{}

f:3

c:3

a:3m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3

c:3am-conditional FP-tree

Cond. pattern base of “cm”: (f:3){}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

04/19/23 59

A Special Case: Single Prefix Path in FP-tree

Suppose a (conditional) FP-tree T has a shared

single prefix-path P

Mining can be decomposed into two parts Reduction of the single prefix path into one node

Concatenation of the mining results of the two parts

a2:n2

a3:n3

a1:n1

{}

b1:m1C1:k1

C2:k2 C3:k3

b1:m1C1:k1

C2:k2 C3:k3

r1

+a2:n2

a3:n3

a1:n1

{}

r1 =

04/19/23 60

Scaling FP-growth by DB Projection

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery, Volume 8(1):pp.53-87, 2004.

FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques

Parallel projection is space costly

04/19/23

Parallel Projection

Parallel projection needs a lot of disk space

Partition projection saves it

04/19/23 63

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

04/19/23 64

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB according to the frequent

patterns obtained so far leads to focused search of smaller databases

Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no

pattern search and matching

04/19/23 65

CHARM: Mining by Exploring Vertical Data Format

Vertical format: t(AB) = {T11, T25, …} tid-list: list of trans.-ids containing an itemset

Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y

Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}

CHARM: An Efficient Algorithm for Closed Itemset Mining. (Mohammed J. Zaki & Ching-Jui Hsiao@SDM’02)

04/19/23 66

CHARM: Mining by Exploring Vertical Data Format

Item TIDI1 {T100, T400, T500, T700, T800, T900}

I2 {T100, T200, T300, T400, T600, T800, T900}

I3 {T300, T500, T600, T700, T800, T900}

I4 {T200, T400}

I5 {T100, T800}

Item TID{I1, I2}

{T100, T400, T800, T900}

{I1, I3}

{T500, T700, T800, T900}

{I1, I4}

{T400}

{I1, I5}

{T100, T800}

{I2, I3}

{T300, T600, T800, T900}

{I2, I4}

{T200, T400}

{I2, I5}

{T100, T800}

{I3, I5}

{T800}

Item TID

{I1, I2, I3} {T800, T900}

{I1, I2, I5} {T100, T800}

1

2 3

04/19/23 67

Interestingness Measure: Correlations (Lift)

Buys games buys videos[40%, 66%] is misleading The overall % of purchasing videos is 75% > 66.7%.

Buys games not buy video[20%, 33.3%] is more

accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

89.010000/7500*10000/6000

10000/4000),( VGlift

game Not game Sum (row)

Video 4000(4500) 3500(3000) 7500

Not video 2000(1500) 500(1000) 2500

Sum(col.) 6000 4000 10000

)()(

)(

BPAP

BAPlift

33.110000/2500*10000/6000

10000/2000),( VGlift

04/19/23 68

The influence of null-transactions!

( , )

( ) ( )

P A Blift

P A P B

Expected

ExpectedObserved 22 )(

Are lift and 2 Good Measures of Correlation?

04/19/23 69

Null-invariant Measures of Correlation

)sup(_max_

)sup(_

Xitem

Xconfall

cosine=

Null-invariant measure:

-if its value is free from the influence of null-transactions

Kulczynski measure:

Max confidence:

All confidence:

Cosine measure:

04/19/23 70

Null-invariant Measures of Correlation: examples

)sup(_max_

)sup(_

Xitem

Xconfall

cosine=

04/19/23 71

Which Null-invariant Measure is better?

Imbalance ratio (IR):-IR=0, balanced -otherwise, the larger the difference between the two, the larger the IR

04/19/23 72

Summary of Measures of Correlation

lift and 2 are not good measures for correlations in large transactional DBs,because they do not have the null-invariance property

Among the four null-invariant measures studied here, namely all_confidence, max_confidence, Kulc, and cosine, we recommend using Kulc in conjunction with the imbalance ratio

all-conf has the downward closure property, and efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)

04/19/23 73

Sequential Association Rule Mining

04/19/23 74

Sequence Data

10 15 20 25 30 35

235

61

1

Timeline

Object A:

Object B:

Object C:

456

2 7812

16

178

Object Timestamp EventsA 10 2, 3, 5A 20 6, 1A 23 1B 11 4, 5, 6B 17 2B 21 7, 8, 1, 2B 28 1, 6C 14 1, 8, 7

Sequence Database:

04/19/23 75

Examples of Sequence Data

Sequence Database

Sequence Element (Transaction) Event(Item)

Customer Purchase history of a given customer

A set of items bought by a customer at time t

Books, diary products, CDs, etc

Web Data Browsing activity of a particular Web visitor

A collection of files viewed by a Web visitor after a single mouse click

Home page, index page, contact info, etc

Sensor data

History of events generated by a given sensor

Events triggered by a sensor at time t

Types of alarms generated by sensors

Genome sequences

DNA sequence of a particular species

An element of the DNA sequence

Bases A,T,G,C

Sequence

E1E2

E1E3

E2E3E4E2

Element (Transaction

)

Event (Item)

04/19/23 76

Formal Definition of a Sequence

A sequence is an ordered list of elements (transactions)

s = < e1 e2 e3 … >

Each element contains a collection of events (items)

ei = {i1, i2, …, ik}

Each element is attributed to a specific time or location

Length of a sequence, |s|, is given by the number of elements of the sequence

A k-sequence is a sequence that contains k elements

04/19/23 77

Examples of Sequence

Web sequence:

< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} >

Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the

King}>

04/19/23 78

Formal Definition of a Subsequence A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm>

(m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi2, …, an bin

The support of a subsequence w is defined as the fraction of data sequences that contain w

A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)

Data sequence Subsequence Contain?

< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes

< {1,2} {3,4} > < {1} {2} > No

< {2,4} {2,4} {2,5} > < {2} {4} > Yes

04/19/23 79

Sequential Pattern Mining: Definition

Given: a database of sequences a user-specified minimum support threshold,

minsup

Task: Find all subsequences with support ≥ minsup

04/19/23 80

Example

Q. How to find the sequential patterns?

04/19/23 81

Example (cont.)

Item

Itemset

Transaction

Sorted by Customer_Id and TransactionTime

04/19/23 82

Example (cont.)

Sequence

<(30) (90)> is supported by customer 1 and 4

<(30) (40 70)> is supported by customer 2 and 4

With minimum support of 2 customers:The large itemset (litemset):

(30), (40), (70), (40 70), (90)

3-Sequence

04/19/23 83

Example (cont.)

Q. Find the maximal sequences with minimum support of 2 customers:

- The answer set is: <(30) (90)>, <(30) (40 70)>

Sequential Patterns

04/19/2384

The Algorithm

Five phases Sort phase Large itemset phase Transformation phase Sequence phase Maximal phase

ApriorAll

ApriorSome

DynamicSome

Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering, ICDE 1995.

04/19/23 85

Sort the database with customer-id as the major key and transaction-time as the minor key

Sort phase

04/19/23 86

Find the large itemset. Itemsets mapping

Litemset phase

04/19/23 87

Transformation phase

Deleting non-large itemsets Mapping large itemsets to integers

04/19/23 88

Sequence phase

Use the set of litemsets to find the desired sequence.

Two families of algorithms: Count-all: counts all large sequences, including non-

maximal sequences.

Algorithm AprioriAll Count-some: try to avoid counting non-maximal

sequences by counting longer sequences first.

Algorithm AprioriSome, Algorithm DynamicSome

04/19/23 89

Maximal phase

Find the maximum sequences among the set of large sequences.

In some algorithms, this phase is combined with the sequence phase.

04/19/23 90

Maximal phase

Algorithm: S the set of all litemsets n the length of the longest sequence

for (k = n; k > 1; k--) do for each k-sequence sk do Delete from S all subsequences of sk

04/19/23 91

AprioriAll

The basic method to mine sequential patterns

Based on the Apriori algorithm. Count all the large sequences, including

non-maximal sequences. Use Apriori-generate function to generate

candidate sequence.

04/19/23 92

Apriori Candidate Generation

generate candidates for pass using only the large sequences found in the previous pass and then makes a pass over the data to find their support.

04/19/23 93

Algorithm: Lk the set of all large k-sequences

Ck the set of candidate k-sequences


insert into Ck

select p.litemset1, p.litemset2,…, p.litemsetk-1, q.litemsetk-1

from Lk-1 p, Lk-1 qwhere p.litemset1=q.litemset1,…, p.litemsetk-2=q.litemsetk-2;

forall sequences cCk do forall (k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;

04/19/23 94

AprioriAll (cont.)

L1 = {large 1-sequences}; // Result of the phasefor ( k=2; Lk-1≠Φ; k++) do begin Ck = New candidate generate from Lk-1 for each customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support.EndAnswer=Maximal Sequences in UkLk;

04/19/23 95

Example: (Customer Sequences)


<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

next step: find the large 1-sequences

With minimum set to 40%

04/19/23 96


Sequence Support

<1>

<2>

<3>

<4>

<5>

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

Example

Large 1-Sequence

4

2

4

4

2

04/19/23 97


Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 2

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 2-Sequence

04/19/23 98


Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>


04/19/23 99

next step: find the sequential pattern

Sequence Support

<1 2 3 4> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>


04/19/23 100

Sequence Support

<1 2 3 4> 2

Example

Sequence Support

<1> 4

<2> 2

<3> 4

<4> 4

<5> 2

Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Find the maximal large sequences

04/19/23 101

Count-some Algorithms

Try to avoid counting non-maximal sequences by counting longer sequences first.

2 phases: Forward Phase – find all large

sequences or certain lengths. Backward Phase – find all remaining

large sequences.

04/19/23 102

AprioriSome (1)

Determines which lengths to count using next() function.

next() takes in as a parameter the length of the sequence counted in the last pass.

next(k) = k + 1 - Same as AprioriAll Balances tradeoff between:

Counting non-maximal sequences Counting extensions of small candidate

sequences

04/19/23 103

AprioriSome (2)

hitk = Lk/ Ck Intuition: As hitk increases the time wasted

by counting extensions of small candidates decreases.

04/19/23 104

AprioriSome (3)

04/19/23 105

AprioriSome (4)

Backward Phase: For all lengths which we skipped:

Delete sequences in candidate set which are contained in some large sequence.

Count remaining candidates and find all sequences with min. support.

Also delete large sequences found in forward phase which are non-maximal.

04/19/23 106

AprioriSome (5)

04/19/23 107

AprioriSome (6)

Example:

3-Sequences

C3

next(k) = 2kminsup = 2Forward Phase:

04/19/23 108

AprioriSome (7)

Example

Backward Phase:

3-Sequences

C3

04/19/23 109

Performance of two algorithms

AprioriSome does a little better than AprioriAll. It avoids counting

many non-maximal sequences.

04/19/23 110

Advanced Association Rule Mining

04/19/23 111

Mining Various Kinds of Association Rules

Mining multilevel association

Miming multidimensional association

(Optional) Mining Max and Closed association

patterns

(Optional) Constraint-based association mining

04/19/23 112

Mining Multiple-Level Association Rules

Items often form hierarchies Flexible support settings

Items at the lower level are expected to have lower support

Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

04/19/23 113

Multi-level Association: Redundancy Filtering

Some rules may be redundant due to “ancestor” relationships between items.

Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

04/19/23 114

Mining Multi-Dimensional Association

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Categorical Attributes: finite number of possible values, no ordering among values—data cube approach

Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

04/19/23 115

Mining Quantitative Associations

Techniques can be categorized by how numerical attributes, such as age or salary are treated

1. Static discretization based on predefined concept hierarchies (data cube methods)

2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)

3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97)

one dimensional clustering then association

4. Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)

04/19/23 116

Static Discretization of Quantitative Attributes

Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets

will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional

cuboid correspond to the

predicate sets.

Mining from data cubescan be much faster.

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

04/19/23 117

Quantitative Association Rules

age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”)

Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined

is maximized 2-D quantitative association rules: Aquan1 Aquan2 Acat

Cluster adjacent association rules to form general rules using a 2-D grid

Exampleage(X, 34) income(X,”31-40K”) buys(X,”HDTV”)

age(X, 35) income(X,”31-40K”) buys(X,”HDTV”)



Classification by Association Rule Analysis

2023年4月19日 119119

Associative Classification

Associative classification: Major steps Mine data to find strong associations between frequent patterns

(conjunctions of attribute-value pairs) and class labels

Association rules are generated in the form of

P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)

Organize the rules to form a rule-based classifier

Why effective? It explores highly confident associations among multiple attributes and

may overcome some constraints introduced by decision-tree induction,

which considers only one attribute at a time

Associative classification has been found to be often more accurate

than some traditional classification methods, such as C4.5

2023年4月19日 120120

Typical Associative Classification Methods CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)

Mine possible association rules in the form of

Cond-set (a set of attribute-value pairs) class label Build classifier: Organize rules according to decreasing precedence

based on confidence and then support

CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)

Classification: Statistical analysis on multiple rules

CPAR (Classification based on Predictive Association Rules: Yin & Han,

SDM’03)

Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight

Prediction using best k rules

High efficiency, accuracy similar to CMAR

2023年4月19日 121

CBA [Liu, Hsu and Ma, KDD’98]

• Basic idea• Mine high-confidence, high-support class

association rules with Apriori• Rule LHS: a conjunction of conditions• Rule RHS: a class label• Example:

R1: age < 25 & credit = ‘good’ buy iPhone (sup=30%, conf=80%)

R2: age > 40 & income < 50k not buy iPhone (sup=40%, conf=90%)

2023年4月19日 122

CBA

• Rule mining• Mine the set of association rules wrt. min_sup and

min_conf• Rank rules in descending order of confidence and

support• Select rules to ensure training instance coverage

• Prediction• Apply the first rule that matches a test case• Otherwise, apply the default rule

2023年4月19日 123

CBA – An exampleage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

min_sup=25% min_conf=80%

1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes

(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=yes

Rules:

•Rule mining

CBA - An example

2023年4月19日 124

1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes

(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=no

Rules:

•Prediction

age income student credit_rating<=30 high yes fair

Apply Rule 2, buys_computer=yes

age income student credit_rating30…40 high yes excellent

Apply Rule 1, buys_computer=yes

age income student credit_rating<=30 high no excellent

Apply Default rule, buys_computer=no

2023年4月19日 125

CMAR [Li, Han and Pei, ICDM’01]

Basic idea Mining: build a class distribution-associated FP-tree Prediction: combine the strength of multiple rules

Rule mining Mine association rules from a class distribution-

associated FP-tree Store and retrieve association rules in a CR-tree Prune rules based on confidence, correlation and

database coverage

2023年4月19日 126

CMAR (Classification based on Multiple Association Rules) (1)

Adopted from the FP-growth Phases:

rule generation or training (R: Pc, such that sup(R ) and conf( R ) pass the given thresholds), and

classification or testing phase (predict the classification of the new sample).

2023年4月19日 127


Training database T for CMAR algorithm (the support threshold is 2 and the confidence threshold is 70%).

ID A B C D Class

01 a1 b1 c1 d1 A

02 a1 b2 c1 d2 B

03 a2 b3 c2 d3 A

04 a1 b2 c3 d3 C

05 a1 b2 c1 d3 C

FP-tree is a FP-tree is a prefix tree with prefix tree with respect to F-listrespect to F-list

F-list: F-list: (a1, b2, c1, d3)(a1, b2, c1, d3)

2023年4月19日 128

2023年4月19日 129


Rules subsets: The rules having d3 value;

The rules having c1 but no d3;

The rules having b2 but no d3 nor c1; and

The rules having only a1.

d3-projected samples:

(a1, b2, c1, d3): C, (a1, b2, d3): C, and (d3): A

=> Rule: A(a1, b2, d3) C (sup = 2, conf =100%)

(a1, c1) is a frequent pattern with support 3, but all rules

are with confidence less than threshold value. Similar conclusions are for pattern (a1, b2), and finally for (a1).

2023年4月19日 130


Classification or testing phase If all the rules have the same class, CMAR

simply assigns that label to the new sample If the rules are not consistent in the class

label of the “strongest” group To compare the strength of groups, it is

necessary to measure the “combined effect” of each group

If the rules in a group are highly positively correlated and have good support, the group should have a strong effect

2023年4月19日 131


Possible ways to measure the combined effect of a group of rules Highest X 2 value Compound of correlation Integrate both information of

correlation and population Weighted X 2

2023年4月19日 132

Weighted X 2

maxX 2 computes the upper bound of X 2 value of the rule w.r.t. other setting are fixed

For each group of rules, the weighted X 2 measure of the group is defined as


2023年4月19日 133

CPAR [Yin and Han, SDM’03]

Basic idea Combine associative classification and FOIL-

based rule generation Foil gain: criterion for selecting a literal

Improve accuracy over traditional rule-based classifiers

Improve efficiency and reduce number of rules over association rule-based methods

2023年4月19日 134

CPAR (1)

Rule generation Build a rule by adding literals one by one in a

greedy way according to foil gain measure Keep all close-to-the-best literals and build

several rules simultaneously

Prediction Collect all rules matching a test case Select the best k rules for each class Choose the class with the highest expected

accuracy for prediction

2023年4月19日 135

CPAR (2)

Build rules by adding literal one by one

CPAR keeps all “close-to-the-best” literal during rule building process

select more than one literal at the same time and build several rules simultaneously

2023年4月19日 136

CPAR (3)

After finding the best literal p, another literal q has the similar gain as p

(e.g. differ by at most 1%)

Appending p and q to current rule r create new rule r’

2023年4月19日 137

How CPAR generates rules

Example 1. Literal (A1=2) has the most Foil

gain

A1=2

2023年4月19日 138


2.After the first literal is selected,two literals(A2=1) and (A3=1) are found to have similar gain, which is higher than others.

A1=2 A2=1

A3=1

A1=2

2023年4月19日 139

How CPAR generates rules 3. Choose literal (A2=1) first. A rule is generated along this

direction. ( A1=2, A2=1, A4=1)

A2=1

A3=1

A1=2 A4=1

2023年4月19日 140


4. Then, the rule (A1=2, A3=1) is taken as the current rule. Again two literals with similar gain are selected.

A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

5. Choose (A1=2,A3=1,A4=2) first. A rule is generated.

(A1=2,A3=1,A4=2,A2=3)

2023年4月19日 141


A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

A2=3

6. (A1=2,A3=1,A2=1) is generated.

2023年4月19日 142


A4=2

A2=1

A2=1

A3=1

A1=2 A4=1

A2=3

More reading on Associative Classification

FADI THABTAH. A review of associative classification mining. The Knowledge Engineering Review, Vol. 22:1, 37–65, 2007.

04/19/23 144

Q&A

lecture 5: mining association rule introduction to data mining yunming ye department of computer...

Documents

association ruleexample

association miningfrequent

ruleprune rules

identical support

set of transactions

different confidence

form x y

sets of items