association rules presented by zbigniew w. ras ,#) ) university of north carolina – charlotte #)...

Association RulesAssociation Rules

presented bypresented by

Zbigniew W. RasZbigniew W. Ras*,#)*,#)

*) *) University of North Carolina – CharlotteUniversity of North Carolina – Charlotte#) #) Warsaw University of TechnologyWarsaw University of Technology

Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”

Customer1

Customer2 Customer3

Milk, eggs, sugar, bread

Milk, eggs, cereal, bread Eggs, sugar

Market Basket Analysis (MBA)

Given: a database of customer transactions, where each transaction is a set of items

Find groups of items which are frequently purchased together

Market Basket Analysis

Extract information on pExtract information on purchasing behaviorurchasing behavior

Actionable information: can suggestActionable information: can suggest new store layoutsnew store layouts new product assortmentsnew product assortments which products to put on promotionwhich products to put on promotion

MBA applicable whenever a customer purchases multiple things in proximity

Goal of MBA

Express how product/services relate to each Express how product/services relate to each other, and tend to group togetherother, and tend to group together

““ifif a customer purchases three-way calling, a customer purchases three-way calling, thenthen will also purchase call-waitingwill also purchase call-waiting””

Simple to understandSimple to understand

Actionable informationActionable information: bundle three-way calling : bundle three-way calling and call-waiting in a single packageand call-waiting in a single package

Association Rules

Transactions:Relational format Compact format<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>

Item: single element, Itemset: set of itemsSupport of an itemset I [denoted by sup(I)]: card(I)

Threshold for minimum support: Itemset I is Frequent if: sup(I) .

Frequent Itemset represents set of items which are

positively correlated

Basic Concepts

itemset

sup({dairy}) = 3 sup({fruit}) = 3 sup({dairy, fruit}) = 2

If = 3, then

{dairy} and {fruit} are frequent while {dairy,fruit} is not.

Customer 1

Customer 2

Frequent Itemsets

Transaction ID Items Bought1 dairy,fruit2 dairy,fruit, vegetable3 dairy4 fruit, cereals

{A,B} - partition of a set of items

r = [A B]

Support of r: sup(r) = sup(AB)

Confidence of r: conf(r) = sup(AB)/sup(A)

Thresholds: minimum support - s minimum confidence – c

r AS(s, c), if sup(r) s and conf(r) c

Association Rules: AR(s,c)

For rule A For rule A C: C: sup(A sup(A C) = 2 C) = 2 conf(A conf(A C) = sup({A,C})/sup({A}) = 2/3 C) = sup({A,C})/sup({A}) = 2/3

The Apriori principle:The Apriori principle: Any subset of a frequent itemset must be frequentAny subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support – 2 [50%]Min. confidence - 50%

Association Rules - Example

The Apriori algorithm [Agrawal]

FFkk : Set of frequent itemsets of size k : Set of frequent itemsets of size k CCkk : Set of candidate itemsets of size k : Set of candidate itemsets of size k

FF11 :=:= { {frequentfrequent items}; k:=1; items}; k:=1; whilewhile card(F card(Fkk) ) 1 1 do begindo begin CCk+1k+1 := := new candidates generated from Fnew candidates generated from Fkk ;; for eachfor each transaction t in the database transaction t in the database dodo increment the count of all candidates in Cincrement the count of all candidates in Ck+1k+1 that that are contained in t ; are contained in t ; FFk+1k+1 := := candidates in Ccandidates in Ck+1k+1 with minimum support with minimum support k:= k+1k:= k+1 endend

Answer := Answer := { F{ Fkk: k : k 1 & card(F 1 & card(Fkk) ) 1} 1}

a b c d

c, db, db, ca, da, ca, b

a, b, d b, c, da, c, da, b, c

a,b,c,d

{a,d} is not frequent, so the 3-itemsets {a,b,d}, {a,c,d} and the 4-itemset {a,b,c,d}, are not generated.

Apriori - Example

Algorithm Apriori: Illustration

The task of mining association rules is mainly The task of mining association rules is mainly to discover to discover strong association rulesstrong association rules (high (high confidence and strong support) in large confidence and strong support) in large databases.databases.

Mining association rules is composed of two steps:

TID Items

1000 A, B, C2000 A, C3000 A, D4000 B, E, F

1. discover the large items, i.e., the sets of itemsets that have

transaction support above a

predetermined minimum support s.

2. Use the large itemsets to generate

the association rules

{A} 3 {B} 2{C} 2{A,C} 2

Large support items

MinSup = 2

TID Items

100 A, C, D200 B, C, E300 A, B, C, E400 B, E

Database D

{A} {B} {C} {D} {E}

Itemset Count

{A} 2 {B} 3{C} 3{E} 3

Itemset Count

{A, B} {A, C} {A, E} {B, C} {B, E}{C, E}

Itemset

{A,B} {A,C} {A,E} {B,C} {B,E} {C,E}

Itemset Count

{A, C} 2 {B, C} 2 {B, E} 3{C, E} 2

Itemset Count

{B, C, E}

Itemset

{B, C, E} 2

Itemset Count

{B, C, E} 2

Itemset Count

C1 F1

C2

F2

C2

C3 F3C3ScanD

ScanD

ScanD

S = 22

3

3

1

3

1 2 1 2 3 2

Algorithm Apriori: IllustrationAlgorithm Apriori: Illustration

Definition 1.Definition 1. Cover C of a rule X Cover C of a rule X Y Y is denoted by is denoted by C(X C(X Y) Y)

and defined as follows:and defined as follows: CC(X (X Y) = { [X Y) = { [X Z] Z] V : Z, V are disjoint subsets of Y}. V : Z, V are disjoint subsets of Y}.

Definition 2.Definition 2. Set Set RR(s, c)RR(s, c) of of Representative Representative AssociationAssociation Rules Rules is defined as follows: is defined as follows: RR(s, c) =RR(s, c) = {r {r AR(s, c): ~( AR(s, c): ~(rrll AR(s, c)) [r AR(s, c)) [rll r & r r & r C(r C(rll)]})]} s – threshold for minimum supports – threshold for minimum support c – threshold for minimum confidencec – threshold for minimum confidence

Representative Rules (Representative Rules (informal descriptioninformal description):): [as short as possible] [as short as possible] [as long as possible] [as long as possible]

Representative Association Rules

Transactions:{A,B,C,D,E}{A,B,C,D,E,F}{A,B,C,D,E,H,I}{A,B,E}{B,C,D,E,H,I}

Representative Association Rules

Find RR(2,80%)

Representative Rules

From (BCDEHI): {H} {B,C,D,E,I} {I} {B,C,D,E,H}

From (ABCDE):{A,C} {B,D,E}{A,D} {B,C,E}

Transactions:abcdeabcacdebcdebcbdecde

Frequent Pattern (FP) Growth StrategyMinimum Support = 2

Frequent Items:c – 6b – 5d – 5e – 5a – 3

Transactionsordered:cbdeacbacdeacbdecbbdecde

null

6 1

1

c

4 2 b

a

a

e

d

d

e

a

b

d

e 1

1

1

2

2

1

2

FP-tree

Frequent Pattern (FP) Growth Strategy

Mining the FP-tree for frequent itemsets:Start from each item and construct a subdatabase of transactions (prefix paths) with that item listed at the end. Reorder the prefix paths in support descending order. Build a conditional FP-tree.

a – 3 Prefix path:(c b d e a, 1)(c b a, 1)(c d e a, 1)

Correct order:c – 3b – 2d – 2e – 2

3 c

2 1 b

a

a

e

d

d

e

a

1

1

1

1

1

1

Frequent Pattern (FP) Growth Strategy

a – 3 Prefix path:(c b d e a, 1)(c b a, 1)(c d e a, 1)

3 c

2 1 b

e

d

d

e

1

1 1

Frequent Itemsets:(c a, 3)(c b a, 2)(c d a, 2)(c d e a, 2)(c e a, 2)

Multidimensional ARAssociations between values of different attributes :

RULES:

[nationality = French] [income = high] [50%, 100%]

[income = high] [nationality = French] [50%, 75%]

[age = 50] [nationality = Italian] [33%, 100%]

Multi-dimensional Single-dimensional

<1, Italian, 50, low> <1, {nat/Ita, age/50, inc/low}>

<2, French, 45, high> <2, {nat/Fre, age/45, inc/high}>

Schema: <ID, a?, b?, c?, d?>

<1, yes, yes, no, no> <1, {a, b}>

<2, yes, no, yes, no> <2, {a, c}>

Single-dimensional AR vs Multi-dimensional

Quantitative Attributes

Quantitative attributes (e.g. age, income)Quantitative attributes (e.g. age, income) Categorical attributes (e.g. Categorical attributes (e.g. colorcolor of car) of car)

Problem: too many distinct values

Solution: transform quantitative attributes into

categorical ones via discretization.

Quantitative attributes are Quantitative attributes are staticallystatically discretized by discretized by

using predefined concept hierarchies:using predefined concept hierarchies: elementary use of background knowledgeelementary use of background knowledge

Loose interaction between Loose interaction between AprioriApriori and and DiscretizerDiscretizer

Quantitative attributes are Quantitative attributes are dynamicallydynamically discretized discretized into “bins” based on the distribution of the data.into “bins” based on the distribution of the data. considering the distance between data points.considering the distance between data points.

Tighter interaction between Tighter interaction between AprioriApriori and and DiscretizerDiscretizer

Discretization of quantitative attributes

Preprocessing:Preprocessing: use constraints to focus on a subset of use constraints to focus on a subset of transactionstransactions Example: find association rules where the prices of all Example: find association rules where the prices of all

items are at most 200 Euro items are at most 200 Euro

Optimizations:Optimizations: use constraints to optimize Apriori use constraints to optimize Apriori algorithm algorithm Anti-monotonicity: when a set violates the constraint, Anti-monotonicity: when a set violates the constraint,

so does any of its supersets.so does any of its supersets. Apriori algorithm uses this property for pruningApriori algorithm uses this property for pruning

Push constraints as deep as possible inside the frequent Push constraints as deep as possible inside the frequent set computationset computation

Constraint-based AR

Anti-monotonicityAnti-monotonicity: If a set S violates the constraint, : If a set S violates the constraint, any superset of S violates the constraint.any superset of S violates the constraint.

ExamplesExamples: : [Price(S) [Price(S) v] is anti-monotone v] is anti-monotone [Price(S) [Price(S) v] is not anti-monotone v] is not anti-monotone [Price(S) = v] is partly anti-monotone[Price(S) = v] is partly anti-monotone

ApplicationApplication:: Push [Price(S) Push [Price(S) 1000] deeply into iterative 1000] deeply into iterative

frequent set computation. frequent set computation.

Apriori property revisited

Post processingPost processing A naive solution: apply Apriori for finding all A naive solution: apply Apriori for finding all

frequent sets, and then to test them for constraint frequent sets, and then to test them for constraint satisfaction one by one.satisfaction one by one.

OptimizationOptimization Han’s approach: comprehensive analysis of the Han’s approach: comprehensive analysis of the

properties of constraints and try to push them as properties of constraints and try to push them as deeply as possible inside the frequent set deeply as possible inside the frequent set computation.computation.

Mining Association Rules with Constraints

It is difficult to find interesting patterns at a too It is difficult to find interesting patterns at a too primitive levelprimitive level high support = too few ruleshigh support = too few rules low support = too many rules, most uninterestinglow support = too many rules, most uninteresting

Approach: reason at suitable level of abstractionApproach: reason at suitable level of abstraction

A common form of background knowledge is that an A common form of background knowledge is that an attribute may be generalized or specialized according attribute may be generalized or specialized according to a hierarchy of conceptsto a hierarchy of concepts

Dimensions and levels can be efficiently encoded in Dimensions and levels can be efficiently encoded in transactions transactions

Multilevel Association Rules: rules which combine Multilevel Association Rules: rules which combine associations with hierarchy of conceptsassociations with hierarchy of concepts

Multilevel AR

Product

Fam ily

Sector

Departm ent

Frozen Refrigerated

Vegetable

Banana Apple Orange Etc...

Fruit Dairy Etc....

Fresh Bakery Etc...

FoodStuff

Hierarchy of concepts

Multilevel ARMultilevel AR

Fresh

[support = 20%]

Dairy

[support = 6%]

Fruit

[support = 1%]

Vegetable

[support = 7%]

FreshFresh BakeryBakery [ [20%, 60%20%, 60%]]DairyDairy BreadBread [ [6%, 50%6%, 50%]] FruitFruit BreadBread [ [1%, 50%1%, 50%] is not valid] is not valid

Support and Confidence of Multilevel Association Rules

Generalizing/specializing values of attributes Generalizing/specializing values of attributes affects support and confidence affects support and confidence

from from specializedspecialized to to generalgeneral: support of rules : support of rules increasesincreases (new rules may become valid) (new rules may become valid)

from from generalgeneral to to specializedspecialized: support of rules : support of rules decreasesdecreases (rules may become not valid, their (rules may become not valid, their support falls undersupport falls under the threshold)the threshold)

Hierarchical attributes: age, salaryAssociation Rule: (age, young) (salary, 40k)

age

young middle-aged old

salary

low medium high

18 … 29 30 … 60 61 … 80 10k…40k 50k 60k 70k 80k…100k

Candidate Association Rules: (age, 18 ) (salary, 40k), (age, young) (salary, low), (age, 18 ) (salary, low)

Mining Multilevel AR

Mining Multilevel AR

Calculate frequent itemsets at each concept level, Calculate frequent itemsets at each concept level,

until no more frequent itemsets can be founduntil no more frequent itemsets can be found

For each level use AprioriFor each level use Apriori

A top_down, progressive deepening approach:A top_down, progressive deepening approach: First find high-level strong rules:First find high-level strong rules:

fresh fresh bakery [20%, 60%]. bakery [20%, 60%]. Then find their lower-level “weaker” rules:Then find their lower-level “weaker” rules:

fruit fruit bread [6%, 50%]. bread [6%, 50%].

Variations at mining multiple-level association rules.Variations at mining multiple-level association rules. Level-crossed association rules:Level-crossed association rules:

fruitfruit wheat bread wheat bread

Multi-level Association: Uniform Support vs. Reduced Support

Uniform Support: the same minimum support for all Uniform Support: the same minimum support for all levelslevels One minimum support threshold. No need to One minimum support threshold. No need to

examine itemsets containing any item whose examine itemsets containing any item whose ancestors do not have minimum support.ancestors do not have minimum support.

If support threshold If support threshold • too high too high miss low level associations. miss low level associations.• too low too low generate too many high level generate too many high level

associations.associations.

Reduced Support: reduced minimum support at Reduced Support: reduced minimum support at lower levels - different strategies possible.lower levels - different strategies possible.

Beyond Support and Confidence Example 1: (Aggarwal & Yu)Example 1: (Aggarwal & Yu)

{tea} => {coffee} has high support (20%) and confidence {tea} => {coffee} has high support (20%) and confidence (80%)(80%)

However, a priori probability that a customer buys coffee is However, a priori probability that a customer buys coffee is 90%90% There is a negative correlation between buying tea and There is a negative correlation between buying tea and

buying coffeebuying coffee {~tea} => {coffee} has higher confidence (93%){~tea} => {coffee} has higher confidence (93%)

coffee not coffee sum(row)tea 20 5 25not tea 70 5 75sum(col.) 90 10 100

Correlation and Interest

Two events are independent Two events are independent

if P(A if P(A B) = P(A)*P(B), otherwise are correlated. B) = P(A)*P(B), otherwise are correlated.

Interest = P(A Interest = P(A B)/P(B)*P(A) B)/P(B)*P(A)

Interest expresses measure of correlation. If:Interest expresses measure of correlation. If: equal to 1equal to 1 A and B are A and B are independent eventsindependent events less than 1less than 1 A and B A and B negatively correlatednegatively correlated, , greater than 1greater than 1 A and B A and B positively correlatedpositively correlated.. In our example, In our example,

I(I(drink tea drink tea drink coffee drink coffee ) = 0.89 i.e. they are ) = 0.89 i.e. they are negatively correlated.negatively correlated.

Questions?

Thank You

association rules presented by zbigniew w. ras ,#) ) university of north carolina – charlotte #)...

Documents

c slide

supa c

abcd c

confa c

cx y cover c

frequent itemsets slide

confr c association

minimum support f

association rules presented by zbigniew w. ras *,#) *) university of north carolina – charlotte #)...

association rules presented by zbigniew w. ras ,#) ) university of north carolina – charlotte #)...