association rules presented by zbigniew w. ras *,#) *) university of north carolina – charlotte #)...
TRANSCRIPT
Association RulesAssociation Rules
presented bypresented by
Zbigniew W. RasZbigniew W. Ras*,#)*,#)
*) *) University of North Carolina – CharlotteUniversity of North Carolina – Charlotte#) #) Warsaw University of TechnologyWarsaw University of Technology
Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”
Customer1
Customer2 Customer3
Milk, eggs, sugar, bread
Milk, eggs, cereal, bread Eggs, sugar
Market Basket Analysis (MBA)
Given: a database of customer transactions, where each transaction is a set of items
Find groups of items which are frequently purchased together
Market Basket Analysis
Extract information on pExtract information on purchasing behaviorurchasing behavior
Actionable information: can suggestActionable information: can suggest new store layoutsnew store layouts new product assortmentsnew product assortments which products to put on promotionwhich products to put on promotion
MBA applicable whenever a customer purchases multiple things in proximity
Goal of MBA
Express how product/services relate to each Express how product/services relate to each other, and tend to group togetherother, and tend to group together
““ifif a customer purchases three-way calling, a customer purchases three-way calling, thenthen will also purchase call-waitingwill also purchase call-waiting””
Simple to understandSimple to understand
Actionable informationActionable information: bundle three-way calling : bundle three-way calling and call-waiting in a single packageand call-waiting in a single package
Association Rules
Transactions:Relational format Compact format<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>
Item: single element, Itemset: set of itemsSupport of an itemset I [denoted by sup(I)]: card(I)
Threshold for minimum support: Itemset I is Frequent if: sup(I) .
Frequent Itemset represents set of items which are
positively correlated
Basic Concepts
itemset
sup({dairy}) = 3 sup({fruit}) = 3 sup({dairy, fruit}) = 2
If = 3, then
{dairy} and {fruit} are frequent while {dairy,fruit} is not.
Customer 1
Customer 2
Frequent Itemsets
Transaction ID Items Bought1 dairy,fruit2 dairy,fruit, vegetable3 dairy4 fruit, cereals
{A,B} - partition of a set of items
r = [A B]
Support of r: sup(r) = sup(AB)
Confidence of r: conf(r) = sup(AB)/sup(A)
Thresholds: minimum support - s minimum confidence – c
r AS(s, c), if sup(r) s and conf(r) c
Association Rules: AR(s,c)
For rule A For rule A C: C: sup(A sup(A C) = 2 C) = 2 conf(A conf(A C) = sup({A,C})/sup({A}) = 2/3 C) = sup({A,C})/sup({A}) = 2/3
The Apriori principle:The Apriori principle: Any subset of a frequent itemset must be frequentAny subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support – 2 [50%]Min. confidence - 50%
Association Rules - Example
The Apriori algorithm [Agrawal]
FFkk : Set of frequent itemsets of size k : Set of frequent itemsets of size k CCkk : Set of candidate itemsets of size k : Set of candidate itemsets of size k
FF11 :=:= { {frequentfrequent items}; k:=1; items}; k:=1; whilewhile card(F card(Fkk) ) 1 1 do begindo begin CCk+1k+1 := := new candidates generated from Fnew candidates generated from Fkk ;; for eachfor each transaction t in the database transaction t in the database dodo increment the count of all candidates in Cincrement the count of all candidates in Ck+1k+1 that that are contained in t ; are contained in t ; FFk+1k+1 := := candidates in Ccandidates in Ck+1k+1 with minimum support with minimum support k:= k+1k:= k+1 endend
Answer := Answer := { F{ Fkk: k : k 1 & card(F 1 & card(Fkk) ) 1} 1}
a b c d
c, db, db, ca, da, ca, b
a, b, d b, c, da, c, da, b, c
a,b,c,d
{a,d} is not frequent, so the 3-itemsets {a,b,d}, {a,c,d} and the 4-itemset {a,b,c,d}, are not generated.
Apriori - Example
Algorithm Apriori: Illustration
The task of mining association rules is mainly The task of mining association rules is mainly to discover to discover strong association rulesstrong association rules (high (high confidence and strong support) in large confidence and strong support) in large databases.databases.
Mining association rules is composed of two steps:
TID Items
1000 A, B, C2000 A, C3000 A, D4000 B, E, F
1. discover the large items, i.e., the sets of itemsets that have
transaction support above a
predetermined minimum support s.
2. Use the large itemsets to generate
the association rules
{A} 3 {B} 2{C} 2{A,C} 2
Large support items
MinSup = 2
TID Items
100 A, C, D200 B, C, E300 A, B, C, E400 B, E
Database D
{A} {B} {C} {D} {E}
Itemset Count
{A} 2 {B} 3{C} 3{E} 3
Itemset Count
{A, B} {A, C} {A, E} {B, C} {B, E}{C, E}
Itemset
{A,B} {A,C} {A,E} {B,C} {B,E} {C,E}
Itemset Count
{A, C} 2 {B, C} 2 {B, E} 3{C, E} 2
Itemset Count
{B, C, E}
Itemset
{B, C, E} 2
Itemset Count
{B, C, E} 2
Itemset Count
C1 F1
C2
F2
C2
C3 F3C3ScanD
ScanD
ScanD
S = 22
3
3
1
3
1 2 1 2 3 2
Algorithm Apriori: IllustrationAlgorithm Apriori: Illustration
Definition 1.Definition 1. Cover C of a rule X Cover C of a rule X Y Y is denoted by is denoted by C(X C(X Y) Y)
and defined as follows:and defined as follows: CC(X (X Y) = { [X Y) = { [X Z] Z] V : Z, V are disjoint subsets of Y}. V : Z, V are disjoint subsets of Y}.
Definition 2.Definition 2. Set Set RR(s, c)RR(s, c) of of Representative Representative AssociationAssociation Rules Rules is defined as follows: is defined as follows: RR(s, c) =RR(s, c) = {r {r AR(s, c): ~( AR(s, c): ~(rrll AR(s, c)) [r AR(s, c)) [rll r & r r & r C(r C(rll)]})]} s – threshold for minimum supports – threshold for minimum support c – threshold for minimum confidencec – threshold for minimum confidence
Representative Rules (Representative Rules (informal descriptioninformal description):): [as short as possible] [as short as possible] [as long as possible] [as long as possible]
Representative Association Rules
Transactions:{A,B,C,D,E}{A,B,C,D,E,F}{A,B,C,D,E,H,I}{A,B,E}{B,C,D,E,H,I}
Representative Association Rules
Find RR(2,80%)
Representative Rules
From (BCDEHI): {H} {B,C,D,E,I} {I} {B,C,D,E,H}
From (ABCDE):{A,C} {B,D,E}{A,D} {B,C,E}
Transactions:abcdeabcacdebcdebcbdecde
Frequent Pattern (FP) Growth StrategyMinimum Support = 2
Frequent Items:c – 6b – 5d – 5e – 5a – 3
Transactionsordered:cbdeacbacdeacbdecbbdecde
null
6 1
1
c
4 2 b
a
a
e
d
d
e
a
b
d
e 1
1
1
2
2
1
2
FP-tree
Frequent Pattern (FP) Growth Strategy
Mining the FP-tree for frequent itemsets:Start from each item and construct a subdatabase of transactions (prefix paths) with that item listed at the end. Reorder the prefix paths in support descending order. Build a conditional FP-tree.
a – 3 Prefix path:(c b d e a, 1)(c b a, 1)(c d e a, 1)
Correct order:c – 3b – 2d – 2e – 2
3 c
2 1 b
a
a
e
d
d
e
a
1
1
1
1
1
1
Frequent Pattern (FP) Growth Strategy
a – 3 Prefix path:(c b d e a, 1)(c b a, 1)(c d e a, 1)
3 c
2 1 b
e
d
d
e
1
1 1
Frequent Itemsets:(c a, 3)(c b a, 2)(c d a, 2)(c d e a, 2)(c e a, 2)
Multidimensional ARAssociations between values of different attributes :
RULES:
[nationality = French] [income = high] [50%, 100%]
[income = high] [nationality = French] [50%, 75%]
[age = 50] [nationality = Italian] [33%, 100%]
Multi-dimensional Single-dimensional
<1, Italian, 50, low> <1, {nat/Ita, age/50, inc/low}>
<2, French, 45, high> <2, {nat/Fre, age/45, inc/high}>
Schema: <ID, a?, b?, c?, d?>
<1, yes, yes, no, no> <1, {a, b}>
<2, yes, no, yes, no> <2, {a, c}>
Single-dimensional AR vs Multi-dimensional
Quantitative Attributes
Quantitative attributes (e.g. age, income)Quantitative attributes (e.g. age, income) Categorical attributes (e.g. Categorical attributes (e.g. colorcolor of car) of car)
Problem: too many distinct values
Solution: transform quantitative attributes into
categorical ones via discretization.
Quantitative attributes are Quantitative attributes are staticallystatically discretized by discretized by
using predefined concept hierarchies:using predefined concept hierarchies: elementary use of background knowledgeelementary use of background knowledge
Loose interaction between Loose interaction between AprioriApriori and and DiscretizerDiscretizer
Quantitative attributes are Quantitative attributes are dynamicallydynamically discretized discretized into “bins” based on the distribution of the data.into “bins” based on the distribution of the data. considering the distance between data points.considering the distance between data points.
Tighter interaction between Tighter interaction between AprioriApriori and and DiscretizerDiscretizer
Discretization of quantitative attributes
Preprocessing:Preprocessing: use constraints to focus on a subset of use constraints to focus on a subset of transactionstransactions Example: find association rules where the prices of all Example: find association rules where the prices of all
items are at most 200 Euro items are at most 200 Euro
Optimizations:Optimizations: use constraints to optimize Apriori use constraints to optimize Apriori algorithm algorithm Anti-monotonicity: when a set violates the constraint, Anti-monotonicity: when a set violates the constraint,
so does any of its supersets.so does any of its supersets. Apriori algorithm uses this property for pruningApriori algorithm uses this property for pruning
Push constraints as deep as possible inside the frequent Push constraints as deep as possible inside the frequent set computationset computation
Constraint-based AR
Anti-monotonicityAnti-monotonicity: If a set S violates the constraint, : If a set S violates the constraint, any superset of S violates the constraint.any superset of S violates the constraint.
ExamplesExamples: : [Price(S) [Price(S) v] is anti-monotone v] is anti-monotone [Price(S) [Price(S) v] is not anti-monotone v] is not anti-monotone [Price(S) = v] is partly anti-monotone[Price(S) = v] is partly anti-monotone
ApplicationApplication:: Push [Price(S) Push [Price(S) 1000] deeply into iterative 1000] deeply into iterative
frequent set computation. frequent set computation.
Apriori property revisited
Post processingPost processing A naive solution: apply Apriori for finding all A naive solution: apply Apriori for finding all
frequent sets, and then to test them for constraint frequent sets, and then to test them for constraint satisfaction one by one.satisfaction one by one.
OptimizationOptimization Han’s approach: comprehensive analysis of the Han’s approach: comprehensive analysis of the
properties of constraints and try to push them as properties of constraints and try to push them as deeply as possible inside the frequent set deeply as possible inside the frequent set computation.computation.
Mining Association Rules with Constraints
It is difficult to find interesting patterns at a too It is difficult to find interesting patterns at a too primitive levelprimitive level high support = too few ruleshigh support = too few rules low support = too many rules, most uninterestinglow support = too many rules, most uninteresting
Approach: reason at suitable level of abstractionApproach: reason at suitable level of abstraction
A common form of background knowledge is that an A common form of background knowledge is that an attribute may be generalized or specialized according attribute may be generalized or specialized according to a hierarchy of conceptsto a hierarchy of concepts
Dimensions and levels can be efficiently encoded in Dimensions and levels can be efficiently encoded in transactions transactions
Multilevel Association Rules: rules which combine Multilevel Association Rules: rules which combine associations with hierarchy of conceptsassociations with hierarchy of concepts
Multilevel AR
Product
Fam ily
Sector
Departm ent
Frozen Refrigerated
Vegetable
Banana Apple Orange Etc...
Fruit Dairy Etc....
Fresh Bakery Etc...
FoodStuff
Hierarchy of concepts
Multilevel ARMultilevel AR
Fresh
[support = 20%]
Dairy
[support = 6%]
Fruit
[support = 1%]
Vegetable
[support = 7%]
FreshFresh BakeryBakery [ [20%, 60%20%, 60%]]DairyDairy BreadBread [ [6%, 50%6%, 50%]] FruitFruit BreadBread [ [1%, 50%1%, 50%] is not valid] is not valid
Support and Confidence of Multilevel Association Rules
Generalizing/specializing values of attributes Generalizing/specializing values of attributes affects support and confidence affects support and confidence
from from specializedspecialized to to generalgeneral: support of rules : support of rules increasesincreases (new rules may become valid) (new rules may become valid)
from from generalgeneral to to specializedspecialized: support of rules : support of rules decreasesdecreases (rules may become not valid, their (rules may become not valid, their support falls undersupport falls under the threshold)the threshold)
Hierarchical attributes: age, salaryAssociation Rule: (age, young) (salary, 40k)
age
young middle-aged old
salary
low medium high
18 … 29 30 … 60 61 … 80 10k…40k 50k 60k 70k 80k…100k
Candidate Association Rules: (age, 18 ) (salary, 40k), (age, young) (salary, low), (age, 18 ) (salary, low)
Mining Multilevel AR
Mining Multilevel AR
Calculate frequent itemsets at each concept level, Calculate frequent itemsets at each concept level,
until no more frequent itemsets can be founduntil no more frequent itemsets can be found
For each level use AprioriFor each level use Apriori
A top_down, progressive deepening approach:A top_down, progressive deepening approach: First find high-level strong rules:First find high-level strong rules:
fresh fresh bakery [20%, 60%]. bakery [20%, 60%]. Then find their lower-level “weaker” rules:Then find their lower-level “weaker” rules:
fruit fruit bread [6%, 50%]. bread [6%, 50%].
Variations at mining multiple-level association rules.Variations at mining multiple-level association rules. Level-crossed association rules:Level-crossed association rules:
fruitfruit wheat bread wheat bread
Multi-level Association: Uniform Support vs. Reduced Support
Uniform Support: the same minimum support for all Uniform Support: the same minimum support for all levelslevels One minimum support threshold. No need to One minimum support threshold. No need to
examine itemsets containing any item whose examine itemsets containing any item whose ancestors do not have minimum support.ancestors do not have minimum support.
If support threshold If support threshold • too high too high miss low level associations. miss low level associations.• too low too low generate too many high level generate too many high level
associations.associations.
Reduced Support: reduced minimum support at Reduced Support: reduced minimum support at lower levels - different strategies possible.lower levels - different strategies possible.
Beyond Support and Confidence Example 1: (Aggarwal & Yu)Example 1: (Aggarwal & Yu)
{tea} => {coffee} has high support (20%) and confidence {tea} => {coffee} has high support (20%) and confidence (80%)(80%)
However, a priori probability that a customer buys coffee is However, a priori probability that a customer buys coffee is 90%90% There is a negative correlation between buying tea and There is a negative correlation between buying tea and
buying coffeebuying coffee {~tea} => {coffee} has higher confidence (93%){~tea} => {coffee} has higher confidence (93%)
coffee not coffee sum(row)tea 20 5 25not tea 70 5 75sum(col.) 90 10 100
Correlation and Interest
Two events are independent Two events are independent
if P(A if P(A B) = P(A)*P(B), otherwise are correlated. B) = P(A)*P(B), otherwise are correlated.
Interest = P(A Interest = P(A B)/P(B)*P(A) B)/P(B)*P(A)
Interest expresses measure of correlation. If:Interest expresses measure of correlation. If: equal to 1equal to 1 A and B are A and B are independent eventsindependent events less than 1less than 1 A and B A and B negatively correlatednegatively correlated, , greater than 1greater than 1 A and B A and B positively correlatedpositively correlated.. In our example, In our example,
I(I(drink tea drink tea drink coffee drink coffee ) = 0.89 i.e. they are ) = 0.89 i.e. they are negatively correlated.negatively correlated.
Questions?
Thank You