mining association rul srules measures of rule ...ec/files_0910/slides/... · measures of rule...
TRANSCRIPT
Mining Association Mining Association Rul sRules
1
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm
Measures of rule interestingness
Advanced Techniques
2
Wh t I A i ti R l Mi i ?What Is Association Rule Mining?
Association rule miningFi di f t tt i ti l ti l Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases
Understand customer buying habits by finding associations and correlations between the different items that customers place in th i “ h i b k t”their “shopping basket”
Applications Applications Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, web log analysis, fraud detection (supervisor->examiner)
3
analys s, web log analys s, fraud detect on (superv sor exam ner)
Wh t I A i ti R l Mi i ?
R l f
What Is Association Rule Mining?
Rule form
Antecedent Consequent [support, confidence]q
(support and confidence are user defined measures of interestingness)
Examples buys(x, “computer”) buys(x, “financial management software”)
[0.5%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “car”) [1%,75%]
4
g ( , ) ( , ) y ( , ) [ , ]
How can Association Rules be used?
Probably mom was
How can Association Rules be used?
Probably mom was calling dad at work to buy diapers on way ap rs on way home and he decided to buy a six-pack as well. p
The retailer could move diapers and beers to separate places and
i i hi hposition high-profit items of interest to young f th l th
5
fathers along the path.
How can Association Rules be used? Let the rule discovered be
How can Association Rules be used?L t th ru sco r
{Bagels,...} {Potato Chips}
Potato chips as consequent => Can be used to determine what should be done to boost its sales
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagelswould be affected if the store discontinues selling bagels
Bagels in antecedent and Potato chips in the consequent => Bagels in antecedent and Potato chips in the consequent Can be used to see what products should be sold with Bagels to promote sale of Potato Chips
6
B Basic Concepts
Given: (1) database of transactions (1) database of transactions, (2) each transaction is a list of items purchased by a customer
in a visit
Find: Find: all rules that correlate the presence of one set of items
(itemset) with that of another set of items( ) f f
E.g., 35% of people who buys salmon also buys cheese
7
E.g., 35% of people who buys salmon also buys cheese
TX1 Shoes, Socks, Tie, Belt
TX2 Shoes, Socks, Tie, Belt, Shirt, Hat
TX3 Shoes TieTX3 Shoes, Tie
TX4 Shoes, Socks, Belt
Transaction Shoes Socks Tie Belt Shirt Scarf HatTransaction Shoes Socks Tie Belt Shirt Scarf Hat1 1 1 1 0 0 02 1 1 1 1 1 0 13 1 0 1 0 0 0 04 1 1 0 1 0 0 0...
Support is 50% (2/4)TieSocks
8
Confidence is 66.67% (2/3)TieSocks
Rule basic MeasuresRule basic Measures
A B [ s, c ]
Support: denotes the frequency of the rule within pp q ytransactions. A high value means that the rule involves a great part of database.
support(A B [ s, c ]) = p(A B)
Confidence: denotes the percentage of transactions containing A which also contain B It is an estimation of containing A which also contain B. It is an estimation of conditioned probability .
fid (A B [ s ]) (B|A) s (A B)/s (A)
9
confidence(A B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).
acti
ons
tran
sa
Rule C => D support
fid /10
confidence /
ApplicationsApplications “Baskets” = documents; “
items” = words in those documents. Lets us find words that appear together unusually frequently Lets us find words that appear together unusually frequently,
i.e., linked concepts.
Word 1 Word 2 Word 3 Word 4Doc 1 1 0 1 1Doc 2 0 0 1 1Doc 3 1 1 1 0
Word 4 => Word 3
11
When word 4 occurs in a document there a big probability of word 3 occurring
ApplicationsApplications
“B k ” “ ” d “Baskets” = sentences, “items” = documents containing those sentences. Items that appear together too often could represent
plagiarism.
Doc 1 Doc 2 Doc 3 Doc 4Sent 1 1 0 1 1Sent 2 0 0 1 1Sent 3 1 1 1 0
Doc 4 => Doc 3
12
When a sentence occurs in document 4 there is a big probability of occurring in document 3
ApplicationsApplications
“B k ” b “ ” l k d “Baskets” = Web pages; “items” = linked pages. Pairs of pages with many common references may be about the
same topic.
“Baskets” = Web pages pi ; “items” = pages that link to pi
Pages with many of the same links may be mirrors or about the same topic.
wp a wp b wp c wp dwp1wp1wp2
13
Example Definitions:pItemset:
A B B E FTrans Id Purchased Items
Definitions:
A,B or B,E,F
Support of an itemset:
Trans. Id Purchased Items1 A,D2 A C
Sup(A,B)=1 Sup(A,C)=2
Frequent pattern:
2 A,C3 A,B,C4 B E F
Given min. sup=2, {A,C} is a frequent pattern
4 B,E,F
For minimum support = 50% and minimum confidence = 50%, we have h f ll i lthe following rules
A => C with 50% support and 66% confidence
14
C => A with 50% support and 100% confidence
15
Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).
The support probability of each rule is identified by the color of the symbols and the confidence probability of each rule is identified by the shape of the symbols. The items that have the highest confidence are
16
Heineken beer, crackers, chicken, and peppers.
Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm Apriori Algorithm
Measures of rule interestingness
Advanced Techniques
17
Boolean association rulesBoolean association rules
Each transaction is converted to a Boolean vectorBoolean vector
18
An ExampleAn Example
Transaction ID Items Bought Min. support 50%Transaction ID Items Bought2000 A,B,C1000 A C F t It t S t
ppMin. confidence 50%
1000 A,C4000 A,D5000 B E F
Frequent Itemset Support{A} 75%{B} 50%5000 B,E,F {B} 50%{C} 50%
{A C} 50%
For rule A C :
{A,C} 50%
support = support({A , C }) = 50%
confidence = support({A , C }) / support({A }) = 66.6%
19
Finding Association RulesFinding Association Rules
Approach:Approach:
1) Find all itemsets that have high support
- These are known as frequent itemsets
2) Generate association rules from frequent itemsets
20
Mining Frequent Itemsets (the Key Step)Mining Frequent Itemsets (the Key Step)
Find the frequent itemsets: the sets of items that have (at least) minimum support A subset of a frequent itemset must also be a frequent
itemset Generate length (k+1) candidate itemsets from length k
frequent itemsets, and
Test the candidates against DB to determine which are in fact Test the candidates against DB to determine which are in fact frequent
21
Apriori principleApriori principle
Any subset of a frequent itemset must y qbe frequent
A transaction containing {beer diaper nuts} also contains A transaction containing {beer, diaper, nuts} also contains {beer, diaper}
{beer diaper nuts} is frequent {beer diaper} must also be {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent
22
Apriori principleApriori principle
No superset of any infrequent itemset should be generated or testedshould be generated or tested
Many item combinations can be prunedy m m p
23
Itemset Latticenull
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
24
ABCDE
Apriori principle for pruning candidatesApriori principle for pruning candidates
If it t i null
If an itemset is infrequent, then all of its supersets must also be A B C D E
infrequent
AB AC AD AE BC BD BE CD CE DE
Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDEPruned supersets
25
ABCDE
How to Generate Candidates?
The items in Lk-1 are listed in an orderk 1
Step 1: self-joining Lk-1
insert into Ck
select p item p item p itemk q itemkselect p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1p 1 q 1 p k 2 q k 2 p k 1 q k 1
A D EA D E
A D F= = <
26
The Apriori Algorithm — Example
Database D itemset sup.{1} 2 itemset supC1 L
Min. support: 2 transactions
TID Items100 1 3 4200 2 3 5
{1} 2{2} 3{3} 3
itemset sup.{1} 2{2} 3Scan D
1 L1
300 1 2 3 5400 2 5
{4} 1{5} 3
{3} 3{5} 3
Scan D
itemset{1 2}{1 3}
itemset sup{1 2} 1{1 3} 2
itemset sup{1 3} 2
L2
C2 C2
{1 3}{1 5}{2 3}{2 5}
{1 3} 2{1 5} 1{2 3} 2
{1 3} 2{2 3} 2{2 5} 3
Scan D
{2 5}{3 5}
{2 5} 3{3 5} 2
{ }{3 5} 2
C Litemset S D itemset sup
27
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
How to Generate Candidates?
S 2 i Step 2: pruningfor all itemsets c in Ck do
for all (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
A D E FA D E F
28
Example of Generating CandidatesExample of Generating Candidates
L = {abc abd acd ace bcd} L3 = {abc, abd, acd, ace, bcd}
Self-joining: L3*L3 Self joining: L3 L3
joining abc and abd gives abcd joining abc and abd gives abcd
j i i d d i d joining acd and ace gives acde
29
Example of Generating CandidatesExample of Generating Candidates L3 = {abc, abd, acd, ace, bcd} L3 {abc, abd, acd, ace, bcd}
Pruning (before counting its support): Pruning (before counting its support):
abcd: check if abc, abd, bcd are in L3 abcd: check if abc, abd, bcd are in L3
acde: check if acd, ace, ade are in L3, ,
acde is removed because ade is not in L3
Thus C4={abcd}
30
4
The Apriori Algorithmp g Ck: Candidate itemset of size k Lk : frequent itemset of size k
J i St : C is t d b j i i L ith its lf Join Step: Ck is generated by joining Lk-1 with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a b f f k isubset of a frequent k-itemset
Algorithm:L {f i }L1 = {frequent items};for (k = 1; Lk !=; k++) do beginCk+1 = candidates generated from Lk;Ck+1 candidates generated from Lk;for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_support
end
31
endreturn L = k Lk;
Counting Support of CandidatesCounting Support of Candidates
Wh ti t f did t bl ? Why counting supports of candidates a problem? The total number of candidates can be very large One transaction may contain many candidates One transaction may contain many candidates
Method: Candidate itemsets are stored in a hash-tree Match candidates to transactions supporting it
32
Counting Support of Candidate ItemsetsCounting Support of Candidate Itemsets Frequent Itemsets are found by counting candidates
Naive way: Search for each candidate in each transaction Expensive!!!p
Candidates Count
A B 0A B 1A B 1A B 2Transactions
A B C D
A B 0
A C 0
A D 01A D
A C
A B
1
1
1A D
A C
A B
2
1
2A D
A C
A B
2
2Naïve approachrequires O(NM)comparisons
M
A C E
B C D
A E 0
B C 0
B D 0
0A E
B D
B C
1
1
1A E
B D
B C
1
1Reduce the number
2A E
B D
B C
4
3
comparisons
N MA B D E
B C E
B D 0
A B E 0
B C D 0
A B D E 0B C D
A B E
B D
1
0
1
B C D
A B E
B D
1
0
1 Reduce the number of comparisons(NM) by using hashtables to store theB C D
A B E
B D
2
2
4
33
B C E
B DA B D E 0
A B C D E 0A B C D E
A B D E
0
0
A B C D E
A B D E
0
0 candidate itemsetsA B C D E
A B D E
0
1
Reducing Number of ComparisonsReducing Number of Comparisons
h d d h h Store the candidate itemsets in a hash structure Hash each transaction to its appropriate buckets
Match transaction only to candidates stored in the hashed bucket(s)
Hash
TID Items
TransactionsHash
Structure
1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke N k4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Buckets
34
Buckets
Example I: itemset{cucumber, parsley, onion, tomato, salt, bread, olives, cheese,
butter}
D: set of transactions1 {{cucumber, parsley, onion, tomato, salt, bread},2 {tomato, cucumber, parsley},3 {tomato, cucumber, olives, onion, parsley},4 {tomato, cucumber, onion, bread},5 {tomato, salt, onion},6 {bread, cheese}6 {bread, cheese}7 {tomato, cheese, cucumber}8 {bread, butter}}
35from: http://www.cs.bilkent.edu.tr/~guvenir/courses/CS558/Association%20Rules.ppt
For Support = 30% ‐> at least 3 transactionsL = {{bread} {cucumber} {onion} {parsley} {tomato}}L1 = {{bread}, {cucumber}, {onion}, {parsley}, {tomato}}
C2 = {{bread cucumber} {bread onion} {bread parsley}C2 {{bread, cucumber}, {bread, onion}, {bread, parsley}, {bread, tomato}, {cucumber, onion}, {cucumber, parsley},{cucumber, tomato}, {onion, parsley},{ , }, { , p y},{onion, tomato}, {parsley, tomato}}
L2 = {{cucumber, onion}, {cucumber, parsley}, {cucumber, tomato}, {onion, tomato}, {parsley, tomato}}
L3 = {{cucumber, parsley, tomato}}L = L L LL = L1 L2 L3
36from: http://www.cs.bilkent.edu.tr/~guvenir/courses/CS558/Association%20Rules.ppt
C2 = {{bread, cucumber}, {bread, onion}, {bread, parsley}, {bread tomato} {cucumber onion} {cucumber parsley}{bread, tomato}, {cucumber, onion}, {cucumber, parsley},{cucumber, tomato}, {onion, parsley},{onion tomato} {parsley tomato}}
The hash‐tree constructed for C2:
{onion, tomato}, {parsley, tomato}}
2
root|_______________________|__________________________
| | | |bread cucumber onion parsley| | | |
{bread, cucumber} {cucumber, onion} {onion, parsley} {parsley, tomato}{bread, onion} {cucumber, parsley} {onion, tomato}{bread, parsley} {cucumber, tomato}{bread, tomato}
For any itemset c contained in transaction t the first item of cmust be in t
37from: http://www.cs.bilkent.edu.tr/~guvenir/courses/CS558/Association%20Rules.ppt
For any itemset c contained in transaction t, the first item of cmust be in t.At root, by hashing on every item in t, we ensure that we only ignore itemsets that start with an item not in t.
Generating AR from frequent intemsetsGenerating AR from frequent intemsets
f d ( B) P(B| ) support count({A B}) Confidence (AB) = P(B|A) = ,support_count({A B})
support_count({A})
For every frequent itemset x, generate all non-empty subsets of x
f h l For every non-empty subset s of x, output the rule
“ s (x-s) ” if ( )
min_confunt({s})support count({x})support_co
38
unt({s})support_co
Rule GenerationRule Generation
ff l l f f How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone
property
But confidence of rules generated from the same itemset has an anti-monotone property
L {A B C D} L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)( ) ( ) ( )
Confidence is non-increasing as number of items in rule
39
f g fconsequent increases
From Frequent Itemsets to Association Rules From Frequent Itemsets to Association Rules
Q: Given frequent set {A,B,E}, what are possible Q G n fr qu n { , ,E}, w r passociation rules? A => B, E,
A, B => E
A, E => B,
B => A, E
B, E => A,
E => A, B
__ => A,B,E (empty rule), or true => A,B,E__ , , ( p y ), , ,
40
Generating Rules: exampleg p
Trans-ID Items Min_support: 60%
1 ACD2 BCE
Min_confidence: 75%
3 ABCE4 BE5 ABCE Frequent Itemset Support
{BCE},{AC} 60%{BC} {CE} {A} 60%R le Conf {BC},{CE},{A} 60%{BE},{B},{C},{E} 80%
Rule Conf.{BC} =>{E} 100%{BE} =>{C} 75%{BE} {C} 75%{CE} =>{B} 100%{B} =>{CE} 75%
41
{C} =>{BE} 75%{E} =>{BC} 75%
ExerciceExerciceTID Items1 Bread Milk Chips Mustard1 Bread, Milk, Chips, Mustard2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk
Converta os dados para o formato booleano e para um suporte de 40% aplique o
4 Beer, Bread, Diaper, Milk, Chips5 Coke, Bread, Diaper, Milk6 B B d Di Milk M t d
suporte de 40%, aplique o algoritmo apriori.
6 Beer, Bread, Diaper, Milk, Mustard7 Coke, Bread, Diaper, Milk
Bread Milk Chips Mustard Beer Diaper Eggs Coke1 1 1 1 0 0 0 01 0 0 0 1 1 1 00 1 0 0 1 1 0 11 1 1 0 1 1 0 01 1 0 0 0 1 0 11 1 0 1 1 1 0 0
42
1 1 0 1 1 1 0 01 1 0 0 0 1 0 1
0.4*7= 2.8
C1 L1 C2 L2Bread 6 Bread 6 Bread,Milk 5 Bread,Milk 5Milk 6 Milk 6 Bread Beer 3 Bread Beer 3Milk 6 Milk 6 Bread,Beer 3 Bread,Beer 3Chips 2 Beer 4 Bread,Diaper 5 Bread,Diaper 5Mustard 2 Diaper 6 Bread,Coke 2 Milk,Beer 3Beer 4 Coke 3 Milk,Beer 3 Milk,Diaper 5, , pDiaper 6 Milk,Diaper 5 Milk,Coke 3Eggs 1 Milk,Coke 3 Beer,Diaper 4Coke 3 Beer,Diaper 4 Diaper,Coke 3
Beer,Coke 1Diaper,Coke 3
C3 L3C3 L3Bread,Milk,Beer 2 Bread,Milk,Diaper 4Bread,Milk,Diaper 4 Bread,Beer,Diaper 3Bread Beer Diaper 3 Milk Beer Diaper 3Bread,Beer,Diaper 3 Milk,Beer,Diaper 3Milk,Beer,Diaper 3 Milk,Diaper,Coke 3Milk,Beer,CokeMilk,Diaper,Coke 3
43
, p ,8 8
2 38 92 24 C C
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm Apriori Algorithm
Measures of rule interestingness
Advanced Techniques
44
Interestingness Measurementsg
Are all of the strong association rules discovered interesting Are all of the strong association rules discovered interesting enough to present to the user?
How can we measure the interestingness of a rule? How can we measure the interestingness of a rule?
Subjective measures
A rule (pattern) is interesting if
• it is unexpected (surprising to the user); and/or
• actionable (the user can do something with it)actionable (the user can do something with it)
• (only the user can judge the interestingness of a rule)
45
Objective measures of rule interestObjective measures of rule interest
Support
Confidence or strengthg
Lift or Interest or Correlation
C i i Conviction
Leverage or Piatetsky-Shapirog y p
Coverage
46
Criticism to Support and Confidencepp Example 1: (Aggarwal & Yu,
PODS98) basketball not basketball sum(row)PODS98)
Among 5000 students 3000 play basketball
( )cereal 2000 1750 3750 75%
not cereal 1000 250 1250 2 5%
sum(col ) 3000 2000 5000 3000 play basketball 3750 eat cereal 2000 both play basket ball
nd t c l
sum(col.) 3000 2000 50006 0 % 4 0 %
and eat cereal
play basketball eat cereal [40% 66 7%] play basketball eat cereal [40%, 66.7%]
misleading because the overall percentage of students eating cereal is 75% which is higher than 66 7%75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%]
47
is more accurate, although with lower support and confidence
Lift of a RuleL f f
Lift (Correlation Interest) Lift (Correlation, Interest)
s ( ) ( | )A B P B Asup( , ) ( | )Lift( )sup( ) sup( ) ( )
A B P B AA BA B P B
A and B negatively correlated, if the value is less than 1; th r is A nd B p sitiv l c rr l t dotherwise A and B positively correlated
X 1 1 1 1 0 0 0 0 rule Support Lift0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
X Y 25% 2.00XZ 37.50% 0.86YZ 12 50% 0 57
48
Z 0 1 1 1 1 1 1 1 YZ 12.50% 0.57
Lift of a RuleLift of a RuleExample 1 (cont)
play basketball eat cereal [40%, 66.7%] 89.03750300050002000
LIFT
50005000
1000
play basketball not eat cereal [20%, 33.3%] 33.1
50001250
50003000
50001000
LIFT
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250
49
sum(col.) 3000 2000 5000
Problems With LiftProblems With Lift
Rules that hold 100% of the time may not have the highest Rules that hold 100% of the time may not have the highest possible lift. For example, if 5% of people are Vietnam veterans and 90% of the people are more than 5 years old, we get a lift of 9 f p p m y , w g f f0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule
Vietnam veterans -> more than 5 years old Vietnam veterans > more than 5 years old.
And, lift is symmetric:
not eat cereal play basketball [20%, 80%]p y [ ]
33150001000
LIFT
50
33.1
50003000
50001250LIFT
Conviction of a RuleConviction of a Rule
1sup( ) sup( ) ( ) ( ) ( )( ( ))( ) A B P A P B P A P BC A B 1sup( ) sup( ) ( ) ( ) ( )( ( ))( )( ) ( , )sup( , ) ( , )
A B P A P B P A P BConv A BP A P A BA B P A B
Conviction is a measure of the implication and has value 1 if items are unrelated.
play basketball eat cereal [40%, 66.7%] 75.0
50002000
50003000
500037501
50003000
Conv
eat cereal play basketball conv:0.85
play basketball not eat cereal [20% 33 3%] 125.110003000500012501
50003000
Conv
51
play basketball not eat cereal [20%, 33.3%] not eat cereal play basketball conv:1.43 5000
100050003000
ConvictionConviction
f b d h conviction of X=>Y can be interpreted as the
ratio of the expected frequency that X occurs without p q yY (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent
divided by the observed frequency of incorrect predictions. predictions.
A conviction value of 1.2 shows that the rule would be incorrect 20% more often (1 2 times as often) if the incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.
52
Leverage of a RuleLeverage of a Rule
Leverage or Piatetsky Shapiro Leverage or Piatetsky-Shapiro
P ( ) sup( , ) sup( ) sup( )S A B A B A B
PS (or Leverage):
is the proportion of additional elements covered by both the premise and consequence above the expected if independent.
53
Coverage of a RuleCoverage of a Rule
( ) s p( )A B Acoverage( ) sup( )A B A
54
CommentComment
d l h d h d b Traditional methods such as database queries: support hypothesis verification about a relationship such as
the co-occurrence of diapers & beer.
Data Mining methods automatically discover significant associations rules from data.
Find whatever patterns exist in the database, without the user having to specify in advance what to look for (data driven).
Therefore allow finding unexpected correlations
55
ASSOCIATION RULES WITH
56
Using Scripts in RUsing Scripts in R
Menu “File; New script”
write down the following commands
a <‐ runif(100,10,20)
hist(a)
Menu “File;Save as...” test.R
source(‘teste.R’) ‐ to run the script
57
Another scriptAnother script
for (i in 1:5){
a <‐ runif(10*i, 10, 20)a runif(10 i, 10, 20)
hist(a, main=paste("uniform n=", 10*i))
savePlot(filename= paste("uniform n=", 10*i),type="jpeg")
} }
58
Association Rules in R package: arules library(arules) data(Groceries) summary(Groceries) length(Groceries) LIST(Groceries[1:5]) image(Groceries[1:500]) image(Groceries[1:500])
my_rules <‐ apriori(Groceries, parameter = list(supp=0.001, conf=0.5)) inspect(my_rules) inspect(my_rules[1:10]) my_rules_2 <‐ SORT(rule, by = "lift") inspect(SORT(rule, by="lift")[1:3])~ summary(my rules) summary(my_rules)
itemFrequencyPlot(Groceries, support = 0.1, cex.names = 0.8) itemFrequencyPlot(Groceries, support = 0.05, cex.names = 0.8)
butter_rules <‐ subset(my_rules,subset = lhs %in% "butter" & lift > 1.2) butter_rules <‐ subset(my_rules,subset = lhs %in% "butter" & lhs %in% "rice" & lift > 1.2)
Groceries.sample <‐ sample(Groceries,1000,replace=TRUE) itemFrequency(Groceries.sample) > 0.002
59
APRIORI EXTENSIONS
60
Challenges of Frequent Pattern MiningChallenges of Frequent Pattern Mining
Ch ll Challenges Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
61
Improving Apriori’s EfficiencyImproving Apriori s Efficiency
Problem with Apriori: every pass goes over whole data Problem with Apriori: every pass goes over whole data.
AprioriTID: Generates candidates as apriori but DB is d f ti t l th fi t used for counting support only on the first pass.
Needs much more memory than Apriori
Builds a storage set C^k that stores in memory the frequent sets per transaction
AprioriHybrid: Use Apriori in initial passes; Estimate the size of C^k; Switch to AprioriTid when C^k is expected to s z of k; Sw tch to pr or wh n k s p ct to fit in memory The switch takes time but it is still better in most cases
62
The switch takes time, but it is still better in most cases
ItemsTID Set-of-itemsetsTID SupportItemset
C^1 L1ItemsTID1 3 41002 3 5200
Set of itemsetsTID
{ {1},{3},{4} }100{ {2},{3},{5} }200
SupportItemset2{1}3{2}
1 2 3 53002 5400
{ {1},{2},{3},{5} }300{ {2},{5} }400
3{3}3{5}
itemsetSI
L2Set-of-itemsetsTID
C2 C^2
{1 2}{1 3}{1 5}
SupportItemset2{1 3}3{2 3}
{ {1 3} }100{ {2 3},{2 5} {3 5} }200{ {1 2} {1 3} {1 5}300{1 5}
{2 3}{2 5}
3{2 3}3{2 5}2{3 5}
{ {1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
300
{ {2 5} }400{3 5}
SupportItemset
{ { } }
C^3 L3C3 Set-of-itemsetsTID
63
itemset{2 3 5}
SupportItemset2{2 3 5}
{ {2 3 5} }200{ {2 3 5} }300
Improving Apriori’s EfficiencyImproving Apriori s Efficiency
Transaction reduction: A transaction that does not Transaction reduction A transaction that does not contain any frequent k-itemset is useless in subsequent scans
S li i i b t f i d t Sampling: mining on a subset of given data. The sample should fit in memory
Use lower support threshold to reduce the probability of missing some itemsets.
The rest of the DB is used to determine the actual itemset count.
64
Improving Apriori’s Efficiency
P i i i A i h i i ll f i DB b Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (2 DB scans) (support in a partition is lowered to be proportional to the number of (support in a partition is lowered to be proportional to the number of
elements in the partition)
Phase I
Divide D Fi d f t Combine Fi d l b l
Phase IPhase II
Trans.in D
Divide D into nNon-
overlappin
Find frequent itemsets local
to each partition
Combine results to
form a global set of
did t
Find global frequent itemsets among
Freq. itemsets
in Dgpartitions
partition(parallel alg.) candidate
itemsets
among candidates
in D
65
Improving Apriori’s Efficiencyp g p ff y
Dynamic itemset counting: partitions the DB into several blocks each marked by a start point.
At each start point, DIC estimates the support of all itemsets that are currently counted and adds new itemsets to the set of candidate i if ll i b i d b f itemsets if all its subsets are estimated to be frequent.
If DIC adds all frequent itemsets to the set of candidate itemsets d i th fi t it ill h t d h it t’ t t during the first scan, it will have counted each itemset’s exact support at some point during the second scan;
th DIC l t i t thus DIC can complete in two scans.
66
Association Rules VisualizationAssociation Rules Visualization
The coloured column indicates the association rule BC. Diff t i l d t h diff t t d t l f
67
Different icon colours are used to show different metadata values of the association rule.
Association Rules VisualizationAssociation Rules Visualization
68
Association Rules VisualizationAssociation Rules VisualizationAssociation Rules VisualizationAssociation Rules Visualization
Size of ball equates to total support
Height Height equates to confidence
69
Association Rules Visualization Association Rules Visualization -- Ball graphBall graph
70
The Ball graph ExplainedThe Ball graph Explained
A b ll h i t f t f d d All th d ll A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green or blue. The blue nodes are active nodes representing the items in the rule in which the user is interested. The yellow nodes are passive representing items related to the active nodes in some way The green nodes merely assist in visualizing two or the active nodes in some way. The green nodes merely assist in visualizing two or more items in either the head or the body of the rule.
A circular node represents a frequent (large) data item. The volume of the ball t th t f th it O l th it hi h ffi i tl represents the support of the item. Only those items which occur sufficiently
frequently are shown
An arrow between two nodes represents the rule implication between the two items. An arrow will be drawn only when the support of a rule is no less than the minimum support
71
Association Rules VisualizationAssociation Rules Visualization
72
Mining Association RulesMining Association Rules
What is Association rule mining
Apriori Algorithm Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques
73
Multiple-Level Association RulesMultiple Level Association Rules
FoodStuffFoodStuff
Frozen Refrigerated Fresh Bakery Etc...
Vegetable Fruit Dairy Etc....
Banana Apple Orange Etc...
Fresh Bakery [20%, 60%]
Dairy Bread [6%, 50%]
Fruit Bread [1%, 50%] is not valid
Items often form hierarchy.
74
Flexible support settings: Items at the lower level are expected to have lower support.Transaction database can be encoded based on dimensions and levels explore shared multi-level mining
Multi-Dimensional Association RulesMulti Dimensional Association Rules
Single-dimensional rules: buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicatesp Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) age(X, 19 25 ) occupation(X, student ) buys(X, coke )
hybrid-dimension association rules (repeated predicates) age(X ”19-25”) buys(X “popcorn”) buys(X “coke” age(X, 19-25 ) buys(X, popcorn ) buys(X, coke
75
Q i i A i i R lQuantitative Association Rules
age(X,”30-34”) income(X,”24K - 48K”) buys(X,”high resolution TV”)
Mining Sequential Patterns
10% of customers bought “Foundation” and “Ringworld” in one transaction, gfollowed by “Ringworld Engineers” in another transaction.
76
g g
Sequential Pattern Mining Given
Sequential Pattern Mining
A database of customer transactions ordered by increasingtransaction timeEach transaction is a set of items Each transaction is a set of items
A sequence is an ordered list of itemsets
Example: Example: 10% of customers bought “Foundation“ and “Ringworld" in one
transaction, followed by “Ringworld Engineers" in another transaction. 10% is called the support of the pattern
(a transaction may contain more books than those in the pattern)
Problem Find all sequential patterns supported by more than a user-specified
percentage of data sequences
77
percentage of data sequences
Sequential Patterns RulesSequential Patterns Rules
(F ) ( E) (FE) h % d % (F, R) (RE) (FE) with 3% support and 40% confidence.
Confidence: 40% of occurrences of (F, R) (RE) are followed by (FE).
(FE- foundation and empire)
Problem Decomposition:Fi d ll i l i h i i Find all sequential patterns with minimum support.
Use the sequential patterns to generate rules.
78
Applications
Applications of sequential pattern miningApplications of sequential pattern mining Attached mailing, e.g., customized mailings for a
b k l bbook club.
Customer satisfaction / retention
Web log analysis
Medical research, e.g., sequences of symptoms
79
The AprioriAll Algorithm [AgSr95]
A 3-phase algorithm Phase 1: finds all frequent itemsets with minimum support that is Phase 1: finds all frequent itemsets with minimum support, that is,
with a minimum number of customers containing it
Ph 2 t f th DB t h t ti i l d b Phase 2: transforms the DB s.t. each transaction is replaced by the set of all frequent itemsets contained in the transaction
80
Phase 3: finds all sequential patterns
The AprioriAll Algorithm (cont.) The sequence phase: In each phase it uses the large sequences from
previous phase to generate the candidate sequences, and then m s s th s t f did t smeasures the support of candidates.
Customer sequences min_support = 0.4Maximal sequences
(2 customer sequences)
81C4
Application DifficultiesApplication Difficulties
l k h h b B b d ll Wal-Mart knows that customers who buy Barbie dolls (it sells one every 20 seconds) have a 60% likelihood of b i f th t f d b Wh t d buying one of three types of candy bars. What does Wal-Mart do with information like that?
'I don't have a clue,' says Wal-Mart's chief of merchandising, Lee Scott.
See KDnuggets 98:01 for many ideas See - KDnuggets 98:01 for many ideas www.kdnuggets.com/news/98/n01.html
82
Some SuggestionsSome Suggestions
B i i h i f B bi d ll d i i h f d b By increasing the price of Barbie doll and giving the type of candy bar free, wal-mart can reinforce the buying habits of that particular types of buyer
Highest margin candy to be placed near dolls.
Special promotions for Barbie dolls with candy at a slightly higher margin.
Take a poorly selling product X and incorporate an offer on this which is based on buying Barbie and Candy. If the customer is likely to buy these two products anyway then why not try to increase sales on X? two products anyway then why not try to increase sales on X?
Probably they can not only bundle candy of type A with Barbie dolls, but can also introduce new candy of Type N in this bundle while offering y yp gdiscount on whole bundle. As bundle is going to sell because of Barbie dolls & candy of type A, candy of type N can get free ride to customers houses. And with the fact that you like something, if you see it often,
83
f y m g, f y f ,Candy of type N can become popular.
ReferencesReferences Jiawei Han and Micheline Kamber, “Data Mining:
Concepts and Techniques”, 2000
Vipin Kumar and Mahesh Joshi “Tutorial on High Vipin Kumar and Mahesh Joshi, Tutorial on High Performance Data Mining ”, 1999
R k h A l R k i h S ik “F Al i h Rakesh Agrawal, Ramakrishnan Srikan, “Fast Algorithms for Mining Association Rules”, Proc VLDB, 1994(h // l/ f /d /F % l h % f % % %(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%20Rules.ppt)
Alípio Jorge “selecção de regras: medidas de interesse Alípio Jorge, selecção de regras: medidas de interesse e meta queries”, (http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)
84
Thank you !!!Thank you !!!85
Thank you !!!Thank you !!!