part ii - association rules © prentice hall1 data mining introductory and advanced topics part ii...
TRANSCRIPT
Part II - Association Rules
© Prentice Hall 1
DATA MININGDATA MININGIntroductory and Advanced TopicsIntroductory and Advanced Topics
Part II – Association RulesPart II – Association Rules
Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering
Southern Methodist UniversitySouthern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.
© Prentice Hall 2Part II – Association Rules
Association Rules OutlineAssociation Rules Outline
Goal:Goal: Provide an overview of basic Provide an overview of basic Association Rule mining techniquesAssociation Rule mining techniques
Association Rules Problem OverviewAssociation Rules Problem Overview– Large itemsetsLarge itemsets
Association Rules AlgorithmsAssociation Rules Algorithms– AprioriApriori– SamplingSampling– PartitioningPartitioning– Parallel AlgorithmsParallel Algorithms
© Prentice Hall 3Part II – Association Rules
Example: Market Basket DataExample: Market Basket Data Items frequently purchased together:Items frequently purchased together:
Bread Bread PeanutButterPeanutButter Uses:Uses:
– Placement Placement – AdvertisingAdvertising– SalesSales– CouponsCoupons
Objective: increase sales and reduce Objective: increase sales and reduce costscosts
© Prentice Hall 4Part II – Association Rules
Association Rule DefinitionsAssociation Rule Definitions
Set of items:Set of items: I={I I={I11,I,I22,…,I,…,Imm}}
Transactions:Transactions: D={t D={t11,t,t22, …, t, …, tnn}, t}, tjj I I
Itemset:Itemset: {I {Ii1i1,I,Ii2i2, …, I, …, Iikik} } I I Support of an itemset:Support of an itemset: Percentage of Percentage of
transactions which contain that itemset.transactions which contain that itemset. Large (Frequent) itemset:Large (Frequent) itemset: Itemset Itemset
whose number of occurrences is above whose number of occurrences is above a threshold.a threshold.
© Prentice Hall 5Part II – Association Rules
Association Rules ExampleAssociation Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
© Prentice Hall 6Part II – Association Rules
Association Rule DefinitionsAssociation Rule Definitions
Association Rule (AR): Association Rule (AR): implication X implication X Y where X,Y Y where X,Y I and X I and X Y = Y = ;;
Support of AR (s) X Support of AR (s) X YY: : Percentage of transactions that Percentage of transactions that contain X contain X YY
Confidence of AR (Confidence of AR () X ) X Y: Y: Ratio of Ratio of number of transactions that contain X number of transactions that contain X Y to the number that contain X Y to the number that contain X
© Prentice Hall 7Part II – Association Rules
Association Rules Ex (cont’d)Association Rules Ex (cont’d)
© Prentice Hall 8Part II – Association Rules
Association Rule ProblemAssociation Rule Problem Given a set of items I={IGiven a set of items I={I11,I,I22,…,I,…,Imm} and a } and a
database of transactions D={tdatabase of transactions D={t11,t,t22, …, t, …, tnn} } where twhere tii={I={Ii1i1,I,Ii2i2, …, I, …, Iikik} and I} and Iijij I, the I, the Association Rule ProblemAssociation Rule Problem is to is to identify all association rulesidentify all association rules X X Y Y with with a minimum support and confidence.a minimum support and confidence.
Link AnalysisLink Analysis NOTE:NOTE: Support of Support of X X Y Y is same as is same as
support of X support of X Y. Y.
© Prentice Hall 9Part II – Association Rules
Association Rule TechniquesAssociation Rule Techniques
1.1. Find Large Itemsets.Find Large Itemsets.
2.2. Generate rules from frequent itemsets.Generate rules from frequent itemsets.
© Prentice Hall 10Part II – Association Rules
Algorithm to Generate ARsAlgorithm to Generate ARs
© Prentice Hall 11Part II – Association Rules
AprioriApriori
Large Itemset Property:Large Itemset Property:
Any subset of a large itemset is large.Any subset of a large itemset is large. Contrapositive:Contrapositive:
If an itemset is not large, If an itemset is not large,
none of its supersets are large.none of its supersets are large.
© Prentice Hall 12Part II – Association Rules
Large Itemset PropertyLarge Itemset Property
© Prentice Hall 13Part II – Association Rules
Apriori Ex (cont’d)Apriori Ex (cont’d)
s=30% = 50%
© Prentice Hall 14Part II – Association Rules
Apriori AlgorithmApriori Algorithm
1.1. CC11 = Itemsets of size one in I; = Itemsets of size one in I;
2.2. Determine all large itemsets of size 1, LDetermine all large itemsets of size 1, L1;1;
3. i = 1;
4.4. RepeatRepeat
5.5. i = i + 1;i = i + 1;
6.6. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););
7.7. Count CCount Cii to determine L to determine L i;i;
8.8. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 15Part II – Association Rules
Apriori-GenApriori-Gen
Generate candidates of size i+1 from Generate candidates of size i+1 from large itemsets of size i.large itemsets of size i.
Only generates a candidate if all Only generates a candidate if all subsets are largesubsets are large
Approach used: join large itemsets of Approach used: join large itemsets of size i if they agree on i-1 size i if they agree on i-1
© Prentice Hall 16Part II – Association Rules
Apriori-Gen ExampleApriori-Gen Example
© Prentice Hall 17Part II – Association Rules
Apriori-Gen Example (cont’d)Apriori-Gen Example (cont’d)
© Prentice Hall 18Part II – Association Rules
Apriori Adv/DisadvApriori Adv/Disadv
Advantages:Advantages:– Uses large itemset property.Uses large itemset property.– Easily parallelizedEasily parallelized– Easy to implement.Easy to implement.
Disadvantages:Disadvantages:– Assumes transaction database is memory Assumes transaction database is memory
resident.resident.– Requires up to m database scans.Requires up to m database scans.
© Prentice Hall 19Part II – Association Rules
SamplingSampling Large databasesLarge databases Sample the database and apply Apriori to the Sample the database and apply Apriori to the
sample. sample. Potentially Large Itemsets (PL):Potentially Large Itemsets (PL): Large Large
itemsets from sampleitemsets from sample Negative Border (BD Negative Border (BD -- ): ):
– Generalization of Apriori-Gen applied to Generalization of Apriori-Gen applied to itemsets of varying sizes.itemsets of varying sizes.
– Minimal set of itemsets which are not in PL, Minimal set of itemsets which are not in PL, butbut whose subsets are all in PL. whose subsets are all in PL.
© Prentice Hall 20Part II – Association Rules
Negative Border ExampleNegative Border Example
PL PL BD-(PL)© Prentice Hall
© Prentice Hall 21Part II – Association Rules
Sampling AlgorithmSampling Algorithm
1.1. DDss = sample of Database D; = sample of Database D;
2.2. PL = Large itemsets in DPL = Large itemsets in Dss using smalls; using smalls;
3.3. C = PL C = PL BDBD--(PL);(PL);4.4. Count C in Database using s;Count C in Database using s;
5.5. ML = large itemsets in BDML = large itemsets in BD--(PL);(PL);6.6. If ML = If ML = then donethen done
7.7. else C = repeated application of BDelse C = repeated application of BD-;-;
8.8. Count C in Database;Count C in Database;
© Prentice Hall 22Part II – Association Rules
Sampling ExampleSampling Example Find AR assuming s = 20%Find AR assuming s = 20% DDss = { t = { t11,t,t22}} Smalls = 10%Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}PeanutButter}, {Bread,Jelly,PeanutButter}}
BDBD--(PL)={{Beer},{Milk}}(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BDRepeated application of BD- - generates all generates all
remaining itemsetsremaining itemsets
© Prentice Hall 23Part II – Association Rules
Sampling Adv/DisadvSampling Adv/Disadv
Advantages:Advantages:– Reduces number of database scans to one Reduces number of database scans to one
in the best case and two in worst.in the best case and two in worst.– Scales better.Scales better.
Disadvantages:Disadvantages:– Potentially large number of candidates in Potentially large number of candidates in
second passsecond pass
© Prentice Hall 24Part II – Association Rules
PartitioningPartitioning
Divide database into partitions DDivide database into partitions D11,D,D22,,…,D…,Dpp
Apply Apriori to each partitionApply Apriori to each partition Any large itemset must be large in at Any large itemset must be large in at
least one partition.least one partition.
© Prentice Hall 25Part II – Association Rules
Partitioning AlgorithmPartitioning Algorithm
1.1. Divide D into partitions DDivide D into partitions D11,D,D22,…,D,…,Dp;p;
2. For I = 1 to p do
3.3. LLii = Apriori(D = Apriori(Dii););
4.4. C = LC = L11 … … L Lpp;;
5.5. Count C on D to generate L;Count C on D to generate L;
© Prentice Hall 26Part II – Association Rules
Partitioning ExamplePartitioning Example
D1
D2
S=10%
L1 ={{Bread}, {Jelly}, {Bread}, {Jelly}, {PeanutButter}, {PeanutButter}, {Bread,Jelly}, {Bread,Jelly}, {Bread,PeanutButter}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}{Bread,Jelly,PeanutButter}}
L2 ={{Bread}, {Milk}, {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, {Bread,PeanutButter}, {Milk, PeanutButter}, PeanutButter}, {Bread,Milk,PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer}, {Beer,Bread}, {Beer,Milk}}{Beer,Milk}}
© Prentice Hall 27Part II – Association Rules
Partitioning Adv/DisadvPartitioning Adv/Disadv
Advantages:Advantages:– Adapts to available main memoryAdapts to available main memory– Easily parallelizedEasily parallelized– Maximum number of database scans is Maximum number of database scans is
two.two. Disadvantages:Disadvantages:
– May have many candidates during second May have many candidates during second scan.scan.
© Prentice Hall 28Part II – Association Rules
Parallelizing AR AlgorithmsParallelizing AR Algorithms
Based on AprioriBased on Apriori Techniques differ:Techniques differ:
– What is counted at each siteWhat is counted at each site– How data (transactions) are distributedHow data (transactions) are distributed
Data ParallelismData Parallelism– Data partitionedData partitioned– Count Distribution AlgorithmCount Distribution Algorithm
Task ParallelismTask Parallelism– Data and candidates partitionedData and candidates partitioned– Data Distribution AlgorithmData Distribution Algorithm
© Prentice Hall 29Part II – Association Rules
Count Distribution Algorithm(CDA)Count Distribution Algorithm(CDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. CC11 = Itemsets of size one in I; = Itemsets of size one in I;4.4. Count CCount C1;1;
5.5. Broadcast counts to all sites;Broadcast counts to all sites;6.6. Determine global large itemsets of size 1, LDetermine global large itemsets of size 1, L11;;7. i = 1; 8.8. RepeatRepeat9.9. i = i + 1;i = i + 1;10.10. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););11.11. Count CCount Ci;i;
12.12. Broadcast counts to all sites;Broadcast counts to all sites;13.13. Determine global large itemsets of size i, LDetermine global large itemsets of size i, L ii;;14.14. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 30Part II – Association Rules
CDA ExampleCDA Example
© Prentice Hall 31Part II – Association Rules
Data Distribution Algorithm(DDA)Data Distribution Algorithm(DDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. Determine local candidates of size 1 to count;Determine local candidates of size 1 to count;4.4. Broadcast local transactions to other sites;Broadcast local transactions to other sites;5.5. Count local candidates of size 1 on all data;Count local candidates of size 1 on all data;6.6. Determine large itemsets of size 1 for local Determine large itemsets of size 1 for local
candidates; candidates; 7.7. Broadcast large itemsets to all sites;Broadcast large itemsets to all sites;8.8. Determine LDetermine L11;;9. i = 1; 10.10. RepeatRepeat11.11. i = i + 1;i = i + 1;12.12. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););13.13. Determine local candidates of size i to count;Determine local candidates of size i to count;14.14. Count, broadcast, and find LCount, broadcast, and find Lii;;15.15. until no more large itemsets found;until no more large itemsets found;
© Prentice Hall 32Part II – Association Rules
DDA ExampleDDA Example
© Prentice Hall 33Part II – Association Rules
Incremental Association RulesIncremental Association Rules
Generate ARs in a dynamic database.Generate ARs in a dynamic database. Problem: algorithms assume static Problem: algorithms assume static
databasedatabase Objective: Objective:
– Know large itemsets for DKnow large itemsets for D– Find large itemsets for D Find large itemsets for D { { D} D}
Must be large in either D or Must be large in either D or D D Save LSave Li i and counts and counts
© Prentice Hall 34Part II – Association Rules
Note on ARsNote on ARs
Many applications outside market Many applications outside market basket data analysisbasket data analysis– Prediction (telecom switch failure)Prediction (telecom switch failure)– Web usage miningWeb usage mining
Many different types of association rulesMany different types of association rules– TemporalTemporal– SpatialSpatial– CausalCausal