part ii - association rules © prentice hall1 data mining introductory and advanced topics part ii...

34
Part II - Associ ation Rules © Prentice Hall 1 DATA MINING DATA MINING Introductory and Advanced Topics Introductory and Advanced Topics Part II – Association Rules Part II – Association Rules Margaret H. Dunham Margaret H. Dunham Department of Computer Science and Department of Computer Science and Engineering Engineering Southern Methodist University Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics Data Mining, Introductory and Advanced Topics , Prentice , Prentice Hall, 2002. Hall, 2002.

Upload: joel-watts

Post on 05-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

Part II - Association Rules

© Prentice Hall 1

DATA MININGDATA MININGIntroductory and Advanced TopicsIntroductory and Advanced Topics

Part II – Association RulesPart II – Association Rules

Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Southern Methodist UniversitySouthern Methodist University

Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.

Page 2: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 2Part II – Association Rules

Association Rules OutlineAssociation Rules Outline

Goal:Goal: Provide an overview of basic Provide an overview of basic Association Rule mining techniquesAssociation Rule mining techniques

Association Rules Problem OverviewAssociation Rules Problem Overview– Large itemsetsLarge itemsets

Association Rules AlgorithmsAssociation Rules Algorithms– AprioriApriori– SamplingSampling– PartitioningPartitioning– Parallel AlgorithmsParallel Algorithms

Page 3: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 3Part II – Association Rules

Example: Market Basket DataExample: Market Basket Data Items frequently purchased together:Items frequently purchased together:

Bread Bread PeanutButterPeanutButter Uses:Uses:

– Placement Placement – AdvertisingAdvertising– SalesSales– CouponsCoupons

Objective: increase sales and reduce Objective: increase sales and reduce costscosts

Page 4: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 4Part II – Association Rules

Association Rule DefinitionsAssociation Rule Definitions

Set of items:Set of items: I={I I={I11,I,I22,…,I,…,Imm}}

Transactions:Transactions: D={t D={t11,t,t22, …, t, …, tnn}, t}, tjj I I

Itemset:Itemset: {I {Ii1i1,I,Ii2i2, …, I, …, Iikik} } I I Support of an itemset:Support of an itemset: Percentage of Percentage of

transactions which contain that itemset.transactions which contain that itemset. Large (Frequent) itemset:Large (Frequent) itemset: Itemset Itemset

whose number of occurrences is above whose number of occurrences is above a threshold.a threshold.

Page 5: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 5Part II – Association Rules

Association Rules ExampleAssociation Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Page 6: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 6Part II – Association Rules

Association Rule DefinitionsAssociation Rule Definitions

Association Rule (AR): Association Rule (AR): implication X implication X Y where X,Y Y where X,Y I and X I and X Y = Y = ;;

Support of AR (s) X Support of AR (s) X YY: : Percentage of transactions that Percentage of transactions that contain X contain X YY

Confidence of AR (Confidence of AR () X ) X Y: Y: Ratio of Ratio of number of transactions that contain X number of transactions that contain X Y to the number that contain X Y to the number that contain X

Page 7: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 7Part II – Association Rules

Association Rules Ex (cont’d)Association Rules Ex (cont’d)

Page 8: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 8Part II – Association Rules

Association Rule ProblemAssociation Rule Problem Given a set of items I={IGiven a set of items I={I11,I,I22,…,I,…,Imm} and a } and a

database of transactions D={tdatabase of transactions D={t11,t,t22, …, t, …, tnn} } where twhere tii={I={Ii1i1,I,Ii2i2, …, I, …, Iikik} and I} and Iijij I, the I, the Association Rule ProblemAssociation Rule Problem is to is to identify all association rulesidentify all association rules X X Y Y with with a minimum support and confidence.a minimum support and confidence.

Link AnalysisLink Analysis NOTE:NOTE: Support of Support of X X Y Y is same as is same as

support of X support of X Y. Y.

Page 9: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 9Part II – Association Rules

Association Rule TechniquesAssociation Rule Techniques

1.1. Find Large Itemsets.Find Large Itemsets.

2.2. Generate rules from frequent itemsets.Generate rules from frequent itemsets.

Page 10: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 10Part II – Association Rules

Algorithm to Generate ARsAlgorithm to Generate ARs

Page 11: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 11Part II – Association Rules

AprioriApriori

Large Itemset Property:Large Itemset Property:

Any subset of a large itemset is large.Any subset of a large itemset is large. Contrapositive:Contrapositive:

If an itemset is not large, If an itemset is not large,

none of its supersets are large.none of its supersets are large.

Page 12: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 12Part II – Association Rules

Large Itemset PropertyLarge Itemset Property

Page 13: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 13Part II – Association Rules

Apriori Ex (cont’d)Apriori Ex (cont’d)

s=30% = 50%

Page 14: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 14Part II – Association Rules

Apriori AlgorithmApriori Algorithm

1.1. CC11 = Itemsets of size one in I; = Itemsets of size one in I;

2.2. Determine all large itemsets of size 1, LDetermine all large itemsets of size 1, L1;1;

3. i = 1;

4.4. RepeatRepeat

5.5. i = i + 1;i = i + 1;

6.6. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););

7.7. Count CCount Cii to determine L to determine L i;i;

8.8. until no more large itemsets found;until no more large itemsets found;

Page 15: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 15Part II – Association Rules

Apriori-GenApriori-Gen

Generate candidates of size i+1 from Generate candidates of size i+1 from large itemsets of size i.large itemsets of size i.

Only generates a candidate if all Only generates a candidate if all subsets are largesubsets are large

Approach used: join large itemsets of Approach used: join large itemsets of size i if they agree on i-1 size i if they agree on i-1

Page 16: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 16Part II – Association Rules

Apriori-Gen ExampleApriori-Gen Example

Page 17: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 17Part II – Association Rules

Apriori-Gen Example (cont’d)Apriori-Gen Example (cont’d)

Page 18: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 18Part II – Association Rules

Apriori Adv/DisadvApriori Adv/Disadv

Advantages:Advantages:– Uses large itemset property.Uses large itemset property.– Easily parallelizedEasily parallelized– Easy to implement.Easy to implement.

Disadvantages:Disadvantages:– Assumes transaction database is memory Assumes transaction database is memory

resident.resident.– Requires up to m database scans.Requires up to m database scans.

Page 19: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 19Part II – Association Rules

SamplingSampling Large databasesLarge databases Sample the database and apply Apriori to the Sample the database and apply Apriori to the

sample. sample. Potentially Large Itemsets (PL):Potentially Large Itemsets (PL): Large Large

itemsets from sampleitemsets from sample Negative Border (BD Negative Border (BD -- ): ):

– Generalization of Apriori-Gen applied to Generalization of Apriori-Gen applied to itemsets of varying sizes.itemsets of varying sizes.

– Minimal set of itemsets which are not in PL, Minimal set of itemsets which are not in PL, butbut whose subsets are all in PL. whose subsets are all in PL.

Page 20: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 20Part II – Association Rules

Negative Border ExampleNegative Border Example

PL PL BD-(PL)© Prentice Hall

Page 21: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 21Part II – Association Rules

Sampling AlgorithmSampling Algorithm

1.1. DDss = sample of Database D; = sample of Database D;

2.2. PL = Large itemsets in DPL = Large itemsets in Dss using smalls; using smalls;

3.3. C = PL C = PL BDBD--(PL);(PL);4.4. Count C in Database using s;Count C in Database using s;

5.5. ML = large itemsets in BDML = large itemsets in BD--(PL);(PL);6.6. If ML = If ML = then donethen done

7.7. else C = repeated application of BDelse C = repeated application of BD-;-;

8.8. Count C in Database;Count C in Database;

Page 22: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 22Part II – Association Rules

Sampling ExampleSampling Example Find AR assuming s = 20%Find AR assuming s = 20% DDss = { t = { t11,t,t22}} Smalls = 10%Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, PL = {{Bread}, {Jelly}, {PeanutButter},

{Bread,Jelly}, {Bread,PeanutButter}, {Jelly, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}PeanutButter}, {Bread,Jelly,PeanutButter}}

BDBD--(PL)={{Beer},{Milk}}(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BDRepeated application of BD- - generates all generates all

remaining itemsetsremaining itemsets

Page 23: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 23Part II – Association Rules

Sampling Adv/DisadvSampling Adv/Disadv

Advantages:Advantages:– Reduces number of database scans to one Reduces number of database scans to one

in the best case and two in worst.in the best case and two in worst.– Scales better.Scales better.

Disadvantages:Disadvantages:– Potentially large number of candidates in Potentially large number of candidates in

second passsecond pass

Page 24: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 24Part II – Association Rules

PartitioningPartitioning

Divide database into partitions DDivide database into partitions D11,D,D22,,…,D…,Dpp

Apply Apriori to each partitionApply Apriori to each partition Any large itemset must be large in at Any large itemset must be large in at

least one partition.least one partition.

Page 25: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 25Part II – Association Rules

Partitioning AlgorithmPartitioning Algorithm

1.1. Divide D into partitions DDivide D into partitions D11,D,D22,…,D,…,Dp;p;

2. For I = 1 to p do

3.3. LLii = Apriori(D = Apriori(Dii););

4.4. C = LC = L11 … … L Lpp;;

5.5. Count C on D to generate L;Count C on D to generate L;

Page 26: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 26Part II – Association Rules

Partitioning ExamplePartitioning Example

D1

D2

S=10%

L1 ={{Bread}, {Jelly}, {Bread}, {Jelly}, {PeanutButter}, {PeanutButter}, {Bread,Jelly}, {Bread,Jelly}, {Bread,PeanutButter}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}{Bread,Jelly,PeanutButter}}

L2 ={{Bread}, {Milk}, {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, {Bread,PeanutButter}, {Milk, PeanutButter}, PeanutButter}, {Bread,Milk,PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer}, {Beer,Bread}, {Beer,Milk}}{Beer,Milk}}

Page 27: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 27Part II – Association Rules

Partitioning Adv/DisadvPartitioning Adv/Disadv

Advantages:Advantages:– Adapts to available main memoryAdapts to available main memory– Easily parallelizedEasily parallelized– Maximum number of database scans is Maximum number of database scans is

two.two. Disadvantages:Disadvantages:

– May have many candidates during second May have many candidates during second scan.scan.

Page 28: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 28Part II – Association Rules

Parallelizing AR AlgorithmsParallelizing AR Algorithms

Based on AprioriBased on Apriori Techniques differ:Techniques differ:

– What is counted at each siteWhat is counted at each site– How data (transactions) are distributedHow data (transactions) are distributed

Data ParallelismData Parallelism– Data partitionedData partitioned– Count Distribution AlgorithmCount Distribution Algorithm

Task ParallelismTask Parallelism– Data and candidates partitionedData and candidates partitioned– Data Distribution AlgorithmData Distribution Algorithm

Page 29: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 29Part II – Association Rules

Count Distribution Algorithm(CDA)Count Distribution Algorithm(CDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. CC11 = Itemsets of size one in I; = Itemsets of size one in I;4.4. Count CCount C1;1;

5.5. Broadcast counts to all sites;Broadcast counts to all sites;6.6. Determine global large itemsets of size 1, LDetermine global large itemsets of size 1, L11;;7. i = 1; 8.8. RepeatRepeat9.9. i = i + 1;i = i + 1;10.10. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););11.11. Count CCount Ci;i;

12.12. Broadcast counts to all sites;Broadcast counts to all sites;13.13. Determine global large itemsets of size i, LDetermine global large itemsets of size i, L ii;;14.14. until no more large itemsets found;until no more large itemsets found;

Page 30: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 30Part II – Association Rules

CDA ExampleCDA Example

Page 31: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 31Part II – Association Rules

Data Distribution Algorithm(DDA)Data Distribution Algorithm(DDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. Determine local candidates of size 1 to count;Determine local candidates of size 1 to count;4.4. Broadcast local transactions to other sites;Broadcast local transactions to other sites;5.5. Count local candidates of size 1 on all data;Count local candidates of size 1 on all data;6.6. Determine large itemsets of size 1 for local Determine large itemsets of size 1 for local

candidates; candidates; 7.7. Broadcast large itemsets to all sites;Broadcast large itemsets to all sites;8.8. Determine LDetermine L11;;9. i = 1; 10.10. RepeatRepeat11.11. i = i + 1;i = i + 1;12.12. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););13.13. Determine local candidates of size i to count;Determine local candidates of size i to count;14.14. Count, broadcast, and find LCount, broadcast, and find Lii;;15.15. until no more large itemsets found;until no more large itemsets found;

Page 32: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 32Part II – Association Rules

DDA ExampleDDA Example

Page 33: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 33Part II – Association Rules

Incremental Association RulesIncremental Association Rules

Generate ARs in a dynamic database.Generate ARs in a dynamic database. Problem: algorithms assume static Problem: algorithms assume static

databasedatabase Objective: Objective:

– Know large itemsets for DKnow large itemsets for D– Find large itemsets for D Find large itemsets for D { { D} D}

Must be large in either D or Must be large in either D or D D Save LSave Li i and counts and counts

Page 34: Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of

© Prentice Hall 34Part II – Association Rules

Note on ARsNote on ARs

Many applications outside market Many applications outside market basket data analysisbasket data analysis– Prediction (telecom switch failure)Prediction (telecom switch failure)– Web usage miningWeb usage mining

Many different types of association rulesMany different types of association rules– TemporalTemporal– SpatialSpatial– CausalCausal