association rule discovery (market basket-analysis) mis2502 data analytics adapted from tan,...

16
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining . http://www-users.cs.umn.edu/~kumar/dmbook/

Upload: michael-french

Post on 31-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

ASSOCIATION RULE DISCOVERY(MARKET BASKET-ANALYSIS)

MIS2502

Data Analytics

Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.http://www-users.cs.umn.edu/~kumar/dmbook/

Page 2: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

What is Association Mining?

• Discovering interesting relationships between variables in large databases (http://en.wikipedia.org/wiki/Association_rule_learning)

• Find out which items predict the occurrence of other items

• Also known as “affinity analysis” or “market basket” analysis

Page 3: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Examples of Association Mining• Market basket analysis/affinity analysis

• What products are bought together?• Where to place items on grocery store shelves?

• Amazon’s recommendation engine• “People who bought this product also bought…”

• Telephone calling patterns• Who do a set of people tend to call most often?

• Social network analysis• Determine who you “may know”

Page 4: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Market-Basket Analysis• Large set of items

• e.g., things sold in a supermarket

• Large set of baskets• e.g., things one customer buys in one visit

• Supermarket chains keep terabytes of this data• Informs store layout• Suggests “tie-ins”

• Place spaghetti sauce in the pasta aisle• Put diapers on sale and raise the price of beer

Page 5: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Market-Basket Transactions

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

Association Rules from these transactions

X Y (antecedent consequent)

{Diapers} {Beer},{Milk, Bread} {Diapers}

{Beer, Bread} {Milk},{Bread} {Milk, Diapers}

Page 6: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Core idea: The itemset• Itemset: A group of items of interest

{Milk, Beer, Diapers}• This itemset is a “3 itemset” because it

contains…3 items!

• An association rule expresses related itemsets• X Y, where X and Y are two itemsets• {Milk, Diapers} {Beer} means

“when you have milk and diapers, you also have beer)

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

Page 7: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Support• Support count ()

• Frequency of occurrence of an itemset• {Milk, Beer, Diapers} = 2

(i.e., it’s in baskets 4 and 5)

• Support (s)• Fraction of transactions that contain all

itemsets in the relationship X Y• s({Milk, Diapers, Beer}) = 2/5 = 0.4

• You can calculate support for both X and Y separately• Support for X = 3/5 = 0.6; Support for Y = 3/5 = 0.6

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

X Y

2 baskets have milk, beer, and diapers

5 baskets total

Page 8: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Confidence• Confidence is the strength of the

association• Measures how often items in Y appear in

transactions that contain X

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

67.03

2

)Diapers,Milk(

)BeerDiapers,Milk,(

)(

)(

X

YXc

This says 67% of the times when you have

milk and diapers in the itemset you also have

beer!

c must be between 0 and 11 is a complete association

0 is no association

Page 9: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Some sample rulesAssociation Rule Support (s) Confidence (c)

{Milk,Diapers} {Beer} 2/5 = 0.4 2/3 = 0.67

{Milk,Beer} {Diapers} 2/5 = 0.4 2/2 = 1.0

{Diapers,Beer} {Milk} 2/5 = 0.4 2/3 = 0.67

{Beer} {Milk,Diapers} 2/5 = 0.4 2/3 = 0.67

{Diapers} {Milk,Beer} 2/5 = 0.4 2/4 = 0.5

{Milk} {Diapers,Beer} 2/5 = 0.4 2/4 = 0.5

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

All the above rules are binary partitions of the same itemset: {Milk, Diapers, Beer}

Page 10: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

But don’t blindly follow the numbers• Rules originating from the same itemset (XY) will have

identical support • Since the total elements (XY) are always the same

• But they can have different confidence• Depending on what is contained in X and Y

• High confidence suggests a strong association• But this can be deceptive • Consider {Bread} {Diapers}

• Support for the total itemset is 0.6 (3/5)• And confidence is 0.75 (3/4) – pretty high• But is this just because both are frequently occurring items (s=0.8)?• You’d almost expect them to show up in the same baskets by chance

Page 11: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Lift• Takes into account how co-occurrence differs from what is

expected by chance• i.e., if items were selected independently from one another

• Based on the support metric

)(*)(

)(

YsXs

YXsLift

Support for total itemset X and Y

Support for X times support for Y

Page 12: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Lift Example• What’s the lift of the association rule

{Milk, Diapers} {Beer}

• So X = {Milk, Diapers} and Y = {Beer}

s({Milk, Diapers, Beer}) = 2/5 = 0.4s({Milk, Diapers}) = 3/5 = 0.6s({Beer}) = 3/5 = 0.6

So

Basket Items

1 Bread, Milk

2 Bread, Diapers, Beer, Eggs

3 Milk, Diapers, Beer, Coke

4 Bread, Milk, Diapers, Beer

5 Bread, Milk, Diapers, Coke

11.136.0

4.0

6.0*6.0

4.0Lift

When Lift > 1, the occurrence of

X Y together is more likely than what you would

expect by chance

Page 13: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Another exampleChecking Account

Savings Account

No Yes

No 500 3500 4000

Yes 1000 5000 6000

10000Are people more likely to have a checking account if they have a

savings account?

Support ({Savings} {Checking}) = 5000/10000 = 0.5Support ({Savings}) = 6000/10000 = 0.6Support ({Checking}) = 8500/10000 = 0.85Confidence ({Savings} {Checking}) = 5000/6000 = 0.83

98.051.0

5.0

85.0*6.0

5.0Lift

Answer: NoIn fact, it’s

slightly less than what

you’d expect by chance!

Page 14: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Selecting the rules• We know how to calculate the

measures for each rule• Support• Confidence• Lift

• Then we set up thresholds for the minimum rule strength we want to accept• For support – called minsup• For confidence – called minconf

The steps

• List all possible association rules

• Compute the support and confidence for each rule

• Drop rules that don’t make the minsup and minconf thresholds

• Use lift to double-check the association

Page 15: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

But this can be overwhelming• Imagine all the combinations possible

in your local grocery store• Every product matched with every

combination of other products

• Tens of thousands of possible rule combinations

So where do you start?

Page 16: ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining

Once you are confident in a rule, take action

{Milk, Diapers} {Beer}

Possible Marketing Actions• Lower the price of milk and diapers, raise it

on beer• Put beer next to one of those products in

the store• Put beer in a special section of the store

away from milk and diapers• Create “New Parent Coping Kits” of beer,

milk, and diapers• Target new beers to people who buy milk

and diapers