association rule mining part 2 (under construction!) introduction to data mining with case studies...

32
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Post on 19-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Association Rule MiningPart 2

(under construction!)

Introduction to Data Mining with Case StudiesAuthor: G. K. Gupta

Prentice Hall India, 2006.

December 2008 ©GKGupta 2

Bigger ExampleTID Items 1 Biscuits, Bread, Cheese, Coffee, Yogurt 2 Bread, Cereal, Cheese, Coffee 3 Cheese, Chocolate, Donuts, J uice, Milk 4 Bread, Cheese, Coffee, Cereal, J uice 5 Bread, Cereal, Chocolate, Donuts, J uice 6 Milk, Tea 7 Biscuits, Bread, Cheese, Coffee, Milk 8 Eggs, Milk, Tea 9 Bread, Cereal, Cheese, Chocolate, Coffee 10 Bread, Cereal, Chocolate, Donuts, J uice 11 Bread, Cheese, J uice 12 Bread, Cheese, Coffee, Donuts, J uice 13 Biscuits, Bread, Cereal 14 Cereal, Cheese, Chocolate, Donuts, J uice 15 Chocolate, Coffee 16 Donuts 17 Donuts, Eggs, J uice 18 Biscuits, Bread, Cheese, Coffee 19 Bread, Cereal, Chocolate, Donuts, J uice 20 Cheese, Chocolate, Donuts, J uice 21 Milk, Tea, Yogurt 22 Bread, Cereal, Cheese, Coffee 23 Chocolate, Donuts, J uice, Milk, Newspaper 24 Newspaper, Pastry, Rolls 25 Rolls, Sugar, Tea

December 2008 ©GKGupta 3

Frequency of Items

Item No Item name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 Donuts 10 8 Eggs 2 9 J uice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 Yogurt 2

December 2008 ©GKGupta 4

Frequent Items

Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions. The frequent 1-itemset or L1 is now given below. How many candidates in C2? List them.

Item Frequency Bread 13 Cereal 10 Cheese 11 Chocolate 9 Coffee 9 Donuts 10 J uice 11

December 2008 ©GKGupta 5

L2

The following pairs are frequent. Now find C3 and then L3 and the rules.

Frequent 2-itemset Frequency {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 {Chocolate, Donuts} 7 {Chocolate, J uice} 7 {Donuts, J uice} 9

December 2008 ©GKGupta 6

Rules

The full set of rules are given below. Could some rules be removed?

Cheese → Bread Cheese → Coffee Coffee → Bread Coffee → Cheese Cheese, Coffee → Bread Bread, Coffee → Cheese Bread, Cheese → Coffee Chocolate → Donuts Chocolate → J uice Donuts → Chocolate Donuts → J uice Donuts, J uice → Chocolate Chocolate, J uice → Donuts Chocolate, Donuts → J uice Bread → Cereal Cereal → Bread

Comment: Study the above rules carefully.

December 2008 ©GKGupta 7

Improving the Apriori Algorithm

Many techniques for improving the efficiency have been proposed:

•Pruning (already mentioned)•Hashing based technique•Transaction reduction•Partitioning•Sampling•Dynamic itemset counting

December 2008 ©GKGupta 8

Pruning

Pruning can reduce the size of the candidate set Ck. We want to transform Ck into a set of frequent items Lk. To reduce the work of checking, we may use the rule that all subsets of Ck must also be frequent.

December 2008 ©GKGupta 9

Example

• Suppose the items are A, B, C, D, E, F, .., X, Y, Z

• Suppose L1 is A, C, E, P, Q, S, T, V, W, X

• Suppose L2 is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},

{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}• Are you able to identify errors in the L2 list?

• What is C3?

• How to prune C3?

• C3 is {A, C, P}, {E, P, V}, {Q, S, X}

December 2008 ©GKGupta 10

Hashing

The direct hashing and pruning (DHP) algorithm attempts to generate large itemsets efficiently and reduces the transaction database size.

When generating L1, the algorithm also generates all the 2-itemsets for each transaction, hashes them to a hash table and keeps a count.

December 2008 ©GKGupta 11

Example

Transaction ID Items 100 Bread, Cheese, Eggs, J uice 200 Bread, Cheese, J uice 300 Bread, Milk, Yogurt 400 Bread, J uice, Milk 500 Cheese, J uice, Milk

100 (B, C) (B, E) (B, J ) (C, E) (C, J ) (E, J ) 200 (B, C) (B, J ) (C, J ) 300 (B, M) (B, Y) (M, Y) 400 (B, J ) (B, M) (J , M) 500 (C, J ) (C, M) (J , M)

Consider the transaction database in the first table below used in an earlier example. The second table below shows all possible 2-itemsets for each transaction.

December 2008 ©GKGupta 12

Hashing Example

The possible 2-itemsets in the last table are now hashed to a hash table below. The last column shown in the table below is not required in the hash table but we have included it for explaining the technique.

Bit vector Bucket number Count Pairs C2 1 0 3 (C, J ) (B, Y) (M, Y) (C, J ) 0 1 1 (C, M) 0 2 1 (E, J ) 0 3 0 0 4 2 (B, C) 1 5 3 (B, E) (J , M) (J , M) 1 6 3 (B, J ) (B, J ) 1 7 3 (C, E) (B, M) (B, M)

December 2008 ©GKGupta 13

Hash Function Used

For each pair, a numeric value is obtained by first representing B by 1, C by 2, E 3, J 4, M 5 and Y 6. Now each pair can be represented by a two digit number, for example (B, E) by 13 and (C, M) by 26.

The two digits are then coded as modulo 8 number (dividing by 8 and using the remainder). This is the bucket address.

A count of the number of pairs hashed is kept. Those addresses that have a count above the support value have the bit vector set to 1 otherwise 0.

All pairs in rows that have zero bit are removed.

December 2008 ©GKGupta 14

Find C2

The major aim of the algorithm is to reduce the size of C2. It is therefore essential that the hash table is large enough so that collisions are low. Collisions result in loss of effectiveness of the hash table. This is what happened in the example in which we had collisions in three of the eight rows of the hash table which required us finding which pair was frequent.

December 2008 ©GKGupta 15

Transaction Reduction

As discussed earlier, any transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and such a transaction may be marked or removed.

December 2008 ©GKGupta 16

Example

Frequent items (L1) are A, B, D, M, T. We are not able to use these to eliminate any transactions since all transactions have at least one of the items in L1. The frequent pairs (C2) are {A,B} and {B,M}. How can we reduce transactions using these?

TID Items bought

001 B, M, T, Y

002 B, M

003 T, S, P

004 A, B, C, D

005 A, B

006 T, Y, E

007 A, B, M

008 B, C, D, T, P

009 D, T, S

010 A, B, M

December 2008 ©GKGupta 17

Partitioning

The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets.How can information about local itemsets be used in finding frequent itemsets of the global set of transactions?In the example on the next slide, we have divided a set of transactions into two partitions. Find the frequent items sets for each partition. Are these local frequent itemsets useful?

December 2008 ©GKGupta 18

Example1 2 5 7

2 3 4 5

5 6 11

2 5 7 4 13

6 11 13 2 4

14 19

1 2 5 7 14

12 14 19

2 4 5 6 7

2 4 6 11 13

2 4 6 11 13

2 13

2 5 7 11 13

1 2 3

4 5 6

1 2 5 7

2 4 6 11 13

5 6 11 13

December 2008 ©GKGupta 19

Partitioning

Phase 1– Divide n transactions into m partitions– Find the frequent itemsets in each partition– Combine all local frequent itemsets to form

candidate itemsets

Phase 2

Find global frequent itemsets

December 2008 ©GKGupta 20

Sampling

A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets.

How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions?

December 2008 ©GKGupta 21

Sampling

Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets.

The actual frequencies of the sample frequent itemsets are then obtained.

More than one sample could be used to improve accuracy.

December 2008 ©GKGupta 22

Problems with Association Rules Algorithms

• Users are overwhelmed by the number of rules identified ─ how can the number of rules be reduced to those that are relevant to the user needs?

• Apriori algorithm assumes sparsity since number of items on each record is quite small.

• Some applications produce dense data which may also have • many frequently occurring items• strong correlations• many items on each record

December 2008 ©GKGupta 23

Problems with Association Rules

Also consider:AB → C (90% confidence)

and A → C (92% confidence)

Clearly the first rule is of no use. We should look for more complex rules only if they are better than simple rules.

December 2008 ©GKGupta 24

Top Down Approach

Algorithms considered so far were bottom up i.e. they started from looking at each frequent item, then each pair and so on.

Is it possible to design top down algorithms that consider the largest group of items first and then finds the smaller groups. Let us first look at the itemset ABCD which can be frequent only if all subsets are frequent.

December 2008 ©GKGupta 25

Subsets of ABCD

A B C D

AB

ABCD

ABC BCDABD ACD

CDBDAC AD BC

December 2008 ©GKGupta 26

Closed and Maximal Itemsets

A frequent closed itemset is a frequent itemset X such that there exists no superset of X with the same support count as X. A frequent itemset Y is maximal if it is not a proper subset of any other frequent itemset. Therefore a maximal itemset is a closed itemset but a closed itemset is not necessarily a maximal itemset.

December 2008 ©GKGupta 27

Closed and Maximal Itemsets

Frequent maximal itemsets – the frequent maximal itemsets uniquely determine all frequent itemsets. Therefore the aim of any association rule algorithm is to find all maximal frequent itemsets.

December 2008 ©GKGupta 28

Closed and Maximal Itemsets

In Example, we found {B, D} and {B, C, D} had the same support of 8 while {C, D} had a support of 9. {C, D} is therefore a closed itemset but not maximal. On the other hand, {B, C} was frequent but no superset of the two items is frequent. This pair therefore is maximal as well as closed.

December 2008 ©GKGupta 29

Closed and maximal itemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

FrequentItemsets

December 2008 ©GKGupta 30

Performance Evaluation of Algorithms

• The FP-growth method was usually better than the best implementation of the Apriori algorithm.

• CHARM was also usually better than Apriori. In some cases, Charm was better than the FP-growth method.

• Apriori was generally better than other algorithms if the support required was high since high support leads to a smaller number of frequent items which suits the Apriori algorithm.

• At very low support, the number of frequent items became large and none of the algorithms were able to handle large frequent sets gracefully.

December 2008 ©GKGupta 31

Bibliography

• R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules Between sets of Items in Large Databases, In Proc of the ACM SIGMOD, 1993, pp. 207-216.

• R. Ramakrishnan and J. Gehrke, Database management systems,, 2nd ed. McGraw-Hill, 2000.

• M. J. A. Berry and G. Linoff, Mastering data mining : the art and science of customer relationship management, Wiley, 2000.

• I. H. Witten and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations,. Morgan Kaufmann, 2000.

December 2008 ©GKGupta 32

Bibliography

• M. J. A. Berry and G. Linoff, Data mining techniques: for marketing, sales, and customer support, New York : Wiley, 1997.

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

• R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest Data Mining System, Proc 1996 Int. Conf on Data Mining and Knowledge Discovery (KDD’96), Portland, Oregon, pp. 244-249, Aug 1996.