data mining chapter 2 association rule mining g. k. gupta faculty of information technology monash...

56
Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

Post on 22-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

Data Mining Chapter 2 Association Rule Mining

G. K. GuptaFaculty of Information Technology

Monash University

Page 2: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 2

Introduction

As noted earlier, huge amount of data is stored electronically in many retail outlets due to barcoding of goods sold. Natural to try to find some useful information from this mountains of data.

A conceptually simple yet interesting technique is to find association rules from these large databases.

The problem was invented by Rakesh Agarwal at IBM.

Page 3: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 3

Introduction

• Association rules mining (or market basket analysis) searches for interesting customer habits by looking at associations.

• The classical example is the one where a store in USA was reported to have discovered that people buying nappies tend also to buy beer.

• Applications in marketing, store layout, customer segmentation, medicine, finance, and many more.

Page 4: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 4

A Simple ExampleConsider the following ten transactions of a shop selling only

nine items. We will return to it after explaining some terminology.

Transaction ID Items

10 Bread, Cheese, Newspaper

20 Bread, Cheese, J uice

30 Bread, Milk

40 Cheese, J uice. Milk, Coffee

50 Sugar, Tea, Coffee, Biscuits, Newspaper

60 Sugar, Tea, Coffee, Biscuits, Milk, J uice, Newspaper

70 Bread, Cheese

80 Bread, Cheese, J uice, Coffee

90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, J uice, Newspaper

Page 5: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 5

Terminology

• Let the number of different items sold be n• Let the number of transactions be N• Let the set of items be {i1, i2, …, in}. The number of

items may be large, perhaps several thousands.

• Let the set of transactions be {t1, t2, …, tN}. Each transaction ti contains a subset of items from the itemset {i1, i2, …, in}. These are the things a customer buys when they visit the supermarket. N is assumed to be large, perhaps in millions.

• Not considering the quantities of items bought.

Page 6: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 6

Terminology

• Want to find a group of items that tend to occur together frequently.

• The association rules are often written as X→Y meaning that whenever X appears Y also tends to appear. X and Y may be single items or sets of items but the same item does not appear in both.

Page 7: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 7

Terminology

• Suppose X and Y appear together in only 1% of the transactions but whenever X appears there is 80% chance that Y also appears.

• The 1% presence of X and Y together is called the support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule.

• These are measures of interestingness of the rule.

• Confidence denotes the strength of the association between X and Y. Support indicates the frequency of the pattern. A minimum support is necessary if an association is going to be of some business value.

Page 8: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 8

Terminology

• Let the chance of finding an item X in the N transactions is x% then we can say probability of X is P(X) = x/100 since probability values are always between 0.0 and 1.0.

• Now suppose we have two items X and Y with probabilities of P(X) = 0.2 and P(Y) = 0.1. What does the product P(X) x P(Y) mean?

• What is likely to be the chance of both items X and Y appearing together, that is P(X and Y) = P(X U Y)?

Page 9: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 9

Terminology

• Suppose we know P(X U Y) then what is the probability of Y appearing if we know that X already exists. It is written as P(Y|X)

• The support for X→Y is the probability of both X and Y appearing together, that is P(X U Y).

• The confidence of X→Y is the conditional probability of Y appearing given that X exists. It is written as P(Y|X) and read as P of Y given X.

Page 10: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 10

Terminology

• Sometime the term lift is also used. Lift is defined as Support(X U Y)/P(X)P(Y). P(X)P(Y) is the probability of X and Y appearing together if both X and Y appear randomly.

• As an example, if support of X and Y is 1%, and X appears 4% in the transactions while Y appears in 2%, then lift = 0.01/0.04x0.02 = 1.25. What does it tell us about X and Y? What if the lift was 1?

Page 11: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 11

The task

Want to find all associations which have at least p% support with at least q% confidence such that

– all rules satisfying any user constraints are found

– the rules are found efficiently from large databases

– the rules found are actionable

Page 12: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 12

An Example

Consider a furniture and appliances store has 100,000 records of sale. Let 5,000 records contain lounge suites (5%), and let 10,000 records contain TVs (10%). The support for lounge suites is therefore 5% and for TVs 10%.

It is expected that the 5,000 records containing lounge suites will contain 10% TVs (i.e. 500). If the number of TV sales in the 5,000 records is in fact 2,500 then we have a lift of 5 or confidence of the rule Lounge→TV is 50%.

Page 13: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 13

Question

With 100,000 records, 5,000 records containing lounge suites (5%), and 10,000 containing TVs (10%) and the number of TV sales in the 5,000 records containing lounge suite being 2,500, what is the support and the confidence in the following two rules?

lounge → TVand TV → lounge

Page 14: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 14

AssociationsIt is worth noting:• The lift of X → Y is the same as that of Y → X (why?). Would the confidence be the same?

• The only rules of interest are with very high or very low lift (why?)

• Items that appear on most transactions are of little interest (why?)

• Similar items should be combined to reduce the number of total items (why?)

Page 15: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 15

Applications

Although we are only considering basket analysis, the technique has many other applications as noted earlier.

For example we may have many patients coming to a hospital with various diseases and from various backgrounds. We therefore have a number of attributes (items) and we may be interested in knowing which attributes appear together and may be which attributes are frequently associated to some particular disease.

Page 16: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 16

ExampleWe have 9 items and 10 transactions. Find support and

confidence for each pair. How many pairs are there?

Transaction ID Items

10 Bread, Cheese, Newspaper

20 Bread, Cheese, J uice

30 Bread, Milk

40 Cheese, J uice. Milk, Coffee

50 Sugar, Tea, Coffee, Biscuits, Newspaper

60 Sugar, Tea, Coffee, Biscuits, Milk, J uice, Newspaper

70 Bread, Cheese

80 Bread, Cheese, J uice, Coffee

90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, J uice, Newspaper

Page 17: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 17

Example

There are 9 x 8 or 72 pairs, half of them duplicates. 36 is too many to analyse in a class, so we use an even simpler example of n = 4 (Bread, Cheese, Juice, Milk) and N = 4. We want to find rules with minimum support of 50% and minimum confidence of 75%.

Transaction ID Items

100 Bread, Cheese

200 Bread, Cheese, J uice

300 Bread, Milk

400 Cheese, J uice. Milk

Page 18: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 18

ExampleN = 4 results in the following frequencies. All items havethe 50% support we require. Only two pairshave the 50% supportand no 3-itemset hasthe support.

We will look at thetwo pairs that havethe minimum supportto find if they havethe required confidence.

Itemsets Frequency

Bread 3

Cheese 3

J uice 2

Milk 2

(Bread, Cheese) 2

(Bread, J uice) 1

(Bread, Milk) 1

(Cheese, J uice) 2

(Cheese, Milk) 1

(J uice, Milk) 1

(Bread, Cheese, J uice) 1

(Bread, Cheese, Milk) 0

(Bread, J uice, Milk) 0

(Cheese, J uice, Milk) 1

(Bread, Cheese, J uice, Milk) 0

Page 19: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 19

Terminology

Items or itemsets that have the minimum support are called frequent. In our example, all the four items and two pairs are frequent.

We will now determine if the two pairs {Bread, Cheese} and {Cheese, Juice} lead to association rules with 75% confidence.

Every pair {A, B} can lead to two rules A → B and B → A if both satisfy the minimum confidence. Confidence of A → B is given by the support for A and B together divided by the support of A.

Page 20: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 20

Rules

We have four possible rules and their confidence is given as follows:

Bread Cheese with confidence of 2/3 = 67%Cheese Bread with confidence of 2/3 = 67%Cheese Juice with confidence of 2/3 = 67%Juice Cheese with confidence of 100%

Therefore only the last rule Juice Cheese has confidence above the minimum 75% and qualifies. Rules that have more than user-specified minimum confidence are called confident.

Page 21: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 21

Problems with Brute Force

This simple algorithm works well with four items since there were only a total of 16 combinations that we needed to look at but if the number of items is say 100, the number of combinations is much larger, in billions. The number of combinations becomes about a million with 20 items since the number of combinations is 2n with n items (why?). The naïve algorithm can be improved to deal more effectively with larger data sets.

Page 22: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 22

Problems with Brute Force

Can you think of an improvement over the brute force method in which we looked at every itemset combination possible? Naïve algorithms even with improvements don’t work efficiently enough to deal with large number of items and transactions. We now define a better algorithm.

Page 23: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 23

The Apriori Algorithm

To find associations, this classical Apriori algorithm may be simply described by a two step approach:

Step 1 ─ discover all frequent (single) items that have support above the minimum support required

Step 2 ─ use the set of frequent items to generate the association rules that have high enough confidence levelIs this a reasonable algorithm?A more formal description is given on the next slide.

Page 24: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 24

The Apriori Algorithm

The step-by-step algorithm may be described as follows

1. Computing L1

This is Step 1. Scan all transactions. Find all frequent items that have support above the required p%. Let all of these frequent items be labeled L1.

2. Apriori-gen Function This is Step 2. Use the frequent items L1 to build all possible item pairs like {Bread, Cheese} if Bread and Cheese are in L1. The set of these item pairs is called C2, the candidate set.

Page 25: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 25

The Apriori Algorithm

3. PruningThis is Step 3. Scan all transactions and find all frequent pairs in the candidate pair setC2. Let these frequent pairs be L2.

4. General rule This is Step 4, a generalization of Step 2. Build candidate set of k items Ck by combining frequent itemsets in the set Lk-1.

Page 26: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 26

The Apriori Algorithm

5. PruningStep 5, generalization of Step 3. Scan all transactions and find all frequent item sets in Ck. Let these frequent itemsets be Lk.

6. Continue with Step 4 unless Lk is empty.

7. Stop if Lk is empty

Page 27: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 27

An Example

This example is similar to the last example but we have added two more items and another transaction making five transactions and six items. We want to find association rules with 50% support and 75% confidence.

Transaction ID Items

100 Bread, Cheese, Eggs, J uice

200 Bread, Cheese, J uice

300 Bread, Milk, Yogurt

400 Bread, J uice, Milk

500 Cheese, J uice, Milk

Page 28: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 28

Example

First find L1. 50% support requires that each frequent item appear in at least three transactions. Therefore L1 is given by:

Item Frequnecy

Bread 4

Cheese 3

J uice 4

Milk 3

Page 29: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 29

Example

The candidate 2-itemsets or C2 therefore has six pairs (why?). These pairs and their frequencies are:

Item Pairs Frequency

(Bread, Cheese) 2

(Bread, J uice) 3

(Bread, Milk) 2

(Cheese, J uice) 3

(Cheese, Milk) 1

(J uice, Milk) 2

Page 30: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 30

Deriving Rules

L2 has only two frequent item pairs {Bread, Juice} and {Cheese, Juice}. After these two frequent pairs, there are no candidate 3-itemsets (why?) since we do not have two 2-itemsets that have the same first item (why?).

The two frequent pairs lead to the following possible rules:

Bread → JuiceJuice → BreadCheese → JuiceJuice → Cheese

Page 31: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 31

Deriving Rules

The confidence of these rules is obtained by dividing the support for both items in the rule by the support of the item on the left hand side of the rule.

The confidence of the four rules therefore are

3/4 = 75% (Bread → Juice) 3/4 = 75% (Juice → Bread) 3/3 = 100% (Cheese → Juice) 3/4 = 75% (Juice → Cheese)

Since all of them have a minimum 75% confidence, they all qualify. We are done since there are no 3–itemsets.

Page 32: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 32

Questions

1. How do we compute C3 from L2 or more generally Ci from Li-1?

2. If there were several members of the set C3, how do we derive L3 by pruning C3?

3. Assume that the 3-itemset {A, B, M} has the minimum support. What do we do next? How do we derive association rules?

Page 33: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 33

Answers

To compute C3 from L2, or more generally Ci from Li-1, we join members of Li-1 to other members in Li-1 by first sorting the itemsets in their lexicographic order and then joining those itemsets in which the first (i-2) items are common. Observe that if an itemset in C3 is (a, b, c) then L2 must have had itemsets (a, b) and (a, c) since all subsets of C3 must be frequent. Why?To repeat, itemsets in a candidate set Ci or a frequent set Li will be frequent only if every subset is frequent.

Page 34: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 34

Answers

For example, {A, B, M} will be frequent only if {A, B}, {A,M} and {B, M} are frequent, which in turn requires each of A, B and M also to be frequent.

Page 35: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 35

Example

If the set {A, B, M} has the minimum support, we can find association rules that satisfy the conditions of support and confidence by first generating all nonempty subsets of {A, B, M} and using each of it on the LHS and remaining symbols on the RHS.

Subsets of {A, B, M} are A, B, M, AB, AM, BM. Therefore possible rules involving all three items are A → BM, B → AM, M → AB, AB → M, AM → B and BM → A.

Now we need to test their confidence.

Page 36: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 36

Answers

To test the confidence of the possible rules, we proceed as we have done before. We know that

confidence(A → B) = P(B|A) = P(A U B)/ P(A)

This confidence is the likelihood of finding B if A already has been found in a transaction. It is the ratio of the support for A and B together and support for A by itself.

The confidence of all these rules can thus be computed.

Page 37: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 37

Efficiency

Consider an example of a supermarket database which might have several thousand items including 1000 frequent items and several million transactions.

Which part of the apriori algorithm will be most expensive to compute? Why?

Page 38: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 38

Efficiency

The algorithm to construct the candidate set Ci to find the frequent set Li is crucial to the performance of the Apriori algorithm. The larger the candidate set, higher the processing cost of discovering the frequent itemsets since the transactions must be scanned for each. Given that the numbers of early candidate itemsets are very large, the initial iterations dominate the cost.

In a supermarket database with about 1000 frequent items, there will be almost half a million candidate pairs C2 that need to be searched for to find L2.

Page 39: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 39

Efficiency

Generally the number of frequent pairs out of half a million pairs will be small and therefore (why?) the number of 3-itemsets should be small.

Therefore, it is the generation of the frequent set L2 that is the key to improving the performance of the Apriori algorithm.

Page 40: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 40

Comment

In class examples we usually require high support, for example, 25%, 30% or even 50%. These support values are very high if the number of items and number of transactions is large. For example, 25% support in a supermarket transaction database means searching for items that have been purchased by one in four customers! Not many items would probably qualify.

Practical applications therefore deal with much smaller support, sometimes even down to 1% or lower.

Page 41: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 41

Improving the Apriori Algorithm

Many techniques for improving the efficiency have been proposed:

•Pruning (already mentioned)•Hashing based technique•Transaction reduction•Partitioning•Sampling•Dynamic itemset counting

Page 42: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 42

Pruning

Pruning can reduce the size of the candidate set Ck. We want to transform Ck into a set of frequent items Lk. To reduce the work of checking, we may use the rule that all subsets of Ck must also be frequent.

Page 43: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 43

Example

• Suppose the items are A, B, C, D, E, F, .., X, Y, Z• Suppose L1 is A, C, E, P, Q, S, T, V, W, X

• Suppose L2 is {A, C}, {A, F}, {A, P}, {C, P}, {E, P}, {E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}

• Are you able to identify errors in the L2 list?

• What is C3?

• How to prune C3?

• C3 is {A, C, P}, {E, P, V}, {Q, S, X}

Page 44: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 44

Hashing

The direct hashing and pruning (DHP) algorithm attempts to generate large itemsets efficiently and reduces the transaction database size.

When generating L1, we also generate all the 2-itemsets for each transaction, hash them in a hash table and keep a count.

Page 45: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 45

ExampleHow does hashing work for this example? Let us look at possible 2-itemsets from the first transaction.

{1,2} --> hash address 2{1,5} --> 5{1,7} --> 7{2,5} --> 2{2,7} --> 6{5,7} --> 3We have used a simple hashing function which involves multiplying the two item numbers and then mod 8. Note there are collisions.

1 2 5 7

2 3 4 5

5 6 11

2 5 7 4 13

6 11 13 2 4

14 19

1 2 5 7 14

12 14 19

2 4 5 6 7

2 4 6 11 13

Page 46: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 46

Transaction Reduction

As discussed earlier, any transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and such a transaction may be marked or removed.

Page 47: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 47

Example

Frequent items (L1) are A, B, D, M, T. We are not able to use these to eliminate any transactions since all transactions have at least one of the items in L1. The frequent pairs (C2) are {A,B} and {B,M}. How can we reduce transactions using these?

TID Items bought

001 B, M, T, Y

002 B, M

003 T, S, P

004 A, B, C, D

005 A, B

006 T, Y, E

007 A, B, M

008 B, C, D, T, P

009 D, T, S

010 A, B, M

Page 48: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 48

Partitioning

The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets.How can information about local itemsets be used in finding frequent itemsets of the global set of transactions?In the example on the next slide, we have divided a set of transactions into two partitions. Find the frequent items sets for each partition. Are these local frequent itemsets useful?

Page 49: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 49

Example1 2 5 7

2 3 4 5

5 6 11

2 5 7 4 13

6 11 13 2 4

14 19

1 2 5 7 14

12 14 19

2 4 5 6 7

2 4 6 11 13

2 4 6 11 13

2 13

2 5 7 11 13

1 2 3

4 5 6

1 2 5 7

2 4 6 11 13

5 6 11 13

Page 50: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 50

Partitioning

Phase 1– Divide n transactions into m partitions– Find the frequent itemsets in each

partition– Combine all local frequent itemsets to

form candidate itemsets

Phase 2Find global frequent itemsets

Page 51: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 51

Sampling

A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets.

How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions?

Page 52: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 52

Sampling

Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets.

The actual frequencies of the sample frequent itemsets are then obtained.

More than one sample could be used to improve accuracy.

Page 53: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 53

Problems with Association Rules Algorithms

• Users are overwhelmed by the number of rules identified ─ how can the number of rules be reduced to those that are relevant to the user needs?

• Apriori algorithm assumes sparsity since number of items on each record is quite small.

• Some applications produce dense data which may also have • many frequently occurring items• strong correlations• many items on each record

Page 54: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 54

Problems with Association Rules

Also consider:AB → C (90% confidence)

and A → C (92% confidence)

Clearly the first rule is of no use. We should look for more complex rules only if they are better than simple rules.

Page 55: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 55

Bibliography

• R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules Between sets of Items in Large Databases, In Proc of the ACM SIGMOD, 1993, pp. 207-216.

• R. Ramakrishnan and J. Gehrke, Database management systems,, 2nd ed. McGraw-Hill, 2000.

• M. J. A. Berry and G. Linoff, Mastering data mining : the art and science of customer relationship management, Wiley, 2000.

• I. H. Witten and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations,. Morgan Kaufmann, 2000.

Page 56: Data Mining Chapter 2 Association Rule Mining G. K. Gupta Faculty of Information Technology Monash University

November 2008 ©GKGupta 56

Bibliography

• M. J. A. Berry and G. Linoff, Data mining techniques: for marketing, sales, and customer support, New York : Wiley, 1997.

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

• R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest Data Mining System, Proc 1996 Int. Conf on Data Mining and Knowledge Discovery (KDD’96), Portland, Oregon, pp. 244-249, Aug 1996.