www.monash.edu.au mining associations in large databases monash university semester 1, march 2006

www.monash.edu.au

Mining Associations in Large Databases

Monash University

Semester 1, March 2006

www.monash.edu.au

2

Associations or Basket Analysis

Huge amount of data is stored electronically in all retail outlets due to barcoding of all goods sold. Natural to try to find some useful information from this mountains of data.

A conceptually simple yet interesting example is to find association rules from these large databases.

www.monash.edu.au

3

Associations or Basket Analysis

• Association rules mining or market basket analysis searches for interesting customer habits by looking at associations.

• The classical example is the one where a store was reported to have discovered that people buying nappies tend also to buy beer

• Applications in marketing, store layout, customer segmentation, medicine, finance, etc.

www.monash.edu.au

4

Basics

• Given a set of transactions {T}, each containing a subset of items from an item set {i1, i2, …, im}, discovery of association relationships or correlations among a set of items.

• Want to find a group of items that tend to occur together.

• The association rules are often written as X => Y meaning that whenever X appears Y also tends to appear. X and Y may be single items of sets of items (same item not appearing in both).

www.monash.edu.au

5

Basics

Although we are only considering basket analysis, the technique can be useful in other application.

We assume that we have data from a supermarket like Woolworths which may have several thousand items and many millions of transactions a week (Gigabytes of data each week). Note that the quantities of items bought in a transaction is ignored.

www.monash.edu.au

6

Terminology

• We assume that we have a set of transactions, each transaction being a list of items (e.g. books)

• Suppose X and Y appear together in only 1% of the transactions but whenever X appears there is 80% chance that Y also appears

• The 1% presence of X and Y together is called the support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule

• These are measures of interestingness of the rule

www.monash.edu.au

7

Terminology

• The support for X=>Y is the probability of both X and Y appearing together, that is P(X U Y)

• The confidence of X=>Y is the conditional probability of Y appearing given that X exists, that is:

P(Y | X) = P(X U Y) / P (Y)• Confidence denotes the strength of the association.

Support indicates the frequency of the pattern. A minimum support is necessary if an association is going to be of some business value.

www.monash.edu.au

8

The task

Want to find all associations which have at least p% support with at least q% confidence such that

– all rules satisfying user constraints are found– the rules are found efficiently from large databases– the rules found are actionable

www.monash.edu.au

9

An Example

Consider a furniture and appliances store has 100,000 records of sale. Let 5,000 records contain lounge suites (5%), and let 10,000 records contain TVs (10%). The support for lounge suites is therefore 5% and for TVs 10%.

It is expected that the 5,000 records containing lounge suites will contain 10% TVs (i.e. 500). If the number of TV sales in the 5,000 records is in fact 2,500 then we have a lift of 5 or confidence of 50%.

www.monash.edu.au

10

Question

With 100,000 records, 5,000 records containing lounge suites (5%), and 10,000 containing TVs (10%) and the number of TV sales in the 5,000 records containing lounge suite being 2,500, what is the support and the confidence in the following two rules?

lounge => TVand TV => lounge

www.monash.edu.au

11

Associations

It is worth noting:

• The only rules of interest are with very high or very low lift (why?)

• Items that appear on most transactions are of little interest (why?)

• Similar items should be combined to reduce the number of total items (why?)

www.monash.edu.au

12

The Apriori Algorithm

To find such associations, a simple two step approach may be used:

Step 1 - discover all frequent items that have support above the minimum support required

Step 2 - Use the set of frequent items to generate the association rules that have high enough confidence level

Is this a reasonable algorithm? Why?

www.monash.edu.au

13


The algorithm works as follows• Scan all transactions and find all frequent items

that have transaction support above x%. Let these be L1.

• Build item pairs from L1. This is the candidate set C2. Scan all transactions and find all frequent pairs in C2. Let this be L2.

• General rule - build sets of k items from Lk-1. This is set Ck. Scan all transactions and find all frequent sets in Ck. Let this be Lk.

www.monash.edu.au

14


Consider an example with the following set of transactions

TID Items bought001 B, M, T, Y002 B, M003 T, S, P004 A, B, C, D005 A, B006 T, Y, E007 A, B, M008 B, C, D, T, P009 D, T, S010 A, B, M

www.monash.edu.au

15


Assume that we wish to find associations with at least 30% support and 60% confidence. The list of frequent items is first computed. Only the following items qualify as frequent since they appear in more than 30% of the transactions. This is set L1.

Item FrequencyA 4B 7D 3M 4T 5

These five items form 10 pairs as shown on the next slide.

www.monash.edu.au

16

The Pairs

The following ten pairs are worth investigating since they are the only ones that could be frequent (why?):

{A, B}{A, D}{A, M}{A, T}{B, D}{B, M}{B, T}{D, M}{D, T}{M, T}

www.monash.edu.au

17

The Frequent Pairs

The frequency of the pairs is as follows:{A, B} 4{A, D} 1{A, M} 2{A, T} 0{B, D} 2{B, M} 4{B, T} 2{D, M} 0{D, T} 2{M, T} 1

So, the only frequent pairs are {A, B} and {B, M}.

www.monash.edu.au

18

Confidence

These two pairs have more than 30% support. No other pair can have 30% support or more. Why?

What about their confidence level? A => B has confidence level of 100%, B => A has confidence level of about 57% (4/7). B => M also has confidence level of approx 57%, M => B 100%.

Therefore only A => B and M => B are acceptable. Are we finished? Is there any point going further? Why?

www.monash.edu.au

19

Example

The frequent item pairs (that is L2) are:Pair Frequency{A, B} 4{B, M} 4

These pairs are now used to generate a set of three items (i.e. C3). In this simple example only one such set is possible which is {A, B, M}. The frequency of this set is only 2 which is below 30% support and therefore this set of three items does not qualify. So L3 is empty.

www.monash.edu.au

20

Questions

Firstly, how do we compute C3 from L2 or more generally Ci from Li-1?

Secondly, if there were several members of the set C3. How do we derive L3 by pruning C3?

Thirdly, let us suppose the set {A, B, M} did have 30% support or higher. What do we then do? What does it tell us? How do we derive association rules?

www.monash.edu.au

21

Answers

To compute C3 from L2, or more generally Ci from Li-1, we join members of Li-1 to other members if the first (I-2) items are common.

When there are several members of the set C3, or Ci in general, we look at each member of the set and consider each subset. For this member to be frequent, every subset must be frequent.

www.monash.edu.au

22

Example

If the set {A, B, M} did have 30% support or higher, we can then find association rules from it that satisfy both the conditions of support and confidence.

www.monash.edu.au

23

Example

We first generate all nonempty subsets of {A, B, M} and use each of it on the LHS and remaining symbols on the RHS. For example, subsets of {A, B, M} are A, B, M, AB, AM, BM. The possible rules therefore are A => BM, B => AM, M => AB, AB => M, AM => B or BM => A.

www.monash.edu.au

24

Answers

To test the confidence of the possible rules A => BM,

B => AM, M => AB, AB => M, AM => B or BM => A, we proceed as follows. We know that

confidence(A=>B) = P(B|A) = P(A U B)/ P(A)

Therefore confidence of A => B can be computed by taking the ratio of the support for A and B both together and support for A by itself. The confidence of all these rules can thus be computed.

www.monash.edu.au

25

Efficiency

Consider an example of a supermarket database which might have several thousand items of which 1000 items are frequent and several million transactions.

Which part of the Apriori algorithm will be expensive to compute? Why?

www.monash.edu.au

26

Efficiency

The algorithm to construct the candidate set for large itemsets is crucial to the performance of the Apriori algorithm. The larger the candidate set, higher the processing cost for discovering the large itemsets. Given that the early itemsets are very large, the initial iterations dominate the cost.

In a supermarket database with about 1000 frequent items, there will be almost one million pairs that need to be searched for.

www.monash.edu.au

27

Efficiency

Generally the number of frequent pairs out of one million pairs will be small and therefore (why?) the number of three-itemsets will be small.

Therefore, it is the generation of the large 2-itemsets that is the key to improving the performance of the Apriori algorithm.

www.monash.edu.au

28

Improving the Apriori Algorithm

Many techniques for improving the efficiency:•Pruning (already mentioned)•Hashing based technique•Transaction reduction•Partitioning•Sampling•Dynamic itemset counting

www.monash.edu.au

29

Pruning

Pruning can reduce the size of candidate set Ck. We want to transform Ck into a set of frequent items Lk. To reduce the work of checking, we may use a trick that all subsets of Ck must also be frequent.

www.monash.edu.au

30

Hashing

The direct hashing and pruning (DHP) algorithm attempts to generate large itemsets efficiently and reduces the transaction database size.

www.monash.edu.au

31

Transaction Reduction

As discussed earlier, any transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and such a transaction may be marked or removed.

www.monash.edu.au

32

Partitioning

The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets using support s times the number of transactions in the subset. These frequent itemsets are called local frequent itemsets.

How can information about local itemsets be used in finding frequent itemsets of the global set of transactions?

www.monash.edu.au

33

Sampling

A sample may be obtained from the overall set of transactions and the sample is searched for frequent itemsets using support s times the number of transactions in the sample. These frequent itemsets are called sample frequent itemsets.

How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions?

www.monash.edu.au

34

Problem with Association Rules Algorithms

• Users are overwhelmed by the number of rules identified - how can rules be reduced to those that are relevant to the user needs?

• Apriori algorithm assumes sparsity since number of items on each record is quite small.

• Some applications produce dense data which may also have

• many frequently occurring items• strong correlations• many items on each record

www.monash.edu.au

35

Problem with Association Rules Algorithms

Also consider:AB => C (90% confidence)

and A => C (92% confidence)

Clearly the first rule is of no use. We should only look for more complex rules if they are better than simple rules.

www.monash.edu.au

36

References

• Database management systems, Raghu Ramakrishnan, Johannes Gehrke, 2nd ed. McGraw-Hill, 2000.

• Knowledge discovery and data mining : current issues and new applications : 4th Pacific-Asia Conference, PAKDD 2000, Kyoto, Japan, Takao Terano, Huan Liu, Arbee L.P. Chen (eds.): Springer, 2000.

• Mastering data mining : the art and science of customer relationship management, Michael J. A. Berry, Gordon Linoff. Wiley, 2000.

• Data mining : building competitive advantage, Robert Groth: Prentice Hall, 2000.

www.monash.edu.au

37

References

• Data mining II , editors, N. Ebecken & C.A. Brebbia: WIT Press, 2000.

• Data mining : practical machine learning tools and techniques with Java implementations, Ian H. Witten, Eibe Frank. Morgan Kaufmann, 2000.

• Data mining : technologies, techniques, tools, and trends, Bhavani Thuraisingham. Boca Raton : CRC Press, 1999.

• Data mining techniques : for marketing, sales, and customer support, Michael J.A. Berry, Gordon Linoff. New York : Wiley, 1997.

www.monash.edu.au

38

References

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996

• M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a Database Perspective, IEEE Transactions on Knowledge and Data Engineering, 8(6), pp 866-883, 1996.

• R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest Data Mining System, Proc 1996 Int. Conf on Data Mining and Knowledge Discovery (KDD’96), Portland, Oregon, pp. 244-249, Aug 1996.

www.monash.edu.au

39

References

• R. Agarwal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami, An Interval Classifier for database Mining Applications, VLDB-92, Vancouver, Canada, 1992, pp 560-573.

• R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules Between sets of Items in Large Databases, In Proc of the ACM SIGMOD, 1993, pp. 207-216.

www.monash.edu.au mining associations in large databases monash university semester 1, march 2006

Documents