1 of 25 1 of 45 association rule mining cit366: data mining & data warehousing instructor:...

31
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing and IT Dept.

Upload: lenard-johnson

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

1of25

1of45 Association Rule Mining

CIT366: Data Mining & Data WarehousingInstructor: Bajuna SaleheThe Institute of Finance Management: Computing and IT Dept.

Page 2: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

2of25

2of45 What Is Association Mining?Association rule mining:

– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database

Page 3: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

3of25

3of45

Motivations For Association Mining

Motivation: Finding regularities in data– What products were often purchased together?

• Beer and nappies!

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we automatically classify web documents?

Page 4: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

4of25

4of45

Motivations For Association Mining (cont…)

Broad applications– Basket data analysis, cross-marketing, catalog

design, sale campaign analysis– Web log (click stream) analysis, DNA sequence

analysis, etc.

Page 5: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

5of25

5of45 Market Basket AnalysisMarket basket analysis is a typical example of frequent itemset mining

Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets”

This information can be used to develop marketing strategies

Page 6: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

6of25

6of45 Market Basket Analysis (cont…)

Page 7: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

7of25

7of45 Application of AssociationAssociation analysis can be used in promoting/improving marketing strategy by analysing frequent itemset.

As a marketing manager of a Company X for instance you would like to determine which items are frequently purchased together within the same transactions.

Page 8: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

8of25

8of45 Application of AssociationAn example of such a rule, mined from the X Company transactional database, isbuys(X; “computer”)=>buys(X; “software”) [support = 1%; confidence = 50%] where X is a variable representing a customer.A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well.

Page 9: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

9of25

9of45 Application of Association

A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together.

This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single-dimensional association rules.

Page 10: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

10of25

10of45 Application of Association

In addition to the marketing application, the same sort of question has the following uses:

Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.

Page 11: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

11of25

11of45 Application of Association

Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.

Page 12: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

12of25

12of45 Association Rule Basic Concepts

Let I be a set of items {I1, I2, I3,…, Im}

Let D be a database of transactions where each transaction T is a set of items such that T I

So, if A is a set of items a transaction T is said to contain A if and only if A T

An association rule is an implication A B where A I, B I, and A B=

Page 13: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

13of25

13of45

Association Rule Support & Confidence

We say that an association rule A B holds in the transaction set D with support, s, and confidence, cThe support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B)

So, the support can be considered the probability P(A B)

Page 14: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

14of25

14of45

Association Rule Support & Confidence (cont…)

The confidence of the association rule is given as the percentage of transactions in D containing A that also contain BSo, the confidence can be considered the conditional probability P(B|A)Association rules that satisfy minimum support and confidence values are said to be strong

Page 15: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

15of25

15of45 Itemsets & Frequent ItemsetsAn itemset is a set of items

A k-itemset is an itemset that contains k itemsThe occurrence frequency of an itemset is the number of transactions that contain the itemset

– This is also known more simply as the frequency, support count or count

An itemset is said to be frequent if the support count satisfies a minimum support count threshold

The set of frequent itemsets is denoted Lk

Page 16: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

16of25

16of45 Support & Confidence AgainSupport and confidence values can be calculated as follows:

)|()( ABPBAconfidence

Acountsupport

BAcountsupport

Asupport

BAsupport

_

_

)()( BAPBAsupport

count

BAuntsupport_co

Page 17: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

17of25

17of45

Mining Association Rules: An Example

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Frequent pattern Support

{A} 75%

{B} 50%

{C} 50%

{A, C} 50%

()

}){_)(

count

Ccount({A}supportCAsupport

})({_

}){_)(

Acountsupport

Ccount({A}supportCAconfidence

%7.66

%50

Page 18: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

18of25

18of45

Mining Association Rules: An Example (cont…)

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Frequent pattern Support

{A} 75%

{B} 50%

{C} 50%

{A, C} 50%

()

}){_)(

count

Acount({C}supportACsupport

})({_

}){_)(

Ccountsupport

Acount({C}supportACconfidence

%100

%50

Page 19: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

19of25

19of45 Association Rule MiningSo, in general association rule mining can be reduced to the following two steps:

1. Find all frequent itemsets• Each itemset will occur at least as frequently as

as a minimum support count

2. Generate strong association rules from the frequent itemsets

• These rules will satisfy minimum support and confidence measures

Page 20: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

20of25

20of45 Combinatorial Explosion!A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive

For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets

A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:

30100 1027.112100

100...

2

100

1

100

Page 21: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

21of25

21of45 The Apriori AlgorithmAny subset of a frequent itemset must be frequent

– If {beer, nappy, nuts} is frequent, so is {beer, nappy}

– Every transaction having {beer, nappy, nuts} also contains {beer, nappy}

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

Page 22: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

22of25

22of45 The Apriori Algorithm (cont…)The Apriori algorithm is known as a candidate generation-and-test approach

Method: – Generate length (k+1) candidate itemsets from

length k frequent itemsets

– Test the candidates against the DB

Performance studies show the algorithm’s efficiency and scalability

Page 23: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

23of25

23of45

The Apriori Algorithm: An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}Itemset sup{B, C, E} 2

Page 24: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

24of25

24of45

Important Details Of The Apriori Algorithm

There are two crucial questions in implementing the Apriori algorithm:

– How to generate candidates?– How to count supports of candidates?

Page 25: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

25of25

25of45

Generating Candidates

There are 2 steps to generating candidates:– Step 1: Self-joining Lk

– Step 2: Pruning

Example of Candidate-generation– L3={abc, abd, acd, ace, bcd}

– Self-joining: L3*L3

• abcd from abc and abd• acde from acd and ace

– Pruning:• acde is removed because ade is not in L3

– C4={abcd}

Page 26: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

26of25

26of45

How to Count Supports Of Candidates?

Why counting supports of candidates a problem?

– The total number of candidates can be huge– One transaction may contain many candidates

Method:– Candidate itemsets are stored in a hash-tree– Leaf node of hash-tree contains a list of itemsets

and counts– Interior node contains a hash table– Subset function: finds all the candidates

contained in a transaction

Page 27: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

27of25

27of45 Generating Association RulesOnce all frequent itemsets have been found association rules can be generated

Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold

Page 28: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

28of25

28of45 Example

TID List of item_IDs

T100 Coke, Crisps, Milk

T200 Crisps, Bread

T300 Crisps, Nappies

T400 Coke, Crisps, Bread

T500 Coke, Nappies

T600 Crisps, Nappies

T700 Coke, Nappies

T800 Coke, Crisps, Nappies, Milk

T900 Coke, Crisps, Nappies

ID Item

I1 Coke

I2 Crisps

I3 Nappies

I4 Bread

I5 Milk

Page 29: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

29of25

29of45 Example

Page 30: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

30of25

30of45

Challenges Of Frequent Pattern Mining

Challenges– Multiple scans of transaction database– Huge number of candidates– Tedious workload of support counting for

candidates

Improving Apriori: general ideas– Reduce passes of transaction database scans– Shrink number of candidates– Facilitate support counting of candidates

Page 31: 1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing

31of25

31of45 Questions?

?