1 of 25 1 of 45 association rule mining cit366: data mining & data warehousing instructor:...
TRANSCRIPT
1of25
1of45 Association Rule Mining
CIT366: Data Mining & Data WarehousingInstructor: Bajuna SaleheThe Institute of Finance Management: Computing and IT Dept.
2of25
2of45 What Is Association Mining?Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories
Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database
3of25
3of45
Motivations For Association Mining
Motivation: Finding regularities in data– What products were often purchased together?
• Beer and nappies!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
4of25
4of45
Motivations For Association Mining (cont…)
Broad applications– Basket data analysis, cross-marketing, catalog
design, sale campaign analysis– Web log (click stream) analysis, DNA sequence
analysis, etc.
5of25
5of45 Market Basket AnalysisMarket basket analysis is a typical example of frequent itemset mining
Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets”
This information can be used to develop marketing strategies
6of25
6of45 Market Basket Analysis (cont…)
7of25
7of45 Application of AssociationAssociation analysis can be used in promoting/improving marketing strategy by analysing frequent itemset.
As a marketing manager of a Company X for instance you would like to determine which items are frequently purchased together within the same transactions.
8of25
8of45 Application of AssociationAn example of such a rule, mined from the X Company transactional database, isbuys(X; “computer”)=>buys(X; “software”) [support = 1%; confidence = 50%] where X is a variable representing a customer.A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well.
9of25
9of45 Application of Association
A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together.
This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single-dimensional association rules.
10of25
10of45 Application of Association
In addition to the marketing application, the same sort of question has the following uses:
Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.
11of25
11of45 Application of Association
Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.
12of25
12of45 Association Rule Basic Concepts
Let I be a set of items {I1, I2, I3,…, Im}
Let D be a database of transactions where each transaction T is a set of items such that T I
So, if A is a set of items a transaction T is said to contain A if and only if A T
An association rule is an implication A B where A I, B I, and A B=
13of25
13of45
Association Rule Support & Confidence
We say that an association rule A B holds in the transaction set D with support, s, and confidence, cThe support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B)
So, the support can be considered the probability P(A B)
14of25
14of45
Association Rule Support & Confidence (cont…)
The confidence of the association rule is given as the percentage of transactions in D containing A that also contain BSo, the confidence can be considered the conditional probability P(B|A)Association rules that satisfy minimum support and confidence values are said to be strong
15of25
15of45 Itemsets & Frequent ItemsetsAn itemset is a set of items
A k-itemset is an itemset that contains k itemsThe occurrence frequency of an itemset is the number of transactions that contain the itemset
– This is also known more simply as the frequency, support count or count
An itemset is said to be frequent if the support count satisfies a minimum support count threshold
The set of frequent itemsets is denoted Lk
16of25
16of45 Support & Confidence AgainSupport and confidence values can be calculated as follows:
)|()( ABPBAconfidence
Acountsupport
BAcountsupport
Asupport
BAsupport
_
_
)()( BAPBAsupport
count
BAuntsupport_co
17of25
17of45
Mining Association Rules: An Example
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
()
}){_)(
count
Ccount({A}supportCAsupport
})({_
}){_)(
Acountsupport
Ccount({A}supportCAconfidence
%7.66
%50
18of25
18of45
Mining Association Rules: An Example (cont…)
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
()
}){_)(
count
Acount({C}supportACsupport
})({_
}){_)(
Ccountsupport
Acount({C}supportACconfidence
%100
%50
19of25
19of45 Association Rule MiningSo, in general association rule mining can be reduced to the following two steps:
1. Find all frequent itemsets• Each itemset will occur at least as frequently as
as a minimum support count
2. Generate strong association rules from the frequent itemsets
• These rules will satisfy minimum support and confidence measures
20of25
20of45 Combinatorial Explosion!A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive
For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets
A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:
30100 1027.112100
100...
2
100
1
100
21of25
21of45 The Apriori AlgorithmAny subset of a frequent itemset must be frequent
– If {beer, nappy, nuts} is frequent, so is {beer, nappy}
– Every transaction having {beer, nappy, nuts} also contains {beer, nappy}
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
22of25
22of45 The Apriori Algorithm (cont…)The Apriori algorithm is known as a candidate generation-and-test approach
Method: – Generate length (k+1) candidate itemsets from
length k frequent itemsets
– Test the candidates against the DB
Performance studies show the algorithm’s efficiency and scalability
23of25
23of45
The Apriori Algorithm: An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}Itemset sup{B, C, E} 2
24of25
24of45
Important Details Of The Apriori Algorithm
There are two crucial questions in implementing the Apriori algorithm:
– How to generate candidates?– How to count supports of candidates?
25of25
25of45
Generating Candidates
There are 2 steps to generating candidates:– Step 1: Self-joining Lk
– Step 2: Pruning
Example of Candidate-generation– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd• acde from acd and ace
– Pruning:• acde is removed because ade is not in L3
– C4={abcd}
26of25
26of45
How to Count Supports Of Candidates?
Why counting supports of candidates a problem?
– The total number of candidates can be huge– One transaction may contain many candidates
Method:– Candidate itemsets are stored in a hash-tree– Leaf node of hash-tree contains a list of itemsets
and counts– Interior node contains a hash table– Subset function: finds all the candidates
contained in a transaction
27of25
27of45 Generating Association RulesOnce all frequent itemsets have been found association rules can be generated
Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold
28of25
28of45 Example
TID List of item_IDs
T100 Coke, Crisps, Milk
T200 Crisps, Bread
T300 Crisps, Nappies
T400 Coke, Crisps, Bread
T500 Coke, Nappies
T600 Crisps, Nappies
T700 Coke, Nappies
T800 Coke, Crisps, Nappies, Milk
T900 Coke, Crisps, Nappies
ID Item
I1 Coke
I2 Crisps
I3 Nappies
I4 Bread
I5 Milk
29of25
29of45 Example
30of25
30of45
Challenges Of Frequent Pattern Mining
Challenges– Multiple scans of transaction database– Huge number of candidates– Tedious workload of support counting for
candidates
Improving Apriori: general ideas– Reduce passes of transaction database scans– Shrink number of candidates– Facilitate support counting of candidates
31of25
31of45 Questions?
?