pattern recognition lecture 20: data mining 3 dr. richard spillman pacific lutheran university
Post on 22-Dec-2015
216 views
TRANSCRIPT
Pattern RecognitionPattern Recognition
Lecture 20: Data Mining 3
Dr. Richard SpillmanPacific Lutheran University
Class Topics
Intr
oduc
tion
Dec
isio
n Fu
nctio
ns
Midterm One
Midterm Two
Data
Min
ing
Pro
ject
Pre
sent
atio
ns
Intr
oduc
tion
Dec
isio
n Fu
nctio
ns
Cluster Analysis
Statistical Decision Theory
Feature Selection
Machine Learning
Neural Nets
Review
• Data Mining Example
• Preprocessing Data
• Preprocessing Tasks
Review – What is Data Mining?
• It is a method to get beyond the “tip of the iceberg”
Data Mining/Knowledge Discoveryin Databases/Data Archeology/Data Dredging
Information Available from a database
Review – Data Preprocessing
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but still an active area of
research
OUTLINE
• Frequent Pattern Mining
• Association Rule Mining
• Algorithms
Frequent Pattern Mining
What is Frequent Pattern Mining?
• What is a frequent pattern?
– Pattern (set of items, sequence, etc.) that occurs
together frequently in a database
• Frequent pattern: an important form of regularity
– What products were often purchased together? —
beers and diapers!
– What are the consequences of a hurricane?
– What is the next target after buying a PC?
Applications
• Market Basket Analysis– * Maintenance Agreement What the store should do to boost Maintenance Agreement sales– Home Electronics * What other products should the store stocks up on if the store
has a sale on Home Electronics
• Attached mailing in direct marketing
• Detecting “ping-pong”ing of patientstransaction: patientitem: doctor/clinic visited by a patientsupport of a rule: number of common patients
Frequent Pattern Mining Methods
• Association analysis
– Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, text database analysis
– Correlation or causality analysis
• Clustering
• Classification
– Association-based classification analysis
• Sequential pattern analysis
– Web log sequence, DNA analysis, etc.
Association Rule Mining
Association Rule Mining
• Given– A database of customer transactions– Each transaction is a list of items (purchased by a
customer in a visit)
• Find all rules that correlate the presence of one set of items with that of another set of items– Example: 98% of people who purchase tires and auto
accessories also get automotive services done– Any number of items in the consequent/antecedent of
rule– Possible to specify constraints on rules (e.g., find only
rules involving Home Laundry Appliances).
Basic Concepts
• Rule form: “A [support s, confidence c]”. Support: usefulness of discovered rules
Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are
called strong.
• Examples: – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– age(x, “30-34”) ^ income(x ,“42K-48K”) buys(x,
“high resolution TV”) [2%,60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,
75%]
Rule Measures• Find all the rules X & Y Z with minimum
confidence and support– support, s, probability that a transaction
contains {X, Y, Z}– confidence, c, conditional probability that a
transaction having {X, Y} also contains Z.
Customerbuys diaper
Customerbuys beer
Customerbuys both
Example: Support
• Given the following data base:
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
For the rule A => C, support is the probability that a transaction contains both A and C
2000 A,B,C1000 A,C
2 out of 4 transactions contain both A and C so the support is 50%
Example: Confidence
• Given the same database:
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
For the rule A => C, confidence is the conditionalprobability that a transaction which contains A also contains C
2000 A,B,C1000 A,C
2 out of the 3 transactions which contain A also have C so the confidence is 66%
Algorithms
Apriori Algorithm
• The Apriori method:
– Proposed by Agrawal & Srikant 1994
– A similar level-wise algorithm by Mannila et al. 1994
• Major idea:
– A subset of a frequent itemset must be frequent• E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must
be. If anyone is infrequent, its superset cannot be!
– A powerful, scalable candidate set pruning technique:
• It reduces candidate k-itemsets dramatically (for k > 2)
Example
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
Given:
Aprior Process
Find the frequent itemsets: the sets of items
that have minimum support (Apriori) A subset of a frequent itemset must also be a
frequent itemset, i.e., if {A B} is a frequent
itemset, both {A} and {B} should be a frequent
itemset
Iteratively find frequent itemsets with cardinality
from 1 to k (k-itemset)
Use the frequent itemsets to generate
association rules.
Aprior Algorithm
• Join Step Ck is generated by joining Lk-1with itself• Prune Step Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset, hence should be removed.
(Ck: Candidate itemset of size k) (Lk : frequent itemset of size k)
Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2itemset sup
{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
C2Scan D
C3 itemset{2 3 5}
Scan D L3 itemset sup{2 3 5} 2
L1
Min. support 50%Min. confidence 50%
Given:
Generating the Candidate Set• In the example, how do you go from L to
C? itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C3 itemset{2 3 5}
For example, if L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abdacde from acd and ace
Pruning:acde is removed because ade is not
in L3
C4={abcd}
Generating Strong Association Rules
• Confidence(A B) = Prob(B|A)
= support(A B)/support(A)• Example:
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D
L3 itemset sup{2 3 5} 2
Possible Rules:2 and 3 => 5 confidence 2/2 = 100%
2 and 5 => 3 confidence 2/3 = 66%
3 and 5 => 2 confidence 2/2 = 100%
2 => 3 and 5 confidence 2/3 = 66%
3 => 2 and 5 confidence 2/3 = 66%
5 => 3 and 2 confidence 2/3 = 66%
Possible QuizPossible Quiz
What is a frequent pattern?
Define support and confidence.
What is the basic principle of the Aprior algorithm?