pattern recognition lecture 20: data mining 3 dr. richard spillman pacific lutheran university

Pattern RecognitionPattern Recognition

Lecture 20: Data Mining 3

Dr. Richard SpillmanPacific Lutheran University

Class Topics

Intr

oduc

tion

Dec

isio

n Fu

nctio

ns

Midterm One

Midterm Two

Data

Min

ing

Pro

ject

Pre

sent

atio

ns

Intr

oduc

tion

Dec

isio

n Fu

nctio

ns

Cluster Analysis

Statistical Decision Theory

Feature Selection

Machine Learning

Neural Nets

Review

• Data Mining Example

• Preprocessing Data

• Preprocessing Tasks

Review – What is Data Mining?

• It is a method to get beyond the “tip of the iceberg”

Data Mining/Knowledge Discoveryin Databases/Data Archeology/Data Dredging

Information Available from a database

Review – Data Preprocessing

• Data preparation is a big issue for both warehousing and mining

• Data preparation includes

– Data cleaning and data integration

– Data reduction and feature selection

– Discretization

• A lot a methods have been developed but still an active area of

research

OUTLINE

• Frequent Pattern Mining

• Association Rule Mining

• Algorithms

Frequent Pattern Mining

What is Frequent Pattern Mining?

• What is a frequent pattern?

– Pattern (set of items, sequence, etc.) that occurs

together frequently in a database

• Frequent pattern: an important form of regularity

– What products were often purchased together? —

beers and diapers!

– What are the consequences of a hurricane?

– What is the next target after buying a PC?

Applications

• Market Basket Analysis– * Maintenance Agreement What the store should do to boost Maintenance Agreement sales– Home Electronics * What other products should the store stocks up on if the store

has a sale on Home Electronics

• Attached mailing in direct marketing

• Detecting “ping-pong”ing of patientstransaction: patientitem: doctor/clinic visited by a patientsupport of a rule: number of common patients

Frequent Pattern Mining Methods

• Association analysis

– Basket data analysis, cross-marketing, catalog design, loss-leader

analysis, text database analysis

– Correlation or causality analysis

• Clustering

• Classification

– Association-based classification analysis

• Sequential pattern analysis

– Web log sequence, DNA analysis, etc.

Association Rule Mining

Association Rule Mining

• Given– A database of customer transactions– Each transaction is a list of items (purchased by a

customer in a visit)

• Find all rules that correlate the presence of one set of items with that of another set of items– Example: 98% of people who purchase tires and auto

accessories also get automotive services done– Any number of items in the consequent/antecedent of

rule– Possible to specify constraints on rules (e.g., find only

rules involving Home Laundry Appliances).

Basic Concepts

• Rule form: “A [support s, confidence c]”. Support: usefulness of discovered rules

Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are

called strong.

• Examples: – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– age(x, “30-34”) ^ income(x ,“42K-48K”) buys(x,

“high resolution TV”) [2%,60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,

75%]

Rule Measures• Find all the rules X & Y Z with minimum

confidence and support– support, s, probability that a transaction

contains {X, Y, Z}– confidence, c, conditional probability that a

transaction having {X, Y} also contains Z.

Customerbuys diaper

Customerbuys beer

Customerbuys both

Example: Support

• Given the following data base:

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

For the rule A => C, support is the probability that a transaction contains both A and C

2000 A,B,C1000 A,C

2 out of 4 transactions contain both A and C so the support is 50%

Example: Confidence

• Given the same database:


For the rule A => C, confidence is the conditionalprobability that a transaction which contains A also contains C

2000 A,B,C1000 A,C

2 out of the 3 transactions which contain A also have C so the confidence is 66%

Algorithms

Apriori Algorithm

• The Apriori method:

– Proposed by Agrawal & Srikant 1994

– A similar level-wise algorithm by Mannila et al. 1994

• Major idea:

– A subset of a frequent itemset must be frequent• E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must

be. If anyone is infrequent, its superset cannot be!

– A powerful, scalable candidate set pruning technique:

• It reduces candidate k-itemsets dramatically (for k > 2)

Example


Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Given:

Aprior Process

Find the frequent itemsets: the sets of items

that have minimum support (Apriori) A subset of a frequent itemset must also be a

frequent itemset, i.e., if {A B} is a frequent

itemset, both {A} and {B} should be a frequent

itemset

Iteratively find frequent itemsets with cardinality

from 1 to k (k-itemset)

Use the frequent itemsets to generate

association rules.

Aprior Algorithm

• Join Step Ck is generated by joining Lk-1with itself• Prune Step Any (k-1)-itemset that is not frequent

cannot be a subset of a frequent k-itemset, hence should be removed.

(Ck: Candidate itemset of size k) (Lk : frequent itemset of size k)

Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2itemset sup

{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

C2Scan D

C3 itemset{2 3 5}

Scan D L3 itemset sup{2 3 5} 2

L1

Min. support 50%Min. confidence 50%

Given:

Generating the Candidate Set• In the example, how do you go from L to

C? itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C3 itemset{2 3 5}

For example, if L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abdacde from acd and ace

Pruning:acde is removed because ade is not

in L3

C4={abcd}

Generating Strong Association Rules

• Confidence(A B) = Prob(B|A)

= support(A B)/support(A)• Example:

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D

L3 itemset sup{2 3 5} 2

Possible Rules:2 and 3 => 5 confidence 2/2 = 100%

2 and 5 => 3 confidence 2/3 = 66%

3 and 5 => 2 confidence 2/2 = 100%

2 => 3 and 5 confidence 2/3 = 66%

3 => 2 and 5 confidence 2/3 = 66%

5 => 3 and 2 confidence 2/3 = 66%

Possible QuizPossible Quiz

What is a frequent pattern?

Define support and confidence.

What is the basic principle of the Aprior algorithm?

pattern recognition lecture 20: data mining 3 dr. richard spillman pacific lutheran university

Documents