a comparative study of data mining algorithms to

8/8/2019 A Comparative Study of Data Mining Algorithms To

1/31

A Comparative Study of Data MiningAlgorithms to Generate Frequent

Itemsets and Association Rules

Anupma Sangwan


2/31

What is Data Mining?

Many Definitions Extraction of implicit, previously unknown and potentiallyuseful information from data.

The task of discovering interesting patterns

from vast amount of data.


3/31

What is (not) Data Mining?

What is Data Mining?

Certain names are more

prevalent in certain USlocations (OBrien, ORurke,

OReilly in Boston area)

Group together similar

documents returned by searchengine according to their

context (e.g. Amazon

rainforest, Amazon.com,)

What is not DataMining?

Look up phone

number in phonedirectory

Query a Web

search engine forinformation about

Amazon


4/31

Data Mining Tasks

Prediction Methods

Use some variables to predict unknown or future

values of other variables.

Description Methods

Find human-interpretable patterns that describe

the data.


5/31

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]


6/31

What Is Association Mining?

Association rule mining:

Finding frequent patterns, associations, correlations, or causalstructures among sets of items or objects in transaction databases,relational databases, and other information repositories.

Frequent pattern: pattern (set of items, sequence, etc.) that occurs

frequently in a database.

Motivation: finding regularities in data

What products were often purchased together? Beer and diapers?!

What are the subsequent purchases after buying a PC?


7/31

Association Rules

An Example

Market-basket model

Look for combinations of products

Put the SHOES near the SOCKS so that if a customer buys onethey will buy the other


8/31

Association Rules Purpose

Providing the rules correlate the presence of a set of items

with another set of item

Examples:


9/31

Basic Concepts & Terms in Association Rules

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Itemset X={x

1, , x

k}

Find all the rules XYwith minconfidence and support

support, s, probability that atransaction contains XY

confidence, c, conditional

probability that a transactionhaving X also contains Y.

Let min_support = 50%,min_conf = 50%:

A C (50%, 66.7%)

CA (50%, 100%)

Customer

buys diaper

Customer

buys both

Customer

buys beer


10/31

Mining Association Rules:Example

For rule A C:

support = support({A}{C}) = 50%

confidence = support({A}{C})/support({A}) =

66.6%

Min. support 50%

Min. confidence 50%Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Frequent pattern Support

{A} 75%

{B} 50%

{C} 50%

{A, C} 50%


11/31

Frequent Itemset Algorithms

Some of algorithms that generate frequentitemset

are as follows:

AIS Algorithm

SETM Algorithm

Apriori Algorithm

FP-Growth Algorithm

AprioriTID Algorithm


12/31

Discovering the Association rule

Find all frequent itemset. (Itemset with above

minimum support)

Use these frequent itemset to generate rules.


13/31

Discovering Large Itemsets

Multiple passes over the data.

First pass count the support of individual items.

Subsequent pass

Generate Candidates using previous passs largeitemset.

Go over the data and check the actual support of thecandidates.

Stop when no new large itemsets are found.


14/31

Apriori Algorithm

First scalable algorithm for Association Rule Mining.

An improvement over AIS and SETM algorithms

(Agrawal and Srikant 1994).


15/31

Apriori: A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent

if{beer, diaper, nuts} is frequent, so is {beer, diaper}

Every transaction having {beer, diaper, nuts} also contains

{beer, diaper} Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

Method:

generate length (k+1) candidate itemsets from length kfrequent itemsets, and

test the candidates against DB.


16/31

Ck: Candidate itemset of size k

Lk: frequent itemset of size k

L1= {frequent items};

for(k= 1; Lk!=; k++) do beginCk+1= candidatesgenerated from Lk;

for each transaction tin database do

increment the count of all candidates in

Ck+1that are contained in t

Lk+1= candidates in Ck+1with min_support

end

returnkLk;

Apriori Algorithm:Pseudo Code

Generate new k-itemsets

candidates

Find the support of all thecandidates

Take only those withsupport over minsup

Join Step:- Ck is generated by joining Lk-1 with itself.

Prune Step:- Any (k-1 )-itemsets that are not frequent can not be a subset of a frequent k-ite


17/31

Candidate generation

Join step

1k1k2k2k11

1k1k

1k1k1

k

q.itep.ite,q.itep.ite,...,q.itep.ite

qp,

iteqitepitepp.ite

!! where

rom

.,.,.,select

intoinsert

2

k

k-1

k

c from C

)L(s

ets s of c(k-1)-subs

Citemsets c

delete

thenif

doforall

doforall

P and q are 2 k-1 largeitemsets identical in allk-2 first items.

Join by adding the last item ofq to p

Check all the subsets, remove acandidate with small subset

Prune step


18/31

Frequency 50%, Confidence 100%:

A C

B E

BC ECE B

BE C

The Apriori AlgorithmAn Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup

{A, B} 1

{A, C} 2

{A, E} 1

{B, C} 2{B, E} 3

{C, E} 2

Itemset sup

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2


19/31

L1= {frequent items};

for(k= 1; Lk!=; k++) do begin Ck+1=

candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in

Ck+1that are contained in t

Lk+1= candidates in Ck+1with min_support

end

returnkLk;

Apriori Problem?

Every pass goes over thewhole data.


20/31

Algorithm AprioriTid

Uses the database only once. Builds a storage set C^k

Members have the form < TID, {Xk} >

Xk are potentially frequent k-items in transaction

TID. For k=1, C^1 is the database.

Uses C^k in pass k+1.


21/31

Algorithm AprioriTid

Ofer Pasternak7k

k

kk

t^kt

t

kt

^k-1

^

k

k-k

k-1

^

1

;Lnswer

insup}|c.count{ cL

;t.TID,then)(

c.count

c

ite s};oft.set1])c[k(c

ite soft.setc[k]|(c{c

entries t

;C

)(LC

);k2;L(k

;databaseDC

ite sets}{large 1-L

!

u!

"!{

!

!

!

{!

!

!

end

end

end

i

;

docandidatesorall

begindoorall

;genapriori-

begindoFor

1

1

J

J

Count item occurrences

Generate new k-itemsetscandidates

Find the support of all thecandidates

Take only those with supportover minsup

The storage set is initialized withthe database

Build a new storage set

Determine candidate itemsetswhich are containted intransaction TID

Remove empty entries


22/31

ItemsTID

1 3 4100

2 3 52001 2 3 5300

2 5400

Set-of-itemsetsTID

{{1},{3},{4} }100

{{2},{3},{5} }200{{1},{2},{3},{5} }300

{{2},{5} }400

SupportItemset

2{1}

3{2}3{3}

3{5}

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

Set-of-itemsetsTID

{{1 3} }100

{{2 3},{2 5} {3 5} }200

{{1 2},{1 3},{1 5},

{2 3}, {2 5}, {3 5} }

300

{{2 5} }400

SupportItemset

2{1 3}

2{2 3}

3{2 5}

2{3 5}

itemset

{2 3 5}

Set-of-itemsetsTID

{{2 3 5} }200

{{2 3 5} }300

SupportItemset

2{2 3 5}

Database C^1

L2

C2C^2

C^3

L1

L3C3


23/31

Advantage

C^k could be smaller than the database. If a transaction does not contain k-itemset

candidates, than it will be excluded from C^k .

For large k, each entry may be smaller

than the transaction The transaction might contain only few

candidates.


24/31

n ng requent atterns t out an ate enerat on

(FP-Growth)

Compress a large database into a compact, Frequent-Patterntree (FP-tree) structure.

highly condensed, but complete for frequent pattern mining

avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining

method

A divide-and-conquer methodology: decompose mining

tasks into smaller ones

Avoid candidate generation: sub-database test only!


25/31

Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency headf 4c 4a 3b 3m 3p 3

min_support = 0.5TI Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent1-itemset (single itempattern)

2. Order frequent items infrequency descending order

3. Scan DB again, constructFP-tree


26/31

Objective

To determine effectiveness and efficiency of

these algorithms of the following

parameters.

- Types of itemsets generated by the algorithms taking in account the

same database.

- Time units taken by the algorithms generating the frequent itemsets.

- Association rules designed on the basis of these frequent itemsets

generated by the algorithms

- Size of Database.

- Varying the Min Support and Min Confidence


27/31

Research Methodology

Programming for these algorithms and

connected them to data base and analyze the

results.


28/31

Scope and Relevance of Study

1. Inventory Management:

Goal: A consumer appliance repair company wants to

anticipate the nature of repairs on its consumer products

and keep the service vehicles equipped with right partsto reduce on number of visits to consumer households.

Approach: Process the data on tools and parts required

in previous repairs at different consumer locations and

discover the co-occurrence patterns


29/31

Cont

Market Analysis: - Which combinations are

frequent.

Health Care: - Analyze the patient disease

history: find relationship between diseases.


30/31

References

[1] Agra al R., Imielinski T., S ami T. Mining Association Rules bet een Sets of Items in Large

Databases. Proc. ACM SIGMOD Int. Conf. on Management ofData, 1993. p. 207- 216

[2] Agra al R., Srikant R. ast Algorithms for Mining Association Rules. Proc. Int. Conf. Very

Large Data Bases, 1994. p. 487 499

[3] M. Houstama and A. S ami. Set-Oriented Mining of Association rules. Research Report RJ

9567, IBM Almaden Research Center, San Jose, California, October 1993

[4] Lecture notes and Presentation slides of Professor Anita Wasile ska, State Universitiy of Ne

York, Stony Brook.

[5] J. Han andM. Kamver, Data Mining: Concepts and Techniques, Morgan Kaufmann/ Elsevier

India, 2001.

[6] Arun Pujari, Data Mining techniques, Universities Press (India) Pvt. Ltd. 2001.

[7] Qi Luo Advancing Kno ledge Discovery and Data Mining 2008 Workshop on Kno ledge

and Data Mining, pg 3-5[8] Rupnik, Kukar, Bajec, Krisper, DMDSS: Data Mining Based Decision Support System to

Integrate Data Mining and Decision Support 28th Int. Conf. Information Technology Interface

ITI 2006, June 19-22, 2006 Cavtat, Croatia.


31/31

Thank You

Any Queries?

a comparative study of data mining algorithms to

Documents