a comparative study of data mining algorithms to

Upload: anupmasangwan

Post on 09-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    1/31

    A Comparative Study of Data MiningAlgorithms to Generate Frequent

    Itemsets and Association Rules

    Anupma Sangwan

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    2/31

    What is Data Mining?

    Many Definitions Extraction of implicit, previously unknown and potentiallyuseful information from data.

    The task of discovering interesting patterns

    from vast amount of data.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    3/31

    What is (not) Data Mining?

    What is Data Mining?

    Certain names are more

    prevalent in certain USlocations (OBrien, ORurke,

    OReilly in Boston area)

    Group together similar

    documents returned by searchengine according to their

    context (e.g. Amazon

    rainforest, Amazon.com,)

    What is not DataMining?

    Look up phone

    number in phonedirectory

    Query a Web

    search engine forinformation about

    Amazon

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    4/31

    Data Mining Tasks

    Prediction Methods

    Use some variables to predict unknown or future

    values of other variables.

    Description Methods

    Find human-interpretable patterns that describe

    the data.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    5/31

    Data Mining Tasks...

    Classification [Predictive]

    Clustering [Descriptive]

    Association Rule Discovery [Descriptive]

    Sequential Pattern Discovery [Descriptive]

    Regression [Predictive]

    Deviation Detection [Predictive]

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    6/31

    What Is Association Mining?

    Association rule mining:

    Finding frequent patterns, associations, correlations, or causalstructures among sets of items or objects in transaction databases,relational databases, and other information repositories.

    Frequent pattern: pattern (set of items, sequence, etc.) that occurs

    frequently in a database.

    Motivation: finding regularities in data

    What products were often purchased together? Beer and diapers?!

    What are the subsequent purchases after buying a PC?

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    7/31

    Association Rules

    An Example

    Market-basket model

    Look for combinations of products

    Put the SHOES near the SOCKS so that if a customer buys onethey will buy the other

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    8/31

    Association Rules Purpose

    Providing the rules correlate the presence of a set of items

    with another set of item

    Examples:

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    9/31

    Basic Concepts & Terms in Association Rules

    Transaction-id Items bought

    10 A, B, C

    20 A, C

    30 A, D

    40 B, E, F

    Itemset X={x

    1, , x

    k}

    Find all the rules XYwith minconfidence and support

    support, s, probability that atransaction contains XY

    confidence, c, conditional

    probability that a transactionhaving X also contains Y.

    Let min_support = 50%,min_conf = 50%:

    A C (50%, 66.7%)

    CA (50%, 100%)

    Customer

    buys diaper

    Customer

    buys both

    Customer

    buys beer

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    10/31

    Mining Association Rules:Example

    For rule A C:

    support = support({A}{C}) = 50%

    confidence = support({A}{C})/support({A}) =

    66.6%

    Min. support 50%

    Min. confidence 50%Transaction-id Items bought

    10 A, B, C

    20 A, C

    30 A, D

    40 B, E, F

    Frequent pattern Support

    {A} 75%

    {B} 50%

    {C} 50%

    {A, C} 50%

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    11/31

    Frequent Itemset Algorithms

    Some of algorithms that generate frequentitemset

    are as follows:

    AIS Algorithm

    SETM Algorithm

    Apriori Algorithm

    FP-Growth Algorithm

    AprioriTID Algorithm

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    12/31

    Discovering the Association rule

    Find all frequent itemset. (Itemset with above

    minimum support)

    Use these frequent itemset to generate rules.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    13/31

    Discovering Large Itemsets

    Multiple passes over the data.

    First pass count the support of individual items.

    Subsequent pass

    Generate Candidates using previous passs largeitemset.

    Go over the data and check the actual support of thecandidates.

    Stop when no new large itemsets are found.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    14/31

    Apriori Algorithm

    First scalable algorithm for Association Rule Mining.

    An improvement over AIS and SETM algorithms

    (Agrawal and Srikant 1994).

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    15/31

    Apriori: A Candidate Generation-and-test Approach

    Any subset of a frequent itemset must be frequent

    if{beer, diaper, nuts} is frequent, so is {beer, diaper}

    Every transaction having {beer, diaper, nuts} also contains

    {beer, diaper} Apriori pruning principle: If there is any itemset which is

    infrequent, its superset should not be generated/tested!

    Method:

    generate length (k+1) candidate itemsets from length kfrequent itemsets, and

    test the candidates against DB.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    16/31

    Ck: Candidate itemset of size k

    Lk: frequent itemset of size k

    L1= {frequent items};

    for(k= 1; Lk!=; k++) do beginCk+1= candidatesgenerated from Lk;

    for each transaction tin database do

    increment the count of all candidates in

    Ck+1that are contained in t

    Lk+1= candidates in Ck+1with min_support

    end

    returnkLk;

    Apriori Algorithm:Pseudo Code

    Generate new k-itemsets

    candidates

    Find the support of all thecandidates

    Take only those withsupport over minsup

    Join Step:- Ck is generated by joining Lk-1 with itself.

    Prune Step:- Any (k-1 )-itemsets that are not frequent can not be a subset of a frequent k-ite

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    17/31

    Candidate generation

    Join step

    1k1k2k2k11

    1k1k

    1k1k1

    k

    q.itep.ite,q.itep.ite,...,q.itep.ite

    qp,

    iteqitepitepp.ite

    !! where

    rom

    .,.,.,select

    intoinsert

    2

    k

    k-1

    k

    c from C

    )L(s

    ets s of c(k-1)-subs

    Citemsets c

    delete

    thenif

    doforall

    doforall

    P and q are 2 k-1 largeitemsets identical in allk-2 first items.

    Join by adding the last item ofq to p

    Check all the subsets, remove acandidate with small subset

    Prune step

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    18/31

    Frequency 50%, Confidence 100%:

    A C

    B E

    BC ECE B

    BE C

    The Apriori AlgorithmAn Example

    Database TDB

    1st scan

    C1L1

    L2

    C2 C22nd scan

    C3 L33rd scan

    Tid Items

    10 A, C, D

    20 B, C, E

    30 A, B, C, E

    40 B, E

    Itemset sup{A} 2

    {B} 3

    {C} 3

    {D} 1

    {E} 3

    Itemset sup

    {A} 2

    {B} 3

    {C} 3

    {E} 3

    Itemset

    {A, B}

    {A, C}

    {A, E}

    {B, C}

    {B, E}

    {C, E}

    Itemset sup

    {A, B} 1

    {A, C} 2

    {A, E} 1

    {B, C} 2{B, E} 3

    {C, E} 2

    Itemset sup

    {A, C} 2

    {B, C} 2

    {B, E} 3

    {C, E} 2

    Itemset

    {B, C, E}

    Itemset sup

    {B, C, E} 2

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    19/31

    L1= {frequent items};

    for(k= 1; Lk!=; k++) do begin Ck+1=

    candidates generated from Lk;

    for each transaction t in database do

    increment the count of all candidates in

    Ck+1that are contained in t

    Lk+1= candidates in Ck+1with min_support

    end

    returnkLk;

    Apriori Problem?

    Every pass goes over thewhole data.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    20/31

    Algorithm AprioriTid

    Uses the database only once. Builds a storage set C^k

    Members have the form < TID, {Xk} >

    Xk are potentially frequent k-items in transaction

    TID. For k=1, C^1 is the database.

    Uses C^k in pass k+1.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    21/31

    Algorithm AprioriTid

    Ofer Pasternak7k

    k

    kk

    t^kt

    t

    kt

    ^k-1

    ^

    k

    k-k

    k-1

    ^

    1

    ;Lnswer

    insup}|c.count{ cL

    ;t.TID,then)(

    c.count

    c

    ite s};oft.set1])c[k(c

    ite soft.setc[k]|(c{c

    entries t

    ;C

    )(LC

    );k2;L(k

    ;databaseDC

    ite sets}{large 1-L

    !

    u!

    "!{

    !

    !

    !

    {!

    !

    !

    end

    end

    end

    i

    ;

    docandidatesorall

    begindoorall

    ;genapriori-

    begindoFor

    1

    1

    J

    J

    Count item occurrences

    Generate new k-itemsetscandidates

    Find the support of all thecandidates

    Take only those with supportover minsup

    The storage set is initialized withthe database

    Build a new storage set

    Determine candidate itemsetswhich are containted intransaction TID

    Remove empty entries

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    22/31

    ItemsTID

    1 3 4100

    2 3 52001 2 3 5300

    2 5400

    Set-of-itemsetsTID

    {{1},{3},{4} }100

    {{2},{3},{5} }200{{1},{2},{3},{5} }300

    {{2},{5} }400

    SupportItemset

    2{1}

    3{2}3{3}

    3{5}

    itemset

    {1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}

    {3 5}

    Set-of-itemsetsTID

    {{1 3} }100

    {{2 3},{2 5} {3 5} }200

    {{1 2},{1 3},{1 5},

    {2 3}, {2 5}, {3 5} }

    300

    {{2 5} }400

    SupportItemset

    2{1 3}

    2{2 3}

    3{2 5}

    2{3 5}

    itemset

    {2 3 5}

    Set-of-itemsetsTID

    {{2 3 5} }200

    {{2 3 5} }300

    SupportItemset

    2{2 3 5}

    Database C^1

    L2

    C2C^2

    C^3

    L1

    L3C3

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    23/31

    Advantage

    C^k could be smaller than the database. If a transaction does not contain k-itemset

    candidates, than it will be excluded from C^k .

    For large k, each entry may be smaller

    than the transaction The transaction might contain only few

    candidates.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    24/31

    n ng requent atterns t out an ate enerat on

    (FP-Growth)

    Compress a large database into a compact, Frequent-Patterntree (FP-tree) structure.

    highly condensed, but complete for frequent pattern mining

    avoid costly database scans

    Develop an efficient, FP-tree-based frequent pattern mining

    method

    A divide-and-conquer methodology: decompose mining

    tasks into smaller ones

    Avoid candidate generation: sub-database test only!

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    25/31

    Construct FP-tree from a Transaction DB

    {}

    f:4 c:1

    b:1

    p:1

    b:1c:3

    a:3

    b:1m:2

    p:2 m:1

    Header Table

    Item frequency headf 4c 4a 3b 3m 3p 3

    min_support = 0.5TI Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

    Steps:

    1. Scan DB once, find frequent1-itemset (single itempattern)

    2. Order frequent items infrequency descending order

    3. Scan DB again, constructFP-tree

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    26/31

    Objective

    To determine effectiveness and efficiency of

    these algorithms of the following

    parameters.

    - Types of itemsets generated by the algorithms taking in account the

    same database.

    - Time units taken by the algorithms generating the frequent itemsets.

    - Association rules designed on the basis of these frequent itemsets

    generated by the algorithms

    - Size of Database.

    - Varying the Min Support and Min Confidence

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    27/31

    Research Methodology

    Programming for these algorithms and

    connected them to data base and analyze the

    results.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    28/31

    Scope and Relevance of Study

    1. Inventory Management:

    Goal: A consumer appliance repair company wants to

    anticipate the nature of repairs on its consumer products

    and keep the service vehicles equipped with right partsto reduce on number of visits to consumer households.

    Approach: Process the data on tools and parts required

    in previous repairs at different consumer locations and

    discover the co-occurrence patterns

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    29/31

    Cont

    Market Analysis: - Which combinations are

    frequent.

    Health Care: - Analyze the patient disease

    history: find relationship between diseases.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    30/31

    References

    [1] Agra al R., Imielinski T., S ami T. Mining Association Rules bet een Sets of Items in Large

    Databases. Proc. ACM SIGMOD Int. Conf. on Management ofData, 1993. p. 207- 216

    [2] Agra al R., Srikant R. ast Algorithms for Mining Association Rules. Proc. Int. Conf. Very

    Large Data Bases, 1994. p. 487 499

    [3] M. Houstama and A. S ami. Set-Oriented Mining of Association rules. Research Report RJ

    9567, IBM Almaden Research Center, San Jose, California, October 1993

    [4] Lecture notes and Presentation slides of Professor Anita Wasile ska, State Universitiy of Ne

    York, Stony Brook.

    [5] J. Han andM. Kamver, Data Mining: Concepts and Techniques, Morgan Kaufmann/ Elsevier

    India, 2001.

    [6] Arun Pujari, Data Mining techniques, Universities Press (India) Pvt. Ltd. 2001.

    [7] Qi Luo Advancing Kno ledge Discovery and Data Mining 2008 Workshop on Kno ledge

    and Data Mining, pg 3-5[8] Rupnik, Kukar, Bajec, Krisper, DMDSS: Data Mining Based Decision Support System to

    Integrate Data Mining and Decision Support 28th Int. Conf. Information Technology Interface

    ITI 2006, June 19-22, 2006 Cavtat, Croatia.

  • 8/8/2019 A Comparative Study of Data Mining Algorithms To

    31/31

    Thank You

    Any Queries?