lecture 8-9 association rule mining.ppt

Upload: muhammad-usman

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    1/21

    Data Mining

    Association Rules Mining

    Frequent Itemset Mining

    Support and Confidence Apriori Approach

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    2/21

    Association rules define relationship of the form:

    Read as A implies B, where A and B are sets ofbinary valued attributes represented in a data

    set.

    Association Rule Mining (ARM) is then the processof finding all the ARs in a given DB.

    A B

    Initial Definition of Association Rules

    (ARs) Mining

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    3/21

    Association Rule: Basic Concepts

    Given: (1) database of transactions, (2) each

    transaction is a list of items (purchased by acustomer in a visit)

    Find: all rules that correlate the presence of oneset of items with that of another set of items

    E.g., 98% of students who study Databases and C++also study Algorithms

    Applications Home Electronics * (What other products should

    the store stocks up?)Attached mailing in direct marketing

    Web page navigation in Search Engines (first page a->page b)

    Text mining if IT companies -> Microsoft

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    4/21

    D = A data set comprising n records and m

    binary valued attributes.

    I = The set of m attributes, {i1,i2, ,im},

    represented in D.

    Itemset = Some subset of I. Each record

    in D is an itemset.

    Some Notation

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    5/21

    I = {a,b,c,d,e},

    D = {{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},{b,c,d},{b,c,e},

    {b,d,e},{c,d,e}}

    Given attributes which are not binaryvalued (i.e. either nominal or 10 c d e

    or ranged) the attributes can be discretised sothat they are represented by a number of binary

    valued attributes.

    9 b d e8 b c e7 b c d6 a d e

    5 a c e4 a c d3 a b e2 a b d1 a b cTID AttsExample DB

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    6/21

    Association rules define relationship of the form:

    Read asA implies B Such thatAI, BI, AB= (A and B are

    disjoint) andABI. In other words an AR is made up of an itemset of

    cardinality 2 or more.

    A B

    In depth Definition of ARs Mining

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    7/21

    Given a database D we wish to find (Mine) all the

    itemsets of cardinality 2 or more, contained in D,and then use these item sets to create associationrules of the form AB.

    The number of potential itemsets of cardinality 2 or

    more is:

    2m-m-1

    So know we do not want to find all the itemsets ofcardinality 2 or more, contained in D, we only wantto find the interesting itemsets of cardinality 2 or

    more, contained in D.

    If m=5, #potential itemsets = 26

    If m=20, #potential itemsets = 1048556

    ARM Problem Definition (1)

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    8/21

    The most commonly used interestingness

    measures are:

    1. Support

    2. Confidence

    Association Rules Measurement

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    9/21

    Itemset Support

    Support: A measure of the frequency with which

    an itemset occurs in a DB.

    If an itemset has support higher than somespecified threshold we say that the itemset issupportedor frequent(some authors use the termlarge).

    Support threshold is normally set reasonably low(say) 1%.

    supp(A) = # records that contain A

    m

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    10/21

    Confidence

    Confidence: A measure, expressed as a ratio, of

    the support for an AR compared to the support ofits antecedent.

    We say that we are confident in a rule if itsconfidence exceeds some threshold (normally setreasonably high, say, 80%).

    conf(AB) = supp(AB)supp(A)

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    11/21

    Rule Measures: Support and Confidence

    Find all the rules X & Y Zwith

    minimum confidence and support support, s, probability that atransaction contains {X Y Z}

    confidence, c, conditional probabilitythat a transaction having {X Y} alsocontains Z

    Transaction ID Items Bought

    2000 A,B,C

    1000 A,C

    4000 A,D

    5000 B,E,F

    Let minimum support 50%, andminimum confidence 50%,

    we have A C (50%, 66.6%)

    C A (50%, 100%)

    Customer

    buys Bread

    Customer

    buys both

    Customerbuys Butter

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    12/21

    Given a database D we wish to find all the

    frequent itemsets (F) and then use this knowledgeto produce high confidence association rules.

    Note: Finding F is the most computationally

    expensive part, once we have the frequent setsgenerating ARs is straight forward

    ARM Problem Definition (2)

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    13/21

    a 6

    b 6ab 3

    c 6

    ac 3

    bc 3

    abc 1

    d 6

    ad 6bd 3

    abd 1

    cd 3

    acd 1bcd 1

    abcd 0

    e 6

    ae 3

    be 3

    abe 1

    ce 3ace 1

    bce 1

    abce 0

    de 3ade 1

    bde 1

    abde 0

    cde 1

    acde 0

    bcde 0

    abcde 0

    List all possible

    combinations in anarray.

    For each record:

    1. Find all combinations.

    2. For each combination

    index into array and

    increment support by

    1.

    Then generate rules

    BRUTE FORCE

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    14/21

    a 6

    b 6

    ab 3

    c 6ac 3

    bc 3

    abc 1d 6

    ad 6

    bd 3

    abd 1

    cd 3

    acd 1

    bcd 1

    abcd 0e 6

    ae 3

    be 3abe 1

    ce 3

    ace 1

    bce 1

    abce 0

    de 3

    ade 1

    bde 1abde 0

    cde 1

    acde 0bcde 0

    abcde 0

    Support threshold = 5%

    (count of 1.55)

    Frequents Sets (F):

    ab(3) ac(3) bc(3)

    ad(3) bd(3) cd(3)

    ae(3) be(3) ce(3)

    de(3)

    Rules:

    ab conf=3/6=50%ba conf=3/6=50%

    Etc.

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    15/21

    Advantages:

    1) Very efficient for data sets with small numbers ofattributes (

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    16/21

    Association Rule Mining: A Road Map

    Boolean vs. quantitative associations (Based

    on the types of values handled) buys(x, SQLServer) ^ buys(x, DMBook) ->buys(x,

    DBMiner) [0.2%, 60%]

    age(x, 30..39) ^ income(x, 42..48K) -> buys(x, PC)

    [1%, 75%]

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    17/21

    Mining Association RulesAn Example

    For ruleAC:

    support = support({AC}) = 50%

    confidence = support({AC})/support({A}) = 66.6%

    TheApriori principle:

    Any subset of a frequent itemset must be frequent

    Transaction ID Items Bought

    2000 A,B,C1000 A,C

    4000 A,D

    5000 B,E,F

    Frequent Itemset Support

    {A} 75%

    {B} 50%

    {C} 50%{A,C} 50%

    Min. support 50%

    Min. confidence 50%

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    18/21

    Mining Frequent Itemsets: the Key Step

    Find the frequent itemsets: the sets of items that

    have minimum support

    A subset of a frequent itemset must also be a frequent

    itemset

    i.e., if {AB} isa frequent itemset, both {A} and {B} should be afrequent itemset

    Iteratively find frequent itemsets with cardinality from 1

    to k (k-itemset)

    Use the frequent itemsets to generate association

    rules.

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    19/21

    The Apriori AlgorithmExample

    TID Items100 1 3 4

    200 2 3 5

    300 1 2 3 5

    400 2 5

    Database D itemset sup.

    {1} 2{2} 3

    {3} 3

    {4} 1

    {5} 3

    itemset sup.

    {1} 2{2} 3

    {3} 3

    {5} 3

    Scan D

    C1

    L1

    itemset{1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}{3 5}

    itemset sup{1 2} 1

    {1 3} 2

    {1 5} 1

    {2 3} 2

    {2 5} 3{3 5} 2

    itemset sup

    {1 3} 2

    {2 3} 2

    {2 5} 3

    {3 5} 2

    L2

    C2 C2

    Scan D

    C3

    L3itemset

    {2 3 5}Scan D itemset sup

    {2 3 5} 2

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    20/21

    The Apriori Algorithm

    Pseudo-code:

    Ck: Candidate itemset of size kLk: frequent itemset of size k

    L1= {frequent items};for(k= 1; L

    k

    !=; k++) do beginCk+1= candidates generated from Lk;

    for each transaction tin database do

    increment the count of all candidates in Ck+1

    that are contained in tLk+1 = candidates in Ck+1with min_supportend

    returnkLk;

  • 7/28/2019 Lecture 8-9 Association Rule Mining.ppt

    21/21

    Important Details of Apriori

    How to generate candidates?

    Step 1: self-joining Lk

    Step 2: pruning

    How to count supports of candidates?

    Example of Candidate-generation L3={abc, abd, acd, ace, bcd}

    Self-joining: L3*L3

    abcdfrom abcand abd acdefrom acdand ace

    Pruning:

    acdeis removed because adeis not in L3

    C4={abcd}