dwdm2015 classi

Upload: srilakshmi-shunmugaraj

Post on 20-Feb-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 DWDM2015 Classi

    1/26

    DATA WAREHOUSING and DATA

    MINING

    Classification, Trees

    Saji K Mathew, PhD

    INDIAN INSTITUTE OF TECHNOLOGY MADRAS

    Chennai, India

  • 7/24/2019 DWDM2015 Classi

    2/26

    Scenario-I

    Imagine you are pursuing a direct marketing program

    Direct mail market budget = 12 mn Cost per mailing = 50/- You need to target the customer base cost effectively To maximize profit

    Whom do I include? How many do I include?

    How do you go?

  • 7/24/2019 DWDM2015 Classi

    3/26

    Scenario-II

    Suppose a catalog company has a database of 20mnnames

    Suppose they choose to send 2 million copies of SummerBonanza catalogue

    Further, suppose the avg. order size is 1500.

    Suppose somehowyou could increase the response ratefrom 5% to 6%

    This is 20,000 orders more = 30 mn 1% increase means a great lift!

  • 7/24/2019 DWDM2015 Classi

    4/26

    Scoring models: supervised data mining for

    classification

    Simple linear regression Response=0.1*frequency-0.2*recency

    Response=-0.1*age+0.2*income

    Logistic regression

    Score (probability) =0.1*frequency-0.2*recency

    Classification And Regression Trees (CART)

    Rules

    ANN

  • 7/24/2019 DWDM2015 Classi

    5/26

    Evaluating a scoring model

    Fit How well the model fits the data (R2)

    Validation

    Checks generalizability

    Performance How useful is the model for action (Lift)

  • 7/24/2019 DWDM2015 Classi

    6/26

    Lift

    Suppose we look at a random 10% of the potentialcustomers, and we expect to get an average R%

    response rate (without doing any data mining)

    If we select 10% of the likely customers using data

    mining and get a higher response rate of G%, then we

    realize a lift (=G/R)

  • 7/24/2019 DWDM2015 Classi

    7/26

    Gains table

  • 7/24/2019 DWDM2015 Classi

    8/26

    The cumulative gains chart

    Random model

    Proposed model

    Lift

  • 7/24/2019 DWDM2015 Classi

    9/26

    Classifier performance

    Measure Formula

    Accuracy, recognition rate

    Error rate,

    misclassification rate Sensitivity, true positive

    rate, recall

    Specificity, true negative

    rate

    Precision

    NP

    TNTP

    +

    +

    N

    TN

    NP

    FNFP

    +

    +

    P

    TP

    FPTP

    TP

    +

    yes no total

    yes TP FN P

    no FP TN N

    total P N P+N

    predicted class

    actual class

  • 7/24/2019 DWDM2015 Classi

    10/26

    Confusion (classification) matrix

    Confusion Matrix

    Predicted Class

    Actual

    Class Yes No

    Yes 800 50

    No 50 100

    Accuracy=900/1000

    Error: 100/1000

    Sensitivity(Accuracy of Yes)=800/850

    Specificity (Accuracy of No)=100/150

    Precision =800/850

  • 7/24/2019 DWDM2015 Classi

    11/26

  • 7/24/2019 DWDM2015 Classi

    12/26

  • 7/24/2019 DWDM2015 Classi

    13/26

  • 7/24/2019 DWDM2015 Classi

    14/26

    Decision Tree

    A decision tree model consists of a set of rules fordividing a large heterogeneous population into

    smaller, more homogeneous groups with respect to a

    target variable

    This structure divides up a large collection of recordsinto successively smaller sets of records with simple

    decision rules- resulting sets become more

    homogeneous

    Target variable is generally categorical, input variables

    could be any combination of categorical or metric

  • 7/24/2019 DWDM2015 Classi

    15/26

    Rules (If then) in Fraud detection at

    HSBC

    3 claims in last 2 years Credit card used in different locations

    Credit card used at petrol station and then in high-value

    store!

  • 7/24/2019 DWDM2015 Classi

    16/26

    A case of internal fraud

    A bank auditor found that the credit card balanceswritten off as uncollectible had an excessive level of

    numbers with first two digits 24

    The investigation found that $2,500 was an internal

    write-off limit One employee was responsible for most of the 24s by

    working with friends and having them apply for a card

    and then running up a balance to just below $2,500. The

    employee then write the debt off. The systematic nature of the fraud was evident from the

    first two digits

  • 7/24/2019 DWDM2015 Classi

    17/26

    Growing decision trees

  • 7/24/2019 DWDM2015 Classi

    18/26

  • 7/24/2019 DWDM2015 Classi

    19/26

    How to grow a tree?

    Purity Measures Gini, Entropy etc.

    Lift Measures how improved is a class formed by decision tree

    rule as compared to original class

  • 7/24/2019 DWDM2015 Classi

    20/26

    20

    Gini Index (CART, IBM IntelligentMiner)

    If a data set D contains items from n classes, gini index,gini(D) isdefined as

    where pj is the relative frequency of classj in D

    If a data set D is split on A into two subsets D1 and D2, thegini

    indexgini(D) is defined as

    Reduction in Impurity:

    The attribute provides the smallestginisplit(D) (or the largest

    reduction in impurity) is chosen to split the node (need to

    enumerate all the possible splitting points for each attribute)

    =

    =n

    j

    pjDgini

    1

    21)(

    )(||

    ||)(

    ||

    ||)( 2

    21

    1Dgini

    D

    DDgini

    D

    DDginiA +=

    )()()( DginiDginiAginiA

    =

  • 7/24/2019 DWDM2015 Classi

    21/26

    Calculate Gini for each node and the split

  • 7/24/2019 DWDM2015 Classi

    22/26

  • 7/24/2019 DWDM2015 Classi

    23/26

    Entropy

    Generically,

    i= the number of classes

  • 7/24/2019 DWDM2015 Classi

    24/26

  • 7/24/2019 DWDM2015 Classi

    25/26

    CART in R

    Classification and Regression Trees (CART) Developed by statisticians Brieman et al in the 80s

    Uses Gini index as the splitting criteria

    Package rpart implements CART in R

  • 7/24/2019 DWDM2015 Classi

    26/26

    In September, the company awarded a $1 million prize to a

    team of engineers, statisticians and researchers thatimproved the accuracy of its movie recommendation system

    by 10%. At the same time the company launched another

    $1 million competition with the aim of predicting movie

    enjoyment among members who don't often rate what they

    watch.