dwdm2015 classi

7/24/2019 DWDM2015 Classi

1/26

DATA WAREHOUSING and DATA

MINING

Classification, Trees

Saji K Mathew, PhD

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

Chennai, India


2/26

Scenario-I

Imagine you are pursuing a direct marketing program

Direct mail market budget = 12 mn Cost per mailing = 50/- You need to target the customer base cost effectively To maximize profit

Whom do I include? How many do I include?

How do you go?


3/26

Scenario-II

Suppose a catalog company has a database of 20mnnames

Suppose they choose to send 2 million copies of SummerBonanza catalogue

Further, suppose the avg. order size is 1500.

Suppose somehowyou could increase the response ratefrom 5% to 6%

This is 20,000 orders more = 30 mn 1% increase means a great lift!


4/26

Scoring models: supervised data mining for

classification

Simple linear regression Response=0.1*frequency-0.2*recency

Response=-0.1*age+0.2*income

Logistic regression

Score (probability) =0.1*frequency-0.2*recency

Classification And Regression Trees (CART)

Rules

ANN


5/26

Evaluating a scoring model

Fit How well the model fits the data (R2)

Validation

Checks generalizability

Performance How useful is the model for action (Lift)


6/26

Lift

Suppose we look at a random 10% of the potentialcustomers, and we expect to get an average R%

response rate (without doing any data mining)

If we select 10% of the likely customers using data

mining and get a higher response rate of G%, then we

realize a lift (=G/R)


7/26

Gains table


8/26

The cumulative gains chart

Random model

Proposed model

Lift


9/26

Classifier performance

Measure Formula

Accuracy, recognition rate

Error rate,

misclassification rate Sensitivity, true positive

rate, recall

Specificity, true negative

rate

Precision

NP

TNTP

+

+

N

TN

NP

FNFP

+

+

P

TP

FPTP

TP

+

yes no total

yes TP FN P

no FP TN N

total P N P+N

predicted class

actual class


10/26

Confusion (classification) matrix

Confusion Matrix

Predicted Class

Actual

Class Yes No

Yes 800 50

No 50 100

Accuracy=900/1000

Error: 100/1000

Sensitivity(Accuracy of Yes)=800/850

Specificity (Accuracy of No)=100/150

Precision =800/850


11/26


12/26


13/26


14/26

Decision Tree

A decision tree model consists of a set of rules fordividing a large heterogeneous population into

smaller, more homogeneous groups with respect to a

target variable

This structure divides up a large collection of recordsinto successively smaller sets of records with simple

decision rules- resulting sets become more

homogeneous

Target variable is generally categorical, input variables

could be any combination of categorical or metric


15/26

Rules (If then) in Fraud detection at

HSBC

3 claims in last 2 years Credit card used in different locations

Credit card used at petrol station and then in high-value

store!


16/26

A case of internal fraud

A bank auditor found that the credit card balanceswritten off as uncollectible had an excessive level of

numbers with first two digits 24

The investigation found that $2,500 was an internal

write-off limit One employee was responsible for most of the 24s by

working with friends and having them apply for a card

and then running up a balance to just below $2,500. The

employee then write the debt off. The systematic nature of the fraud was evident from the

first two digits


17/26

Growing decision trees


18/26


19/26

How to grow a tree?

Purity Measures Gini, Entropy etc.

Lift Measures how improved is a class formed by decision tree

rule as compared to original class


20/26

20

Gini Index (CART, IBM IntelligentMiner)

If a data set D contains items from n classes, gini index,gini(D) isdefined as

where pj is the relative frequency of classj in D

If a data set D is split on A into two subsets D1 and D2, thegini

indexgini(D) is defined as

Reduction in Impurity:

The attribute provides the smallestginisplit(D) (or the largest

reduction in impurity) is chosen to split the node (need to

enumerate all the possible splitting points for each attribute)

=

=n

j

pjDgini

1

21)(

)(||

||)(

||

||)( 2

21

1Dgini

D

DDgini

D

DDginiA +=

)()()( DginiDginiAginiA

=


21/26

Calculate Gini for each node and the split


22/26


23/26

Entropy

Generically,

i= the number of classes


24/26


25/26

CART in R

Classification and Regression Trees (CART) Developed by statisticians Brieman et al in the 80s

Uses Gini index as the splitting criteria

Package rpart implements CART in R


26/26

In September, the company awarded a $1 million prize to a

team of engineers, statisticians and researchers thatimproved the accuracy of its movie recommendation system

by 10%. At the same time the company launched another

$1 million competition with the aim of predicting movie

enjoyment among members who don't often rate what they

watch.

dwdm2015 classi

Documents