copyright © 2011 pearson education, inc. publishing as prentice hall 5-1 data mining methods:...

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-1

Data Mining Methods: Data Mining Methods: ClassificationClassification

Most frequently used DM method Employ supervised learning Learn from past data, classify new

data The output variable is categorical

(nominal or ordinal) in nature Classification versus regression? Classification versus clustering?


Assessment Methods for Assessment Methods for ClassificationClassification

Predictive accuracy Hit rate

Speed Model building; predicting

Robustness Scalability Interpretability

The level of understanding provided by the mdoel


Accuracy of Classification ModelsAccuracy of Classification Models In classification problems, the primary source

for accuracy estimation is the confusion matrix

True Positive

Count (TP)

FalsePositive

Count (FP)

TrueNegative

Count (TN)

FalseNegative

Count (FN)

True Class

Positive Negative

Pos

itive

Neg

ativ

e

Pre

dict

ed C

lass FNTP

TPRatePositiveTrue

FPTN

TNRateNegativeTrue

FNFPTNTP

TNTPAccuracy

FPTP

TPrecision

P

FNTP

TPcallRe


Estimation Methodologies for Estimation Methodologies for ClassificationClassification

Simple split (or holdout or test sample estimation) Split the data into 2 mutually exclusive sets

training (~70%) and testing (30%)

PreprocessedData

Training Data

Testing Data

Model Development

Model Assessment

(scoring)

2/3

1/3

Classifier

Prediction Accuracy


Estimation Methodologies for Estimation Methodologies for ClassificationClassification

k-Fold Cross Validation (rotation estimation) Split the data into k mutually exclusive

subsets Use each subset as testing while using the

rest of the subsets as training Repeat the experimentation for k times Aggregate the test results for true

estimation of prediction accuracy training Other estimation methodologies

Area under the ROC curve


Estimation Methodologies for Estimation Methodologies for Classification – ROC CurveClassification – ROC Curve

10.90.80.70.60.50.40.30.20.10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1

0.9

0.8

False Positive Rate (1 - Specificity)

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity) A

B

C


Example - BMW dealershipExample - BMW dealership The dealership is starting a promotional campaign,

whereby it is trying to push a two-year extended warranty to its past customers. The dealership has done this before and has gathered 4,500 data points from past sales of extended warranties. The attributes in the data set are:

Income bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+]

Year/month first BMW bought Year/month most recent BMW bought Whether they responded to the extended warranty offer

in the past


Weka Input file formatWeka Input file format

copyright © 2011 pearson education, inc. publishing as prentice hall 5-1 data mining methods:...

Documents