copyright © 2011 pearson education, inc. publishing as prentice hall 5-1 data mining methods:...
TRANSCRIPT
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-1
Data Mining Methods: Data Mining Methods: ClassificationClassification
Most frequently used DM method Employ supervised learning Learn from past data, classify new
data The output variable is categorical
(nominal or ordinal) in nature Classification versus regression? Classification versus clustering?
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-2
Assessment Methods for Assessment Methods for ClassificationClassification
Predictive accuracy Hit rate
Speed Model building; predicting
Robustness Scalability Interpretability
The level of understanding provided by the mdoel
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-3
Accuracy of Classification ModelsAccuracy of Classification Models In classification problems, the primary source
for accuracy estimation is the confusion matrix
True Positive
Count (TP)
FalsePositive
Count (FP)
TrueNegative
Count (TN)
FalseNegative
Count (FN)
True Class
Positive Negative
Pos
itive
Neg
ativ
e
Pre
dict
ed C
lass FNTP
TPRatePositiveTrue
FPTN
TNRateNegativeTrue
FNFPTNTP
TNTPAccuracy
FPTP
TPrecision
P
FNTP
TPcallRe
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-4
Estimation Methodologies for Estimation Methodologies for ClassificationClassification
Simple split (or holdout or test sample estimation) Split the data into 2 mutually exclusive sets
training (~70%) and testing (30%)
PreprocessedData
Training Data
Testing Data
Model Development
Model Assessment
(scoring)
2/3
1/3
Classifier
Prediction Accuracy
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-5
Estimation Methodologies for Estimation Methodologies for ClassificationClassification
k-Fold Cross Validation (rotation estimation) Split the data into k mutually exclusive
subsets Use each subset as testing while using the
rest of the subsets as training Repeat the experimentation for k times Aggregate the test results for true
estimation of prediction accuracy training Other estimation methodologies
Area under the ROC curve
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-6
Estimation Methodologies for Estimation Methodologies for Classification – ROC CurveClassification – ROC Curve
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
ositi
ve R
ate
(Sen
sitiv
ity) A
B
C
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-7
Example - BMW dealershipExample - BMW dealership The dealership is starting a promotional campaign,
whereby it is trying to push a two-year extended warranty to its past customers. The dealership has done this before and has gathered 4,500 data points from past sales of extended warranties. The attributes in the data set are:
Income bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+]
Year/month first BMW bought Year/month most recent BMW bought Whether they responded to the extended warranty offer
in the past
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall5-8
Weka Input file formatWeka Input file format