chapter 5 evaluating predictive performance data mining for business analytics shmueli, patel ...

67
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce

Upload: stephanie-robinson

Post on 18-Jan-2018

237 views

Category:

Documents


2 download

DESCRIPTION

Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categorical

TRANSCRIPT

Page 1: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Chapter 5 – Evaluating Predictive Performance

Data Mining for Business AnalyticsShmueli, Patel & Bruce

Page 2: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Why Evaluate?Multiple methods are available to classify

or predictFor each method, multiple choices are

available for settingsTo choose best model, need to assess each

model’s performance

Page 3: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Types of outcomePredicted numerical value

Outcome variable is numerical and continuousPredicted class membership

Outcome variable is categoricalPropensity

Probability of class membership when the outcome variable is categorical

Page 4: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Accuracy Measures (Continuous outcome)

Page 5: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Evaluating Predictive PerformanceNot the same as goodness-of-fit (R2, SY/X)Predicted performance is measured using

the validation datasetBenchmark is average used as prediction

Page 6: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Prediction accuracy measuresei = yi – ŷi,

where yi = Actual y value and ŷi = predicted y value

MAE = Mean Absolute Error = MAE is also known as MAD = Mean Absolute DeviationAverage Error = MAPE = Mean Absolute Percent Error = 100% x RSME = Root-Mean-Squared Error = Total SSE = Total Sum of Squared Error =

Page 7: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Excel MinerExample: Multiple Regression using Boston Housing dataset

Page 8: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Excel MinerResiduals for Training dataset Residuals for Validation dataset

Page 9: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Excel MinerResiduals for Training dataset Residuals for Validation dataset

Page 10: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

SAS Enterprise MinerExample: Multiple Regression using Boston Housing dataset

Page 11: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

SAS Enterprise MinerResiduals for Validation dataset

Page 12: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

SAS Enterprise MinerBoxplot of residuals

Page 13: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift ChartOrder predicted y-value from highest to

lowestX-axis = Cases from 1 to nY-axis = Cumulative predicted value of YTwo lines are plotted …

one with as the predicted valueone with ŷ found using a prediction model

Page 14: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Excel Miner: Lift Chart

Page 15: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Decile-wise lift ChartOrder predicted y-value from highest to

lowestX-axis = % of cases from 10% to 100%

i.e. 1st decile to 10th decileY-axis = Cumulative predicted value of YFor each decile the ratio of sum of ŷ and

sum of is potted

Page 16: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

XLMINER: Decile-wise lift Chart

Page 17: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Accuracy Measures (Classification)

Page 18: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Misclassification error

Error = classifying a record as belonging to one class when it belongs to another class.

Error rate = percent of misclassified records out of the total records in the validation data

Page 19: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Benchmark: Naïve RuleNaïve rule: classify all records as belonging

to the most prevalent classUsed only as benchmark to evaluate more

complex classifiers

Page 20: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Separation of Records “High separation of records” means that using predictor variables attains low error

“Low separation of records” means that using predictor variables does not improve much on naïve rule

Page 21: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

High Separation of Records

Page 22: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Low Separation of Records

Page 23: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Classification Confusion Matrix

201 1’s correctly classified as “1” (True positives)

85 1’s incorrectly classified as “0” (False negatives)

25 0’s incorrectly classified as “1” (False positives)

2689 0’s correctly classified as “0” (True negatives)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Page 24: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Error Rate

Overall error rate = (25+85)/3000 = 3.67%

Accuracy = 1 – err = (201+2689) = 96.33%

If multiple classes, error rate is: (sum of misclassified records)/(total

records)

Actual Class 1 01 201 850 25 2689

Predicted ClassClassification Confusion Matrix

Page 25: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Classification Matrix:Meaning of each cell

Actual Class

Predicted ClassC1 C2

C1 n1,1= No. of C1 cases classified correctly as C1

n1,2= No. of C1 cases classified incorrectly as C2

C2 n2,1= No. of C2 cases classified incorrectly as C1

n2,2= No. of C2 cases classified correctly as C2

Misclassification rate = err =

Accuracy = 1 - err =

Page 26: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

PropensityPropensities are estimated probabilities

that a case belongs to each of the classesIt is used in two ways …

To generate predicted class membershipRank ordering cases by probability of

belonging to a particular class of interest

Page 27: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Most Data Mining algorithms classify via a 2-step process:

For each case,1. Compute probability of belonging to the

class of interest2. Compare to cutoff value, and classify

accordingly

Default cutoff value is 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”

Can use different cutoff values Typically, error rate is lowest for cutoff =

0.50

Propensity and Cutoff for Classification

Page 28: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Cutoff TableActual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004

If cutoff is 0.50: thirteen records are classified as “1”

If cutoff is 0.80: seven records are classified as “1”

Page 29: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Confusion Matrix for Different Cutoffs

Page 30: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Confusion Matrix for Different Cutoffs0.25

Actual Class owner non-owner

owner 11 1

non-owner 4 8

0.75

Actual Class owner non-owner

owner 7 5

non-owner 1 11

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Page 31: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

When One Class is More ImportantIn many cases it is more important to identify members of one class

Tax fraudCredit defaultResponse to promotional offerPredicting delayed flights

In such cases, overall accuracy is not a good measure of evaluating the classifier.

SensitivitySpecificity

Page 32: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

SensitivitySuppose that “C1” is the important class. Sensitivity is the ability (probability) to

detect membership in “C1” class correctly and is given by

=

Hit rate = True Positive rate

Page 33: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

SpecificitySuppose that “C1” is the important class. Specificity is the ability (probability) to

rule out members who belong to “C2” class correctly and is given by

=

1 – Specificity = False Positive rate

Page 34: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

ROC CurveReceiver Operating Characteristics Curve

Page 35: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Asymmetric Misclassification Costs

Page 36: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Misclassification Costs May Differ

The cost of making a misclassification error may be higher for one class than the other(s)

Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Page 37: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Example – Response to Promotional Offer

“Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)

Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)

  Predict Class 0 Predict Class 1Actual 0 990 0Actual 1 10 0

Page 38: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

The Confusion Matrix

Error rate = (2+20) = 2.2% (higher than naïve rate)

  Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8

Suppose that using DM we can correctly classify eight 1’s as 1’s

It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.

Page 39: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Introducing Costs & BenefitsSuppose:Profit from a “1” is $10Cost of sending offer is $1Then:Under naïve rule, all are classified as “0”,

so no offers are sent: no cost, no profitUnder DM predictions, 28 offers are sent.

8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)

Net profit = $80 - $20 = $60

Page 40: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Profit Matrix

  Predict Class 0 Predict Class 1Actual 0 0 -$20Actual 1 0 $80

Page 41: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Minimize Opportunity costsAs we see, best to convert everything to

costs, as opposed to a mix of costs and benefits

E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”

Leads to same decisions, but referring only to costs allows greater applicability

Page 42: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Cost Matrix with opportunity costs

Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):

 Costs Predict Class 0 Predict Class 1Actual 0 970 x $0 = $0 20 x $1 = $20Actual 1 2 x $10 = $20 8 x $1 = $8

  Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8

Total opportunity cost = 0 + 20 + 20 + 8 = 48

Page 43: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Average misclassification cost

q1 = cost of misclassifying an actual C1 as belonging to C2

q2 = cost of misclassifying an actual C2 as belonging to C1

Average misclassification cost =

Look for a classifier that minimizes this average cost.

Page 44: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Generalize to Cost RatioSometimes actual costs and benefits are hard to estimate

Need to express everything in terms of costs (i.e., cost of misclassification per record)

A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)

Page 45: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Multiple Classes

Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes

Practically too many to work withIn decision-making context, though, such

complexity rarely arises – one class is usually of primary interest

For m classes, confusion matrix has m rows and m columns

Page 46: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Judging Ranking Performance

Page 47: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift Chart for Binary DataInput: Scored validation dataset

Actual class and propensity (probability) to belong to the class of interest C1)

Sort records in descending order of propensity to belong to the class of interest

Compute cumulative number of C1 members for each row

Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C1 members as the y-axis

Page 48: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift Chart for Binary Data - ExampleCase No.

Propensity(Predicted

probability of belonging to class

"1")Actual

class

Cumulative actual classes

1 0.9959767 1 12 0.9875331 1 23 0.9844564 1 34 0.9804396 1 45 0.9481164 1 56 0.8892972 1 67 0.8476319 1 78 0.7628063 0 79 0.7069919 1 810 0.6807541 1 911 0.6563437 1 1012 0.6224195 0 1013 0.5055069 1 1114 0.4713405 0 1115 0.3371174 0 1116 0.2179678 1 1217 0.1992404 0 1218 0.1494827 0 1219 0.0479626 0 1220 0.0383414 0 1221 0.0248510 0 1222 0.0218060 0 1223 0.0161299 0 1224 0.0035600 0 12

Page 49: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift Chart with Costs and BenefitsSort records in descending order of

probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Compute a column of cumulative cost/benefit

Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis

Page 50: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift Chart with cost/benefit - Example

Case No.

Propensity(Predicted

probability of belonging to class

"1")

Actual

class

Cost/Benefi

t

Cumulative Cost/benefi

t1 0.9959767 1 10 102 0.9875331 1 10 203 0.9844564 1 10 304 0.9804396 1 10 405 0.9481164 1 10 506 0.8892972 1 10 607 0.8476319 1 10 708 0.7628063 0 -1 699 0.7069919 1 10 7910 0.6807541 1 10 8911 0.6563437 1 10 9912 0.6224195 0 -1 9813 0.5055069 1 10 10814 0.4713405 0 -1 10715 0.3371174 0 -1 10616 0.2179678 1 10 11617 0.1992404 0 -1 11518 0.1494827 0 -1 11419 0.0479626 0 -1 11320 0.0383414 0 -1 11221 0.0248510 0 -1 11122 0.0218060 0 -1 11023 0.0161299 0 -1 10924 0.0035600 0 -1 108

Page 51: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Lift Curve May Go Negative

If total net benefit from all cases is negative, reference line will have negative slope

Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Page 52: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Negative slope to reference curve

Page 53: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Oversampling and Asymmetric Costs

Page 54: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Rare Cases

Responder to mailingSomeone who commits fraudDebt defaulter

Often we oversample rare cases to give model more information to work with

Typically use 50% “1” and 50% “0” for training

Asymmetric costs/benefits typically go hand in hand with presence of rare but important class

Page 55: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

ExampleFollowing graphs show optimal classification

under three scenarios:assuming equal costs of misclassificationassuming that misclassifying “o” is five times

the cost of misclassifying “x”Oversampling scheme allowing DM methods

to incorporate asymmetric costs

Page 56: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Classification: equal costs

Page 57: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Classification: Unequal costsSuppose that failing to catch “o” is 5 times as costly as failing to catch “x”.

Page 58: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Oversampling for asymmetric costs

Oversample “o” to appropriately weight misclassification costs – without or with replacement

Page 59: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Equal number of respondersSample equal number of responders and

non-responders

Page 60: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

An Oversampling Procedure1. Separate the responders (rare) from non-

responders2. Randomly assign half the responders to

the training sample, plus equal number of non-responders

3. Remaining responders go to validation sample

4. Add non-responders to validation data, to maintain original ratio of responders to non-responders

5. Randomly take test set (if needed) from validation

Page 61: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Assessing model performance1. Score the model with a validation

dataset selected without oversampling2. Score the model with a oversampled

validation dataset and reweight the results to remove the effects of oversampling

Method 1 is straightforward and easier to implement.

Page 62: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting confusion matrix for oversampling

Whole data SampleResponders 2% 50%Nonresponders 98% 50%

Example:

One responder in whole data = 50/2 = 25 in the sample

One nonresponder in whole data = 50/98 = 0.5102 in the sample

Page 63: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting confusion matrix for oversampling

Suppose that the confusion matrix with oversampled validationdataset is as follows:

Classification matrix, Oversampled data (Validation)

Predicted 0 Predcited 1 TotalActual 0 390 110 500Actual 1 80 420 500Total 470 530 1000

Misclassification rate = (80 + 110)/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 =

0.53 or 53%

Page 64: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting confusion matrix for oversampling

Classification matrix, ReweightedPredicted 0 Predcited 1 Total

Actual 0 390/0.5102 = 764.4 110/0.5102 = 215.6 980Actual 1 80/25 = 3.2 420/25 = 16.8 500Total 767.6 232.4 1000

Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = 0.5102

Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9%

Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%

Page 65: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting Lift curve for OversamplingSort records in descending order of

probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Divide the value by the oversampling rate of the actual class

Compute a cumulative column of weighted cost/benefit

Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis

Page 66: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting Lift curve for Oversampling -- Example

Cost/BenefitOversamplin

g weightActual 0 -3 0.6Actual 1 50 20

Suppose that cost/benefit and oversampling weights are as follows:

Case No.

Propensity(Predicted

probability of belonging to class

"1")Actual

classweighted

Cost/Benefit

Cumulative weighted

Cost/benefit1 0.9959767 1 2.50 2.502 0.9875331 1 2.50 5.003 0.9844564 1 2.50 7.504 0.9804396 1 2.50 10.005 0.9481164 1 2.50 12.506 0.8892972 1 2.50 15.007 0.8476319 1 2.50 17.508 0.7628063 0 -5.00 12.509 0.7069919 1 2.50 15.0010 0.6807541 1 2.50 17.5011 0.6563437 1 2.50 20.0012 0.6224195 0 -5.00 15.0013 0.5055069 1 2.50 17.5014 0.4713405 0 -5.00 12.5015 0.3371174 0 -5.00 7.5016 0.2179678 1 2.50 10.0017 0.1992404 0 -5.00 5.0018 0.1494827 0 -5.00 0.0019 0.0479626 0 -5.00 -5.0020 0.0383414 0 -5.00 -10.0021 0.0248510 0 -5.00 -15.0022 0.0218060 0 -5.00 -20.0023 0.0161299 0 -5.00 -25.0024 0.0035600 0 -5.00 -30.00

Page 67: Chapter 5  Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel  Bruce

Adjusting Lift curve for Oversampling -- Example