discrete predictive modeling casualty actuarial society special interest seminar on predictive...

82
Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher Monsour, FCAS, MAAA © 2004 Towers Perrin

Upload: barrie-griffin

Post on 30-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

Discrete Predictive ModelingCasualty Actuarial Society

Special Interest Seminar on Predictive Modeling

Chicago

October 5, 2004

Presented by Christopher Monsour, FCAS, MAAA

© 2004 Towers Perrin

Page 2: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

2

What do other people do with it?

Pattern recognition / image processing

Measuring medical trial outcomes

Direct response modeling

Classification of texts and artifacts on stylistic and physical criteria

Categorization of web pages / organization of information

Page 3: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

3

What good is it in insurance?

Claim frequency / claim occurrence models

Claim closure with or without payment

Response models (direct mail, cross-sale)

Customer retention

Underwriting inspections

Premium audit

Fraud

Page 4: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

4

Topics

Discrete modeling generally

Terminology

Comparison of models

Techniques for supervised learning

Intuition for, rough sketch of technical details

Advantages and disadvantages

Techniques for unsupervised learning

Sketch of a couple of techniques

Page 5: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

5

Goal

Can read the literature

Have confidence to try things

Software packages

Your own unique way of handing a unique challenge

Page 6: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

Discrete Modeling — Terminology

6

Page 7: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

7

Scoring

Often a two class model produces scores

Observations with scores greater than a certain amount are classified to A; the rest to B

The cutoff score can be changed

— e.g., could use the cutoff that gives the lowest misclassification cost

A soft assignment model is a model where these scores can reasonably be interpreted as probabilities

Page 8: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

8

Some Contrasts

Supervised vs. unsupervised learning

Hard vs. soft assignment

Two vs. many classes

Equal vs. unequal misclassification costs

Assigning class priors (J)vs. using the proportions in the data

Training vs. test data

Page 9: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

9

Some Terminology

Confusion Matrix

Predicted Class

CBA

32234122,332

2,345214,3243,124

23,445345312

Actual Class

A

B

C

Page 10: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

10

Some Terminology

If testing for A

Sensitivity = probability of predicting A when A is true =

1 – false negative = 12,332 / (12,332+34+322)

Specificity = probability of predicting not-A when not-A is true =

1 – false positive = (214,324+2,345+345+23,445) /

(3,124+214,324+2,345+312+345+23,445)

Sensitivity and Specificity

Page 11: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

11

Some Terminology (ROC curve)

Changing misclassification costs will often change the sensitivity / specificity tradeoff

Increasing cost of classifying A as not-A will increase sensitivity at the cost of specificity

Receiver-Operating Characteristic curve (ROC curve):

Sensitivity vs. false positive rate (i.e., 1 – specificity)

Allows comparison of several types of model each tuned to various specificities by changing the misclassification costs

Area under the ROC curve is a commonly used comparison (more is better)

As with all tests, comparison should be on test data, not training data

Page 12: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

12

Some Terminology

With scores, can vary number classed as type A continuously. Call this X%

Gain = proportion of those classed as A that are A compared to proportion in general population

Gain is >= 1 and is decreasing as one moves to the right (including more quantiles in a mailing, for example). Flat line at 1 is a worthless model

Often used in response modeling: The “gain” vs. random mailing

Gains Chart

Page 13: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

13

Some Terminology

Lift = percent of class A that falls in the first x% of scores

Sensitivity as a function of quantile

If you have the 20% scored most likely to be in class A, then false negative rate will be less than 20%, so lift curve is to the right of ROC curve

Lift Curve

Page 14: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

14

Some Terminology

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

False Positive Rate for ROC,Quantiles for Others

Sen

siti

vity

fo

r R

OC

, L

ift

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ROC

Lift

Gains

Page 15: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

Discrete Modeling — Comparing Models

15

Page 16: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

16

How to Compare Models

On Test Data

Hard Assignment

Area under ROC curve

Total cost

Specificity, sensitivity

Prediction accuracy (total costs where all misclassifications have the same cost)

Misclassification rate of the perfect model is the “Bayes rate”

— Can do “better” than the Bayes rate on training data by over fitting

Page 17: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

17

How to Compare Models

On Test Data

Soft Assignment

Likelihood of test data

… plus all of the above

Page 18: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

18

How to Compare Models

Much less straightforward

Approaches:

Cross-validation

Log-likelihood measures penalized for degrees of freedom

— Akaike’s information criterion

– Penalty = degrees of freedom

— Schwarz’s Bayesian criterion

– Penalty = 0.5 x ln (#obs) x df

On Training Data

Page 19: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

19

How to Compare Models

Much less straightforward

Approaches:

Must be careful about measuring degrees of freedom

— Size of space searched, not just number of parameters, is relevant

— Shrinkage is relevant

On Training Data

Page 20: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

20

Thematic Database

48,842 working adults, split 2/3 train, 1/3 test

45,222 observations with no missing values

Dependent variable: Whether income exceeds $50,000 per annum

Note that actual income is not included

Approximately 24% of the workers had incomes exceeding $50,000

Like many UCI databases, has been thoroughly studied

Census Data Extract From the UC Irvine Repository

Page 21: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

21

Thematic Database

Best sited error rates by data donors are with Naïve Bayes combined with feature subset selection; this was reported at a test error rate of 14.05%

How much lower than 14.05% do we think the Bayes error rate really is?

Analogous to the question of how much variance in insured loss experience is explainable and how much remains purely random (“noise”) even in the perfect model

Census Data Extract From the UC Irvine Repository

Page 22: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

22

Thematic Database

Independent Variables

Categorical Work-class (e.g., government, private, self) Marital status Relationship Occupation (e.g., clerical, professional, etc.) Race Sex Native country

Census Data Extract From the UC Irvine Repository

Page 23: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

23

Thematic Database

Independent variables

Ordinal Education

Continuous: Age, hours worked per week, capital gains reported on income tax, capital losses reported on income tax

Data set includes weights

In most of modeling, made mistake of treating capital gain / loss as yes / no

Census Data Extract From the UC Irvine Repository

Page 24: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

24

Thematic Database

Used as an illustration for a broad range of techniques On a practical problem, try fewer techniques

But spend more time on feature selection

— Transformations of predictors

— Interactions of predictors

— Selection of predictors

Census Data Extract From the UC Irvine Repository

Page 25: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

25

Thematic Database

Example From the Data

Married – Civ-

Spouse

Married – AF-

SpouseDivorced

24.8%0.0%12.2%

21.0%0.0%1.5%

Income

>50K

<=50K

SeparatedNever Married

Married – Spouse Absent

3.2%31.6%1.2%

0.2%1.6%0.1%

Widowed

2.3%

0.2%

Page 26: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

Techniques for Supervised Learning

26

Page 27: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

27

Global vs. Local

Unweighted one parameter model

High bias, low variance

Appropriate if low SNR

Nearest neighbor

High variance, low bias

Appropriate if high SNR

Most Local Model ImaginableMost Global Model Imaginable

Page 28: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

28

A statistical problem in its own right …

… but also a way of handling the classification problem

If you have populations A and B

Estimate the densities fA(x) and fB(x)

Estimate the prior class probabilities A and B

Then assign a new observation with coordinates x to the class J that maximizes JJ(x)

The prior probabilities can be taken from the data or from other knowledge

Estimating the densities is the tough part

Density Estimation in Classification Problems

Page 29: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

29

Density Estimation

Simplest density estimator is a histogram

Can generalize this by a sliding histogram: Height at any one point depends on number of observations within a specified distance

More generally, can use ‘kernel’ functions to take weighted averages, giving more weight to nearer points

02468

101214

0 to 1

2 to 3

4 to 5

6 to 7

8 to 9

10 to

11

12 to

13

14 to

15

16 to

17

18 to

19

1.06 3.961.06 4.391.09 4.451.12 5.041.27 5.881.30 6.121.40 6.321.47 6.871.57 11.721.69 13.681.77 15.171.82 19.001.86

Page 30: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

30

Density Estimation — Kernels

“Sliding histogram” is a kernel where the kernel drops off from 1 to 0 at a specified distance

Common choices of kernel (with kernel radius of r, and object at distance d)

Epanechnikov 1-(d/r)2 minimizes mean square error asymptotically

Tricube (1-(d/r)3)3

Can use a normal distribution, but it does have infinite radius … often undesirable

Note that endpoints are a problem. Extrapolation is an extreme problem

Page 31: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

31

Density Estimation — Kernels

Sliding Histogram

01

2345

67

0 5 10 15 20

Page 32: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

32

Density Estimation — Kernels

Comparison of Kernels with Radius 1

Wei

gh

t

Sliding HistogramKernel

Epanechnikovkernel

Tricube kernel

-1.5 -1 -0.5 0 0.5 1 1.5

Displacement

0

0.2

0.4

0.6

0.8

Page 33: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

33

Density Estimation — Kernels

0

2

4

6

8

10

12

0 5 10 15 20

Sliding Histogram

Epanechnikov kernel

Tricube kernel

Page 34: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

34

Some Techniques

Naïve Bayes

K-Nearest Neighbor

Discriminant Analysis (Linear, Quadratic …)

Logistic Regression (various links …)

Trees (e.g., CART, CHAID, C4.5)

Page 35: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

35

Naïve Bayes

There are a lot of refinements to naïve Bayes, but the basic idea is very simple, and is also known as “idiot’s Bayes”:

Assume there are no interactions

Model densities univariately

— Use contingency table for discrete predictor

— For continuous predictor, usually bin the variable to make it discrete, but could just as easily use a kernel density estimator

In form, looks like a generalized additive model

— But much faster and simpler to fit

Page 36: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

36

Naïve Bayes

Example

Suppose there is a population of 100 men and 50 women

Of the population, 20 of the women are wearing skirts. The other 130 are wearing pants

Of the population, 20 of the men and 30 of the women have hair shoulder-length or longer

The goal is to predict gender from the other observations

Naïve Bayes assumes that for each class, the densities are the products of the marginals

Short Hair

Long Hair

Total

80%20%

80%20%100%

Men

Pants

Total

0%0%0%Skirt

Short Hair

Long Hair

Total

40%60%

24%36%60%

Women

Pants

Total

16%24%40%Skirt

Page 37: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

37

Naïve Bayes

Probability of observed person being male, assuming priors in the data

Short Hair

Long Hair

Total

80%20%

80%20%100%

100Men

Pants

Total

0%0%0%Skirt

Short Hair

Long Hair

Total

40%60%

24%36%60%

50Women

Pants

Total

16%24%40%Skirt

Short Hair

Long Hair

87.0%52.6%

Prob (Male)

Pants

0%0%Skirt

Short Hair

Long Hair

76.9%35.7%

Prob (Male)

Pants

0%0%Skirt

Probability of observed person being male, assuming equal priors

Page 38: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

38

Naïve Bayes

Short Hair

Long HairTotal

40%60%

40%20%60%

50Women

Pants

Total

0%40%40%Skirt

What Naïve Bayes does not take into account is that the second square of data could actually look like

This would change the resulting probabilities considerably

Page 39: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

39

Naïve Bayes

Advantages Easy to do Very easy to interpret

Just one-way tables put together Decision boundaries fairly flexible but not completely

general

Page 40: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

40

Naïve Bayes

Disadvantages Sensitive to feature selection

Easy to double count effects For instance, I did not use the variables “relationship”

or “sex” in the model for the Census data

— Too highly correlated with marital status Can automate feature selection and make Naïve

Bayes a good method even on problems with more predictors than observations

Does not handle interactions gracefully

Page 41: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

41

Naïve Bayes on the Census data

Performed rather well for the model at hand

Even without formal feature selection, have misclassification rate of 16.0% and area under ROC curve of 89.2%

Naïve Bayes is the simplest example of a Bayesian network

A Bayesian network is a structure that describes conditional independence assumptions (e.g., marital status and occupation might be independent given sex and age)

In general, in a Bayesian networks, variables are assumed conditionally independent of their ancestors given their parents

The naïve Bayes assumption is that all characteristics are conditionally independent given the class labels

Page 42: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

42

Discriminant Analysis

How to group things?

Naïve approach:

For each class, take the centroid of the training data for that class

Classify new points to the closest centroid

What’s wrong with this?

Define “close”

Normalizing predictor variables won’t help (much)

— Differences in some may be more important than differences in others

— Some may be strongly correlated

Page 43: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

43

Discriminant Analysis

0

1

2

3

4

5

6

0 0.5 1 1.5 2 2.5

Class A

Class B

Series3

Page 44: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

44

Linear Discriminant Analysis (LDA)

Normal distance works well for spherical clusters

To the extent that classes are not spherical, rescale them

Modeling each class with a multivariate normal does three things:

Centers class density at centroid

Accounts for elliptical distribution

Accounts for dispersion of each class

But … tons of parameters to estimate:

If p predictors and k classes, then

k p-dimensional centroids and k pxp covariance matrices

Simplify:

Assume each class has same covariance matrix

Page 45: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

45

Linear Discriminant Analysis (LDA)

Estimation

Estimate centroids

For each observation (x,J), with class centroid CJ, consider x-CJ

Determine the covariance matrix of the x-CJ

— Easy enough to do one pair of coordinates at a time: Covariance is just the average of the product less the product of the averages

etc.

1112 22

Page 46: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

46

Linear Discriminant Analysis (LDA)

The result is called linear discriminant analysis because the decision boundary will be linear (in fact, a hyperplane)

Why?

Because a linear transformation will make the ellipsoids into spheres (when we know the boundary is a hyperplane)

etc.

1112 22

Page 47: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

47

Linear Discriminant Analysis (LDA) Virtues

There are really fewer degrees of freedom than it appears

Decision surface is a hyperplane in predictor space, so only p+1 degrees of freedom for 2 class problem if p is the number of predictors

The decision surface for a 2 class problem is the same as that resulting from linear regression

— Thus, it is not silly to apply linear regression to 2 class problems

B

A

Page 48: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

48

LDA with level curves of densities

A

B

Page 49: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

49

B

A

Quadratic Discriminant Analysis (QDA)

If you have a ton of data, you can try estimating each covariance matrix separately

Not only a lot of parameters to estimate

… but also more sensitive than LDA to non-normality

Harder to interpret … decision surface is not linear

Poor method if any class has few representatives, no matter how huge the data set

Page 50: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

50

Quadratic Discriminant Analysis (QDA)

LDA and QDA outlines assume data were labeled as three groups

02468

1012141618

0 5 10 15 20

Sliding Histogram

Epanechnikov kernel

Tricube kernel

QDA Density

lDA Density

Page 51: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

51

02468

1012141618

0 5 10 15 20

Quadratic Discriminant Analysis (QDA)

LDA and QDA outlines assume data were labeled as three groups

Actual Density

Sliding Histogram

Epanechnikov kernel

Tricube kernel

QDA Density

lDA Density

Page 52: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

52

Regularized Discriminant Analysis (RDA)

How to get the right amount of flexibility

Average local covariance estimate with global one (typical actuarial thing to do)

Two Types of Averages Suggested by Friedman:

Average class covariance matrices with grand mean

J,Z = JZ + (1-Z)

Choose Z by whatever produces the best fit

Ideally in terms of cross-validation

Page 53: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

53

Regularized Discriminant Analysis (RDA)

How to get the right amount of flexibility

Average local covariance estimate with global one (typical actuarial thing to do)

Two Types of Averages Suggested by Friedman:

Average the resulting covariance matrices with scalar multiple of identity

Choose the scalar multiple to have the same trace as J,Z

Scaling of predictors suddenly matters

Be careful with this if you have collinearity, since this assumes there isn’t much collinearity

Page 54: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

54

Discriminant Analysis

Other ways to reduce degrees of freedom without resorting to LDA

Feature selection … drop the features with the least influence on the discriminant function

Other types of averaging of covariance matrices:

— For example, can insist that the covariance matrices are the same up to a scalar multiple

— Multiply the pooled covariance matrix by a credibility weighting of the class and pooled determinants

In the Census example, I tried Friedman’s two types of regularization. The first (average of LDA and QDA) worked best

Page 55: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

55

Discriminant Analysis on Census Data

Quick and dirty

Recoded categorical variables to category means

Did not look for interactions or transformations

LDA, QDA

Friedman’s two types of regularization. The first (average of LDA and QDA) worked best

LDA misclassification of 15.9%, area under ROC curve 89.6%

RDA with 40% weight to pooled and 60% weight to unpooled has misclassification rate of 15.8% and area under ROC curve of 89.7%

Page 56: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

56

Logistic Regression

Generalized Linear Model Dependent variable is conditionally Bernoulli (0 or 1)

Note that you cannot think of this as “Bernoulli errors” Various ways of handling more than 2 classes If h is the [inverse] link function, then modeled probability of

success given x1, … , xn is

h(b0+b1x1+b2x2+ … +bnxn)

Note that this always gives a decision boundary linear in the x i

Choices for [inverse] link function h(z): Cumulative normal also called “probit” regression) Logistic function: ez/(1+ez) Complementary log-log: 1-exp(-exp(z)) Log-log: exp(-exp(-z))

Page 57: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

57

Logistic Regression

Logistic Link Use this one unless you have a reason to do otherwise

Can interpret b0+b1x1+b2x2+ … +bnxn as the log of the odds of success

Note: probability of success = odds of success / (1 + odds of success)

Effectively a multiplicative model for the odds

Interpret the bi

Allows for “retrospective” or stratified sampling

— Because sampling does not change the relative odds

— So it won’t bias the answer … it just changes the intercept

Page 58: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

58

Logistic Regression

Logistic Link Logistic does not like to predict pure answers (predictions

near 0 or 1) Probit loves to do this Logistic preferable if there’s “always a chance”

anything might happen Complementary log-log looks very similar to logistic for

rare classes Not appropriate if successes are common

Log-log not appropriate if failures are common

Page 59: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

59

Logistic Regression

00.10.20.30.40.50.60.70.80.9

1

-6 -3 0 3 6

Probit

Logit

Cloglog

Loglog

Page 60: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

60

Logistic Regression

Advantages

Scores interpretable in terms of log odds

Constructed probabilities have chance of making sense

Modeled directly rather than as ratio of two densities

A good “default” tool to use when appropriate, especially combined with feature creation and selection

Page 61: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

61

Logistic Regression

Disadvantages Invites over-interpretation of parameters For example, if a 10% rate increase

Causes lapse rates for customers under age 30 to increase from 15% to 20%

Causes lapse rates for customers 30 and over to increase from 5% to 10%,

Then logistic regression says the older customers are more price sensitive Their odds of lapse increased by a factor of 19/9 The young customers odds of lapse increased by a

factor of 17/12 Doesn’t generalize to 3+ classes as painlessly as LDA

Page 62: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

62

Logistic Regression on the Census data

Quick and dirty

Recoded categorical variables to % with over 50K income in each category (instead of using indicators for each level)

Did not look for interactions or transformations

Misclassification rate of 15.8%

Area under ROC curve of 89.8%

Generally speaking logistic regression does well on area under the ROC curve measures

A nice tool if you don’t know where you want to put the decision threshold until later

Page 63: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

63

k Nearest Neighbors

Score each observation by vote of the nearest k training points Traditional for k to be odd Note that is k=1 then the training error will be 0 by definition

This is not necessarily a good thing Cross-validation will give good estimate of error on a test set,

assuming independent observations in the training set This is very similar to kernel

density estimation But the neighborhood

size is determined by the density of observations Within the neighborhood

all observations count equally A

B

Page 64: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

64

k Nearest Neighbors

Advantages

Very flexible … can model almost any decision boundary

Requires no distributional assumptions

Page 65: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

65

k Nearest Neighbors

Disadvantages Computationally painful

Search entire training set for nearest neighbors for every test point

There are ways to speed this up, but still slow Breaks down with large number of predictors (curse of

dimensionality) Too flexible

Easy to overfit Of course, can usually cure this by choosing k large enough For census data, k=11 much better than k=1

Need to decide how to scale the axes Standardizing variables is not necessarily a sensible solution For census data, used one-way tables for categoricals

Page 66: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

66

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LDALogistic RegNBN11N1

ROC Curve Comparison

Page 67: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

67

Decision Trees

Recursively split the data

Greedy

At each iteration choose split to maximize some measure of significance or purity

Continue until reaching some stopping criterion, e.g.,

— Don’t split nodes smaller than a certain size

— Don’t split nodes with significance less than a certain amount

Prune this back

BA

A

A

A

B

B

B

Page 68: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

68

Decision Trees

CART did very well on misclassification: 14.8%

Probably because I allowed capital gains and losses to be continuous, and it used them: Second split was on capital gains at about $5,000

Area under ROC curve only 89.6%. Suggests that if logistic regression and LDA had been given a few bins for capital gains and losses they might have done better

24.7%

Husband or Wife46.3%

Occupation: Repair, Farming, etc.

23.5%

Occupation: Professional, etc.

62.1%

Other Relationships6.9%

Capital Gains > $4,70081.5%

Capital Gains < $4,7005.0%

Page 69: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

69

ROC Curve Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LDALogistic RegNBN11N1CART

Page 70: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

70

Decision Trees (CHAID)

Some common algorithms

CHAID

CART

C4.5

Page 71: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

71

Decision Trees (CHAID)

CHAID (d categories of dependent variable) Classify predictors as ordinal or categorical For each categorical (resp., ordinal) predictor, merge the pair (resp.,

adjacent pair) of categories where the 2 x d contingency table is least significant, if it is not significant at a certain level p

— A missing value can be considered adjacent to any value

— Alternate this with testing whether merged categories can be split at that significance level If d=2, can treat categorical predictors as ordinal, ordered by the proportion of the first class

— Sum of [(observed – expected)2/expected] is chi-square with (d-1) degrees of freedom

This is just like stepwise regression (using chi-square instead of F tests)

Page 72: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

72

Decision Trees (CHAID)

Eventually, one has determined how to merge the categories for that predictor

If there are c of them, now compute the significance level of the c x d contingency table, which is chi-square is (c-1)(d-1) degrees of freedom

Bonferroni adjustment: multiply this significance level by a penalty for having the best partition into c classes.

Repeat this process for all predictors

Split on the most significant predictor

CHAID as such has a stopping rule but no pruning rule

However, could always allow a generous significance level (to overfit) and then prune as per CART

Page 73: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

73

Decision Trees (CART)

Consider all binary splits on all predictors (splits of the form x>a for ordinal variables)

Various different criteria for determining the best split, will focus on Gini criterion:

Minimize expected misclassification cost

Sum of misclassification costs for each child node

— If the left child node has probabilities 90% A, 5% B, and 5% C, and takes 30% of the observations

— And the right node has probabilities 20% A, 70% B, and 10% C, and takes 70% of the observations

— And the cost of misclassifying an A or C object is 1, but the cost of classifying B as A is 2 and B as C is 3, then

Page 74: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

74

Decision Trees (CART)

Sum of misclassification costs for each child node

— The total misclassification cost for the left node is:

– 30% of 90%*5%+90%*5%+2*90%*5%+3*5%*5%+ 90%*5%+90%*5%

— Compute for right node similarly and add

Page 75: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

75

Decision Trees (CART)

Grow an extremely overfit model (large tree)

Determine an order in which to prune back

Score each prune as

— (increase in expected misclassification cost) / (decrease in number of terminal nodes)

— Note that this can be seen as requiring a minimum usefulness for each degree of freedom

Use cross-validation to determine which in the series of pruned trees is the best

Page 76: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

76

Page 77: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

77

Hybrid Models

Use nodes of a tree as indicators in logistic regression or LDA

Capture interactions

Use a tree for the initial model and include a linear model within each leaf node

More complex (i.e., presumably the tree splits should try to make nodes as “linear” rather than a “pure” as possible)

Models for adjacent nodes might not glue together nicely

If a variable has important interactions with all others

Consider separate models for each of its levels

E.g., marital status might be a good candidate with the census model

Page 78: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

Techniques for Unsupervised Learning

78

Page 79: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

79

Some Unsupervised Learning

Cluster Analysis Hierarchical Clustering

Agglomerative vs. Divisive Similar ideas often used in the creation of rating territories In that case there is a specific covariate of interest, namely

loss cost In other applications, often looking for a “lumpy” area of

feature space

— E.g., segment with enough similar people to advertise to Prototype clustering

K-Means EM version thereof These have similarities to LDA and QDA

Page 80: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

80

k-Means

Choose k prototypes at random within the range of the data Class each point to the nearest prototype Move each prototype to the centroid of its class Repeat until convergence Rinse and repeat (varying location of prototypes and varying K) Choose “best fit”

Strengths: Nothing could be easier Flexibility to weight covariates

Weaknesses: Flexibility to weight covariates

— DO NOT “normalize” the covariates to std dev of 1

— Better to normalize the RANGE to [-1,1] or [0,1]

Page 81: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

81

Expectation Maximization

EM is a very useful generic algorithm

This is the simplest possible use

Start with random prototypes k and random class probabilities k, and a global variance of 2 in every direction for each cluster

E Step: Then for each observation x, compute P(x in cluster k)

M Step: For each cluster, using the membership probabilities as weights, re-estimate k. Once all the means have been estimated, re-estimate 2 from the global data

More complex options would allow elliptical clusters, or for some clusters to be tighter than others

Page 82: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher

82

Useful References

Brieman, Friedman, Olshen, and Stone, Classification and Regression Trees, Chapman & Hall, 1984

Hand, David J., Construction and Assessment of Classification Rules, Wiley, 1997

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning: Data Mining Inference and Prediction, Springer, 2001

Hastie and Tibshirani, Generalized Additive Models, Chapman & Hall, 1990

McCullagh and Nelder, Generalized Linear Models, 2nd ed, Chapman & Hall, 1989

Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science