discrete predictive modeling casualty actuarial society special interest seminar on predictive...

Discrete Predictive ModelingCasualty Actuarial Society

Special Interest Seminar on Predictive Modeling

Chicago

October 5, 2004

Presented by Christopher Monsour, FCAS, MAAA

© 2004 Towers Perrin

2

What do other people do with it?

Pattern recognition / image processing

Measuring medical trial outcomes

Direct response modeling

Classification of texts and artifacts on stylistic and physical criteria

Categorization of web pages / organization of information

3

What good is it in insurance?

Claim frequency / claim occurrence models

Claim closure with or without payment

Response models (direct mail, cross-sale)

Customer retention

Underwriting inspections

Premium audit

Fraud

4

Topics

Discrete modeling generally

Terminology

Comparison of models

Techniques for supervised learning

Intuition for, rough sketch of technical details

Advantages and disadvantages

Techniques for unsupervised learning

Sketch of a couple of techniques

5

Goal

Can read the literature

Have confidence to try things

Software packages

Your own unique way of handing a unique challenge

Discrete Modeling — Terminology

6

7

Scoring

Often a two class model produces scores

Observations with scores greater than a certain amount are classified to A; the rest to B

The cutoff score can be changed

— e.g., could use the cutoff that gives the lowest misclassification cost

A soft assignment model is a model where these scores can reasonably be interpreted as probabilities

8

Some Contrasts

Supervised vs. unsupervised learning

Hard vs. soft assignment

Two vs. many classes

Equal vs. unequal misclassification costs

Assigning class priors (J)vs. using the proportions in the data

Training vs. test data

9

Some Terminology

Confusion Matrix

Predicted Class

CBA

32234122,332

2,345214,3243,124

23,445345312

Actual Class

A

B

C

10

Some Terminology

If testing for A

Sensitivity = probability of predicting A when A is true =

1 – false negative = 12,332 / (12,332+34+322)

Specificity = probability of predicting not-A when not-A is true =

1 – false positive = (214,324+2,345+345+23,445) /

(3,124+214,324+2,345+312+345+23,445)

Sensitivity and Specificity

11

Some Terminology (ROC curve)

Changing misclassification costs will often change the sensitivity / specificity tradeoff

Increasing cost of classifying A as not-A will increase sensitivity at the cost of specificity

Receiver-Operating Characteristic curve (ROC curve):

Sensitivity vs. false positive rate (i.e., 1 – specificity)

Allows comparison of several types of model each tuned to various specificities by changing the misclassification costs

Area under the ROC curve is a commonly used comparison (more is better)

As with all tests, comparison should be on test data, not training data

12

Some Terminology

With scores, can vary number classed as type A continuously. Call this X%

Gain = proportion of those classed as A that are A compared to proportion in general population

Gain is >= 1 and is decreasing as one moves to the right (including more quantiles in a mailing, for example). Flat line at 1 is a worthless model

Often used in response modeling: The “gain” vs. random mailing

Gains Chart

13

Some Terminology

Lift = percent of class A that falls in the first x% of scores

Sensitivity as a function of quantile

If you have the 20% scored most likely to be in class A, then false negative rate will be less than 20%, so lift curve is to the right of ROC curve

Lift Curve

14

Some Terminology

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

False Positive Rate for ROC,Quantiles for Others

Sen

siti

vity

fo

r R

OC

, L

ift

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ROC

Lift

Gains

Discrete Modeling — Comparing Models

15

16

How to Compare Models

On Test Data

Hard Assignment

Area under ROC curve

Total cost

Specificity, sensitivity

Prediction accuracy (total costs where all misclassifications have the same cost)

Misclassification rate of the perfect model is the “Bayes rate”

— Can do “better” than the Bayes rate on training data by over fitting

17


On Test Data

Soft Assignment

Likelihood of test data

… plus all of the above

18


Much less straightforward

Approaches:

Cross-validation

Log-likelihood measures penalized for degrees of freedom

— Akaike’s information criterion

– Penalty = degrees of freedom

— Schwarz’s Bayesian criterion

– Penalty = 0.5 x ln (#obs) x df

On Training Data

19


Much less straightforward

Approaches:

Must be careful about measuring degrees of freedom

— Size of space searched, not just number of parameters, is relevant

— Shrinkage is relevant

On Training Data

20

Thematic Database

48,842 working adults, split 2/3 train, 1/3 test

45,222 observations with no missing values

Dependent variable: Whether income exceeds $50,000 per annum

Note that actual income is not included

Approximately 24% of the workers had incomes exceeding $50,000

Like many UCI databases, has been thoroughly studied

Census Data Extract From the UC Irvine Repository

21

Thematic Database

Best sited error rates by data donors are with Naïve Bayes combined with feature subset selection; this was reported at a test error rate of 14.05%

How much lower than 14.05% do we think the Bayes error rate really is?

Analogous to the question of how much variance in insured loss experience is explainable and how much remains purely random (“noise”) even in the perfect model


22

Thematic Database

Independent Variables

Categorical Work-class (e.g., government, private, self) Marital status Relationship Occupation (e.g., clerical, professional, etc.) Race Sex Native country


23

Thematic Database

Independent variables

Ordinal Education

Continuous: Age, hours worked per week, capital gains reported on income tax, capital losses reported on income tax

Data set includes weights

In most of modeling, made mistake of treating capital gain / loss as yes / no


24

Thematic Database

Used as an illustration for a broad range of techniques On a practical problem, try fewer techniques

But spend more time on feature selection

— Transformations of predictors

— Interactions of predictors

— Selection of predictors


25

Thematic Database

Example From the Data

Married – Civ-

Spouse

Married – AF-

SpouseDivorced

24.8%0.0%12.2%

21.0%0.0%1.5%

Income

>50K

<=50K

SeparatedNever Married

Married – Spouse Absent

3.2%31.6%1.2%

0.2%1.6%0.1%

Widowed

2.3%

0.2%

Techniques for Supervised Learning

26

27

Global vs. Local

Unweighted one parameter model

High bias, low variance

Appropriate if low SNR

Nearest neighbor

High variance, low bias

Appropriate if high SNR

Most Local Model ImaginableMost Global Model Imaginable

28

A statistical problem in its own right …

… but also a way of handling the classification problem

If you have populations A and B

Estimate the densities fA(x) and fB(x)

Estimate the prior class probabilities A and B

Then assign a new observation with coordinates x to the class J that maximizes JJ(x)

The prior probabilities can be taken from the data or from other knowledge

Estimating the densities is the tough part

Density Estimation in Classification Problems

29

Density Estimation

Simplest density estimator is a histogram

Can generalize this by a sliding histogram: Height at any one point depends on number of observations within a specified distance

More generally, can use ‘kernel’ functions to take weighted averages, giving more weight to nearer points

02468

101214

0 to 1

2 to 3

4 to 5

6 to 7

8 to 9

10 to

11

12 to

13

14 to

15

16 to

17

18 to

19

1.06 3.961.06 4.391.09 4.451.12 5.041.27 5.881.30 6.121.40 6.321.47 6.871.57 11.721.69 13.681.77 15.171.82 19.001.86

30

Density Estimation — Kernels

“Sliding histogram” is a kernel where the kernel drops off from 1 to 0 at a specified distance

Common choices of kernel (with kernel radius of r, and object at distance d)

Epanechnikov 1-(d/r)2 minimizes mean square error asymptotically

Tricube (1-(d/r)3)3

Can use a normal distribution, but it does have infinite radius … often undesirable

Note that endpoints are a problem. Extrapolation is an extreme problem

31


Sliding Histogram

01

2345

67

0 5 10 15 20

32


Comparison of Kernels with Radius 1

Wei

gh

t

Sliding HistogramKernel

Epanechnikovkernel

Tricube kernel

-1.5 -1 -0.5 0 0.5 1 1.5

Displacement

0

0.2

0.4

0.6

0.8

33


0

2

4

6

8

10

12

0 5 10 15 20

Sliding Histogram

Epanechnikov kernel

Tricube kernel

34

Some Techniques

Naïve Bayes

K-Nearest Neighbor

Discriminant Analysis (Linear, Quadratic …)

Logistic Regression (various links …)

Trees (e.g., CART, CHAID, C4.5)

35

Naïve Bayes

There are a lot of refinements to naïve Bayes, but the basic idea is very simple, and is also known as “idiot’s Bayes”:

Assume there are no interactions

Model densities univariately

— Use contingency table for discrete predictor

— For continuous predictor, usually bin the variable to make it discrete, but could just as easily use a kernel density estimator

In form, looks like a generalized additive model

— But much faster and simpler to fit

36

Naïve Bayes

Example

Suppose there is a population of 100 men and 50 women

Of the population, 20 of the women are wearing skirts. The other 130 are wearing pants

Of the population, 20 of the men and 30 of the women have hair shoulder-length or longer

The goal is to predict gender from the other observations

Naïve Bayes assumes that for each class, the densities are the products of the marginals

Short Hair

Long Hair

Total

80%20%

80%20%100%

Men

Pants

Total

0%0%0%Skirt

Short Hair

Long Hair

Total

40%60%

24%36%60%

Women

Pants

Total

16%24%40%Skirt

37

Naïve Bayes

Probability of observed person being male, assuming priors in the data

Short Hair

Long Hair

Total

80%20%

80%20%100%

100Men

Pants

Total

0%0%0%Skirt

Short Hair

Long Hair

Total

40%60%

24%36%60%

50Women

Pants

Total

16%24%40%Skirt

Short Hair

Long Hair

87.0%52.6%

Prob (Male)

Pants

0%0%Skirt

Short Hair

Long Hair

76.9%35.7%

Prob (Male)

Pants

0%0%Skirt

Probability of observed person being male, assuming equal priors

38

Naïve Bayes

Short Hair

Long HairTotal

40%60%

40%20%60%

50Women

Pants

Total

0%40%40%Skirt

What Naïve Bayes does not take into account is that the second square of data could actually look like

This would change the resulting probabilities considerably

39

Naïve Bayes

Advantages Easy to do Very easy to interpret

Just one-way tables put together Decision boundaries fairly flexible but not completely

general

40

Naïve Bayes

Disadvantages Sensitive to feature selection

Easy to double count effects For instance, I did not use the variables “relationship”

or “sex” in the model for the Census data

— Too highly correlated with marital status Can automate feature selection and make Naïve

Bayes a good method even on problems with more predictors than observations

Does not handle interactions gracefully

41

Naïve Bayes on the Census data

Performed rather well for the model at hand

Even without formal feature selection, have misclassification rate of 16.0% and area under ROC curve of 89.2%

Naïve Bayes is the simplest example of a Bayesian network

A Bayesian network is a structure that describes conditional independence assumptions (e.g., marital status and occupation might be independent given sex and age)

In general, in a Bayesian networks, variables are assumed conditionally independent of their ancestors given their parents

The naïve Bayes assumption is that all characteristics are conditionally independent given the class labels

42

Discriminant Analysis

How to group things?

Naïve approach:

For each class, take the centroid of the training data for that class

Classify new points to the closest centroid

What’s wrong with this?

Define “close”

Normalizing predictor variables won’t help (much)

— Differences in some may be more important than differences in others

— Some may be strongly correlated

43


0

1

2

3

4

5

6

0 0.5 1 1.5 2 2.5

Class A

Class B

Series3

44

Linear Discriminant Analysis (LDA)

Normal distance works well for spherical clusters

To the extent that classes are not spherical, rescale them

Modeling each class with a multivariate normal does three things:

Centers class density at centroid

Accounts for elliptical distribution

Accounts for dispersion of each class

But … tons of parameters to estimate:

If p predictors and k classes, then

k p-dimensional centroids and k pxp covariance matrices

Simplify:

Assume each class has same covariance matrix

45


Estimation

Estimate centroids

For each observation (x,J), with class centroid CJ, consider x-CJ

Determine the covariance matrix of the x-CJ

— Easy enough to do one pair of coordinates at a time: Covariance is just the average of the product less the product of the averages

etc.

1112 22

46


The result is called linear discriminant analysis because the decision boundary will be linear (in fact, a hyperplane)

Why?

Because a linear transformation will make the ellipsoids into spheres (when we know the boundary is a hyperplane)

etc.

1112 22

47

Linear Discriminant Analysis (LDA) Virtues

There are really fewer degrees of freedom than it appears

Decision surface is a hyperplane in predictor space, so only p+1 degrees of freedom for 2 class problem if p is the number of predictors

The decision surface for a 2 class problem is the same as that resulting from linear regression

— Thus, it is not silly to apply linear regression to 2 class problems

B

A

48

LDA with level curves of densities

A

B

49

B

A

Quadratic Discriminant Analysis (QDA)

If you have a ton of data, you can try estimating each covariance matrix separately

Not only a lot of parameters to estimate

… but also more sensitive than LDA to non-normality

Harder to interpret … decision surface is not linear

Poor method if any class has few representatives, no matter how huge the data set

50


LDA and QDA outlines assume data were labeled as three groups

02468

1012141618

0 5 10 15 20

Sliding Histogram

Epanechnikov kernel

Tricube kernel

QDA Density

lDA Density

51

02468

1012141618

0 5 10 15 20


LDA and QDA outlines assume data were labeled as three groups

Actual Density

Sliding Histogram

Epanechnikov kernel

Tricube kernel

QDA Density

lDA Density

52

Regularized Discriminant Analysis (RDA)

How to get the right amount of flexibility

Average local covariance estimate with global one (typical actuarial thing to do)

Two Types of Averages Suggested by Friedman:

Average class covariance matrices with grand mean

J,Z = JZ + (1-Z)

Choose Z by whatever produces the best fit

Ideally in terms of cross-validation

53

Regularized Discriminant Analysis (RDA)

How to get the right amount of flexibility

Average local covariance estimate with global one (typical actuarial thing to do)

Two Types of Averages Suggested by Friedman:

Average the resulting covariance matrices with scalar multiple of identity

Choose the scalar multiple to have the same trace as J,Z

Scaling of predictors suddenly matters

Be careful with this if you have collinearity, since this assumes there isn’t much collinearity

54


Other ways to reduce degrees of freedom without resorting to LDA

Feature selection … drop the features with the least influence on the discriminant function

Other types of averaging of covariance matrices:

— For example, can insist that the covariance matrices are the same up to a scalar multiple

— Multiply the pooled covariance matrix by a credibility weighting of the class and pooled determinants

In the Census example, I tried Friedman’s two types of regularization. The first (average of LDA and QDA) worked best

55

Discriminant Analysis on Census Data

Quick and dirty

Recoded categorical variables to category means

Did not look for interactions or transformations

LDA, QDA

Friedman’s two types of regularization. The first (average of LDA and QDA) worked best

LDA misclassification of 15.9%, area under ROC curve 89.6%

RDA with 40% weight to pooled and 60% weight to unpooled has misclassification rate of 15.8% and area under ROC curve of 89.7%

56

Logistic Regression

Generalized Linear Model Dependent variable is conditionally Bernoulli (0 or 1)

Note that you cannot think of this as “Bernoulli errors” Various ways of handling more than 2 classes If h is the [inverse] link function, then modeled probability of

success given x1, … , xn is

h(b0+b1x1+b2x2+ … +bnxn)

Note that this always gives a decision boundary linear in the x i

Choices for [inverse] link function h(z): Cumulative normal also called “probit” regression) Logistic function: ez/(1+ez) Complementary log-log: 1-exp(-exp(z)) Log-log: exp(-exp(-z))

57

Logistic Regression

Logistic Link Use this one unless you have a reason to do otherwise

Can interpret b0+b1x1+b2x2+ … +bnxn as the log of the odds of success

Note: probability of success = odds of success / (1 + odds of success)

Effectively a multiplicative model for the odds

Interpret the bi

Allows for “retrospective” or stratified sampling

— Because sampling does not change the relative odds

— So it won’t bias the answer … it just changes the intercept

58

Logistic Regression

Logistic Link Logistic does not like to predict pure answers (predictions

near 0 or 1) Probit loves to do this Logistic preferable if there’s “always a chance”

anything might happen Complementary log-log looks very similar to logistic for

rare classes Not appropriate if successes are common

Log-log not appropriate if failures are common

59

Logistic Regression

00.10.20.30.40.50.60.70.80.9

1

-6 -3 0 3 6

Probit

Logit

Cloglog

Loglog

60

Logistic Regression

Advantages

Scores interpretable in terms of log odds

Constructed probabilities have chance of making sense

Modeled directly rather than as ratio of two densities

A good “default” tool to use when appropriate, especially combined with feature creation and selection

61

Logistic Regression

Disadvantages Invites over-interpretation of parameters For example, if a 10% rate increase

Causes lapse rates for customers under age 30 to increase from 15% to 20%

Causes lapse rates for customers 30 and over to increase from 5% to 10%,

Then logistic regression says the older customers are more price sensitive Their odds of lapse increased by a factor of 19/9 The young customers odds of lapse increased by a

factor of 17/12 Doesn’t generalize to 3+ classes as painlessly as LDA

62

Logistic Regression on the Census data

Quick and dirty

Recoded categorical variables to % with over 50K income in each category (instead of using indicators for each level)

Did not look for interactions or transformations

Misclassification rate of 15.8%

Area under ROC curve of 89.8%

Generally speaking logistic regression does well on area under the ROC curve measures

A nice tool if you don’t know where you want to put the decision threshold until later

63

k Nearest Neighbors

Score each observation by vote of the nearest k training points Traditional for k to be odd Note that is k=1 then the training error will be 0 by definition

This is not necessarily a good thing Cross-validation will give good estimate of error on a test set,

assuming independent observations in the training set This is very similar to kernel

density estimation But the neighborhood

size is determined by the density of observations Within the neighborhood

all observations count equally A

B

64

k Nearest Neighbors

Advantages

Very flexible … can model almost any decision boundary

Requires no distributional assumptions

65

k Nearest Neighbors

Disadvantages Computationally painful

Search entire training set for nearest neighbors for every test point

There are ways to speed this up, but still slow Breaks down with large number of predictors (curse of

dimensionality) Too flexible

Easy to overfit Of course, can usually cure this by choosing k large enough For census data, k=11 much better than k=1

Need to decide how to scale the axes Standardizing variables is not necessarily a sensible solution For census data, used one-way tables for categoricals

66

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LDALogistic RegNBN11N1

ROC Curve Comparison

67

Decision Trees

Recursively split the data

Greedy

At each iteration choose split to maximize some measure of significance or purity

Continue until reaching some stopping criterion, e.g.,

— Don’t split nodes smaller than a certain size

— Don’t split nodes with significance less than a certain amount

Prune this back

BA

A

A

A

B

B

B

68

Decision Trees

CART did very well on misclassification: 14.8%

Probably because I allowed capital gains and losses to be continuous, and it used them: Second split was on capital gains at about $5,000

Area under ROC curve only 89.6%. Suggests that if logistic regression and LDA had been given a few bins for capital gains and losses they might have done better

24.7%

Husband or Wife46.3%

Occupation: Repair, Farming, etc.

23.5%

Occupation: Professional, etc.

62.1%

Other Relationships6.9%

Capital Gains > $4,70081.5%

Capital Gains < $4,7005.0%

69

ROC Curve Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LDALogistic RegNBN11N1CART

70

Decision Trees (CHAID)

Some common algorithms

CHAID

CART

C4.5

71


CHAID (d categories of dependent variable) Classify predictors as ordinal or categorical For each categorical (resp., ordinal) predictor, merge the pair (resp.,

adjacent pair) of categories where the 2 x d contingency table is least significant, if it is not significant at a certain level p

— A missing value can be considered adjacent to any value

— Alternate this with testing whether merged categories can be split at that significance level If d=2, can treat categorical predictors as ordinal, ordered by the proportion of the first class

— Sum of [(observed – expected)2/expected] is chi-square with (d-1) degrees of freedom

This is just like stepwise regression (using chi-square instead of F tests)

72


Eventually, one has determined how to merge the categories for that predictor

If there are c of them, now compute the significance level of the c x d contingency table, which is chi-square is (c-1)(d-1) degrees of freedom

Bonferroni adjustment: multiply this significance level by a penalty for having the best partition into c classes.

Repeat this process for all predictors

Split on the most significant predictor

CHAID as such has a stopping rule but no pruning rule

However, could always allow a generous significance level (to overfit) and then prune as per CART

73

Decision Trees (CART)

Consider all binary splits on all predictors (splits of the form x>a for ordinal variables)

Various different criteria for determining the best split, will focus on Gini criterion:

Minimize expected misclassification cost

Sum of misclassification costs for each child node

— If the left child node has probabilities 90% A, 5% B, and 5% C, and takes 30% of the observations

— And the right node has probabilities 20% A, 70% B, and 10% C, and takes 70% of the observations

— And the cost of misclassifying an A or C object is 1, but the cost of classifying B as A is 2 and B as C is 3, then

74


Sum of misclassification costs for each child node

— The total misclassification cost for the left node is:

– 30% of 90%*5%+90%*5%+2*90%*5%+3*5%*5%+ 90%*5%+90%*5%

— Compute for right node similarly and add

75


Grow an extremely overfit model (large tree)

Determine an order in which to prune back

Score each prune as

— (increase in expected misclassification cost) / (decrease in number of terminal nodes)

— Note that this can be seen as requiring a minimum usefulness for each degree of freedom

Use cross-validation to determine which in the series of pruned trees is the best

77

Hybrid Models

Use nodes of a tree as indicators in logistic regression or LDA

Capture interactions

Use a tree for the initial model and include a linear model within each leaf node

More complex (i.e., presumably the tree splits should try to make nodes as “linear” rather than a “pure” as possible)

Models for adjacent nodes might not glue together nicely

If a variable has important interactions with all others

Consider separate models for each of its levels

E.g., marital status might be a good candidate with the census model

Techniques for Unsupervised Learning

78

79

Some Unsupervised Learning

Cluster Analysis Hierarchical Clustering

Agglomerative vs. Divisive Similar ideas often used in the creation of rating territories In that case there is a specific covariate of interest, namely

loss cost In other applications, often looking for a “lumpy” area of

feature space

— E.g., segment with enough similar people to advertise to Prototype clustering

K-Means EM version thereof These have similarities to LDA and QDA

80

k-Means

Choose k prototypes at random within the range of the data Class each point to the nearest prototype Move each prototype to the centroid of its class Repeat until convergence Rinse and repeat (varying location of prototypes and varying K) Choose “best fit”

Strengths: Nothing could be easier Flexibility to weight covariates

Weaknesses: Flexibility to weight covariates

— DO NOT “normalize” the covariates to std dev of 1

— Better to normalize the RANGE to [-1,1] or [0,1]

81

Expectation Maximization

EM is a very useful generic algorithm

This is the simplest possible use

Start with random prototypes k and random class probabilities k, and a global variance of 2 in every direction for each cluster

E Step: Then for each observation x, compute P(x in cluster k)

M Step: For each cluster, using the membership probabilities as weights, re-estimate k. Once all the means have been estimated, re-estimate 2 from the global data

More complex options would allow elliptical clusters, or for some clusters to be tighter than others

82

Useful References

Brieman, Friedman, Olshen, and Stone, Classification and Regression Trees, Chapman & Hall, 1984

Hand, David J., Construction and Assessment of Classification Rules, Wiley, 1997

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning: Data Mining Inference and Prediction, Springer, 2001

Hastie and Tibshirani, Generalized Additive Models, Chapman & Hall, 1990

McCullagh and Nelder, Generalized Linear Models, 2nd ed, Chapman & Hall, 1989

Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science

discrete predictive modeling casualty actuarial society special interest seminar on predictive...

Documents