discrete predictive modeling casualty actuarial society special interest seminar on predictive...
TRANSCRIPT
![Page 1: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/1.jpg)
Discrete Predictive ModelingCasualty Actuarial Society
Special Interest Seminar on Predictive Modeling
Chicago
October 5, 2004
Presented by Christopher Monsour, FCAS, MAAA
© 2004 Towers Perrin
![Page 2: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/2.jpg)
2
What do other people do with it?
Pattern recognition / image processing
Measuring medical trial outcomes
Direct response modeling
Classification of texts and artifacts on stylistic and physical criteria
Categorization of web pages / organization of information
![Page 3: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/3.jpg)
3
What good is it in insurance?
Claim frequency / claim occurrence models
Claim closure with or without payment
Response models (direct mail, cross-sale)
Customer retention
Underwriting inspections
Premium audit
Fraud
![Page 4: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/4.jpg)
4
Topics
Discrete modeling generally
Terminology
Comparison of models
Techniques for supervised learning
Intuition for, rough sketch of technical details
Advantages and disadvantages
Techniques for unsupervised learning
Sketch of a couple of techniques
![Page 5: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/5.jpg)
5
Goal
Can read the literature
Have confidence to try things
Software packages
Your own unique way of handing a unique challenge
![Page 6: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/6.jpg)
Discrete Modeling — Terminology
6
![Page 7: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/7.jpg)
7
Scoring
Often a two class model produces scores
Observations with scores greater than a certain amount are classified to A; the rest to B
The cutoff score can be changed
— e.g., could use the cutoff that gives the lowest misclassification cost
A soft assignment model is a model where these scores can reasonably be interpreted as probabilities
![Page 8: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/8.jpg)
8
Some Contrasts
Supervised vs. unsupervised learning
Hard vs. soft assignment
Two vs. many classes
Equal vs. unequal misclassification costs
Assigning class priors (J)vs. using the proportions in the data
Training vs. test data
![Page 9: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/9.jpg)
9
Some Terminology
Confusion Matrix
Predicted Class
CBA
32234122,332
2,345214,3243,124
23,445345312
Actual Class
A
B
C
![Page 10: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/10.jpg)
10
Some Terminology
If testing for A
Sensitivity = probability of predicting A when A is true =
1 – false negative = 12,332 / (12,332+34+322)
Specificity = probability of predicting not-A when not-A is true =
1 – false positive = (214,324+2,345+345+23,445) /
(3,124+214,324+2,345+312+345+23,445)
Sensitivity and Specificity
![Page 11: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/11.jpg)
11
Some Terminology (ROC curve)
Changing misclassification costs will often change the sensitivity / specificity tradeoff
Increasing cost of classifying A as not-A will increase sensitivity at the cost of specificity
Receiver-Operating Characteristic curve (ROC curve):
Sensitivity vs. false positive rate (i.e., 1 – specificity)
Allows comparison of several types of model each tuned to various specificities by changing the misclassification costs
Area under the ROC curve is a commonly used comparison (more is better)
As with all tests, comparison should be on test data, not training data
![Page 12: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/12.jpg)
12
Some Terminology
With scores, can vary number classed as type A continuously. Call this X%
Gain = proportion of those classed as A that are A compared to proportion in general population
Gain is >= 1 and is decreasing as one moves to the right (including more quantiles in a mailing, for example). Flat line at 1 is a worthless model
Often used in response modeling: The “gain” vs. random mailing
Gains Chart
![Page 13: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/13.jpg)
13
Some Terminology
Lift = percent of class A that falls in the first x% of scores
Sensitivity as a function of quantile
If you have the 20% scored most likely to be in class A, then false negative rate will be less than 20%, so lift curve is to the right of ROC curve
Lift Curve
![Page 14: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/14.jpg)
14
Some Terminology
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
False Positive Rate for ROC,Quantiles for Others
Sen
siti
vity
fo
r R
OC
, L
ift
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
ROC
Lift
Gains
![Page 15: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/15.jpg)
Discrete Modeling — Comparing Models
15
![Page 16: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/16.jpg)
16
How to Compare Models
On Test Data
Hard Assignment
Area under ROC curve
Total cost
Specificity, sensitivity
Prediction accuracy (total costs where all misclassifications have the same cost)
Misclassification rate of the perfect model is the “Bayes rate”
— Can do “better” than the Bayes rate on training data by over fitting
![Page 17: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/17.jpg)
17
How to Compare Models
On Test Data
Soft Assignment
Likelihood of test data
… plus all of the above
![Page 18: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/18.jpg)
18
How to Compare Models
Much less straightforward
Approaches:
Cross-validation
Log-likelihood measures penalized for degrees of freedom
— Akaike’s information criterion
– Penalty = degrees of freedom
— Schwarz’s Bayesian criterion
– Penalty = 0.5 x ln (#obs) x df
On Training Data
![Page 19: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/19.jpg)
19
How to Compare Models
Much less straightforward
Approaches:
Must be careful about measuring degrees of freedom
— Size of space searched, not just number of parameters, is relevant
— Shrinkage is relevant
On Training Data
![Page 20: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/20.jpg)
20
Thematic Database
48,842 working adults, split 2/3 train, 1/3 test
45,222 observations with no missing values
Dependent variable: Whether income exceeds $50,000 per annum
Note that actual income is not included
Approximately 24% of the workers had incomes exceeding $50,000
Like many UCI databases, has been thoroughly studied
Census Data Extract From the UC Irvine Repository
![Page 21: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/21.jpg)
21
Thematic Database
Best sited error rates by data donors are with Naïve Bayes combined with feature subset selection; this was reported at a test error rate of 14.05%
How much lower than 14.05% do we think the Bayes error rate really is?
Analogous to the question of how much variance in insured loss experience is explainable and how much remains purely random (“noise”) even in the perfect model
Census Data Extract From the UC Irvine Repository
![Page 22: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/22.jpg)
22
Thematic Database
Independent Variables
Categorical Work-class (e.g., government, private, self) Marital status Relationship Occupation (e.g., clerical, professional, etc.) Race Sex Native country
Census Data Extract From the UC Irvine Repository
![Page 23: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/23.jpg)
23
Thematic Database
Independent variables
Ordinal Education
Continuous: Age, hours worked per week, capital gains reported on income tax, capital losses reported on income tax
Data set includes weights
In most of modeling, made mistake of treating capital gain / loss as yes / no
Census Data Extract From the UC Irvine Repository
![Page 24: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/24.jpg)
24
Thematic Database
Used as an illustration for a broad range of techniques On a practical problem, try fewer techniques
But spend more time on feature selection
— Transformations of predictors
— Interactions of predictors
— Selection of predictors
Census Data Extract From the UC Irvine Repository
![Page 25: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/25.jpg)
25
Thematic Database
Example From the Data
Married – Civ-
Spouse
Married – AF-
SpouseDivorced
24.8%0.0%12.2%
21.0%0.0%1.5%
Income
>50K
<=50K
SeparatedNever Married
Married – Spouse Absent
3.2%31.6%1.2%
0.2%1.6%0.1%
Widowed
2.3%
0.2%
![Page 26: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/26.jpg)
Techniques for Supervised Learning
26
![Page 27: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/27.jpg)
27
Global vs. Local
Unweighted one parameter model
High bias, low variance
Appropriate if low SNR
Nearest neighbor
High variance, low bias
Appropriate if high SNR
Most Local Model ImaginableMost Global Model Imaginable
![Page 28: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/28.jpg)
28
A statistical problem in its own right …
… but also a way of handling the classification problem
If you have populations A and B
Estimate the densities fA(x) and fB(x)
Estimate the prior class probabilities A and B
Then assign a new observation with coordinates x to the class J that maximizes JJ(x)
The prior probabilities can be taken from the data or from other knowledge
Estimating the densities is the tough part
Density Estimation in Classification Problems
![Page 29: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/29.jpg)
29
Density Estimation
Simplest density estimator is a histogram
Can generalize this by a sliding histogram: Height at any one point depends on number of observations within a specified distance
More generally, can use ‘kernel’ functions to take weighted averages, giving more weight to nearer points
02468
101214
0 to 1
2 to 3
4 to 5
6 to 7
8 to 9
10 to
11
12 to
13
14 to
15
16 to
17
18 to
19
1.06 3.961.06 4.391.09 4.451.12 5.041.27 5.881.30 6.121.40 6.321.47 6.871.57 11.721.69 13.681.77 15.171.82 19.001.86
![Page 30: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/30.jpg)
30
Density Estimation — Kernels
“Sliding histogram” is a kernel where the kernel drops off from 1 to 0 at a specified distance
Common choices of kernel (with kernel radius of r, and object at distance d)
Epanechnikov 1-(d/r)2 minimizes mean square error asymptotically
Tricube (1-(d/r)3)3
Can use a normal distribution, but it does have infinite radius … often undesirable
Note that endpoints are a problem. Extrapolation is an extreme problem
![Page 31: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/31.jpg)
31
Density Estimation — Kernels
Sliding Histogram
01
2345
67
0 5 10 15 20
![Page 32: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/32.jpg)
32
Density Estimation — Kernels
Comparison of Kernels with Radius 1
Wei
gh
t
Sliding HistogramKernel
Epanechnikovkernel
Tricube kernel
-1.5 -1 -0.5 0 0.5 1 1.5
Displacement
0
0.2
0.4
0.6
0.8
![Page 33: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/33.jpg)
33
Density Estimation — Kernels
0
2
4
6
8
10
12
0 5 10 15 20
Sliding Histogram
Epanechnikov kernel
Tricube kernel
![Page 34: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/34.jpg)
34
Some Techniques
Naïve Bayes
K-Nearest Neighbor
Discriminant Analysis (Linear, Quadratic …)
Logistic Regression (various links …)
Trees (e.g., CART, CHAID, C4.5)
![Page 35: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/35.jpg)
35
Naïve Bayes
There are a lot of refinements to naïve Bayes, but the basic idea is very simple, and is also known as “idiot’s Bayes”:
Assume there are no interactions
Model densities univariately
— Use contingency table for discrete predictor
— For continuous predictor, usually bin the variable to make it discrete, but could just as easily use a kernel density estimator
In form, looks like a generalized additive model
— But much faster and simpler to fit
![Page 36: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/36.jpg)
36
Naïve Bayes
Example
Suppose there is a population of 100 men and 50 women
Of the population, 20 of the women are wearing skirts. The other 130 are wearing pants
Of the population, 20 of the men and 30 of the women have hair shoulder-length or longer
The goal is to predict gender from the other observations
Naïve Bayes assumes that for each class, the densities are the products of the marginals
Short Hair
Long Hair
Total
80%20%
80%20%100%
Men
Pants
Total
0%0%0%Skirt
Short Hair
Long Hair
Total
40%60%
24%36%60%
Women
Pants
Total
16%24%40%Skirt
![Page 37: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/37.jpg)
37
Naïve Bayes
Probability of observed person being male, assuming priors in the data
Short Hair
Long Hair
Total
80%20%
80%20%100%
100Men
Pants
Total
0%0%0%Skirt
Short Hair
Long Hair
Total
40%60%
24%36%60%
50Women
Pants
Total
16%24%40%Skirt
Short Hair
Long Hair
87.0%52.6%
Prob (Male)
Pants
0%0%Skirt
Short Hair
Long Hair
76.9%35.7%
Prob (Male)
Pants
0%0%Skirt
Probability of observed person being male, assuming equal priors
![Page 38: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/38.jpg)
38
Naïve Bayes
Short Hair
Long HairTotal
40%60%
40%20%60%
50Women
Pants
Total
0%40%40%Skirt
What Naïve Bayes does not take into account is that the second square of data could actually look like
This would change the resulting probabilities considerably
![Page 39: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/39.jpg)
39
Naïve Bayes
Advantages Easy to do Very easy to interpret
Just one-way tables put together Decision boundaries fairly flexible but not completely
general
![Page 40: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/40.jpg)
40
Naïve Bayes
Disadvantages Sensitive to feature selection
Easy to double count effects For instance, I did not use the variables “relationship”
or “sex” in the model for the Census data
— Too highly correlated with marital status Can automate feature selection and make Naïve
Bayes a good method even on problems with more predictors than observations
Does not handle interactions gracefully
![Page 41: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/41.jpg)
41
Naïve Bayes on the Census data
Performed rather well for the model at hand
Even without formal feature selection, have misclassification rate of 16.0% and area under ROC curve of 89.2%
Naïve Bayes is the simplest example of a Bayesian network
A Bayesian network is a structure that describes conditional independence assumptions (e.g., marital status and occupation might be independent given sex and age)
In general, in a Bayesian networks, variables are assumed conditionally independent of their ancestors given their parents
The naïve Bayes assumption is that all characteristics are conditionally independent given the class labels
![Page 42: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/42.jpg)
42
Discriminant Analysis
How to group things?
Naïve approach:
For each class, take the centroid of the training data for that class
Classify new points to the closest centroid
What’s wrong with this?
Define “close”
Normalizing predictor variables won’t help (much)
— Differences in some may be more important than differences in others
— Some may be strongly correlated
![Page 43: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/43.jpg)
43
Discriminant Analysis
0
1
2
3
4
5
6
0 0.5 1 1.5 2 2.5
Class A
Class B
Series3
![Page 44: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/44.jpg)
44
Linear Discriminant Analysis (LDA)
Normal distance works well for spherical clusters
To the extent that classes are not spherical, rescale them
Modeling each class with a multivariate normal does three things:
Centers class density at centroid
Accounts for elliptical distribution
Accounts for dispersion of each class
But … tons of parameters to estimate:
If p predictors and k classes, then
k p-dimensional centroids and k pxp covariance matrices
Simplify:
Assume each class has same covariance matrix
![Page 45: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/45.jpg)
45
Linear Discriminant Analysis (LDA)
Estimation
Estimate centroids
For each observation (x,J), with class centroid CJ, consider x-CJ
Determine the covariance matrix of the x-CJ
— Easy enough to do one pair of coordinates at a time: Covariance is just the average of the product less the product of the averages
etc.
1112 22
![Page 46: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/46.jpg)
46
Linear Discriminant Analysis (LDA)
The result is called linear discriminant analysis because the decision boundary will be linear (in fact, a hyperplane)
Why?
Because a linear transformation will make the ellipsoids into spheres (when we know the boundary is a hyperplane)
etc.
1112 22
![Page 47: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/47.jpg)
47
Linear Discriminant Analysis (LDA) Virtues
There are really fewer degrees of freedom than it appears
Decision surface is a hyperplane in predictor space, so only p+1 degrees of freedom for 2 class problem if p is the number of predictors
The decision surface for a 2 class problem is the same as that resulting from linear regression
— Thus, it is not silly to apply linear regression to 2 class problems
B
A
![Page 48: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/48.jpg)
48
LDA with level curves of densities
A
B
![Page 49: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/49.jpg)
49
B
A
Quadratic Discriminant Analysis (QDA)
If you have a ton of data, you can try estimating each covariance matrix separately
Not only a lot of parameters to estimate
… but also more sensitive than LDA to non-normality
Harder to interpret … decision surface is not linear
Poor method if any class has few representatives, no matter how huge the data set
![Page 50: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/50.jpg)
50
Quadratic Discriminant Analysis (QDA)
LDA and QDA outlines assume data were labeled as three groups
02468
1012141618
0 5 10 15 20
Sliding Histogram
Epanechnikov kernel
Tricube kernel
QDA Density
lDA Density
![Page 51: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/51.jpg)
51
02468
1012141618
0 5 10 15 20
Quadratic Discriminant Analysis (QDA)
LDA and QDA outlines assume data were labeled as three groups
Actual Density
Sliding Histogram
Epanechnikov kernel
Tricube kernel
QDA Density
lDA Density
![Page 52: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/52.jpg)
52
Regularized Discriminant Analysis (RDA)
How to get the right amount of flexibility
Average local covariance estimate with global one (typical actuarial thing to do)
Two Types of Averages Suggested by Friedman:
Average class covariance matrices with grand mean
J,Z = JZ + (1-Z)
Choose Z by whatever produces the best fit
Ideally in terms of cross-validation
![Page 53: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/53.jpg)
53
Regularized Discriminant Analysis (RDA)
How to get the right amount of flexibility
Average local covariance estimate with global one (typical actuarial thing to do)
Two Types of Averages Suggested by Friedman:
Average the resulting covariance matrices with scalar multiple of identity
Choose the scalar multiple to have the same trace as J,Z
Scaling of predictors suddenly matters
Be careful with this if you have collinearity, since this assumes there isn’t much collinearity
![Page 54: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/54.jpg)
54
Discriminant Analysis
Other ways to reduce degrees of freedom without resorting to LDA
Feature selection … drop the features with the least influence on the discriminant function
Other types of averaging of covariance matrices:
— For example, can insist that the covariance matrices are the same up to a scalar multiple
— Multiply the pooled covariance matrix by a credibility weighting of the class and pooled determinants
In the Census example, I tried Friedman’s two types of regularization. The first (average of LDA and QDA) worked best
![Page 55: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/55.jpg)
55
Discriminant Analysis on Census Data
Quick and dirty
Recoded categorical variables to category means
Did not look for interactions or transformations
LDA, QDA
Friedman’s two types of regularization. The first (average of LDA and QDA) worked best
LDA misclassification of 15.9%, area under ROC curve 89.6%
RDA with 40% weight to pooled and 60% weight to unpooled has misclassification rate of 15.8% and area under ROC curve of 89.7%
![Page 56: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/56.jpg)
56
Logistic Regression
Generalized Linear Model Dependent variable is conditionally Bernoulli (0 or 1)
Note that you cannot think of this as “Bernoulli errors” Various ways of handling more than 2 classes If h is the [inverse] link function, then modeled probability of
success given x1, … , xn is
h(b0+b1x1+b2x2+ … +bnxn)
Note that this always gives a decision boundary linear in the x i
Choices for [inverse] link function h(z): Cumulative normal also called “probit” regression) Logistic function: ez/(1+ez) Complementary log-log: 1-exp(-exp(z)) Log-log: exp(-exp(-z))
![Page 57: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/57.jpg)
57
Logistic Regression
Logistic Link Use this one unless you have a reason to do otherwise
Can interpret b0+b1x1+b2x2+ … +bnxn as the log of the odds of success
Note: probability of success = odds of success / (1 + odds of success)
Effectively a multiplicative model for the odds
Interpret the bi
Allows for “retrospective” or stratified sampling
— Because sampling does not change the relative odds
— So it won’t bias the answer … it just changes the intercept
![Page 58: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/58.jpg)
58
Logistic Regression
Logistic Link Logistic does not like to predict pure answers (predictions
near 0 or 1) Probit loves to do this Logistic preferable if there’s “always a chance”
anything might happen Complementary log-log looks very similar to logistic for
rare classes Not appropriate if successes are common
Log-log not appropriate if failures are common
![Page 59: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/59.jpg)
59
Logistic Regression
00.10.20.30.40.50.60.70.80.9
1
-6 -3 0 3 6
Probit
Logit
Cloglog
Loglog
![Page 60: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/60.jpg)
60
Logistic Regression
Advantages
Scores interpretable in terms of log odds
Constructed probabilities have chance of making sense
Modeled directly rather than as ratio of two densities
A good “default” tool to use when appropriate, especially combined with feature creation and selection
![Page 61: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/61.jpg)
61
Logistic Regression
Disadvantages Invites over-interpretation of parameters For example, if a 10% rate increase
Causes lapse rates for customers under age 30 to increase from 15% to 20%
Causes lapse rates for customers 30 and over to increase from 5% to 10%,
Then logistic regression says the older customers are more price sensitive Their odds of lapse increased by a factor of 19/9 The young customers odds of lapse increased by a
factor of 17/12 Doesn’t generalize to 3+ classes as painlessly as LDA
![Page 62: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/62.jpg)
62
Logistic Regression on the Census data
Quick and dirty
Recoded categorical variables to % with over 50K income in each category (instead of using indicators for each level)
Did not look for interactions or transformations
Misclassification rate of 15.8%
Area under ROC curve of 89.8%
Generally speaking logistic regression does well on area under the ROC curve measures
A nice tool if you don’t know where you want to put the decision threshold until later
![Page 63: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/63.jpg)
63
k Nearest Neighbors
Score each observation by vote of the nearest k training points Traditional for k to be odd Note that is k=1 then the training error will be 0 by definition
This is not necessarily a good thing Cross-validation will give good estimate of error on a test set,
assuming independent observations in the training set This is very similar to kernel
density estimation But the neighborhood
size is determined by the density of observations Within the neighborhood
all observations count equally A
B
![Page 64: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/64.jpg)
64
k Nearest Neighbors
Advantages
Very flexible … can model almost any decision boundary
Requires no distributional assumptions
![Page 65: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/65.jpg)
65
k Nearest Neighbors
Disadvantages Computationally painful
Search entire training set for nearest neighbors for every test point
There are ways to speed this up, but still slow Breaks down with large number of predictors (curse of
dimensionality) Too flexible
Easy to overfit Of course, can usually cure this by choosing k large enough For census data, k=11 much better than k=1
Need to decide how to scale the axes Standardizing variables is not necessarily a sensible solution For census data, used one-way tables for categoricals
![Page 66: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/66.jpg)
66
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LDALogistic RegNBN11N1
ROC Curve Comparison
![Page 67: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/67.jpg)
67
Decision Trees
Recursively split the data
Greedy
At each iteration choose split to maximize some measure of significance or purity
Continue until reaching some stopping criterion, e.g.,
— Don’t split nodes smaller than a certain size
— Don’t split nodes with significance less than a certain amount
Prune this back
BA
A
A
A
B
B
B
![Page 68: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/68.jpg)
68
Decision Trees
CART did very well on misclassification: 14.8%
Probably because I allowed capital gains and losses to be continuous, and it used them: Second split was on capital gains at about $5,000
Area under ROC curve only 89.6%. Suggests that if logistic regression and LDA had been given a few bins for capital gains and losses they might have done better
24.7%
Husband or Wife46.3%
Occupation: Repair, Farming, etc.
23.5%
Occupation: Professional, etc.
62.1%
Other Relationships6.9%
Capital Gains > $4,70081.5%
Capital Gains < $4,7005.0%
![Page 69: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/69.jpg)
69
ROC Curve Comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LDALogistic RegNBN11N1CART
![Page 70: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/70.jpg)
70
Decision Trees (CHAID)
Some common algorithms
CHAID
CART
C4.5
![Page 71: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/71.jpg)
71
Decision Trees (CHAID)
CHAID (d categories of dependent variable) Classify predictors as ordinal or categorical For each categorical (resp., ordinal) predictor, merge the pair (resp.,
adjacent pair) of categories where the 2 x d contingency table is least significant, if it is not significant at a certain level p
— A missing value can be considered adjacent to any value
— Alternate this with testing whether merged categories can be split at that significance level If d=2, can treat categorical predictors as ordinal, ordered by the proportion of the first class
— Sum of [(observed – expected)2/expected] is chi-square with (d-1) degrees of freedom
This is just like stepwise regression (using chi-square instead of F tests)
![Page 72: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/72.jpg)
72
Decision Trees (CHAID)
Eventually, one has determined how to merge the categories for that predictor
If there are c of them, now compute the significance level of the c x d contingency table, which is chi-square is (c-1)(d-1) degrees of freedom
Bonferroni adjustment: multiply this significance level by a penalty for having the best partition into c classes.
Repeat this process for all predictors
Split on the most significant predictor
CHAID as such has a stopping rule but no pruning rule
However, could always allow a generous significance level (to overfit) and then prune as per CART
![Page 73: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/73.jpg)
73
Decision Trees (CART)
Consider all binary splits on all predictors (splits of the form x>a for ordinal variables)
Various different criteria for determining the best split, will focus on Gini criterion:
Minimize expected misclassification cost
Sum of misclassification costs for each child node
— If the left child node has probabilities 90% A, 5% B, and 5% C, and takes 30% of the observations
— And the right node has probabilities 20% A, 70% B, and 10% C, and takes 70% of the observations
— And the cost of misclassifying an A or C object is 1, but the cost of classifying B as A is 2 and B as C is 3, then
![Page 74: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/74.jpg)
74
Decision Trees (CART)
Sum of misclassification costs for each child node
— The total misclassification cost for the left node is:
– 30% of 90%*5%+90%*5%+2*90%*5%+3*5%*5%+ 90%*5%+90%*5%
— Compute for right node similarly and add
![Page 75: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/75.jpg)
75
Decision Trees (CART)
Grow an extremely overfit model (large tree)
Determine an order in which to prune back
Score each prune as
— (increase in expected misclassification cost) / (decrease in number of terminal nodes)
— Note that this can be seen as requiring a minimum usefulness for each degree of freedom
Use cross-validation to determine which in the series of pruned trees is the best
![Page 76: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/76.jpg)
76
![Page 77: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/77.jpg)
77
Hybrid Models
Use nodes of a tree as indicators in logistic regression or LDA
Capture interactions
Use a tree for the initial model and include a linear model within each leaf node
More complex (i.e., presumably the tree splits should try to make nodes as “linear” rather than a “pure” as possible)
Models for adjacent nodes might not glue together nicely
If a variable has important interactions with all others
Consider separate models for each of its levels
E.g., marital status might be a good candidate with the census model
![Page 78: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/78.jpg)
Techniques for Unsupervised Learning
78
![Page 79: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/79.jpg)
79
Some Unsupervised Learning
Cluster Analysis Hierarchical Clustering
Agglomerative vs. Divisive Similar ideas often used in the creation of rating territories In that case there is a specific covariate of interest, namely
loss cost In other applications, often looking for a “lumpy” area of
feature space
— E.g., segment with enough similar people to advertise to Prototype clustering
K-Means EM version thereof These have similarities to LDA and QDA
![Page 80: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/80.jpg)
80
k-Means
Choose k prototypes at random within the range of the data Class each point to the nearest prototype Move each prototype to the centroid of its class Repeat until convergence Rinse and repeat (varying location of prototypes and varying K) Choose “best fit”
Strengths: Nothing could be easier Flexibility to weight covariates
Weaknesses: Flexibility to weight covariates
— DO NOT “normalize” the covariates to std dev of 1
— Better to normalize the RANGE to [-1,1] or [0,1]
![Page 81: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/81.jpg)
81
Expectation Maximization
EM is a very useful generic algorithm
This is the simplest possible use
Start with random prototypes k and random class probabilities k, and a global variance of 2 in every direction for each cluster
E Step: Then for each observation x, compute P(x in cluster k)
M Step: For each cluster, using the membership probabilities as weights, re-estimate k. Once all the means have been estimated, re-estimate 2 from the global data
More complex options would allow elliptical clusters, or for some clusters to be tighter than others
![Page 82: Discrete Predictive Modeling Casualty Actuarial Society Special Interest Seminar on Predictive Modeling Chicago October 5, 2004 Presented by Christopher](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649e8a5503460f94b8f2aa/html5/thumbnails/82.jpg)
82
Useful References
Brieman, Friedman, Olshen, and Stone, Classification and Regression Trees, Chapman & Hall, 1984
Hand, David J., Construction and Assessment of Classification Rules, Wiley, 1997
Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning: Data Mining Inference and Prediction, Springer, 2001
Hastie and Tibshirani, Generalized Additive Models, Chapman & Hall, 1990
McCullagh and Nelder, Generalized Linear Models, 2nd ed, Chapman & Hall, 1989
Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science