addressing analytics challenges in the retail and insurance … · 2011-03-21 · addressing...
TRANSCRIPT
Addressing Analytics Challenges in
the Insurance Industry
Noe Tuason
California State Automobile
Association
Overview
• Two Challenges:
1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects in the company’s Internal Customers’ Database
2. Finding New Factors to Improve Pricing Model
• Methodologies are applicable to financial, retail, and other industries
Identifying High Profit who are Low/High Risk of
Flight Prospects in our Customers’ Database
Challenge 1
Segmenting High Profit and Low Risk of Flight Customers
Profitability (Loss Ratio Score)
High Medium Low
Risk
of
Flight
Low
High
1
High Profit
Stable
4
High Profit
Likely to Leave
2
Medium Profit
Stable
5
Medium Profit
Likely to Leave
3
Low Profit
Stable
6
Low Profit
Likely to Leave
Methodology for determining Risk of Flight
Logistic Regression Using Insurance Customers Data
Challenge: Identify and Differentiate the Stable-High/Medium Profit as
well as the Likely to Leave-High/Medium Profit Customers from the Low
Profit Customers in the Prospect Database
Profitability (Loss Ratio Score)
High Medium Low
Risk
of
Flight
Low
High
1
High Profit
Stable
4
High Profit
Likely to Leave
2
Medium Profit
Stable
5
Medium Profit
Likely to Leave
3
Low Profit
Stable
6
Low Profit
Likely to Leave
Paradigm for Targeting High/Medium Profit and Low/High Risk of Flight Prospects in the Members’ Database
Insurance Customers (Model)
Members Database (Score)
Insurance
Customer
Segments
Membership Variables M1 M2 . . Mn
Demographics P1 P2 . . Pn
Membership
Variables
M1
M2
.
.
Mn
Demographics
P1
P2
.
.
Pn
For prospecting in
external databases
Differentiating Between the 3 Groups Within the Non-Insureds in the
Prospect Database (AAA Members) Using CART
• Demographics
• Lifestage
• MembershipVariables
• Transaction Variables
Draw a sample of
10,000 insureds with
segments and
appended the following
variables for modeling: Run CART
1
High Profit
Stable
4
High Profit
Likely to Leave
2
Medium Profit
Stable
5
Medium Profit
Likely to Leave
3
Low Profit
Stable
6
Low Profit
Likely to Leave
Decision to use CART over Multinomial Logit or
Discriminant Analysis
• Handles discrete or continuous target variable
• No worries about linearity or normality assumptions
• Can handle categorical predictors without need to create dummy variables
• Could use missing values as valid category—no need to do imputation
• Gives surrogate and competitive variables—another way of handling missing values
• Automatic.
• Allows for overgrowing and pruning back. Recommends best tree
• Shows hierarchical interactions and impact of these interactions
• Gives Relative importance of variables
• Includes self-validation to avoid overit: holdout and n-ways cross validation
• Alternative splitting criteria depending on structure of data
• Can specify higher penalty for misclassification, e.g. misclassifying low risks cases
CART is an acronym for Classification and Regression Trees, a decision-tree
procedure introduced in 1984 by world-renowned UC Berkeley and Stanford
statisticians,Leo Breiman, Jerome Friedman, Richard Olshen, and Charles
Stone.
Variables’ Relative Importance
Variable
SAMP_AGE$ 100.00 ||||||||||||||||||||||||||||||||||||||||||
LIFETIME_ERS_COUNT$ 53.38 ||||||||||||||||||||||
WEALTH$ 41.58 |||||||||||||||||
ETHNCITY$ 35.03 ||||||||||||||
INCOME_BRACKET$ 34.22 ||||||||||||||
LIFESTAGE$ 34.03 ||||||||||||||
LENGTH_RESIDENCE$ 33.30 |||||||||||||
MBS_STATUS$ 21.50 ||||||||
EDUCATION$ 19.86 ||||||||
GENDER$ 14.49 |||||
MARITAL$ 6.65 ||
MBS_PROGRAM$ 4.64 |
HAS_KIDS$ 2.99
Actual
Class
Total
Cases
Percent
Correct
1
N=334
2
N=344
3
N=105
Stable H/M
Profits 400 69% 275 98 27
High Risk H/M
Profits 317 75% 58 239 20
Low Profit 66 88% 1 7 58
Total: 783
Average: 77%
Overall % Correct: 73%
Predicted
% Correct Classification (test-holdout validation)
Finding New Factors to Optimize Pricing
Challenge 2
Modeling Problem:*
• Insurance Pricing Models have different
distributional assumptions, i.e. Poisson,
Gamma, Lognormal, Negative Binomial ,
Tweddie, etc.
• Goal is to find one or two factors from over 200
geo-demographic variables that could be
included in the company’s pricing model that
could improve pricing (lower premium without
loss of profit)
*Done for another client, not AAA
Procedures Used:
• SAS PROC VARCLUS (Variable Clustering)
• CART (Initial Variable Selection)
• MARS (Variable Selection, Creation of Functions to
enter into the model)
• SAS PROC GENMOD (Poisson and Gamma
Distribution)
Role that MARS played in my models:
• Multivariate adaptive regression splines (MARS) is a form of
regression analysis introduced by Jerome Friedman in 1991. It is a
non-parametric regression technique and can be seen as an
extension of linear models that automatically models non-
linearities and interactions.
• Accounted for non-linear relationships by creating (basis) functions
for splines (or departures from straight line).
• Handled missing values through a process similar to CART
surrogate splitsby identifying alternative basis functions
• Like CART it initially overfits model then prunes away components
that do not hold in the validation process.
• Entered the (basis) functions as predictors in PROC GENMOD
Screenshot of plots to illustrate departures from linearity assumptions. They are not
accounted for by classical modeling approaches and highlights the importance of
CART/MARS steps in modeling process flow.
Main Modeling Steps:
• Appended over 200 census-based variables to a
sample of over 100,000 from the insurance database
and kept claims frequencies and premium/loss
information to compute target variables.
• Clustered variables (using SAS PROC VARCLUS) to
explore data structure-reduced number of variables to
90
• Ran dataset through CART (Exploratory Regression
Tree) to find relative importance of potential predictors,
check surrogates and competitive variables-noted
variable importance. Target variables (separately)
were Claims Counts and Severity (loss/claim) in dollars
(both continuous)
Main Modeling Steps (cont):
• Ran dataset with 90 variables through MARS, compared
to CART results-selected final set of variables that CART
and MARS ranked as important—reduced to 15
variables
• Ran MARS on 15 variables-obtain (Basis) Functions
• Built models using SAS PROC GENMOD using Claims
Frequency and Severity (loss/claim) with different
distributional assumptions as Targets and MARS (Basis)
Functions as predictors
• Validated models in a holdout samples: final models had
10-15 variables
• Pricing group tested variables with existing factors
Sample Results: Severity Model (Gamma Dist, Log Link)
Predicted and Actual Losses
1 2 3 4 5 6 7 8 9 10
D E C I L E S
Actual Loss
Predicted Loss
You can use the approach for any linear
modeling including Multiple regression or
Logistic Regression which are really part
of the Family of Linear Models.