addressing analytics challenges in the retail and insurance … · 2011-03-21 · addressing...

Addressing Analytics Challenges in

the Insurance Industry

Noe Tuason

California State Automobile

Association

Overview

• Two Challenges:

1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects in the company’s Internal Customers’ Database

2. Finding New Factors to Improve Pricing Model

• Methodologies are applicable to financial, retail, and other industries

Identifying High Profit who are Low/High Risk of

Flight Prospects in our Customers’ Database

Challenge 1

Segmenting High Profit and Low Risk of Flight Customers

Profitability (Loss Ratio Score)

High Medium Low

Risk

of

Flight

Low

High

1

High Profit

Stable

4

High Profit

Likely to Leave

2

Medium Profit

Stable

5

Medium Profit

Likely to Leave

3

Low Profit

Stable

6

Low Profit

Likely to Leave

Methodology for determining Risk of Flight

Logistic Regression Using Insurance Customers Data

Challenge: Identify and Differentiate the Stable-High/Medium Profit as

well as the Likely to Leave-High/Medium Profit Customers from the Low

Profit Customers in the Prospect Database

Profitability (Loss Ratio Score)

High Medium Low

Risk

of

Flight

Low

High

1

High Profit

Stable

4

High Profit

Likely to Leave

2

Medium Profit

Stable

5

Medium Profit

Likely to Leave

3

Low Profit

Stable

6

Low Profit

Likely to Leave

Paradigm for Targeting High/Medium Profit and Low/High Risk of Flight Prospects in the Members’ Database

Insurance Customers (Model)

Members Database (Score)

Insurance

Customer

Segments

Membership Variables M1 M2 . . Mn

Demographics P1 P2 . . Pn

Membership

Variables

M1

M2

.

.

Mn

Demographics

P1

P2

.

.

Pn

For prospecting in

external databases

Differentiating Between the 3 Groups Within the Non-Insureds in the

Prospect Database (AAA Members) Using CART

• Demographics

• Lifestage

• MembershipVariables

• Transaction Variables

Draw a sample of

10,000 insureds with

segments and

appended the following

variables for modeling: Run CART

1

High Profit

Stable

4

High Profit

Likely to Leave

2

Medium Profit

Stable

5

Medium Profit

Likely to Leave

3

Low Profit

Stable

6

Low Profit

Likely to Leave

Decision to use CART over Multinomial Logit or

Discriminant Analysis

• Handles discrete or continuous target variable

• No worries about linearity or normality assumptions

• Can handle categorical predictors without need to create dummy variables

• Could use missing values as valid category—no need to do imputation

• Gives surrogate and competitive variables—another way of handling missing values

• Automatic.

• Allows for overgrowing and pruning back. Recommends best tree

• Shows hierarchical interactions and impact of these interactions

• Gives Relative importance of variables

• Includes self-validation to avoid overit: holdout and n-ways cross validation

• Alternative splitting criteria depending on structure of data

• Can specify higher penalty for misclassification, e.g. misclassifying low risks cases

CART is an acronym for Classification and Regression Trees, a decision-tree

procedure introduced in 1984 by world-renowned UC Berkeley and Stanford

statisticians,Leo Breiman, Jerome Friedman, Richard Olshen, and Charles

Stone.

Variables’ Relative Importance

Variable

SAMP_AGE$ 100.00 ||||||||||||||||||||||||||||||||||||||||||

LIFETIME_ERS_COUNT$ 53.38 ||||||||||||||||||||||

WEALTH$ 41.58 |||||||||||||||||

ETHNCITY$ 35.03 ||||||||||||||

INCOME_BRACKET$ 34.22 ||||||||||||||

LIFESTAGE$ 34.03 ||||||||||||||

LENGTH_RESIDENCE$ 33.30 |||||||||||||

MBS_STATUS$ 21.50 ||||||||

EDUCATION$ 19.86 ||||||||

GENDER$ 14.49 |||||

MARITAL$ 6.65 ||

MBS_PROGRAM$ 4.64 |

HAS_KIDS$ 2.99

Actual

Class

Total

Cases

Percent

Correct

1

N=334

2

N=344

3

N=105

Stable H/M

Profits 400 69% 275 98 27

High Risk H/M

Profits 317 75% 58 239 20

Low Profit 66 88% 1 7 58

Total: 783

Average: 77%

Overall % Correct: 73%

Predicted

% Correct Classification (test-holdout validation)

Finding New Factors to Optimize Pricing

Challenge 2

Modeling Problem:*

• Insurance Pricing Models have different

distributional assumptions, i.e. Poisson,

Gamma, Lognormal, Negative Binomial ,

Tweddie, etc.

• Goal is to find one or two factors from over 200

geo-demographic variables that could be

included in the company’s pricing model that

could improve pricing (lower premium without

loss of profit)

*Done for another client, not AAA

Procedures Used:

• SAS PROC VARCLUS (Variable Clustering)

• CART (Initial Variable Selection)

• MARS (Variable Selection, Creation of Functions to

enter into the model)

• SAS PROC GENMOD (Poisson and Gamma

Distribution)

Role that MARS played in my models:

• Multivariate adaptive regression splines (MARS) is a form of

regression analysis introduced by Jerome Friedman in 1991. It is a

non-parametric regression technique and can be seen as an

extension of linear models that automatically models non-

linearities and interactions.

• Accounted for non-linear relationships by creating (basis) functions

for splines (or departures from straight line).

• Handled missing values through a process similar to CART

surrogate splitsby identifying alternative basis functions

• Like CART it initially overfits model then prunes away components

that do not hold in the validation process.

• Entered the (basis) functions as predictors in PROC GENMOD

Screenshot of plots to illustrate departures from linearity assumptions. They are not

accounted for by classical modeling approaches and highlights the importance of

CART/MARS steps in modeling process flow.

Main Modeling Steps:

• Appended over 200 census-based variables to a

sample of over 100,000 from the insurance database

and kept claims frequencies and premium/loss

information to compute target variables.

• Clustered variables (using SAS PROC VARCLUS) to

explore data structure-reduced number of variables to

90

• Ran dataset through CART (Exploratory Regression

Tree) to find relative importance of potential predictors,

check surrogates and competitive variables-noted

variable importance. Target variables (separately)

were Claims Counts and Severity (loss/claim) in dollars

(both continuous)

Main Modeling Steps (cont):

• Ran dataset with 90 variables through MARS, compared

to CART results-selected final set of variables that CART

and MARS ranked as important—reduced to 15

variables

• Ran MARS on 15 variables-obtain (Basis) Functions

• Built models using SAS PROC GENMOD using Claims

Frequency and Severity (loss/claim) with different

distributional assumptions as Targets and MARS (Basis)

Functions as predictors

• Validated models in a holdout samples: final models had

10-15 variables

• Pricing group tested variables with existing factors

Sample Results: Severity Model (Gamma Dist, Log Link)

Predicted and Actual Losses

1 2 3 4 5 6 7 8 9 10

D E C I L E S

Actual Loss

Predicted Loss

You can use the approach for any linear

modeling including Multiple regression or

Logistic Regression which are really part

of the Family of Linear Models.

addressing analytics challenges in the retail and insurance … · 2011-03-21 · addressing...

Documents