dancing with dirty data: methods for exploring and cleaning data 2005 cas ratemaking seminar louise...

37
Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. [email protected] www.data-mines.com

Upload: kellie-reynolds

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Data Quality: A Problem Actuary doing reconciliation on February 15

TRANSCRIPT

Page 1: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Dancing With Dirty Data:Methods for Exploring and Cleaning Data

2005 CAS Ratemaking SeminarLouise Francis, FCAS, MAAA

Francis Analytics and Actuarial Data Mining, Inc.

[email protected]

Page 2: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Objectives Discuss topic of data quality Present methods for screening

data for problems Errors Missing data

Present methods of fixing problems

Page 3: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Data Quality: A Problem Actuary doing reconciliation on February 15

Page 4: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

It’s Not Just Us “In just about any organization,

the state of information quality is at the same low level” Olson, Data Quality

Page 5: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

AAA Standards of Practice AAA SOP: Review data for

completeness, accuracy and relevance

IDMA and CAS White paper. Evaluate data for Validity accuracy reasonableness completeness

Page 6: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Example Data Private passenger auto Some variables are:

Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses

Page 7: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Screening Data Many of the Methods have been in

use for a while Pioneered in field of exploratory data

analysis More recently – missing data

methods for dealing with some quality problems

Page 8: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Some Methods for Numeric Data Visual

Histograms Box and Whisker Plots

Statistical Descriptive statistics Data spheres

Page 9: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Histograms Can do them in

Microsoft Excel

Page 10: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

HistogramsFrequencies for Age Variable

Bin Frequency 20 2853 25 3709 30 4372 35 4366 40 4097 45 3588 50 2707 55 1831 60 1140 65 615 70 397 75 271 80 148 85 83 90 32 95 12

More 5

Page 11: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Histograms of Age VariableVarying Window Size

16 33 51 68 85Age

0

20

40

60

80

16 23 30 37 44 51 57 64 71 78 85

Age

0

5

10

15

20

16 24 31 39 47 54 62 70 77 85 Age0

10

20

30

40

Page 12: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Formula for Window Width

13

3.5

standard deviationN=sample sizeh =window width

hN

Page 13: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Example of Suspicious Value

600 900 1200 1500 1800

License Year

0

5,000

10,000

15,000

20,000

25,000Fr

eque

ncy

Page 14: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Discrete-Numeric Data

0 50,000 100,000Paid Losses

0

10,000

20,000

30,000

40,000

Freq

uenc

y

Page 15: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Filtered DataFilter out Unwanted Records

0 20,000 40,000 60,000 80,000 100,000Paid Losses

0

200

400

600

800

1,000

1,200

Freq

uenc

y

Page 16: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Box and Whisker Plot

-20

0

20

40

60

80

Nor

mal

ly D

istri

bute

d A

ge

+1.5*midspread

median

Outliers

75th Percentile

25th Percentile

Page 17: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Box and Whisker Example

16 1820 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 96

Age

0

200

400

600

800

1,000

Freq

uenc

y

Page 18: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Plot of Heavy Tailed DataPaid Losses

-10000

10000

30000

50000

70000

90000

110000

Pai

d Lo

ss

Page 19: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Heavy Tailed Data – Log Scale

101

102

103

104

105

106

107P

aid

Loss

Page 20: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Descriptive Statistics

N Minimum Maximum Mean Std.

Deviation License Year 30,250 490 2,049 1,990 16.3

Valid N 30,250

Page 21: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Mahalanobis Distance1( ) ' ( )

is a vector of variables is a vector of means is a variance-covariance matrix

MD x μ Σ x μxμΣ

Page 22: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Data Spheres Example: Longitude and Latitude

Page 23: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Sample from Highest Percentile

Policy ID Mahalanobis

Depth Percentile of Mahalanobis Age

License Year

Number of Cars

Number of Drivers

Model Year

Incurred Loss

22244 59 100 27 1997 3 6 1994 4,456 6159 60 100 22 2001 2 6 1993 0

22997 65 100 NA NA 2 1 1954 0 5412 61 100 17 2003 3 6 1994 0

30577 72 100 43 1979 3 1 1952 0 28319 8,490 100 30 490 1 1 1987 0 27815 55 100 44 1976 -1 0 1959 0 16158 24 100 82 1938 1 1 1989 61,187 4908 25 100 56 1997 4 4 2003 35,697

28790 24 100 82 2039 1 1 1985 27,769

Page 24: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Categorical Data: Data Cubes

Page 25: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Example – Marital Status Frequency Percent 5,053 14.3 1 2,043 5.8 2 9,657 27.4 4 2 0 D 4 0 M 2,971 8.4 S 15,554 44.1 Total 35,284 100

Page 26: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Screening for Missing Data

BUSINESS

TYPE Gender Age License Year Valid 35,284 35,284 30,242 30,250

N Missing 0 0 5,042 5,034

25 27.00 1,986.00

50 35.00 1,996.00 Percentiles

75 45.00 2,000.00

Page 27: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Blanks as Missing

Frequency Percent Valid Percent Cumulative

Percent 5,054 14.3 14.3 14.3 F 13,032 36.9 36.9 51.3 M 17,198 48.7 48.7 100.0

Valid

Total 35,284 100.0 100.0

Page 28: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Types of Missing Values Missing completely at random Missing at random Informative missing

Page 29: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Methods for Missing Values Drop record if any variable used in

model is missing Drop variable Data Imputation Other

CART, MARS use surrogate variables Expectation Maximization

Page 30: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Imputation A method to “fill in” missing value Use other variables (which have

values) to predict value on missing variable

Involves building a model for variable with missing value Y = f(x1,x2,…xn)

Page 31: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Example: Age Variable About 14% of records missing

values Imputation will be illustrated with

simple regression model Age = a+b1X1+b2X2…bnXn

Page 32: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Model for AgeTests of Between-Subjects Effects

Dependent Variable: Age Type III Sum of Squares df Mean Square F Sig.

Source Corrected Model 3,218,216 24 134,092 1,971.2 0.000 Intercept 9,255 1 9,255 136.0 0.000 ClassCode 3,198,903 18 177,717 2,612.4 0.000 CoverageType 876 3 292 4.3 0.005 ModelYear 7,245 1 7,245 106.5 0.000 No of Vehicles 2,365 1 2,365 34.8 0.000 No of drivers 3,261 1 3,261 47.9 0.000 Error 2,055,243 30,212 68 Total 46,377,824 30,237 Corrected Total 5,273,459 30,236

Page 33: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Censorship Problem Property and casualty insurance

data is typically censored We do not know final settlement

value for data Adjustments must be made to

avoid erroneous models Use ultimates Mix adjust

Page 34: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Example From Ignoring CensorshipTable 21

Age (Years) Closed Claim Severity Percent of Claims 1 500 25% 2 1,000 50% 3 5,000 15% 4 10,000 10%

Distribution of claims and average settlement amounts by age.

Table 22 New Company

Accident Year Age Severity Percent 2003 1 500 100%

Average Severity 500

Industry Accident Year Age Severity Percent

2003 1 500 25% 2002 2 1,000 50% 2001 3 5,000 15% 2000 4 10,000 10%

Average Severity 2,375

Page 35: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Metadata Data about data Detailed description of the

variables in the file, their meaning and permissible values

Marital Status Value Description 1 Married, data from source 1 2 Single, data from source 1 4 Divorced, data from source 1 D Divorced, data from source 2 M Married, data from source 2 S Single, data from source 2 Blank Marital status is missing

Page 36: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Conclusions Data quality is significant problem

in insurance and in other industries

Statistical methods can be used to detect and remediate data quality problems

How do we get better data?

Page 37: Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial

Conclusions “In the end, the best defense is

relentless monitoring of data and metadata” Dasu and Johnson, Exploratory Data

Mining and Data Cleaning