the crisp data mining process. august 28, 2004data mining2 the data mining process business...

40
The CRISP Data Mining Process

Upload: arron-davidson

Post on 04-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

The CRISP Data Mining Process

Page 2: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 2

The Data Mining Process

Businessunderstanding

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 3: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 3

Business Understanding

Projectobjectives

Projectrequirements

DM ProblemFormulation

PreliminaryPlan

Page 4: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 4

Case Study

Data mining project done for a large insurance companyConsider the use of data mining to improve understanding of customer databasesLed by the data warehousing team, which wanted to also improve their expertise

Page 5: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 5

Business Objectives

Understand what coverage packages are of interest to a customer group Targeting of new customers Cross-selling opportunities to existing customers

Understand why a customer group terminates coverage Know in advance what groups are likely to

terminate Understand what factors influence termination

Page 6: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 6

What are the Goals?

The business goals Improve customer retention Increase cross-selling

Success criteriaCustomer turnover rateAmount of cross-selling

Page 7: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 7

Data Mining Problems

Classify new and existing customers as either interested or not interested in a particular coverage

Classify existing customers as either likely or unlikely to terminate coverage

Page 8: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 8

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 9: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 9

Data Evaluation

Initial data collections

Data quality

Initial insights

Interesting subsets

Data warehousing team

Page 10: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 10

Case Study: Data Evaluation

Data was extracted from select customer databases by company personnel

Coverage programs with few customers selected for pilot project

Five separate files extracted for five coverage programs

Page 11: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 11

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 12: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 12

Data Preparation

Raw DataFinishedData Set

Technical tasks:Data selectionAttribute selectionData cleaning

Page 13: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 13

Case Study: Data Preparation

Some initial formatting of data in MS ExcelCleaning of data fileCombine headers/instancesAdd a new attribute: interest (yes/no)Must create the no interest cases

End up with a CSV formatted file

Page 14: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 14

Weka Data Mining Software

Data in CSV format loaded into Weka:Data preprocessingAttribute selectionModeling

ClassificationClusteringAssociation rule mining

Visualization

Page 15: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 15

Data Preprocessing in Weka

Initial data inspectionMissing valuesUseless attributesNumeric attributes as nominal

Some helpful Weka filtersRemoveUselessReplaceMissingValues

Page 16: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 16

Data Preprocessing in Weka

Data reduction: Instance dimension

RemovePercentage, and Resample filtersAttribute dimension

Remove redundant attributesRemove irrelevant attributes Identify most important attributes

Page 17: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 17

Attribute Selection Methods

Three main methods used: InfoGain ChiSquared Relief

Combined results from complimentary methods

Final pruning of attribute list to twenty attributes

Page 18: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 18

Selected Attributes

LocationTax StateContract StateState CodeZip Code

Page 19: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 19

Selected Attributes

SizeCase Size Range

Industry Industry Classification Industry Classification NameSIC Code

Page 20: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 20

Selected Attributes

TimingNew Sale FlagDecision Maker Effective MonthDecision Maker Effective YearNext Renewal MonthNext Renewal Year

Page 21: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 21

Selected Attributes

InternalAgency NumberOffice NamePricing Category CodeProduct Line NameSmall Group Flag

Page 22: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 22

Relevance of Attribute Selection

Improved modelingFaster model inductionHigher accuracyEasier to interpret models

Structural knowledge gained from the selection of attributes

Page 23: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 23

Most Important Attributes

What attributes effect the purchasing decision of a customer group?E.g., the five most important factor that determine if a customer group purchases a particular insurance coverage Agency Number Small Group Flag Zip Code Decision Maker Effective Year Next Renewal Month

Page 24: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 24

Customer Segmentation

Unique groups of customersSimilar characteristicsSimilar behavior in terms of interest in

coverage

For example, separate predictive models for customer segments for a particular type of insurance

Page 25: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 25

Customer Segments Used for Modeling

ResultsThree segments for one databaseTwo segments for two databasesOne segment for two databases

Continue modeling for each segment independently

Page 26: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 26

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 27: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 27

Modeling

Select modeling technique(s)

Calibrate modeling techniques

Make adjustments to data

Page 28: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 28

Modeling

Mathematical models for predicting if a customer is interested in a coverageUnderstand why a customer is interestedFor example:If a customer’s state is Indiana and the office is Indianapolis_Office1 then the customer is interested in Coverage_3

Page 29: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 29

Modeling Techniques

Three modeling techniques tried for predicting customer interest: Decision trees Artificial neural networks (ANN) Support vector machines (SVM)

Decision trees have the advantage of transparencyANN and SVM did not have significantly better prediction accuracy

Page 30: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 30

Insurance Coverage Interest (Type 6)

Small Group Flag

Y

Product Line Name

No

N

No

Group_2

Yes

Group_1

Page 31: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 31

Insurance Coverage Interest (Type 7)

Pricing Category Code

Industry Classification

Name

A4

Agency Number

Yes No

<= 430 > 430

Next Renewal Year

NoYes

<= 2000 > 2000

Legal_ServicesTransportation_andPublic_Utilities

Next Renewal Year

Yes No

Group_1Group_2

A2

Yes No

<= 2002> 2002

OthersBranchesomitted

Page 32: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 32

Accuracy of Predicting Customer Interest

Coverage Accuracy

Type 1 84.0%

Type 2 97.2%

Type 3 98.3%

Type 4 99.5%

Type 5 88.4%

Type 6 100%

Type 7 76.3%

Type 8 85.0%

Type 9 94.8%

Page 33: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 33

Modeling

Mathematical models for predicting if a customer will terminate coverage

Why do customers terminate a specific type of coverage?

What are the important factors in a customers decision to terminate coverage?

Page 34: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 34

Who Terminates Type 3 Coverage?

CustomerEffective Year

Terminated

2000

Next RenewalMonth

1999

2000

CoverageEffective Year

CoverageEffective Year

2001 2002

Active

Terminated Terminated Active

2000

Active

2000

7 7

Correct for 95%of customers

Page 35: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 35

Who Terminates Type 1 Coverage?

Decision tree based on:Distribution numberUnderwriting department numberPrice categoryRate typeRate Plan Year

Predicts 96.3% of terminations correctly

Page 36: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 36

Accuracy of Predicting Termination

Model Accuracy

Type 1 96.3%

Type 2 96.5%

Type 3 95.3%

Type 4 88.9%

Type 5 88.3%

Page 37: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 37

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 38: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 38

Evaluation

Data analysis results in a good model

Are business objectives being achieved?

Is there an important business issue that has

not been considered?

Should the results be used?

Page 39: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 39

The Data Mining Process

Businessobjectives

Dataevaluation

Datapreparation

Modeling

Evaluation

Deployment Data

Page 40: The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation

August 28, 2004 Data Mining 40

Deployment

Incorporate the results in the organization’s decision making processReportDecision support systemPersonalization of web pagesRepeatable data mining process