final report data mining
TRANSCRIPT
-
30/05/2014
Applying Data Mining Techniques for an Insurance Company
Group Project: Abidar Hamza
Jad Al Adas
Sandra Culman
Lionel Kouchou
Prepared for:
Professor: Kilian Stoffel
Assistant: Dong Han
-
DATA MINING UNIVERSITE DE NEUCHATEL
Acknowledgment
We would like to express our deepest appreciation to the Professor Kilian
Stoffel , as well to Mister Dong Han who gave us the opportunity to do
this interesting project on the topic Applying Data Mining Techniques for
an Insurance Company, which also made us doing a lot of research that
helped us accumulate new information.
-
DATA MINING UNIVERSITE DE NEUCHATEL
Table of content
Introduction .......................................................................................................................................... 1
Chapiter 1 :Business Background and Data Presentation .................................................................. 2
1. What is Insurance? ........................................................................................................................ 3
1.1 Area of study ......................................................................................................................... 3
2. Data source ................................................................................................................................... 4
3. From a big relational database to a normalized dataset ................................................................ 7
3.1 Creating a view for the policies ............................................................................................. 7
3.2 Querying the database ........................................................................................................... 8
3.3 Description of the initial data ................................................................................................ 9
3.4 Key Attributes for analysis .................................................................................................... 9
Chapiter 2 :Data Reparation and Visualization ................................................................................ 11
1. Introduction................................................................................................................................. 12
2. Missing Values ........................................................................................................................... 12
3. Discretization .............................................................................................................................. 14
3.1 Discretization from Numerical To Nominal .................................................................... 14
4. Conversion .................................................................................................................................. 20
4.1 Nominal to numeric ............................................................................................................. 20
5. Visualization ............................................................................................................................... 20
Chapiter 3 :Method processes and results interpretation .................................................................. 24
1. Business questions ...................................................................................................................... 25
2. Predictive methods and evaluation ............................................................................................. 25
2.1 One Rule .............................................................................................................................. 25
2.2 Nave Bayes ......................................................................................................................... 29
-
DATA MINING UNIVERSITE DE NEUCHATEL
2.3 Decision Tree ...................................................................................................................... 34
2.4 Logistic Regression [2] ....................................................................................................... 39
3. Descriptive methods and evaluation ........................................................................................... 40
3.1 Association Rules ................................................................................................................ 40
3.2 Clustering ............................................................................................................................ 42
Conclusion.......................................................................................................................................... 51
Glossary.............................................................................................................................................. 52
Webography ....................................................................................................................................... 54
-
DATA MINING UNIVERSITE DE NEUCHATEL
Figure List
Figure 1: Policies Diagram................................................................................................................... 5
Figure 2 : Claims Diagram ................................................................................................................... 6
Figure 3 : Policy view script ................................................................................................................ 7
Figure 4 : Policy view relation ............................................................................................................. 7
Figure 5 : Final query ........................................................................................................................... 8
Figure 6 : Results in SQL ..................................................................................................................... 8
Figure 7 : Attributes ............................................................................................................................. 9
Figure 8 : Missing values table .......................................................................................................... 12
Figure 9 : Table with no missing values ............................................................................................ 13
Figure 10 : Discretization process in Rapiminer ................................................................................ 14
Figure 11 : Visualization of Body after discretization ................................................................... 16
Figure 12 : Discritization of attribute Body in two intervals .......................................................... 17
Figure 13 : Weka pre-process main window ...................................................................................... 18
Figure 14 : Weka visualization .......................................................................................................... 19
Figure 15 : Visualization of all attributes after correcting the discretized values .............................. 19
Figure 16 : Histogram visualization of Dateoccurence .................................................................. 21
Figure 17 : Discretization of Horsepower ...................................................................................... 21
Figure 18 : Visualization after correcting split point of Horsepower ............................................. 22
Figure 19 : Scatter plot visualization of Horsepower ..................................................................... 22
Figure 20 : Weka visualization plot of Region ............................................................................... 23
Figure 21 : Single Rule process ......................................................................................................... 26
Figure 22 : Screenshot of Single Rule process ................................................................................... 26
Figure 23 : Single Rule confusion matrix .......................................................................................... 27
Figure 24 : Car characteristics result using Single Rule .................................................................... 27
Figure 25 : Car characteristics confusion matrix ............................................................................... 28
Figure 26: Car characteristics confusion matrix ................................................................................ 28
Figure 27: Confusion Matrix Single Rule ....................................................................................... 28
Figure 28: Nave Bayes process ......................................................................................................... 29
Figure 29: Naive Bayes distribution table 1 ....................................................................................... 30
Figure 30: Naive Bayes confusion matrix .......................................................................................... 30
-
DATA MINING UNIVERSITE DE NEUCHATEL
Figure 31: Lift chart of driver profiles prediction .............................................................................. 31
Figure 32: Confusion matrix of car characteristics using Naive Bayes ............................................. 32
Figure 33: Distribution table of car characteristics ............................................................................ 32
Figure 34: Naive Bayes distribution table 2 ....................................................................................... 33
Figure 35: Lift chart of car characteristics ......................................................................................... 33
Figure 36: Screenshot Weka decision tree ...................................................................................... 34
Figure 37: Text View Weka Decision Tree .................................................................................... 36
Figure 38: Confusion matrix using decision tree ............................................................................... 36
Figure 39: Screenshot of Weka tree ................................................................................................... 36
Figure 40: Confusion Matrix .............................................................................................................. 37
Figure 41: Screenshot of Weka decision tree using driver profile attributes ..................................... 37
Figure 42: Confusion matrix Driver attributes ................................................................................ 38
Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car attributes) ......... 38
Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver attributes) ...... 39
Figure 45: Logistic Regression table result ........................................................................................ 39
Figure 46: Association rules process .................................................................................................. 40
Figure 47: Association rules process 1 ............................................................................................... 41
Figure 48: Association rules process 2 ............................................................................................... 42
Figure 49: Clustering process with Rapidminer................................................................................. 43
Figure 50 : Number of clusters ........................................................................................................... 44
Figure 51 : Centroid table .................................................................................................................. 44
Figure 52: Screen shot of centroid plot view ..................................................................................... 45
Figure 53: Screenshot of cluster 0 folder view .................................................................................. 46
Figure 54: Centroid table of second iteration ..................................................................................... 47
Figure 55: Screenshot of centroid plot - second iteration .................................................................. 48
Figure 56 : Screenshot of cluster 3 - folder view ............................................................................... 49
Figure 57: Performance Vector Davies Boulding ........................................................................... 49
Figure 58 : David Boulding table ....................................................................................................... 50
Figure 59 : David Building graph ...................................................................................................... 50
-
DATA MINING 1 UNIVERSITE DE NEUCHATEL
Introduction
The accumulation of vast and growing amounts of data in different formats and different datasets
can be considered one of the biggest problem we are facing. The amount of information stored in
insurance databases is rapidly increasing because of the rapid progress of information technology.
The data that is gathered is useless without analyzing it. The patterns, associations, or relationships
among this data can provide important information that helps companies improve their activities.
The wealth of data can be considered a potential goldmine of business information. Finding the
valuable information hidden in those databases and identifying appropriate models is a difficult
task.
The above mentioned problem can be solved with the help of Data Mining, a process of analyzing
data from different perspectives and summarizing it into useful information. A typical data mining
process includes data acquisition, data integration, data exploration, model building, and model
validation.
Nowadays, insurance has become a compulsory need in peoples life since they cant afford anymore
to bear the expenses of a loss or an accident. Thus, this need have fueled insurance companies to
expand and grow, consequently profits increased as well as market share. Nevertheless, corporations
are still exposed to a great risk and some losses are inevitable, thats why they seek new approaches to
better manage their risk.
The paper is organized as follows. Chapter 1 provides an overview of the insurance area and the data
source. Chapter 2 explains our whole process of data preparation. By analyzing the data that we obtain
from this field, in chapter 3 we try to find usable information that can help the insurance company
better price their premiums.
-
DATA MINING 2 UNIVERSITE DE NEUCHATEL
Chapter 1
Business Background and Data
Presentation
What is insurance?
Data source
Steps from a big relational database
To a normalized dataset
Description of the initial data
-
DATA MINING 3 UNIVERSITE DE NEUCHATEL
1. What is Insurance?
Insurance is the fair transfer of the risk of a loss, from one entity to another in exchange for
payment. In other words, insurance equals peace of mind. An insurer, or insurance carrier, is a
company selling the insurance. The insured, or policy holder, is the person or entity buying the
insurance policy. The amount of money to be charged for a certain amount of insurance coverage
is called the premium.
The main aspects of insurance are:
Underwriting (Policies) is when a customer buys coverage or a policy from the insurance
company (Revenues to the company).
A claim is when a customer undergoes a certain loss and declares it to the insurance
company in order to receive the compensation agreed upon (Losses to the company).
There are several types of insurance, for example Motor, Health, Fire, Allied Perils, Natural
disasters, Marine, Personal Accident, Life, Property, Liability, Travel and many more.
A key part of insurance is charging each customer the appropriate price for the risk they
represent. Risk varies widely from customer to customer, and a deep understanding of different
risk factors helps predict the likelihood and cost of insurance claims.
1.1 Area of study
Insurance is a quite broad and a rich topic; it offers a lot of potential for performing data mining
methods. Yet, we will concentrate our research on one line of business: Motor (Automobile).
Some interesting facts about motor accidents:
There are more than 12 million motor vehicle accidents annually;
The typical driver will have a near automobile accident one or two times per month
The typical driver will be in a collision of some type on average of every 6 years
Crashes are the leading cause of death for ages 3-33
Its good to know as well that even a minor accident can result in thousands of dollars in
damages. Accordingly, a question would arise here to know: what is the likelihood of a car
accident to occur?
Many studies have been conducted over the years on this topic; specialists have narrowed some
important factors that are linked to increasing the risk an accident and on which they base the
pricing of their premiums. Some of these aspects are:
-
DATA MINING 4 UNIVERSITE DE NEUCHATEL
Age and gender of the driver;
Driving record;
Type of vehicle;
Geographical record;
Period of the year;
Two factors have been chosen from the above to be interpreted in our project: type of the vehicle
and the drivers profile.
Therefore our goal will be:
To better predict motor insurance claim occurrence based on the characteristics of:
The drivers vehicle.
The driver.
Discover hidden patterns that may be useful for the insurance company
2. Data source
One of our team members used to work as a software developer for a software vendor specialized in
insurance and reinsurance ERP systems. As a result, we had the permission to acquire a database of
one of the main insurance companies. For confidentiality purposes we preferred to keep the names
undisclosed unless requested.
The software vendor used a relational database built on the Microsoft SQL Server database
management system. Parts of the diagrams are shown in the figures below:
-
DATA MINING 5 UNIVERSITE DE NEUCHATEL
Policies Diagram:
Figure 1: Policies Diagram
-
DATA MINING 6 UNIVERSITE DE NEUCHATEL
Claims Diagram:
Figure 2 : Claims Diagram
The initial unfiltered database contains roughly 600 000 rows of policies and 90 000 rows of claims.
Hence, we had to go through several steps in order to come up with one final dataset that would be
useful for our project.
-
DATA MINING 7 UNIVERSITE DE NEUCHATEL
3. From a big relational database to a normalized dataset
3.1 Creating a view for the policies
First of all, creating a view for the policies was fundamental as it will facilitate writing the queries
in the database. The following figures describe the view created as a script and the design to
illustrate the tables used:
Figure 3 : Policy view script
Figure 4 : Policy view relation
-
DATA MINING 8 UNIVERSITE DE NEUCHATEL
3.2 Querying the database
As a second step, we used the above view in a query to select the needed data. The query selected
the attributes related to the characteristics of the car, the drivers profile, the date occurrence of the
claim and a conditional attribute to check if a policy has a claim (depending if a Policy ID have
one or more rows in the claims table). Furthermore, we filtered the policies by the issue year 2012
and the line of business Motor using the WHERE clause. The result is exported to a CSV file as our
preliminary dataset.
The following figure illustrates the final query in SQL:
Figure 5 : Final query
Result in SQL:
Figure 6 : Results in SQL
-
DATA MINING 9 UNIVERSITE DE NEUCHATEL
3.3 Description of the initial data
A total final of 4731 instances and 16 attributes is the result of the transition of SQL to the CSV
file.
Figure 7 : Attributes
The table was extracted from Rapidminer. It represents the metadata of the insurance data which
includes the number of examples or instances, the number, description and type of the attributes,
some statistics about the values of the attributes and missing values if any.
3.4 Key Attributes for analysis
HasClaim is the target attribute or the Label in our data. The type of this attribute is binominal
with 2 values: True and False. This attribute describes weather the line has a claim or not in the
past. Given this result, we could establish a relationship with the rest of the attributes for
classification. The rest of the attributes will be divided into two main groups which will support our
supervised learning hypothesizes:
Drivers Profile
Age: age of the driver, Type integer
Gender: Male or Female, Type binominal
Marital Status: Married or Single, Type binominal
Has Children: True, False, Type binominal
Region: Urban, Town, Suburban ,Type polynominal
-
DATA MINING 10 UNIVERSITE DE NEUCHATEL
The drivers profile is seen as important to analyze by insurance companies as it portrays the degree
of responsibility given the presence of children, the method of driving, abidance by the rules as
youngsters tend to break them more often and go over the speed limit and the residential area with
high or low traffic.
Vehicle Specifications
Make: manufacturer (BMW, Fiat), Type polynominal
Model: subcategory (BMW X5, 320), Type polynominal
Year Built: year of manufacturing, Type integer
Category: type of usage (Taxi, private, rent a car) Type polynominal
Body: the size, Type polynomial
Horsepower: speed or power, Type integer
Another aspect to look at when pricing premiums is obviously the car. Some makes are considered
to be safer on the roads and more robust, new cars tend to be more stable than old cars, high speed
cars are more prone to accidents.
After the description of the initial data, we come to the step of further cleaning and preparing so it
would be all set to be utilized in data mining techniques using rapid miner.
-
DATA MINING 11 UNIVERSITE DE NEUCHATEL
Chapter 2
Data Reparation and Visualization
Introduction
Missing values
Discretization
Conversion
Visualization
-
DATA MINING 12 UNIVERSITE DE NEUCHATEL
1. Introduction
In order to use the techniques of data mining, or make the predictions, business professionals almost
generally agree that data preparation is one of the most imperative parts of any such project, and
one of the most time-consuming and difficult.
We already covered the integration, transformation and reduction parts in the first chapter Data
source by normalizing data from a relational database into one dataset CSV file. We will process
the rest of the steps in Rapidminer.
2. Missing Values
A missing value can signify a number of different things. Perhaps the field was not applicable, the
event did not happen, or the data was not available. It could be that the person who entered the data
did not know the right value, or did not care if a field was not filled in.
However, there are many data mining scenarios in which missing values provide important
information. The meaning of the missing values depends largely on context. Since our data comes
from an insurance company, the quality for them is not an option but its a vital thing in their
everyday operations and they perform data cleaning very often. Therefore, we didnt find so many
inconsistent or missing values. Yet, we could discover some attributes as shown in the metadata of
Rapidminer:
Figure 8 : Missing values table
-
DATA MINING 13 UNIVERSITE DE NEUCHATEL
The Metadata indicated that we have 4 attributes with missing values:
Model: 14 missing values
Yearbuilt: 3 missing values
Horsepower: 3 missing values
Body: 1 missing values
To deal with these missing values Rapidminer provides an operator called Replace missing
values. We will use it and tune its parameters according to each attribute to get the best
replacement.
Model and Body
The best method for a polynominal attribute is to assign the maximum value (most frequent value)
to the missing values.
Yearbuilt and Horsepower
Since they are integers, replacing by the average is the best option, in all cases 3 values will not
make much of a difference. Metadata after fixing the missing values will look as the following:
Figure 9 : Table with no missing values
-
DATA MINING 14 UNIVERSITE DE NEUCHATEL
3. Discretization
The discretization involves partitioning numerical values into intervals by placing breakpoints;
using this method randomly can result in data unbalance. In our project we worked efficiently to
visualize the discretization results and making sure to keep as much as possible balance between the
intervals, through this section we describe the steps of discretization and visualization results using
Rapidminer and Weka.
3.1 Discretization from Numerical To Nominal
The insurance data has many attributes which need discretization in order to provide a better
understanding and good interpretation while using different data mining methods. We checked
before starting discretization if our data is well balanced and then corrected the unbalance of some
attribute values. The Figure 10 describes the discretization process in Rapidminer.
Figure 10 : Discretization process in Rapiminer
The Discretization by user specification has limited capabilities and supports only the numerical
attributes. The sensitive point in the discretization process is the cut point called breakpoint. After
sorting each attribute in ascending order, we choose the cut point and then we split and merged.
After these three important steps we evaluated the values and made some modification if the
intervals were unbalanced.
-
DATA MINING 15 UNIVERSITE DE NEUCHATEL
The numerical attributes selected for discretization are:
Sum insured:
o High-risk
o Medium Risk
o Low Risk
Total Premium:
o High
o Medium
o Low
Year Built:
o Very old Cars
o Old Cars
o Recent Cars
o New Cars
Horsepower:
o Fast
o Medium
o Slow
Age:
o Young
o Adult
o Old
There are also other attributes which are important to use even though they are not numerical. For
example the attribute BODY that describes the car size is nominal and contains different types of
car sizes. We decided to discretize this attribute using nominal to numerical conversion and then
discretization by user specification.
Based on the Meta data table we set
the range values for each attribute and
corrected these values until we obtained the
right balance
-
DATA MINING 16 UNIVERSITE DE NEUCHATEL
Body:
o Small cars
o Medium Cars
o Big Cars
It is important to set the right cut-point for a better discretization!
Figure 11 : Visualization of Body after discretization
The figure11 shows that our discretization needs to be more balanced between the three groups.
Small cars have the largest value and while using this attribute for learning models we will have
always a bigger probability to obtain small cars then the medium or big cars. To remedy this
problem we discretized again and corrected the breakpoints between the three groups. Figure 12
show the correct discretization of the attribute Body.
-
DATA MINING 17 UNIVERSITE DE NEUCHATEL
Figure 12 : Discritization of attribute Body in two intervals
We modified the number of the group into two groups: standard car and big car. Because the
value of medium car and big car was close to each other we choose to regroup these two groups
into one significant group with a large number of values. After finishing discretization we added the
operator Write CSV to generate a new file with the data that has been discretized. This helps us
use the file in Weka; in fact it reduces the number of the operators used in the main process as long
as we use the predictive and descriptive methods (Figure 10).
-
DATA MINING 18 UNIVERSITE DE NEUCHATEL
Preprocessing in WEKA
First, we ran Weka and launched the explorer window, and then we selected the Preprocess tab, in
order to see the attribute name, the percentage of missing values and the balanced values (Figure
13).
Figure 13 : Weka pre-process main window
Figure 14 shows all attributes after the first discretization; there are values which cant be displayed,
due to the large number of instances (Policy No, Make, Date Occurrence etc.). Actually these
attributes could not be discretized into intervals in order to know which car model implies the
occurrence of the claim.
-
DATA MINING 19 UNIVERSITE DE NEUCHATEL
Figure 14 : Weka visualization
Figure 15 shows the improvement we bring into the following attributes: Body, Horsepower,
Marital status and Gender, with the intention to balance as much as possible the intervals
between each instance.
Figure 15 : Visualization of all attributes after correcting the discretized values
-
DATA MINING 20 UNIVERSITE DE NEUCHATEL
4. Conversion
The conversion of the data from a type of format to another through different types of conversion
operators solves the problem when the operators are not able to support some attribute types. In our
project we used conversion not only to solve this kind of problems but also to facilitate the
discretization.
4.1 Nominal to numeric
Our insurance data has many numerical attributes, it was not difficult to use discretization and
convert numerical attributes to nominal, but for some attributes like Body which is polynomial
and has much similarity, we needed to split it into groups which can be differentiated between its
values.
To be more efficient in processing we sorted the Body values, then we convert these values into
numerical using Nominal to numerical by assigning each value number (unique integer) and
finally discretized it into a range or an interval (Standard car, big car).
5. Visualization
Data visualization is the process by which textual or numerical data are converted into meaningful
images. The reason why the data visualization can help in data mining is that the human brain is
very effective in recognizing large amounts of graphical representations. This approach of
visualization allows us, in each step that we did in our project, to understand graphically what our
data looks like. In this chapter we described and analyzed the selected attributes for visualization
and described the impact on the insurance business. There is a great variety of visualization
techniques proposed in Rapidminer. We listed below the ones we used in our project:
Histogram
We used the histogram to show the Dateoccurence values. The Dateoccurence = Null is the
most dominant for the company. Important for us it to learn and understand when the claim
occurred.
-
DATA MINING 21 UNIVERSITE DE NEUCHATEL
Figure 16 : Histogram visualization of Dateoccurence
Pie
By visualizing the attribute Horse Power after discretization, we see that there is an unbalanced
data between fast, medium and slow; this means the cut point between the three intervals is not
correct.
Figure 17 : Discretization of Horsepower
-
DATA MINING 22 UNIVERSITE DE NEUCHATEL
While correcting the cut point, we obtained the following visualization:
Figure 18 : Visualization after correcting split point of Horsepower
Scatter plot
It is useful to visualize some attributes in regards to our label. Figure 19 permits a visualization of
Horsepower and Yearbuilt (Y-Axis) and the HasClaim (X-Axis). The HorsePower is
presented by different colors. If we analyze this scatter plot view, we see that old cars with medium
horse power (with a speed not greater than 120 km/h) have more claims contrary to other type of
cars, like new cars that have few claims.
Figure 19 : Scatter plot visualization of Horsepower
-
DATA MINING 23 UNIVERSITE DE NEUCHATEL
We used to work with Weka because of the good visualization that it provides. In our report we
didnt try to figure out all the visualiation types, we wanted to find the visualization that helped us
analyze the relationships between the attributes. Figure 20 ilustrates the attribute Region togheter
with the label attribute. We can deduct that the sensitive region is Town.
Figure 20 : Weka visualization plot of Region
-
DATA MINING 24 UNIVERSITE DE NEUCHATEL
Chapter 3
Method processes and results
interpretation
Business questions
Predictive method and evaluation
Descriptive method and evaluation
-
DATA MINING 25 UNIVERSITE DE NEUCHATEL
1. Business questions
The overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use. While thinking of the components of the input,
beside the instances and the attributes, the things that we can learn is the concept. Throughout this
chapter we want to put in practice some of the main data mining techniques and algorithms and
their potential applications in the insurance industry.
What are the driver profiles most likely to have an accident?
What are car characteristics which has high impact on hasclaim?
Segmenting automobile drivers?
Accordingly to this business questions we can help insurance firm making crucial business
decisions and turn the new found knowledge into actionable results.
2. Predictive methods and evaluation
To learn a method for predicting the instances class from pre-labeled instances is called
classification. This is a supervised learning method and is used to analyze the importance of the
attributes determining the value of the target attribute. We decided to use four classifiers in order to
predict the value of our label One Rule, Nave Bayes, Decision Tree and the Logistic Regression.
For a better understanding of our results we proceeded using both Rapidminer and Weka, programs
that contain a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to its functionality.
2.1 One Rule
The Single Rule Induction operator is one of the most reliable classification methods. It helps find
the attribute with the lowest number of errors by testing each attribute. The result can be interpreted
as the attribute that has the most influence on the target group.
-
DATA MINING 26 UNIVERSITE DE NEUCHATEL
Figure 21 : Single Rule process
The first attributes that we tested on One Rule Operator were the car characteristic attributes:
Body, Category, HorsePower, MAKE, Model and Yearbuilt. Amongst these attributes the Model
is the one that provides us the smallest error rate (3442 out of 4731 correct instances). The
insurance company must take into account the model that is most predisposed to accidents. For
Example, a pick-up Car will have a bigger probability to have an accident then an Accord.
Knowing the model implies automatically deducing the brand, while the reciprocal does not hold
true necessarily.
Figure 22 : Screenshot of Single Rule process
The confusion matrix displays how many of predicted values matched the actual values when cross-
validation tests were performed. From among records that were predicted as FALSE, the correct
predictions were 464 and 318 were incorrect. From among records that were predicted as TRUE,
the correct predictions were 356 and 281 were incorrect The confusion matrix shows that in regard
-
DATA MINING 27 UNIVERSITE DE NEUCHATEL
to prediction of FALSE, model accuracy is 59.34% and in regard to prediction of no, model
accuracy is 55.89%. The overall model accuracy is simply the percentage of good predictions
among all predictions, that is:
/ + /
/ + / + / + /
, where T/T is the number of situations where the model predicts TRUE, and the result is TRUE,
T/F is the number of situations where the model predicts TRUE and the result is FALSE, etc. After
performing all computations using Rapidminer, the total accuracy of the model is 57.79%.
Figure 23 : Single Rule confusion matrix
The prediction on the HasClaim is lying on the TRUE value, which means that this value plays an
important role in analyzing the performance of the operator. The poor level of accuracy provided by
this operator will not help us improve the prediction of HasClaim in regards to the car
characteristics.
Obtaining the model of the car as the first rule in our algorithm made us think whether this is
helpful or not. Because of the high diversion of the attribute Model and Make we decided to
test the operator without them. The results below provides us a better understanding of which
attribute an insurance company should take into consideration while making the insurance offer. In
conclusion YearBuilt provides us with 4 simple rules.
Figure 24 : Car characteristics result using Single Rule
-
DATA MINING 28 UNIVERSITE DE NEUCHATEL
After we made the changes, the accuracy of this operator with the new number of attributes still
havent change.
Figure 25 : Car characteristics confusion matrix
Further on we tested One Rule on the driver characteristics: Age, Gender, HasChildren, Marital
Status and Region. The model provided us HasChildren with the attribute having the less number
of errors (2476 out of 4731 correct instances). The program defined the two rules below: if
HasChildren = False then False and if HasChildren = True then True, meaning that clients having
children are prone to have an accident, while the clients without children are not.
Figure 26: Car characteristics confusion matrix
Evaluating this Operator with the new attributes, provides us with a low accuracy of 53.14%.
Having such a low accuracy means that this algorithm is not providing us a reliable algorithm that
the company could use in deciding if the attribute HasChildren is really a primary component in
taking a decision.
Figure 27: Confusion Matrix Single Rule
-
DATA MINING 29 UNIVERSITE DE NEUCHATEL
2.2 Nave Bayes
Driver Profiles prediction
Based on Bayes Theorem this classifier applies a simple probabilistic assumption. The Bayes
Theory applied on this operator is assuming that our attributes are not related to each other but they
are all related to our label attribute HasClaim.
Figure 28: Nave Bayes process
The figure 26 illustrate the nave bayes process, we selected the attributes which are relative to the
driver profiles in order to calculate the likelihood of each one for predicting True hasclaim.
We obtained as a result from this process ,distribution table which mark out the probability of each
attribute ;then we focused on the target class TRUE and we compared the values with the intention
to sort out the attributes with high probability value .
Accordingly to the table the attributes selected are:
Age=adult and Gender =Male and Marital status = Married Has children =TRUE
-
DATA MINING 30 UNIVERSITE DE NEUCHATEL
Figure 29: Naive Bayes distribution table 1
To evaluate the performance of naive Bayes classifier the confusion matrix allows more
detailed analysis than mere proportion of correct guesses,
Confusion matrix displays how many of predicted values matched the actual values when cross-
validation tests were performed (by cross-validation operator). For example, from among records
that were predicted with Hasclaim class as True, the correct predictions were 249 and 194 were
incorrect. The confusion matrix shows that in regard to prediction of Hasclaim =TRUE, model
accuracy is 55.11%.
Figure 30: Naive Bayes confusion matrix
Lift chart
A lift chart graphically represents the improvement that a mining model provides when compared
against a random guess, and measures the change in terms of a lift score. By comparing the lift
scores for various portions of our data set and for different models, we can determine which model
-
DATA MINING 31 UNIVERSITE DE NEUCHATEL
is best, and which percentage of the cases in the data set would benefit from applying the models
predictions.
Figure 31: Lift chart of driver profiles prediction
The lift chart is helpful it quantify the relationships between the confidence (using threshold) and
the True prediction of hasclaim, by showing the increase in the number of driver claims , as we can
see in more details analyzing this chart , with confidence greater than 0.5 the number of driver who
is predicted to hasclaim decrease ,for instance if we take confidence range between [0.49-0.51]
1944 drivers must me targeted in order to found 1025 drivers among them who hasclaim ; the good
thing in our prediction results it gives value more than 90 % when confidence and the driver target
are specified.
Car characteristics prediction
We tried to predict the same target class but using different factors ,first we choosed driver
profiles and now we use the car characteristics ,our main goal is to found rules which can have high
accuracy ,with selecting the car characteristics attributes and train test the model we obtain the
confusion matrix in figure 32
-
DATA MINING 32 UNIVERSITE DE NEUCHATEL
Figure 32: Confusion matrix of car characteristics using Naive Bayes
The accuracy of our model using car attributes is 61.54 %, this result is acceptable and allow us to
give more importance on what type of cars has high impact on hasclaim, to extract more details
from this model we use the weight table of True in order to select the attributes which has high
probability on the table
Figure 33: Distribution table of car characteristics
From the table in figure 33, the car characteristic attributes which has high probability and can be
selected for prediction in this model are:
o Body = Standard cars and Yearbuilt = very old cars
We have split the table presenting car brand and model because of the large number of values; we
sorted these values and selected the car brands which have high probability:
-
DATA MINING 33 UNIVERSITE DE NEUCHATEL
Figure 34: Naive Bayes distribution table 2
Mercedes, Nissan and Hyundai are the three main brand having and high impact on hasclaim
Figure 35: Lift chart of car characteristics
Accordingly to the lift chart using to quantify the prediction of the cars who has highest claim
,although there is no prediction results with perfect information .the variance between the
confidence measure and the number of drivers contribute both on the prediction ; if we take for
-
DATA MINING 34 UNIVERSITE DE NEUCHATEL
instance the confidence value between [0.4 -0.5] 529 drivers targeted out of 1092 can have
prediction of 98% of True hasclaim but if we compare it to the confidence of 1 205 out of 211 we
get only 40 % of true claim , from this result we confirmed that our model predicting the true has
claim is acceptable and the results can contribute strongly in future prediction and help the
insurance company to set new decision regardless this results.
2.3 Decision Tree
This operator generates a Decision Tree for classification both for nominal and numerical data. This
classification model is really easy to interpret and predicts the value of our label Has Claim based
on the attributes we have chosen. In this section we proceed in analyzing both group of attributes
(Car Driver) on two operators: - Decision Tree in Rapidminer and J48- Weka
Weka
The decision tree below resulted after running the program with the car attributes. Unfortunately the
tree is very big since the attributes have many possible outcomes. It has a size of 835 and 801
numbers of leaves. It seems difficult to interpret, but with the help of the text view we managed to
provide an interpretation.
Figure 36: Screenshot Weka decision tree
J48 is considering the Category the root attribute. Prive cars are being analyzed depending on
the YearBuilt, make and horsepower. If the car is very old the algorithm checks the
MAKE of the care before making a decision if true or false. If the car is prive and is
categorized as a new car then this will be automatically assigned as a TRUE HasClaim
-
DATA MINING 35 UNIVERSITE DE NEUCHATEL
prediction. If the car is categorized as an old car, the model will go first trough the MAKE of
the car, depending on this it will proceed in analyzing the Horsepower.
The generated result shows that the cars more likely to be implied in an accident are: Prive, Cargo,
Rent and Taxis. The rest of the proportion is well spread over the remaining categories which own a
small part of the total instances (they have a remote probability of an accident). Really exposed
cars are: Mercedes, Nissan, BMW, Audi, Toyota and DAIHATSU. We are questioning ourselves if
these results are really providing a good overview on the insurances of cars.
The other categories are well spread and own a small part of the total instances. According to this,
these categories have a small probability in being involved in an accident. The insurance company
should analyze and focus on the four big car groups.
-
DATA MINING 36 UNIVERSITE DE NEUCHATEL
Figure 37: Text View Weka Decision Tree
Performing the evaluation the Weka J48 Tree obtained 67.09% accuracy. The prediction of the
TRUE class, that is important for our analysis, has an even bigger accuracy of 71.23%.
Figure 38: Confusion matrix using decision tree
Because of the complexity of the Weka tree we decided to take the attributes with a large number of
instances out from the model (Model, Make, and Category). We obtained a tree that is easier to
interpret but provides a lower accuracy.
Figure 39: Screenshot of Weka tree
-
DATA MINING 37 UNIVERSITE DE NEUCHATEL
Figure 40: Confusion Matrix
Using the driver attributes we obtained the tree below. A 16 size tree with a number of 10 leaves.
The algorithm sets HasChildren as the root attribute and analysis the data in regards to this. We
could conclude that clients that dont have children and are married depending on their age have
the following results: adult true, old- true, young false. If the clients dont have children and are
single then the model decides over false. On the opposite site if the client has children and is single
has a bigger probability of an accident occurrence.
Figure 41: Screenshot of Weka decision tree using driver profile attributes
The Evaluation of this operator using the driver attributes provides a low accuracy of 56%, meaning
that the interpreted results are not to certain. We tried changing the parameters of the operator, but
we couldnt obtain better results.
-
DATA MINING 38 UNIVERSITE DE NEUCHATEL
Figure 42: Confusion matrix Driver attributes
Decision Tree Rapidminer
While running the second operator, this time on Rapidminer we obtained the results below. The
algorithm is producing a pruned tree that has a medium complexity. The root attribute used in the
first model is Year Built and in the second model HasChildren. As a small conclusion after
examining the results, we would say that the attribute YearBuilt and HasChildren will have the
following interpretation: New car True, Old Car False, Very old cars False and Recent car
True. If the client has children it depends on the age and the region if he has a claim or not.
Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car
attributes)
-
DATA MINING 39 UNIVERSITE DE NEUCHATEL
We evaluated both algorithms and we obtained a 57% and 53% accuracy. Compared to Weka this
operator is producing a less accurate result. In other words, the insurance company can get a better
prediction of the label class using the J48 Tree.
Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver
attributes)
2.4 Logistic Regression [2]
We start modeling process by learning the relationship between claim frequency and underlying
risk factors including age, gender, and marital status, region, and HasChildren based on this
attributes which describe the driver profiles we use logistic regression to quantify the claim
frequency and the effect of each risk factor, and also estimate the probability of claim frequency.
Figure 45: Logistic Regression table result
-
DATA MINING 40 UNIVERSITE DE NEUCHATEL
We compute logistic regression using the same attributes selected from the decision tree, based on
the results from the table we can deduct that the age has high impact on hasclaim and by selecting
drivers depending to the age the insurance company can make more profit increasing the amount
premium when the drivers getting older .this is one of many reasons why we applied predicting
model to learn the best rule for the future.
3. Descriptive methods and evaluation
We proceeded to find the hidden structure in the available data (process called supervised
learning) by using two methods: clustering and association rules.
3.1 Association Rules
Association rules are a bit different from classification rules excepting the fact that they can predict
any attribute, not just the target one, and consequently, any combination of attributes. Different
association rules express different regularities that underlie the dataset, and they generally predict
different things. In order to implement the association rules for our database, the following five
operators are needed:
Figure 46: Association rules process
Read CSV , in which we import the database (but we should not use any label);
Select Attributes, where we will select the attributes for the process. We decided to take
out two attributes that have no impact on our analysis: Policy Number and Date
Occurrence;
Nominal to Binominal, which changes the type of selected nominal attributes to a
binominal type. It also maps all values of these attributes to binominal values;
-
DATA MINING 41 UNIVERSITE DE NEUCHATEL
FP Growth - the FP in FP-Growth stands for Frequency Pattern. Frequency pattern
analysis is used for many kinds of data mining, and it is a necessary component of
association rule mining. Without having frequencies of attribute combinations, we cannot
determine whether any of the patterns in the data occur often enough to be considered
rules. One important parameter of this operator is Min Support. It represents the
occurrence rate of the rule (number of times the rule occurred divided by the number of
observations in the data set).
Create Associations - this operator uses the frequent pattern matrix data and seeks any
patterns that occur frequently enough to be considered as rules. The Create Association
Rules operator generates both a set of rules (through the rul port) and a set of associated
items (through the ite port). In this model we are looking just for generating rules, so we
simply connect the rul port to the res port of the process window. One of the influential
parameters of this operator is Min Confidence. Confident percentage is a measure of
confidence about the likeliness that both an attribute and its associated attribute to be
flagged as true. The confident percentage is computed as the ratio between the number of
times a certain rule occurs and the number of times it could have occurred.
First, we decided to test the program using a confidence of 0.8 and a Min Support of 0.1.
Surprisingly, 352 rules had a very low support. In our opinion this rules are not considered reliable
because of the low frequency of the attribute combination.
Figure 47: Association rules process 1
We considered important to increase the min support in order to have rules with a high frequency of
attribute combination.
The second testing was considering the same confidence, but with a minimum support of 0.6. 5
rules were generated that were having a support between 0.359 and 0.435 and a confidence between
0.821 and 0.951.
-
DATA MINING 42 UNIVERSITE DE NEUCHATEL
Figure 48: Association rules process 2
Analyzing these rules we can consider the following:
A car with a medium risk sumInsured is most probably a prive car;
A prive car, with a medium risk sumInsured will have a high probability to be a standard
car;
Most of the times a standard car will be a prive car;
If the car is a standard one and the driver is a male then we could conclude that the car is a
prive one;
If we have a standard car with a medium risk sumInsured, the car is most probably a prive
car.
3.2 Clustering
Clustering is one of the unsupervised techniques we will deploy on the data in order to partition and
to reveal some sub classes and discover their natural grouping. K means algorithm is one of the
operators that rapid miner offers to divide the data. We preferred this particular algorithm as it is
easy to understand and simple to interpret. Yet, a challenge we faced, was to pick the number of
clusters in advance with the aim of acquiring a robust set of clusters. The performance of a
clustering algorithm may be affected by the chosen value of K. Therefore, instead of using a single
predefined K, a set of values will be adopted in order to find a satisfactory clustering result. The
validity of the outcome will be assessed using the Evaluation Cluster Distance Performance which
is provided by Rapid Miner. This operator relies on two main criteria to evaluate the performance:
avg._within_centroid_distance: The average within cluster distance is calculated by
averaging the distance between the centroid and all examples of a cluster.
davies_bouldin: The algorithms that produce clusters with low intra-cluster distances
(high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity)
will have a low DaviesBouldin index, the clustering algorithm that produces a collection
-
DATA MINING 43 UNIVERSITE DE NEUCHATEL
of clusters with the smallest DaviesBouldin index is considered the best algorithm based
on this criterion.
Since we dont have a target attribute in clustering most of the attributes will be needed for the
process except:
Policy NO, Dateoccurence: Irrelevant
SumInsured, TotalPremium: our outcome of this study is to help price the premium and
sum insured so it will illogical to include them in the clusters.
Make, Model, Category: they contain too many values and we would need large number of
clusters for grouping, thus we omit to have more realistic and useful clusters.
Hence, we would able to learn the correlation and similarities between the drivers profile and the
car he is driving along with the claim occurrence. Furthermore, all polynominal and binominal
attributes will be converted to numerical since K means deals with distances and accepts only
numerical.
The process is illustrated in the figure below:
Figure 49: Clustering process with Rapidminer
First Iteration
K = 3, Max Runs = 10
This first step will be considered as the reference point considering the outcome of the evaluation
and the consistency of the clusters. The results are illustrated in the figures below:
-
DATA MINING 44 UNIVERSITE DE NEUCHATEL
Text View
Figure 50 : Number of clusters
Centroid Table
Figure 51 : Centroid table
-
DATA MINING 45 UNIVERSITE DE NEUCHATEL
Centroid Plot View
Figure 52: Screen shot of centroid plot view
Each colored Line represents a cluster, with the peaks on the attributes that have a strong
relationship to the cluster.
Analysis of Results
Attributes are considered a strong match within a cluster if they have a high average. For example,
we look at cluster 0 in the centroid table and sort the averages from highest to lowest. We can
conclude that the strongest element is Marital Status = single = 1. The strong elements which have
a correlation are: HasChildren = False = 0.747, Body = standard car = 0.634, Gender Male = 0.626,
HasClaim = False = 0.512. The rest of the elements are considered weak or have no correlation like
Marital Status = 0. If we take a look at the folder view of cluster 0 we can have a better and clear
example:
-
DATA MINING 46 UNIVERSITE DE NEUCHATEL
Figure 53: Screenshot of cluster 0 folder view
Cluster 0 includes the items in the red square with a value of 1. From the insurance business
perspective, we can conclude that an adult single female with no children, living in an urban area,
driving a big very old fast car has no claims.
Analysis of Performance
Performance Vector
As a first iteration the performance vector doesnt inform much about the quality of the clusters,
however it is taken as a reference point for the rest of the iterations to validate the relationship
between the number of clusters and Davies Bouldin average.
-
DATA MINING 47 UNIVERSITE DE NEUCHATEL
Second Iteration
K = 6, Max Runs = 10
Here we will try to double the number of K and observe the possible outcome and check the
performance. The results are illustrated in the figures below:
Text View
Centroid Table
Figure 54: Centroid table of second iteration
-
DATA MINING 48 UNIVERSITE DE NEUCHATEL
Centroid Plot
Figure 55: Screenshot of centroid plot - second iteration
Analysis of Results
Now that we increased the number of clusters we can have less attributes belonging to each one and
a clearer view for conclusions. We will consider cluster 3 for this example. We sort it again form
the high to low. As a first observation we can conclude that the very strong attributes belonging to
the cluster has increased dramatically.
Now we got body = standard car, HasChildren = true, hasclaim = false = 1. Plus the strong
attributes have higher averages as well for example: Marital status = married = 0.883, gender =
male = 0.650.
Again, we will have look at the folder view with the aim of understanding better the cluster.
-
DATA MINING 49 UNIVERSITE DE NEUCHATEL
Figure 56 : Screenshot of cluster 3 - folder view
Cluster 3 includes the items in the red square with a value of 1. From the insurance business
perspective, we can conclude that an adult married male with children, living in a suburban area,
driving a standard fast car has no claims.
Analysis of Performance
Figure 57: Performance Vector Davies Boulding
-
DATA MINING 50 UNIVERSITE DE NEUCHATEL
In the first iteration Davies Bouldin was -2.711. After the increase of K to 6 it decreased to -2.657.
This indicates that k = 6 is suits better the algorithm with k = 3 in this data. However a question
arises here which is if the increase of K always produces better clusters. Technically speaking it
might be true but the increase of K / decrease of DB is not a direct relation. Therefore, we need to
decide when stop increasing K whenever we notice DB decrease is becoming almost flat.
Davies Boulding Chart and determining the ideal number of clusters
A number of iterations were performed with different values of K to monitor the variation of
performance.
Figure 58 : David Boulding table
Figure 59 : David Building graph
We notice that the curve is becoming flat starting K = 24. Thus, we can conclude that an optimal
number of clusters would be a range between 12 and 24.
-2,750
-2,700
-2,650
-2,600
-2,550
-2,500
-2,450
-2,400
-2,350
-2,300
-2,250
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Davies Boulding
Davies Boulding
K Davies
Boulding
3 -2.711
6 -2.657
12 -2.461
24 -2.313
48 -2.304
96 -2.297
-
DATA MINING 51 UNIVERSITE DE NEUCHATEL
Conclusion
This project was a typical application of deploying data mining techniques into real life
situations. We learned the different phases starting from data preparation through interpretation and
drawing conclusions. We noticed as well that not all techniques fit perfectly the data, thus a
thorough knowledge of the business is essential in order to have an initial set up of what to use in
the plan.
It is good to mention here that sometimes result evaluations are subjective; the company in
need of this information would be the only judge of the accuracy of certain outcomes. For instance,
we discovered several correlations between the driver, the car and the occurrence of an accident.
Now, it is up for the firm to take decisions whether to increase or decrease premiums for certain
group of people and group of cars.
-
DATA MINING 52 UNIVERSITE DE NEUCHATEL
Glossary
A
Accuracy: A measure of a predictive model that reflects the proportionate number of times that
the model is correct when applied to data.
B
Binning: The process of breaking up continuous values into bins. Usually done as a
preprocessing step for some data mining algorithms. For example breaking up age into bins for
every ten years.
C Claims: Claims and loss handling is the materialized utility of insurance; it is the actual
"product" paid for. Claims may be filed by insured directly with the insurer or through brokers or
agents. The insurer may require that the claim be filed on its own proprietary forms, or may accept
claims on a standard industry form.
Cross Validation (and Test Set Validation): The process of holding aside some training data
which is not used to build a predictive model and to later use that data to estimate the accuracy of
the model on unseen data simulating the real world deployment of the model
Comma separate value (CSV): A common text-based format for data where the divisions
between attributes (columns of data) are indicated by commas.
Confidence level: A value, usually 5% or 0,005, used to test for statistical significance in some
data mining methods. IS statistical significance is found, a data miner can say that there is a 95%
likelihood that a calculated or predicted value is not a false positive.
D
Decision Tree: A class of data mining and statistical methods that form tree like predictive
models.
Data analysis: The process of examining data in a repeatable and structured way in order to
extract mining patters or messages from a set of data
-
DATA MINING 53 UNIVERSITE DE NEUCHATEL
L Label: In Rapidminer, this is the role that must be set in order to use an attribute as the
dependent or target, attribute in a predictive model.
M Missing Data: These are instances in an observation where one or more attributes does not
have a value. It is not as same as zero, because zero is a value.
P Prediction: The target, or label or dependent attribute that is generated by a predictive model,
usually for a scoring data set in a model.
T Training Data: In a predictive model, this data set already has the label, or dependant variable
defined, so that it can be used to create a model which can be applied to a scoring data set in order
to generate prediction for the latter.
-
DATA MINING 54 UNIVERSITE DE NEUCHATEL
Webography
[1] http://chirouble.univ-lyon2.fr/~ricco/data-mining/
[2] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf
[3] http://www.cnbc.com/id/101586404
[4]http://online.wsj.com/news/articles/SB1000142405274870464860457562075099
8072986
[5] http://arxiv.org/ftp/arxiv/papers/1309/1309.0806.pdf
[6] http://docs.salford-systems.com/insurance4211.pdf
[7] http://www.ulb.ac.be/di/map/adalpozz/pdf/Claim_prediction.pdf
[8] http://chirouble.univ-lyon2.fr/~ricco/data-mining/
[9] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf