scoring with r - free · dominique desbois (2008),\introduction to scoring methods: financial...
TRANSCRIPT
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Scoring with RSummer School on Mathematical Methods in Finance and Economy
Hanoi
Thibault LAURENT
Toulouse School of Economics
June 2010 (Slides modified in August 2010)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Introduction
Preparing the database
Exploratoty Data Analysis
Logistic Regression
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Background study
Dominique Desbois (2008), “Introduction to Scoring Methods:Financial Problems of Farm Holdings”, CS-BIGS, 2(1): 56-76.
Objectives: analysis of the causes of farm’s bankruptcy. Find amodel which may identify farms with financial difficulties in orderto prevent them.
Analysis plan:
1. Preparing the database
2. Exploratory data analysis
3. Logistic regression
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Description of the data set
I 1260 farms specialized in field crops
I response variable Y takes the value “failing” (Y = 1) if thefarm failed and “healthy” otherwise (Y = 0)
I explanatory variables X contain informations about thestructure (legal status, type of farming index, agricultural areaused, etc.) and 22 ratios according to the following topics:Capitalization, Weight of the Debt, Liquidity, Debt servicing,Capital profitability, Earnings and Productive activity.
See p. 4 of Desbois (2008) fore more details
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Packages used in this course
You may download (function install.packages) or update(function update.packages) these following packages at thebeginning of your R session:
> install.packages(c("foreign", "xtable", "lattice"))
> install.packages(c("car", "classInt", "ROCR",
+ "BMA"))
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Introduction
Preparing the database
Exploratoty Data Analysis
Logistic Regression
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Importing the data set
I Download the “desbois.zip” file fromhttp://www.bentley.edu/csbigs/csbigs-v2-n1.cfm
I Unzip the file.
I Import the “desbois.sav”SPSS file in R after loading theforeign package (functions for reading and writing data storedby statistical packages such as Minitab, SAS, Stata, etc.) :> library(foreign)
> farms <- read.spss("desbois.sav", to.data.frame = TRUE)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Recoding ?
The main objective of recoding is to obtain a first working versionof the data set:
1. choose the right format of variables,
2. verify if there are missing values,
3. choose short and intuitive names of variable and attributelevels.
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Recoding with R
1. checking the structure of our data set:> str(farms)
2. re-order the levels of the interest variable:> farms$DIFF <- relevel(farms$DIFF, ref = "failing")
3. create a binary variable for the logistic regression:> farms$Y <- factor(ifelse(farms$DIFF == "failing",
+ 1, 0))
4. simplify the levels of some attributes:> levels(farms$STATUS) <- c("company", "proprietorship")
> levels(farms$ToF) <- c("cereals", "gen.cropping",
+ "dairy.farm", "mix.livestock", "var.crops-livestock",
+ "soilless.breed")
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Missing values ?
Is there any missing value in the data set ?
> any(is.na(farms))
[1] FALSE
No Missing values here. If the awnser were YES, possibility tochange the missing values by using imputation techniques (see forexamplehttp://en.wikipedia.org/wiki/Imputation_(statistics))
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Introduction
Preparing the database
Exploratoty Data Analysis
Logistic Regression
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Exploratory Data Analysis ?
Objectives:
1. obtain some elements of answers to the problem: which arethe causes of bankruptcy of the farms ?
2. detect outliers in observations or collinearity betweenvariables.
3. create new pertinent variables (transforming with log, exp,etc., or crossing some variables, etc).
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Analysis of the data.frame object
farms belongs to a class with common methods (print, plot,summary); the data live in a data.frame, the workhorse datacontainer for analysis in R.
> class(farms)
> summary(farms)
> plot(farms)
Useful function to visualize the data set:
> edit(farms)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Basic statistics with R
For numeric variable:
> n <- nrow(farms)
> min(farms$r1)
> max(farms$r1)
> mean(farms$r1)
> median(farms$r1)
> quantile(farms$r1)
> sd(farms$r1) == sqrt(var(farms$r1))
> stem(farms$r1)
For attribute variable:
> dis.Y <- table(farms$DIFF)
> margin.table(dis.Y)
> all(prop.table(dis.Y) ==
+ dis.Y/margin.table(dis.Y))
> addmargins(dis.Y)
I Sweness and Kurtosis statistics can be calculated by loadinge1071 package
I the package r2lh provides functionalities to export some Ranalysis in a LATEXformat
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Graphics
Main advantages of using graphics:
I a good summary of the data
I easy to understand and comment
Be careful: graphics may bring some intuitions but comments mustbe confirmed by statistical test! Here some links with R graphics:
I http://addictedtor.free.fr/graphiques/
I http://csg.sph.umich.edu/docs/R/graphics-1.pdf
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Attribute variable analysis: Bar plot
> col.y = colors()[c(641, 615)]
> barplot(dis.Y, main = "Y", col = col.y, space = 0.5)
failing healthy
Y
010
020
030
040
050
060
0
In this study, the number of failing farms is close to the number ofthe healthy farms. colors() returns a vector of the names ofavailable colors in R.
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Attribute variable analysis: Pie Chart
> label.ToF = paste(round(prop.table(table(farms$ToF)),
+ 3) * 100, "%")
> with(farms, pie(table(ToF), main = "Type of Farms",
+ labels = label.ToF, col = heat.colors(6),
+ cex = 0.8))
> legend("bottomleft", legend = levels(farms$ToF),
+ fill = heat.colors(6), cex = 0.7)
26.9 %24.3 %
37.1 %
4 %
6.2 %
1.4 %
Type of Farms
cerealsgen.croppingdairy.farmmix.livestockvar.crops−livestocksoilless.breed
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Numerical variable analysis: boxplot
> boxplot(farms$r2, main = "variable r2", col = "lightgrey")
0.0
0.2
0.4
0.6
0.8
1.0
variable r2
This variable does not seem to contain any outlier...
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Numerical variable analysis: histogram
> plot(density(farms$r3), col = "red", type = "n", main = "")
> hist(farms$r3, breaks = 15, freq = FALSE, col = "royalblue", add = T)
> rug(farms$r3)
> lines(density(farms$r3), col = "red")
−1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
N = 1260 Bandwidth = 0.04652
Den
sity
Remark: r3 contains outliers (negative values)Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
What can be done after a univariate analysis
I deleting/modifying observations with abnormal values:high/low values for a numeric variable or levels with too fewfrequencies for an attribute
> low.index <- which(farms$r3 < 0)
> farms$r3 <- with(farms, replace(r3, low.index, mean(r3)))
> farms$r4 <- with(farms, replace(r4, low.index, mean(r4)))
> farms$r8 <- with(farms, replace(r8, low.index, mean(r8)))
> farms$r14 <- with(farms, replace(r14, low.index, mean(r14)))
I transforming variable (x 7−→ log(a + x)) to obtain a morenormal distribution
I More general Box-Cox transformation (function BoxCox offorecast)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Bivariate analysis: 2 numerical variables
> with(farms, cov(r1, r2))
> with(farms, cov(r1, r2)/(sd(r1) * sd(r2)) == cor(r1,
+ r2))
> tab.cor <- cor(farms[, c("r1", "r2", "r3", "r4", "r5")])
Reproducible research with LATEX:> library(xtable)
> matable <- xtable(tab.cor, digits = 3, caption = "Correlation tabular")
> print(matable, file = "corr.tex", size = "tiny")
r1 r2 r3 r4 r5r1 1.000 -0.908 0.121 0.759 0.818r2 -0.908 1.000 0.026 -0.643 -0.790r3 0.121 0.026 1.000 0.642 -0.370r4 0.759 -0.643 0.642 1.000 0.283r5 0.818 -0.790 -0.370 0.283 1.000
Table: Correlation tabular of Capitalization variables
We notice a strong correlation between Capitalization variables.Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Scatter plot (with lattice package)
> library(lattice)
> xyplot(r2 ~ r1, data = farms, groups = DIFF, auto.key = list(columns = 2,
+ title = "Scatter plot"), par.settings = simpleTheme(col = col.y))
r1
r2
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●●●
Scatter plotfailing healthy● ●
(low values of r2 + high values of r1) → high probability of failing
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Scatterplot Matrices (with car package)
> library(car)
> scatterplotMatrix(~r6 + r7 + r8 | DIFF, data = farms,
+ col = col.y, main = "Weight of the debt variables")
● failinghealthy
r6
0.0 1.0 2.0 3.0
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●●●
●
●
●●
●
●
●●●
●
●
●●
●
● ●
●
●●
●●
●
●●
●
●
●●●
●
●
●
●
●●
●
●●●
●●
●
●●
●●
●
●
●
●
●● ●
●●● ●●
●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
● ●
●
●
●●
●
●● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●●●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
● ●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●●
●● ●●
●●●
●●●
●
●● ●
●
●
●
●
●
●
●● ●
●
●
●
●●
●●
●●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●● ●●●
●
●●●
●
●
●
● ●
●
●
●
●●●
●●
●
● ●●
●
●
●
●
●●●
●
●
●●
12
34
●
●
●
●●
●
●●
●
● ●
●
●
●
●●
●
●
●●
●
●
● ●
●●
●
●
●●
●
●
●●
●● ●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●
●●
●●
●
● ●
●
●
●●●
●
●
●
●
●●
●
●●●
●●
●
●●
●●
●
●
●
●
●●●
●● ●● ●
●
●
●●
●
●● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●●
●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
● ●●
●●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●● ●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
● ● ●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
● ●
●●
●●●
●●●● ●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●
●
●●
●
●
●
●
●
● ●●
●●● ●
●●●
●●●
●
●●●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●●
●● ●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●●
●
●●●
●
●
●
●●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●● ●●
●
●●
0.0
1.0
2.0
3.0
● ●●
●
●
●
●●●●●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●
●
●
●●
●●
●●
●●
●
●
●
● ●●●
●
●
●●
●●
●
●
●
●●●
●
●
●
●
●●
●
●● ●
●
●
●●●
●●●●
●
●●
●●
●
●●
●
●
●
●●●
●
●●
●● ●
●
●
●●
●●
●
●●●
●
●
●
● ●
●●
●
●●
●●
●
●●●
●●
●
●
●
●●
●●
●●
●
●●
● ●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
● ●●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
● ●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●
●●
●●
●
●
●
●●●
●●
●●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●●
●●
●● ●
●
●●
●
●
●
●
●●
● ●
●●
●●
●● ●●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●● ●
●●●●
●●
●
●
●● ●
●
●
●●
●
●
●
●●●
●●
●
●
●●
●
●
●●
●
●
●
● ●
●●
r7
● ●●
●
●
●
●●●● ●
●
●
●
●
●
●
●
● ●●
●
● ●
●●
●
●●●
●
●
●●
● ●
●●
●●
●
●
●
● ●●●
●
●
●●
●●
●
●
●
●●●
●
●
●
●
●●●
●● ●
●
●
●●●
●●●
●
●
●●
●●
●
●●
●
●
●
●●●
●
●●
●● ●
●
●
●●
●●
●
●●●
●
●
●
● ●
●●
●
●●
●●
●
●●●
●●
●
●
●
●●
●●
●●
●
●●
● ●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
● ●●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
● ●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
● ●
●●●
●
●●
●
●
●●
●●
●●
●
●
●
●●●
●●
●●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●●
●● ●
●
●●
●
●
●
●
●●
● ●
●●
●●
●● ●●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●● ●
●●●●
●●
●
●
●● ●
●
●
●●
●
●
●
●●●
●●
●
●
● ●
●
●
●●
●
●
●
● ●
●●
1 2 3 4
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
● ●●●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
● ●●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●● ●
●
● ●
●●
●
●●
●
●●
● ●
●
●●
●
●
●
●●
● ●
●●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
● ●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●●
●●
● ●
●
●
●●●
●●
●
●
●● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
● ●●●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
● ●●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●● ●
●
● ●
●●
●
● ●
●
●●
● ●
●
●●
●
●
●
●●
● ●
●●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
● ●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●●
●●
● ●
●
●
●●●
●●
●
●
●●●●
●
●
●
●
●●
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5r8
weight of the debt variables
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Bivariate analysis: 2 attributes
> op <- par(mfrow = c(1, 2), cex.axis = 0.6, cex.lab = 0.6)
> mosaicplot(table(farms[, c(3, 2)]), color = TRUE, main = "")
> barplot(table(farms[c(2, 5)]), beside = TRUE, legend.text = c("failing",
+ "healthy"), horiz = TRUE, cex.names = 0.5, col = col.y,
+ args.legend = list(cex = 0.5), las = 2)
> par(op)
STATUS
DIF
F
company proprietorship
faili
nghe
alth
y
cereals
gen.cropping
dairy.farm
mix.livestock
var.crops−livestock
soilless.breed healthyfailing
0 50 100
150
200
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Pearson’s Chi-squared Test
Pearson’s Chi-squared Test:
> t1 <- with(farms, chisq.test(table(DIFF, CNTY)))
> print(t1)
Pearson's Chi-squared test
data: table(DIFF, CNTY)
X-squared = 5.9929, df = 3, p-value = 0.1120
We notice that the value of χ2 is not large enough to be“abnormal” compared to a χ2 distribution. The link between thetwo variables is not significant...
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Cramer’s V statistic
We have to first calculate first the Pearson’s Chi-squared statistic:
> t2 <- with(farms, chisq.test(table(DIFF, STATUS)))
We obtain the Cramer’s V statistic like that:
> V.2 <- sqrt(t2$statistic/n/min(nlevels(farms$DIFF), nlevels(farms$STATUS)))
> names(V.2) <- "Cramer's V statistic"
> print(V.2)
Cramer's V statistic
0.05408548
If 0 < V ≤ 0.25 the link is low, if 0.25 < V ≤ 0.6 the link ismedium, if V > 0.6 the link is strong. In this case, the link is low.We will see in the next slide that the links between attributes andY are not strong.
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Cramer’s V statistic summary
> res.cramer <- NULL
> for (i in c(1, 3, 5, 6, 8)) {
+ t.k <- with(farms, chisq.test(table(DIFF, farms[,
+ i])))
+ res.cramer <- c(res.cramer, sqrt(t.k$statistic/n/min(nlevels(farms$DIFF),
+ nlevels(farms[, i]))))
+ }
> names(res.cramer) <- names(farms)[c(1, 3, 5, 6, 8)]
> res <- data.frame(t(res.cramer))
> row.names(res) <- "values"
> matable <- xtable(res, digits = 3, caption = "Cramer's V statistic")
> print(matable, file = "vstat.tex", size = "tiny")
CNTY STATUS ToF OWNLAND HARVESTvalues 0.049 0.054 0.087 0.040 0.044
Table: Cramer’s V statistic
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Empirical Odds, Odds Ratio and Relative Risk (1)
Consider the variable STATUS with two levels and the followingcontingency table:
> tab1 <- with(farms, addmargins(table(DIFF, STATUS)))
> matable <- xtable(tab1, digits = 3, align = "l|cc|r",
+ caption = "Contingency Table")
> print(matable, hline.after = c(0, 2), file = "V.tex",
+ size = "tiny")
company proprietorship Sumfailing 89.000 518.000 607.000healthy 135.000 518.000 653.000Sum 224.000 1036.000 1260.000
Table: Contingency Table
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Empirical Odds, OR and RR (2)
Prevalences :
I π(Company) = (#Y=1|X=Company)(#X=Company)
I π(prop) = (#Y=1|X=prop)(#X=prop)
I p1 = (#Y=1)n :
> res.preval <- tab1[1, ]/tab1[3, ]
> names(res.preval) <- c("pi.comp", "pi.prop", "p.1")
> print(res.preval)
pi.comp pi.prop p.1
0.3973214 0.5000000 0.4817460
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Empirical Odds, OR and RR (3)
I Odds: among the farms included in company, the chances offailing are 0.66 (= #Y=1|X=company
#Y=0|X=company = 89/135). Note the
chances are equal for proprietorship (518/518).
I OR =(#Y=1|X=prop)
1−(#Y=1|X=prop)(#Y=1|X=comp)
1−(#Y=1|X=comp)
= (518/518)/(89/135) = 1.5.
I RR = π(prop)/π(Company) = 0.5/0.4 = 1.25
→ the chances of failing are higher in the group of proprietorship
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Bivariate analysis: one attribute and one numerical variable(1)
> par(mfrow = c(1, 3))
> boxplot(r11 ~ DIFF, data = farms, xlab = "r11", col = col.y)
> boxplot(r12 ~ DIFF, data = farms, xlab = "r12", col = col.y)
> boxplot(r14 ~ DIFF, data = farms, xlab = "r14", col = col.y)
> par(op)
> title("Liquidity variables")
●
●
●
●
●
●●
●
●
●
●●●●
●●
●●●●●●
●
●●●●●●●
●
●
●
●
●
●
●
●
failing healthy
−1.
00.
00.
51.
01.
52.
0
r11
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●
●
●
●
●
●
failing healthy
−1
01
23
4
r12
●
●●●●●●●●●●●●●●●●●●
●●●
●●
●●●
●●
●
●
●
●
●●●
●●
●
●●●●●
●
●
●
●●●
●
●
●
●
failing healthy
01
23
45
r14
Liquidity variables
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Bivariate analysis: one attribute and one numerical variable(2)
> library(lattice)
> histogram(~r17 | DIFF,
+ layout = c(1, 2),
+ nint = 20, data = farms,
+ panel = function(x,
+ ...) {
+ panel.histogram(x,
+ ..., col = col.y[panel.number()])
+ })
r17
Per
cent
of T
otal
0
5
10
15
0.00 0.05 0.10 0.15 0.20
failing
0
5
10
15
healthy
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Correlation ratio
η2 =r∑
l=1
nl(Xl − X )2
nσ2X
> n <- nrow(farms)
> deno <- (n - 1) * var(farms$r1)
> eta.r1 <- with(farms, sum(table(DIFF) * (by(r1,
+ DIFF, mean) - mean(r1))^2)/deno)
> print(eta.r1)
[1] 0.419557
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Correlation ratio (2)
Objective: calculate the correlation ratio of each numerical variablewith Y and draw a dot chart depending on the topic of thevariables (“capitalization”, “liquidity”, etc.)
> res <- NULL
> for (k in c(4, 7, 9:30)) {
+ deno <- (n - 1) * var(farms[, k])
+ res <- c(res, with(farms, sum(table(DIFF) *
+ (by(farms[, k], DIFF, mean) - mean(farms[,
+ k]))^2)/deno))
+ }
> names(res) <- names(farms[c(4, 7, 9:30)])
> topics <- factor(c("structure", "structure", rep("capitalization",
+ 5), rep("Weight of the debt", 3), rep("Liquidity",
+ 3), rep("Debt servicing", 5), "Capital profitability",
+ rep("Earnings", 3), rep("Productive activity",
+ 2)))
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Correlation ratio (3)
> dotchart(res, groups = topics, main = "Correlation ratio by topics")
> abline(v = 0.25, col = "red", lty = 2)
r6r7r8
HECTAREAGE
r36r37
r11r12r14
r28r30r32
r17r18r19r21r22
r1r2r3r4r5
r24
●●
●
●●
●●
●●
●
●●
●
●●
●●
●
●●
●●
●
●Capital profitability
capitalization
Debt servicing
Earnings
Liquidity
Productive activity
structure
Weight of the debt
0.0 0.1 0.2 0.3 0.4
Correlation ratio by topics
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Student’s t-Test
> with(farms, t.test(r36[DIFF == "failing"], r36[DIFF ==
+ "healthy"]))
Welch Two Sample t-test
data: r36[DIFF == "failing"] and r36[DIFF == "healthy"]
t = 3.4574, df = 1087.007, p-value = 0.0005666
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04941916 0.17912308
sample estimates:
mean of x mean of y
1.241827 1.127556
See the formula of Welch’s test inhttp://en.wikipedia.org/wiki/Welch%27s_t-test
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
How to discretize a numerical variable ?
1. start by using traditional methods such as “quantile”,“Fisher-Jenks”, etc included in package classInt with a highnumber of classes.
2. try to aggregate classes using the weights of evidence criteria
WoE = Log(Odds) = Log (#y=1|X )(#y=0|X )
3. in the end, 5-6 classes seem to be enough
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Discretization of r1
> library(classInt)
> interval <- classIntervals(farms$r1, n = 12,
+ style = "quantile")$brks
> nb.int <- findInterval(farms$r1, interval,
+ all.inside = TRUE)
> woe <- by(farms$DIFF, as.factor(nb.int),
+ function(x) log(length(which(x ==
+ "failing"))/length(which(x ==
+ "healthy"))))
> plot((interval[1:12] + interval[2:13])/2,
+ woe, main = "Weight Of Evidence",
+ xlab = "variable r1")
> abline(v = interval, lty = 2, col = "grey")
> abline(v = interval[c(4, 8, 11)], col = "red")
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Choice of classes
4 classes seem to be enough to discretize r1: “low”, “medium”,“high” and “very high”.
●
●●
●
●
● ●
●●
●
● ●
0.5 1.0 1.5 2.0
−2
02
4Weight Of Evidence
variable r1
woe
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Multivariate analysis
I Principal ComponentAnalysis (PCA) to completethe analysis ofcovariance/variance ofexplanatory variables
I Hierarchical cluster analysis(if n < 1000) with hclust
function or k-meansclustering with kmeans
function
> res.pca <- princomp(farms[,
+ 9:30])
> biplot(res.pca, col = c("grey",
+ "blue"))
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Other R Tools
I package Rcmdr: a Tk menu with several graphics and testswith a minimum of programming
I package rattle: a package which depends on a lot ofpackages, dedicated to scoring methods
I package iplots: interactive selection on basic graphic such ashistogram, barplot, etc., useful for the detection ofmultivariate outliers
I package ggplot2: an other “generation” of graphics
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Conclusion of this part
Do you already have an idea of the characteristics of the farmswhich failed ?If no, you may continue to explore the data...
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Introduction
Preparing the database
Exploratoty Data Analysis
Logistic Regression
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Sampling
I working sample (70%) farms.work: used for model selection
I test sample (30%) farms.test: used to test the selectedmodel
> set.seed(121181)
> ind <- sample(1:n, round(0.7 * n))
> farms.work <- farms[ind, ]
> farms.test <- farms[-ind, ]
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Ordinary Least Square (OLS) model (2)
How to explain a numerical variable by other explanatory variables(both numerical and attribute) ?
I use of the function lm: r1 is the variable to explain, DIFF andHECATRE are explanatory variables:> res.lm <- lm(r1 ~ DIFF + HECTARE, data = farms.work)
I What results are included in res.lm ?> names(res.lm)
I function anova.lm calculates the analysis of variance table:> anova(res.lm)
I function summary.lm computes a list of summary statitics (Fstatistic, adjusted R2, etc) of the OLS model:> summary(res.lm)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Ordinary Least Square (OLS) model (2)
I function plot.lm returns plot diagnostics:> dev.new()
> par(mfrow = c(2, 2))
> plot(res.lm)
> par(op)
I function influence.measures returns statistics as Cook’sdistance to detect influent observations:> influence.measures(res.lm)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Generalized Linear Model (GLM) model
How to explain a normal, binomial, poisson or gamma variable byexplanatory variable (both numerical and attribute) ?
I use of the function glm and option family to give the name ofthe distribution and the link used:> res.glm <- glm(Y ~ ., family = binomial(link = "logit"),
+ data = farms.work[, -2])
I the functions used for glm object are the same than lm:> names(res.glm)
> anova(res.glm)
> summary(res.glm)
> dev.new()
> par(mfrow = c(2, 2))
> plot(res.glm)
> par(op)
> influence.measures(res.lm)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Choice of variables
I You can choose the function stepAIC which performs stepwisemodel selection by AIC, applied to the res.step objectconstructed previously:> res.step <- stepAIC(res.glm, direction = "backward",
+ k = log(nrow(farms.work)))
I You can also choose the function bic.glm of package BMA:> library(BMA)
> choix.bic.glm <- bic.glm(farms.work[, -c(2, 31)], farms.work$Y,
+ strict = FALSE, OR = 20, data = x, glm.family = "binomial",
+ factor.type = TRUE)
> summary(choix.bic.glm, conditional = T, digits = 2)
> imageplot.bma(choix.bic.glm)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Comparing two models
I with stepAIC, we keep the variables: CNTY, STATUS, HECTARE,r1, r5, r12, r14, r21 and r36
I with the first model of bic.glm, we keep the variables: CNTY,STATUS, HECTARE, r1, r3, r17, r24 and r36
To compare the two methods, we can use the AIC criteria:> res.bic.glm <- glm(Y ~ STATUS + CNTY + HECTARE + r1 +
+ r3 + r17 + r24 + r36, family = binomial(link = "logit"),
+ data = farms.work)
> AIC(res.step)
[1] 423.0119
> AIC(res.bic.glm)
[1] 422.6086
We keep the second model ...Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Coefficients of the model
> matable <- xtable(res.bic.glm, digits = 3, caption = "Coefficient of the selected model")
> print(matable, file = "coeff.tex", size = "tiny")
Estimate Std. Error z value Pr(>|z|)(Intercept) -6.118 1.089 -5.616 0.000
STATUSproprietorship -1.543 0.400 -3.860 0.000CNTYNord -2.257 0.413 -5.465 0.000CNTYOrne -1.472 0.393 -3.748 0.000
CNTYSeine-Maritime -0.186 0.388 -0.478 0.633HECTARE -0.035 0.004 -7.836 0.000
r1 11.642 0.892 13.051 0.000r3 5.915 0.785 7.531 0.000
r17 31.362 6.214 5.047 0.000r24 -7.437 2.008 -3.705 0.000r36 1.532 0.332 4.618 0.000
Table: Coefficient of the selected model
Be careful before interpreting the coefficients β: notice for examplethe sign associated to STATUS, contrary to what we observed inEDA, due certainly to a problem of multi-collinearity ...
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Estimated adjusted Odds ratio
We may calculate the odds ratio and confidence interval by usingthe functions summary and coeff.
> lreg.coeffs <- coef(summary(res.bic.glm))
> lreg.coeffs[c("r1", "r3", "r17", "r24", "r36"), 1] <- lreg.coeffs[c("r1",
+ "r3", "r17", "r24", "r36"), 1] * 0.01
> odds <- data.frame(signif(cbind(exp(lreg.coeffs[, 1]),
+ exp(lreg.coeffs[, 1] - 1.96 * lreg.coeffs[, 2]),
+ exp(lreg.coeffs[, 1] + 1.96 * lreg.coeffs[, 2])),
+ 3))
> names(odds) <- c("odds", "l.95", "u.95")
In order to interpret the odds-ratio associated to the ratios (r1,etc.), we have multiplied these coefficients by 0.01.
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Other issues in modelling
I transforming all numeric variables into attributes such as wehave seen in previous section.
I transforming all the attributes into numeric variables (MultipleCorrespondence Analysis with function dudi.acm of packageade4).
I choose an econometric approach: for example, try a model bytaking into account an economic “a priori” on the variables.The dot chart of the weight of evidence may also bring a goodintuition.
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Prediction on the test sample
Calculate the following term by using the function predict:
Y ∗ = X β
> eta <- predict(res.step, newdata = farms.test)
Calculate then the score:
µ = exp(Y ∗)/(1 + exp(Y ∗))
> mu <- exp(eta)/(1 + exp(eta))
If we choose an arbitrary cut off equal to 0.5, we calculate Y suchas:
> Y.pred = relevel(factor(ifelse(mu > 0.5, 1, 0)), ref = "1")
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Confusion Matrix and vocabulary
actual value Y = 1 actual value Y = 0 Total
predicted value Y = 1 TP FP
predicted value Y = 0 FN TNTotal P N P+N
I True Positive Rate TPR: TP/P
I False Positive Rate FPR: FP/N
I Accuracy: (TP + TN)/(P + N)
I Positive predictive value PPV: TP/(TP + FP)
I Sensitivity: TP/(TP + FN)
I Specificity: TN/(FP + TN)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Example with a Cutoff equal to 0.5
Construction of the confusion matrix:> ma.conf <- addmargins(table(Y.pred, farms.test$DIFF))
TPR and FPR:> ma.conf[1, 1:2]/ma.conf[3, 1:2]
failing healthy
0.89637306 0.07027027
Accuracy:> (ma.conf[1, 1] + ma.conf[2, 2])/sum(ma.conf[3, 1:2])
[1] 0.9126984
Sensitivity and Specificity:> ma.conf[1, 1]/ma.conf[3, 1]
[1] 0.896373
> ma.conf[2, 2]/ma.conf[3, 2]
[1] 0.9297297
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
The ROC curve (1)
How to choose the cut-off ?
I Choose two criteria seen previously, for example the TPR andFPR citeria.
I We would like to choose a cut off such as the TPR be largewhereas the FPR be small.
I The ROC curve draws simultaneously these two criteria whenvarying the cut off: when cut off equal to 1, non farm hasbeen predicted as failing, so TPR=FPR=0, etc.
I Use of the ROCR package (seehttp://rocr.bioinf.mpi-sb.mpg.de/).> library(ROCR)
> pred <- prediction(mu, farms.test$Y)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
The ROC curve (2)
In this case, a cut off equal to 0.4 seems to be a good compromiseto obtain both a good TPR and FPR.> perf <- performance(pred, measure = "tpr", x.measure = "fpr")
> plot(perf, colorize = T, print.cutoffs.at = seq(0, 1,
+ by = 0.1), text.adj = c(1.2, 1.2), lwd = 3)
False positive rate
True
pos
itive
rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
00.
20.
40.
60.
81●●
●●●
●●
●
●
●
●
00.10.20.30.4
0.50.60.7
0.8
0.9
1
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Cross-validation for GLM
This method is an alternative to the AIC criteria for the choice ofthe model and may be recommanded when the size of the sampleis not large enough:
1. Divide the data (of size n) into K groups
2. For each group, fit a GLM omitting that group and calculatethe percent of badly classified with function cost in the groupthat was omitted from the fit
> require(boot)
> res.glm <- glm(Y ~ STATUS + CNTY + HECTARE + r1 + r3 +
+ r17 + r24 + r36, family = binomial(link = "logit"),
+ data = farms)
> cost <- function(r, pi = 0) mean(abs(r - pi) > 0.6)
> res.cv <- cv.glm(farms, res.glm, cost, K = 10)
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Conclusion
I The EDA gave us some answers to the problem and helped usunderstand the data.
I We found one possible model which has good properties byconsidering the usual criteria even if the interpretation of thismodel is not easy because of the inhomogeneity andcorrelation between the variables: other model could be found.
I However, we may use this model to prevent some farms offailing, in the case where we had all explanatory variablesexcepted Y , farms with values of Y > 0.4 would be warned...
Thibault LAURENT Toulouse School of Economics
Scoring with R
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression
Other methods for scoring with R
See http://cran.r-project.org/doc/contrib/
Sharma-CreditScoring.pdf which deals with the followingmethods
I Bagging: package adabag
I Random Forest: package randomForest
I Support Vector Machines: package e1071
I Generalized Additive Model: package gam
Thibault LAURENT Toulouse School of Economics
Scoring with R