scoring with r - free · dominique desbois (2008),\introduction to scoring methods: financial...

60
Introduction Preparing the database Exploratoty Data Analysis Logistic Regression Scoring with R Summer School on Mathematical Methods in Finance and Economy Hanoi Thibault LAURENT Toulouse School of Economics June 2010 (Slides modified in August 2010) Thibault LAURENT Toulouse School of Economics Scoring with R

Upload: others

Post on 11-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Scoring with RSummer School on Mathematical Methods in Finance and Economy

Hanoi

Thibault LAURENT

Toulouse School of Economics

June 2010 (Slides modified in August 2010)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 2: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Introduction

Preparing the database

Exploratoty Data Analysis

Logistic Regression

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 3: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Background study

Dominique Desbois (2008), “Introduction to Scoring Methods:Financial Problems of Farm Holdings”, CS-BIGS, 2(1): 56-76.

Objectives: analysis of the causes of farm’s bankruptcy. Find amodel which may identify farms with financial difficulties in orderto prevent them.

Analysis plan:

1. Preparing the database

2. Exploratory data analysis

3. Logistic regression

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 4: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Description of the data set

I 1260 farms specialized in field crops

I response variable Y takes the value “failing” (Y = 1) if thefarm failed and “healthy” otherwise (Y = 0)

I explanatory variables X contain informations about thestructure (legal status, type of farming index, agricultural areaused, etc.) and 22 ratios according to the following topics:Capitalization, Weight of the Debt, Liquidity, Debt servicing,Capital profitability, Earnings and Productive activity.

See p. 4 of Desbois (2008) fore more details

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 5: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Packages used in this course

You may download (function install.packages) or update(function update.packages) these following packages at thebeginning of your R session:

> install.packages(c("foreign", "xtable", "lattice"))

> install.packages(c("car", "classInt", "ROCR",

+ "BMA"))

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 6: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Introduction

Preparing the database

Exploratoty Data Analysis

Logistic Regression

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 7: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Importing the data set

I Download the “desbois.zip” file fromhttp://www.bentley.edu/csbigs/csbigs-v2-n1.cfm

I Unzip the file.

I Import the “desbois.sav”SPSS file in R after loading theforeign package (functions for reading and writing data storedby statistical packages such as Minitab, SAS, Stata, etc.) :> library(foreign)

> farms <- read.spss("desbois.sav", to.data.frame = TRUE)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 8: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Recoding ?

The main objective of recoding is to obtain a first working versionof the data set:

1. choose the right format of variables,

2. verify if there are missing values,

3. choose short and intuitive names of variable and attributelevels.

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 9: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Recoding with R

1. checking the structure of our data set:> str(farms)

2. re-order the levels of the interest variable:> farms$DIFF <- relevel(farms$DIFF, ref = "failing")

3. create a binary variable for the logistic regression:> farms$Y <- factor(ifelse(farms$DIFF == "failing",

+ 1, 0))

4. simplify the levels of some attributes:> levels(farms$STATUS) <- c("company", "proprietorship")

> levels(farms$ToF) <- c("cereals", "gen.cropping",

+ "dairy.farm", "mix.livestock", "var.crops-livestock",

+ "soilless.breed")

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 10: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Missing values ?

Is there any missing value in the data set ?

> any(is.na(farms))

[1] FALSE

No Missing values here. If the awnser were YES, possibility tochange the missing values by using imputation techniques (see forexamplehttp://en.wikipedia.org/wiki/Imputation_(statistics))

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 11: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Introduction

Preparing the database

Exploratoty Data Analysis

Logistic Regression

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 12: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Exploratory Data Analysis ?

Objectives:

1. obtain some elements of answers to the problem: which arethe causes of bankruptcy of the farms ?

2. detect outliers in observations or collinearity betweenvariables.

3. create new pertinent variables (transforming with log, exp,etc., or crossing some variables, etc).

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 13: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Analysis of the data.frame object

farms belongs to a class with common methods (print, plot,summary); the data live in a data.frame, the workhorse datacontainer for analysis in R.

> class(farms)

> summary(farms)

> plot(farms)

Useful function to visualize the data set:

> edit(farms)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 14: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Basic statistics with R

For numeric variable:

> n <- nrow(farms)

> min(farms$r1)

> max(farms$r1)

> mean(farms$r1)

> median(farms$r1)

> quantile(farms$r1)

> sd(farms$r1) == sqrt(var(farms$r1))

> stem(farms$r1)

For attribute variable:

> dis.Y <- table(farms$DIFF)

> margin.table(dis.Y)

> all(prop.table(dis.Y) ==

+ dis.Y/margin.table(dis.Y))

> addmargins(dis.Y)

I Sweness and Kurtosis statistics can be calculated by loadinge1071 package

I the package r2lh provides functionalities to export some Ranalysis in a LATEXformat

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 15: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Graphics

Main advantages of using graphics:

I a good summary of the data

I easy to understand and comment

Be careful: graphics may bring some intuitions but comments mustbe confirmed by statistical test! Here some links with R graphics:

I http://addictedtor.free.fr/graphiques/

I http://csg.sph.umich.edu/docs/R/graphics-1.pdf

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 16: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Attribute variable analysis: Bar plot

> col.y = colors()[c(641, 615)]

> barplot(dis.Y, main = "Y", col = col.y, space = 0.5)

failing healthy

Y

010

020

030

040

050

060

0

In this study, the number of failing farms is close to the number ofthe healthy farms. colors() returns a vector of the names ofavailable colors in R.

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 17: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Attribute variable analysis: Pie Chart

> label.ToF = paste(round(prop.table(table(farms$ToF)),

+ 3) * 100, "%")

> with(farms, pie(table(ToF), main = "Type of Farms",

+ labels = label.ToF, col = heat.colors(6),

+ cex = 0.8))

> legend("bottomleft", legend = levels(farms$ToF),

+ fill = heat.colors(6), cex = 0.7)

26.9 %24.3 %

37.1 %

4 %

6.2 %

1.4 %

Type of Farms

cerealsgen.croppingdairy.farmmix.livestockvar.crops−livestocksoilless.breed

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 18: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Numerical variable analysis: boxplot

> boxplot(farms$r2, main = "variable r2", col = "lightgrey")

0.0

0.2

0.4

0.6

0.8

1.0

variable r2

This variable does not seem to contain any outlier...

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 19: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Numerical variable analysis: histogram

> plot(density(farms$r3), col = "red", type = "n", main = "")

> hist(farms$r3, breaks = 15, freq = FALSE, col = "royalblue", add = T)

> rug(farms$r3)

> lines(density(farms$r3), col = "red")

−1.5 −1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

N = 1260 Bandwidth = 0.04652

Den

sity

Remark: r3 contains outliers (negative values)Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 20: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

What can be done after a univariate analysis

I deleting/modifying observations with abnormal values:high/low values for a numeric variable or levels with too fewfrequencies for an attribute

> low.index <- which(farms$r3 < 0)

> farms$r3 <- with(farms, replace(r3, low.index, mean(r3)))

> farms$r4 <- with(farms, replace(r4, low.index, mean(r4)))

> farms$r8 <- with(farms, replace(r8, low.index, mean(r8)))

> farms$r14 <- with(farms, replace(r14, low.index, mean(r14)))

I transforming variable (x 7−→ log(a + x)) to obtain a morenormal distribution

I More general Box-Cox transformation (function BoxCox offorecast)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 21: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Bivariate analysis: 2 numerical variables

> with(farms, cov(r1, r2))

> with(farms, cov(r1, r2)/(sd(r1) * sd(r2)) == cor(r1,

+ r2))

> tab.cor <- cor(farms[, c("r1", "r2", "r3", "r4", "r5")])

Reproducible research with LATEX:> library(xtable)

> matable <- xtable(tab.cor, digits = 3, caption = "Correlation tabular")

> print(matable, file = "corr.tex", size = "tiny")

r1 r2 r3 r4 r5r1 1.000 -0.908 0.121 0.759 0.818r2 -0.908 1.000 0.026 -0.643 -0.790r3 0.121 0.026 1.000 0.642 -0.370r4 0.759 -0.643 0.642 1.000 0.283r5 0.818 -0.790 -0.370 0.283 1.000

Table: Correlation tabular of Capitalization variables

We notice a strong correlation between Capitalization variables.Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 22: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Scatter plot (with lattice package)

> library(lattice)

> xyplot(r2 ~ r1, data = farms, groups = DIFF, auto.key = list(columns = 2,

+ title = "Scatter plot"), par.settings = simpleTheme(col = col.y))

r1

r2

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●

●●

●●

● ●

●●

● ●

●● ●●

●●

●●

●●●

●●

●●

● ● ●

●● ●●

●●

●●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

Scatter plotfailing healthy● ●

(low values of r2 + high values of r1) → high probability of failing

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 23: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Scatterplot Matrices (with car package)

> library(car)

> scatterplotMatrix(~r6 + r7 + r8 | DIFF, data = farms,

+ col = col.y, main = "Weight of the debt variables")

● failinghealthy

r6

0.0 1.0 2.0 3.0

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●● ●

●●● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●

● ●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●●●

●●

●●●

●● ●●

●●●

●●●

●● ●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●● ●●●

●●●

● ●

●●●

●●

● ●●

●●●

●●

12

34

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●●

●● ●● ●

●●

●● ●●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

●●

●●

●●●

● ●

●● ●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

● ●

●●

●●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●● ●

●●

●●

●●

●●●●

●●

● ●●

●●● ●

●●●

●●●

●●●

●●●

●●

●●

●●

●● ●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●● ●●

●●

0.0

1.0

2.0

3.0

● ●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●●

●●

●●

●●●

●●

●● ●

●●●

●●●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●●

●●●

●●

●● ●

●●

●●

● ●

●●

●●

●● ●●●

●●●

●●

●● ●

●●●●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

r7

● ●●

●●●● ●

● ●●

● ●

●●

●●●

●●

● ●

●●

●●

● ●●●

●●

●●

●●●

●●●

●● ●

●●●

●●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●●

●●

●●●

●● ●

●●

●●

● ●

●●

●●

●● ●●

●●

●●●

●●

●● ●

●●●●

●●

●● ●

●●

●●●

●●

● ●

●●

● ●

●●

1 2 3 4

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

● ●

●●

●●●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●● ●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

● ●

●●

●●●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●●●

●●

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5r8

weight of the debt variables

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 24: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Bivariate analysis: 2 attributes

> op <- par(mfrow = c(1, 2), cex.axis = 0.6, cex.lab = 0.6)

> mosaicplot(table(farms[, c(3, 2)]), color = TRUE, main = "")

> barplot(table(farms[c(2, 5)]), beside = TRUE, legend.text = c("failing",

+ "healthy"), horiz = TRUE, cex.names = 0.5, col = col.y,

+ args.legend = list(cex = 0.5), las = 2)

> par(op)

STATUS

DIF

F

company proprietorship

faili

nghe

alth

y

cereals

gen.cropping

dairy.farm

mix.livestock

var.crops−livestock

soilless.breed healthyfailing

0 50 100

150

200

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 25: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Pearson’s Chi-squared Test

Pearson’s Chi-squared Test:

> t1 <- with(farms, chisq.test(table(DIFF, CNTY)))

> print(t1)

Pearson's Chi-squared test

data: table(DIFF, CNTY)

X-squared = 5.9929, df = 3, p-value = 0.1120

We notice that the value of χ2 is not large enough to be“abnormal” compared to a χ2 distribution. The link between thetwo variables is not significant...

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 26: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Cramer’s V statistic

We have to first calculate first the Pearson’s Chi-squared statistic:

> t2 <- with(farms, chisq.test(table(DIFF, STATUS)))

We obtain the Cramer’s V statistic like that:

> V.2 <- sqrt(t2$statistic/n/min(nlevels(farms$DIFF), nlevels(farms$STATUS)))

> names(V.2) <- "Cramer's V statistic"

> print(V.2)

Cramer's V statistic

0.05408548

If 0 < V ≤ 0.25 the link is low, if 0.25 < V ≤ 0.6 the link ismedium, if V > 0.6 the link is strong. In this case, the link is low.We will see in the next slide that the links between attributes andY are not strong.

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 27: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Cramer’s V statistic summary

> res.cramer <- NULL

> for (i in c(1, 3, 5, 6, 8)) {

+ t.k <- with(farms, chisq.test(table(DIFF, farms[,

+ i])))

+ res.cramer <- c(res.cramer, sqrt(t.k$statistic/n/min(nlevels(farms$DIFF),

+ nlevels(farms[, i]))))

+ }

> names(res.cramer) <- names(farms)[c(1, 3, 5, 6, 8)]

> res <- data.frame(t(res.cramer))

> row.names(res) <- "values"

> matable <- xtable(res, digits = 3, caption = "Cramer's V statistic")

> print(matable, file = "vstat.tex", size = "tiny")

CNTY STATUS ToF OWNLAND HARVESTvalues 0.049 0.054 0.087 0.040 0.044

Table: Cramer’s V statistic

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 28: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Empirical Odds, Odds Ratio and Relative Risk (1)

Consider the variable STATUS with two levels and the followingcontingency table:

> tab1 <- with(farms, addmargins(table(DIFF, STATUS)))

> matable <- xtable(tab1, digits = 3, align = "l|cc|r",

+ caption = "Contingency Table")

> print(matable, hline.after = c(0, 2), file = "V.tex",

+ size = "tiny")

company proprietorship Sumfailing 89.000 518.000 607.000healthy 135.000 518.000 653.000Sum 224.000 1036.000 1260.000

Table: Contingency Table

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 29: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Empirical Odds, OR and RR (2)

Prevalences :

I π(Company) = (#Y=1|X=Company)(#X=Company)

I π(prop) = (#Y=1|X=prop)(#X=prop)

I p1 = (#Y=1)n :

> res.preval <- tab1[1, ]/tab1[3, ]

> names(res.preval) <- c("pi.comp", "pi.prop", "p.1")

> print(res.preval)

pi.comp pi.prop p.1

0.3973214 0.5000000 0.4817460

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 30: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Empirical Odds, OR and RR (3)

I Odds: among the farms included in company, the chances offailing are 0.66 (= #Y=1|X=company

#Y=0|X=company = 89/135). Note the

chances are equal for proprietorship (518/518).

I OR =(#Y=1|X=prop)

1−(#Y=1|X=prop)(#Y=1|X=comp)

1−(#Y=1|X=comp)

= (518/518)/(89/135) = 1.5.

I RR = π(prop)/π(Company) = 0.5/0.4 = 1.25

→ the chances of failing are higher in the group of proprietorship

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 31: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Bivariate analysis: one attribute and one numerical variable(1)

> par(mfrow = c(1, 3))

> boxplot(r11 ~ DIFF, data = farms, xlab = "r11", col = col.y)

> boxplot(r12 ~ DIFF, data = farms, xlab = "r12", col = col.y)

> boxplot(r14 ~ DIFF, data = farms, xlab = "r14", col = col.y)

> par(op)

> title("Liquidity variables")

●●

●●●●

●●

●●●●●●

●●●●●●●

failing healthy

−1.

00.

00.

51.

01.

52.

0

r11

●●

●●●●●

failing healthy

−1

01

23

4

r12

●●●●●●●●●●●●●●●●●●

●●●

●●

●●●

●●

●●●

●●

●●●●●

●●●

failing healthy

01

23

45

r14

Liquidity variables

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 32: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Bivariate analysis: one attribute and one numerical variable(2)

> library(lattice)

> histogram(~r17 | DIFF,

+ layout = c(1, 2),

+ nint = 20, data = farms,

+ panel = function(x,

+ ...) {

+ panel.histogram(x,

+ ..., col = col.y[panel.number()])

+ })

r17

Per

cent

of T

otal

0

5

10

15

0.00 0.05 0.10 0.15 0.20

failing

0

5

10

15

healthy

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 33: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Correlation ratio

η2 =r∑

l=1

nl(Xl − X )2

nσ2X

> n <- nrow(farms)

> deno <- (n - 1) * var(farms$r1)

> eta.r1 <- with(farms, sum(table(DIFF) * (by(r1,

+ DIFF, mean) - mean(r1))^2)/deno)

> print(eta.r1)

[1] 0.419557

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 34: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Correlation ratio (2)

Objective: calculate the correlation ratio of each numerical variablewith Y and draw a dot chart depending on the topic of thevariables (“capitalization”, “liquidity”, etc.)

> res <- NULL

> for (k in c(4, 7, 9:30)) {

+ deno <- (n - 1) * var(farms[, k])

+ res <- c(res, with(farms, sum(table(DIFF) *

+ (by(farms[, k], DIFF, mean) - mean(farms[,

+ k]))^2)/deno))

+ }

> names(res) <- names(farms[c(4, 7, 9:30)])

> topics <- factor(c("structure", "structure", rep("capitalization",

+ 5), rep("Weight of the debt", 3), rep("Liquidity",

+ 3), rep("Debt servicing", 5), "Capital profitability",

+ rep("Earnings", 3), rep("Productive activity",

+ 2)))

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 35: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Correlation ratio (3)

> dotchart(res, groups = topics, main = "Correlation ratio by topics")

> abline(v = 0.25, col = "red", lty = 2)

r6r7r8

HECTAREAGE

r36r37

r11r12r14

r28r30r32

r17r18r19r21r22

r1r2r3r4r5

r24

●●

●●

●●

●●

●●

●●

●●

●●

●●

●Capital profitability

capitalization

Debt servicing

Earnings

Liquidity

Productive activity

structure

Weight of the debt

0.0 0.1 0.2 0.3 0.4

Correlation ratio by topics

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 36: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Student’s t-Test

> with(farms, t.test(r36[DIFF == "failing"], r36[DIFF ==

+ "healthy"]))

Welch Two Sample t-test

data: r36[DIFF == "failing"] and r36[DIFF == "healthy"]

t = 3.4574, df = 1087.007, p-value = 0.0005666

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.04941916 0.17912308

sample estimates:

mean of x mean of y

1.241827 1.127556

See the formula of Welch’s test inhttp://en.wikipedia.org/wiki/Welch%27s_t-test

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 37: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

How to discretize a numerical variable ?

1. start by using traditional methods such as “quantile”,“Fisher-Jenks”, etc included in package classInt with a highnumber of classes.

2. try to aggregate classes using the weights of evidence criteria

WoE = Log(Odds) = Log (#y=1|X )(#y=0|X )

3. in the end, 5-6 classes seem to be enough

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 38: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Discretization of r1

> library(classInt)

> interval <- classIntervals(farms$r1, n = 12,

+ style = "quantile")$brks

> nb.int <- findInterval(farms$r1, interval,

+ all.inside = TRUE)

> woe <- by(farms$DIFF, as.factor(nb.int),

+ function(x) log(length(which(x ==

+ "failing"))/length(which(x ==

+ "healthy"))))

> plot((interval[1:12] + interval[2:13])/2,

+ woe, main = "Weight Of Evidence",

+ xlab = "variable r1")

> abline(v = interval, lty = 2, col = "grey")

> abline(v = interval[c(4, 8, 11)], col = "red")

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 39: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Choice of classes

4 classes seem to be enough to discretize r1: “low”, “medium”,“high” and “very high”.

●●

● ●

●●

● ●

0.5 1.0 1.5 2.0

−2

02

4Weight Of Evidence

variable r1

woe

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 40: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Multivariate analysis

I Principal ComponentAnalysis (PCA) to completethe analysis ofcovariance/variance ofexplanatory variables

I Hierarchical cluster analysis(if n < 1000) with hclust

function or k-meansclustering with kmeans

function

> res.pca <- princomp(farms[,

+ 9:30])

> biplot(res.pca, col = c("grey",

+ "blue"))

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 41: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Other R Tools

I package Rcmdr: a Tk menu with several graphics and testswith a minimum of programming

I package rattle: a package which depends on a lot ofpackages, dedicated to scoring methods

I package iplots: interactive selection on basic graphic such ashistogram, barplot, etc., useful for the detection ofmultivariate outliers

I package ggplot2: an other “generation” of graphics

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 42: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Conclusion of this part

Do you already have an idea of the characteristics of the farmswhich failed ?If no, you may continue to explore the data...

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 43: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Introduction

Preparing the database

Exploratoty Data Analysis

Logistic Regression

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 44: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Sampling

I working sample (70%) farms.work: used for model selection

I test sample (30%) farms.test: used to test the selectedmodel

> set.seed(121181)

> ind <- sample(1:n, round(0.7 * n))

> farms.work <- farms[ind, ]

> farms.test <- farms[-ind, ]

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 45: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Ordinary Least Square (OLS) model (2)

How to explain a numerical variable by other explanatory variables(both numerical and attribute) ?

I use of the function lm: r1 is the variable to explain, DIFF andHECATRE are explanatory variables:> res.lm <- lm(r1 ~ DIFF + HECTARE, data = farms.work)

I What results are included in res.lm ?> names(res.lm)

I function anova.lm calculates the analysis of variance table:> anova(res.lm)

I function summary.lm computes a list of summary statitics (Fstatistic, adjusted R2, etc) of the OLS model:> summary(res.lm)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 46: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Ordinary Least Square (OLS) model (2)

I function plot.lm returns plot diagnostics:> dev.new()

> par(mfrow = c(2, 2))

> plot(res.lm)

> par(op)

I function influence.measures returns statistics as Cook’sdistance to detect influent observations:> influence.measures(res.lm)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 47: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Generalized Linear Model (GLM) model

How to explain a normal, binomial, poisson or gamma variable byexplanatory variable (both numerical and attribute) ?

I use of the function glm and option family to give the name ofthe distribution and the link used:> res.glm <- glm(Y ~ ., family = binomial(link = "logit"),

+ data = farms.work[, -2])

I the functions used for glm object are the same than lm:> names(res.glm)

> anova(res.glm)

> summary(res.glm)

> dev.new()

> par(mfrow = c(2, 2))

> plot(res.glm)

> par(op)

> influence.measures(res.lm)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 48: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Choice of variables

I You can choose the function stepAIC which performs stepwisemodel selection by AIC, applied to the res.step objectconstructed previously:> res.step <- stepAIC(res.glm, direction = "backward",

+ k = log(nrow(farms.work)))

I You can also choose the function bic.glm of package BMA:> library(BMA)

> choix.bic.glm <- bic.glm(farms.work[, -c(2, 31)], farms.work$Y,

+ strict = FALSE, OR = 20, data = x, glm.family = "binomial",

+ factor.type = TRUE)

> summary(choix.bic.glm, conditional = T, digits = 2)

> imageplot.bma(choix.bic.glm)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 49: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Comparing two models

I with stepAIC, we keep the variables: CNTY, STATUS, HECTARE,r1, r5, r12, r14, r21 and r36

I with the first model of bic.glm, we keep the variables: CNTY,STATUS, HECTARE, r1, r3, r17, r24 and r36

To compare the two methods, we can use the AIC criteria:> res.bic.glm <- glm(Y ~ STATUS + CNTY + HECTARE + r1 +

+ r3 + r17 + r24 + r36, family = binomial(link = "logit"),

+ data = farms.work)

> AIC(res.step)

[1] 423.0119

> AIC(res.bic.glm)

[1] 422.6086

We keep the second model ...Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 50: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Coefficients of the model

> matable <- xtable(res.bic.glm, digits = 3, caption = "Coefficient of the selected model")

> print(matable, file = "coeff.tex", size = "tiny")

Estimate Std. Error z value Pr(>|z|)(Intercept) -6.118 1.089 -5.616 0.000

STATUSproprietorship -1.543 0.400 -3.860 0.000CNTYNord -2.257 0.413 -5.465 0.000CNTYOrne -1.472 0.393 -3.748 0.000

CNTYSeine-Maritime -0.186 0.388 -0.478 0.633HECTARE -0.035 0.004 -7.836 0.000

r1 11.642 0.892 13.051 0.000r3 5.915 0.785 7.531 0.000

r17 31.362 6.214 5.047 0.000r24 -7.437 2.008 -3.705 0.000r36 1.532 0.332 4.618 0.000

Table: Coefficient of the selected model

Be careful before interpreting the coefficients β: notice for examplethe sign associated to STATUS, contrary to what we observed inEDA, due certainly to a problem of multi-collinearity ...

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 51: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Estimated adjusted Odds ratio

We may calculate the odds ratio and confidence interval by usingthe functions summary and coeff.

> lreg.coeffs <- coef(summary(res.bic.glm))

> lreg.coeffs[c("r1", "r3", "r17", "r24", "r36"), 1] <- lreg.coeffs[c("r1",

+ "r3", "r17", "r24", "r36"), 1] * 0.01

> odds <- data.frame(signif(cbind(exp(lreg.coeffs[, 1]),

+ exp(lreg.coeffs[, 1] - 1.96 * lreg.coeffs[, 2]),

+ exp(lreg.coeffs[, 1] + 1.96 * lreg.coeffs[, 2])),

+ 3))

> names(odds) <- c("odds", "l.95", "u.95")

In order to interpret the odds-ratio associated to the ratios (r1,etc.), we have multiplied these coefficients by 0.01.

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 52: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Other issues in modelling

I transforming all numeric variables into attributes such as wehave seen in previous section.

I transforming all the attributes into numeric variables (MultipleCorrespondence Analysis with function dudi.acm of packageade4).

I choose an econometric approach: for example, try a model bytaking into account an economic “a priori” on the variables.The dot chart of the weight of evidence may also bring a goodintuition.

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 53: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Prediction on the test sample

Calculate the following term by using the function predict:

Y ∗ = X β

> eta <- predict(res.step, newdata = farms.test)

Calculate then the score:

µ = exp(Y ∗)/(1 + exp(Y ∗))

> mu <- exp(eta)/(1 + exp(eta))

If we choose an arbitrary cut off equal to 0.5, we calculate Y suchas:

> Y.pred = relevel(factor(ifelse(mu > 0.5, 1, 0)), ref = "1")

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 54: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Confusion Matrix and vocabulary

actual value Y = 1 actual value Y = 0 Total

predicted value Y = 1 TP FP

predicted value Y = 0 FN TNTotal P N P+N

I True Positive Rate TPR: TP/P

I False Positive Rate FPR: FP/N

I Accuracy: (TP + TN)/(P + N)

I Positive predictive value PPV: TP/(TP + FP)

I Sensitivity: TP/(TP + FN)

I Specificity: TN/(FP + TN)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 55: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Example with a Cutoff equal to 0.5

Construction of the confusion matrix:> ma.conf <- addmargins(table(Y.pred, farms.test$DIFF))

TPR and FPR:> ma.conf[1, 1:2]/ma.conf[3, 1:2]

failing healthy

0.89637306 0.07027027

Accuracy:> (ma.conf[1, 1] + ma.conf[2, 2])/sum(ma.conf[3, 1:2])

[1] 0.9126984

Sensitivity and Specificity:> ma.conf[1, 1]/ma.conf[3, 1]

[1] 0.896373

> ma.conf[2, 2]/ma.conf[3, 2]

[1] 0.9297297

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 56: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

The ROC curve (1)

How to choose the cut-off ?

I Choose two criteria seen previously, for example the TPR andFPR citeria.

I We would like to choose a cut off such as the TPR be largewhereas the FPR be small.

I The ROC curve draws simultaneously these two criteria whenvarying the cut off: when cut off equal to 1, non farm hasbeen predicted as failing, so TPR=FPR=0, etc.

I Use of the ROCR package (seehttp://rocr.bioinf.mpi-sb.mpg.de/).> library(ROCR)

> pred <- prediction(mu, farms.test$Y)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 57: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

The ROC curve (2)

In this case, a cut off equal to 0.4 seems to be a good compromiseto obtain both a good TPR and FPR.> perf <- performance(pred, measure = "tpr", x.measure = "fpr")

> plot(perf, colorize = T, print.cutoffs.at = seq(0, 1,

+ by = 0.1), text.adj = c(1.2, 1.2), lwd = 3)

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

00.

20.

40.

60.

81●●

●●●

●●

00.10.20.30.4

0.50.60.7

0.8

0.9

1

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 58: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Cross-validation for GLM

This method is an alternative to the AIC criteria for the choice ofthe model and may be recommanded when the size of the sampleis not large enough:

1. Divide the data (of size n) into K groups

2. For each group, fit a GLM omitting that group and calculatethe percent of badly classified with function cost in the groupthat was omitted from the fit

> require(boot)

> res.glm <- glm(Y ~ STATUS + CNTY + HECTARE + r1 + r3 +

+ r17 + r24 + r36, family = binomial(link = "logit"),

+ data = farms)

> cost <- function(r, pi = 0) mean(abs(r - pi) > 0.6)

> res.cv <- cv.glm(farms, res.glm, cost, K = 10)

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 59: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Conclusion

I The EDA gave us some answers to the problem and helped usunderstand the data.

I We found one possible model which has good properties byconsidering the usual criteria even if the interpretation of thismodel is not easy because of the inhomogeneity andcorrelation between the variables: other model could be found.

I However, we may use this model to prevent some farms offailing, in the case where we had all explanatory variablesexcepted Y , farms with values of Y > 0.4 would be warned...

Thibault LAURENT Toulouse School of Economics

Scoring with R

Page 60: Scoring with R - Free · Dominique Desbois (2008),\Introduction to Scoring Methods: Financial Problems of Farm Holdings", CS-BIGS, 2(1): 56-76. Objectives: analysis of the causes

Introduction Preparing the database Exploratoty Data Analysis Logistic Regression

Other methods for scoring with R

See http://cran.r-project.org/doc/contrib/

Sharma-CreditScoring.pdf which deals with the followingmethods

I Bagging: package adabag

I Random Forest: package randomForest

I Support Vector Machines: package e1071

I Generalized Additive Model: package gam

Thibault LAURENT Toulouse School of Economics

Scoring with R