1 logistic regression knn ch. 14 (pp. 555-618) minitab user’s guide sas em documentation

1

Logistic Regression

KNN Ch. 14 (pp. 555-618) MINITAB User’s Guide SAS EM documentation

2

Regression Models with Binary Response Variable

In many applications the response variable has only two possible outcomes (0/1):In a study of liability insurance possession, using Age of head of household, Amount of liquid assets, and Type of occupation of head of household as predictors, the response variable had two possible outcomes: House has liability insurance (=1), or Household does not have liability insurance (=0)

The financial status of a firm (sound status, headed toward insolvency) can be coded as 0/1

Blood pressure status (high blood pressure, not high blood pressure) can be coded as 0/1

3

Meaning of the Response Function for Binary Outcomes

Consider the simple linear regression model

1,0 ,10 iiii YXY

In this case, the expected response E{Yi} has a special meaning.

Consider Yi to be a Bernoulli random variable:

ii XYE 10

Y i Probability

1 P(Yi=1) = pi

0 P(Yi=0) =1 - pi

4

Meaning of the Response Function for Binary Outcomes

Using the definition of expected value of a random variable,

Therefore, the mean response E{Yi} is the probability that Yi =1 when the level of the predictor variable is Xi.

iii XYE p 10

)1()1(0)(1 iiiii YPYE ppp

E{Y}

1

X0

E{Y} = b0 + b1X

5

Problems when Response Variable is Binary

1. Error Terms are not normal:

At each X level, the error cannot be normally distributed since it takes only 2 possible values, depending on whether Y is 0 or 1

2. Error Variance is not constant:

Error Variance is a function of X, therefore not constant

3. Constraints with the response function:

We need to find response functions that do not exceed the value of 1, and that is not easy

6

Link Functions

Inverse of distribution functions have a sigmoid shape that can be helpful as a response function of a regression model with binary outcome. Such a function is called Link Function.

We want to choose a link function that best fits our data. Goodness-of-fit statistics can be used to compare fits using different link functions:

Name Link Function Distribution Mean Variance

logit g(p i) = log(p i / (1-p i)) logistic 0 pi2 / 3

normit/probit g(p i) = F -1 (p i) normal 0 1

gompit g(p i) = log(-log(1-p i)) Gumbel -g (Euler c.) pi2 / 6

7

Logistic Regression Assumption

logittransformation

Assumption: The logit transformation of the probabilities of the target value results in a linear relationship with the input variables.

8

Linear versus Logistic Regression

Input variables have any measurement level.

Input variables have any measurement level.

Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables.

Predicted values are the mean of the target variable at the given values of the input variables.

Target is a discrete (binary or ordinal) variable.

Target is an interval variable.

Logistic Regression

Linear Regression

9

Interpretation of Parameter EstimatesThe interpretation of the parameter estimates depends on

The link function The reference event (1 or 0) The reference factor levels (for numerical factors, reference level is the smallest value)

The logit link function provides the most natural interpretation of the estimated coefficients:

The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor (factor or covariate) is the estimated change in the log of P(event)/P(not event) for each unit change in the predictor, assuming the other predictors remain constant

10

E(Y | X=x) =g(x;w)p (x)

Parametric Models

Training Data

E(Y | X=x) = g(x;w)

Generalized Linear Model

w0 + w1x1 +…+ wpxp)

w1

w2

g-1( )

11

g-1( ) w0 + w1x1 +…+ wpxp=p

Logistic Regression Models

Training Data

log(odds)

( )p

1 - p log

logit(p)

0.0

1.0

p 0.5

logit(p )

0

12

=( )p

1 - p log w0 + w1(x1+1)+…+ wpxp

´

´w1 + w0 + w1x1 +…+ wpxp

Changing the Odds

Training Data

w0 + w1x1 +…+ wpxp=( )p

1 - p log

oddsratio

( )p

1 - p logexp(w1)

13

Regression diagnostics – Residual Analysis

To identify… Use… Which measures…poorly fit factor/covariate patterns

Pearson residual

the difference between the actual and the predicted observation

factor/covariate patterns with strong influence on parameter estimates delta beta

changes in the coefficients when the j-th factor/covariate pattern is removed, based on Pearson residuals

factor/covariate patterns with a large leverage

Leverage (Hi )

leverages of the j-th factor/covariate pattern, a measure of how unusual predictor values are

14

The Home Equity Loan Case

HMEQ Overview Determine who should be

approved for a home equity loan. The target variable is a binary

variable that indicates whether an applicant eventually defaulted on the loan.

The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.

15

HMEQ – The consumer credit department of a bank wants to

automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections).

– The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

16

Original HMEQ data

Name Model Role Measurement Level Description

BAD Target Binary 1=defaulted on loan, 0=paid back loan

REASON Input Binary HomeImp=home improvement, DebtCon=debt consolidation

JOB Input Nominal Six occupational categories

LOAN Input Interval Amount of loan request

MORTDUE Input Interval Amount due on existing mortgage

VALUE Input Interval Value of current property

DEBTINC Input Interval Debt-to-income ratio

YOJ Input Interval Years at present job

DEROG Input Interval Number of major derogatory reports

CLNO Input Interval Number of trade lines

DELINQ Input Interval Number of delinquent trade lines

CLAGE Input Interval Age of oldest trade line in months

NINQ Input Interval Number of recent credit inquiries

17

HMEQ: Modeling Goal

– The credit scoring model computes a probability of a given loan applicant defaulting on loan repayment. A threshold is selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.

18

HMEQ: two added variables

– For model comparison purposes, we added two variables:• BEHAVIOR (good/bad), which precisely

mirrors the 0/1 values in BAD, to see how we can perfectly predict BAD using insider information

• FLIPCOIN (Head/Tail), which is completely random, to see if we can predict BAD using random flips of a coin

Introducing SAS Enterprise Miner v.5.3

Enterprise-grade (and expensive!) Data Mining package

Implemented Methodology:

– Sample-Explore-Modify-Model-Assess (SEMMA)

Available Modeling Tools:

– Logistic Regression

– Many others, such as Decision Trees, Neural Networks, Clustering, Market-Basket, etc.

19

Analysis of HMEQ in SAS EM

Three logistic Regression nodes were added to the Analysis Diagram. In order to compare them, a Compare node was added.

20

SAS EM 4.3: A more accessible version

Accessible through base SAS at UNT CoB

Start SAS 9.3. From the SAS menu bar, select Solutions > Analysis > Enterprise Miner

21

Logistic Regression results (all predictors)

22

Logistic Regression results (stepwise, final model)

23

Interpretation of Odds Ratio results

Predictors that cause the probability to default on the loan to increase (=odds ratio coeff. > 1):

• DEBTINC• DELINQ• DEROG• NINQ

Predictors that cause the probability to default on the loan to decrease (=odds ratio coeff. < 1):

• CLNO• YOJ

24

Model Comparison

25

Perfect Regression is, well, perfect.

In Baseline Regression, 20% of the borrowers default, regardless of fitted value

Stepwise Regression is somewhere between the other two models

1 logistic regression knn ch. 14 (pp. 555-618) minitab user’s guide sas em documentation

Documents