1 logistic regression knn ch. 14 (pp. 555-618) minitab user’s guide sas em documentation
TRANSCRIPT
1
Logistic Regression
KNN Ch. 14 (pp. 555-618) MINITAB User’s Guide SAS EM documentation
2
Regression Models with Binary Response Variable
In many applications the response variable has only two possible outcomes (0/1):In a study of liability insurance possession, using Age of head of household, Amount of liquid assets, and Type of occupation of head of household as predictors, the response variable had two possible outcomes: House has liability insurance (=1), or Household does not have liability insurance (=0)
The financial status of a firm (sound status, headed toward insolvency) can be coded as 0/1
Blood pressure status (high blood pressure, not high blood pressure) can be coded as 0/1
3
Meaning of the Response Function for Binary Outcomes
Consider the simple linear regression model
1,0 ,10 iiii YXY
In this case, the expected response E{Yi} has a special meaning.
Consider Yi to be a Bernoulli random variable:
ii XYE 10
Y i Probability
1 P(Yi=1) = pi
0 P(Yi=0) =1 - pi
4
Meaning of the Response Function for Binary Outcomes
Using the definition of expected value of a random variable,
Therefore, the mean response E{Yi} is the probability that Yi =1 when the level of the predictor variable is Xi.
iii XYE p 10
)1()1(0)(1 iiiii YPYE ppp
E{Y}
1
X0
E{Y} = b0 + b1X
5
Problems when Response Variable is Binary
1. Error Terms are not normal:
At each X level, the error cannot be normally distributed since it takes only 2 possible values, depending on whether Y is 0 or 1
2. Error Variance is not constant:
Error Variance is a function of X, therefore not constant
3. Constraints with the response function:
We need to find response functions that do not exceed the value of 1, and that is not easy
6
Link Functions
Inverse of distribution functions have a sigmoid shape that can be helpful as a response function of a regression model with binary outcome. Such a function is called Link Function.
We want to choose a link function that best fits our data. Goodness-of-fit statistics can be used to compare fits using different link functions:
Name Link Function Distribution Mean Variance
logit g(p i) = log(p i / (1-p i)) logistic 0 pi2 / 3
normit/probit g(p i) = F -1 (p i) normal 0 1
gompit g(p i) = log(-log(1-p i)) Gumbel -g (Euler c.) pi2 / 6
7
Logistic Regression Assumption
logittransformation
Assumption: The logit transformation of the probabilities of the target value results in a linear relationship with the input variables.
8
Linear versus Logistic Regression
Input variables have any measurement level.
Input variables have any measurement level.
Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables.
Predicted values are the mean of the target variable at the given values of the input variables.
Target is a discrete (binary or ordinal) variable.
Target is an interval variable.
Logistic Regression
Linear Regression
9
Interpretation of Parameter EstimatesThe interpretation of the parameter estimates depends on
The link function The reference event (1 or 0) The reference factor levels (for numerical factors, reference level is the smallest value)
The logit link function provides the most natural interpretation of the estimated coefficients:
The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor (factor or covariate) is the estimated change in the log of P(event)/P(not event) for each unit change in the predictor, assuming the other predictors remain constant
10
E(Y | X=x) =g(x;w)p (x)
Parametric Models
Training Data
E(Y | X=x) = g(x;w)
Generalized Linear Model
w0 + w1x1 +…+ wpxp)
w1
w2
g-1( )
11
g-1( ) w0 + w1x1 +…+ wpxp=p
Logistic Regression Models
Training Data
log(odds)
( )p
1 - p log
logit(p)
0.0
1.0
p 0.5
logit(p )
0
12
=( )p
1 - p log w0 + w1(x1+1)+…+ wpxp
´
´w1 + w0 + w1x1 +…+ wpxp
Changing the Odds
Training Data
w0 + w1x1 +…+ wpxp=( )p
1 - p log
oddsratio
( )p
1 - p logexp(w1)
13
Regression diagnostics – Residual Analysis
To identify… Use… Which measures…poorly fit factor/covariate patterns
Pearson residual
the difference between the actual and the predicted observation
factor/covariate patterns with strong influence on parameter estimates delta beta
changes in the coefficients when the j-th factor/covariate pattern is removed, based on Pearson residuals
factor/covariate patterns with a large leverage
Leverage (Hi )
leverages of the j-th factor/covariate pattern, a measure of how unusual predictor values are
14
The Home Equity Loan Case
HMEQ Overview Determine who should be
approved for a home equity loan. The target variable is a binary
variable that indicates whether an applicant eventually defaulted on the loan.
The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
15
HMEQ – The consumer credit department of a bank wants to
automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections).
– The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.
16
Original HMEQ data
Name Model Role Measurement Level Description
BAD Target Binary 1=defaulted on loan, 0=paid back loan
REASON Input Binary HomeImp=home improvement, DebtCon=debt consolidation
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of loan request
MORTDUE Input Interval Amount due on existing mortgage
VALUE Input Interval Value of current property
DEBTINC Input Interval Debt-to-income ratio
YOJ Input Interval Years at present job
DEROG Input Interval Number of major derogatory reports
CLNO Input Interval Number of trade lines
DELINQ Input Interval Number of delinquent trade lines
CLAGE Input Interval Age of oldest trade line in months
NINQ Input Interval Number of recent credit inquiries
17
HMEQ: Modeling Goal
– The credit scoring model computes a probability of a given loan applicant defaulting on loan repayment. A threshold is selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.
18
HMEQ: two added variables
– For model comparison purposes, we added two variables:• BEHAVIOR (good/bad), which precisely
mirrors the 0/1 values in BAD, to see how we can perfectly predict BAD using insider information
• FLIPCOIN (Head/Tail), which is completely random, to see if we can predict BAD using random flips of a coin
Introducing SAS Enterprise Miner v.5.3
Enterprise-grade (and expensive!) Data Mining package
Implemented Methodology:
– Sample-Explore-Modify-Model-Assess (SEMMA)
Available Modeling Tools:
– Logistic Regression
– Many others, such as Decision Trees, Neural Networks, Clustering, Market-Basket, etc.
19
Analysis of HMEQ in SAS EM
Three logistic Regression nodes were added to the Analysis Diagram. In order to compare them, a Compare node was added.
20
SAS EM 4.3: A more accessible version
Accessible through base SAS at UNT CoB
Start SAS 9.3. From the SAS menu bar, select Solutions > Analysis > Enterprise Miner
21
Logistic Regression results (all predictors)
22
Logistic Regression results (stepwise, final model)
23
Interpretation of Odds Ratio results
Predictors that cause the probability to default on the loan to increase (=odds ratio coeff. > 1):
• DEBTINC• DELINQ• DEROG• NINQ
Predictors that cause the probability to default on the loan to decrease (=odds ratio coeff. < 1):
• CLNO• YOJ
24
Model Comparison
25
Perfect Regression is, well, perfect.
In Baseline Regression, 20% of the borrowers default, regardless of fitted value
Stepwise Regression is somewhere between the other two models