research methodology: tools - schwarz & partners 08...research methodology: tools applied data...
TRANSCRIPT
Research Methodology: Tools
Applied Data Analysis (with SPSS)
Lecture 08: Logistic Regression Analysis
April 2014
Prof. Dr. Jürg Schwarz
Lic. phil. Heidi Bruderer Enzler
MSc Business Administration
Slide 2
Contents
Aims of the Lecture ______________________________________________________________________________________ 3
Typical Syntax ___________________________________________________________________________________________ 4
Introduction _____________________________________________________________________________________________ 5
Example ................................................................................................................................................................................................................. 5
Outline ________________________________________________________________________________________________ 10
Concepts of Logistic Regression Analysis __________________________________________________________________ 11
Logistic Regression Model.................................................................................................................................................................................... 13
Odds ("Wettquote")............................................................................................................................................................................................... 17
Logistic Regression Analysis with SPSS: An Example _________________________________________________________ 20
Appendix – Interpretation of the Coefficients ________________________________________________________________ 31
Slide 3
Aims of the Lecture
You will understand the key steps in conducting a logistic regression analysis.
You will understand the concept of a logistic function.
You will understand the concept of logit and odds.
You can conduct a logistic regression analysis with SPSS
In particular, you will know how to 9
◦ interpret the output
omnibus test
goodness of fit
coefficients
classification
◦ describe the output
Slide 4
Dependent variable heart
Independent variables
Classification plot
CIs for Exp(B)
Typical Syntax
Boxplot of variable activity split by heart (legacy dialogs) EXAMINE VARIABLES=activity BY heart /PLOT=BOXPLOT /STATISTICS=NONE /NOTOTAL.
Logistic regression analysis of heart on independent variables LOGISTIC REGRESSION VARIABLES heart /METHOD=ENTER activity dweight press stress /CLASSPLOT /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
Slide 5
Introduction
Example
Medical research: Statistics on the cause of heart attack.
Sample: 200 male patients in a hospital.
Variables
Dependent
heart = got heart attack (yes =1/no = 0)
Independent
activity = physical activity (1 to 6)
dweight = difference to normal weight (kg)
press = blood pressure (mm Hg)
stress = stress level (1 to 6)
Typical questions
How do individual characteristics influence
the probability of getting a heart attack?
How can this impact be modeled?
How strong is the impact of each factor? Physical activity
Got
heart
att
ack
Slide 6
Relation between independent variables and "probability" of getting a heart attack
Blo
od p
ressure
No Yes
Got heart attack
Str
ess level
Physic
al activity
Diffe
rence t
o n
orm
al
weig
ht
No Yes
Got heart attack
No Yes
Got heart attack
No Yes
Got heart attack
Slide 7
Questions
Question in everyday language:
How do individual characteristics influence the probability of getting a heart attack?
Research question:
Is there an impact of the factors9
◦ physical activity
◦ difference to normal weight
◦ blood pressure
◦ stress level
9on the probability of getting a heart attack?
How strong is the impact of the factors?
Is there a model?
Is logistic regression analysis the right model?
Statistical question:
H0: "No model" (= No overall model and no significant coefficients)
HA: "Model" (= Overall model and significant coefficients)
Can we reject H0?
Slide 8
Solution
Logistic regression model with the probability of getting a heart attack as the dependent variable:
z
1Probability(heart attack = 1) =
1+e−
= β + β ⋅ + β ⋅ + β ⋅ + β ⋅0 1 2 3 4 + uz activity dweight press stress
− β +β ⋅ +β ⋅ +β ⋅ +β ⋅0 1 2 3 4( + u)
1=> Probability(heart attack = 1) =
1+e activity dweight press stress
=
β β =
=
=
0 4
Probability(heart attack = 1) = dependent variable
, ... independent variables
,... coefficients
u error term
e Euler's nu
activity stress
mber = 2.718...
"How to" in SPSS
SPSS: Analyze�Regression�Binary Logistic...
Slide 9
Results
The model as a whole is significant (table not shown).
Coefficients are significant (but not the constant which is not a problem).
z = -.784 ⋅⋅⋅⋅ activity + .057 ⋅⋅⋅⋅ dweight + .038 ⋅⋅⋅⋅ press + .568 ⋅⋅⋅⋅ stress
Example interpretation for stress:
An increase of the stress level by one unit alters the relative probability of getting a heart attack
by a factor of 1.765. This means it is being increased by 76.5% (1.765 - 1 = .765).
Comparable with R2 of linear regression
Slide 10
Outline
Basic situation
Given: A binary dependent variable which is coded as 0/1 and metric or categorical (coded as
dummy variables) independent variables
Task: Find a relationship between the characteristics.
Logistic regression analysis
Postulation of a logistic model:
Probability(event occurs) = Function(independent variables)
z
1The function is a logistic function =
1+e−
Logistic regression analysis
◦ Examines the relationship between a type of probability of the dependent variable y taking the value 1 and the independent variables xi.
◦ Uses inferential statistics to estimate the parameters
Differences between linear regression analysis and logistic regression analysis
◦ A type of probability of the dependent variable is predicted instead of the value itself
◦ Many assumptions are not as restrictive as in OLS regression analysis
Slide 11
Concepts of Logistic Regression Analysis
Compare with Lecture 06 "Introduction to Regression Analysis"
◦ Key steps involved in using a regression analysis 1. Formulation of the model 2. Estimation of the model 3. Verification of the model 4. Considering other aspects 5. Testing of assumptions 6. Interpretation of the model and reporting
◦ General purpose of regression
These concepts are the same also when applied to the logistic regression analysis
What is different?
◦ Key steps 4. to 6. are treated a little different
◦ Logistic regression => dependent variable nominal (binary)
◦ Concept of logistic function
◦ Concept of odds and odds ratio
◦ Maximum likelihood estimation (MLE) instead of ordinary least squares estimation (OLS)
Slide 12
Types of Logistic Regression Analysis
(Binary) logistic regression
Dependent variable is nominal and consists of two categories
Example: The example discussed above. Dependence of
getting a heart attack on individual characteristics (see above)
Extensions of binary logistic regression
Multinomial logistic regression
Dependent variable is nominal and consists of more than two categories
Example: The relationship between drug abuse (none, marijuana, hard drugs) and sex, age,
and a factor representing peer norms about drug abuse
Ordinal logistic regression
Dependent variable is ordinal
Example: The relationship between the soda size (small, medium, large or extra large) that
people order at a fast-food chain and type of sandwich (beef or chicken), and the age of the
consumer
Slide 13
Logistic Regression Model
Step-by-step
Dependent binary variable y (e.g. heart attack yes/no)
Independent metric variable x (e.g. stress level)
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
x y
1 0
1 0
1 0
1 0
1 0
2 0
1 0
2 0
1 0
2 0
5 1
4 1
5 1
5 1
5 1
6 1
5 1
6 1
6 1
22
23
24
25
26
27
28
29
30
case
1
2
3
4
5
6
7
8
9
10
:
x
Dataset B
inary
vari
ab
le y
In this sample, roughly one half of the cases have y = 1.
Slide 14
Conduct a linear regression
(Avoid thinking about all the violated assumptions!)
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
The model is OK (F test, t test), but what about predictions based on this model?
What does a value of -0.1 or +1.1 of the binary variable y (heart attack yes/no) mean?
x
Bin
ary
vari
ab
le y
Regression line
= β + β ⋅0 1
Linear regression
y x
Slide 15
To limit predictions of y within an interval [0,1] fit a logistic function to data
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
The logistic function can be interpreted as the probability of the outcome 1 of the variable y.
Instead of predicting the value of the variable y from the independent variables xi,
the probability of occurring y = 1 is predicted.
x
Pro
ba
bili
ty
Bin
ary
vari
ab
le y
logistic function
z
1Logistic function =
1+e−
z = Function(x)
0 < Logistic function < 1
Slide 16
Logistic regression model
z
1Probability(y=1) =
1+e−
0 1 2 3 4z activity dweight press stress + u= β + β ⋅ + β ⋅ + β ⋅ + β ⋅
0 4
Probability(heart attack = 1) = dependent variable
activity, ... stress independent variables
,... coefficients
u error term
=
β β =
=
−z
1
1+e is also called link function.
z is also called logit.
The logit is a linear multiple regression model of the independent variables
Slide 17
Odds ("Wettquote")
Example: horse betting
The probability of a horse winning a race is set in relation to the probability of losing.
This ratio is called odds.
Example:
In a horse race, the probability of a particular horse winning is 0.1
Probability of winning: 0.1
Probability of losing: 1 - 0.1 = 0.9
0.1 0.1 1odds = "odds 1 to 9" alternatively "odds 10% to 90%"
1 0.1 0.9 9= =
−
General equation
Probability that an event occurs: P(Event occurs)
Probability that an event does not occur: 1 - P(Event occurs)
P(Event occurs)odds =
1 P(Event occurs)−
Odds are actually relative probabilities.
Slide 18
Odds ratio
Example of hair color in northern Germany
Women Men
blond 55% 48%
not blond 45% 52%
100% 100%
Women Men
Probability of blond hair 0.55 0.48
Probability of not blond hair: 1 – 0.55 = 0.45 1 – 0.48 = 0.52
W
0.55 0.55odds = 1.23
1 0.55 0.45= =
− M
0.48 0.48odds = 0.92
1 0.48 0.52= =
−
Odds ratio
= = = ⋅WRatio W M
M
odds 1.23odds = 1.33 => odds 1.33 odds
odds 0.92
The relative probability of having blond hair is 1.33 times as great in women as in men.
www.haar-und-psychologie.de (Date of access: April, 2014)
Sample of men and women in northern Germany (n is unknown)
Slide 19
What is the connection between odds and logistic regression?
Rewriting the logistic function:
−
−
−
−
=
=
= − = − =
=
=
z
z
z
z
z
1P(y=1)
1+e
11+e
P(y=1)
1 1 P(y=1) 1-P(y=1)e 1 form inverse value
P(y=1) P(y=1) P(y=1) P(y=1)
1 P(y=1)
e 1-P(y=1)
P(y=1)e
1-P(y=1)
Compare to the definition of odds from slide 17:
=P(y=1)
odds1-P(y=1)
⇒ =
=
=
z
z
e odds
ln(e ) ln(odds)
z ln(odds) z is called logit <=> taking the logarithm of the odds
Slide 20
Logistic Regression Analysis with SPSS: An Example
What Causes a Heart Attack?
Dataset and variables
Data
Sample of n = 200 male patients in a hospital
Variables
Dependent
heart = got a heart attack (yes=1/no =0)
Independent
activity = physical activity (1 to 6)
dweight = difference to normal weight (kg)
press = blood pressure (metric)
stress = stress level (1 to 6)
Model
Logistic regression model
Slide 21
SPSS: Analyze����Regression����Binary Logistic...
Slide 22
SPSS output – Omnibus test
The model as a whole is significant (Omnibus test: χ2 = 99.278, df = 4, p = .000).
The omnibus test corresponds to the F test in linear regression.
The null hypothesis (H0) aims to verify that the constant has the only effect.
The alternative hypothesis (HA) aims to verify that this is not the case.
H0: β0 ≠ 0 , β1 = 9 = βp = 0
HA: At least two of the coefficients are not zero.
The omnibus test compares two models:
◦ Base model: Only the constant β0 is in the model (sometimes also called "null model")
◦ Model: All coefficients β0, β1, 9, βp are in the model, which is the "final" model
If we do not use STEPWISE, "Step", "Block" and "Model" are the same.
Slide 23
Maximum likelihood estimation (MLE)
MLE maximizes a "likelihood"-function, which
reflects how likely it is that the values of the
dependent variable can be predicted from the
values of the independent variables.
Sometimes the logarithm of the likelihood function is used → abbreviation is LL
The value LL is negative
By convention "-2 times the log likelihood (-2LL)" is used for estimating how well the model fits.
Actually -2LL is a type of error term
Both models have associated log likelihood values:
◦ Base model: -2LL(base model)
◦ Model: -2LL(model)
The test statistic "-2LL(base model) - (-2LL(model))" follows a chi-square distribution.
See multiple regression analysis:
Ordinary least squares (OLS) estimation
minimizes the sum of squared distances
of the data points to the regression line.
Slide 24
Model Base model
χ2 = -2LL(base model) - (-2LL(model))
χ2 = 276.939 – 177.661 = 99.278
p < 5%: The null hypothesis (H0) can be rejected.
The model as a whole is significant (Omnibus test: χ2 = 99.278, df = 4, p = .000).
Slide 25
SPSS output – Model summary (Goodness of fit)
The Cox & Snell R2 and Nagelkerke R
2 are logistic analogies to R
2 in OLS regression.
Cox & Snell R Square
2
n2
CS
L(base model)R 1
L(model)
= −
22 nCSmaxR 1 L(base model)= −
Nagelkerke R Square
The Nagelkerke R2 measure adapts the Cox-Snell measure so that it varies from 0 to 1.
22 CSN 2
CSmax
RR
R=
The higher the value of R2, the better the model fits the data.
Slide 26
SPSS output – Classification table
Frequencies of variable heart – Analyze�Descriptive Statistics�9
SPSS uses the regression equation to predict whether a case is 0 or 1.
Default is a cutoff value of 0.5
73 of 96 (= 76.0%) outcomes 0 of the variable heart are predicted correctly.
85 of 104 (= 81.7%) outcomes 1 of the variable heart are predicted correctly.
Roughly 50% No and 50% Yes => cutoff value 0.5
Slide 27
SPSS output – Classification plot (Histogram of predicted probabilities)
Predicted
values
Observed
values
Cut value
(p = .5)
Correct predictions for No (= N) Correct predictions for Yes (= Y)
Slide 28
SPSS output – Variables in the equation
Wald test 2
B
BWald
s.e.
=
Similar to the t-statistic in linear regression analysis
Tests the null hypothesis that B = 0
Regression coefficient B (the βs in the regression equation)
z
1Probability(heart attack=1) =
1+e−
= − ⋅ + ⋅ + ⋅ + ⋅z .784 .057 .038 .568activity dweight press stress
The constant is not significant, so β0 is set to 0.
Slide 29
Interpretation of the coefficients by means of odds ratios Exp(B)
Exp(B) corresponds to an odds ratio (Note: All other variables are held constant).
Exp(B) expresses the odds ratio after a unit change in the independent variable:
afterRatio
before
after before
OddsOdds after a unit changeExp(B) = e = Odds =
Odds before a unit change Odds
Odds = Exp(B) Odds
β =
=> ⋅
Example activity: If physical activity is increased by one unit, the odds of getting
a heart attack is multiplied by a factor of .457 => Oddsafter = .457 � Oddsbefore
That is, the odds is lowered by 54.3% (.457 - 1 = -.543).
(If Oddsbefore is set to 100%, this corresponds to a decrease from 100% to 45.7%)
Example dweight: If difference to normal weight is increased by one unit, the odds of getting
a heart attack is multiplied by a factor of 1.058 => Oddsafter = 1.058 � Oddsbefore
That is, the odds is raised by 5.8% (1.058 - 1 = .058).
(If Oddsbefore is set to 100%, this corresponds to an increase from 100% to 105.8%)
Slide 30
Confidence interval for Exp(B)
In addition, SPSS estimates confidence intervals for the odds ratios:
◦ Entire interval above 1 → positive association between xi and y
◦ Entire interval below 1 → negative association between xi and y
◦ Interval contains 1 → no association between xi and y
Example activity: The confidence interval of Exp(B) at the 95% confidence level ranges from
.290 to .718. Thus, it does not include 1 and a negative association between activity and heart
can be assumed.
Interpretation of the sign of the regression coefficient B
B Exp(B) (= OddsRatio) Oddsafter = P(y=1)/P(y=0) P(y=1)
B > 0 eB > 1 increases by factor Exp(B) increase
B = 0 eB = 1 no change no change
B < 0 eB < 1 decreases by factor Exp(B) decrease
Slide 31
Appendix – Interpretation of the Coefficients
0 1z
1Example with one predictor x: Logistic function = with z x u
1+e−= β + β ⋅ +
Varying values of β0 (while β1 = 0.5) Varying values of β1 (while β0 = 0)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-15 -10 -5 0 5 10 15
Pro
bab
ilit
y
x
b0 = -2
b0 = 0
b0 = 2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-15 -10 -5 0 5 10 15P
rob
ab
ilit
y
x
b1 = 0.2
b1 = 0.5
b1 = 1.3
β0 shifts the function to the left or to the right. β1 changes the slope of the function.
β0 determines the value of x at which If the value of |β1| is higher, the prediction
the predicted outcome switches from 0 to 1. of the dependent variable is more selective.
β0 = -2
β0 = 0
β0 = 2
β1 = 0.2
β1 = 0.5
β1 = 1.3
Slide 32
The sign of the coefficients is interpreted in the same way as in linear regression.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-15 -10 -5 0 5 10 15
Pro
ba
bilit
y
x
b1 = -0.4
b1 = 0.4
β1 = -0.4
β1 = 0.4
Positive sign
Persons with higher values of the predictor varia-
ble have a higher probability that y = 1.
Example: More overweight ⇒ higher probability to
suffer from a heart attack.
Negative sign
Persons with higher values of the predictor varia-
ble have a lower probability that y = 1.
Example: Higher level of activity ⇒ lower probabil-
ity to suffer from a heart attack.
Slide 33
What does a non-significant constant mean?
In our estimated logistic regression model, the constant is not significant:
= − ⋅ + ⋅ + ⋅ + ⋅z .784 .057 .038 .568activity dweight press stress
If not only the constant, but also all other predictors were not significant (β0 = 0 and βi = 0 for i >
0, then z = 0.
− −= = =
z 0
1 1 1 1Probability(y=1) =
1+e 1+e 1+1 2
In this case, the predictors (xi) do not provide a
basis for predicting a person's value on y. Only
the distribution of y can be used for prediction.
If roughly half of the persons have a value of 1
on y, the estimated probability will be .5 and the
constant thus will be non-significant.
This represents a kind of "mean" probability.
x
Slide 34
Variable selection methods (Source: SPSS help)
Enter: A procedure for variable selection in which all variables are entered in a single step.
Forward Selection (Conditional): Stepwise selection method with entry testing based on the significance of
the score statistic, and removal testing based on the probability of a likelihood-ratio statistic using conditional
parameter estimates.
Forward Selection (Likelihood Ratio): Stepwise selection method with entry testing based on the signifi-
cance of the score statistic, and removal testing based on the probability of a likelihood-ratio statistic using
the maximum partial likelihood estimates.
Forward Selection (Wald): Stepwise selection method with entry testing based on the significance of the
score statistic, and removal testing based on the probability of the Wald statistic.
Backward Elimination (Conditional): Backward stepwise selection. Removal testing is based on the proba-
bility of the likelihood-ratio statistic using conditional parameter estimates.
Backward Elimination (Likelihood Ratio): Backward stepwise selection. Removal testing is based on the
probability of the likelihood-ratio statistic using the maximum partial likelihood estimates.
Backward Elimination (Wald): Backward stepwise selection. Removal testing is based on the probability of
the Wald statistic.