research methodology: tools - schwarz & partners 08...research methodology: tools applied data...

Research Methodology: Tools

Applied Data Analysis (with SPSS)

Lecture 08: Logistic Regression Analysis

April 2014

Prof. Dr. Jürg Schwarz

Lic. phil. Heidi Bruderer Enzler

MSc Business Administration

Slide 2

Contents

Aims of the Lecture ______________________________________________________________________________________ 3

Typical Syntax ___________________________________________________________________________________________ 4

Introduction _____________________________________________________________________________________________ 5

Example ................................................................................................................................................................................................................. 5

Outline ________________________________________________________________________________________________ 10

Concepts of Logistic Regression Analysis __________________________________________________________________ 11

Logistic Regression Model.................................................................................................................................................................................... 13

Odds ("Wettquote")............................................................................................................................................................................................... 17

Logistic Regression Analysis with SPSS: An Example _________________________________________________________ 20

Appendix – Interpretation of the Coefficients ________________________________________________________________ 31

Aims of the Lecture

You will understand the key steps in conducting a logistic regression analysis.

You will understand the concept of a logistic function.

You will understand the concept of logit and odds.

You can conduct a logistic regression analysis with SPSS

In particular, you will know how to 9

◦ interpret the output

omnibus test

goodness of fit

coefficients

classification

◦ describe the output

Slide 4

Dependent variable heart

Independent variables

Classification plot

CIs for Exp(B)

Typical Syntax

Boxplot of variable activity split by heart (legacy dialogs) EXAMINE VARIABLES=activity BY heart /PLOT=BOXPLOT /STATISTICS=NONE /NOTOTAL.

Logistic regression analysis of heart on independent variables LOGISTIC REGRESSION VARIABLES heart /METHOD=ENTER activity dweight press stress /CLASSPLOT /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

Introduction

Example

Medical research: Statistics on the cause of heart attack.

Sample: 200 male patients in a hospital.

Variables

Dependent

heart = got heart attack (yes =1/no = 0)

Independent

activity = physical activity (1 to 6)

dweight = difference to normal weight (kg)

press = blood pressure (mm Hg)

stress = stress level (1 to 6)

Typical questions

How do individual characteristics influence

the probability of getting a heart attack?

How can this impact be modeled?

How strong is the impact of each factor? Physical activity

Got

heart

att

ack

Slide 6

Relation between independent variables and "probability" of getting a heart attack

Blo

od p

ressure

No Yes

Got heart attack

Str

ess level

Physic

al activity

Diffe

rence t

o n

orm

al

weig

ht

No Yes

Got heart attack

No Yes

Got heart attack

No Yes

Got heart attack

Questions

Question in everyday language:

How do individual characteristics influence the probability of getting a heart attack?

Research question:

Is there an impact of the factors9

◦ physical activity

◦ difference to normal weight

◦ blood pressure

◦ stress level

9on the probability of getting a heart attack?

How strong is the impact of the factors?

Is there a model?

Is logistic regression analysis the right model?

Statistical question:

H0: "No model" (= No overall model and no significant coefficients)

HA: "Model" (= Overall model and significant coefficients)

Can we reject H0?

Slide 8

Solution

Logistic regression model with the probability of getting a heart attack as the dependent variable:

z

1Probability(heart attack = 1) =

1+e−

= β + β ⋅ + β ⋅ + β ⋅ + β ⋅0 1 2 3 4 + uz activity dweight press stress

− β +β ⋅ +β ⋅ +β ⋅ +β ⋅0 1 2 3 4( + u)

1=> Probability(heart attack = 1) =

1+e activity dweight press stress

=

β β =

=

=

0 4

Probability(heart attack = 1) = dependent variable

, ... independent variables

,... coefficients

u error term

e Euler's nu

activity stress

mber = 2.718...

"How to" in SPSS

SPSS: Analyze�Regression�Binary Logistic...

Results

The model as a whole is significant (table not shown).

Coefficients are significant (but not the constant which is not a problem).

z = -.784 ⋅⋅⋅⋅ activity + .057 ⋅⋅⋅⋅ dweight + .038 ⋅⋅⋅⋅ press + .568 ⋅⋅⋅⋅ stress

Example interpretation for stress:

An increase of the stress level by one unit alters the relative probability of getting a heart attack

by a factor of 1.765. This means it is being increased by 76.5% (1.765 - 1 = .765).

Comparable with R2 of linear regression

Slide 10

Outline

Basic situation

Given: A binary dependent variable which is coded as 0/1 and metric or categorical (coded as

dummy variables) independent variables

Task: Find a relationship between the characteristics.

Logistic regression analysis

Postulation of a logistic model:

Probability(event occurs) = Function(independent variables)

z

1The function is a logistic function =

1+e−

Logistic regression analysis

◦ Examines the relationship between a type of probability of the dependent variable y taking the value 1 and the independent variables xi.

◦ Uses inferential statistics to estimate the parameters

Differences between linear regression analysis and logistic regression analysis

◦ A type of probability of the dependent variable is predicted instead of the value itself

◦ Many assumptions are not as restrictive as in OLS regression analysis

Concepts of Logistic Regression Analysis

Compare with Lecture 06 "Introduction to Regression Analysis"

◦ Key steps involved in using a regression analysis 1. Formulation of the model 2. Estimation of the model 3. Verification of the model 4. Considering other aspects 5. Testing of assumptions 6. Interpretation of the model and reporting

◦ General purpose of regression

These concepts are the same also when applied to the logistic regression analysis

What is different?

◦ Key steps 4. to 6. are treated a little different

◦ Logistic regression => dependent variable nominal (binary)

◦ Concept of logistic function

◦ Concept of odds and odds ratio

◦ Maximum likelihood estimation (MLE) instead of ordinary least squares estimation (OLS)

Slide 12

Types of Logistic Regression Analysis

(Binary) logistic regression

Dependent variable is nominal and consists of two categories

Example: The example discussed above. Dependence of

getting a heart attack on individual characteristics (see above)

Extensions of binary logistic regression

Multinomial logistic regression

Dependent variable is nominal and consists of more than two categories

Example: The relationship between drug abuse (none, marijuana, hard drugs) and sex, age,

and a factor representing peer norms about drug abuse

Ordinal logistic regression

Dependent variable is ordinal

Example: The relationship between the soda size (small, medium, large or extra large) that

people order at a fast-food chain and type of sandwich (beef or chicken), and the age of the

consumer

Logistic Regression Model

Step-by-step

Dependent binary variable y (e.g. heart attack yes/no)

Independent metric variable x (e.g. stress level)

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

x y

1 0

1 0

1 0

1 0

1 0

2 0

1 0

2 0

1 0

2 0

5 1

4 1

5 1

5 1

5 1

6 1

5 1

6 1

6 1

22

23

24

25

26

27

28

29

30

case

1

2

3

4

5

6

7

8

9

10

:

x

Dataset B

inary

vari

ab

le y

In this sample, roughly one half of the cases have y = 1.

Slide 14

Conduct a linear regression

(Avoid thinking about all the violated assumptions!)

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

The model is OK (F test, t test), but what about predictions based on this model?

What does a value of -0.1 or +1.1 of the binary variable y (heart attack yes/no) mean?

x

Bin

ary

vari

ab

le y

Regression line

= β + β ⋅0 1

Linear regression

y x

To limit predictions of y within an interval [0,1] fit a logistic function to data

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

The logistic function can be interpreted as the probability of the outcome 1 of the variable y.

Instead of predicting the value of the variable y from the independent variables xi,

the probability of occurring y = 1 is predicted.

x

Pro

ba

bili

ty

Bin

ary

vari

ab

le y

logistic function

z

1Logistic function =

1+e−

z = Function(x)

0 < Logistic function < 1

Slide 16

Logistic regression model

z

1Probability(y=1) =

1+e−

0 1 2 3 4z activity dweight press stress + u= β + β ⋅ + β ⋅ + β ⋅ + β ⋅

0 4

Probability(heart attack = 1) = dependent variable

activity, ... stress independent variables

,... coefficients

u error term

=

β β =

=

−z

1

1+e is also called link function.

z is also called logit.

The logit is a linear multiple regression model of the independent variables

Odds ("Wettquote")

Example: horse betting

The probability of a horse winning a race is set in relation to the probability of losing.

This ratio is called odds.

Example:

In a horse race, the probability of a particular horse winning is 0.1

Probability of winning: 0.1

Probability of losing: 1 - 0.1 = 0.9

0.1 0.1 1odds = "odds 1 to 9" alternatively "odds 10% to 90%"

1 0.1 0.9 9= =

−

General equation

Probability that an event occurs: P(Event occurs)

Probability that an event does not occur: 1 - P(Event occurs)

P(Event occurs)odds =

1 P(Event occurs)−

Odds are actually relative probabilities.

Slide 18

Odds ratio

Example of hair color in northern Germany

Women Men

blond 55% 48%

not blond 45% 52%

100% 100%

Women Men

Probability of blond hair 0.55 0.48

Probability of not blond hair: 1 – 0.55 = 0.45 1 – 0.48 = 0.52

W

0.55 0.55odds = 1.23

1 0.55 0.45= =

− M

0.48 0.48odds = 0.92

1 0.48 0.52= =

−

Odds ratio

= = = ⋅WRatio W M

M

odds 1.23odds = 1.33 => odds 1.33 odds

odds 0.92

The relative probability of having blond hair is 1.33 times as great in women as in men.

www.haar-und-psychologie.de (Date of access: April, 2014)

Sample of men and women in northern Germany (n is unknown)

What is the connection between odds and logistic regression?

Rewriting the logistic function:

−

−

−

−

=

=

= − = − =

=

=

z

z

z

z

z

1P(y=1)

1+e

11+e

P(y=1)

1 1 P(y=1) 1-P(y=1)e 1 form inverse value

P(y=1) P(y=1) P(y=1) P(y=1)

1 P(y=1)

e 1-P(y=1)

P(y=1)e

1-P(y=1)

Compare to the definition of odds from slide 17:

=P(y=1)

odds1-P(y=1)

⇒ =

=

=

z

z

e odds

ln(e ) ln(odds)

z ln(odds) z is called logit <=> taking the logarithm of the odds

Slide 20

Logistic Regression Analysis with SPSS: An Example

What Causes a Heart Attack?

Dataset and variables

Data

Sample of n = 200 male patients in a hospital

Variables

Dependent

heart = got a heart attack (yes=1/no =0)

Independent

activity = physical activity (1 to 6)

dweight = difference to normal weight (kg)

press = blood pressure (metric)

stress = stress level (1 to 6)

Model

Logistic regression model

SPSS: Analyze��Regression��Binary Logistic...

Slide 22

SPSS output – Omnibus test

The model as a whole is significant (Omnibus test: χ2 = 99.278, df = 4, p = .000).

The omnibus test corresponds to the F test in linear regression.

The null hypothesis (H0) aims to verify that the constant has the only effect.

The alternative hypothesis (HA) aims to verify that this is not the case.

H0: β0 ≠ 0 , β1 = 9 = βp = 0

HA: At least two of the coefficients are not zero.

The omnibus test compares two models:

◦ Base model: Only the constant β0 is in the model (sometimes also called "null model")

◦ Model: All coefficients β0, β1, 9, βp are in the model, which is the "final" model

If we do not use STEPWISE, "Step", "Block" and "Model" are the same.

Maximum likelihood estimation (MLE)

MLE maximizes a "likelihood"-function, which

reflects how likely it is that the values of the

dependent variable can be predicted from the

values of the independent variables.

Sometimes the logarithm of the likelihood function is used → abbreviation is LL

The value LL is negative

By convention "-2 times the log likelihood (-2LL)" is used for estimating how well the model fits.

Actually -2LL is a type of error term

Both models have associated log likelihood values:

◦ Base model: -2LL(base model)

◦ Model: -2LL(model)

The test statistic "-2LL(base model) - (-2LL(model))" follows a chi-square distribution.

See multiple regression analysis:

Ordinary least squares (OLS) estimation

minimizes the sum of squared distances

of the data points to the regression line.

Slide 24

Model Base model

χ2 = -2LL(base model) - (-2LL(model))

χ2 = 276.939 – 177.661 = 99.278

p < 5%: The null hypothesis (H0) can be rejected.

The model as a whole is significant (Omnibus test: χ2 = 99.278, df = 4, p = .000).

SPSS output – Model summary (Goodness of fit)

The Cox & Snell R2 and Nagelkerke R

2 are logistic analogies to R

2 in OLS regression.

Cox & Snell R Square

2

n2

CS

L(base model)R 1

L(model)

= −

22 nCSmaxR 1 L(base model)= −

Nagelkerke R Square

The Nagelkerke R2 measure adapts the Cox-Snell measure so that it varies from 0 to 1.

22 CSN 2

CSmax

RR

R=

The higher the value of R2, the better the model fits the data.

Slide 26

SPSS output – Classification table

Frequencies of variable heart – Analyze�Descriptive Statistics�9

SPSS uses the regression equation to predict whether a case is 0 or 1.

Default is a cutoff value of 0.5

73 of 96 (= 76.0%) outcomes 0 of the variable heart are predicted correctly.

85 of 104 (= 81.7%) outcomes 1 of the variable heart are predicted correctly.

Roughly 50% No and 50% Yes => cutoff value 0.5

SPSS output – Classification plot (Histogram of predicted probabilities)

Predicted

values

Observed

values

Cut value

(p = .5)

Correct predictions for No (= N) Correct predictions for Yes (= Y)

Slide 28

SPSS output – Variables in the equation

Wald test 2

B

BWald

s.e.

=

Similar to the t-statistic in linear regression analysis

Tests the null hypothesis that B = 0

Regression coefficient B (the βs in the regression equation)

z

1Probability(heart attack=1) =

1+e−

= − ⋅ + ⋅ + ⋅ + ⋅z .784 .057 .038 .568activity dweight press stress

The constant is not significant, so β0 is set to 0.

Interpretation of the coefficients by means of odds ratios Exp(B)

Exp(B) corresponds to an odds ratio (Note: All other variables are held constant).

Exp(B) expresses the odds ratio after a unit change in the independent variable:

afterRatio

before

after before

OddsOdds after a unit changeExp(B) = e = Odds =

Odds before a unit change Odds

Odds = Exp(B) Odds

β =

=> ⋅

Example activity: If physical activity is increased by one unit, the odds of getting

a heart attack is multiplied by a factor of .457 => Oddsafter = .457 � Oddsbefore

That is, the odds is lowered by 54.3% (.457 - 1 = -.543).

(If Oddsbefore is set to 100%, this corresponds to a decrease from 100% to 45.7%)

Example dweight: If difference to normal weight is increased by one unit, the odds of getting

a heart attack is multiplied by a factor of 1.058 => Oddsafter = 1.058 � Oddsbefore

That is, the odds is raised by 5.8% (1.058 - 1 = .058).

(If Oddsbefore is set to 100%, this corresponds to an increase from 100% to 105.8%)

Slide 30

Confidence interval for Exp(B)

In addition, SPSS estimates confidence intervals for the odds ratios:

◦ Entire interval above 1 → positive association between xi and y

◦ Entire interval below 1 → negative association between xi and y

◦ Interval contains 1 → no association between xi and y

Example activity: The confidence interval of Exp(B) at the 95% confidence level ranges from

.290 to .718. Thus, it does not include 1 and a negative association between activity and heart

can be assumed.

Interpretation of the sign of the regression coefficient B

B Exp(B) (= OddsRatio) Oddsafter = P(y=1)/P(y=0) P(y=1)

B > 0 eB > 1 increases by factor Exp(B) increase

B = 0 eB = 1 no change no change

B < 0 eB < 1 decreases by factor Exp(B) decrease

Appendix – Interpretation of the Coefficients

0 1z

1Example with one predictor x: Logistic function = with z x u

1+e−= β + β ⋅ +

Varying values of β0 (while β1 = 0.5) Varying values of β1 (while β0 = 0)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-15 -10 -5 0 5 10 15

Pro

bab

ilit

y

x

b0 = -2

b0 = 0

b0 = 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-15 -10 -5 0 5 10 15P

rob

ab

ilit

y

x

b1 = 0.2

b1 = 0.5

b1 = 1.3

β0 shifts the function to the left or to the right. β1 changes the slope of the function.

β0 determines the value of x at which If the value of |β1| is higher, the prediction

the predicted outcome switches from 0 to 1. of the dependent variable is more selective.

β0 = -2

β0 = 0

β0 = 2

β1 = 0.2

β1 = 0.5

β1 = 1.3

Slide 32

The sign of the coefficients is interpreted in the same way as in linear regression.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-15 -10 -5 0 5 10 15

Pro

ba

bilit

y

x

b1 = -0.4

b1 = 0.4

β1 = -0.4

β1 = 0.4

Positive sign

Persons with higher values of the predictor varia-

ble have a higher probability that y = 1.

Example: More overweight ⇒ higher probability to

suffer from a heart attack.

Negative sign

Persons with higher values of the predictor varia-

ble have a lower probability that y = 1.

Example: Higher level of activity ⇒ lower probabil-

ity to suffer from a heart attack.

What does a non-significant constant mean?

In our estimated logistic regression model, the constant is not significant:

= − ⋅ + ⋅ + ⋅ + ⋅z .784 .057 .038 .568activity dweight press stress

If not only the constant, but also all other predictors were not significant (β0 = 0 and βi = 0 for i >

0, then z = 0.

− −= = =

z 0

1 1 1 1Probability(y=1) =

1+e 1+e 1+1 2

In this case, the predictors (xi) do not provide a

basis for predicting a person's value on y. Only

the distribution of y can be used for prediction.

If roughly half of the persons have a value of 1

on y, the estimated probability will be .5 and the

constant thus will be non-significant.

This represents a kind of "mean" probability.

x

Slide 34

Variable selection methods (Source: SPSS help)

Enter: A procedure for variable selection in which all variables are entered in a single step.

Forward Selection (Conditional): Stepwise selection method with entry testing based on the significance of

the score statistic, and removal testing based on the probability of a likelihood-ratio statistic using conditional

parameter estimates.

Forward Selection (Likelihood Ratio): Stepwise selection method with entry testing based on the signifi-

cance of the score statistic, and removal testing based on the probability of a likelihood-ratio statistic using

the maximum partial likelihood estimates.

Forward Selection (Wald): Stepwise selection method with entry testing based on the significance of the

score statistic, and removal testing based on the probability of the Wald statistic.

Backward Elimination (Conditional): Backward stepwise selection. Removal testing is based on the proba-

bility of the likelihood-ratio statistic using conditional parameter estimates.

Backward Elimination (Likelihood Ratio): Backward stepwise selection. Removal testing is based on the

probability of the likelihood-ratio statistic using the maximum partial likelihood estimates.

Backward Elimination (Wald): Backward stepwise selection. Removal testing is based on the probability of

the Wald statistic.

research methodology: tools - schwarz & partners 08...research methodology: tools applied data...

Documents