an introduction to logistic regression for categorical dependent variables gv917

An Introduction to Logistic Regression

For categorical Dependent VariablesGV917

What do we do when the dependent variable in a regression is a dummy variable? Suppose we have the dummy variable

turnout: 1 – if a survey respondent turns out to vote 0 – if they don’t vote

One thing we could do is simply run an ordinary least squares regression

Turnout Interest 1.00 4.00 .00 1.00 1.00 4.00 1.00 3.00 1.00 3.00 .00 1.00 1.00 3.00 1.00 2.00 1.00 3.00 .00 1.00 .00 2.00 1.00 2.00 1.00 2.00 .00 1.00 1.00 2.00 1.00 4.00 .00 2.00 1.00 2.00 .00 1.00 1.00 1.00 .00 2.00 1.00 4.00 1.00 1.00 1.00 2.00 1.00 1.00 .00 2.00 1.00 3.00 1.00 3.00 1.00 4.00 1.00 3.00

Turnout 1 yes 0 No

Interest in the Election 1 not at all interested 2 not very interested 3 fairly interested 4 very interested

(N=30)

OLS Regression of Interest on Turnout – The Linear Probability Model

Model Summary

.540a .291 .266 .39930Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), interesta.

ANOVAb

1.836 1 1.836 11.513 .002a

4.464 28 .1596.300 29

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), interesta.

Dependent Variable: turnoutb.

Coefficientsa

.152 .177 .856 .399

.238 .070 .540 3.393 .002

(Constant)interest

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: turnouta.

The Residuals of the OLS Turnout Regression Casewise Diagnostics a

-.264 1.00 1.1053 -.10526-.977 .00 .3901 -.39009-.264 1.00 1.1053 -.10526.333 1.00 .8669 .13313.333 1.00 .8669 .13313

-.977 .00 .3901 -.39009.333 1.00 .8669 .13313.930 1.00 .6285 .37152.333 1.00 .8669 .13313

-.977 .00 .3901 -.39009-1.574 .00 .6285 -.62848

.930 1.00 .6285 .37152

.930 1.00 .6285 .37152-.977 .00 .3901 -.39009.930 1.00 .6285 .37152

-.264 1.00 1.1053 -.10526-1.574 .00 .6285 -.62848

.930 1.00 .6285 .37152-.977 .00 .3901 -.390091.527 1.00 .3901 .60991

-1.574 .00 .6285 -.62848-.264 1.00 1.1053 -.105261.527 1.00 .3901 .60991.930 1.00 .6285 .37152

1.527 1.00 .3901 .60991-1.574 .00 .6285 -.62848

.333 1.00 .8669 .13313

.333 1.00 .8669 .13313-.264 1.00 1.1053 -.10526.333 1.00 .8669 .13313

Case Number123456789101112131415161718192021222324252627282930

Std. Residual turnoutPredicted

Value Residual

Dependent Variable: turnouta.

What’s Wrong?

Predicted probabilities which exceed 1.0, which makes no sense.

The test statistics t and F are not valid because the sampling distribution of residuals does not meet the required assumptions (heteroscedasticity)

We can correct for the heteroscedasticity, but a better option is to use a logistic regression model

Some Preliminaries needed for Logistic Regression Odds Ratios These are defined as the probability of an event occurring divided by the

probability of it not occurring. Thus if p is the probability of an event: p Odds = ------- 1- p

For example: In the 2005 British Election Study Face-to-Face survey 48.2 per cent of the

sample were men, and 51.8 percent women, thus the odds of being a man were:

0.482 0.518 -------- = 0.93 and the odds of being a women were -------- = 1.07 0.518 0.482

Note that if the odds ratio was 1.00 it would mean that women were equally likely to appear in the survey as men.

Log Odds The natural logarithm of a number is the power we must raise e (2.718) to

give the number in question.

So the natural logarithm of 100 is 4.605 because 100 = e 4.605

This can be written 100 = exp(4.605)

Similarly the anti-log of 4.605 is 100 because e 4.605 = 100

In the 2005 BES study 70.5 per cent of men and 72.9 per cent of women voted.

The odds of men voting were 0.705/0.295 = 2.39, and the log odds were ln(2.39) = 0.8712

The odds of women voting were 0.729/0.271 = 2.69, and the log odds were ln(2.69) = 0.9896

Note that ln(1.0) = 0, so that when the odds ratio is 1.0 the log odds ratio is zero

Why Use Logarithms?

They have 3 advantages:

Odds vary from 0 to ∞, whereas log odds vary from -∞ to + ∞ and are centered on 0. Odds less than 1 have negative values in log odds, and odds greater than one have positive values in log odds. This accords better with the natural number system which runs from -∞ to + ∞.

If we take any two numbers and multiply them together that is the equivalent of adding their logs. Thus logs make it possible to convert multiplicative models to additive models, a useful property in the case of logistic regression which is a non-linear multiplicative model when not expressed in logs

A useful statistic for evaluating the fit of models is -2*loglikelihood (also known as the deviance). The model has to be expressed in logarithms for this to work

Logistic Regression

^ p(y) ln -------- = a + bXi ^ 1 - p(y) ^ Where p(y) is the predicted probability of being a voter ^ 1 – p(y) is the predicted probability of not being a voter

If we express this in terms of anti-logs or odds ratios, then ^ p(y) --------- = exp( a + bXi) ^ 1 - p(y) and

^ exp(a + bXi) p(y) = ----------------- 1 + exp(a + bXi)

The Logistic Function

The logistic function can never be greater than one, so there are no impossible probabilities

It corrects for the problems with the test statistics

Estimating a Logistic Regression

In OLS regression the least squares solution can be defined analytically – there are equations called the Normal Equations which we use to find the values of a and b. In logistic regression there are no such equations. The solutions are derived iteratively – by a process of trial and error.

Doing this involves identifying a likelihood function. A likelihood is a measure of how typical a sample is of a given population. For example we can calculate how typical the ages of the students in this class are in comparison with students in the university as a whole. Applied to our regression problem we are working out how likely individuals are to be voters given their level of interest in the election and given values for the a and b coefficients.

We ‘try out’ different values of a and b and the maximum likelihood

estimation identifies the values which are most likely to reproduce the distribution of voters and non-voters we see in the sample, given their levels of interest in the election.

A Note on Maximum Likelihood

Define the probability of getting a head in tossing a fair coin as p(H) = 0.5, so that p(1-H) = 0.5 (getting a tail). So the probability of two heads followed by a tail is:

P[(H)(H)(1-H)] = (0.5)(0.5)(0.5) = 0.125

We can get this sequence in 3 different ways (the tail can be first, second or third), so that the probability of getting 2 heads and a tail without worrying about the sequence is 0.125(3) = 0.375

But suppose we did not know the value of p(H). We could ‘try out’ different values and see how well they fitted an experiment consisting of repeated tosses of a coin three times. For example if we thought p(H) = 0.4, then two heads and a tail would give

(0.4)(0.4)(0.6)(3)= 0.288.

If we thought it was 0.3 we would get: (0.3)(0.3)(0.7)(3) = 0.189

Maximum Likelihood in General More generally we can write a likelihood

function for this exercise:

LF = π [pi2 * (1- pi)] where pi is the

probability of getting a head and π is the number of ways this sequence can occur.

The maximum value of this function occurs when pi=0.5, making this the maximum likelihood estimate of the sequence two heads and a tail.

Explaining Variance

In OLS regression we defined the following expression:

_ _ Σ(Yi – Y)2 = Σ(Ŷ – Y)2 + Σ(Yi - Ŷ)2

Or Total Variation = Explained Variation + Residual Variation

In logistic regression measures of the Deviance replace the sum of squares as the building blocks of measures of fit and statistical tests.

Deviance Deviance measures are built from maximum likelihoods calculated

using different models. For example, suppose we fit a model with no slope coefficient (b), but an intercept coefficient (a). We can call this model zero because it has no predictors. We then fit a second model, called model one, which has both a slope and an intercept. we can form the ratio of the maximum likelihoods of these models:

maximum likelihood of model zero Likelihood ratio = --------------------------------------------- maximum likelihood of model one

Expressed in logs this becomes:

Log Likelihood ratio = ln(maximum likelihood of model zero – maximum likelihood of model one)

Note that the (Likelihood ratio)2 is the same as 2(log likelihood ratio)

The Deviance is defined as -2(log likelihood ratio)

What does this mean?

The maximum likelihood of model zero is analogous to the total variation in OLS and the maximum likelihood of model one is analogous to the explained variation. If the maximum likelihoods of models zero and one were the same, then the likelihood ratio would be 1 and the log likelihood ratio 0.

This would mean that model one was no better than model zero in accounting for turnout, so the deviance captures how much we improve things by taking into account interest in the election. The bigger the deviance the more the improvement

SPSS Output from the Logistic Regression of Turnout

Omnibus Tests of Model Coefficients

10.757 1 .00110.757 1 .00110.757 1 .001

StepBlockModel

Step 1Chi-square df Sig.

Model Summary

25.894a .301 .427Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than .001.

a.

Classification Tablea

5 4 55.63 18 85.7

76.7

Observed.001.00

turnout

Overall Percentage

Step 1.00 1.00

turnout PercentageCorrect

Predicted

The cut value is .500a.

Variables in the Equation

1.742 .697 6.251 1 .012 5.708-2.582 1.302 3.934 1 .047 .076

interestConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: interest.a.

The Meaning of the Omnibus Test

Omnibus Tests of Model Coefficients

10.757 1 .00110.757 1 .00110.757 1 .001

StepBlockModel

Step 1Chi-square df Sig.

SPSS starts by fitting what it calls Block 0, which is the model containing the constant term and no predictor variables. It then proceeds to Block 1 which fits the model and gives us another estimate of the likelihood function. These two can then be compared and the table shows a chi-square statistical test of the improvement in the model achieved by adding interest in the election to the equation. This chi-square statistic is significant at the 0.001 level. In a multiple logistic regression this table tells us how much all of the predictor variables improve things compared with model zero.

We have significantly improved on the baseline model by adding the variable interest to the equation

The Model Summary Table

Model Summary

25.894a .301 .427Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than .001.

a.

The -2 loglikelihood statistic for our two variable model appears in the table, but it is only really meaningful for comparing different models. The Cox and Snell and the Nagelkerke R Squares are different ways of approximating the percentage of variance explained (R square) in multiple regression. The Cox and Snell statistic is problematic because it has a maximum value of 0.75. The Nagelkerke R square corrects this and has a maximum value of 1.0, so it is often the preferred measure.

The Classification TableClassification Tablea

5 4 55.63 18 85.7

76.7

Observed.001.00

turnout

Overall Percentage

Step 1.00 1.00

turnout PercentageCorrect

Predicted

The cut value is .500a.

The classification table tells us the extent to which the model correctly predicts the actual turnout, so it is another goodness of fit measure. The main diagonal from top left to bottom right contains the cases predicted correctly (23), whereas the off-diagonal from bottom right to top left are the cases predicted incorrectly (7). So overall 76.7 per cent of the cases are predicted correctly.

Interpreting the Coefficients

Variables in the Equation

1.742 .697 6.251 1 .012 5.708-2.582 1.302 3.934 1 .047 .076

interestConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: interest.a.

The column on the left gives the coefficients in the logistic regression model. It means that a unit change in the level of interest in the election increases the log odds of voting by 1.742. The standard error appears in the next column (0.697) and the Wald statistic in the third column. The latter is the t statistic squared (6.251) and as we can see it is significant at the 0.012 level. Finally, Exp (B) is the anti-log of the (B) column so that e1.742 = 5.708. This is the effect on the odds of voting of an increase in the level of interest in the election by one unit. Since odds ratios are a bit more easy to understand than log odds ratios the effects are often reported using these coefficients.

Making Sense of the Coefficients

^ p(y) ln -------- = -2.582 + 1.742Xi ^ 1 - p(y)

So that

^ p(y) = exp(-2.582 + 1.742X i) ------------------------------- 1 + exp(-2.582 + 1.742Xi)

Translating into Probabilities Suppose a person scores 4 on the interest in the election variable (they are

very interested). Then according to the model the probability that they will vote is:

^ P(y) = exp(-2.582 + 1.742(4)) ------------------------------- 1 + exp(-2.582 + 1.742(4)Xi)

^ P(y) = exp(4.386)/(1 + exp(4.386)) = 0.99

If they are not at all interested and score (1) then:

^ P(y) = exp(-0.84)/(1 + exp(-0.84)) = 0.30

Consequently a change from being not at all interested to being very interested increases the probability of voting by 0.99-0.30= 0.69

Probabilities

Level of Interest Probability of Voting 1 0.30 2 0.71 3 0.93 4 0.99

Conclusions

Logistic regression allows us to model relationships when the dependent variable is a dummy variable.

It can be extended to multinomial logistic regression in which there are several categories – and this produces several sets of coefficients

The results are more reliable than if we had just used ordinary least squares regression

an introduction to logistic regression for categorical dependent variables gv917

Documents