multiple regression: categorical dependent variables · regression binary dependent variables...

21
logistic regression Binary dependent variables Logistic regression Interpretation Model fit Multiple regression: Categorical dependent variables Johan A. Elkink School of Politics & International Relations University College Dublin 27 November 2017

Upload: others

Post on 15-Jul-2020

31 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Multiple regression:Categorical dependent variables

Johan A. ElkinkSchool of Politics & International Relations

University College Dublin

27 November 2017

Page 2: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

1 Binary dependent variables

2 Logistic regression

3 Interpretation

4 Model fit

Page 3: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Outline

1 Binary dependent variables

2 Logistic regression

3 Interpretation

4 Model fit

Page 4: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Binary models

Binary models have a dependent variable consisting of twocategories.

For example,

• Vote on a particular law

• Turning out in an election

• Approval in a referendum

• Bankrupt or not

Page 5: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Limited dependent variables

When a dependent variable is not continuous, or is truncatedfor some reason, a linear model would lead to implausiblepredictions.

For binary dependent variables we estimate the probability ofobserving a one:

• Prediction below 0 and above 1 would not make sense.

• For any case where the predicted probability is alreadyhigh, it cannot increase much with a change in X (andvice versa for low probabilities).

• A linear model would imply high levels ofheteroskedasticity.

Page 6: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Estimators

A typical approach is to have an estimator that is “linear in theparameters” – i.e. it generates a linear prediction based on Xand β – but then transforms this linear prediction into onebounded between 0 and 1.

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Logistic transformation

Linear prediction

Pre

dict

ed p

roba

bilit

y

Pr(Y = 1) = 0.5

Page 7: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Outline

1 Binary dependent variables

2 Logistic regression

3 Interpretation

4 Model fit

Page 8: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Logistic regression

The most common transformation is the logistictransformation, which relates to the log-odds:

log

(Pr(yi = 1)

Pr(yi = 0)

)= β1 + β2xi1 + β3xi2,

which can also be formulated as:

Pr(yi = 1) =1

1 + e−(β1+β2xi1+β3xi2).

Page 9: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Estimating a logistic regression

Estimating a logisticregression is straightforwardand output will look similar tothat of linear regression.

E.g. explaining “Yes” in theMarriage EqualityReferendum.

Note the use of continuousand discreet independentvariables.

Age 25-34 −0.152(0.410)

35-44 −0.707∗

(0.386)45-54 −0.865∗∗

(0.390)55-64 −1.084∗∗∗

(0.399)65+ −1.857∗∗∗

(0.374)Urban 0.305∗

(0.168)Pro-abortion 0.221∗∗∗

attitude (0.028)intercept 0.358

(0.372)

N 851

Page 10: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Outline

1 Binary dependent variables

2 Logistic regression

3 Interpretation

4 Model fit

Page 11: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Derivatives

For linear regression, we interpret using the first derivative –i.e. the effect of X on Y is:

∂y

∂xj= βj .

In a logistic regression, however, the derivative is morecomplicated:

∂π

∂xj= βj π(1 − π).

Because of the non-linear relationship, the effect of X on Ydepends on all other independent variables.

Nevertheless, a quick method to interpret logit coefficients is todivide them by 4 to get the slope at π = 0.5.

Page 12: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Graphical interpretation

An alternative method is to plot the relationship between one xand π, holding the other values of X constant (e.g. at themean, median, etc.).

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

● ●●●

2 4 6 8

−0.

20.

00.

20.

40.

60.

81.

01.

2

Predicted probability of a yes vote(for 35−44 year old)

Pro−abortion attitude

Pro

babi

lity

of y

es v

ote

UrbanRural

Because the link function g(Xβ) is not linear (but insteadg(Xβ) = 1

1+e−Xβ ), the effect of X on y depends on all X.

Page 13: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Fitted values

A third useful way of interpreting logit regression coefficients isby describing typical cases or interesting examples.

Age Region P(Yesvote)

18–24 Urban 0.8935–44 Urban 0.7965+ Urban 0.59

18–24 Rural 0.8535–44 Rural 0.7465+ Rural 0.51

(This assumes attitude towards abortion at the median value.)

Page 14: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Presentation

Bottom line: it is much better to present interpretable andunderstandable inferences, with an indication of the level ofuncertainty, than to present simply estimated coefficients.

E.g. “An increase in automobile support for a Republicansenator from $10000 to $20000 in total increases his or herprobability to vote for the Corporate Average Fuel Economystandard bill by 11%, give or take 7%, all else equal.”

Page 15: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Outline

1 Binary dependent variables

2 Logistic regression

3 Interpretation

4 Model fit

Page 16: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

R2 for logistic regression

Although various authors have proposed pseudo-R2 estimatorsthat roughly do the same thing as an R2 for linear regression,there is no good alternative.

They cannot be interpreted as “the proportion of variance in Yexplained.”

Instead, it is typically better to look at the quality of thepredictions – do I get high Pr(Y = 1) for the observed ones inthe data and low Pr(Y = 1) for the observed zeros?

Page 17: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Confusion matrix

Evaluating the performance of the binary model can be done byusing the confusion matrix:

True value1 0

Pre

dic

tion 1 True positive False positive

Precision: TPTP+FP

(Type I error)

0 False negative True negative(Type II error)

Sensitivity: TPTP+FN Specificity: TN

FP+TN Accuracy: TP+TNN

TPR: TPTP+FN FPR: FP

FP+TN

Page 18: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Confusion matrix

Evaluating the performance of the binary model can be done byusing the confusion matrix:

True value1 0

Pre

dic

tion 1 True positive False positive Precision: TP

TP+FP

(Type I error)

0 False negative True negative(Type II error)

Sensitivity: TPTP+FN Specificity: TN

FP+TN Accuracy: TP+TNN

TPR: TPTP+FN FPR: FP

FP+TN

Page 19: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Receiver Operating Characteristic curve

The accuracy of predictions will depend on the thresholdprobability – variations on default of π = 0.5 are possible.

Depending on the application, it might be better or worse toover- or underestimate ones relative to zeros.

The ROC-curve plots, for all possible thresholds, the truepositive rate against the false positive rate.

An ROC-curve further from (above) the 45 degree lineindicates a better predictive performance; any predictions underthis line indicate worse than random prediction.

Page 20: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Receiver Operating Characteristic curve

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Receiver Operating Characteristic curve

False positive rate

True

pos

itive

rat

e

With ageWithout age

Given the above, we can also calculate the area under theROC-curve as a measure of prediction quality, called AUC. Thisis somewhat related to the Gini coefficient for incomedistributions (G = 2AUC − 1).

Page 21: Multiple regression: Categorical dependent variables · regression Binary dependent variables Logistic regression Interpretation Model t Limited dependent variables When a dependent

logisticregression

Binarydependentvariables

Logisticregression

Interpretation

Model fit

Variations

Logistic regression is the most common and easiest to use.Other models exist for specific uses:

• Probit regression—similar to logistic regression, but withless fat tails.

• Ordered probit—similar to probit regression, but withmultiple category ordinal dependent variable.

• Multinomial logistic regression—similar to logisticregression, but for a multiple category nominal dependentvariable.

• Poisson / negative binomial—regression models fordiscrete, positive dependent variables, such as counts.