logistic regression: for when your data really do fit in neat little boxes

Research as it happens & how SAS Enterprise Guide can help

ANNMARIA DE MARS, Ph.D.

The Julia Group

Logistic Regression: For when your data really do fit in neat little boxes

Logistic regression is used when a few conditions are met:

1. There is a dependent variable.

2. There are two or more independent variables.

3. The dependent variable is binary, ordinal or categorical.

Medical applications

Symptoms are absent, mild or severe

Patient lives or dies

Cancer, in remission, no cancer history

Logistic regression applications are common in medicine, with binary outcomes of died/survived or ordinal outcomes of symptoms which are absent, mild or severe.

3

Marketing Applications

Buys pickle / does not buy pickle

Which brand of pickle is purchased

Buys pickles never, monthly or daily

GLM and LOGISTIC are similar in syntax

PROC GLM DATA = dsname;

CLASS class_variable ;

model dependent = indep_var class_variable ;

PROC LOGISTIC DATA = dsname;

CLASS class_variable ;

MODEL dependent = indep_var class_variable ;

That was easy

. So, why arent we done and going for coffee now?

Why its a little more complicated

The output from PROC LOGISTIC is quite different from PROC GLM

If you arent familiar with PROC GLM, the similarities dont help you, now do they?

Important Logistic Output

Model fit statistics

Global Null Hypothesis tests

Odds-ratios

Parameter estimates

& a useful plot

But first a word from Chris rock

10

. But that they have to tell you why

The problem with women is not that they leave you

I am going to explain why you use logistic regression

11

A word from an unknown person on the Chronicle of Higher Ed Forum

Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression

Not just how can I do it but why

Ask ---

Whats the rationale,

How will you interpret the results.

Because the world is not experiencing a shortage of people who are unclear what the hell they are doing.

13

The world already has enough people who dont know what theyre doing

No one cares that much

Advice on residuals

At first, I was prepared to give a long talk about residuals, and analysis of residuals and why logistic regression was more suited in some cases than multiple linear regression. Then, I remembered the advice my three children in their twenties, frequently give me about the latest studies showing statistics is a sexy profession. A. No its not and B. No one cares that much. Since they were all in preschool and I would sit them on the steps and practice my presentations in front of them, they have had the same advice for me Mom, dont talk too much because youre not that interesting

15

What Euclid knew ?

I was laying in bed one day, with my eyes closed, trying to avoid morning, when my husband brought me a cup of coffee and started talking about Euclid. He mentioned that much of what the ancient Greeks did in geometry was not really proofs at all in the way we think of them, but very copious examples.

16

Its amazing what you learn when you stay awake in graduate school

A very brief refresher

Those of you who are statisticians, feel free to nap for two minutes

Assumptions of linear regression

linearity of the relationship between dependent and independent variables

independence of the errors (no serial correlation)

homoscedasticity (constant variance) of the errors across predictions (or versus any independent variable)

normality of the error distribution.

In a second nutshell, logistic regression is preferred to linear regression when you have a categorical dependent variable. No, it is NOT okay to just pretend your dependent is continuous. In one of your first statistics courses you no doubt learned about assumptions of linear regression models, that is:

19

Residuals Bug Me

To a statistician, all of the variance in the world is divided into two groups, variance you can explain and variance you can't, called error variance.

Residuals are the error in your prediction.

20

Residual error

If your actual score on say, depression, is 25 points above average and, based on stressful events in your life I predict it to be 20 points above average, then the residual (error) is 5.

Say, for example, your home has recently suffered an invasion of naked mole rats. That would be stressful.

21

Euclid says

Lets look at those residuals when we do linear regression with a categorical and a continuous variable

22

Residuals: Pass/ Fail

Residuals: Final Score

Lets go back and forth between these two and be bothered for a while. As you can see, our residuals for the continuous prediction are above and below the mean. Also, youd think if you have a good prediction, the average error should be zero because the too- high and too low predictions cancel out. Your errors should center around zero, with small errors more common than large ones. This doesnt happen at all with the binary predictors. In fact, most of your errors are far above or far below mean of zero.

24

Which looks more normal?

On the left is the graph of the residuals when the dependent variable is binary pass or fail. That whole constant variance across the dependent variable homoscedasticity thing? The distribution of errors being normal? Yeah, not happening here. The diamond indicates the mean and the median id that straight line. You can see that the median is much closer to the mean for the continuous example. I have to admit though, that there are a few more extreme outliers for the continuous predictor.

25

Which is a straight line?

If you look at the diagnostic plots of two analyses, one with a continuous measure in logistic regression you see this normal probability plot. It should look like a straight line. When we take a look at the plot, it does look pretty much like a straight line, except for at the extremes. We have more extremely low scores and more extremely high scores than should be perceived in a normal distribution. If you want to know why that is, you need to attend the other session on data visualization!

26

Impossible events: Prediction of pass/fail

Look at the predicted scores for whether or not the student passed the course:

27

Its not always like this. Sometimes its worse.

Notice that NO ONE was predicted to have failed the course.

Several people had predicted scores over 1.

Sometimes you get negative predictions, too

If, as is often the case, you have a very small proportion of positive cases, e.g., who die, get cancer, your departure from normality is going to be even worse. Notice here that NO ONE was predicted to have failed the course. In fact, 26% of the students failed. Several people had predicted scores over 1.0, which doesnt make sense. Sometimes you get negative predictions, too. So. No, we cant just pretend it is continuous.

28

In five minutes or less

Logarithms, probability & odds ratios

Most people might start with the underlying mathematics and go through a lot of proofs. Im not most people so I am just going to go over a few points justifying the use of logistic regression.

29

Im not most people

Most people might start with the underlying mathematics and go through a lot of proofs. Im not most people so I am just going to go over a few points justifying the use of logistic regression.

30

Really, if you look at the relationship of a dichotomous dependent variable and a continuous predictor, often the best-fitting line isnt a straight line at all. Its a curve.

Points justifying the use of logistic regression

Really, if you look at the relationship of a dichotomous dependent variable and a continuous predictor it often looks more logistic than linear. The best-fitting line isnt a straight line at all. Its a curve.

31

You could try predicting the probability of an event

say, passing a course. That would be better than nothing, but the problem with that is probability goes from 0 to 1, again, restricting your range.

Maybe use the odds ratio ?

which is the ratio of the odds of an event happening versus not happening given one condition compared to the odds given another condition. However, that only goes from 0 to infinity.

Your dependent variable (Y) :

There are two probabilities, married or not. We are modeling the probability that an individual is married, yes or no.

Your independent variable (X):

Degree in computer science field =1, degree in French literature = 0

When to use logistic regression: Basic example #1

[Logistic regression is NOT what you would use to model how long a marriage lasted. That would be survival analysis.]

34

A. Find the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY, for when X = 1

p / (1- p)

Step #1

Logistic regression is the statistical technique of choice when you have a single dependent variable and multiple independent variables from which you would like to predict it.

With logistic regression, the first step in the dependent variable you are modeling is the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY.

35

B. Find the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY, for when X = 0

Step #2

36

B. Divide A by B

That is, take the odds of Y given X = 1 and divide it by odds of Y given X = 2

Step #3

37

Example!

100 people in computer science & 100 in French literature

90 computer scientists are married

Odds = 90/10 = 9

45 French literature majors are married

Odds = 45/55 = .818

Divide 9 by .818 and you get your odds ratio of 11 because that is 9/.818

Further, lets say that we have sampled 100 people in computer science and 100 people in French literature. We find that 90 of the computer scientists are married and 45 of the French literature people.

So, if the probability of marriage is 90/100 and the probability of not married is 10/100 then the odds ratio of 9:1 for the computer scientists. = 9

For the French literature people, the probability of marriage is 45/ 100 and the probability of not being married is 55/100 = .818

So, 9/.818 = 11.00

This tells you that the odds of a computer scientist being married versus single are 11 times that of a French literature professor. Also, that you should study computer science instead of French.

38

Just because that wasnt complicated enough

Now that you understand what the odds ratio is

The dependent variable in logistic regression is the LOG of the odds ratio (hence the name)

Which has the nice property of extending from negative infinity to positive infinity.

A table (try to contain your excitement)
BS.E.WaldDfSigExp(B)CS 2.398.38937.9491.00011.00Constant-.201.201.9971.318.818
The natural logarithm (ln) of 11 is 2.398.

I dont think this is a coincidence

If the reference value for CS =1 , a positive coefficient means when cs =1, the outcome is more likely to occur

How much more likely? Look to your right

BS.E.WaldDfSigExp(B)CS 2.398.38937.9491.00011.00Constant-.201.201.9971.318.818
The ODDS of getting married are 11 times GREATER

If you are a computer science major

Lets take a look at two examples of logistic regression. The previous example of computer scientists and French literature professors used data I made up. This is a really good way to insure that your results turn out nice, neat and highly significant. Unfortunately, making up your data does not sit well with journal re viewers, tenure committees or the National Institutes on Health.

43

Thank God!

Actual Syntax

Picture of God not available

PROC LOGISTIC data = datasetname descending ;

By default the reference group is the first category.

What if data are scored

0 = not dead

1 = died

Proc logistic invokes the logistic procedure, data = gives the dataset name. What about the descending? By default the reference group is the first category. Often, data are scored 0 or 1, for example, 0 = survived, 1 = died. The values of the parameters are then predicting have not died. While your total model chi-square, model significance, and significance of your regression coefficients will all be the same regardless, it is a little counterintuitive for most people to think of this is the probability of them not having died and it gets a little convoluted explaining this to everyone. So, the descending option uses one as your reference group.

45

CLASS categorical variables ;

Any variables listed here will be treated as categorical variables, regardless of the format in which they are stored in SAS

Unintentionally listing a numeric variable here makes each value be treated as a separate category, so make sure it is what you want to do.

46

MODEL dependent = independents ;

Just like in Ordinary Least Squares regression, you give the variable you want to predict and the variables you want to use as predictors.

47

Dependent = Employed (0,1)

Independents

County

# Visits to program

Gender

Age

PROC LOGISTIC DATA = stats1 DESCENDING ;

CLASS gender county ;

MODEL job = gender county age visits ;

We will now enter real life

Table 1

Probability modeled is job=1.

Note: 50 observations were deleted due to missing values for the response or explanatory variables.

This is the result of the DESCENDING option. The probability of 1, or being employed, is modeled. Also, just in case I missed the fact that a substantial amount of my data is missing, SAS gives me a little note.

54

This is bad

Model Convergence Status

Quasi-complete separation of data points detected.

Warning:

The maximum likelihood estimate may not exist.

Warning:

The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Complete separation
XGroup0010203041516171

Complete separation occurs when your data are all in one group above some X value and all in another group below that value. Something like this:

57

If you dont go to church you will never die

and so on. You might think this is great, you can perfectly predict which group someone falls in. If the score of X is 3 then theyre dead. Otherwise, theyre alive.

Lets say that X is how many times a week people went to church in 2010 and the GROUP variable is if they died or not. You could come out and say that people who go to church less than three days a week never die but you would look pretty stupid saying that.

58

Quasi-complete separation

Like complete separation BUT one or more points where the points have both values

1 1

2 1

3 1

4 1

4 0

5 0

6 0

there is not a unique maximum likelihood estimate

From a mathematical point of view, this is a problem because The model cannot converge to a maximum likelihood estimate because a single maximum does not exist.

60

For any dichotomous independent variable in a logistic regression, if there is a zero in the 2 x 2 table formed by that variable and the dependent variable, the ML estimate for the regression coefficient does not exist.

Depressing words from Paul Allison

What the hell happened?

Solution?

Collect more data.

Figure out why your data are missing and fix that.

Delete the category that has the zero cell..

Delete the variable that is causing the problem

So, what happened? Well, when I looked at County by Job by Gender, I found that about a third of my sample were missing gender. When I looked at those who HAD data for gender, there were only two who had . for county, that is, did not report the county where they lived. Both of those two people had a job, so there was a zero cell.

Collect more data. This usually solves the problem of having a zero cell. Its not always an option, though.

Figure out why your data are missing and fix that. This is something you should always do whenever you can, regardless of whether you have problems in quasi-complete or complete separation. In this case, it was because one file we merged had names in all upper case, so the two did not match. When I re-ran the merge, our missing data were miraculously non-missing.

Delete the category that has the zero cell. This CAN work if, as in our example, the cell that had a zero was people who did not give their county and did not get a job. Thats kind of a garbage category anyway, not listed. If it was a lot, and related to the dependent variable, it could tell us our data was not missing at random. In this case it was only a few people. So, toss it.

Delete the variable that is causing the problem. (If you have a dichotomous variable, it should be noted that the option above essentially means deleting the variable, too.) This may be advisable to do if you have a problem with the variable, for example, if none of the women in your sample are working, how likely is that? BUT . Then you have to ask if there is a problem with your sample - did you collect data from fighters in the UFC or Catholic priests ? Does the variable even make sense for your research question?

63

Nothing was significant

& I was sad

What you should not do is just ignore that note about quasi-separation and the model failing to converge. First of all, I should point out that these data werent really from the counties named. I changed the data here for confidentiality purposes. This is the point where if this was most statistical presentations that I would correct the problem doing A, B, C and/ or D, thereby resolving the problem and getting a useful model with all parameters significant except for that one non-significant effect that I can use to illustrate non-significance.

Unfortunately, these were real data from a real study where we were trying to prove something. We did, in fact, collect more data, fix the problem of the missing data, delete the category of missing data for county and still nothing was significant. I was sad.

64

Hey, theres still money in the budget!

Lets try something else!

Maybe its the clients fault

Proc logistic descending data = stats ;

Class difficulty gender ;

Model job = gender age difficulty ;

Oh, joy !

This sort of sucks

AIC - This is the Akaike Information Criterion. .. AIC is used for the comparison of models from different samples or nonnested models. Ultimately, the model with the smallest AIC is considered the best.

SC - This is the Schwarz Criterion. .. Like AIC, SC penalizes for the number of predictors in the model and the smallest SC is most desireable.

-2 Log L - This is negative two times the log likelihood. The -2 Log L is used in hypothesis tests for nested models.

68

Yep. Sucks.

Sucks. Totally.

Conclusion

Sometimes, even when you do the right statistical techniques the data dont predict well. My hypothesis would be that employment is determined by other variables, say having particular skills, like SAS programming.

Take 2

Predicting passing grades

Lets move on to a project where we are predicting if people pass or fail a course. This is another model with actual data, and a real point. We received funding to train direct care staff working with individuals with disabilities and chronic illness. As our end outcome we have whether or not they passed a test of knowledge and best practices related to disability. It included everything from how to give a bed bath to nutrition. As our first attempt, we used education and which group (experimental or control) as the independent variables.

72

Proc Logistic data = nidrr ;Class group ;Model passed = group education ;

Yay! Better than nothing!

Lets look at several tables. Well skip over the first few tables on model fit statistics and look at them later. Here is our table that tests the null hypothesis that all of our regression coefficients are zero. If you are familiar with linear regression or Analysis of Variance, think of this as similar to the F-test. If you arent familiar with those, just so you know, this is simply the test of whether ANY of your predictor variables are significantly different from zero. There are three different types of chi-square tests, and, in this case as usually occurs, the three agree. The model is significant.

74

& we have a significant predictor

WHY is education negative?

Here we see the estimates of our coefficient values. Notice that education has a NEGATIVE estimate. Why? Because the default value modeled is 0, that is, the person did not pass. The probability of failing is negatively related to education. Being in the group, CONTROL is positively related to failing. As you can see from the column on the far-right, education is significantly related to the dependent variable but group is not (p > .05) .

76

Higher education, less failure

In the following table, we see the Odds Ratio estimates. The point estimate for education is less than one, this tells us that people with more education are less likely to fail. Also, as you can see, 1.0 does not fall within the 95% confidence limits so the odds are significantly different. The odds for the group variable are greater than one, and, again, 1.0 DOES fall within the confidence limits. Since the odds are greater than 1.0 it means that people in the control group are more likely to fail than those in the experimental group, but since 1.0 falls within the confidence interval, this is not significantly greater.

77

Comparing model fit statistics

Now its later

Comparing models

The Mathematical Way

Akaike Information Criterion

Used to compare models

The SMALLER the better when it comes to AIC.

Remember that I said wed talk about those other tables later? Well, now its later. We want to compare models. Below are the fit statistics for our model with Education and Group.

The first column uses only the intercept to predict. Your model with the two covariates gives a smaller AIC than the intercept alone. This tells you that your model is better than having no predictors. If the AIC for Intercept Only and Intercept Only and Covariates models are almost the same, that is bad.

81

New variable improves model
CriterionInterceptOnlyInterceptandCovariatesAIC193.107178.488SC196.131187.560-2 Log L191.107172.488CriterionInterceptOnlyInterceptandCovariatesAIC193.107141.250SC196.131153.346-2 Log L191.107133.250

Lets say, because it happens to be true, that we gave everyone a pre-test before they were trained. We then tried a different model with pretest, education and group. Check out our fit statistics below:

The first column is the exact same as our previous model, because it is how well the data fit for the intercept only. The second column, including covariates, is now 141.25. Recall that in the previous model it was 178.49 . Also recall that smaller is better. So, we conclude here that my second model is much better than the first. Wont it always be that way, with adding more variables making the model better? No. It wont.

82

Comparing models

The Visual Way

Reminder

Sensitivity is the percent of true positives, for example, the percentage of people you predicted would die who actually died.

Specificity is the percent of true negatives, for example, the percentage of people you predicted would NOT die who survived.

Im unimpressed

Yeah, but can you do it again?

Data mining

With SAS logistic regression

Data mining sample & test

Select sample

Create estimates from sample

Apply to hold out group

Assess effectiveness

Create sample

proc surveyselect data = visual

out = samp300 rep = 1

method = SRS seed = 1958 sampsize = 315 ;

Create Test Dataset

proc sort data = samp300 ;

by caseid ;

proc sort data = visual ;

by caseid ;

data testdata ;

merge samp300 (in =a ) visual (in =b) ;

if a and b then delete ;

Create Test Dataset

data testdata ;

merge samp300 (in =a ) visual (in =b) ;

if a and b then delete ;

*** Deletes record if it is in the sample ;

Create estimates

ods graphics on

proc logistic data = samp300 outmodel = test_estimates plots = all ;

model vote = q6 totincome pctasian / stb rsquare ;

weight weight ;

Test estimates

proc logistic inmodel = test_estimates plots = all ;

score data = testdata ;

weight weight ;

*** If no dataset is named, outputs to dataset named Data1, Data2 etc. ;

Validate estimates

proc freq data = data1;

tables vote* i_vote ;

proc sort data = data1 ;

by i_vote ;

-1 -0.8 -0.4 0 0.2 0.4 0.6 0.8

-2 -1 0 1 2 3

Model Information

Data Set

WORK.STATS1

Response Variable

job

Number of Response Levels

2

Model

binary logit

Optimization Technique

Fisher's scoring

logistic regression: for when your data really do fit in neat little boxes

Documents

logistic data

logistic regression11

variable model dependent

dsnameclass class

var class

multiple linear regression

julia grouplogistic

syntaxproc glm data