logistic regression model

Upload: ravi-kumar

Post on 14-Jul-2015

195 views

Category:

Documents


0 download

TRANSCRIPT

Statistical Computing Seminar Proc Logistic and Logistic Regression Models

Introduction Binary Logistic Regression Exact Logistic Regression Generalized Logits Model - Multinomial Logistic Regression Proportional Odds Model - Ordinal Logistic Regression

Introduction Logistic regression describes the relationship between a categorical response variable and a set of predictor variables. A categorical response variable can be a binary variable, an ordinal variable or a nominal variable. Each type of categorical variables requires different techniques to model its relationship with the predictor variables. In this seminar, we illustrate how to perform different types of analyses using SAS proc logistic. For a binary response variable, such as a response to a yes-no question, a commonly used model is the logistic regression model. We also touch the surface of exact logistic regression, which is very useful when the sample size is too small or the events are too sparse. For a nominal response variable, such as Democrats, Republicans and Independents, we can fit a generalized logits model. For an ordinal response variable, such as low, medium and high, we can fit it to a proportional odds model.

Logistic Regression Models In this section, we will use the High School and Beyond data set, hsb2.sas7bdat to describe what a logistic model is, how to perform a logistic regression model analysis and how to interpret the model. Our dependent variable is created as a dichotomous variable indicating if a student's writing score is higher than or equal to 52. We call it hiwrite. The predictor variables will include prog, female and other test scores. Our data set has 200 observations.data hsb2; set hsb2; hiwrite = write >=52; run; proc means data = hsb2 mean std; run; Variable Mean Std Dev ---------------------------------------ID 100.5000000 57.8791845 FEMALE 0.5450000 0.4992205 RACE 3.4300000 1.0394722 SES 2.0550000 0.7242914 SCHTYP 1.1600000 0.3675260 PROG 2.0250000 0.6904772 READ 52.2300000 10.2529368 WRITE 52.7750000 9.4785860

MATH 52.6450000 9.3684478 SCIENCE 51.8500000 9.9008908 SOCST 52.4050000 10.7357935 hiwrite 0.6300000 0.4840159 ----------------------------------------

Let be the probability of scoring higher than 51 in writing test. The odds is /(1-). For example, the overall probability of scoring higher than 51 is .63. The odds will be .63/ (1-.63) = 1.703. A logistic regression model describes a linear relationship between the logit, which is the log of odds, and a set of predictors. logit() = log(/(1-)) = + 1*x1 + 2*x2 + ... + k*xk = + x We can either interpret the model using the logit scale, or we can convert the log of odds back to the probability such that = exp( + x ) /(1 + exp( + x )). The advantage of using the logit scale for interpretation is that the relationship between the logit and the predictors is a linear relationship. But sometimes it is easier to interpret the model in terms of probabilities. Then we have to keep in mind that the relationship between the probabilities and the predictors is not a linear relationship. For more details on odds ratio, please see our FAQ page on how to interpret odds ratios in logistic regression. A Simple Model Let's consider the model where female is the only predictor. We will use this example to understand the concepts of odds and odds ratios and to understand how they are related to the parameter estimates.proc logistic data = hsb2 ; model hiwrite (event='1') = female ; ods output ParameterEstimates = model_female; run; Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 0.0220 0.2097 0.0110 FEMALE 1 0.9928 0.3016 10.8369

Pr > ChiSq 0.9165 0.0010

Notice that we can specify which event to model using the event = option in the model statement. This is new in SAS 8.2. The other way of specifying that we want to model 1 as event instead of 0 is to use the descending option in the proc logistic statement. One thing that is worth noticing is the use of quotes in the option event = '1'. Even though, the variable hiwrite is a numeric variable, it is still necessary to surround 1 with a pair of quotes. It comes handy when the outcome variable is coded as a character variable. Using the ODS output statement, we created a data set called model_female containing the

parameter estimates shown above. We can then use the data set to create the odds and odds ratio.data model_fem; set model_female; o = exp(estimate); run; proc print data = model_fem; var variable estimate o; run; Obs Variable Estimate 1 Intercept 0.0220 2 FEMALE 0.9928

o 1.02222 2.69865

The intercept has a parameter estimate of .022. This is the estimated logit when female = 0, that is when the student is a male student. Therefore, the odds = exp(logit) = exp(.0220) = 1.02222 is the estimated odds for a male student to score 52 or higher in writing test. The coefficient for variable female is .9928. That means that for a one unit increase in female (that is changing from male to female) the expected change in log of odds is .9928. We can also interpret it in the scale of odds ratio. The odds for a male student is exp() = exp(.022) and the odds for a female student is exp(.022 + .9928*1). Therefore, taking the ratio of these two odds, we get the odds ratio for female versus male is exp(.9928) = 2.699. In terms of probabilities, the probability for females to score 52 or higher on the writing test is exp(.022 + .9928) / (1 + exp(.022 + .9928)) = .734. The probability for males is exp(.022 )/(1 + exp(.022)) = .505. With this simple example, we can actually compute the odds ratio from the 2x2 table of hiwrite*female.proc freq data = hsb2; tables hiwrite*female /nocum nopercent; run; hiwrite FEMALE Frequency| Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 45 | 29 | 74 | 60.81 | 39.19 | | 49.45 | 26.61 | ---------+--------+--------+ 1 | 46 | 80 | 126 | 36.51 | 63.49 | | 50.55 | 73.39 | ---------+--------+--------+ Total 91 109 200

For example, for males, the odds is 46/45 = 1.022, which is the exponentiated value of the intercept from the model. The odds ratio for females versus males is (80/29)/(46/45) = 2.699. It is usually written as a cross-product (45*80)/(29*46) = 2.699. This is the exponentiated value of the parameter estimate for variable female.

A Model with a Continuous Predictor and a Categorical Predictor Let's now take a look at a model with both a continuous variable math and a categorical variable female as predictors. We will focus on how to interpret the parameter estimate for the continuous variable.proc logistic data = hsb2; model hiwrite (event='1') = female math; output out = m2 p = prob xbeta = logit; run; Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Intercept 1 -10.3651 1.5535 44.5153 FEMALE 1 1.6304 0.4052 16.1922 MATH 1 0.1979 0.0293 45.5559 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits FEMALE 5.106 2.308 11.298 MATH 1.219 1.151 1.291

Pr > ChiSq ChiSq Intercept 1 -9.4745 5.5677 2.8958 0.0888 dose 1 2.0804 1.2603 2.7249 0.0988 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits dose 8.007 0.677 94.679 Exact Conditional Analysis Conditional Exact Tests --- p-Value --Effect Test Statistic Exact Mid dose Score 5.4724 0.0245 0.0190 Probability 0.0110 0.0245 0.0190 Exact Parameter Estimates 95% Confidence Parameter Estimate Limits p-Value dose 1.8000 0.1157 5.8665 0.0245 Exact Odds Ratios 95% Confidence Parameter Estimate Limits p-Value dose 6.049 1.123 353.000 0.0245

Model Fit Statistics

Notice first of all that the syntax for model statement is slight different than we have seen so far. This is the syntax used for grouped data. That is we have frequencies of the events for each of the cells. This type of syntax works for both the maximum likelihood logistic regression and exact logistic regression. With estimate = both, we request that both the parameters and the odds ratios are being estimated.

Generalized Logits Model for Multinomial Logistic Models Proc logistic also perform analysis on nominal response variables. Since the response variable no longer has the ordering, we can no longer fit a proportional odds model to our data. But we can fit a generalized logits model. This analysis can be done using proc catmod and that is how it is used to be done. SAS 8.2 added some new features to its proc logistic and now proc logistic does analysis on nominal responses with ease. In this section, we are going to use a data file called school used in Categorical Data Analysis Using The SAS System, by M. Stokes, C. Davis and G. Koch. We will illustrate what a generalized logits model is and how to perform an analysis using proc logistic.

data school; input school program cards; 1 1 self 10 1 1 team 17 1 1 class 26 1 2 self 5 1 2 team 12 1 2 class 50 2 1 self 21 2 1 team 17 2 1 class 26 2 2 self 16 2 2 team 12 2 2 class 36 3 1 self 15 3 1 team 15 3 1 class 16 3 2 self 12 3 2 team 12 3 2 class 20 ; run;

style $ count;

In this data set, three different teaching styles have been implemented in teaching third grade math. School children in experimental learning settings were surveyed to determine which teaching styles they preferred. The response variable style takes three values: class, self and team. We want to determine the preference of students by their schools and programs. The programs are regular and after-school programs with 1 being regular and 2 being after-school. In a generalized logit model, we will pick a particular category of responses as the baseline reference and compare every other category with the baseline response. In our example, we will choose team as the baseline category. Consider the probabilities: 1 = probability of 'Preferring class', 2 = probability of 'Preferring self', 3 = probability of 'Preferring team'. The generalized logits are defined as logit(1) = log(1/3), logit(2) = log(2/3). The generalized logits model for our example is then defined as logit(i) = i + x i, where i = 1 and 2 indicating the two logits. This means that we allow two different sets of regression parameters, one for each logit.

A Simple Example We can calculate the generalized odds from the frequency table, similar to what we have done in the case of proportional odds model.proc freq data = school; weight count; tables style /list chisq relrisk; ods output OneWayFreqs = test; run; data test1; set test; godds = frequency/85; run; proc print data = test1; var style frequency godds; run; Obs style Frequency godds 1 class 174 2.04706 2 self 79 0.92941 3 team 85 1.00000

The other way of getting the same results is to run the generalized logits model. In SAS, we can simply use proc logistic with the link = glogit option.proc logistic data = school order = internal; freq count; model style = /link = glogit ; ods output parameterestimates= odds; run; data odds1; set odds; odds = exp(estimate); run; proc print data = odds1; var response estimate odds; run; Obs Response Estimate 1 class 0.7164 2 self -0.0732

odds 2.04706 0.92941

Saturated Model Example In this data set, there are three schools and two types of programs. That is, for each of the preference choices there are possible six cell counts. If we use both school and program and also include their interaction, we will use up all the degrees of freedom. That is we have a saturated model. This is the best model we can get, fitting each cell with its own parameter.proc logistic data=school order = internal; freq count; class school program / order = data;

model style = school program school*program /link = glogit scale = none aggregate; run; The LOGISTIC Procedure Model Information Data Set WORK.SCHOOL Response Variable style Number of Response Levels 3 Number of Observations 18 Frequency Variable count Sum of Frequencies 338 Model generalized logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value style Frequency 1 class 174 2 self 79 3 team 85 Logits modeled use style='team' as the reference category. Class Level Information Design Class Value Variables school 1 1 0 2 0 1 3 -1 -1 program 1 1 2 -1 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 0.0000 0 . . Pearson 0.0000 0 . . Number of unique profiles: 6 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 699.404 689.156 SC 707.050 735.033 -2 Log L 695.404 665.156 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 30.2480 10 0.0008 Score 28.3738 10 0.0016 Wald 25.6828 10 0.0042 Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq school 4 14.5522 0.0057 program 2 10.4815 0.0053 school*program 4 1.7439 0.7827

We have included most parts of the output from SAS, excluding the parameter estimates. The Deviance and Pearson Goodness-of-Fit Statistics output is new here. They were

requested by using option scale = none aggregate. Because our model is saturated, the goodness-of-fit statistics are zero with zero degree of freedom. We also see that the default type of coding scheme, e.g. effect coding, that proc logistic has for categorical variables. We also see that the overall effect of the interaction of school and program is not significant. This leads us to a simpler model with only the main effect. Model With Only Main Effectproc logistic data=school order = internal; freq count; class school program /order = data; model style = school program /link = glogit scale = none aggregate; run; Odds Ratio Estimates Point 95% Wald Effect style Estimate Confidence Limits school 1 vs 3 class 1.926 0.990 3.747 school 1 vs 3 self 0.517 0.228 1.175 school 2 vs 3 class 1.609 0.820 3.155 school 2 vs 3 self 1.276 0.620 2.626 program 1 vs 2 class 0.476 0.280 0.809 program 1 vs 2 self 1.005 0.538 1.877

We will focus on the interpretation of parameters. For example the odds ratio of class to team for program1 versus program 2 is .476. We can say that the odds for students in program 1 to choose class over team is .476 times the odds for students in program 2. Or we can say that the odds for students in program 1 to choose class over team is .524 times less than the odds for students in program 2. Similarly, we can say that the odds for students in school 1 to choose class over team is 1.926 times the odds for students in school 3. Or we can say that the odds for students in school 1 to choose class over team is .926 times more than the odds for students in school 3. It is oftentimes easier to describe in terms of probabilities. We can use the output statement to generate these probabilities as shown below.proc logistic data=school order = internal; freq count; class school program ; model style = school program / link = glogit; output out = smodel p=prob; run; proc freq data = smodel; where school = 1 or school = 2; format prob 5.4; tables school*program*_level_*prob /list nopercent nocum; run; school program _LEVEL_ prob Frequency -------------------------------------------------1 1 class .5371 3 1 1 self .1580 3 1 1 team .3049 3 1 2 class .7095 3 1 2 self .0989 3

1 2 2 2 2 2 2

2 1 1 1 2 2 2

team class self team class self team

.1917 .3924 .3409 .2667 .5764 .2372 .1864

3 3 3 3 3 3 3

Proportional Odds Model for Ordinal Logistic Models The proportional odds model is also referred as the logit version of an ordinal regression model. It extends logistic regression to handle ordinal response variables. In this section, we are going to use SAS data set ordwarm2.sas7bdat to illustrate what a proportional odds model is and how to perform a proportional odds model analysis. Let's first take a look at the data set. This data set is taken from Regression Models For Categorical Dependent Variables Using Stata by Long and Freese. Each subject in the data set was asked to evaluate the following statement: "A working mother can establish just as warm and secure of a relationship with her child as a mother who does not work". The response is recoded in a variable called warm. It has four levels: 1 = Strongly Disagree (SD), 2 = Disagree (D), 3 = Agree (A) and 4 = Strongly Agree (SA). This will be the response variable in our analysis. Other variables in the data set include age, education level, gender of the subject, and other subject related variables.options nocenter nodate label; proc contents data = ordwarm2; run; The CONTENTS Procedure Data Set Name: WORK.ORDWARM2 2293 Member Type: DATA 10

Observations: Variables:

-----Alphabetic List of Variables and Attributes----# Variable Type Len Pos Label ----------------------------------------------------------------------------2 age Num 3 8 Age in years 3 ed Num 3 11 Years of education 5 male Num 3 17 Gender: 1=male 0=female 4 prst Num 3 14 Occupational prestige 1 warm Num 8 0 Mom can have warm relations with child 8 warmlt2 Num 3 26 1=SD; 0=D,A,SA 9 warmlt3 Num 3 29 1=SD,D; 0=A,SA 10 warmlt4 Num 3 32 1=SD,D,A; 0=SA 7 white Num 3 23 Race: 1=white 0=not white 6 yr89 Num 3 20 Survey year: 1=1989 0=1977

We are interested in building up a model to describe the relationship between the response variable warm and some of the explanatory variables, such as the age, level of education and race. Let's consider the probabilities

1 = 1, probability of 'Strongly Disagree', 2 = 1 + 2, probability of 'Strongly Disagree' or 'Disagree', 3 = 1 + 2 + 3, probability of 'Not Strongly Agree', where 1 = probability of 'Strongly Disagree', 2 = probability of 'Disagree', 3 = probability of 'Agree', 4 = probability of 'Strongly Agree'. Then we can construct the cumulative logits: logit(1) = log( 1/(1 - 1)) = log(1/(2 + 3 + 4)), logit(2) = log( 2/(1 - 2)) = log((1 + 2)/(3 + 4)), logit(3) = log( 3/(1 - 3)) = log((1 + 2 + 3))/4). The proportional odds model is the following: logit(i) = i + x. Thus we allow the intercept to be different for different cumulative logit functions, but the effect of the explanatory variables will be the same across different logit functions. That is, we allow different 's for each of the cumulative odds, but only one set of 's for all the cumulative odds. This is the proportionality assumption and this is why this type model is called proportional odds model. Also notice that although this is a model in terms of cumulative odds, we can always recover the probabilities of each response category as follows. 1 = 1 2 = 2 - 1 3 = 3 - 2 4 = 1 - 3 A Simple Example We can calculate the cumulative odds from the frequency table.proc freq data = ordwarm2; table warm ; ods output onewayfreqs = test (keep = warm frequency cumfrequency); run; data test1; set test; if _n_ ChiSq Intercept 1 1 -2.5550 0.1277 400.0337 ChiSq Likelihood Ratio 162.4403 3