28. multiple regression the practice of statistics in the life sciences second edition
TRANSCRIPT
28. Multiple regression
The Practice of Statistics in the Life Sciences
Second Edition
Objectives (PSLS Chapter 28)
Multiple regression
The multiple linear regression model
Indicator variables
Two parallel regression lines
Interaction
Inference for multiple linear regression
The multiple linear regression model In previous chapters we examined a simple linear regression model
expressing a response variable y as a linear function of one
explanatory variable x. In the population, this model has the form
y = α + x
We now examine multiple linear regression models in which the
response variable y is a linear combination of k explanatory
variables. In the population, this model takes the form
y = 0 + 1 x1 + 2 x2 + …+ k xk
The parameters can be estimated from sample data, giving
y = b0 + b1 x1 + b2 x2 + … + bk xk
Assumptions
The mean response μy has a linear relationship with the k
explanatory variables taken together.
The y responses are independent of each other.
For any set of fixed values of the k explanatory variables, the
response y varies Normally.
The standard deviation σ of y is the same for all values of the
explanatory variables. In inference, the value of σ is unknown.
Indicator variables
The multiple regression model can accommodate categorical response
variables by coding them in a binary mode (0,1).
In particular, we can compare individuals from different groups
(independent SRSs in an observational study or randomized groups in
an experiment) by using an indicator variable.
To compare 2 groups, we simply create an indicator variable Ind such that
Ind = 0 for individuals in one group and
Ind = 1 for individuals in the other group
Two parallel regression lines
When plotting the linear regression pattern of y as a function of x for two
groups, we sometimes find that the two groups have roughly parallel
simple regression lines.
In such instances, we can model the data using a unique multiple linear
regression model with two parallel regression lines, using the
quantitative variable x1 and an indicator variable Indx2 for the groups:
y = 0 + 1 x1 + 2 Indx2
1 is the slope for both lines
0 is the intercept for the Indx2 = 0 line
0 + 2) is the intercept for the Indx2 = 1 line
2
Indx2 = 0 line
Indx2 = 1 line
Unique multiple regression model with an indicator variable for two
parallel lines:
y = –44.29 + 133.39x1 –23.55Indx2
Two separate simple linear regression models (notice the similar
slopes).
Male fruit flies were randomly assigned to either reproduce (IndReprod = 1) or not (IndReprod = 0). Their thorax length and longevity were recorded.
Interaction
When plotting the linear regression pattern of y as a function of x for two
groups, we may find two non-parallel simple regression lines.
We can model such data with a unique multiple linear regression model
using a quantitative variable x1, an indicator variable Indx2 for the groups,
and an interaction term x1Indx2:
y = 0 + 1 x1 + 2 Indx2 + 3
x1Indx2
Each line has its own slope and intercept.
1 is the slope for the Indx2 = 0 line
0 + 3) is the slope for the Indx2 = 1 line
Indx2 = 0 line
Indx2 = 1 line
Note that an interaction term can be computed between any two
variables (not just between a quantitative variable and an indicator
variable).
An interaction effect between the variables x1 and x2 means that the
relationship between the mean response y and the explanatory
variable x1 is different for varying values of the explanatory variable x2.
When comparing two groups (x2 is an indicator variable), this means that the
two regression lines will not be parallel.
A random sample of children was taken and their lung capacity (forced expiratory volume, or FEV) was plotted as a function of their age and sex (IndSex = 0 for female and IndReprod = 1 for male).
Using an interaction term to take into account the non-parallel lines, software gives the following multiple regression model :
y = 0.6739 + 0.18209x1 –0.7314Indx2 + 0.10613x1Indx2
Inference for multiple regression We first want to run an overall test. We use an ANOVA F test to test:
H0: β1 = 0 and β2 = 0 … and βk = 0
Ha: H0 is not true (at least one coefficient is not equal to 0)
The squared multiple correlation coefficient R2 is given by the
ANOVA output as
and indicates how much of the variability in the response variable y
can be explained by the specific model tested. A higher R2 indicates
a better model.
SSTotalSSModel2 R
Estimating the regression coefficients
If the ANOVA is significant, we can run individual t tests on each
regression coefficient:
H0: βi = 0 in this specific model
Ha: βi ≠ 0 in this specific model
using , which follows the t distribution with n – k – 1
degrees of freedom when H0 is true.
We can also compute individual level-C confidence intervals for each
of the k regression coefficients in the specific model.
where t* is the critical value for a t distribution with n – k – 1 degrees
of freedom.
ib
i
SE
bt
ibii SEtb *:
The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.81, so this is a very good model that explains 81% of the variations in longevity of male fruit flies in the lab.
The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. The confidence intervals give an range of likely values for these parameters.
Because this is a model with 2 parallel lines, we can conclude that reproducing male fruit flies live between 19 and 28 days less on average than those that do not reproduce, when thorax length is taken into account.
SPSS
The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.67, so this is a good model that explains 67% of FEV variations in children.
The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero.
Because this is a model with a significant interaction effect, we conclude that both age and sex influence FEV in children, but that the effect of age on FEV is different for males and for females. The scatterplot indicates that the effect of age is more pronounced for males.
Checking the conditions for inference
The best way to check the conditions for inference is by examining
graphically the scatterplot(s) of y as a function of each xi, and the
residuals (y - ŷ) from the multiple regression model.
Look for:
Linear trends in the scatterplot(s)
Normality of the residuals (histogram of residuals)
Constant σ for all combinations of the xis (residual plot with no particular
pattern and approximately equal vertical spread)
Independence of observations (check the study design or a plot of the
residuals sorted by order of data acquisition)