28. multiple regression the practice of statistics in the life sciences second edition

28. Multiple regression

The Practice of Statistics in the Life Sciences

Second Edition

Objectives (PSLS Chapter 28)

Multiple regression

The multiple linear regression model

Indicator variables

Two parallel regression lines

Interaction

Inference for multiple linear regression

The multiple linear regression model In previous chapters we examined a simple linear regression model

expressing a response variable y as a linear function of one

explanatory variable x. In the population, this model has the form

y = α + x

We now examine multiple linear regression models in which the

response variable y is a linear combination of k explanatory

variables. In the population, this model takes the form

y = 0 + 1 x1 + 2 x2 + …+ k xk

The parameters can be estimated from sample data, giving

y = b0 + b1 x1 + b2 x2 + … + bk xk

Assumptions

The mean response μy has a linear relationship with the k

explanatory variables taken together.

The y responses are independent of each other.

For any set of fixed values of the k explanatory variables, the

response y varies Normally.

The standard deviation σ of y is the same for all values of the

explanatory variables. In inference, the value of σ is unknown.

Indicator variables

The multiple regression model can accommodate categorical response

variables by coding them in a binary mode (0,1).

In particular, we can compare individuals from different groups

(independent SRSs in an observational study or randomized groups in

an experiment) by using an indicator variable.

To compare 2 groups, we simply create an indicator variable Ind such that

Ind = 0 for individuals in one group and

Ind = 1 for individuals in the other group

Two parallel regression lines

When plotting the linear regression pattern of y as a function of x for two

groups, we sometimes find that the two groups have roughly parallel

simple regression lines.

In such instances, we can model the data using a unique multiple linear

regression model with two parallel regression lines, using the

quantitative variable x1 and an indicator variable Indx2 for the groups:

y = 0 + 1 x1 + 2 Indx2

1 is the slope for both lines

0 is the intercept for the Indx2 = 0 line

0 + 2) is the intercept for the Indx2 = 1 line

2

Indx2 = 0 line

Indx2 = 1 line

Unique multiple regression model with an indicator variable for two

parallel lines:

y = –44.29 + 133.39x1 –23.55Indx2

Two separate simple linear regression models (notice the similar

slopes).

Male fruit flies were randomly assigned to either reproduce (IndReprod = 1) or not (IndReprod = 0). Their thorax length and longevity were recorded.

Interaction

When plotting the linear regression pattern of y as a function of x for two

groups, we may find two non-parallel simple regression lines.

We can model such data with a unique multiple linear regression model

using a quantitative variable x1, an indicator variable Indx2 for the groups,

and an interaction term x1Indx2:

y = 0 + 1 x1 + 2 Indx2 + 3

x1Indx2

Each line has its own slope and intercept.

1 is the slope for the Indx2 = 0 line

0 + 3) is the slope for the Indx2 = 1 line

Indx2 = 0 line

Indx2 = 1 line

Note that an interaction term can be computed between any two

variables (not just between a quantitative variable and an indicator

variable).

An interaction effect between the variables x1 and x2 means that the

relationship between the mean response y and the explanatory

variable x1 is different for varying values of the explanatory variable x2.

When comparing two groups (x2 is an indicator variable), this means that the

two regression lines will not be parallel.

A random sample of children was taken and their lung capacity (forced expiratory volume, or FEV) was plotted as a function of their age and sex (IndSex = 0 for female and IndReprod = 1 for male).

Using an interaction term to take into account the non-parallel lines, software gives the following multiple regression model :

y = 0.6739 + 0.18209x1 –0.7314Indx2 + 0.10613x1Indx2

Inference for multiple regression We first want to run an overall test. We use an ANOVA F test to test:

H0: β1 = 0 and β2 = 0 … and βk = 0

Ha: H0 is not true (at least one coefficient is not equal to 0)

The squared multiple correlation coefficient R2 is given by the

ANOVA output as

and indicates how much of the variability in the response variable y

can be explained by the specific model tested. A higher R2 indicates

a better model.

SSTotalSSModel2 R

Estimating the regression coefficients

If the ANOVA is significant, we can run individual t tests on each

regression coefficient:

H0: βi = 0 in this specific model

Ha: βi ≠ 0 in this specific model

using , which follows the t distribution with n – k – 1

degrees of freedom when H0 is true.

We can also compute individual level-C confidence intervals for each

of the k regression coefficients in the specific model.

where t* is the critical value for a t distribution with n – k – 1 degrees

of freedom.

ib

i

SE

bt

ibii SEtb *:

The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.81, so this is a very good model that explains 81% of the variations in longevity of male fruit flies in the lab.

The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. The confidence intervals give an range of likely values for these parameters.

Because this is a model with 2 parallel lines, we can conclude that reproducing male fruit flies live between 19 and 28 days less on average than those that do not reproduce, when thorax length is taken into account.

SPSS

The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.67, so this is a good model that explains 67% of FEV variations in children.

The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero.

Because this is a model with a significant interaction effect, we conclude that both age and sex influence FEV in children, but that the effect of age on FEV is different for males and for females. The scatterplot indicates that the effect of age is more pronounced for males.

Checking the conditions for inference

The best way to check the conditions for inference is by examining

graphically the scatterplot(s) of y as a function of each xi, and the

residuals (y - ŷ) from the multiple regression model.

Look for:

Linear trends in the scatterplot(s)

Normality of the residuals (histogram of residuals)

Constant σ for all combinations of the xis (residual plot with no particular

pattern and approximately equal vertical spread)

Independence of observations (check the study design or a plot of the

residuals sorted by order of data acquisition)

28. multiple regression the practice of statistics in the life sciences second edition

Documents