multiple linear regression

MULTIPLE LINEAR REGRESSION Avjinder Singh Kaler and Kristi Mai

We will look at a method for analyzing a linear relationship involving

more than two variables.

We focus on these key elements:

1. Finding the multiple regression equation.

2. The values of the adjusted R2, and the p-value as measures of

how well the multiple regression equation fits the sample data.

• Multiple Regression Equation – given a collection of sample data with several (𝑘−𝑚𝑎𝑛𝑦) explanatory variables, the regression equation that algebraically describes the relationship between the response variable 𝑦 and two or more explanatory variables 𝑥1, 𝑥2, … 𝑥𝑘 and is:

𝑦 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 + ⋯+ 𝑏𝑘𝑥𝑘 • We are using more than one explanatory variable to predict a response variable now

• In practice, you need large amounts of data to use several predictor/explanatory variables

* Guideline: Your sample size should be 10 times larger than the number of 𝑥 variables*

• Multiple Regression Line – the graph of the multiple regression equation • This multiple regression line still fits the sample points best according to the least squares

property

• Visualization – multiple scatterplots of each pair (𝑥𝑘 , 𝑦) of quantitative data can still be helpful in determining whether there is a relationship between two variables

• These scatterplots can be created one at a time. However, it is common to visualize all the pairs of variables within one plot. This is often called a pairs plot, pairwise scatterplot or scatterplot matrix.

Population Parameter Sample Statistic

Equation 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯+ 𝛽𝑘𝑥𝑘 𝑦 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 + ⋯+ 𝑏𝑘𝑥𝑘

Note:

• 𝑦 is the predicted value of 𝑦

• 𝑘 is the number of predictor variables (also called independent

variables or 𝑥 variables)

• Requirements for Regression:

1. The sample data is a Simple Random Sample of quantitative data

2. Each of the pairs of data (𝑥𝑘 , 𝑦) has a bivariate normal distribution (recall this definition)

3. Random errors associated with the regression equation (i.e. residuals) are independent and normally distributed with a mean of 0 and a standard deviation 𝜎

• Formulas for 𝑏𝑘:

• Statistical software will be used to calculate the individual coefficient estimates, 𝑏𝑘

1. Use common sense and practical considerations to include or exclude variables

2. Consider the P-value for the test of overall model significance

• Hypotheses: 𝐻0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0𝐻1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑘 ≠ 0

• Test Statistic: 𝐹 =𝑀𝑆 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑀𝑆(𝐸𝑟𝑟𝑜𝑟)

• This will result in an ANOVA table with a p-value that expresses the overall statistical significance of the model

3. Consider equations with high adjusted 𝑹𝟐 values

• 𝑅 is the multiple correlation coefficient that describes the correlation between the observed 𝑦 values and the predicted 𝑦 values

• 𝑅2 is the multiple coefficient of determination and measures how well the multiple regression equation fits the sample data

• Problems: This measure of model “fitness” increases as more variables are included until it can usually raise no more or only by a very little amount no matter how significant the most recently added predictor variable may be

• Adjusted 𝑅2 is the multiple coefficient of determination that is modified to account for the number of variables in the model and the sample size

4. Consider equations with the fewest number of predictor/explanatory variables if models that are being compared are nearly equivalent in terms of significance and fit (i.e. p-value and adjusted 𝑅2)

• This is known as the “Law of Parsimony”

• We are looking for the simplest yet most informative model

• Individual t-tests of particular regression parameters may help select the correct model and eliminate insignificant explanatory variables

Notice: If the regression equation does not appear to be useful for predictions, the best predicted value of a 𝑦 variable is still its point estimate [i.e. the sample mean of the 𝑦 variable would be the best predicted value for that variable]

• Identify the response and potential explanatory variables by constructing a scatterplot matrix

• Create a multiple regression model

• Perform the appropriate tests of the following:

• Overall model significance (the ANOVA i.e. the 𝐹 test)

• Individual variable significance (𝑡 tests)

• In addition, find the following:

• Find the adjusted 𝑅2 value to assess the predictive power of the model

• Perform a Residual Analysis to verify the Requirements for Linear Regression have been satisfied:

1. Construct a residual plot and verify that there is no pattern (other than a straight line pattern) and also verify that the residual plot does not become thicker or thinner

• Examples are shown below:

2. Use a histogram, normal quantile plot, or Shapiro Wilk test of normality to confirm that the values of the residuals have a distribution that is approximately normal • Normal Quantile Plot (aka QQ Plot) * Examples on the next 3 slides *

• Shapiro Wilk Normality Test

• This will help you assess the normality of a given set of data (in this case, the normality of the residuals) when the visual examination of the QQ Plot and/or the histogram of the data seem unclear to you and leave you stumped!

• Hypotheses:

H0: Th݁ ݀𝑎𝑡𝑎 ܿ𝑚݁ݎ݂ ݏ𝑚 𝑎 𝑛ݎ𝑚𝑎݈ ݀݅ݏ𝑡ݑܾ݅ݎ𝑡݅𝑛

H1: Th݁ ݀𝑎𝑡𝑎 ݀ݏ݁ 𝑛𝑡 𝑎݁𝑎ݎ 𝑡ܿ 𝑚݁ ݂ݎ𝑚 𝑎 𝑛ݎ𝑚𝑎݈ ݀݅ݏ𝑡ݑܾ݅ݎ𝑡݅𝑛

Normal: Histogram of IQ scores is close to being bell-shaped, suggests that the IQ

scores are from a normal distribution. The normal quantile plot shows points that are

reasonably close to a straight-line pattern. It is safe to assume that these IQ scores

are from a normally distributed population.

Uniform: Histogram of data having a uniform distribution. The corresponding

normal quantile plot suggests that the points are not normally distributed because

the points show a systematic pattern that is not a straight-line pattern. These

sample values are not from a population having a normal distribution.

Skewed: Histogram of the amounts of rainfall in Boston for every Monday during

one year. The shape of the histogram is skewed, not bell-shaped. The

corresponding normal quantile plot shows points that are not at all close to a

straight-line pattern. These rainfall amounts are not from a population having a

normal distribution.

The table to the right includes a random

sample of heights of mothers, fathers, and their

daughters (based on data from the National

Health and Nutrition Examination).

Find the multiple regression equation in which

the response (y) variable is the height of a

daughter and the predictor (x) variables are

the height of the mother and height of the

father.

The StatCrunch results are shown here:

From the display, we see that the multiple

regression equation is:

𝐷𝑎𝑢𝑔ℎ𝑡𝑒𝑟 = 7.5 + 0.707𝑀𝑜𝑡ℎ𝑒𝑟 + 0.164 𝐹𝑎𝑡ℎ𝑒𝑟

We could write this equation as:

𝑦 = 7.5 + 0.707𝑥1 + 0.164𝑥2

where 𝑦 is the predicted height of a

daughter,

𝑥1 is the height of the mother, and 𝑥2 is the

height of the father.

The preceding technology display shows the adjusted coefficient of

determination as R-Sq(adj) = 63.7%.

When we compare this multiple regression equation to others, it is better

to use the adjusted R2 of 63.7%

Based on StatCrunch, the p-value is less than 0.0001, indicating that the multiple regression equation has good overall significance and is usable for predictions.

That is, it makes sense to predict the heights of daughters based on heights of mothers and fathers.

The p-value results from a test of the null hypothesis that β1 = β2 = 0, and rejection of this hypothesis indicates the equation is effective in predicting the heights of daughters.

Data Set 2 in Appendix B includes the age, foot length, shoe print length, shoe size, and height for each of 40 different subjects. Using those sample data, find the regression equation that is the best for predicting height. The table on the next slide includes key results from the combinations of the five predictor variables.

Using critical thinking and statistical analysis:

1. Delete the variable age.

2. Delete the variable shoe size, because it is really a rounded form of foot length.

3. For the remaining variables of foot length and shoe print length, select foot length because its adjusted R2 of 0.7014 is greater than 0.6520 for shoe print length.

4. Although it appears that only foot length is best, we note that criminals usually wear shoes, so shoe print lengths are likely to be found than foot lengths.

Hence, the final regression equation only including foot length: 𝑦 = 𝛽0 + 𝛽1𝑥1

where 𝛽0 is the intercept, 𝛽1 is the coefficient corresponding to x1 variable (foot length).

The methods of the above section (Multiple Linear Regression) rely on variables that are continuous in nature. Many times we are interested in dichotomous or binary variables.

These variables have only two possible categorical outcomes such as male/female, success/failure, dead/alive, etc.

Indicator or dummy variables are artificial variables that can be used to specify the categories of the binary variable such as 0=male/1=female.

If an indicator variable is included in the regression model as a predictor/explanatory variable, the methods we have are appropriate.

HOWEVER, can we handle a situation when the variable we are trying to predict is categorical and/or binary? Notice that this is a different situation.

But, YES!!

The data in the table also includes

the dummy variable of sex (coded as 0 = female and 1 = male).

Given that a mother is 63 inches tall

and a father is 69 inches tall, find the regression equation and use it to

predict the height of a daughter and

a son.

Using technology, we get the regression equation:

𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐶ℎ𝑖𝑙𝑑 = 25.6 + 0.377 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑀𝑜𝑡ℎ𝑒𝑟 + 0.195 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐹𝑎𝑡ℎ𝑒𝑟 + 4.15(𝑠𝑒𝑥)

We substitute in 0 for the sex variable, 63 for the mother, and 69 for the

father, and predict the daughter will be 62.8 inches tall.

We substitute in 1 for the sex variable, 63 for the mother, and 69 for the

father, and predict the son will be 67 inches tall.

multiple linear regression

Education