multiple linear regression - johns hopkins university...... x 3) has their own regression...

Multiple Linear Regression

Song GeBSN, RN, PhD Candidate

Johns Hopkins University School of Nursing

www.nursing.jhu.edu

NR120.508 Biostatistics for Evidence‐based Practice

Learning Objectives

By the end of this module, you will be able to:1. Articulate assumptions for multiple linear

regression2. Explain the primary components of multiple

linear regression3. Identify and define the variables included in

the regression equation4. Construct a multiple regression equation5. Calculate a predicted value of a dependent

variable using a multiple regression equation

Learning Objectives Cont’d

6. Distinguish between unstandardized (B) and standardized (Beta) regression coefficients

7. Distinguish between different methods for entering predictors into a regression model (simultaneous, hierarchical and stepwise)

8. Identify strategies to assess model fit9. Interpret and report the results of

multiple linear regression analysis

Review of lecture two weeks ago

• Linear regression assumes a linear relationship between independent variable(s) and dependent variable

• Linear regression allows us to predict an outcome based on one or several predictors

• Linear regression allows us to explainthe interrelationships among variables

• Linear regression is a parametric test

How to choose X and Y?

• Y can be regressed on X• X can be regressed on Y• The regression is not symmetric• The choice of which regression to

perform depends on the scientific question: Is X to be used to explain or predict Y?

• Is Y to be used to explain or predict X? (e.g. Does poor health status explain high pollution level?)

Linear Regression Assumptions

1. Independent variable can be any scale (ratio, nominal, etc.)

2. Dependent variable need to be ratio/interval scale

3. Dependent variable need to be normally distributed overall and normally distributed for each value of the independent variable

4. If dependent variable is not normally distributed, we can transform it

Review: Normal distribution

Example of transformed data

Positively skewed

Normally distributed

Method Math Operation Good for: Bad for:

Log ln(x)log10(x)

Right skewed data

Zero valuesNegative values

Square root √x Right skewed data

Negative values

Square x2 Left skewed data

Negative values

Cube root x1/3 Right skewed dataNegative values

Not as effective as log transform

Reciprocal 1/x Making small values bigger and big values

ll

Zero valuesNegative values

https://infoactive.co/data‐design/ch11.html

Continued…5. Samples must be

representative of the population

6. There is no multicollinearity: the interdependent variables are so strongly intercorrelated that they are indistinguishable from each other

If VIF lies between 1‐10, no multicollinearityIf VIF <1 or >10, then there is multicollinearity

Continued…

7. The relationship between x and y must be linear. When two scores are graphed, they should tend to form a straight line. If that is not a linear relationship, other methods must be used.

8. For every value of X, the distribution of Y scores must have approximately equal variability (homoscedasticity)


• Recall student scores example from previous module• What will you do if you are interested in studying

relationship between final grade with midterm (or screening) score and other variables such as previous (undergraduate) GPA, GRE score and motivation?

• A simple linear regression (SLR) cannot handle this• A separate SLR with each explanatory (independent)

variable will provide information in isolation• You will need to use a multiple linear regression

(MLR) method to study them together


• A multiple linear regression model shows the relationship between the dependent variable and multiple (two or more) independent variables

• The overall variance explained by the model (R2) as well as the unique contribution (strength and direction) of each independent variable can be obtained

• In MLR, the shape is not really a line. If there are three variables, the shape is a plane, and if there are four or more variables, it is impossible to visualize or graph. However, by convention, we still refer to the regression equation as a regression 'line'.

MLR with Two Predictors

http://www.aetheling.com/models/cusp/Intro.htm

Multiple Linear Regression Equation

• Sometimes also called multivariate linear regression for MLR

• The prediction equation isY′ = a + b1X1 + b2X2 + b3X3 + ∙ ∙ ∙ bkXk

• There is still one intercept constant, a, but each independent variable (e.g., X1, X2, X3) has their own regression coefficient

Review: Simple linear regression

• Y’ is a linear function of X• Y’ = a + bx• a = intercept• b = slope

Interpretation of MLR Coefficients

Group exercise: interpret B0, B1 and B2

• Data are from children aged 1 to 5 years in the• Variables• — Y is the child’s arm circumference (cm) • — X1 is the age of the child (months)• — X2 is the height of the child (cm) • Does arm circumference increase with

increasing child age after controlling for child height?

• • Multiple linear regression model• Y = B0 + B1 X1 + B2 X2

Answers

• B0= the estimated mean arm circumference when the values of age and height are zero

• B1= the change in the estimated mean arm

• circumference associated with each 1 month increase in age if height is unchanged

• B3= You do!

Multiple Linear Regression Models

• We can get six critical pieces of information from an MLR:– The overall significance of the model– The variance in the dependent variable that comes

from the set of independent variables in the model– The statistical significance of each individual

independent variable (controlling for the others)– The direct effect (and direction of the effect) of each

independent variable on the dependent variable– The relative strength of the independent variables– The regression equation, which allows us to predict

values of the dependent variable given values of the independent variables

The overall piece: R2 (coefficient of determination)

• R2 provides the proportion of variability explained by using X

• R2 measures the ability to predict an individual Y using its X(s)

• Statistical significance of the overall model (Model F-test)

• Recall that R is population correlation coefficient– Takes on values between -1 and +1– 0 indicates no linear association; 1 indicates a perfect positive

linear relationship; -1 indicates a perfect negative linear relationship

R: population correlation coefficient

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

SPSS output for R square

The individual piece: Correlation coefficient

F‐ Test of Regression coefficient: Whether the independent variable associated with it is contributing significantly to the variance accounted for in the dependent variable

Group exercise

• Propose a research question that can be answered by MLR

• State under what assumptions do we use this statistical method?

• State the formula and what B0, B1 and B2 stand for?

• Break• https://www.youtube.com/watch?v=LV

w9YdP1O-0

Example

• We are interested in knowing if going to restaurants frequently (five or more times/week) can lead to higher cholesterol. We also know that age, gender, and race/ethnicity can affect cholesterol. How can we tell if going out to restaurants frequently, this factor alone, will affect cholesterol levels?

• Do age, gender, ethnicity, and going out to eat frequently all affect cholesterol levels?– Dependent variable: cholesterol level– Independent variables: age (years), gender

(male/female), race/ethnicity (Black, White, Asian, or Hispanic), frequency of going out to eat (5+ times/week vs less than 5 times/week)

Linear Regression Assumptions• Linear regression is a parametric method and

requires that certain assumptions be met to be valid.

1. The sample must be representative of the population

2. The dependent variable must be of ratio/interval scale and normally distributed overall and normally distributed for each value of the independent variables

3. For every value of X, the distribution of Y scores must have approximately equal variability (homoscedasticity)

4. The relationship between X and Y must be linear5. The independent variables are not very strongly

inter-correlated (no multicollinearity)

Creating Dummy Variables

• Using dummy variables is a way to express a nominal independent variable with multiple categories by a series of dichotomous (binary) variables that compare one category to a different category that serves as the reference

• The number of dummy variables created will be one less than the number of categories of the variable

• One of the categories is chosen to serve as the “reference” category

• You then include all the dummy variables in the regression model instead of the original categorical variable

Creating Dummy Variables: Example

• Let's say we have a race/ethnicity variable with four categories (non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, and Hispanic)

• If we want to use it in a multiple regression, we would need to create three variables (4-1) to represent the four categories

• We would put these variables into the multiple regression equation instead of the four category race/ethnicity variable

Example Cont’d• We would therefore create 3 (4−1) dummy variables

and choose one category as the reference, in this case, non-Hispanic White– Non-Hispanic Black (1=yes, 0=no)– Non-Hispanic Asian (1=yes, 0=no)– Hispanic (1=yes, 0=no)– Say these are called Dummy1, Dummy2 and

Dummy3

Race/Ethnicity Dummy1 Dummy2 Dummy3

Non‐Hispanic Black 1 0 0

Non‐Hispanic Asian 0 1 0

Hispanic 0 0 1

Non‐Hispanic White 0 0 0

Information from MLR

• Overall variance explained by the model (e.g., do the independent variables in the model, taken together, do a good job at predicting the dependent variable?) using the adjusted R2

• Statistical significance of the overall model (Model F-test of R2)

• The strength, direction, and statistical significance of each independent variable (regression coefficients)

• Regression equation as a whole can be used to predict values of the dependent variable for a given set of values of the independent variables

MLR: Analysis Example

• We will use data on 489 NYCHANES study participants to look at a number of potential predictors of total cholesterol (mg/dL)

• The dependent variable is total cholesterol (mg/dL)

• We can see that total cholesterol is somewhat right-skewed

MLR: Analysis Example Cont’d

• To correct for this departure from normality, an adjustment called a “linear transformation” of the variable can be made

• In this case, we take the natural log of cholesterol. This makes the dependent variable normally distributed


• We will use multivariate linear regression to look at a number of independent variables– Gender (female=1 vs. male=0)– Age (continuous)– Frequency of eating in restaurants

(frequent=1 vs. infrequent=0)– Race/ethnicity (Black, White, Asian, or

Hispanic)• Note that the race/ethnicity variable has four

categories. In order to look at this variable in a regression model, we will have to create dummy variables.


• We will create 3 (4−1) dummy variables and use the category “White” as the reference. The variable coding will be– Black (1 = person is non-Hispanic Black; 0

= person is any other race/ethnicity)– Asian (1 = person is non-Hispanic Asian; 0

= person is any other race/ethnicity)– Hispanic (1 = person is Hispanic, 0 =

person is not Hispanic)


• We are testing a number of hypotheses, one null and one alternate hypothesis for each independent variable in the model. For example, one hypothesis we are testing is– H0: There is no association between frequency

of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity (adjusted beta=0)

– Ha: There is an association between frequency of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity (adjusted beta≠0)

Analysis Example: Model Summary

• Adjusted R2=0.031• The four independent variables explain 3.1%

of the variance in the dependent variable.

Model Summary

Model R R2 Adjusted R2 Std. Error of the Estimate

1 0.211a 0.044 0.031 0.19918

a Predictors: (Constant), Hispanic, restaurant_dich, participant gender, age in years, Asian, Black

Analysis Example: ANOVA

• The p-value for the overall model is 0.004. The amount of variance explained by the model (independent variables) is statistically significant

Analysis Example: Coefficients• Beta for gender (−0.015), beta for age (0.002), beta

for eating in restaurants (0.008), beta for Black (−0.053), beta for Asian (0.0006), and beta for Hispanic (− 0.040), the regression constant (5.189)

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients

t Sig. 95% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025

age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068

Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009a.dependent variable: LNCholesterol

Example: Estimated Equation• We can construct the regression equation and use it

to make predictions:

• Predicted ln(cholesterol) = 5.189 − 0.015 (gender) + 0.002 (age) – 0.053 (Black) + 0.006 (Asian) – 0.040 (Hispanic ) + 0.008 (restaurant_dich)

Coefficientsa



B Std. Error Beta Lower Bound Upper Bound1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068


Example: Prediction

• From this model, we can predict that the total cholesterol of a 25-year-old White woman who does not eat out often

• Predicted ln(cholesterol) = 5.189 − 0.015(1) + 0.002(25) – 0.053(0) + 0.006(0) – 0.040(0) +0 .008(0) = 5.269

• Cholesterol values were log-transformed, so need to back-transform (exponentiate in this case) Don’t forget about it!

• e5.269 = 194.22 mg/dL

Example: Significance

• Can look at significance of individual coefficients if overall model is significant in ANOVA

• From the p-values, we can see that only age is significantly associated with total cholesterol (p=0.001)

Coefficientsa




participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068


Example: Interpretation• The conclusions can be stated as follows:

– This model explains 3.1% of the variance in total cholesterol, and this is statistically significant at α=0.05. Age is positively associated with total cholesterol such that, adjusting for the other variables in the model, for each additional year in age, the natural log (recall that we logged the outcome to make it more normal) of total cholesterol is predicted to increase by 0.002 units, and this association is statistically significant (p=0.001). None of the other variables in the model were significantly associated with total cholesterol.

Standardized Coefficients• Different independent variables often measured in

different units (can standardize them to bring to the same scale, e.g. z)

• If you were interested in relative importance of relationship of IVs with DV, you would look at standardized coefficients (called Beta in SPSS output), look at size regardless of direction

Coefficientsa




participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009

a.dependent variable: LNCholesterol

Multicollinearity in MLR• Multicollinearity—the independent variables in

the model are strongly associated with each other that they are essentially measuring the same thing

• You would like to see no or very small multicollinearity among the independent variables

• The tolerance of a variable is used as a measure of collinearity. To obtain measures of tolerance, each independent variable is treated as a dependent variable and is regressed on the other independent variables

Checking Assumptions using Residuals

• Normal distribution of the dependent variable and linear relationship with the independent variables

• If the relationships are linear and the dependent variable is normally distributed for each value of the independent variable, then the distribution of the residuals (the residual or error is the difference between the actual and the predicted values in the model) should be approximately normal. This can be assessed by using a histogram of the standardized residuals.

Checking Assumptions using Residuals

• Homoscedasticity—for every value of X, the distribution of Yscores must have approximately equal variability.

• To check this assumption, the residuals can be plotted against the predicted values and against the independent variables. When standardized predicted values are plotted against observed values, the data would form a straight line from the lower-left corner to the upper-right corner, if the model fit the data exactly.

Approaches for Selecting Variables

• First, it is common to look at a correlation matrix or SLR models

• There are a number of approaches to choosing which variables to include in a multiple regression model

• Standard approach (simultaneous method)—all the independent variables are entered at once

• Stepwise (forward, backward, or stepwise solution)—the software selects the best model based on a series of steps in which variables are added and removed depending on their association with the outcome

• Hierarchical—the researcher compares two or more models before and after the addition of certain variables of interest and uses preset criteria for selecting the best model

Summary

• Multivariate linear regression can be used when the outcome of interest is of interval or ratio scale and normally distributed

• The independent variables can be on any scale, but dummy variables need to be created for all polytomous (categorical or nominal) independent variables

• Using the regression model, we can estimate the strength and direction of the association from the adjusted betas

• We can also determine the statistical significance of each parameter in the model using the p-value or the 95% confidence interval around the beta

References/Acknowledgement

Kellar, S. P., & Kelvin, E. A. (2013). Munro’s Statistical Methods for Health Care Research (6th ed.). Philadelphia, PA: Wolters Kluwer Health | Lippincott Williams and Wilkins. Chapter 14Polit, D. F. (2010). Statistics and data analysis for nursing research (2nd ed.). Upper Saddle River, NJ: Pearson.Adapted some slides from the publisher’s course instructor resourcesAdapted some slides from biostatsitics course from school of public health

multiple linear regression - johns hopkins university...... x 3) has their own regression...

Documents