multiple linear regression - johns hopkins university...... x 3) has their own regression...
TRANSCRIPT
Multiple Linear Regression
Song GeBSN, RN, PhD Candidate
Johns Hopkins University School of Nursing
www.nursing.jhu.edu
NR120.508 Biostatistics for Evidence‐based Practice
Learning Objectives
By the end of this module, you will be able to:1. Articulate assumptions for multiple linear
regression2. Explain the primary components of multiple
linear regression3. Identify and define the variables included in
the regression equation4. Construct a multiple regression equation5. Calculate a predicted value of a dependent
variable using a multiple regression equation
Learning Objectives Cont’d
6. Distinguish between unstandardized (B) and standardized (Beta) regression coefficients
7. Distinguish between different methods for entering predictors into a regression model (simultaneous, hierarchical and stepwise)
8. Identify strategies to assess model fit9. Interpret and report the results of
multiple linear regression analysis
Review of lecture two weeks ago
• Linear regression assumes a linear relationship between independent variable(s) and dependent variable
• Linear regression allows us to predict an outcome based on one or several predictors
• Linear regression allows us to explainthe interrelationships among variables
• Linear regression is a parametric test
How to choose X and Y?
• Y can be regressed on X• X can be regressed on Y• The regression is not symmetric• The choice of which regression to
perform depends on the scientific question: Is X to be used to explain or predict Y?
• Is Y to be used to explain or predict X? (e.g. Does poor health status explain high pollution level?)
Linear Regression Assumptions
1. Independent variable can be any scale (ratio, nominal, etc.)
2. Dependent variable need to be ratio/interval scale
3. Dependent variable need to be normally distributed overall and normally distributed for each value of the independent variable
4. If dependent variable is not normally distributed, we can transform it
Review: Normal distribution
Example of transformed data
Positively skewed
Normally distributed
Method Math Operation Good for: Bad for:
Log ln(x)log10(x)
Right skewed data
Zero valuesNegative values
Square root √x Right skewed data
Negative values
Square x2 Left skewed data
Negative values
Cube root x1/3 Right skewed dataNegative values
Not as effective as log transform
Reciprocal 1/x Making small values bigger and big values
ll
Zero valuesNegative values
https://infoactive.co/data‐design/ch11.html
Continued…5. Samples must be
representative of the population
6. There is no multicollinearity: the interdependent variables are so strongly intercorrelated that they are indistinguishable from each other
If VIF lies between 1‐10, no multicollinearityIf VIF <1 or >10, then there is multicollinearity
Continued…
7. The relationship between x and y must be linear. When two scores are graphed, they should tend to form a straight line. If that is not a linear relationship, other methods must be used.
8. For every value of X, the distribution of Y scores must have approximately equal variability (homoscedasticity)
Multiple Linear Regression
• Recall student scores example from previous module• What will you do if you are interested in studying
relationship between final grade with midterm (or screening) score and other variables such as previous (undergraduate) GPA, GRE score and motivation?
• A simple linear regression (SLR) cannot handle this• A separate SLR with each explanatory (independent)
variable will provide information in isolation• You will need to use a multiple linear regression
(MLR) method to study them together
Multiple Linear Regression
• A multiple linear regression model shows the relationship between the dependent variable and multiple (two or more) independent variables
• The overall variance explained by the model (R2) as well as the unique contribution (strength and direction) of each independent variable can be obtained
• In MLR, the shape is not really a line. If there are three variables, the shape is a plane, and if there are four or more variables, it is impossible to visualize or graph. However, by convention, we still refer to the regression equation as a regression 'line'.
MLR with Two Predictors
http://www.aetheling.com/models/cusp/Intro.htm
Multiple Linear Regression Equation
• Sometimes also called multivariate linear regression for MLR
• The prediction equation isY′ = a + b1X1 + b2X2 + b3X3 + ∙ ∙ ∙ bkXk
• There is still one intercept constant, a, but each independent variable (e.g., X1, X2, X3) has their own regression coefficient
Review: Simple linear regression
• Y’ is a linear function of X• Y’ = a + bx• a = intercept• b = slope
Interpretation of MLR Coefficients
Interpretation of MLR Coefficients
Interpretation of MLR Coefficients
Group exercise: interpret B0, B1 and B2
• Data are from children aged 1 to 5 years in the• Variables• — Y is the child’s arm circumference (cm) • — X1 is the age of the child (months)• — X2 is the height of the child (cm) • Does arm circumference increase with
increasing child age after controlling for child height?
• • Multiple linear regression model• Y = B0 + B1 X1 + B2 X2
Answers
• B0= the estimated mean arm circumference when the values of age and height are zero
• B1= the change in the estimated mean arm
• circumference associated with each 1 month increase in age if height is unchanged
• B3= You do!
Multiple Linear Regression Models
• We can get six critical pieces of information from an MLR:– The overall significance of the model– The variance in the dependent variable that comes
from the set of independent variables in the model– The statistical significance of each individual
independent variable (controlling for the others)– The direct effect (and direction of the effect) of each
independent variable on the dependent variable– The relative strength of the independent variables– The regression equation, which allows us to predict
values of the dependent variable given values of the independent variables
The overall piece: R2 (coefficient of determination)
• R2 provides the proportion of variability explained by using X
• R2 measures the ability to predict an individual Y using its X(s)
• Statistical significance of the overall model (Model F-test)
• Recall that R is population correlation coefficient– Takes on values between -1 and +1– 0 indicates no linear association; 1 indicates a perfect positive
linear relationship; -1 indicates a perfect negative linear relationship
R: population correlation coefficient
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
SPSS output for R square
The individual piece: Correlation coefficient
F‐ Test of Regression coefficient: Whether the independent variable associated with it is contributing significantly to the variance accounted for in the dependent variable
Group exercise
• Propose a research question that can be answered by MLR
• State under what assumptions do we use this statistical method?
• State the formula and what B0, B1 and B2 stand for?
• Break• https://www.youtube.com/watch?v=LV
w9YdP1O-0
Example
• We are interested in knowing if going to restaurants frequently (five or more times/week) can lead to higher cholesterol. We also know that age, gender, and race/ethnicity can affect cholesterol. How can we tell if going out to restaurants frequently, this factor alone, will affect cholesterol levels?
• Do age, gender, ethnicity, and going out to eat frequently all affect cholesterol levels?– Dependent variable: cholesterol level– Independent variables: age (years), gender
(male/female), race/ethnicity (Black, White, Asian, or Hispanic), frequency of going out to eat (5+ times/week vs less than 5 times/week)
Linear Regression Assumptions• Linear regression is a parametric method and
requires that certain assumptions be met to be valid.
1. The sample must be representative of the population
2. The dependent variable must be of ratio/interval scale and normally distributed overall and normally distributed for each value of the independent variables
3. For every value of X, the distribution of Y scores must have approximately equal variability (homoscedasticity)
4. The relationship between X and Y must be linear5. The independent variables are not very strongly
inter-correlated (no multicollinearity)
Creating Dummy Variables
• Using dummy variables is a way to express a nominal independent variable with multiple categories by a series of dichotomous (binary) variables that compare one category to a different category that serves as the reference
• The number of dummy variables created will be one less than the number of categories of the variable
• One of the categories is chosen to serve as the “reference” category
• You then include all the dummy variables in the regression model instead of the original categorical variable
Creating Dummy Variables: Example
• Let's say we have a race/ethnicity variable with four categories (non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, and Hispanic)
• If we want to use it in a multiple regression, we would need to create three variables (4-1) to represent the four categories
• We would put these variables into the multiple regression equation instead of the four category race/ethnicity variable
Example Cont’d• We would therefore create 3 (4−1) dummy variables
and choose one category as the reference, in this case, non-Hispanic White– Non-Hispanic Black (1=yes, 0=no)– Non-Hispanic Asian (1=yes, 0=no)– Hispanic (1=yes, 0=no)– Say these are called Dummy1, Dummy2 and
Dummy3
Race/Ethnicity Dummy1 Dummy2 Dummy3
Non‐Hispanic Black 1 0 0
Non‐Hispanic Asian 0 1 0
Hispanic 0 0 1
Non‐Hispanic White 0 0 0
Information from MLR
• Overall variance explained by the model (e.g., do the independent variables in the model, taken together, do a good job at predicting the dependent variable?) using the adjusted R2
• Statistical significance of the overall model (Model F-test of R2)
• The strength, direction, and statistical significance of each independent variable (regression coefficients)
• Regression equation as a whole can be used to predict values of the dependent variable for a given set of values of the independent variables
MLR: Analysis Example
• We will use data on 489 NYCHANES study participants to look at a number of potential predictors of total cholesterol (mg/dL)
• The dependent variable is total cholesterol (mg/dL)
• We can see that total cholesterol is somewhat right-skewed
MLR: Analysis Example Cont’d
• To correct for this departure from normality, an adjustment called a “linear transformation” of the variable can be made
• In this case, we take the natural log of cholesterol. This makes the dependent variable normally distributed
MLR: Analysis Example Cont’d
• We will use multivariate linear regression to look at a number of independent variables– Gender (female=1 vs. male=0)– Age (continuous)– Frequency of eating in restaurants
(frequent=1 vs. infrequent=0)– Race/ethnicity (Black, White, Asian, or
Hispanic)• Note that the race/ethnicity variable has four
categories. In order to look at this variable in a regression model, we will have to create dummy variables.
MLR: Analysis Example Cont’d
• We will create 3 (4−1) dummy variables and use the category “White” as the reference. The variable coding will be– Black (1 = person is non-Hispanic Black; 0
= person is any other race/ethnicity)– Asian (1 = person is non-Hispanic Asian; 0
= person is any other race/ethnicity)– Hispanic (1 = person is Hispanic, 0 =
person is not Hispanic)
MLR: Analysis Example Cont’d
• We are testing a number of hypotheses, one null and one alternate hypothesis for each independent variable in the model. For example, one hypothesis we are testing is– H0: There is no association between frequency
of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity (adjusted beta=0)
– Ha: There is an association between frequency of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity (adjusted beta≠0)
Analysis Example: Model Summary
• Adjusted R2=0.031• The four independent variables explain 3.1%
of the variance in the dependent variable.
Model Summary
Model R R2 Adjusted R2 Std. Error of the Estimate
1 0.211a 0.044 0.031 0.19918
a Predictors: (Constant), Hispanic, restaurant_dich, participant gender, age in years, Asian, Black
Analysis Example: ANOVA
• The p-value for the overall model is 0.004. The amount of variance explained by the model (independent variables) is statistically significant
Analysis Example: Coefficients• Beta for gender (−0.015), beta for age (0.002), beta
for eating in restaurants (0.008), beta for Black (−0.053), beta for Asian (0.0006), and beta for Hispanic (− 0.040), the regression constant (5.189)
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients
t Sig. 95% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268
participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025
age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009a.dependent variable: LNCholesterol
Example: Estimated Equation• We can construct the regression equation and use it
to make predictions:
• Predicted ln(cholesterol) = 5.189 − 0.015 (gender) + 0.002 (age) – 0.053 (Black) + 0.006 (Asian) – 0.040 (Hispanic ) + 0.008 (restaurant_dich)
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients
t Sig. 95% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268
participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009a.dependent variable: LNCholesterol
Example: Prediction
• From this model, we can predict that the total cholesterol of a 25-year-old White woman who does not eat out often
• Predicted ln(cholesterol) = 5.189 − 0.015(1) + 0.002(25) – 0.053(0) + 0.006(0) – 0.040(0) +0 .008(0) = 5.269
• Cholesterol values were log-transformed, so need to back-transform (exponentiate in this case) Don’t forget about it!
• e5.269 = 194.22 mg/dL
Example: Significance
• Can look at significance of individual coefficients if overall model is significant in ANOVA
• From the p-values, we can see that only age is significantly associated with total cholesterol (p=0.001)
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients
t Sig. 95% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268
participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009a.dependent variable: LNCholesterol
Example: Interpretation• The conclusions can be stated as follows:
– This model explains 3.1% of the variance in total cholesterol, and this is statistically significant at α=0.05. Age is positively associated with total cholesterol such that, adjusting for the other variables in the model, for each additional year in age, the natural log (recall that we logged the outcome to make it more normal) of total cholesterol is predicted to increase by 0.002 units, and this association is statistically significant (p=0.001). None of the other variables in the model were significantly associated with total cholesterol.
Standardized Coefficients• Different independent variables often measured in
different units (can standardize them to bring to the same scale, e.g. z)
• If you were interested in relative importance of relationship of IVs with DV, you would look at standardized coefficients (called Beta in SPSS output), look at size regardless of direction
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients
t Sig. 95% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268
participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009
a.dependent variable: LNCholesterol
Multicollinearity in MLR• Multicollinearity—the independent variables in
the model are strongly associated with each other that they are essentially measuring the same thing
• You would like to see no or very small multicollinearity among the independent variables
• The tolerance of a variable is used as a measure of collinearity. To obtain measures of tolerance, each independent variable is treated as a dependent variable and is regressed on the other independent variables
Checking Assumptions using Residuals
• Normal distribution of the dependent variable and linear relationship with the independent variables
• If the relationships are linear and the dependent variable is normally distributed for each value of the independent variable, then the distribution of the residuals (the residual or error is the difference between the actual and the predicted values in the model) should be approximately normal. This can be assessed by using a histogram of the standardized residuals.
Checking Assumptions using Residuals
• Homoscedasticity—for every value of X, the distribution of Yscores must have approximately equal variability.
• To check this assumption, the residuals can be plotted against the predicted values and against the independent variables. When standardized predicted values are plotted against observed values, the data would form a straight line from the lower-left corner to the upper-right corner, if the model fit the data exactly.
Approaches for Selecting Variables
• First, it is common to look at a correlation matrix or SLR models
• There are a number of approaches to choosing which variables to include in a multiple regression model
• Standard approach (simultaneous method)—all the independent variables are entered at once
• Stepwise (forward, backward, or stepwise solution)—the software selects the best model based on a series of steps in which variables are added and removed depending on their association with the outcome
• Hierarchical—the researcher compares two or more models before and after the addition of certain variables of interest and uses preset criteria for selecting the best model
Summary
• Multivariate linear regression can be used when the outcome of interest is of interval or ratio scale and normally distributed
• The independent variables can be on any scale, but dummy variables need to be created for all polytomous (categorical or nominal) independent variables
• Using the regression model, we can estimate the strength and direction of the association from the adjusted betas
• We can also determine the statistical significance of each parameter in the model using the p-value or the 95% confidence interval around the beta
References/Acknowledgement
Kellar, S. P., & Kelvin, E. A. (2013). Munro’s Statistical Methods for Health Care Research (6th ed.). Philadelphia, PA: Wolters Kluwer Health | Lippincott Williams and Wilkins. Chapter 14Polit, D. F. (2010). Statistics and data analysis for nursing research (2nd ed.). Upper Saddle River, NJ: Pearson.Adapted some slides from the publisher’s course instructor resourcesAdapted some slides from biostatsitics course from school of public health