experimental statistics - week 12
DESCRIPTION
Experimental Statistics - week 12. Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking. Data – Page 628. Weight loss in a chemical compound as a function of exposure time and humidity. Y = weight loss (wtloss) - PowerPoint PPT PresentationTRANSCRIPT
1
Experimental StatisticsExperimental Statistics - week 12 - week 12Experimental StatisticsExperimental Statistics - week 12 - week 12
Chapter 12:
Multiple Regression
Chapter 13: Variable Selection Model Checking
2
Y X1 X2
4.3 4 .25.5 5 .26.8 6 .28.0 7 .24.0 4 .35.2 5 .36.6 6 .37.5 7 .32.0 4 .44.0 5 .45.7 6 .46.5 7 .4
Data – Page 628
Y = weight loss (wtloss)
X1 = exposure time (exptime)
X2 = relative humidity (humidity)
Weight loss in a chemical compound as a function of exposure time and humidity
3
The REG Procedure Dependent Variable: wtloss Number of Observations Read 12 Number of Observations Used 12
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > FModel 2 31.12417 15.56208 104.13 <.0001Error 9 1.34500 0.14944Corrected Total 11 32.46917
Root MSE 0.38658 R-Square 0.9586 Dependent Mean 5.50833 Adj R-Sq 0.9494 Coeff Var 7.01810
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.66667 0.69423 0.96 0.3620 exptime 1 1.31667 0.09981 13.19 <.0001 humidity 1 -8.00000 1.36677 -5.85 0.0002
Chemical Weight Loss – MLR Output
4
Examining Contributions of Individual X variables
Use t-test for the X variable in question.
- this tests the effect of that particular independent variable while all other independent variables stay constant.
Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 0.66667 0.69423 0.96 0.3620exptime 1 1.31667 0.09981 13.19 <.0001humidity 1 -8.00000 1.36677 -5.85 0.0002
1 2 1 20.667 1.317 8.0ˆ where exptime, humidityy x x x x
Note: In this equation, weight loss is positively related to exposure time and negatively to humidity.
5
Residual Analysis in Multiple Regression
Examination of residuals to help determine if: - assumptions are met - regression model is appropriate
Residual Plots:
- each indep var. in final model vs residuals
- predicted Y vs residuals
- run order vs residuals
6
PROC REG;MODEL wtloss=exptime humidity;output out=new r=resid2 p=predict2;RUN;PROC GPLOT; Title 'Plot of Residuals - MLR Model'; PLOT resid2*exptime; PLOT resid2*humidity; PLOT resid2*predict2;RUN;
7
Infant Length Data
(Probability and Statistics for Engineers and Scientists – Walpole, Myers, Myers, and Ye, page 433)
Data Set: 9 infants (2-3 months of age) Dependent Variable (Y): Current Infant length (cm) Independent Variables: X1 = age (in days) X2 = length at birth (cm) X3 = weight at birth (kg) X4 = chest size at birth (cm)
Goal: Obtain an estimating equation relating length of an infant to all or a subset of these independent variables.
DATA infant;INPUT id y x1 x2 x3 x4;DATALINES;1 57.5 78 48.2 2.75 29.52 52.8 69 45.5 2.15 26.33 61.3 77 46.3 4.41 32.24 67.0 88 49.0 5.52 36.55 53.5 67 43.0 3.21 27.26 62.7 80 48.0 4.32 27.77 56.2 74 48.0 2.31 28.38 68.5 94 53.0 4.30 30.39 69.2 102 58.0 3.71 28.7;PROC CORR; Var y x1 x2 x3 x4;RUN;PROC REG;MODEL y=x1 x2 x3 x4;output out=new r=resid;RUN;
8
SAS PROC CORR OutputPearson Correlation Coefficients, N = 9 Prob > |r| under H0: Rho=0 y x1 x2 x3 x4 y 1.00000 0.94709 0.81867 0.76114 0.56033 0.0001 0.0070 0.0172 0.1166 x1 0.94709 1.00000 0.95227 0.53402 0.38999 0.0001 <.0001 0.1386 0.2995 x2 0.81867 0.95227 1.00000 0.26267 0.15491 0.0070 <.0001 0.4947 0.6907 x3 0.76114 0.53402 0.26267 1.00000 0.78447 0.0172 0.1386 0.4947 0.0123 x4 0.56033 0.38999 0.15491 0.78447 1.00000 0.1166 0.2995 0.6907 0.0123
Note: x1, x2, and x3 are significantly correlated with y while x4 is not. Recall, this indicates that the simple linear regression of y using either x1, x2, or x3 will be significant.
Standard SAS PROC REG Output for all 4 Independent Variables X1, X2, X3, and X4Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 4 318.27442 79.56860 107.323 0.0003 Error 4 2.96558 0.74140 C Total 8 321.24000
Root MSE 0.86104 R-square 0.9908 Dep Mean 60.96667 Adj R-sq 0.9815 C.V. 1.41232
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 7.147532 16.45961128 0.434 0.6865 X1 1 0.100094 0.33970898 0.295 0.7829 X2 1 0.726417 0.78590156 0.924 0.4076 X3 1 3.075837 1.05917874 2.904 0.0439 X4 1 -0.030042 0.16646232 -0.180 0.8656
Note: Even though the overall p-value is small (.0003), there is much confusion concerning the contribution of the individual X variables - this is probably due to collinearity
9
Setting:
We have a dependent variable Y and several candidate independent variables.
Question:Should we use all of them?
10
Why do we run Multiple Regression?
1. Obtain estimates of individual coefficients in a model (+ or -, etc.)
2. Screen variables to determine which have a significant effect on the model
3. Arrive at the most effective (and efficient) prediction model
11
The problem:
Collinearity among the independent variables
-- high correlation between 2 independent variables-- one independent variable nearly a linear combination of other independent variables-- etc.
Example: x1 = total income
x2 = bonus
x3 = monthly income
Note: x1 = 12x3 + x2 -- singularity -- SAS cannot use all 3
12
Effects of Collinearity
• parameter estimates are highly variable and unreliable
- parameter estimates may even have the opposite sign from what is reasonable
• may have significant F but none of the t-tests are significant
Variable Selection TechniquesTechniques for “being careful” about which variables are put into the model
13
Variable Selection Procedures
• Forward selection
• Backward Elimination
• Stepwise
• Best subset
14
Forward Selection:Step 1: Choose Xj that has highest R2
(i.e. has the highest correlation with Y) -- call it X1
Step 2: Choose another Xj to go along with X1
by finding the one that maximizes R2
Note: This new R2 will be at least as large as the one in Step 1.
Problem: Has the new variable increased R2 enough to be “useful”?Solution: Examine the significance level (p) of the new variable -- keep variable if p < SLENTRY (I used SLENTRY = .15 in example)
Procedure continues until no new variables satisfy entry criteria
15
FORWARD SELECTION RESULTS FROM SAS Stepwise Procedure for Dependent Variable Y Step 1 Variable X1 Entered R-square = 0.89698302 C(p) = 39.63635101 DF Sum of Squares Mean Square F Prob>F Regression 1 288.14682495 288.14682495 60.95 0.0001 Error 7 33.09317505 4.72759644 Total 8 321.24000000 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP 19.01108007 5.42271930 58.10583282 12.29 0.0099 X1 0.51797020 0.06634651 288.14682495 60.95 0.0001
Note: These F values are the squares of the usual t-values in SAS
Bounds on condition number: 1, 1--------------------------------------------------------------------------------Step 2 Variable X3 Entered R-square = 0.98821914 C(p) = 2.10454082 DF Sum of Squares Mean Square F Prob>F Regression 2 317.45551809 158.72775905 251.65 0.0001 Error 6 3.78448191 0.63074698 Total 8 321.24000000 Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP 20.10845029 1.98725776 64.58088391 102.39 0.0001 X1 0.41362967 0.02866328 131.34899803 208.24 0.0001 X3 2.02533400 0.29711598 29.30869314 46.47 0.0005Bounds on condition number: 1.398946, 5.595783--------------------------------------------------------------------------------All variables left in the model are significant at the 0.1500 level.No other variable met the 0.1500 significance level for entry into the model.
Summary of Stepwise Procedure for Dependent Variable Y Variable Number Partial Model Step Entered Removed In R**2 R**2 C(p) F Prob>F 1 X1 1 0.8970 0.8970 39.6364 60.9500 0.0001 2 X3 2 0.0912 0.9882 2.1045 46.4666 0.0005
This is the end of the SAS FORWARD SELECTION output.The final regression equation is: ̂ 20.108 0.414 1 2.025 3y x x
We can see from the model that an increase in age or in the weight at birth predicts longer current length.NOTICE: SAS picked 2 independent variables and then stopped.
PROC reg;MODEL y=x1 x2 x3x4 /selection=forward slentry=.15; RUN;
The next pages show SAS output from standard PROC REG. Each set of output on the following pages is from a separate running of PROC REG.
16
Standard SAS PROC REG Printout for 3 Features - to show why STEPWISE Procedure stopped with 2 features
X1, X3, and X4Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 3 317.64101 105.88034 147.097 0.0001 Error 5 3.59899 0.71980 C Total 8 321.24000
Root MSE 0.84841 R-square 0.9888 Dep Mean 60.96667 Adj R-sq 0.9821 C.V. 1.39160
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 21.873528 4.07388552 5.369 0.0030 X1 1 0.412771 0.03066663 13.460 0.0001 X3 1 2.202668 0.47198905 4.667 0.0055 X4 1 -0.078945 0.15551477 -0.508 0.6333
Note: p-value for X4 is too large.
X1, X3, and X2Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 3 318.25027 106.08342 177.413 0.0001 Error 5 2.98973 0.59795 C Total 8 321.24000
Root MSE 0.77327 R-square 0.9907 Dep Mean 60.96667 Adj R-sq 0.9851 C.V. 1.26835
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 5.629827 12.70680159 0.443 0.6762 X1 1 0.080984 0.28988008 0.279 0.7911 X3 1 3.069358 0.95066097 3.229 0.0232 X2 1 0.771498 0.66918975 1.153 0.3011
Note: X2 really messes up the p-values, and the p-value
for X2 is too large
17
Standard SAS PROC Reg Output for X1 and for X1 & X3
X1
Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 288.14682 288.14682 60.950 0.0001 Error 7 33.09318 4.72760 C Total 8 321.24000
Root MSE 2.17430 R-square 0.8970 Dep Mean 60.96667 Adj R-sq 0.8823 C.V. 3.56638
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 19.011080 5.42271930 3.506 0.0099 X1 1 0.517970 0.06634651 7.807 0.0001
X1 and X3
Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 2 317.45552 158.72776 251.650 0.0001 Error 6 3.78448 0.63075 C Total 8 321.24000
Root MSE 0.79420 R-square 0.9882 Dep Mean 60.96667 Adj R-sq 0.9843 C.V. 1.30267
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 20.108450 1.98725776 10.119 0.0001 X1 1 0.413630 0.02866328 14.431 0.0001 X3 1 2.025334 0.29711598 6.817 0.0005
18
Plots for Residual Analysis for the Final Model, i.e.
ˆ 20.108 0.414 1 2.025 3y x x
Res i dual
- 0. 9- 0. 8- 0. 7- 0. 6- 0. 5- 0. 4- 0. 3- 0. 2- 0. 1
0. 00. 10. 20. 30. 40. 50. 60. 70. 80. 91. 01. 11. 2
x1
60 70 80 90 100 110
Res i dual
- 0. 9- 0. 8- 0. 7- 0. 6- 0. 5- 0. 4- 0. 3- 0. 2- 0. 1
0. 00. 10. 20. 30. 40. 50. 60. 70. 80. 91. 01. 11. 2
x3
2 3 4 5 6
Res i dual
- 0. 9- 0. 8- 0. 7- 0. 6- 0. 5- 0. 4- 0. 3- 0. 2- 0. 1
0. 00. 10. 20. 30. 40. 50. 60. 70. 80. 91. 01. 11. 2
i d
1 2 3 4 5 6 7 8 9
FREQUENCY
0
1
2
3
Res i dual
- 0. 75 - 0. 25 0. 25 0. 75
Res i dual
- 0. 9- 0. 8- 0. 7- 0. 6- 0. 5- 0. 4- 0. 3- 0. 2- 0. 1
0. 00. 10. 20. 30. 40. 50. 60. 70. 80. 91. 01. 11. 2
Pr edi ct ed Val ue of y
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
x1x3
ID Predicted Values
19
• Begin with all independent variables in the model
• Find the independent variable that is “least useful” in predicting the dependent variable(i.e. smallest R2, F (or t), etc.)
– delete this variable if p < SLSTAY
• Continue the process until no further variables are deleted
Backward Elimination
20
• Add independent variables one at a time as in Forward Selection (if p < SLENTRY)
• At each stage perform backward elimination to see whether any variables should be removed (if p < SLSTAY)
Stepwise Selection
21
• Examine criteria for all acceptable subsets of each “size”, i.e. # of independent variables
• Criteria: R2, adjusted R2, Cp
Best Subset Regression
22
R11kn
1n1R 22
adj
-- adjusts for the number of independent variables
-- penalizes excessive use of independent variables
-- useful for comparing competing models with differing number of independent variables
Adjusted R2
- Cp statistic plays a similar role
23
Multiple Regression – Analysis Suggestions
1. Examine pairwise correlations among variables
2. Examine pairwise scatterplots among variables
24length age lengthb weightb chestb
ches
tbw
eigh
tble
ngth
bag
ele
ngth
SPSS Output from INFANT Data Set
25Horsepower City MPG Highway MPG Weight in Pounds
Wei
ght i
n P
ound
sH
ighw
ay M
PG
City
MP
GH
orse
pow
er
SPSS Output from CAR Data Set
26
Multiple Regression – Analysis Suggestions
1. Examine pairwise correlations among variables
2. Examine pairwise scatterplots among variables
- identify nonlinearity
- identify unequal variance problems
- identify possible outliers
3. Try transformations of variables for
- correcting nonlinearity
- stabilizing the variances
- inducing normality of residuals
27
Examples of Nonlinear Data “Shapes” and Linearizing Transformations
28
Y
X1
0 1 iXi iY e Original Model
1 > 0
1 < 0
Transformed Into: 0 1ln lni i iY X
Exponential Transformation(Log-Linear)
29
10Original: i i iY X
0 1Transformed: ln ln ln lni i iY X
Y
X1
Y
X1
1 1
10 1 11 0
1 1
1 1
Transformed Multiplicative Model (Log-Log)
30
Y
X1
0 1 1i i iY X
1 > 0
1 < 0
Square Root Transformation
31
Note:- transforming Y using the log or square root transformation can help with unequal variance problems
- these transformations may also help induce normality
32
hmpg vs hp hmpg vs sqrt(hp)
log(hmpg) vs hp log(hmpg) vs log(hp)