model selection and model building
DESCRIPTION
Model selection and model building. Model selection. Selection of predictor variables. Statement of problem. A common problem is that there is a large set of candidate predictor variables. - PowerPoint PPT PresentationTRANSCRIPT
Model selection and model building
Model selection
Selection of predictor variables
Statement of problem
• A common problem is that there is a large set of candidate predictor variables.
• Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability.
Example: Cement data
• Response y: heat evolved in calories during hardening of cement on a per gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino ferrite
• Predictor x4: % of dicalcium silicate
Example: Cement data
83.35
105.05
6
16
37.25
59.75
8.75
18.25
83.35
105.0
5
19.5
46.5
6 1637.25
59.75 8.7
518.25 19
.546.5
y
x1
x2
x3
x4
Two basic methods of selecting predictors
• Stepwise regression: Enter and remove variables, in a stepwise manner, until no justifiable reason to enter or remove more.
• Best subsets regression: Select the subset of variables that do the best at meeting some well-defined objective criterion.
Stepwise regression: the idea
• Start with no predictors in the model.
• At each step, enter or remove a variable based on partial F-tests.
• Stop when no more variables can be justifiably entered or removed.
Stepwise regression: the steps
• Specify an Alpha-to-Enter (0.15) and an Alpha-to-Remove (0.15).
• Start with no predictors in the model.• Put the predictor with the smallest P-value
based on the partial F statistic (a t-statistic) in the model. If P-value > 0.15, then stop. None of the predictors have good predictive ability. Otherwise …
Stepwise regression: the steps
• Add the predictor with the smallest P-value (below 0.15) based on the partial F-statistic (a t-statistic) in the model. If none of the predictors yield P-values < 0.15, stop.
• If P-value > 0.15 for any of the partial F statistics, then remove violating predictor.
• Continue above two steps, until no more predictors can be entered or removed.
Predictor Coef SE Coef T PConstant 81.479 4.927 16.54 0.000x1 1.8687 0.5264 3.55 0.005
Predictor Coef SE Coef T PConstant 57.424 8.491 6.76 0.000x2 0.7891 0.1684 4.69 0.001
Predictor Coef SE Coef T PConstant 110.203 7.948 13.87 0.000x3 -1.2558 0.5984 -2.10 0.060
Predictor Coef SE Coef T PConstant 117.568 5.262 22.34 0.000x4 -0.7382 0.1546 -4.77 0.001
Example: Cement data
Predictor Coef SE Coef T PConstant 103.097 2.124 48.54 0.000x4 -0.61395 0.04864 -12.62 0.000x1 1.4400 0.1384 10.40 0.000
Predictor Coef SE Coef T PConstant 94.16 56.63 1.66 0.127x4 -0.4569 0.6960 -0.66 0.526x2 0.3109 0.7486 0.42 0.687
Predictor Coef SE Coef T PConstant 131.282 3.275 40.09 0.000x4 -0.72460 0.07233 -10.02 0.000x3 -1.1999 0.1890 -6.35 0.000
Predictor Coef SE Coef T PConstant 71.65 14.14 5.07 0.001x4 -0.2365 0.1733 -1.37 0.205x1 1.4519 0.1170 12.41 0.000x2 0.4161 0.1856 2.24 0.052
Predictor Coef SE Coef T PConstant 111.684 4.562 24.48 0.000x4 -0.64280 0.04454 -14.43 0.000x1 1.0519 0.2237 4.70 0.001x3 -0.4100 0.1992 -2.06 0.070
Predictor Coef SE Coef T PConstant 52.577 2.286 23.00 0.000x1 1.4683 0.1213 12.10 0.000x2 0.66225 0.04585 14.44 0.000
Predictor Coef SE Coef T PConstant 71.65 14.14 5.07 0.001x1 1.4519 0.1170 12.41 0.000x2 0.4161 0.1856 2.24 0.052x4 -0.2365 0.1733 -1.37 0.205
Predictor Coef SE Coef T PConstant 48.194 3.913 12.32 0.000x1 1.6959 0.2046 8.29 0.000x2 0.65691 0.04423 14.85 0.000x3 0.2500 0.1847 1.35 0.209
Predictor Coef SE Coef T PConstant 52.577 2.286 23.00 0.000x1 1.4683 0.1213 12.10 0.000x2 0.66225 0.04585 14.44 0.000
Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13
Step 1 2 3 4Constant 117.57 103.10 71.65 52.58
x4 -0.738 -0.614 -0.237 T-Value -4.77 -12.62 -1.37 P-Value 0.001 0.000 0.205
x1 1.44 1.45 1.47T-Value 10.40 12.41 12.10P-Value 0.000 0.000 0.000
x2 0.416 0.662T-Value 2.24 14.44P-Value 0.052 0.000
S 8.96 2.73 2.31 2.41R-Sq 67.45 97.25 98.23 97.87R-Sq(adj) 64.50 96.70 97.64 97.44C-p 138.7 5.5 3.0 2.7
Drawbacks of stepwise regression
• The final model is not guaranteed to be optimal in any specified sense.
• The procedure yields a single final model, although in practice there are often several almost equally good models.
Best subsets regression
• If there are p-1 possible predictors, then there are 2p-1 possible regression models containing the predictors.
• For example, 10 predictors yields 210 = 1024 possible regression models.
• A best subsets algorithm determines the best subsets of each size, so that choice of the final model can be made by researcher.
What is used to judge “best”?
• R-square
• Adjusted R-square
• MSE (or S = square root of MSE)
• Mallow’s Cp
R-squared
SSTO
SSE
SSTO
SSRR 12
Use the R-squared values to find the point where adding more predictors is not worthwhile because it leads to a very small increase in R-squared.
Adjusted R-squared or MSE
MSESSTO
n
SSTO
SSE
pn
nRa
1
11
12
Adjusted R-squared increases only if MSE decreases, so adjusted R-squared and MSE provide equivalent information.
Find a few subsets for which MSE is smallest (or adjusted R-squared is largest) or so close to the smallest (largest) that adding more predictors is not worthwhile.
Mallow’s Cp criterion
pnXXMSE
SSEC
p
pp 2
),...,( 11
Mallow’s Cp statistic:
is an estimator of total standardized mean square error of prediction:
2
12
ˆ1
n
iiipp YEYE
n
i
n
iipiipp YVarYEYE
1 1
2
2ˆˆ1
which equals:
Using the Cp criterion
• Subsets with small Cp values have a small total (standardized) mean square error of prediction.
• When the Cp value is also near p, the bias of the regression model is small.
Using the Cp criterion (cont’d)
• So, identify subsets of predictors for which:– the Cp value is smallest, and
– the Cp value is near p (if possible)
• Note though that for the full model, Cp= p. So, the fullest model is always judged to be a good candidate model.
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
x x x x Vars R-Sq R-Sq(adj) C-p S 1 2 3 4
1 67.5 64.5 138.7 8.9639 X 1 66.6 63.6 142.5 9.0771 X 2 97.9 97.4 2.7 2.4063 X X 2 97.2 96.7 5.5 2.7343 X X 3 98.2 97.6 3.0 2.3087 X X X 3 98.2 97.6 3.0 2.3121 X X X 4 98.2 97.4 5.0 2.4460 X X X X
Example: Modeling PIQ
130.5
91.5
100.728
86.283
73.25
65.75
130.591.
5
170.5
127.5
100.72
886.
283 73.25
65.75
170.5
127.5
PIQ
MRI
Height
Weight
Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38
Step 1 2Constant 4.652 111.276
MRI 1.18 2.06T-Value 2.45 3.77P-Value 0.019 0.001
Height -2.73T-Value -2.75P-Value 0.009
S 21.2 19.5R-Sq 14.27 29.49R-Sq(adj) 11.89 25.46C-p 7.3 2.0
Best Subsets Regression: PIQ versus MRI, Height, WeightResponse is PIQ
H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t
1 14.3 11.9 7.3 21.212 X 1 0.9 0.0 13.8 22.810 X 2 29.5 25.5 2.0 19.510 X X 2 19.3 14.6 6.9 20.878 X X 3 29.5 23.3 4.0 19.794 X X X
The regression equation isPIQ = 111 + 2.06 MRI - 2.73 Height
Predictor Coef SE Coef T PConstant 111.28 55.87 1.99 0.054MRI 2.0606 0.5466 3.77 0.001Height -2.7299 0.9932 -2.75 0.009
S = 19.51 R-Sq = 29.5% R-Sq(adj) = 25.5%
Analysis of Variance
Source DF SS MS F PRegression 2 5572.7 2786.4 7.32 0.002Error 35 13321.8 380.6Total 37 18894.6
Source DF Seq SSMRI 1 2697.1Height 1 2875.6
Example: Modeling BP
120
110
53.25
47.75
97.325
89.375
2.125
1.875
8.275
4.425
72.5
65.5
120110
76.25
30.75
53.25
47.75
97.325
89.375 2.1
251.8
758.2
754.4
25 72.5
65.5
76.25
30.75
BP
Age
Weight
BSA
Duration
Pulse
Stress
Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20
Step 1 2 3Constant 2.205 -16.579 -13.667
Weight 1.201 1.033 0.906T-Value 12.92 33.15 18.49P-Value 0.000 0.000 0.000
Age 0.708 0.702T-Value 13.23 15.96P-Value 0.000 0.000
BSA 4.6T-Value 3.04P-Value 0.008
S 1.74 0.533 0.437R-Sq 90.26 99.14 99.45R-Sq(adj) 89.72 99.04 99.35C-p 312.8 15.1 6.4
Best Subsets Regression: BP versus Age, Weight, ...Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s
1 90.3 89.7 312.8 1.7405 X 1 75.0 73.6 829.1 2.7903 X 2 99.1 99.0 15.1 0.53269 X X 2 92.0 91.0 256.6 1.6246 X X 3 99.5 99.4 6.4 0.43705 X X X 3 99.2 99.1 14.1 0.52012 X X X 4 99.5 99.4 6.4 0.42591 X X X X 4 99.5 99.4 7.1 0.43500 X X X X 5 99.6 99.4 7.0 0.42142 X X X X X 5 99.5 99.4 7.7 0.43078 X X X X X 6 99.6 99.4 7.0 0.40723 X X X X X X
The regression equation isBP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA
Predictor Coef SE Coef T PConstant -13.667 2.647 -5.16 0.000Age 0.70162 0.04396 15.96 0.000Weight 0.90582 0.04899 18.49 0.000BSA 4.627 1.521 3.04 0.008S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4%
Analysis of Variance
Source DF SS MS F PRegression 3 556.94 185.65 971.93 0.000Error 16 3.06 0.19Total 19 560.00
Source DF Seq SSAge 1 243.27Weight 1 311.91BSA 1 1.77
Stepwise regression in Minitab
• Stat >> Regression >> Stepwise …
• Specify response and all possible predictors.
• If desired, specify predictors that must be included in every model.
• Select OK. Results appear in session window.
Best subsets regression
• Stat >> Regression >> Best subsets …
• Specify response and all possible predictors.
• If desired, specify predictors that must be included in every model.
• Select OK. Results appear in session window.
Model building strategy
The first step
• Decide on the type of model needed– Predictive: model used to predict the response
variable from a chosen set of predictors.– Theoretical: model based on theoretical
relationship between response and predictors.– Control: model used to control a response
variable by manipulating predictor variables.
The first step (cont’d)
• Decide on the type of model needed– Inferential: model used to explore strength of
relationships between response and predictors.– Data summary: model used merely as a way to
summarize a large set of data by a single equation.
The second step
• Decide which predictor variables and response variable on which to collect the data.
• Collect the data.
The third step
• Explore the data– Check for outliers, gross data errors, missing
values on a univariate basis.– Study bivariate relationships to reveal other
outliers, to suggest possible transformations, to identify possible multicollinearities.
The fourth step
• Randomly divide the data into a training set and a test set:– The training set, with at least 15-20 error d.f.,
is used to fit the model.– The test set is used for cross-validation of the
fitted model.
The fifth step
• Using the training set, fit several candidate models:– Use best subsets regression.– Use stepwise regression (only gives one model
unless specifies different alpha-to-remove and alpha-to-enter values).
The sixth step
• Select and evaluate a few “good” models:– Select based on adjusted R2, Mallow’s Cp,
number and nature of predictors.– Evaluate selected models for violation of model
assumptions.– If none of the models provide a satisfactory fit,
try something else, such as more data, different predictors, a different class of model …
The final step
• Select the final model:– Compare competing models by cross-validating
them against the test data.– The model with a larger cross-validation R2 is a
better predictive model.– Consider residual plots, outliers, parsimony,
relevance, and ease of measurement of predictors.