model selection and model building

Model selection and model building

Model selection

Selection of predictor variables

Statement of problem

• A common problem is that there is a large set of candidate predictor variables.

• Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability.

Example: Cement data

• Response y: heat evolved in calories during hardening of cement on a per gram basis

• Predictor x1: % of tricalcium aluminate

• Predictor x2: % of tricalcium silicate

• Predictor x3: % of tetracalcium alumino ferrite

• Predictor x4: % of dicalcium silicate


83.35

105.05

6

16

37.25

59.75

8.75

18.25

83.35

105.0

5

19.5

46.5

6 1637.25

59.75 8.7

518.25 19

.546.5

y

x1

x2

x3

x4

Two basic methods of selecting predictors

• Stepwise regression: Enter and remove variables, in a stepwise manner, until no justifiable reason to enter or remove more.

• Best subsets regression: Select the subset of variables that do the best at meeting some well-defined objective criterion.

Stepwise regression: the idea

• Start with no predictors in the model.

• At each step, enter or remove a variable based on partial F-tests.

• Stop when no more variables can be justifiably entered or removed.

Stepwise regression: the steps

• Specify an Alpha-to-Enter (0.15) and an Alpha-to-Remove (0.15).

• Start with no predictors in the model.• Put the predictor with the smallest P-value

based on the partial F statistic (a t-statistic) in the model. If P-value > 0.15, then stop. None of the predictors have good predictive ability. Otherwise …

Stepwise regression: the steps

• Add the predictor with the smallest P-value (below 0.15) based on the partial F-statistic (a t-statistic) in the model. If none of the predictors yield P-values < 0.15, stop.

• If P-value > 0.15 for any of the partial F statistics, then remove violating predictor.

• Continue above two steps, until no more predictors can be entered or removed.

Predictor Coef SE Coef T PConstant 81.479 4.927 16.54 0.000x1 1.8687 0.5264 3.55 0.005

Predictor Coef SE Coef T PConstant 57.424 8.491 6.76 0.000x2 0.7891 0.1684 4.69 0.001

Predictor Coef SE Coef T PConstant 110.203 7.948 13.87 0.000x3 -1.2558 0.5984 -2.10 0.060

Predictor Coef SE Coef T PConstant 117.568 5.262 22.34 0.000x4 -0.7382 0.1546 -4.77 0.001


Predictor Coef SE Coef T PConstant 103.097 2.124 48.54 0.000x4 -0.61395 0.04864 -12.62 0.000x1 1.4400 0.1384 10.40 0.000

Predictor Coef SE Coef T PConstant 94.16 56.63 1.66 0.127x4 -0.4569 0.6960 -0.66 0.526x2 0.3109 0.7486 0.42 0.687

Predictor Coef SE Coef T PConstant 131.282 3.275 40.09 0.000x4 -0.72460 0.07233 -10.02 0.000x3 -1.1999 0.1890 -6.35 0.000

Predictor Coef SE Coef T PConstant 71.65 14.14 5.07 0.001x4 -0.2365 0.1733 -1.37 0.205x1 1.4519 0.1170 12.41 0.000x2 0.4161 0.1856 2.24 0.052

Predictor Coef SE Coef T PConstant 111.684 4.562 24.48 0.000x4 -0.64280 0.04454 -14.43 0.000x1 1.0519 0.2237 4.70 0.001x3 -0.4100 0.1992 -2.06 0.070

Predictor Coef SE Coef T PConstant 52.577 2.286 23.00 0.000x1 1.4683 0.1213 12.10 0.000x2 0.66225 0.04585 14.44 0.000

Predictor Coef SE Coef T PConstant 71.65 14.14 5.07 0.001x1 1.4519 0.1170 12.41 0.000x2 0.4161 0.1856 2.24 0.052x4 -0.2365 0.1733 -1.37 0.205

Predictor Coef SE Coef T PConstant 48.194 3.913 12.32 0.000x1 1.6959 0.2046 8.29 0.000x2 0.65691 0.04423 14.85 0.000x3 0.2500 0.1847 1.35 0.209

Predictor Coef SE Coef T PConstant 52.577 2.286 23.00 0.000x1 1.4683 0.1213 12.10 0.000x2 0.66225 0.04585 14.44 0.000

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13

Step 1 2 3 4Constant 117.57 103.10 71.65 52.58

x4 -0.738 -0.614 -0.237 T-Value -4.77 -12.62 -1.37 P-Value 0.001 0.000 0.205

x1 1.44 1.45 1.47T-Value 10.40 12.41 12.10P-Value 0.000 0.000 0.000

x2 0.416 0.662T-Value 2.24 14.44P-Value 0.052 0.000

S 8.96 2.73 2.31 2.41R-Sq 67.45 97.25 98.23 97.87R-Sq(adj) 64.50 96.70 97.64 97.44C-p 138.7 5.5 3.0 2.7

Drawbacks of stepwise regression

• The final model is not guaranteed to be optimal in any specified sense.

• The procedure yields a single final model, although in practice there are often several almost equally good models.

Best subsets regression

• If there are p-1 possible predictors, then there are 2p-1 possible regression models containing the predictors.

• For example, 10 predictors yields 210 = 1024 possible regression models.

• A best subsets algorithm determines the best subsets of each size, so that choice of the final model can be made by researcher.

What is used to judge “best”?

• R-square

• Adjusted R-square

• MSE (or S = square root of MSE)

• Mallow’s Cp

R-squared

SSTO

SSE

SSTO

SSRR 12

Use the R-squared values to find the point where adding more predictors is not worthwhile because it leads to a very small increase in R-squared.

Adjusted R-squared or MSE

MSESSTO

n

SSTO

SSE

pn

nRa

1

11

12

Adjusted R-squared increases only if MSE decreases, so adjusted R-squared and MSE provide equivalent information.

Find a few subsets for which MSE is smallest (or adjusted R-squared is largest) or so close to the smallest (largest) that adding more predictors is not worthwhile.

Mallow’s Cp criterion

pnXXMSE

SSEC

p

pp 2

),...,( 11

Mallow’s Cp statistic:

is an estimator of total standardized mean square error of prediction:

2

12

ˆ1

n

iiipp YEYE

n

i

n

iipiipp YVarYEYE

1 1

2

2ˆˆ1

which equals:

Using the Cp criterion

• Subsets with small Cp values have a small total (standardized) mean square error of prediction.

• When the Cp value is also near p, the bias of the regression model is small.

Using the Cp criterion (cont’d)

• So, identify subsets of predictors for which:– the Cp value is smallest, and

– the Cp value is near p (if possible)

• Note though that for the full model, Cp= p. So, the fullest model is always judged to be a good candidate model.

Best Subsets Regression: y versus x1, x2, x3, x4

Response is y

x x x x Vars R-Sq R-Sq(adj) C-p S 1 2 3 4

1 67.5 64.5 138.7 8.9639 X 1 66.6 63.6 142.5 9.0771 X 2 97.9 97.4 2.7 2.4063 X X 2 97.2 96.7 5.5 2.7343 X X 3 98.2 97.6 3.0 2.3087 X X X 3 98.2 97.6 3.0 2.3121 X X X 4 98.2 97.4 5.0 2.4460 X X X X

Example: Modeling PIQ

130.5

91.5

100.728

86.283

73.25

65.75

130.591.

5

170.5

127.5

100.72

886.

283 73.25

65.75

170.5

127.5

PIQ

MRI

Height

Weight

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38

Step 1 2Constant 4.652 111.276

MRI 1.18 2.06T-Value 2.45 3.77P-Value 0.019 0.001

Height -2.73T-Value -2.75P-Value 0.009

S 21.2 19.5R-Sq 14.27 29.49R-Sq(adj) 11.89 25.46C-p 7.3 2.0

Best Subsets Regression: PIQ versus MRI, Height, WeightResponse is PIQ

H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t

1 14.3 11.9 7.3 21.212 X 1 0.9 0.0 13.8 22.810 X 2 29.5 25.5 2.0 19.510 X X 2 19.3 14.6 6.9 20.878 X X 3 29.5 23.3 4.0 19.794 X X X

The regression equation isPIQ = 111 + 2.06 MRI - 2.73 Height

Predictor Coef SE Coef T PConstant 111.28 55.87 1.99 0.054MRI 2.0606 0.5466 3.77 0.001Height -2.7299 0.9932 -2.75 0.009

S = 19.51 R-Sq = 29.5% R-Sq(adj) = 25.5%

Analysis of Variance

Source DF SS MS F PRegression 2 5572.7 2786.4 7.32 0.002Error 35 13321.8 380.6Total 37 18894.6

Source DF Seq SSMRI 1 2697.1Height 1 2875.6

Example: Modeling BP

120

110

53.25

47.75

97.325

89.375

2.125

1.875

8.275

4.425

72.5

65.5

120110

76.25

30.75

53.25

47.75

97.325

89.375 2.1

251.8

758.2

754.4

25 72.5

65.5

76.25

30.75

BP

Age

Weight

BSA

Duration

Pulse

Stress

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20

Step 1 2 3Constant 2.205 -16.579 -13.667

Weight 1.201 1.033 0.906T-Value 12.92 33.15 18.49P-Value 0.000 0.000 0.000

Age 0.708 0.702T-Value 13.23 15.96P-Value 0.000 0.000

BSA 4.6T-Value 3.04P-Value 0.008

S 1.74 0.533 0.437R-Sq 90.26 99.14 99.45R-Sq(adj) 89.72 99.04 99.35C-p 312.8 15.1 6.4

Best Subsets Regression: BP versus Age, Weight, ...Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s

1 90.3 89.7 312.8 1.7405 X 1 75.0 73.6 829.1 2.7903 X 2 99.1 99.0 15.1 0.53269 X X 2 92.0 91.0 256.6 1.6246 X X 3 99.5 99.4 6.4 0.43705 X X X 3 99.2 99.1 14.1 0.52012 X X X 4 99.5 99.4 6.4 0.42591 X X X X 4 99.5 99.4 7.1 0.43500 X X X X 5 99.6 99.4 7.0 0.42142 X X X X X 5 99.5 99.4 7.7 0.43078 X X X X X 6 99.6 99.4 7.0 0.40723 X X X X X X

The regression equation isBP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA

Predictor Coef SE Coef T PConstant -13.667 2.647 -5.16 0.000Age 0.70162 0.04396 15.96 0.000Weight 0.90582 0.04899 18.49 0.000BSA 4.627 1.521 3.04 0.008S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4%

Analysis of Variance

Source DF SS MS F PRegression 3 556.94 185.65 971.93 0.000Error 16 3.06 0.19Total 19 560.00

Source DF Seq SSAge 1 243.27Weight 1 311.91BSA 1 1.77

Stepwise regression in Minitab

• Stat >> Regression >> Stepwise …

• Specify response and all possible predictors.

• If desired, specify predictors that must be included in every model.

• Select OK. Results appear in session window.

Best subsets regression

• Stat >> Regression >> Best subsets …

• Specify response and all possible predictors.

• If desired, specify predictors that must be included in every model.

• Select OK. Results appear in session window.

Model building strategy

The first step

• Decide on the type of model needed– Predictive: model used to predict the response

variable from a chosen set of predictors.– Theoretical: model based on theoretical

relationship between response and predictors.– Control: model used to control a response

variable by manipulating predictor variables.

The first step (cont’d)

• Decide on the type of model needed– Inferential: model used to explore strength of

relationships between response and predictors.– Data summary: model used merely as a way to

summarize a large set of data by a single equation.

The second step

• Decide which predictor variables and response variable on which to collect the data.

• Collect the data.

The third step

• Explore the data– Check for outliers, gross data errors, missing

values on a univariate basis.– Study bivariate relationships to reveal other

outliers, to suggest possible transformations, to identify possible multicollinearities.

The fourth step

• Randomly divide the data into a training set and a test set:– The training set, with at least 15-20 error d.f.,

is used to fit the model.– The test set is used for cross-validation of the

fitted model.

The fifth step

• Using the training set, fit several candidate models:– Use best subsets regression.– Use stepwise regression (only gives one model

unless specifies different alpha-to-remove and alpha-to-enter values).

The sixth step

• Select and evaluate a few “good” models:– Select based on adjusted R2, Mallow’s Cp,

number and nature of predictors.– Evaluate selected models for violation of model

assumptions.– If none of the models provide a satisfactory fit,

try something else, such as more data, different predictors, a different class of model …

The final step

• Select the final model:– Compare competing models by cross-validating

them against the test data.– The model with a larger cross-validation R2 is a

better predictive model.– Consider residual plots, outliers, parsimony,

relevance, and ease of measurement of predictors.

model selection and model building

Documents

resulting regression

partial fstatistic

smallest pvalue

predictorsstepwise regression

best subsets regression

partial ftests

subset of variables

partial f statistics