model selection

27
1 Model Selection In multiple regression we often have many explanatory variables. How do we find the “best” model?

Upload: selia

Post on 16-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Model Selection. In multiple regression we often have many explanatory variables. How do we find the “best” model?. Model Selection. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Model Selection

1

Model Selection In multiple regression we

often have many explanatory variables.

How do we find the “best” model?

Page 2: Model Selection

2

Model Selection How can we select the set of

explanatory variables that will explain the most variation in the response and have each variable adding significantly to the model?

Page 3: Model Selection

3

Cruising Timber Response: Mean Diameter at

Breast Height (MDBH) of a tree. Explanatory:

X1 = Mean Height of Pines X2 = Age of Tract times the Number of Pines X3 = Mean Height of Pines divided by the Number of Pines

Page 4: Model Selection

4

Forward Selection Begin with no variables in the

model. At each step check to see if

you can add a variable to the model. If you can, add the variable. If not, stop.

Page 5: Model Selection

5

Forward Selection – Step 1

Select the variable that has the highest correlation with the response.

If this correlation is statistically significant, add the variable to the model.

Page 6: Model Selection

6

JMP Multivariate Methods Multivariate Put MDBH, X1, X2, and X3 in

the Y, Columns box.

Page 7: Model Selection

7

MDBH

X1

X2

X3

1.0000

0.7731

0.2435

0.8404

0.7731

1.0000

0.7546

0.6345

0.2435

0.7546

1.0000

0.0557

0.8404

0.6345

0.0557

1.0000

MDBH X1 X2 X3

Correlations

4.5

5.5

6.5

7.5

25

35

45

55

5000

7500

10000

12500

15000

17500

0.04

0.06

0.08

0.1

0.12

MDBH

4.5 5.5 6.5 7.5

X1

25 35 45 55

X2

5000 12500

X3

0.04 0.07 0.1

Scatterplot Matrix

Multivariate

Page 8: Model Selection

8

Correlation with response

X1

X2

X2

X3

X3

X3

Variable

MDBH

MDBH

X1

MDBH

X1

X2

by Variable

0.7731

0.2435

0.7546

0.8404

0.6345

0.0557

Correlation

20

20

20

20

20

20

Count

<.0001*

0.3008

0.0001*

<.0001*

0.0027*

0.8156

Signif Prob -.8 -.6 -.4 -.2 0 .2 .4 .6 .8

Pairwise Correlations

Multivariate

Page 9: Model Selection

9

Comment The explanatory variable X3 has

the highest correlation with the response MDBH. r = 0.8404

The correlation between X3 and MDBH is statistically significant. Signif Prob < 0.0001, small P-value.

Page 10: Model Selection

10

Step 1 - Action Fit the simple linear

regression of MDBH on X3. Predicted MDBH = 3.896 +

32.937*X3

R2 = 0.7063 RMSE = 0.4117

Page 11: Model Selection

11

SLR of MDBH on X3

Test of Model Utility F = 43.2886, P-value < 0.0001

Statistical Significance of X3

t = 6.58, P-value < 0.0001 Exactly the same as the test

for significant correlation.

Page 12: Model Selection

12

Can we do better? Can we explain more

variation in MDBH by adding one of the other variables to the model with X3?

Will that addition be statistically significant?

Page 13: Model Selection

13

Forward Selection – Step 2

Which variable should we add, X1 or X2?

How can we decide?

Page 14: Model Selection

14

Correlation among explanatory variables

X1

X2

X2

X3

X3

X3

Variable

MDBH

MDBH

X1

MDBH

X1

X2

by Variable

0.7731

0.2435

0.7546

0.8404

0.6345

0.0557

Correlation

20

20

20

20

20

20

Count

<.0001*

0.3008

0.0001*

<.0001*

0.0027*

0.8156

Signif Prob -.8 -.6 -.4 -.2 0 .2 .4 .6 .8

Pairwise Correlations

Multivariate

Page 15: Model Selection

15

Multicollinearity Because some explanatory

variables are correlated, they may carry overlapping information about the response.

You can’t rely on the simple correlations between explanatory and response to tell you which variable to add.

Page 16: Model Selection

16

Forward selection – Step 2

Look at partial residual plots.

Determine statistical significance.

Page 17: Model Selection

17

Partial Residual Plots Look at the residuals from

the SLR of Y on X3 plotted against the other variables once the overlapping information with X3 has been removed.

Page 18: Model Selection

18

How is this done? Fit MDBH versus X3 and obtain

residuals – Resid(Y on X3) Fit X1 versus X3 and obtain

residuals - Resid(X1 on X3) Fit X2 versus X3 and obtain

residuals - Resid(X2 on X3)

Page 19: Model Selection

19

-0.5

0

0.5

-10

-5

0

5

10

1520

-2500

0

2500

5000

7500

10000

Resid(YonX3)

-0.5 0 .5

Resid(X1onX3)

-10 -5 0 5 10 15 20

Resid(X2onX3)

-2500 0 2500 7500

Page 20: Model Selection

20

CorrelationsResid(YonX3) Resid(X1onX

3)Resid(X2onX3)

Resid(YonX3) 1.0000 0.5726 0.3636

Resid(X1onX3)

0.5726 1.0000 0.9320

Resid(X2onX3)

0.3636 0.9320 1.000

Page 21: Model Selection

21

Comment The residuals (unexplained

variation in the response) from the SLR of MDBH on X3 have the highest correlation with X1 once we have adjusted for the overlapping information with X3.

Page 22: Model Selection

22

Statistical Significance Does X1 add significantly to the

model that already contains X3? t = 2.88, P-value = 0.0104 F = 8.29, P-value = 0.0104 Because the P-value is small, X1 adds significantly to the model with X3.

Page 23: Model Selection

23

Summary Step 1 – add X3

R2 = 0.706 Step 2 – add X1 to X3

R2 = 0.803 Can we do better?

Page 24: Model Selection

24

Forward Selection – Step 3 Does X2 add significantly to the

model that already contains X3 and X1? t = –2.79, P-value = 0.0131 F = 7.78, P-value = 0.0131 Because the P-value is small, X2

adds significantly to the model with X3 and X1.

Page 25: Model Selection

25

Summary Step 1 – add X3

R2 = 0.706 Step 2 – add X1 to X3

R2 = 0.803 Step 3 – add X2 to X1 and X3

R2 = 0.867

Page 26: Model Selection

26

Summary At each step the variable

being added is statistically significant.

Has the forward selection procedure found the “best” model?

Page 27: Model Selection

27

“Best” Model? The model with all three

variables is useful. F = 34.83, P-value < 0.0001

The variable X3 does not add significantly to the model with just X1 and X2. t = 0.41, P-value = 0.6844