michael k. tippettindico.ictp.it/event/a09161/session/17/contribution/10/material/0/0.pdfa more...

Post on 23-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Pitfalls of cross validation

Michael K. Tippett

International Research Institute for Climate and SocietyThe Earth Institute, Columbia University

Statistical Methods in Seasonal Prediction, ICTPAug 2-13, 2010

Main ideas

I Cross-validation is useful for estimating the performance ofa model on independent data.

I Few assumptions.I Computationally expensive.I Can be misused.

Outline

I Cross validation and linear regressionI Computational methodI Bias

I PitfallsI Not including model/predictor selection in the

cross-validation.I Not leaving out enought data so that the training and

validation samples are independent.

Cross-validation is expensive.

Cross-validation of linear regression

Cross-validated error can be computed without thecomputational cost of cross-validation.

y(i)− ycv (i) =e(i)

1− vii

whereI ycv (i) is the prediction from the regression with the i-th

sample left out of the compuation of the regressioncoefficients .

I e(i) = y(i)− y(i) in-sample errorI vii is the i-th diagonal of the “hat”-matrix

V = X (X T X )−1X T .

(Cook & Weisberg, Residuals and Influence in Regression, 1980)

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

What is the correlation of a climatological forecast?What is the cross-validated correlation of a climatologicalforecast?

I X = column of n ones.I X T X = n, (X T X )−1 = 1/n,I V = X (X T X )−1X T = n × n matrix with values 1/n.I e(i) = y(i)− y .

ycv (i) = y(i)− e(i)1− vii

= y(i)− y(i)− y1− 1/n

= yn

n − 1− y

1n − 1

What is the correlation between y and ycv ? Why?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regression

More intuitivelyI Leaving out a wet year, gives a drier meanI Leaving out a drier year, gives a wetter mean.

Negative correlation.

Problem solved if somehow the mean did not change. How?

Cross-validation for linear regressionSome benefit to leaving out more years and predicting middleyear. (n = 30)

1 3 5 7 9 11 13 15−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

number left out

cross−validated correlation of climatology forecast

Problem with trend?

Cross-validation for linear regressionSome benefit to leaving out more years and predicting middleyear but increase in error variance. (n = 30)

1 3 5 7 9 11 13 15−0.1

−0.09

−0.08

−0.07

−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

number left out

cross−validated 1−normalized error variance

Bias of cross validationn=30. y = ax + b.

0 0.2 0.4 0.6 0.8 1−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

population correlation

cv correlation

cv

Bias of cross validationn=30. y = ax + b.

0 0.2 0.4 0.6 0.8 1−0.2

0

0.2

0.4

0.6

0.8

1

1.2

population correlation

cv 1 − normalized error variance

Pitfalls of cross validation

I Performing an initial analysis using the entire data set toidentify the most informative features

I Using cross-validation to assess several models, and onlystating the results for the model with the best results.

I Allowing some of the training data to be (essentially)included in the test set

From wikipedia

Example: PhilippinesProblem: predict gridded April-June precipitation over thePhilippines from proceeding (January-March) SST.

100E 120E 140E 160E 180 160W 140W

20S

10S

0

10N

20N

SST anomaly JFM 1971

−1.6

−1.2

−0.8

−0.4

0

0.4

0.8

1.2

1.6

115E 120E 125E 130E

5N

10N

15N

20N

Rainfall anomalies AMJ 1971

−8

−6

−4

−2

0

2

4

6

8

Example: Philippines

Simple regression model: Either climatology or ENSO aspredictors.

Use AIC to choose.

115E 120E 125E 130E

5N

10N

15N

20N

AIC selected model

CL

ENSO

Example: Philippines

I Some skill(cross-validated)

I Why the negativecorrelation?

115E 120E 125E 130E

5N

10N

15N

20N

correlation

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

115E 120E 125E 130E

5N

10N

15N

20N

1−error var/clim var

0

0.1

0.2

0.3

0.4

0.5

Pitfall!

I Showed the model selected.I Presented the cross-validated skill of that model.

What’s wrong?

The entire dataset was used to select the predictors.Solution?

Pitfall!

I Showed the model selected.I Presented the cross-validated skill of that model.

What’s wrong?

The entire dataset was used to select the predictors.Solution?

Pitfall!

I Showed the model selected.I Presented the cross-validated skill of that model.

What’s wrong?

The entire dataset was used to select the predictors.Solution?

Pitfall!

Solution: include predictor selection in cross-validation.

115E 120E 125E 130E

5N

10N

15N

20N

min AIC selected model

CL

ENSO

115E 120E 125E 130E

5N

10N

15N

20N

max AIC selected model

CL

ENSO

Pitfall!

Negative impact on correlation in places with skill?

115E 120E 125E 130E

5N

10N

15N

20N

Change in correlation

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Why is the impact positive in some areas?

Pitfall!

Negative impact on normalized error

115E 120E 125E 130E

5N

10N

15N

20N

Change in 1−error var/clim var

−0.2

−0.16

−0.12

−0.08

−0.04

0

0.04

0.08

0.12

0.16

0.2

Pitfall!

Points with little skill are most affected.

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

w/o cv pred. sel.

w/ cv p

red. sel.

correlation

−0.2 −0.1 0 0.1 0.2 0.3 0.4

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

w/o cv pred. sel.

w/ cv p

red. sel.

1−error var/clim var

Idea: convervative models don’t go too far bad . . .What if the model had used many PCs?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

I Observe that in a 40-member ensemble of GCMpredictions some members have more skill than others.

I Pick the members with skill exceeding some threshold.I Perform PCA and retain the PCs with skill exceeding some

threshold as your predictors.I Estimate skill using cross-validation.

Sounds harmless, maybe even clever.

What is the problem?

What is the impact?

A more nefarious (but real) example

Cross-validated forecasts show good skill.

1980 1985 1990 1995 2000 2005 2010−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

correlation =0.84

obs.

CV forecast

What is the real skill?

A more nefarious (but real) example

Apply this procedure 1000 times to random numbers

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

50

100

150

200

250frequency of correlation

mean correlation = 0.8

Never underestimate the power of a screening procedure.

A more nefarious (but real) example

Apply this procedure 1000 times to random numbers

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

50

100

150

200

250frequency of correlation

mean correlation = 0.8

Never underestimate the power of a screening procedure.

Cross-validation and Predictor selection: Bad

predictor_selection(x,y)ypred = y+NAfor(ii in 1:N) {

out = (ii-1):(ii+1)training = setdiff(1:N,out)xcv = x[training]ycv = y[training]model.cv = lm(ycv ~ xcv)ypred[ii] = predict(model.cv,list(xcv=x[ii]))

}

Cross-validation and Predictor selection: Good

ypred = y+NAfor(ii in 1:N) {

out = (ii-1):(ii+1)training = setdiff(1:N,out)xcv = x[training]ycv = y[training]predictor_selection(xcv,ycv)model.cv = lm(ycv ~ xcv)ypred[ii] = predict(model.cv,list(xcv=x[ii]))

}

Summary

I Predictor selection needs to be included in thecross-validation.

I Impact varies.

Example: PCA and regression

We asked:Is there any benefit to predicting the PCs of y rather than y?

Compared regression at each gridpoint to regression betweenpatterns.

I Compute PCs of SSTI Compute PCs of rainfall.I Skill from cross-validated regression between the PCs.

What is the problem?

Cannot do PCA of “future” observations. PCA of y needs to goin the CV-loop

Example: PCA and regression

We asked:Is there any benefit to predicting the PCs of y rather than y?

Compared regression at each gridpoint to regression betweenpatterns.

I Compute PCs of SSTI Compute PCs of rainfall.I Skill from cross-validated regression between the PCs.

What is the problem?

Cannot do PCA of “future” observations. PCA of y needs to goin the CV-loop

Example: PCA and regression

We asked:Is there any benefit to predicting the PCs of y rather than y?

Compared regression at each gridpoint to regression betweenpatterns.

I Compute PCs of SSTI Compute PCs of rainfall.I Skill from cross-validated regression between the PCs.

What is the problem?

Cannot do PCA of “future” observations. PCA of y needs to goin the CV-loop

Example: Philippines

Y PCA outside CV Y PCA inside CV

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

pointwise

pattern

correlation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

pointwise

pattern

correlation

Example: Philippines

Y PCA outside CV Y PCA inside CV

−0.1 0 0.1 0.2 0.3 0.4 0.5−0.1

0

0.1

0.2

0.3

0.4

0.5

pointwise

pattern

1−error var/cl var

−0.1 0 0.1 0.2 0.3 0.4 0.5−0.1

0

0.1

0.2

0.3

0.4

0.5

pointwise

pattern

1−error var/cl var

Similar problem.

I Do CCA.I Find patterns and time-series.I Use time-series in a regression.I Check skill using cross-validation.

What is the problem?

What is a solution?

Similar problem.

I Do CCA.I Find patterns and time-series.I Use time-series in a regression.I Check skill using cross-validation.

What is the problem?

What is a solution?

CPT

Climate prediction tool.I 3 consecutive years are left out.I CCA is applied to the remaining years.

I CCA depends on the number of PCs retained.I Middle year of the left out years is forecast.I Repeat until all years are forecast.I Cross-validated forecast depends on the number of PCs

retained.I Select number of PCs that optimizes cross-validated skill.

This skill is the forecast skill.

What is the problem? Solution?

CPT

Climate prediction tool.I 3 consecutive years are left out.I CCA is applied to the remaining years.

I CCA depends on the number of PCs retained.I Middle year of the left out years is forecast.I Repeat until all years are forecast.I Cross-validated forecast depends on the number of PCs

retained.I Select number of PCs that optimizes cross-validated skill.

This skill is the forecast skill.

What is the problem? Solution?

CPT

Climate prediction tool.I 3 consecutive years are left out.I CCA is applied to the remaining years.

I CCA depends on the number of PCs retained.I Middle year of the left out years is forecast.I Repeat until all years are forecast.I Cross-validated forecast depends on the number of PCs

retained.I Select number of PCs that optimizes cross-validated skill.

This skill is the forecast skill.

What is the problem? Solution?

Three data sets

I Data set 1 = estimate parameters of the model.I Data set 2 = select predictor/model.I Data set 3 = estimate skill of model

Why are two data sets not enough?

“Example”Suppose many models are compared.Same skill except for sampling differences. s ± δPick model with largest skill s + δ , larger than real skill s.

Three data sets

I Data set 1 = estimate parameters of the model.I Data set 2 = select predictor/model.I Data set 3 = estimate skill of model

Why are two data sets not enough?

“Example”Suppose many models are compared.Same skill except for sampling differences. s ± δPick model with largest skill s + δ , larger than real skill s.

Three data sets

I Data set 1 = estimate parameters of the model.I Data set 2 = select predictor/model.I Data set 3 = estimate skill of model

Why are two data sets not enough?

“Example”Suppose many models are compared.Same skill except for sampling differences. s ± δPick model with largest skill s + δ , larger than real skill s.

Three data sets

I Data set 1 = estimate parameters of the model.I Data set 2 = select predictor/model.I Data set 3 = estimate skill of model

Why are two data sets not enough?

“Example”Suppose many models are compared.Same skill except for sampling differences. s ± δPick model with largest skill s + δ , larger than real skill s.

SST prediction

I Predict monthly SST anomalies from monthly SSTanomalies six months before.

I 1982-2009. 28 years.I PCA of monthly SST anomalies.I Cross validation to pick the number of PCs.I Look at the correlation of those cv’d forecasts.

What are the problems?

SST predictionI Leave-one-month out cross-validation.I Checked truncations up to 75 PCs.I Lowest cross-validated error with 66 PCs.

What’s wrong with this picture?

120E 160E 160W 120W 80W

30S

20S

10S

0

10N

20N

30N

mean correlation = 0.64

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

SST prediction

I Leave-one-year-out cross-validation.I Checked truncations up to 75 PCs.I Lowest cross-validated error with 6 PCs.

120E 160E 160W 120W 80W

30S

20S

10S

0

10N

20N

30N

mean correlation = 0.42

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Summary

I Efficient method to compute leave-one-out cross-validationfor linear regression.

I There are some biases with CV. Climatology forecastshave negative correlation.

I Include model/predictor selection in the CV.I Left-out data must be independent.

top related