cross-validation to assess decoder performance: the good, the bad, and the ugly

30
Cross-validation to assess decoder performance: the good, the bad, and the ugly Gaël Varoquaux https://hal.archives-ouvertes.fr/hal-01332785

Upload: gael-varoquaux

Post on 16-Apr-2017

1.169 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Cross-validation to assess decoder performance: the good, the bad, and the ugly

Cross-validation to assess decoder performance:the good, the bad, and the ugly

Gaël Varoquaux

https://hal.archives-ouvertes.fr/hal-01332785

Page 2: Cross-validation to assess decoder performance: the good, the bad, and the ugly

Measuring prediction accuracy

To find the best method(computer scientists)

For information mapping = omnibus test(cognitive neuroimaging)

Cross-validationasymptotically unbiasednon parametric

G Varoquaux 2

Page 3: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Some theory

2 Empirical results on brain imaging

G Varoquaux 3

Page 4: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Some theory

Test setTrain set

Full data

G Varoquaux 4

Page 5: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Cross-validationTest on independent data

Train set Validation set

Loop

Test setTrain set

Full data

Measures prediction accuracy

G Varoquaux 5

Page 6: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Cross-validationTest on independent data

Train set Validation set

Loop

Test setTrain set

Full data

Measures prediction accuracyG Varoquaux 5

Page 7: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Choice of cross-validation strategyTest on independent dataBe robust to confounding dependences

Leave subjects out, or sessions out

LoopMore loop = more data points

Need to balance error in training model/ error on test

G Varoquaux 6

Page 8: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Choice of cross-validation strategy: theory

Negative bias (underestimate performance)decreasing with the size of the training set

[Arlot... 2010] sec.5.1

Variance decreases with the size of the test set[Arlot... 2010] sec.5.2

Fraction of data left out: 10–20%Many random splits of the datarespecting dependency structure

G Varoquaux 7

Page 9: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Tuning hyper-parametersComputer scientist says:

You need to set C in your SVM

10-410-310-210-1100 101 102 103 104

Parameter tuning: C

Training set

Validation set

G Varoquaux 8

Page 10: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Tuning hyper-parametersComputer scientist says:

You need to set C in your SVM

10-410-310-210-1100 101 102 103 104

Parameter tuning: C

Training set

Validation set

G Varoquaux 8

Page 11: Cross-validation to assess decoder performance: the good, the bad, and the ugly

1 Nested cross-validationTest on independent data

Train set Validation set

Two loops

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 9

Page 12: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Empirical results on brainimaging

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 10

Page 13: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Datasets and tasks

7 fMRI datasets (6 from openfMRI)Haxby: 5 subjects, 15 inter-subject predictionsInter-subject predictions on 6 studies

OASIS VBM, gender discrimination

HCP MEG task, intra-subject, working memory

# samples: ∼ 200 (min 80, max 400)accuracy min 62%, max 96%

G Varoquaux 11

Page 14: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Experiment 1: measuring cross-validation errorLeave out a large validation setMeasure error by cross-validation on the restCompare

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 12

Page 15: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Cross-validated measure versus validation set

50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

Accuracy on validation set

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

Acc

urac

y m

easu

red 

by c

ross

­val

idat

ion

Intra subjectInter subject

G Varoquaux 13

Page 16: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

40% 20% 10% 0% +10% +20% +40%

Leave onesample out

22% +19%

+3% +43%

Intrasubject

Intersubject

G Varoquaux 14

Page 17: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

40% 20% 10% 0% +10% +20% +40%

Leave onesample out

Leave onesubject/session

22% +19%

+3% +43%

10% +10%

21% +17%

Intrasubject

Intersubject

G Varoquaux 14

Page 18: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

40% 20% 10% 0% +10% +20% +40%

Leave onesample out

Leave onesubject/session

20% left out, 3 splits

22% +19%

+3% +43%

10% +10%

21% +17%

11% +11%

24% +16%

Intrasubject

Intersubject

G Varoquaux 14

Page 19: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

40% 20% 10% 0% +10% +20% +40%

Leave onesample out

Leave onesubject/session

20% left out, 3 splits

20% left out, 10 splits

20% left out, 50 splits

22% +19%

+3% +43%

10% +10%

21% +17%

11% +11%

24% +16%

9% +9%

24% +14%

9% +8%

23% +13%

Intrasubject

Intersubject

G Varoquaux 14

Page 20: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Simple simulations

X1

X2

time

X1

2 Gaussian-separatedclouds

Auto-correlated noise

200 decoding samples10 000 validation samples⇒ Validation

= assymptotics

G Varoquaux 15

Page 21: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Simple simulations

X1

X2

time

X1

X1

X2

time

X1

G Varoquaux 15

Page 22: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

­40% ­20% ­10%  0% +10% +20% +40%

Leave onesample out

Leave oneblock out

20% left­out,  3 splits

20% left­out,  10 splits

20% left­out,  50 splits

­16% +14%

+4% +33%

­15% +13%

­8% +8%

­15% +12%

­10% +11%

­13% +10%

­8% +8%

­12% +10%

­7% +7%

MEG data

Simulations

G Varoquaux 16

Page 23: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Experiment 2: parameter-tuningCompare different strategies on validation set:1. Use the default C = 12. Use C = 10003. Choose best C by cross-validation and refit3. Average best models in cross-validation

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

Non-sparse decodersSVM `2Log-reg `2

Sparse decodersSVM `1Log-reg `1

G Varoquaux 17

Page 24: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Experiment 2: parameter-tuningCompare different strategies on validation set:1. Use the default C = 12. Use C = 10003. Choose best C by cross-validation and refit3. Average best models in cross-validation

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

Non-sparse decodersSVM `2Log-reg `2

Sparse decodersSVM `1Log-reg `1

G Varoquaux 17

Page 25: Cross-validation to assess decoder performance: the good, the bad, and the ugly

2 Cross-validation for tuning?

CV + 

averaging CV + 

refitting C=1

C=1000

­8%

­4%

­2%

0%

+2%

+4%

+8%

Impa

ct o

n pr

edic

tion 

accu

racy

SVMlog­reg

CV + 

averaging CV + 

refitting C=1

C=1000

­8%

­4%

­2%

0%

+2%

+4%

+8%

Impa

ct o

n pr

edic

tion 

accu

racy

SVMlog­reg

Non-sparse models Sparse models

G Varoquaux 18

Page 26: Cross-validation to assess decoder performance: the good, the bad, and the ugly

@GaelVaroquaux

Cross-validation: lessons learned

Don’t use Leave One OutRandom 10-20% splits respecting sample structure

Cross-validation has error bars of ±10%

Cross-validation is inefficient for parameter tuning-C = 1 for SVM-`2-model averaging for SVM-`1

https://hal.archives-ouvertes.fr/hal-01332785

ni

Page 27: Cross-validation to assess decoder performance: the good, the bad, and the ugly

@GaelVaroquaux

Cross-validation: lessons learned

Don’t use Leave One OutRandom 10-20% splits respecting sample structure

Cross-validation has error bars of ±10%

Cross-validation is inefficient for parameter tuning-C = 1 for SVM-`2-model averaging for SVM-`1

https://hal.archives-ouvertes.fr/hal-01332785

ni

Page 28: Cross-validation to assess decoder performance: the good, the bad, and the ugly

@GaelVaroquaux

Cross-validation: lessons learned

Don’t use Leave One OutRandom 10-20% splits respecting sample structure

Cross-validation has error bars of ±10%

Cross-validation is inefficient for parameter tuning-C = 1 for SVM-`2-model averaging for SVM-`1

https://hal.archives-ouvertes.fr/hal-01332785

ni

Page 29: Cross-validation to assess decoder performance: the good, the bad, and the ugly

@GaelVaroquaux

Cross-validation: lessons learned

Don’t use Leave One OutRandom 10-20% splits respecting sample structure

Cross-validation has error bars of ±10%

Cross-validation is inefficient for parameter tuning-C = 1 for SVM-`2-model averaging for SVM-`1

https://hal.archives-ouvertes.fr/hal-01332785

ni

Page 30: Cross-validation to assess decoder performance: the good, the bad, and the ugly

References I

S. Arlot, A. Celisse, ... A survey of cross-validation procedures formodel selection. Statistics surveys, 4:40–79, 2010.