cross-validation to assess decoder performance: the good, the bad, and the ugly

Cross-validation to assess decoder performance:the good, the bad, and the ugly

Gaël Varoquaux

https://hal.archives-ouvertes.fr/hal-01332785


Measuring prediction accuracy

To find the best method(computer scientists)

For information mapping = omnibus test(cognitive neuroimaging)

Cross-validationasymptotically unbiasednon parametric

G Varoquaux 2

1 Some theory

2 Empirical results on brain imaging

G Varoquaux 3

1 Some theory

Test setTrain set

Full data

G Varoquaux 4

1 Cross-validationTest on independent data

Train set Validation set

Loop

Test setTrain set

Full data

Measures prediction accuracy

G Varoquaux 5

1 Cross-validationTest on independent data


Loop

Test setTrain set

Full data

Measures prediction accuracyG Varoquaux 5

1 Choice of cross-validation strategyTest on independent dataBe robust to confounding dependences

Leave subjects out, or sessions out

LoopMore loop = more data points

Need to balance error in training model/ error on test

G Varoquaux 6

1 Choice of cross-validation strategy: theory

Negative bias (underestimate performance)decreasing with the size of the training set

[Arlot... 2010] sec.5.1

Variance decreases with the size of the test set[Arlot... 2010] sec.5.2

Fraction of data left out: 10–20%Many random splits of the datarespecting dependency structure

G Varoquaux 7

1 Tuning hyper-parametersComputer scientist says:

You need to set C in your SVM

10-410-310-210-1100 101 102 103 104

Parameter tuning: C

Training set

Validation set

G Varoquaux 8

1 Nested cross-validationTest on independent data


Two loops

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 9

2 Empirical results on brainimaging

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 10

2 Datasets and tasks

7 fMRI datasets (6 from openfMRI)Haxby: 5 subjects, 15 inter-subject predictionsInter-subject predictions on 6 studies

OASIS VBM, gender discrimination

HCP MEG task, intra-subject, working memory

# samples: ∼ 200 (min 80, max 400)accuracy min 62%, max 96%

G Varoquaux 11

2 Experiment 1: measuring cross-validation errorLeave out a large validation setMeasure error by cross-validation on the restCompare

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

G Varoquaux 12

2 Cross-validated measure versus validation set

50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

Accuracy on validation set

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

Acc

urac

y m

easu

red

by c

ross

val

idat

ion

Intra subjectInter subject

G Varoquaux 13

2 Different cross-validation strategiesCross-validation Difference in accuracy measuredstrategy by cross-validation and on validation set

40% 20% 10% 0% +10% +20% +40%

Leave onesample out

22% +19%

+3% +43%

Intrasubject

Intersubject

G Varoquaux 14


40% 20% 10% 0% +10% +20% +40%

Leave onesample out

Leave onesubject/session

22% +19%

+3% +43%

10% +10%

21% +17%

Intrasubject

Intersubject

G Varoquaux 14


40% 20% 10% 0% +10% +20% +40%

Leave onesample out


20% left out, 3 splits

22% +19%

+3% +43%

10% +10%

21% +17%

11% +11%

24% +16%

Intrasubject

Intersubject

G Varoquaux 14


40% 20% 10% 0% +10% +20% +40%

Leave onesample out





22% +19%

+3% +43%

10% +10%

21% +17%

11% +11%

24% +16%

9% +9%

24% +14%

9% +8%

23% +13%

Intrasubject

Intersubject

G Varoquaux 14

2 Simple simulations

X1

X2

time

X1

2 Gaussian-separatedclouds

Auto-correlated noise

200 decoding samples10 000 validation samples⇒ Validation

= assymptotics

G Varoquaux 15

2 Simple simulations

X1

X2

time

X1

X1

X2

time

X1

G Varoquaux 15


40% 20% 10% 0% +10% +20% +40%

Leave onesample out

Leave oneblock out

20% leftout, 3 splits



16% +14%

+4% +33%

15% +13%

8% +8%

15% +12%

10% +11%

13% +10%

8% +8%

12% +10%

7% +7%

MEG data

Simulations

G Varoquaux 16

2 Experiment 2: parameter-tuningCompare different strategies on validation set:1. Use the default C = 12. Use C = 10003. Choose best C by cross-validation and refit3. Average best models in cross-validation

Validation set

Full data

Test setTrain set

Nested loop

Outer loop

Non-sparse decodersSVM `2Log-reg `2

Sparse decodersSVM `1Log-reg `1

G Varoquaux 17

2 Cross-validation for tuning?

CV +

averaging CV +

refitting C=1

C=1000

8%

4%

2%

0%

+2%

+4%

+8%

Impa

ct o

n pr

edic

tion

accu

racy

SVMlogreg

⇓

CV +

averaging CV +

refitting C=1

C=1000

8%

4%

2%

0%

+2%

+4%

+8%

Impa

ct o

n pr

edic

tion

accu

racy

SVMlogreg

⇑

Non-sparse models Sparse models

G Varoquaux 18

@GaelVaroquaux

Cross-validation: lessons learned

Don’t use Leave One OutRandom 10-20% splits respecting sample structure

Cross-validation has error bars of ±10%

Cross-validation is inefficient for parameter tuning-C = 1 for SVM-`2-model averaging for SVM-`1


ni


References I

S. Arlot, A. Celisse, ... A survey of cross-validation procedures formodel selection. Statistics surveys, 4:40–79, 2010.

cross-validation to assess decoder performance: the good, the bad, and the ugly

Technology