steep learning curves reading: bishop ch. 3.0, 3.1

22
Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Steep learning curves

Reading: Bishop Ch. 3.0, 3.1

Page 2: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Administrivia

•Reminder:

•Microsoft on campus for recruiting

•Next Mon, Feb 5

•FEC141, 11:00 AM

•All welcome

Page 3: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Viewing and re-viewing•Last time:

•(4)5 minutes of math: function optimization

•Measuring performance

•Today:

•Cross-validation

•Learning curves

Page 4: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Separation of train & test•Fundamental principle (1st

amendment of ML):

•Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

Page 5: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Holdout data•Usual to “hold out” a separate set of data for

testing; not used to train classifier

•A.k.a., test set, holdout set, evaluation set, etc.

•E.g.,

• is training set (or empirical) accuracy

• is test set (or generalization) accuracy

Page 6: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Gotchas...•What if you’re unlucky when you split data

into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Best answer: stratification

•Try to make sure class (+feature) ratios are same in train/test sets (and same as original data)

•Why does this work?

Page 7: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Gotchas...•What if you’re unlucky when you split

data into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Almost as good: randomization

•Shuffle data randomly before split

•Why does this work?

Page 8: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Gotchas

•What if the data is small?

•N=50 or N=20 or even N=10

•Can’t do perfect stratification

•Can’t get representative accuracy from any single train/test split

Page 9: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Gotchas

•No good answer

•Common answer: cross-validation•Shuffle data vectors•Break into k chunks•Train on first k-1 chunks•Test on last 1•Repeat, with a different chunk held-out•Average all test accuracies together

Page 10: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Gotchas

• In code:

for (i=0;i<k;++i) {[Xtrain,Ytrain,Xtest,Ytest]=splitData(X,Y,N/k,i);

model[i]=train(Xtrain,Ytrain);cvAccs[i]=measureAcc(model[i],Xtest,Ytest);

}avgAcc=mean(cvAccs);stdAcc=stddev(cvAccs);

Page 11: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

CV in pix[X;y]Original

data

[X’;y’]Randomshuffle

k-waypartition

[X1’Y1’]

[X2’Y2’]

[Xk’Yk’]

...

k train/test sets

k accuracies53.7% 85.1% 73.2%

Page 12: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

But is it really learning?•Now we know how well our models are

performing

•But are they really learning?

•Maybe any classifier would do as well

•E.g., a default classifier (pick the most likely class) or a random classifier

•How can we tell if the model is learning anything?

Page 13: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

The learning curve•Train on successively larger fractions of

data

•Watch how accuracy (performance) changes Learning

Static classifier(no learning)

Anti-learning(forgetting)

Page 14: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Measuring variance•Cross validation helps you get better

estimate of accuracy for small data

•Randomization (shuffling the data) helps guard against poor splits/ordering of the data

•Learning curves help assess learning rate/asymptotic accuracy

•Still one big missing component: variance

•Definition: Variance of a classifier is the fraction of error due to the specific data set it’s trained on

Page 15: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Measuring variance•Variance tells you how much you expect

your classifier/performance to change when you train it on a new (but similar) data set

•E.g., take 5 samplings of a data source; train/test 5 classifiers

•Accuracies: 74.2, 90.3, 58.1, 80.6, 90.3

•Mean accuracy: 78.7%

•Std dev of acc: 13.4%

•Variance is usually a function of both classifier and data source

•High variance classifiers are very susceptible to small changes in data

Page 16: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Putting it all together•Suppose you want to measure the expected

accuracy of your classifier, assess learning rate, and measure variance all at the same time?for (i=0;i<10;++i) { // variance reps

shuffle datado 10-way CV partition of datafor each train/test partition { // xval

for (pct=0.1;pct+=0.1;pct<=0.9) { // LCSubsample pct fraction of training settrain on subsample, test on test set

}}avg across all folds of CV partitiongenerate learning curve for this partition

}get mean and std across all curves

Page 17: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

Putting it all together“hepatitis” data

Page 18: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

5 minutes of math...•Decision trees make very few assumptions

about data

•Don’t know anything about relations between instances, except sets induced by feature splits

•No sense of spatial/topological relations among data

•Often, our data is real, honest-to-Cthulhu, mathematically sound vector data

•As opposed to the informal sense of vector that I have used so far

•Often comes endowed with a natural inner product and norm

Page 19: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

5 minutes of math•Mathematicians like to study the properties of

spaces in general

•From linear algebra, you’ve already met the notion of a vector space:

•Definition: a vector space, V, is a set of elements (vectors) plus a scalar field, F, such that the following properties hold:

•Vector addition:

•Scalar multiplication:

•Linearity; commutativity; associativity; etc.

Page 20: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

5 minutes of math

•By itself, vector spaces only partially useful

•Gets more useful when you add a norm and an inner product

Page 21: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

5 minutes of math•Definition: a norm, ||.||, is a function of a

single vector ( V∈ ) that returns a scalar ( F∈ ) such that for all a, b V∈ and c F∈ :

• ||a|| ≥ 0

• ||c a|| = |c| ||a||

• ||a+b||≤||a|| + ||b||

• Intutition: norm gives you the length of a vector

•A vector space+norm ⇒ Banach space (*)

Page 22: Steep learning curves Reading: Bishop Ch. 3.0, 3.1

5 minutes of math•Definition: an inner product, 〈 ∙ , ∙ 〈 , is a

function of two vectors ( V∈ ) that returns a scalar ( F∈ ) such that:

•Symmetry

•Linearity in first variable

•Non-negativity

•Non-degeneracy

•A vector space+inner product ⇒ Hilbert space (*)