steep learning curves reading: bishop ch. 3.0, 3.1

Steep learning curves

Reading: Bishop Ch. 3.0, 3.1

Administrivia

•Reminder:

•Microsoft on campus for recruiting

•Next Mon, Feb 5

•FEC141, 11:00 AM

•All welcome

Viewing and re-viewing•Last time:

•(4)5 minutes of math: function optimization

•Measuring performance

•Today:

•Cross-validation

•Learning curves

Separation of train & test•Fundamental principle (1st

amendment of ML):

•Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

Holdout data•Usual to “hold out” a separate set of data for

testing; not used to train classifier

•A.k.a., test set, holdout set, evaluation set, etc.

•E.g.,

• is training set (or empirical) accuracy

• is test set (or generalization) accuracy

Gotchas...•What if you’re unlucky when you split data

into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Best answer: stratification

•Try to make sure class (+feature) ratios are same in train/test sets (and same as original data)

•Why does this work?

Gotchas...•What if you’re unlucky when you split

data into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Almost as good: randomization

•Shuffle data randomly before split

•Why does this work?

Gotchas

•What if the data is small?

•N=50 or N=20 or even N=10

•Can’t do perfect stratification

•Can’t get representative accuracy from any single train/test split

Gotchas

•No good answer

•Common answer: cross-validation•Shuffle data vectors•Break into k chunks•Train on first k-1 chunks•Test on last 1•Repeat, with a different chunk held-out•Average all test accuracies together

Gotchas

• In code:

for (i=0;i<k;++i) {[Xtrain,Ytrain,Xtest,Ytest]=splitData(X,Y,N/k,i);

model[i]=train(Xtrain,Ytrain);cvAccs[i]=measureAcc(model[i],Xtest,Ytest);

}avgAcc=mean(cvAccs);stdAcc=stddev(cvAccs);

CV in pix[X;y]Original

data

[X’;y’]Randomshuffle

k-waypartition

[X1’Y1’]

[X2’Y2’]

[Xk’Yk’]

...

k train/test sets

k accuracies53.7% 85.1% 73.2%

But is it really learning?•Now we know how well our models are

performing

•But are they really learning?

•Maybe any classifier would do as well

•E.g., a default classifier (pick the most likely class) or a random classifier

•How can we tell if the model is learning anything?

The learning curve•Train on successively larger fractions of

data

•Watch how accuracy (performance) changes Learning

Static classifier(no learning)

Anti-learning(forgetting)

Measuring variance•Cross validation helps you get better

estimate of accuracy for small data

•Randomization (shuffling the data) helps guard against poor splits/ordering of the data

•Learning curves help assess learning rate/asymptotic accuracy

•Still one big missing component: variance

•Definition: Variance of a classifier is the fraction of error due to the specific data set it’s trained on

Measuring variance•Variance tells you how much you expect

your classifier/performance to change when you train it on a new (but similar) data set

•E.g., take 5 samplings of a data source; train/test 5 classifiers

•Accuracies: 74.2, 90.3, 58.1, 80.6, 90.3

•Mean accuracy: 78.7%

•Std dev of acc: 13.4%

•Variance is usually a function of both classifier and data source

•High variance classifiers are very susceptible to small changes in data

Putting it all together•Suppose you want to measure the expected

accuracy of your classifier, assess learning rate, and measure variance all at the same time?for (i=0;i<10;++i) { // variance reps

shuffle datado 10-way CV partition of datafor each train/test partition { // xval

for (pct=0.1;pct+=0.1;pct<=0.9) { // LCSubsample pct fraction of training settrain on subsample, test on test set

}}avg across all folds of CV partitiongenerate learning curve for this partition

}get mean and std across all curves

Putting it all together“hepatitis” data

5 minutes of math...•Decision trees make very few assumptions

about data

•Don’t know anything about relations between instances, except sets induced by feature splits

•No sense of spatial/topological relations among data

•Often, our data is real, honest-to-Cthulhu, mathematically sound vector data

•As opposed to the informal sense of vector that I have used so far

•Often comes endowed with a natural inner product and norm

5 minutes of math•Mathematicians like to study the properties of

spaces in general

•From linear algebra, you’ve already met the notion of a vector space:

•Definition: a vector space, V, is a set of elements (vectors) plus a scalar field, F, such that the following properties hold:

•Vector addition:

•Scalar multiplication:

•Linearity; commutativity; associativity; etc.

5 minutes of math

•By itself, vector spaces only partially useful

•Gets more useful when you add a norm and an inner product

5 minutes of math•Definition: a norm, ||.||, is a function of a

single vector ( V∈ ) that returns a scalar ( F∈ ) such that for all a, b V∈ and c F∈ :

• ||a|| ≥ 0

• ||c a|| = |c| ||a||

• ||a+b||≤||a|| + ||b||

• Intutition: norm gives you the length of a vector

•A vector space+norm ⇒ Banach space (*)

5 minutes of math•Definition: an inner product, 〈 ∙ , ∙ 〈 , is a

function of two vectors ( V∈ ) that returns a scalar ( F∈ ) such that:

•Symmetry

•Linearity in first variable

•Non-negativity

•Non-degeneracy

•A vector space+inner product ⇒ Hilbert space (*)

steep learning curves reading: bishop ch. 3.0, 3.1

Documents

train data

original data

separate set of data

welcome slide

randomization shuffle

holdout data usual

training data best answer

generalization accuracy