lecture 12 – model assessment and selection

Lecture 12 – Model Assessment and Selection

Rice ECE697

Farinaz Koushanfar

Fall 2006

Summary

• Bias, variance, model complexity• Optimism of training error rate• Estimates of in-sample prediction error, AIC• Effective number of parameters• The Bayesian approach and BIC• Vapnik-Chernovekis dimension• Cross-Validation• Bootstrap method

Model Selection Criteria

• Training Error

• Loss Function

• Generalization Error

Training Error vs. Test Error

Model Selection and Assessment

• Model selection: – Estimating the performance of different models in

order to chose the best

• Model assessment:– Having chosen a final model, estimating its

prediction error (generalization error) on new data

• If we were rich in data:

Train Validation Test

Bias-Variance Decomposition

• As we have seen before,

• The first term is the variance of the target around the true mean f(x0); the second term is the average by which our estimate is off from the true mean; the last term is variance of f^(x0)

* The more complex f, the lower the bias, but the higher the variance

Bias-Variance Decomposition (cont’d)

• For K-nearest neighbor

• For linear regression

Bias-Variance Decomposition (cont’d)

• For linear regression, where h(x0) is the vector of weights that produce fp(x0)=x0

T(XTX)-1XTy and hence Var[(fp(x0)]=||h(x0)||2

2

• This variance changes with x0, but its average over the sample values xi is (p/N)

2

Example

• 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1]20.

• Left: Y is 0 if X11/2 and apply k-NN• Right: Y is 1 if j=1

10Xj is 5 and 0 otherwise

Prediction error

Squared bias

Variance

Example – loss function

Prediction error

Squared bias

Variance

Optimism of Training Error

• The training error• Is typically less than the true error• In sample error

• Optimism

• For squared error, 0-1, and other losses, on can show in general

Optimism (cont’d)

• Thus, the amount by which the error under estimates the true error depends on how much yi affects its own prediction

• For linear model

• For additive model Y=f(X)+ and thus,

Optimism increases linearly with number of inputs or basis d, decreases as training size increases

How to count for optimism?

• Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc.

• Bootstrap and cross-validation, are direct estimates of this optimism error

Estimates of In-Sample Prediction Error

• General form of in-sample estimate is computed from

• Cp statistic: for an additive error model, when d parameters are fit under squared error loss,

• Using this criterion, adjust the training error by a factor proportional to the number of basis

• Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Errin, when the log-likelihood loss function is used

Akaike Information Criterion (AIC)

• AIC relies on a relationship that holds asymptotically as N

• Pr(Y) is a family of densities for Y (contains the “true” density), “ hat” is the max likelihood estimate of , “loglik” is the maximized log-likelihood:

N

1iiˆ)y(Prlogliklog

AIC (cont’d)

• For the Gaussian model, the AICCp

• For the logistic regression, using the binomial log-likelihood, we have

• AIC=-2/N. loglik + 2. d/N

• Choose the model that produces the smallest possible AIC

• What if we don’t know d?

• How about having tuning parameters?

AIC (cont’d)

• Given a set of models f(x) indexed by a tuning parameter , denote by err() and d() the training error and number of parameters

• The function AIC provides an estimate of the test error curve and we find the tuning parameter that maximizes it

• By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

2ˆN

)(d.2)(err)(AIC

AIC- Example: Phenome recognition

The effective number of parameters

• Generalize num of parameters to regularization

• Effective num of parameters is: d(S) = trace(S)

• In sample error is:

The effective number of parameters

• Thus, for a regularized model:

• Hence

• and

The Bayesian Approach and BIC

• Bayesian information criterion (BIC)

• BIC/2 is also known as Schwartz criterion

BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

BIC (cont’d)

• BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N

• Suppose we have a set of candidate models Mm, m=1,..,M and corresponding model parameters m, and we wish to chose a best model

• Assuming a prior distribution Pr(m|Mm) for the parameters of each model Mm, compute the posterior probability of a given model!

BIC (cont’d)

• The posterior probability is

• Where Z represents the training data. To compare two models Mm and Ml, form the posterior odds

• If the posterior greater than one, chose m, otherwise l.

BIC (cont’d)

• Bayes factor: the rightmost term in posterior odds

• We need to approximate Pr(Z|Mm)• A Laplace approximation to the integral gives

^m is the maximum likelihood estimate and dm is the

number of free parameters of model Mm• If the loss function is set as -2 log Pr(Z|Mm,^

m), this is equivalent to the BIC criteria

BIC (cont’d)

• Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability

• If we compute the BIC criterion for a set of M models, BICm, m=1,…,M, then the posterior of each model is estimates as

M

1l

BIC5.0

BIC5.0

l

m

e

e

• Thus, we can estimate not only the best model, but also

asses the relative merits of the models considered

Vapnik-Chernovenkis Dimension

• It is difficult to specify the number of parameters• The Vapnik-Chernovenkis (VC) provides a general

measure of complexity and associated bounds on optimism

• For a class of functions {f(x,)} indexed by a parameter vector , and xp.

• Assume f is in indicator function, either 0 or 1• If =(0,1) and f is a linear indicator, I(0+1

Tx>0), then it is reasonable to say complexity is p+1

• How about f(x, )=I(sin .x)?

VC Dimension (cont’d)


• The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be

• The VC dimension of the class {f(x,)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,)}


• A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them

• Example: VC dim of linear indicator function in 2D


• Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g.

• If we fit N data points using a class of functions {f(x,)} having VC dimension h, then with probability at least 1- over training sets

Cherkassky and Mulier, 1998For regression, a1=a2=1


• The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N

• The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,) and hence allow for searching over the class


• Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds

• SRM fits a nested sequence of models of increasing VC dimensions h1<h2<…, and then chooses the model with the smallest value of the upper bound

• Drawback is difficulty in computing VC dim• A crude upper bound may not be adequate

Example – AIC, BIC, SRM

Cross Validation (CV)

• The most widely used method• Directly estimate the generalization error by applying

the model to the test sample• K-fold cross validation

– Use part of data to build a model, different part to test

• Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

CV (cont’d)

:{1,…,N}{1,…,K} divides the data to groups• Fitted function f^-(x), computed when removed• CV estimate of prediction error is

• If K=N, is called leave-one-out CV• Given a set of models f^-(x), the th model fit with

the kth part removed. For this set of models we have

N

1i i

)i(

i))x(f,y(LN1CV

N

1i i

)i(

i)),x(f,y(LN1)(CV

CV (cont’d)

• CV() should be minimized over • What should we chose for K?

• With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same

• Computational complexity

N

1i i

)i(

i)),x(f,y(LN1)(CV

CV (cont’d)

CV (cont’d)

• With lower K, CV has a lower variance, but bias could be a problem!

• The most common are 5-fold and 10-fold!

CV (cont’d)

• Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy

• For linear fits (Sii is the ith on S diagonal)

• The GCV approximation is

N

1i

2

ii

ii2N

1ii

i

i]

S1

)x(fy[

N

1)]x(fy[

N

1

N

1i

2ii ]N/)S(trace1

)x(fy[

N

1GCV

GCV maybe sometimes advantageous where the trace is computed more easily than the individual Sii’s

Bootstrap

• Denote the training set by Z=(z1,…,zN) where zi=(xi,yi)

• Randomly draw a dataset with replacement from training data

• This is done B times (e.g., B=100)• Refit the model to each of the bootstrap datasets and

examine the behavior over the B replications• From the bootstrap sample, we can estimate any

aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

Bootstrap - Schematic

B

1b

2*b* ]S)Z(S[1B

1)]Z(S[VarFor e.g.,

Bootstrap (Cont’d)

• Bootstrap to estimate the prediction error

• E^rrboot does not provide a good estimate– Bootstrap dataset is acting as both training and testing and

these two have common observations– The overfit predictions will look unrealistically good

• By mimicking CV, better bootstrap estimates• Only keep track of predictions from bootstrap

samples not containing the observations

B

1b

N

1ii

b*

iboot))x(f,y(L

N

1

B

1rrE


• The leave-one-out bootstrap estimate of prediction error

• C-i is the set of indices of the bootstrap sample b that do not contain observation I

• We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C-i|’s that are zero


• The leave-one-out bootstrap solves the overfitting problem, we has a training size bias

• The average number of distinct observations in each bootstrap sample is 0.632.N

• Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error

• There are a number of proposed methods to alleviate this problem, e.g., .632 estimator, information error rate (overfitting rate)

Bootstrap (Example)

• Five-fold CV and .632 estimate for the same problems as before

• Any of the measures could be biased but not affecting, as long as relative performance is the same

lecture 12 – model assessment and selection

Documents