lectures 13,14 – model inference and averaging

58
Lectures 13,14 – Model Inference and Averaging Rice ECE697 Farinaz Koushanfar Fall 2006

Upload: adair

Post on 14-Feb-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Lectures 13,14 – Model Inference and Averaging. Rice ECE697 Farinaz Koushanfar Fall 2006. Summary. Bootstrap and maximum likelihood (ML) Bayesian methods The expectation maximization (EM) algorithm MCMC for sampling from posterior Bagging Model averaging. Empirical Distribution. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lectures 13,14 – Model Inference and Averaging

Lectures 13,14 – Model Inference and Averaging

Rice ECE697Farinaz Koushanfar

Fall 2006

Page 2: Lectures 13,14 – Model Inference and Averaging

Summary

• Bootstrap and maximum likelihood (ML)• Bayesian methods• The expectation maximization (EM) algorithm• MCMC for sampling from posterior• Bagging• Model averaging

Page 3: Lectures 13,14 – Model Inference and Averaging

Empirical Distribution• Suppose x1, ..., xN are the observed outcomes of N iid random

variables following an unknown PDF • The empirical distribution: P(X = a) = count (xi = a)/n. • The empirical estimate of a parameter is computed from the

empirical distribution by the formula that defines the parameter based on its true distribution.

• For example, the empirical estimate of the variance is 

• Empirical estimates are often biased; there is no guarantee they have best possible variance, or other good properties.

2N

1ii

)xx(N1)xvar(

Page 4: Lectures 13,14 – Model Inference and Averaging

Bootstrap• Let T(x1, ..., xN) be an estimator of .  • Bootstrap generates repeated estimates by generating repeated

"fake" outcomes. Each fake outcome is generated by taking a random sample according to the empirical distribution P(X = a) = count(xi = a)/n.

• To generate fake outcomes, resample with replacement. • By taking R random samples with replacement, get R different

bootstrap estimates of ; call these B1, ..., BR. • What do we use R bootstrap?  The most common is for CI:

(1) Use the order statistics of Br.  E.g., for a 95% confidence interval, use B(2.5%) and B(97.5%).

(2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the sample variance of the bootstrap estimates,

2R

1rrr)BB(

1R1)var(

Page 5: Lectures 13,14 – Model Inference and Averaging

Bootstrap-Example• Training data, Z={z1,z2,..,zN}, with zi=(xi,yi)• Suppose we are fitting a cubic spline, with 3 knots placed at

quantiles (splines are one form of kernel basis function, centered at knots)

h1(x)h2(x)h3(x)h4(x)h5(x)h6(x)h7(x)

1j jj)x(h)x(

Page 6: Lectures 13,14 – Model Inference and Averaging

Bootstrap-Example (cont’d)

• Spline prediction• Can think of (x)=E(Y|X=x)• The usual estimate of ^=(HTH)-1HTy• The estimated covariance is• Noise var:

• How do we apply bootstrap on this example?

ˆ)x(h)x(h)x( T

1j jj

21T ˆ)HH()(Var

N

1i2

ii

2 N/))x(y(ˆ

h(x)T=(h1(x),…,h7(x))

ˆ|)x(h)HH()x(h|))x((es 21

1TT

Page 7: Lectures 13,14 – Model Inference and Averaging

Bootstrap-Example (cont’d)• Draw B datasets with replacements (Z*:zi=(xi,yi))• To each bootstrap sample, fit a cubic spline ^*(x)• Example - 10 bootstrap samples (left), CI (right)

Page 8: Lectures 13,14 – Model Inference and Averaging

Least Square and Bootstrap - Example

Page 9: Lectures 13,14 – Model Inference and Averaging

Least Square, Bootstrap, and ML• Previous example was nonparametric bootstrap• Suppose that error is Gaussian: ~N(0,2)• In parametric bootstrap, we draw samples by adding

Gaussian noise to the predicted values

• The process repeated B times, re-compute the spline on each, the CI from this method will be exactly the least square bands!!

• The function estimated from the bootstrap sample has the distribution

)ˆ)x(h)HH()x(h),x(ˆ(N)x(*ˆ 21TT

*

ii

*

i)x(ˆy N,...,1i);ˆ,0(N~ 2*

i

Page 10: Lectures 13,14 – Model Inference and Averaging

Maximum Likelihood (ML) inference

• In general, bootstrap estimate agrees not with the least squares, but with ML

• Specify a pdf function to observations: zi~g(z)• ML is based on a likelihood function L(;Z)

• The logarithm of L(;Z) is denoted as l(;Z), and is the log-likelihood function

• ML chooses the value of that maximizes l(;Z)

N

1ii)z(g)Z;(L

N

1ii

N

1ii )z(glog)z;(l)Z;(l

Page 11: Lectures 13,14 – Model Inference and Averaging

ML – Some Definitions

• The score function is:• Where• Assuming that the max is in the interior, it is 0• The information matrix is

• I() evaluated at 0 is the observed information• Fisher information (or expected information) is

N

1ii )z;(l)Z;(l

/)z;(l)z;(lii

N

1i T

i

2 )z;(l)(I

)](I[E)(i

Page 12: Lectures 13,14 – Model Inference and Averaging

ML – Some More Results

• Assume independent sampling from g(z)• ***The sampling distribution of the ML

estimator has a limiting Normal PDF (N)

• The standard error of j estimation is

• CI for j has the forms

))ˆ(I,ˆ(N))ˆ(i,ˆ(N))(i,(Nˆ 111

00

1

jj

1

jj)ˆ(Iand)ˆ(i

1

jj

)1(

j

1

jj

)1(

j )ˆ(Izˆor)ˆ(izˆ

Page 13: Lectures 13,14 – Model Inference and Averaging

ML for our smoothing example

• Parameters are =(,2); the log-likelihood is

• ML estimate achieved by:

• The information matrix for =(,2) is

N

1i

2T

ii2

2 ))x(hy(21)2log(

2N)(l

0/)(land0/)(l 2

2

ii

2T1T ))x(ˆy(N1ˆandyH)HH(

2T /)HH()(I

Page 14: Lectures 13,14 – Model Inference and Averaging

Bayesian Approach to Inference• Specify a sampling model Pr(Z|), pdf for data• Given the parameters, and a prior distribution for the Pr(),

reflecting the knowledge before we see the new data

• Corresponding to our updated knowledge about after we see the new data

• The difference b/w Bayesian and regular inference is that it expresses the uncertainty before seeing the data (prior), and express the uncertainty remaining (posterior)

d)Pr()|ZPr()Pr()|ZPr(

)ZPr()Pr()|ZPr()Z|Pr(

Page 15: Lectures 13,14 – Model Inference and Averaging

Bayesian Approach (Cont’d)

• Predict the value of a future observation via the predictive distribution

• ML would use Pr(znew|) to predict future data• Unlike predictive distribution, does not

account for uncertainty in estimating

d)z|Pr()|zPr()Z|zPr( newnew

Page 16: Lectures 13,14 – Model Inference and Averaging

Bayesian Approach on Our Example• Parametric model• Assume that 2 is known and randomness is only

coming from variations of y around (x)• Assuming finite number of basis, put the prior on

distribution of coefficients ~N(0,) (x) is Gaussian with covariance kernel

K(x,x’)=cov((x), (x’))= h(x)Th(x’)• Posterior distribution for is also Gaussian

ˆ)x(h)x(h)x( T

1j jj

yH)HH()Z|(E T112

T

2112

T )HH()Z|(Cov

Page 17: Lectures 13,14 – Model Inference and Averaging

Example (Cont’d)

• The corresponding posterior value for (x) is

• How to choose ? Take the prior to be =I

yH)HH()x(h)Z|)x((E T112

TT

2112

TT )'x(h)HH()x(h)Z|)x((Cov

Page 18: Lectures 13,14 – Model Inference and Averaging

Example (Cont’d)• Let’s take a look at the posterior curves and see the

impact of the prior on the posterior

Looks like bootstrapcurves

Page 19: Lectures 13,14 – Model Inference and Averaging

Bayes Inference – Example (from wikipedia)

• Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum

• Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence)

• Let m be the number of voters in that random sample who will vote "yes“

• Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes theorem:

Page 20: Lectures 13,14 – Model Inference and Averaging

Example from Wikipedia (Cont’d)

• From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior pdf

• f(r) summarizes what we know about the distribution of r in the absence of any observation.

• We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1.

• If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.

Page 21: Lectures 13,14 – Model Inference and Averaging

Example from Wikipedia (Cont’d)• Assuming random sampling, the likelihood function

L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes in 10 trials for a binomial pdf

• As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor,

• For r[0,1] inclusive, the posterior distribution for r is then

Page 22: Lectures 13,14 – Model Inference and Averaging

Example from Wikipedia (Cont’d)• One may be interested in the probability that more

than half the voters will vote "yes". • The prior probability that more than half the voters

will vote "yes" is 1/2, by the symmetry of the uniform distribution.

• In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll – that seven of the 10 voters questioned will vote "yes" – is

Page 23: Lectures 13,14 – Model Inference and Averaging

Expectation Maximization (EM) Algorithm

• EM algorithm is used for simplifying difficult ML problems, we will show an example first

• Assume we are looking for a simple mixture model

Page 24: Lectures 13,14 – Model Inference and Averaging

EM - Example• Model: Y is mixture of two normal distributions:

Y1~N(1,12), and Y1~N(2,2

2)

• Y=(1-)Y1+Y2, where {0,1}, Pr(=1)=

• Let (y) be Gaussian with =(,2), pdf of y is gY(y)=(1- ) 1(y)+ 2(y)

• The parameters of the ML for the mixture model are =(,1,1

2,2,22),

• The log-likelihood on N training cases is

)]y()y()1log[()Z;(li2

N

1ii1

Page 25: Lectures 13,14 – Model Inference and Averaging

EM - Example• Direct maximization is difficult numerically, because of the

sum of the terms inside the log• A simpler approach is to set some of the values, find the others

and iteratively update• This is the core of EM algorithm• Expectation step: Do a soft assignment of each observation to

one model: how much each model is responsible for explaining a data point– E.g., responsibility of model 2 for observation i can be written as

• Maximization step: The responsibilities are used in a weighted ML fit to update the parameter estimates

)Z,|1Pr()Z,|(E)(ˆiii

Page 26: Lectures 13,14 – Model Inference and Averaging

EM Algorithm for 2-Component Gaussian Mixture Model

1. Take initial guesses for parameters ,1,12,2,2

2 2. Expectation step: compute the responsibilities

3. Maximization step: compute weighted mean & variance

And the mixing probability4. Iterate steps 2 and 3 (E and M) until converge

N,...,2,1i,)y(ˆ)y()ˆ1(

)y(ˆˆ

i2ˆi1ˆ

i2ˆ

i

N

1i i

2

1i

N

1i i2

1N

1i i

N

1i ii1 )ˆ1(

)ˆy()ˆ1(ˆ,)ˆ1(y)ˆ1(ˆ

N

1i i N/ˆˆ

N

1i i

2

2i

N

1i i2

2N

1i i

N

1i ii2 ˆ

)ˆy(ˆˆ,ˆyˆˆ

Page 27: Lectures 13,14 – Model Inference and Averaging

EM – Example (cont’d)

Selected iterations of the EM algorithmFor mixture example

Iteration 1 0.4855 0.49310 0.52315 0.54420 0.546

Page 28: Lectures 13,14 – Model Inference and Averaging

EM – When is it Useful?

• Another name for EM is Baum-Welch algorithm• EM is used for maximizing the likelihood for a

certain class of problems, where ML is difficult• Data augmentation: ML is enlarged with latent

(unobserved) variables– In the previous example, the unobserved variable was i

– EM is widely used when a part of the actual data is missing treat missing data as latent variables

Page 29: Lectures 13,14 – Model Inference and Averaging

EM Algorithm – General Case

• Assume that the observed data is Z, with log-likelihood l(;Z), depending on parameter

• Missing data is Zm, complete data T=(Z,Zm), with log-likelihood l0(;T), l0 based on complete density

• In the mixture problem T=(Z,Zm)=(y,)

)'|ZPr()'|Z,ZPr()',Z|ZPr(

mm

)',Z|ZPr()'|Z,ZPr()'|ZPr(

m

m

)Z|Z;'(l)T;'(l)Z;'(l m

10

• And l1 is based on the conditional density Pr(Zm|Z,)

Page 30: Lectures 13,14 – Model Inference and Averaging

EM Algorithm – General Case

1. Start with initial guesses for the parameter ^(0)

2. Expectation step: at the j-th step, compute

as a function of the dummy argument ’3. Maximization step: determine the new estimate

^(j+1) as the maximizer of Q(’, ^(j)) over ’4. Iterate steps 2 and 3 (E and M) until convergence

)ˆ,Z|)T;'(l[E)ˆ,'(Q )j(

0

)j(

Page 31: Lectures 13,14 – Model Inference and Averaging

Why EM Works?• Remember that in terms of the log-likelihood:

• Where l0 is based on complete density, l1 is based on the conditional density Pr(Zm|Z,)

• Taking the conditional distribution of T|Z governed by the parameter gives

• R(*,) is the expectation of a log-likelihood of a density (indexed by *) w.r.t the same density indexed by maximized as a function of , when *= (Jensen’s inequality)

)Z|Z;'(l)T;'(l)Z;'(l m

10

),'(R),'(Q],Z|)Z|Z;'(l[E],Z|)T;'(l[E)Z;'(l m

10

0)],(R),'(R[)],(Q),'(Q[)Z;(l)Z;'(l

• Thus, the EM iteration never decreases the log-likelihood!

Page 32: Lectures 13,14 – Model Inference and Averaging

MCMC Methods• Slides (for MCMC) mostly borrowed from Sujit Sahu

www.maths.soton.ac.uk/staff/Sahu/ • Assume that we want to find• If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can

be approximated by

• Now assume we need to find an expectation

• This can be difficult!

b

a

dx)x(fI

h)2

xx(fdx)x(fI1Ni

0i

1iib

a

dx)x()x(h)]X(h[E

dx)x(|)x(h|

Page 33: Lectures 13,14 – Model Inference and Averaging

MCMC for Integration (Cont’d)

• If we can draw samples• Then we can estimate

• This is Monte Carlo (MC) integration• We have changed notation

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 34: Lectures 13,14 – Model Inference and Averaging

Consistency of the integral estimate

• For independent samples, by Law of Large Numbers,

• But independent sampling from (x) can be difficult

• It turns out that the above equation holds if we generate samples using a Markov chain

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 35: Lectures 13,14 – Model Inference and Averaging

Markov Chains (Review)• Markov chain generated by sampling

• Where p is the transition kernel• So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1)

• For example

• This is called the first order auto-regressive process with lag-1, autocorrelation 0.5

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 36: Lectures 13,14 – Model Inference and Averaging

Markov Chains (Stationary)

• As t, the Markov chain converges (in distribution) to its stationary (invariant) distribution

• In the example before, this is

• Which does not depend on x(0)

• Does this happen for all Markov chains?

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 37: Lectures 13,14 – Model Inference and Averaging

Markov Chains (Irreducibility)

• Assuming that a stationary distribution exists, it is unique if the chain is irreducible

• Irreducible means that any set of states can be reached from any other state in a finite number of moves

• An example of reducible Markov chain:– Suppose p(x|y)=0, for xA and yB and vice

versa

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 38: Lectures 13,14 – Model Inference and Averaging

Markov Chains (Aperiodicity)• A Markov chain taking only finite number of values

is aperiodic if greatest common divisor of return times to any particular state i say, is 1

• Think of recording the number of steps taken to return to the state 1. The g.c.d of those numbers should be 1

• If the g.c.d is bigger than 1, 2 say, then the chain will return in cycles of 2,4,6,.. Number of steps. This is not allowed for aperiodicity

• Definition can be extended to general state-space case

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 39: Lectures 13,14 – Model Inference and Averaging

Markov Chains (Ergodicity)• Assume the Markov chain has a stationary

distribution (x), is aperiodic and irreducible Ergodic theorem

• Also for such chains with • Central limit theorem holds and • Convergence occurs geometrically

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 40: Lectures 13,14 – Model Inference and Averaging

Numerical Standard Error (nse)

• In general, no simpler expression exists for nse

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 41: Lectures 13,14 – Model Inference and Averaging

Numerical Standard Error (nse)• If {h(X(t))} can be approximated as a first-order auto-

regressive process, then

• Where is the lag-1 auto-correlation {h(X(t))}• The first factor is the usual term under independent

sampling• The second term is usually >1• As thus is the penalty to be paid because a Markov

chain has been used

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 42: Lectures 13,14 – Model Inference and Averaging

More on nse

• The nse may not be finite in general• It is finite if the chain converges geometrically• If the nse is finite, then we can make it as

small as we like by increasing N• The ‘obvious’ estimator of nse is not

consistent

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 43: Lectures 13,14 – Model Inference and Averaging

Markov Chains -- Summary

• A Markov chain may have a stationary distribution

• The stationary distribution is unique is the chain is irreducible

• We can estimate nse’s if the chain is also geometrically convergent

• Where does this all gets us?

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 44: Lectures 13,14 – Model Inference and Averaging

MCMC

• How do we construct a Markov chain whose stationary distribution is our target distribution, (x)?

• Metropolis et al. (1953) shows how• The method was generalized by Hastings

(1970)• This is called

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 45: Lectures 13,14 – Model Inference and Averaging

Metropolis-Hastings Algorithm• At each iteration t• Step 1 Sample y~q(y|x(t)), where y is the candidate

point and q(.) is the proposed pdf• Step 2 With probability

• Set x(t+1)=y (acceptance)• Else set x(t+1)=x(t) (rejection)

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 46: Lectures 13,14 – Model Inference and Averaging

Metropolis-Hasting Notes• The normalizing constant in (x) is not required to

run the algorithm. It cancels in the ratio• If q(y|x)= (y), then we obtain independent samples• Usually q is chosen so that q(y|x) is easy to sample

from• Theoretically, any density q(.|x) having the same

support should work. However, some q’s are better than the others

• The induced Markov chains have the desirable properties under mild conditions on (x)

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 47: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampler• Gibbs sampling is a Monte-Carlo sampling method• Suppose that x=(x1,x2,…,xk) is k(2) dimensional• Gibbs sampler uses what are called the full (or complete)

conditional distributions

• Note that the conditional

• Is proportional to the joint. Often this helps in finding it

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 48: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling (Cont’d)

• Sample or update in turn, but always use the most relevant values

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Page 49: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling -- Algorithm

1. Take some initial values Uk(0), k=1,2,…,K

2. Repeat for t=1,2,…,:For k=1,2,…,K generate Uk

(t) from

Pr(Uk(t) | U1

(t),…, Uk-1(t-1), …, UK

(t-1))

3. Continue step 2 until the joint distribution of (U1

(t), U2(t), …, Uk

(t)) does not change

Page 50: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling (Cont’d)

• In Bayesian Inference, the goal is to draw a sample from the joint posterior of the parameters given the data Z

• Gibbs sampling will be helpful if it is easy to sample from the conditional distribution of each parameter given the other parameters and Z

• An example – Gaussian mixture problem is described on the next slide

Page 51: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling - Example

1. Take some initial values (0)= (1(0), 2

(0))2. Repeat for t=1,2,..

a) For i=1,2,…,N generate i(t){0,1} with

Pr(i(t)=1)=^

i((t)), from the Eq. previously givenb) Set

And generate 1(t)~N(^

1,^1

2) and 2(t)~ N(^

2,^2

2) 3. Continue step 2 until the joint distribution of

((t),1(t),2

(t)) does not change

N

1i

)t(

i

N

1i i

)t(

i1 )1(

y)1(ˆ

N

1i

)t(

i

N

1i i

)t(

i2

y.ˆ, and,

Page 52: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling – Example (Cont’d)

• Close connection b/w Gibbs sampling and EM• The key is to consider the latent variable Zm from EM as another

parameter for Gibbs sampler• For simplicity, fix the variances (1

2,22) and mixing proportions

unknowns are 1,2

• The first two steps are like EM, in step 3, Gibbs simulates the latent data i from the distribution Pr(i|,Z)

• In step 2(b), rather than compute the maximization of the posterior Pr(1,2,|Z), we simulate from the conditional distribution Pr(1,2|,Z)

• The example was simplified, in reality variances and should be taken into account

Page 53: Lectures 13,14 – Model Inference and Averaging

Gibbs Sampling – Example (Cont’d)• 200 iterations of Gibbs sampling• The values seems to stabilize quickly, and are distributed

evenly around the ML value

Page 54: Lectures 13,14 – Model Inference and Averaging

Bagging• Some classification and regressions are unstable

– Small perturbations in their training sets or in construction may result in large changes in constructed predictor

– E.g., subset selection method in regression, decision trees• Unstable methods can improve their accuracy by perturbing

and combining• Generate multiple versions of the predictor by perturbing the

training set or construction method and then combine the versions into a single predictor

• E.g., bagging estimate for regression

• Bagging for classification: combine the B votes

B

1b

b*

bag)x(f

B1)x(f

Page 55: Lectures 13,14 – Model Inference and Averaging

Bagging• The bagging estimate will differ from the original

estimate f^(x) when the latter is a nonlinear or adaptive function of the data

• Can be used to bag different models of a prediction• It has a lot of usage on the trees (discussed in the next

section)• If we think about bootstrap values of an estimator as

approximate posterior values of a corresponding parameter, from a nonparametric Bayesian analysis, bagging is just the posterior Bayesian mean

Page 56: Lectures 13,14 – Model Inference and Averaging

Model Averaging and Stacking• We have a set of candidate models m, m=1,..,M for training set Z• The models maybe the same type with different parameter values,

or different models for the same task• Suppose is some quantity of interest, say, a prediction f(x) at

some fixed feature value x• The posterior distribution and mean of are:

• This Bayesian prediction, is a weighted average of individual predictions, with weights proportional to the posterior probability of each model

)Z|Pr()Z,|Pr()Z|Pr( m

M

1mm

)Z|Pr()Z,|(E)Z|(E m

M

1mm

Page 57: Lectures 13,14 – Model Inference and Averaging

Model Averaging and Stacking• Committee methods take a simple unweighted

average of the predictions from each model, giving equal probability to each of the models

• In cases where the different models arise from the same parametric model, with different parametric values, one could use BIC to weight– Recall that BIC gives weight to each model depending on

how well it fits and how many parameters it uses• How does it work for nonparametric (frequentist)

models?– Answer: Chapter 10, boosting is a strong method proposed

for solving these cases

Page 58: Lectures 13,14 – Model Inference and Averaging

Stochastic Search: Bumping• Bumping is a technique to find a better single model, but

again uses bootstrapping• It is suitable for models, where there are a lot of local minima• Like bagging, draw bootstrap samples, and fit a model to each• Rather than averaging, choose the model estimated from the

bootstrap sample that best fits the training data• Draw bootstrap samples Z*1,…,Z*B and fit our model to each,

giving predictions f^*b(x), b=1,..,B• Choose the model that gives the smallest prediction error,

averaged over the original training set• E.g. for squared error, choose the bootstrap sample b^

N

1i

2

i

*

ib)]x(bfy[minargb