lectures 13,14 – model inference and averaging

Lectures 13,14 – Model Inference and Averaging

Rice ECE697Farinaz Koushanfar

Fall 2006

Summary

• Bootstrap and maximum likelihood (ML)• Bayesian methods• The expectation maximization (EM) algorithm• MCMC for sampling from posterior• Bagging• Model averaging

Empirical Distribution• Suppose x1, ..., xN are the observed outcomes of N iid random

variables following an unknown PDF • The empirical distribution: P(X = a) = count (xi = a)/n. • The empirical estimate of a parameter is computed from the

empirical distribution by the formula that defines the parameter based on its true distribution.

• For example, the empirical estimate of the variance is

• Empirical estimates are often biased; there is no guarantee they have best possible variance, or other good properties.

2N

1ii

)xx(N1)xvar(

Bootstrap• Let T(x1, ..., xN) be an estimator of . • Bootstrap generates repeated estimates by generating repeated

"fake" outcomes. Each fake outcome is generated by taking a random sample according to the empirical distribution P(X = a) = count(xi = a)/n.

• To generate fake outcomes, resample with replacement. • By taking R random samples with replacement, get R different

bootstrap estimates of ; call these B1, ..., BR. • What do we use R bootstrap? The most common is for CI:

(1) Use the order statistics of Br. E.g., for a 95% confidence interval, use B(2.5%) and B(97.5%).

(2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the sample variance of the bootstrap estimates,

2R

1rrr)BB(

1R1)var(

Bootstrap-Example• Training data, Z={z1,z2,..,zN}, with zi=(xi,yi)• Suppose we are fitting a cubic spline, with 3 knots placed at

quantiles (splines are one form of kernel basis function, centered at knots)

h1(x)h2(x)h3(x)h4(x)h5(x)h6(x)h7(x)

1j jj)x(h)x(

Bootstrap-Example (cont’d)

• Spline prediction• Can think of (x)=E(Y|X=x)• The usual estimate of ^=(HTH)-1HTy• The estimated covariance is• Noise var:

• How do we apply bootstrap on this example?

ˆ)x(h)x(h)x( T

1j jj

21T ˆ)HH()(Var

N

1i2

ii

2 N/))x(y(ˆ

h(x)T=(h1(x),…,h7(x))

ˆ|)x(h)HH()x(h|))x((es 21

1TT

Bootstrap-Example (cont’d)• Draw B datasets with replacements (Z*:zi=(xi,yi))• To each bootstrap sample, fit a cubic spline ^*(x)• Example - 10 bootstrap samples (left), CI (right)

Least Square and Bootstrap - Example

Least Square, Bootstrap, and ML• Previous example was nonparametric bootstrap• Suppose that error is Gaussian: ~N(0,2)• In parametric bootstrap, we draw samples by adding

Gaussian noise to the predicted values

• The process repeated B times, re-compute the spline on each, the CI from this method will be exactly the least square bands!!

• The function estimated from the bootstrap sample has the distribution

)ˆ)x(h)HH()x(h),x(ˆ(N)x(*ˆ 21TT

*

ii

*

i)x(ˆy N,...,1i);ˆ,0(N~ 2*

i

Maximum Likelihood (ML) inference

• In general, bootstrap estimate agrees not with the least squares, but with ML

• Specify a pdf function to observations: zi~g(z)• ML is based on a likelihood function L(;Z)

• The logarithm of L(;Z) is denoted as l(;Z), and is the log-likelihood function

• ML chooses the value of that maximizes l(;Z)

N

1ii)z(g)Z;(L

N

1ii

N

1ii )z(glog)z;(l)Z;(l

ML – Some Definitions

• The score function is:• Where• Assuming that the max is in the interior, it is 0• The information matrix is

• I() evaluated at 0 is the observed information• Fisher information (or expected information) is

N

1ii )z;(l)Z;(l

/)z;(l)z;(lii

N

1i T

i

2 )z;(l)(I

)](I[E)(i

ML – Some More Results

• Assume independent sampling from g(z)• ***The sampling distribution of the ML

estimator has a limiting Normal PDF (N)

• The standard error of j estimation is

• CI for j has the forms

))ˆ(I,ˆ(N))ˆ(i,ˆ(N))(i,(Nˆ 111

00

1

jj

1

jj)ˆ(Iand)ˆ(i

1

jj

)1(

j

1

jj

)1(

j )ˆ(Izˆor)ˆ(izˆ

ML for our smoothing example

• Parameters are =(,2); the log-likelihood is

• ML estimate achieved by:

• The information matrix for =(,2) is

N

1i

2T

ii2

2 ))x(hy(21)2log(

2N)(l

0/)(land0/)(l 2

2

ii

2T1T ))x(ˆy(N1ˆandyH)HH(

2T /)HH()(I

Bayesian Approach to Inference• Specify a sampling model Pr(Z|), pdf for data• Given the parameters, and a prior distribution for the Pr(),

reflecting the knowledge before we see the new data

• Corresponding to our updated knowledge about after we see the new data

• The difference b/w Bayesian and regular inference is that it expresses the uncertainty before seeing the data (prior), and express the uncertainty remaining (posterior)

d)Pr()|ZPr()Pr()|ZPr(

)ZPr()Pr()|ZPr()Z|Pr(

Bayesian Approach (Cont’d)

• Predict the value of a future observation via the predictive distribution

• ML would use Pr(znew|) to predict future data• Unlike predictive distribution, does not

account for uncertainty in estimating

d)z|Pr()|zPr()Z|zPr( newnew

Bayesian Approach on Our Example• Parametric model• Assume that 2 is known and randomness is only

coming from variations of y around (x)• Assuming finite number of basis, put the prior on

distribution of coefficients ~N(0,) (x) is Gaussian with covariance kernel

K(x,x’)=cov((x), (x’))= h(x)Th(x’)• Posterior distribution for is also Gaussian

ˆ)x(h)x(h)x( T

1j jj

yH)HH()Z|(E T112

T

2112

T )HH()Z|(Cov

Example (Cont’d)

• The corresponding posterior value for (x) is

• How to choose ? Take the prior to be =I

yH)HH()x(h)Z|)x((E T112

TT

2112

TT )'x(h)HH()x(h)Z|)x((Cov

Example (Cont’d)• Let’s take a look at the posterior curves and see the

impact of the prior on the posterior

Looks like bootstrapcurves

Bayes Inference – Example (from wikipedia)

• Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum

• Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence)

• Let m be the number of voters in that random sample who will vote "yes“

• Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes theorem:

http://en.wikipedia.org/wiki/Bayes'_theorem

http://en.wikipedia.org/wiki/Bayes'_theorem

Example from Wikipedia (Cont’d)

• From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior pdf

• f(r) summarizes what we know about the distribution of r in the absence of any observation.

• We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1.

• If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.

Example from Wikipedia (Cont’d)• Assuming random sampling, the likelihood function

L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes in 10 trials for a binomial pdf

• As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor,

• For r[0,1] inclusive, the posterior distribution for r is then

Example from Wikipedia (Cont’d)• One may be interested in the probability that more

than half the voters will vote "yes". • The prior probability that more than half the voters

will vote "yes" is 1/2, by the symmetry of the uniform distribution.

• In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll – that seven of the 10 voters questioned will vote "yes" – is

Expectation Maximization (EM) Algorithm

• EM algorithm is used for simplifying difficult ML problems, we will show an example first

• Assume we are looking for a simple mixture model

EM - Example• Model: Y is mixture of two normal distributions:

Y1~N(1,12), and Y1~N(2,2

2)

• Y=(1-)Y1+Y2, where {0,1}, Pr(=1)=

• Let (y) be Gaussian with =(,2), pdf of y is gY(y)=(1- ) 1(y)+ 2(y)

• The parameters of the ML for the mixture model are =(,1,1

2,2,22),

• The log-likelihood on N training cases is

)]y()y()1log[()Z;(li2

N

1ii1

EM - Example• Direct maximization is difficult numerically, because of the

sum of the terms inside the log• A simpler approach is to set some of the values, find the others

and iteratively update• This is the core of EM algorithm• Expectation step: Do a soft assignment of each observation to

one model: how much each model is responsible for explaining a data point– E.g., responsibility of model 2 for observation i can be written as

• Maximization step: The responsibilities are used in a weighted ML fit to update the parameter estimates

)Z,|1Pr()Z,|(E)(ˆiii

EM Algorithm for 2-Component Gaussian Mixture Model

1. Take initial guesses for parameters ,1,12,2,2

2 2. Expectation step: compute the responsibilities

3. Maximization step: compute weighted mean & variance

And the mixing probability4. Iterate steps 2 and 3 (E and M) until converge

N,...,2,1i,)y(ˆ)y()ˆ1(

)y(ˆˆ

i2ˆi1ˆ

i2ˆ

i

N

1i i

2

1i

N

1i i2

1N

1i i

N

1i ii1 )ˆ1(

)ˆy()ˆ1(ˆ,)ˆ1(y)ˆ1(ˆ

N

1i i N/ˆˆ

N

1i i

2

2i

N

1i i2

2N

1i i

N

1i ii2 ˆ

)ˆy(ˆˆ,ˆyˆˆ

EM – Example (cont’d)

Selected iterations of the EM algorithmFor mixture example

Iteration 1 0.4855 0.49310 0.52315 0.54420 0.546

EM – When is it Useful?

• Another name for EM is Baum-Welch algorithm• EM is used for maximizing the likelihood for a

certain class of problems, where ML is difficult• Data augmentation: ML is enlarged with latent

(unobserved) variables– In the previous example, the unobserved variable was i

– EM is widely used when a part of the actual data is missing treat missing data as latent variables

EM Algorithm – General Case

• Assume that the observed data is Z, with log-likelihood l(;Z), depending on parameter

• Missing data is Zm, complete data T=(Z,Zm), with log-likelihood l0(;T), l0 based on complete density

• In the mixture problem T=(Z,Zm)=(y,)

)'|ZPr()'|Z,ZPr()',Z|ZPr(

mm

)',Z|ZPr()'|Z,ZPr()'|ZPr(

m

m

)Z|Z;'(l)T;'(l)Z;'(l m

10

• And l1 is based on the conditional density Pr(Zm|Z,)

EM Algorithm – General Case

1. Start with initial guesses for the parameter ^(0)

2. Expectation step: at the j-th step, compute

as a function of the dummy argument ’3. Maximization step: determine the new estimate

^(j+1) as the maximizer of Q(’, ^(j)) over ’4. Iterate steps 2 and 3 (E and M) until convergence

)ˆ,Z|)T;'(l[E)ˆ,'(Q )j(

0

)j(

Why EM Works?• Remember that in terms of the log-likelihood:

• Where l0 is based on complete density, l1 is based on the conditional density Pr(Zm|Z,)

• Taking the conditional distribution of T|Z governed by the parameter gives

• R(*,) is the expectation of a log-likelihood of a density (indexed by *) w.r.t the same density indexed by maximized as a function of , when *= (Jensen’s inequality)

)Z|Z;'(l)T;'(l)Z;'(l m

10

),'(R),'(Q],Z|)Z|Z;'(l[E],Z|)T;'(l[E)Z;'(l m

10

0)],(R),'(R[)],(Q),'(Q[)Z;(l)Z;'(l

• Thus, the EM iteration never decreases the log-likelihood!

MCMC Methods• Slides (for MCMC) mostly borrowed from Sujit Sahu

www.maths.soton.ac.uk/staff/Sahu/ • Assume that we want to find• If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can

be approximated by

• Now assume we need to find an expectation

• This can be difficult!

b

a

dx)x(fI

h)2

xx(fdx)x(fI1Ni

0i

1iib

a

dx)x()x(h)]X(h[E

dx)x(|)x(h|

MCMC for Integration (Cont’d)

• If we can draw samples• Then we can estimate

• This is Monte Carlo (MC) integration• We have changed notation

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Consistency of the integral estimate

• For independent samples, by Law of Large Numbers,

• But independent sampling from (x) can be difficult

• It turns out that the above equation holds if we generate samples using a Markov chain


Markov Chains (Review)• Markov chain generated by sampling

• Where p is the transition kernel• So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1)

• For example

• This is called the first order auto-regressive process with lag-1, autocorrelation 0.5


Markov Chains (Stationary)

• As t, the Markov chain converges (in distribution) to its stationary (invariant) distribution

• In the example before, this is

• Which does not depend on x(0)

• Does this happen for all Markov chains?


Markov Chains (Irreducibility)

• Assuming that a stationary distribution exists, it is unique if the chain is irreducible

• Irreducible means that any set of states can be reached from any other state in a finite number of moves

• An example of reducible Markov chain:– Suppose p(x|y)=0, for xA and yB and vice

versa


Markov Chains (Aperiodicity)• A Markov chain taking only finite number of values

is aperiodic if greatest common divisor of return times to any particular state i say, is 1

• Think of recording the number of steps taken to return to the state 1. The g.c.d of those numbers should be 1

• If the g.c.d is bigger than 1, 2 say, then the chain will return in cycles of 2,4,6,.. Number of steps. This is not allowed for aperiodicity

• Definition can be extended to general state-space case


Markov Chains (Ergodicity)• Assume the Markov chain has a stationary

distribution (x), is aperiodic and irreducible Ergodic theorem

• Also for such chains with • Central limit theorem holds and • Convergence occurs geometrically


Numerical Standard Error (nse)

• In general, no simpler expression exists for nse


Numerical Standard Error (nse)• If {h(X(t))} can be approximated as a first-order auto-

regressive process, then

• Where is the lag-1 auto-correlation {h(X(t))}• The first factor is the usual term under independent

sampling• The second term is usually >1• As thus is the penalty to be paid because a Markov

chain has been used


More on nse

• The nse may not be finite in general• It is finite if the chain converges geometrically• If the nse is finite, then we can make it as

small as we like by increasing N• The ‘obvious’ estimator of nse is not

consistent


Markov Chains -- Summary

• A Markov chain may have a stationary distribution

• The stationary distribution is unique is the chain is irreducible

• We can estimate nse’s if the chain is also geometrically convergent

• Where does this all gets us?


MCMC

• How do we construct a Markov chain whose stationary distribution is our target distribution, (x)?

• Metropolis et al. (1953) shows how• The method was generalized by Hastings

(1970)• This is called


Metropolis-Hastings Algorithm• At each iteration t• Step 1 Sample y~q(y|x(t)), where y is the candidate

point and q(.) is the proposed pdf• Step 2 With probability

• Set x(t+1)=y (acceptance)• Else set x(t+1)=x(t) (rejection)


Metropolis-Hasting Notes• The normalizing constant in (x) is not required to

run the algorithm. It cancels in the ratio• If q(y|x)= (y), then we obtain independent samples• Usually q is chosen so that q(y|x) is easy to sample

from• Theoretically, any density q(.|x) having the same

support should work. However, some q’s are better than the others

• The induced Markov chains have the desirable properties under mild conditions on (x)


Gibbs Sampler• Gibbs sampling is a Monte-Carlo sampling method• Suppose that x=(x1,x2,…,xk) is k(2) dimensional• Gibbs sampler uses what are called the full (or complete)

conditional distributions

• Note that the conditional

• Is proportional to the joint. Often this helps in finding it


Gibbs Sampling (Cont’d)

• Sample or update in turn, but always use the most relevant values


Gibbs Sampling -- Algorithm

1. Take some initial values Uk(0), k=1,2,…,K

2. Repeat for t=1,2,…,:For k=1,2,…,K generate Uk

(t) from

Pr(Uk(t) | U1

(t),…, Uk-1(t-1), …, UK

(t-1))

3. Continue step 2 until the joint distribution of (U1

(t), U2(t), …, Uk

(t)) does not change

Gibbs Sampling (Cont’d)

• In Bayesian Inference, the goal is to draw a sample from the joint posterior of the parameters given the data Z

• Gibbs sampling will be helpful if it is easy to sample from the conditional distribution of each parameter given the other parameters and Z

• An example – Gaussian mixture problem is described on the next slide

Gibbs Sampling - Example

1. Take some initial values (0)= (1(0), 2

(0))2. Repeat for t=1,2,..

a) For i=1,2,…,N generate i(t){0,1} with

Pr(i(t)=1)=^

i((t)), from the Eq. previously givenb) Set

And generate 1(t)~N(^

1,^1

2) and 2(t)~ N(^

2,^2

2) 3. Continue step 2 until the joint distribution of

((t),1(t),2

(t)) does not change

N

1i

)t(

i

N

1i i

)t(

i1 )1(

y)1(ˆ

N

1i

)t(

i

N

1i i

)t(

i2

y.ˆ, and,

Gibbs Sampling – Example (Cont’d)

• Close connection b/w Gibbs sampling and EM• The key is to consider the latent variable Zm from EM as another

parameter for Gibbs sampler• For simplicity, fix the variances (1

2,22) and mixing proportions

unknowns are 1,2

• The first two steps are like EM, in step 3, Gibbs simulates the latent data i from the distribution Pr(i|,Z)

• In step 2(b), rather than compute the maximization of the posterior Pr(1,2,|Z), we simulate from the conditional distribution Pr(1,2|,Z)

• The example was simplified, in reality variances and should be taken into account

Gibbs Sampling – Example (Cont’d)• 200 iterations of Gibbs sampling• The values seems to stabilize quickly, and are distributed

evenly around the ML value

Bagging• Some classification and regressions are unstable

– Small perturbations in their training sets or in construction may result in large changes in constructed predictor

– E.g., subset selection method in regression, decision trees• Unstable methods can improve their accuracy by perturbing

and combining• Generate multiple versions of the predictor by perturbing the

training set or construction method and then combine the versions into a single predictor

• E.g., bagging estimate for regression

• Bagging for classification: combine the B votes

B

1b

b*

bag)x(f

B1)x(f

Bagging• The bagging estimate will differ from the original

estimate f^(x) when the latter is a nonlinear or adaptive function of the data

• Can be used to bag different models of a prediction• It has a lot of usage on the trees (discussed in the next

section)• If we think about bootstrap values of an estimator as

approximate posterior values of a corresponding parameter, from a nonparametric Bayesian analysis, bagging is just the posterior Bayesian mean

Model Averaging and Stacking• We have a set of candidate models m, m=1,..,M for training set Z• The models maybe the same type with different parameter values,

or different models for the same task• Suppose is some quantity of interest, say, a prediction f(x) at

some fixed feature value x• The posterior distribution and mean of are:

• This Bayesian prediction, is a weighted average of individual predictions, with weights proportional to the posterior probability of each model

)Z|Pr()Z,|Pr()Z|Pr( m

M

1mm

)Z|Pr()Z,|(E)Z|(E m

M

1mm

Model Averaging and Stacking• Committee methods take a simple unweighted

average of the predictions from each model, giving equal probability to each of the models

• In cases where the different models arise from the same parametric model, with different parametric values, one could use BIC to weight– Recall that BIC gives weight to each model depending on

how well it fits and how many parameters it uses• How does it work for nonparametric (frequentist)

models?– Answer: Chapter 10, boosting is a strong method proposed

for solving these cases

Stochastic Search: Bumping• Bumping is a technique to find a better single model, but

again uses bootstrapping• It is suitable for models, where there are a lot of local minima• Like bagging, draw bootstrap samples, and fit a model to each• Rather than averaging, choose the model estimated from the

bootstrap sample that best fits the training data• Draw bootstrap samples Z*1,…,Z*B and fit our model to each,

giving predictions f^*b(x), b=1,..,B• Choose the model that gives the smallest prediction error,

averaged over the original training set• E.g. for squared error, choose the bootstrap sample b^

N

1i

2

i

*

ib)]x(bfy[minargb

lectures 13,14 – model inference and averaging

Documents