lectures 13,14 – model inference and averaging

Lectures 13,14 – Model Inference and Averaging

Rice ECE697Farinaz Koushanfar

Fall 2006

Summary

• Bootstrap and maximum likelihood (ML)• Bayesian methods• The expectation maximization (EM) algorithm• MCMC for sampling from posterior• Bagging• Model averaging

Empirical Distribution• Suppose x1, ..., xN are the observed outcomes of N iid random

variables following an unknown PDF • The empirical distribution: P(X = a) = count (xi = a)/n. • The empirical estimate of a parameter is computed from the

empirical distribution by the formula that defines the parameter based on its true distribution.

• For example, the empirical estimate of the variance is

• Empirical estimates are often biased; there is no guarantee they have best possible variance, or other good properties.

)xx(N1)xvar(

Bootstrap• Let T(x1, ..., xN) be an estimator of . • Bootstrap generates repeated estimates by generating repeated

"fake" outcomes. Each fake outcome is generated by taking a random sample according to the empirical distribution P(X = a) = count(xi = a)/n.

• To generate fake outcomes, resample with replacement. • By taking R random samples with replacement, get R different

bootstrap estimates of ; call these B1, ..., BR. • What do we use R bootstrap? The most common is for CI:

(1) Use the order statistics of Br. E.g., for a 95% confidence interval, use B(2.5%) and B(97.5%).

(2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the sample variance of the bootstrap estimates,

1rrr)BB(

1R1)var(

Bootstrap-Example• Training data, Z={z1,z2,..,zN}, with zi=(xi,yi)• Suppose we are fitting a cubic spline, with 3 knots placed at

quantiles (splines are one form of kernel basis function, centered at knots)

h1(x)h2(x)h3(x)h4(x)h5(x)h6(x)h7(x)

1j jj)x(h)x(

Bootstrap-Example (cont’d)

• Spline prediction• Can think of (x)=E(Y|X=x)• The usual estimate of ^=(HTH)-1HTy• The estimated covariance is• Noise var:

• How do we apply bootstrap on this example?

ˆ)x(h)x(h)x( T

21T ˆ)HH()(Var

2 N/))x(y(ˆ

h(x)T=(h1(x),…,h7(x))

ˆ|)x(h)HH()x(h|))x((es 21

Bootstrap-Example (cont’d)• Draw B datasets with replacements (Z*:zi=(xi,yi))• To each bootstrap sample, fit a cubic spline ^*(x)• Example - 10 bootstrap samples (left), CI (right)

Least Square and Bootstrap - Example

Least Square, Bootstrap, and ML• Previous example was nonparametric bootstrap• Suppose that error is Gaussian: ~N(0,2)• In parametric bootstrap, we draw samples by adding

Gaussian noise to the predicted values

• The process repeated B times, re-compute the spline on each, the CI from this method will be exactly the least square bands!!

• The function estimated from the bootstrap sample has the distribution

)ˆ)x(h)HH()x(h),x(ˆ(N)x(*ˆ 21TT

i)x(ˆy N,...,1i);ˆ,0(N~ 2*

Maximum Likelihood (ML) inference

• In general, bootstrap estimate agrees not with the least squares, but with ML

• Specify a pdf function to observations: zi~g(z)• ML is based on a likelihood function L(;Z)

• The logarithm of L(;Z) is denoted as l(;Z), and is the log-likelihood function

• ML chooses the value of that maximizes l(;Z)

1ii)z(g)Z;(L

1ii )z(glog)z;(l)Z;(l

ML – Some Definitions

• The score function is:• Where• Assuming that the max is in the interior, it is 0• The information matrix is

• I() evaluated at 0 is the observed information• Fisher information (or expected information) is

1ii )z;(l)Z;(l

/)z;(l)z;(lii

2 )z;(l)(I

)](I[E)(i

ML – Some More Results

• Assume independent sampling from g(z)• ***The sampling distribution of the ML

estimator has a limiting Normal PDF (N)

• The standard error of j estimation is

• CI for j has the forms

))ˆ(I,ˆ(N))ˆ(i,ˆ(N))(i,(Nˆ 111

jj)ˆ(Iand)ˆ(i

j )ˆ(Izˆor)ˆ(izˆ

ML for our smoothing example

• Parameters are =(,2); the log-likelihood is

• ML estimate achieved by:

• The information matrix for =(,2) is

2 ))x(hy(21)2log(

0/)(land0/)(l 2

2T1T ))x(ˆy(N1ˆandyH)HH(

2T /)HH()(I

Bayesian Approach to Inference• Specify a sampling model Pr(Z|), pdf for data• Given the parameters, and a prior distribution for the Pr(),

reflecting the knowledge before we see the new data

• Corresponding to our updated knowledge about after we see the new data

• The difference b/w Bayesian and regular inference is that it expresses the uncertainty before seeing the data (prior), and express the uncertainty remaining (posterior)

d)Pr()|ZPr()Pr()|ZPr(

)ZPr()Pr()|ZPr()Z|Pr(

Bayesian Approach (Cont’d)

• Predict the value of a future observation via the predictive distribution

• ML would use Pr(znew|) to predict future data• Unlike predictive distribution, does not

account for uncertainty in estimating

d)z|Pr()|zPr()Z|zPr( newnew

Bayesian Approach on Our Example• Parametric model• Assume that 2 is known and randomness is only

coming from variations of y around (x)• Assuming finite number of basis, put the prior on

distribution of coefficients ~N(0,) (x) is Gaussian with covariance kernel

K(x,x’)=cov((x), (x’))= h(x)Th(x’)• Posterior distribution for is also Gaussian

ˆ)x(h)x(h)x( T

yH)HH()Z|(E T112

T )HH()Z|(Cov

Example (Cont’d)

• The corresponding posterior value for (x) is

• How to choose ? Take the prior to be =I

yH)HH()x(h)Z|)x((E T112

TT )'x(h)HH()x(h)Z|)x((Cov

Example (Cont’d)• Let’s take a look at the posterior curves and see the

impact of the prior on the posterior

Looks like bootstrapcurves

Bayes Inference – Example (from wikipedia)

• Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum

• Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence)

• Let m be the number of voters in that random sample who will vote "yes“

• Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes theorem:

Example from Wikipedia (Cont’d)

• From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior pdf

• f(r) summarizes what we know about the distribution of r in the absence of any observation.

• We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1.

• If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.

Example from Wikipedia (Cont’d)• Assuming random sampling, the likelihood function

L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes in 10 trials for a binomial pdf

• As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor,

• For r[0,1] inclusive, the posterior distribution for r is then

Example from Wikipedia (Cont’d)• One may be interested in the probability that more

than half the voters will vote "yes". • The prior probability that more than half the voters

will vote "yes" is 1/2, by the symmetry of the uniform distribution.

• In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll – that seven of the 10 voters questioned will vote "yes" – is

Expectation Maximization (EM) Algorithm

• EM algorithm is used for simplifying difficult ML problems, we will show an example first

• Assume we are looking for a simple mixture model

EM - Example• Model: Y is mixture of two normal distributions:

Y1~N(1,12), and Y1~N(2,2

• Y=(1-)Y1+Y2, where {0,1}, Pr(=1)=

• Let (y) be Gaussian with =(,2), pdf of y is gY(y)=(1- ) 1(y)+ 2(y)

• The parameters of the ML for the mixture model are =(,1,1

2,2,22),

• The log-likelihood on N training cases is

)]y()y()1log[()Z;(li2

EM - Example• Direct maximization is difficult numerically, because of the

sum of the terms inside the log• A simpler approach is to set some of the values, find the others

and iteratively update• This is the core of EM algorithm• Expectation step: Do a soft assignment of each observation to

one model: how much each model is responsible for explaining a data point– E.g., responsibility of model 2 for observation i can be written as

• Maximization step: The responsibilities are used in a weighted ML fit to update the parameter estimates

)Z,|1Pr()Z,|(E)(ˆiii

EM Algorithm for 2-Component Gaussian Mixture Model

1. Take initial guesses for parameters ,1,12,2,2

2 2. Expectation step: compute the responsibilities

3. Maximization step: compute weighted mean & variance

And the mixing probability4. Iterate steps 2 and 3 (E and M) until converge

N,...,2,1i,)y(ˆ)y()ˆ1(

)y(ˆˆ

i2ˆi1ˆ

1i ii1 )ˆ1(

)ˆy()ˆ1(ˆ,)ˆ1(y)ˆ1(ˆ

1i i N/ˆˆ

1i ii2 ˆ

)ˆy(ˆˆ,ˆyˆˆ

EM – Example (cont’d)

Selected iterations of the EM algorithmFor mixture example

Iteration 1 0.4855 0.49310 0.52315 0.54420 0.546

EM – When is it Useful?

• Another name for EM is Baum-Welch algorithm• EM is used for maximizing the likelihood for a

certain class of problems, where ML is difficult• Data augmentation: ML is enlarged with latent

(unobserved) variables– In the previous example, the unobserved variable was i

– EM is widely used when a part of the actual data is missing treat missing data as latent variables

EM Algorithm – General Case

• Assume that the observed data is Z, with log-likelihood l(;Z), depending on parameter

• Missing data is Zm, complete data T=(Z,Zm), with log-likelihood l0(;T), l0 based on complete density

• In the mixture problem T=(Z,Zm)=(y,)

)'|ZPr()'|Z,ZPr()',Z|ZPr(

)',Z|ZPr()'|Z,ZPr()'|ZPr(

)Z|Z;'(l)T;'(l)Z;'(l m

• And l1 is based on the conditional density Pr(Zm|Z,)

EM Algorithm – General Case

1. Start with initial guesses for the parameter ^(0)

2. Expectation step: at the j-th step, compute

as a function of the dummy argument ’3. Maximization step: determine the new estimate

^(j+1) as the maximizer of Q(’, ^(j)) over ’4. Iterate steps 2 and 3 (E and M) until convergence

)ˆ,Z|)T;'(l[E)ˆ,'(Q )j(

Why EM Works?• Remember that in terms of the log-likelihood:

• Where l0 is based on complete density, l1 is based on the conditional density Pr(Zm|Z,)

• Taking the conditional distribution of T|Z governed by the parameter gives

• R(*,) is the expectation of a log-likelihood of a density (indexed by *) w.r.t the same density indexed by maximized as a function of , when *= (Jensen’s inequality)

)Z|Z;'(l)T;'(l)Z;'(l m

),'(R),'(Q],Z|)Z|Z;'(l[E],Z|)T;'(l[E)Z;'(l m

0)],(R),'(R[)],(Q),'(Q[)Z;(l)Z;'(l

• Thus, the EM iteration never decreases the log-likelihood!

MCMC Methods• Slides (for MCMC) mostly borrowed from Sujit Sahu

www.maths.soton.ac.uk/staff/Sahu/ • Assume that we want to find• If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can

be approximated by

• Now assume we need to find an expectation

• This can be difficult!

dx)x(fI

xx(fdx)x(fI1Ni

dx)x()x(h)]X(h[E

dx)x(|)x(h|

MCMC for Integration (Cont’d)

• If we can draw samples• Then we can estimate

• This is Monte Carlo (MC) integration• We have changed notation

Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Consistency of the integral estimate

• For independent samples, by Law of Large Numbers,

• But independent sampling from (x) can be difficult

• It turns out that the above equation holds if we generate samples using a Markov chain

Markov Chains (Review)• Markov chain generated by sampling

• Where p is the transition kernel• So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1)

• For example

• This is called the first order auto-regressive process with lag-1, autocorrelation 0.5

Markov Chains (Stationary)

• As t, the Markov chain converges (in distribution) to its stationary (invariant) distribution

• In the example before, this is

• Which does not depend on x(0)

• Does this happen for all Markov chains?

Markov Chains (Irreducibility)

• Assuming that a stationary distribution exists, it is unique if the chain is irreducible

• Irreducible means that any set of states can be reached from any other state in a finite number of moves

• An example of reducible Markov chain:– Suppose p(x|y)=0, for xA and yB and vice

Markov Chains (Aperiodicity)• A Markov chain taking only finite number of values

is aperiodic if greatest common divisor of return times to any particular state i say, is 1

• Think of recording the number of steps taken to return to the state 1. The g.c.d of those numbers should be 1

• If the g.c.d is bigger than 1, 2 say, then the chain will return in cycles of 2,4,6,.. Number of steps. This is not allowed for aperiodicity

• Definition can be extended to general state-space case

Markov Chains (Ergodicity)• Assume the Markov chain has a stationary

distribution (x), is aperiodic and irreducible Ergodic theorem

• Also for such chains with • Central limit theorem holds and • Convergence occurs geometrically

Numerical Standard Error (nse)

• In general, no simpler expression exists for nse

Numerical Standard Error (nse)• If {h(X(t))} can be approximated as a first-order auto-

regressive process, then

• Where is the lag-1 auto-correlation {h(X(t))}• The first factor is the usual term under independent

sampling• The second term is usually >1• As thus is the penalty to be paid because a Markov

chain has been used

lectures 13,14 – model inference and averaging

Documents

probabilistic counting with stochastic averaging … ·...

basic definitions . reynolds averaging

„wer liest, soll verstehen“ (mk 13,14) das...

politik (pertemuan 13,14).ppt

13,14 metabolismul nucleotidelor prezentare

model inference and averaging

العدد 13,14 (1)

the strategy for enhancing investment ... - value averaging...

13,14 group theory

windows windowswindows linux. linux: slides…...

13,14 čas ms visio 2003

sem1 13,14

value averaging 14022012

grammar in use 13,14

household averaging

circuit averaging models a. circuit averaging of …3...

bayesian model averaging and model...

601 13,14 solutions

standard operation procedures for trace level co ·...

bayesian model averaging and jointness measures for...