stat 451 lecture notes 0712 markov chain monte...

42
Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/ ~ rgmartin 1 Based on Chapters 8–9 in Givens & Hoeting, Chapters 25–27 in Lange 2 Updated: April 4, 2016 1 / 42

Upload: others

Post on 17-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Stat 451 Lecture Notes 0712

Markov Chain Monte Carlo

Ryan MartinUIC

www.math.uic.edu/~rgmartin

1Based on Chapters 8–9 in Givens & Hoeting, Chapters 25–27 in Lange2Updated: April 4, 2016

1 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

2 / 42

Motivation

We know how to sample independent random variables fromthe target distribution f (x), at least approximately.

Monte Carlo uses these simulated random variables toapproximate integrals.

But the random variables don’t need to be independent inorder to accurately approximate integrals!

Markov Chain Monte Carlo (MCMC) constructs a dependentsequence of random variables that can be used to approximatethe integrals just like for ordinary Monte Carlo.

The advantage of introducing this dependence is that verygeneral “black-box” algorithms (and corresponding theory) areavailable to perform the required simulations.

3 / 42

Initial remarks

In some sense, MCMC is basically a black-box approach thatworks for almost all problems.

The previous remark is dangerous — it’s always a bad idea touse tools without knowing that they will work.

Here we will discuss some basics of Markov chains and MCMCbut know that there are very important unanswered questionsabout how and when MCMC works.3

3MCMC is an active area of research; despite the many developments inMCMC, according to Diaconis (∼2008), still “almost nothing” is known!

4 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

5 / 42

Markov chains

A Markov chain is just a sequence of random variables{X1,X2, . . .} with a specific type of dependence structure.

In particular, a Markov chain satisfies

P(Xn+1 ∈ B | X1, . . . ,Xn) = P(Xn+1 ∈ B | Xn), (?)

i.e., the future, given the past and present, depends only onthe present.

Independence is a trivial Markov chain.

From (?) we can argue that the probabilistic properties of thechain are completely determined by:

initial distribution for X0, andthe transition distribution, distribution of Xn+1, given Xn.4

4Assume the Markov chain is homogeneous, so that the transitiondistribution does not depend on n.

6 / 42

Example – simple random walk

Let U1,U2, . . .iid∼ Unif({−1, 1}).

Set X0 = 0 and let Xn =∑n

i=1 Ui = Xn−1 + Un.

The initial distribution is P{X0 = 0} = 1.

The transition distribution is determined by

Xn =

{Xn−1 − 1 with prob 1/2

Xn−1 + 1 with prob 1/2

While very simple, the random walk is an important examplein probability, having connections to advanced things likeBrownian motion.

7 / 42

Keywords5

A state A is recurrent if a chain starting in A will eventuallyreturn to A with probability 1. This state is nonull if theexpected time to return is finite. A chain is called recurrent ifeach state is recurrent.

A Markov chain is irreducible if there is positive probabilitythat a chain starting in a state A can reach any other state B.

A Markov chain is aperiodic if, for a starting state A, there isno constraint on the times at which the chain can return to A.

An irreducible, aperiodic Markov chain with all states beingnonnull recurrent is called ergodic.

5Not mathematically precise!8 / 42

Limit theory6

f is a stationary distribution if X0 ∼ f implies Xn ∼ f for all n.

An ergodic Markov chain has at most one stationary dist.

Furthermore, if the chain is ergodic, then

limn→∞

P(Xm+n ∈ B | Xm ∈ A) =

∫Bf (x) dx , for all A,B,m.

Even further, if ϕ(x) is integrable, then

1

n

n∑t=1

ϕ(Xt)→∫ϕ(x)f (x) dx , with prob 1.

This is a version of the famous ergodic theorem.

There are also central limit theorems for Markov chains, but Iwon’t say anything about this.

6Again, not mathematically precise!9 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

10 / 42

Why MCMC?

In Monte Carlo applications, we want to generate randomvariables with distribution f .

This could be difficult or impossible to do exactly.

MCMC is designed to construct an ergodic Markov chain withf as its stationary distribution.

Asymptotically, the chain will resemble samples from f .

In particular, by the ergodic theorem, expectations withrespect to f can be approximated by averages.

Somewhat surprising is that it is actually quite easy toconstruct and simulate a suitable Markov chain, explainingwhy MCMC methods have become so popular.

But, of course, there are practical and theoretical challenges...

11 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

12 / 42

Details

Let f (x) denote the target distribution pdf.

Let q(x | y) denote a conditional pdf for X , given Y = y ; thispdf should be easy to sample from.

Given X0, the Metropolis–Hasting algorithm (MH) produces asequence of random variables as follows:

1 Sample X ?t ∼ q(x | Xt−1).

2 Compute

R = min{

1 ,f (X ?

t )

f (Xt−1)

q(Xt−1 | X ?t )

q(X ?t | Xt−1)

}.

3 Set Xt = X ?t with probability R; otherwise, Xt = Xt−1.

General R code to implement MH is on course website.

13 / 42

Details (cont.)

The proposal distribution is not easy to choose, and theperformance of the algorithm depends on this choice.

Two general strategies are:

Take the proposal q(x | y) = q(x); i.e., at each stage of theMH algorithm, X ?

t does not depend on Xt−1.Take q(x | y) = q0(x − y), for a symmetric distribution withpdf q0, which amounts to a random walk proposal.

This is one aspect of the MCMC implementation that requiresa lot of care from the user; deeper understanding is needed toreally see how the proposal affects the performance.

In my examples, I will just pick a proposal that seems to workreasonably well...

14 / 42

Details (cont.)7

Assuming the proposal is not too bad, then a number ofthings can be shown about the sequence {Xt : t ≥ 1}:

the chain is ergodic, andthe target f is the stationary distribution.

Consequently, the sequence converges to the stationarydistribution and, for any integrable function ϕ(x), we canapproximate integrals with sample averages.

So, provided that we run the simulation “long enough,” weshould be able to get arbitrarily good approximations.

This is an interesting scenario where we, as statisticians, areable to control the sample size!

7Again, not mathematically precise!15 / 42

Example: “cosine model”

Consider the problem from old homework, where thelikelihood function is

L(θ) ∝n∏

i=1

{1− cos(Xi − θ)}, −π ≤ θ ≤ π.

Observed data (X1, . . . ,Xn) given in the code.

Assume that θ is given a Unif(−π, π) prior distribution.

Use MH to sample from the posterior:

Proposal: q(θ′ | θ) = Unif(θ′ | θ ± 0.5).Burn-in: B = 5000.Sample size: M = 10000.

16 / 42

Example: “cosine model” (cont.)

Left figure shows histogram of the MCMC sample withposterior density overlaid.

Right figure shows a “trace plot” of the chain.

θ

De

nsity

−0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

0 2000 4000 6000 8000 10000

−0

.50

.00

.51

.0

Iteration

θ

17 / 42

Example: Weibull model

Data X1, . . . ,Xn has Weibull likelihood

L(α, η) ∝ αnηn exp{α

n∑i=1

logXi − ηn∑

i=1

Xαi

}.

Prior: π(α, η) ∝ e−αηb−1e−cη, for some (b, c).

Posterior density is proportional to

αnηn+b−1 exp{α( n∑i=1

logXi − 1)− η(c +

n∑i=1

Xαi

)}.

Exponential is a special case of Weibull, when α = 1.

Goal: informal Bayesian test of H0 : α = 1.

18 / 42

Example: Weibull model (cont.)

Data from Problem 7.11 in Ghosh et al (2006).

Use MH to sample from posterior of (α, η).

Proposal: (α′, η′) | (α, η) ∼ Exp(α)× Exp(η).b = 2 and c = 1; B = 1000 and M = 10000.

Histogram shows marginal posterior of α.

Is an exponential model (α = 1) reasonable?

α

De

nsity

1 2 3 4

0.0

0.2

0.4

0.6

0.8

19 / 42

Example: logistic regression

Based on Examples 1.13 and 7.11 in Robert & Casella’s book.

In 1986, the Challenger space shuttle exploded duringtake-off, the result of an “o-ring” failure.

Failure may have been due to the cold temperature (31◦F).

Goal: Analyze the relationship between temperature ando-ring failure.

In particular, fit a logistic regression model.

20 / 42

Example: logistic regression (cont.)

Model: Y | x ∼ Ber(p(x)), x = temperature.

Failure probability, p(x), is of the form

p(x) =exp(α + βx)

1 + exp(α + βx).

Using available data, fit logistic regression using glm.

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 15.0429 7.3786 2.039 0.0415 *

x -0.2322 0.1082 -2.145 0.0320 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Note that p̂(31) ≈ 0.999!!!

21 / 42

Example: logistic regression (cont.)

Can also do a Bayesian analysis of this logistic model.

Use MH to obtain samples from the posterior of (α, β).

Samples can be used to approximate the posterior distributionof p(x0) for any fixed x0, e.g., x0 = 65, 31; see below.

Details about the prior and proposal construction are given inthe R code and a short write-up posted on the course website.

p(65)

Density

0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

p(31)

Density

0.96 0.97 0.98 0.99 1.00

050

100

150

200

22 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

23 / 42

Setup

Suppose we have a multivariate target distribution f .

MH can be applied to such a problem, but there are challengesin constructing a good proposal over multiple dimensions.

Idea: sample one dimension at a time.

Question: How to carry out the sampling so that it willapproximate the target, at least in a limit?

Gibbs sampler is the right tool for the job.

24 / 42

Details

Suppose we have a trivariate target f (x) = f (x1, x2, x3).

Suppose we can write down the set of full conditionals

f (x1 | x2, x3), f (x2 | x1, x3), f (x3 | x1, x2)

and that these can be sampled from.

Gibbs sampler generates a sequence {X (t) : t ≥ 0} byiteratively sampling from the conditionals:

X(t)1 ∼ f (x1 | X (t−1)

2 ,X(t−1)3 )

X(t)2 ∼ f (x2 | X (t)

1 ,X(t−1)3 )

X(t)3 ∼ f (x3 | X (t)

1 ,X(t)2 ).

25 / 42

Details (cont.)

Gibbs sequence forms a Markov chain.

In fact, Gibbs sampler is a special case of MH!

Connection to MH is made by considering Gibbs as asequence that updates one component of X at a time.

The acceptance probability for this form of MH update isexactly 1, which explains why Gibbs sampler has noaccept/reject step.

Since Gibbs is a special kind of MH, the convergence for MHapplies to Gibbs as well.

26 / 42

Example: bivariate normal

A super-simple Gibbs example: bivariate normal.

Suppose X = (X1,X2) is bivariate normal, correlation ρ.

Full conditionals are easy to write down here.

Gibbs steps:

X(t)1 ∼ N(ρX

(t−1)2 , 1− ρ2)

X(t)2 ∼ N(ρX

(t)1 , 1− ρ2).

Not as efficient as direct sampling, but works fine.

27 / 42

Example: many-normal-means

Model: Xiind∼ N(θi , 1), i = 1, . . . , n.

Hierarchical prior distribution:

[(θ1, . . . , θn) | ψ]iid∼ N(0, ψ−1), ψ ∼ Gamma(a, b).

Takes some work, but it can be shown8 that the fullconditionals are

θi | (Xi , ψ)ind∼ N

( Xi

1 + ψ,

1

1 + ψ

), i = 1, . . . , n

ψ−1 | (θ,X ) ∼ Gamma(a +

n

2, b +

1

2

n∑i=1

θ2i

).

So Gibbs sampler is pretty easy to implement...

8Easiest argument is based on standard conjugate priors...28 / 42

Example: many-normal-means (cont.)

Suppose the goal is to estimate ‖θ‖2 =∑n

i=1 θ2i .

In general, the MLE ‖X‖2 is lousy.

However, the Bayes estimator, E(‖θ‖2 | X ), is better and canbe evaluated by running the Gibbs sampler.

Can use the Rao–Blackwellized estimator of E(θ2i | X ) toreduce the variance.

Simulation study to compare Bayes estimator with MLE:

n = 10, θ = (1, 1, . . . , 1);1000 reps, 5000 Monte Carlos, a = b = 1.

mle.mse bayes.mse

[1,] 180.1721 32.93027

29 / 42

Example: capture–recapture

Example 26.3.3 in Lange.

Consider a lake that contains N fish, N is unknown.

Capture–recapture study:

On n occasions, fish are caught, marked, and returned.At occasion i = 1, . . . , n, record

Ci = number of fish caught at time i

Ri = number of “recaptures” at time i .

Ci − Ri is the number of new fish caught at time i .Set Ui =

∑ij=1(Ci − Ri ).

Model assumes independent binomial sampling.

30 / 42

Example: capture–recapture (cont.)

Introduce binomial success probabilities (ω1, . . . , ωn).

Likelihood for (N, ω) is

L(N, ω) =n∏

i=1

(Ui−1

Ri

)ωRii (1− ωi )

Ui−1−Ri

(N − Ui−1

Ci − Ri

)ωCi−Rii (1− ωi )

N−Ui−1−Ci+Ri

=n∏

i=1

(Ui−1

Ri

)(N − Ui−1

Ci − Ri

)ωCii (1− ωi )

N−Ci

=N!

(N − Un)!

n∏i=1

(Ui−1

Ri

)ωCii (1− ωi )

N−Ci .

Priors: N ∼ Pois(m) and ωiind∼ Beta(a, b).

31 / 42

Example: capture–recapture (cont.)

Posterior distribution for (N, ω) proportional to

N!

(N − Un)!

mN

N!

n∏i=1

(Ui−1Ri

)ωCi+a−1i (1− ωi )

N−Ci+b−1.

To run a Gibbs sampler, we need the full conditionals.

Distribution of (ω1, . . . , ωn), given N and data, is clearly

ωi | (N, data)ind∼ Beta(a + Ci , b + N − Ci ), i = 1, . . . , n.

Distribution of N, given ω and data, is

N | (ω, data) ∼ Un + Pois(m

n∏i=1

(1− ωi )).

Now the Gibbs sampler is easy to run...

32 / 42

Example: probit regression

Model: Yiind∼ Ber(Φ(x>i β)), i = 1, . . . , n.

Suppose β has a normal prior.

Not directly obvious how to implement Gibbs to get a samplefrom the posterior distribution of β.

Recall, from EM notes, that this model can be simplified byintroducing some “missing data.”

The conditional distribution of the missing data, given theobserved data and β, makes up one part of the fullconditionals.

The other part of the full conditionals is simple since themodel for the complete data is, by construction, nice.

33 / 42

Example: probit regression (cont.)

Missing data: Z1, . . . ,Zn where Zi ∼ N(x>i β, 1) and

Yi = IZi>0, i = 1, . . . , n.

Full conditionals:

Distribution of β, given (Y ,Z ), only depends on Z and is easybecause normal prior for β is conjugate.Distribution of Z , given (Y , β), is a truncated normal...

Though I’ve not given the precise details, the steps forconstructing a Gibbs sampler are not too difficult;9 seeSection 8.3.2 in Ghosh et al (2006) for details.

9The only potential difficulty is simulating from a truncated normal whenthe truncation point is extreme, but remember we have talked about extremenormal tail probabilities before...

34 / 42

Example: Dirichlet process mixture

In Bayesian nonparametrics, the Dirichlet process mixturemodel is probably the most widely used.

Flexible model for density estimation — normal mixturedensity with unspecified component means (and variances),but also doesn’t specify the number of components.

The main challenge with using mixture models is choosinghow many components to use; this DPM selects the numberof components automatically.

Interesting that, despite being a “nonparametric” model, thecomputations are not too hard, just a Gibbs sampler.

Simplest algorithm in Escobar & West (JASA 1995); a niceslice sampler is proposed in Kalli et al (Stat Comp 2011).

35 / 42

Example: Dirichlet process mixture (cont.)

Using the slice sampler from Kalli et al to fit same normalmixture model to the galaxy data from HW.

R code for this on my research page.

Ynew

Density

10 15 20 25 30 35

0.00

0.05

0.10

0.15

0.20

Post meanKernel

4 6 8 10 12 14

0.00

0.05

0.10

0.15

0.20

0.25

Number of components

Probability

36 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

37 / 42

Diagnostic plots

Sample path plot, or trace plot:

Can reveal any residual dependence after burn-in.Idea is that a sample path of iid samples should show no trend,so if there is minimal trend in our sample plot, then we can becomfortable treating samples as independent.

Autocorrelation plot:

Plot sample correlation of {(Xt ,Xt+r ) : t = 1, 2, . . .} as afunction of the “lag” r .What to see the autocorrelation plot decay rapidly, suggestingthat the dependence along the chain is not too strong.

If these plots indicate that chain has not yet converged tostationarity, then you can run the chain longer or make someother modifications, e.g.,

Transformations“thinning”

38 / 42

Other considerations

The practical/theoretical rate of convergence can depend onthe parametrization; see homework.

There is no agreement in the stat community about howmany chains to run, how long should the burn in be, etc.

Charles Geyer (Univ Minnesota) strongly supports runningonly one long chain — check out his “rants”

Gelman & Rubin suggest running several shorter chains withdifferent starting points; textbook gives their diagnostic test.

39 / 42

Outline

1 Introduction

2 Crash course on Markov chains

3 Motivation, revisited

4 Metropolis–Hastings algorithm

5 Gibbs sampler

6 Some MCMC diagnostics

7 Conclusion

40 / 42

Remarks

MCMC methods are powerful because they give fairly generalprocedures able to solve a variety of important problems.

There are “black-box” software implementations:

The mcmc package in R will do random walk MH;In SAS, PROC MCMC does similar things;BUGS (“Bayes Using Gibbs Sampler”).

However, it’s a bad idea to blindly use these things withoutfully understanding what they’re doing and whether or notthey will actually work in your problem.

Also important to look at convergence diagnostics before thesimulation results be used for inference.

41 / 42

Remarks (cont.)

Our focus here was on relatively simple MCMC methods.

Don’t think that MH, Gibbs, etc are all separate methods,they can be combined.

For example, if one of the full conditionals is difficult tosample from, one might consider a MH step10 within Gibbs tosample this conditional.

Book by Robert & Casella has some details about moreadvanced MCMC methods, including various combinations ofthese standard techniques.

10Could also use accept–reject for this...42 / 42