yet another mcmc case-study

40
Yet another MCMC case-study Yet another MCMC case-study avard Rue Department of Mathematical Sciences NTNU, Norway Sept 2008

Upload: buidung

Post on 30-Dec-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Yet another MCMC case-study

Yet another MCMC case-study

Yet another MCMC case-study

Havard RueDepartment of Mathematical Sciences

NTNU, Norway

Sept 2008

Page 2: Yet another MCMC case-study

Yet another MCMC case-study

Overview

Yet another MCMC case-study: Overview of talk

• Study a seemingly trivial hierarchical model• Latent temporal Gaussian, with• Binary observations

• Develop a “standard” MCMC-algorithm for inference• Auxiliary variables• (Conjugate) single-site updates

• ..and study empirically its properties.

Page 3: Yet another MCMC case-study

Yet another MCMC case-study

Overview

Yet another MCMC case-study: Overview of talk ...

• Demonstrate how to develop more sophisticatedMCMC-algorithms

• Blocking• Joint updates with blocking• Independence sampler

• and demonstrate that MCMC is not even needed in thisexample.

• Discuss the consequences of this case-study.

Page 4: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

Tokyo rainfall data

Stage 1 Binomial data

yi ∼

{Binomial(2, p(xi ))

Binomial(1, p(xi ))

Page 5: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

Tokyo rainfall data

Stage 2 Assume a smooth latent x,

x ∼ RW 2(κ), logit(pi ) = xi

Page 6: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

Tokyo rainfall data

Stage 3 Gamma(α, β)-prior on κ

Page 7: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

The RW2 model for regular locations

Use the second order increments

∆2xiiid∼ N (0, κ−1) (1)

for i = 1, . . . , n − 2, to define the joint density of x

π(x) ∝ κ(n−2)/2 exp

(−κ

2

n−2∑i=1

(xi − 2xi+1 + xi+2)2

)= κ(n−2)/2 exp

(−κ

2xTRx

)

Page 8: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

R =

1 −2 1−2 5 −4 11 −4 6 −4 1

1 −4 6 −4 1. . .

. . .. . .

. . .. . .

1 −4 6 −4 11 −4 6 −4 1

1 −4 5 −21 −2 1

.

IGMRF of second order: invariant to the adding line to x.Our problem is circulant.

Page 9: Yet another MCMC case-study

Yet another MCMC case-study

The latent Gaussian model

Model summary

π(x | κ) π(κ)∏i

π(yi | xi )

where

• x | κ is Gaussian (Markov) with dimension 366

• κ is Gamma

• yi |xi is Binomial with p(xi )

Page 10: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Construction of nice full conditionals

A popular approach is to introduce auxiliary variables w, so that

x | the rest

is Gaussian.

Page 11: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Normal-mixture representation

Theorem (Kelker, 1971)If x has density f (x) symmetric around 0, then there existindependent random variables z and v, with z standard normalsuch that x = z/v iff the derivatives of f (x) satisfy(

− d

dy

)k

f (√

y) ≥ 0

for y > 0 and for k = 1, 2, . . ..

• Student-t

• Logistic and Laplace

Page 12: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Corresponding mixing distribution for the precision parameter λthat generates these distributions as scale mixtures of normals.

Distribution of x Mixing distribution of λ

Student-tν G(ν/2, ν/2)Logistic 1/(2K )2 where K is

Kolmogorov-Smirnov distributedLaplace 1/(2E ) where E is exponential distributed

Page 13: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Example: Binary regression

GMRF x and Bernoulli data

yi ∼ B(g−1(xi ))

g(p) = Φ(p) probit link

Equivalent representation using auxiliary variables w

εiiid∼ N (0, 1)

wi = xi + εi

yi =

{1 if wi > 0

0 otherwise.

for the probit-link.

Page 14: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Single-site Gibbs sampling

Auxiliary variables can be introduced for the logit-link1, to achievethis sampler:

• κ ∼ Γ(·, ·)• for each i

• xi ∼ N (·, ·)• for each i

• wi ∼ W(·)

It is fully automatic; no tuning!!!

1Held & Holmes (2006)

Page 15: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Results: hyper-parameter log(κ)

Page 16: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Results: latent node x10

Page 17: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Results: density for latent node x10

0.0 0.1 0.2 0.3 0.4 0.5

02

46

810

density(x[10]), run 1

N = 747017 Bandwidth = 0.002606

Den

sity

0.0 0.1 0.2 0.3 0.4 0.5

02

46

8

density(x[10]), run 2

N = 747391 Bandwidth = 0.002803

Den

sity

Page 18: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

Discussion

Single-site sampler with auxiliary variables:

• Even long runs shows large variation

• “Long” range dependence

• Very slowly mixing

But:

• Easy to be “fooled” running shorter chains

• The variability can be underestimated.

Page 19: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – single-site

What is causing the problem?

Two issues

1. Slow mixing within the latent field x

2. Slow mixing between the latent field x and θ.

Blocking is the “usual” approach to resolve such issues, if possible.

Note: blocking mainly helps within the block only.

Page 20: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking

Strategies for blocking

Slow mixing due to the latent field x only:

• Block x

Slow mixing due to the interaction between the latent field x andθ:

• Block (x,θ).

In most cases: if you can do one, you can do both.

Page 21: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking scheme I

Blocking scheme I

• κ ∼ Γ(·, ·)• x ∼ N (·, ·)• w ∼ W(·) (conditional independent)

Page 22: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking scheme I

Results

6 7 8 9 10 11

0.00.2

0.40.6

0.8

density of log(kappa)

N = 101804 Bandwidth = 0.04667

Dens

ity

0.1 0.2 0.3 0.4

02

46

8

density of x[10]

N = 101804 Bandwidth = 0.003815

Dens

ity

Page 23: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking scheme II

Blocking scheme II

• Sample• κ′ ∼ q(κ′;κ)• x′ | κ′, y ∼ N (·, ·)

and then accept/reject (x′, κ′) jointly

• w ∼ W(·) (conditional independent)

Remarks

• If the normalising constant for x|· is available, then this is anEASY FIX of scheme I.

• Usually makes a huge improvement

• Automatic “reparameterisation”

• Doubles the computational costs

Page 24: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking scheme II

Results

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa) scheme I)

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa) scheme II)

Page 25: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking scheme II

Scheme I & II is not that easy to do in practice...

• The computational speed depends critically on how theGaussian part is dealt with

• Statisticians does like high-dimensional Gaussians as long theycan be dealt with using “Kalman-filter” algorithms

• “Kalman-filter” algorithms are not applicable here, as theprior is circulant.

• General sparse matrix numerical algebra is required

• Possible in R, but not easy (yet?) as it could be.

Conclusion: Not that many would had implemented neitherblocking schemes.

Page 26: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking without auxiliary variables

Removing the auxiliary variables

• The auxiliary variables makes the full conditional for xGaussian

• If we do not use them, the full conditional for x looks like

π(x | . . .) ∝ exp

(−1

2xTQx +

∑i

log(π(yi |xi ))

)

≈ exp

(−1

2(x− µ)T (Q + diag(c))(x− µ)

)= πG (x| . . .)

• The Gaussian approximation is constructed by matching the• mode, and the• curvature at the mode.

Page 27: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking without auxiliary variables

Improved one-block scheme

• κ′ ∼ q(·;κ)

• x′ ∼ πG (x | κ′, y)

• Accept/reject (x′, κ′) jointly

Note: πG (·) is indexed by κ′, hence we need to compute one foreach value of κ′.

Page 28: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – blocking without auxiliary variables

Results

0 2000 4000 6000 8000

78

91

0

Trace of log(kappa)

Index

log

(xx[m

:n,

2])

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa))

Page 29: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – independence sampler

Independence sampler

We can construct an independence sampler, using πG (·).The Laplace-approximation for κ|x:

π(κ | y) ∝ π(κ) π(x|κ) π(y|x)

π(x|κ, y)

≈ π(κ) π(x|κ) π(y|x)

πG (x|κ, y)

∣∣∣∣∣x=mode(κ)

Hence, we do first

• Evaluate the Laplace-approximation at some “selected” points

• Build an interpolation log-spline

• Use this parametric model as π(κ|y)

Page 30: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – independence sampler

Independence sampler

• κ′ ∼ π(κ|y)

• x′ ∼ πG (x|κ′, y)

• Accept/reject (κ′, x′) jointly

Note:Corr(x(t + k), x(t)) ≈ (1− α)|k|

In this example, α = 0.83...

Page 31: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – independence sampler

Results

0 2000 4000 6000 8000

67

89

10

Trace log(kappa); independence sampler

Index

log

(xx[m

:n,

2])

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa)); independence sampler

Page 32: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – Deterministic inference

Can we improve this sampler?

• Yes, if we are interested in the posterior marginals for κ and{xi}.

• The marginals for the Gaussian proposal πG (x| . . .), are knowanalytically.

• Just use numerical integration!

Page 33: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – Deterministic inference

Deterministic inference

Posterior marginal for κ:

• Compute π(κ|y)

Posterior marginal for xi :

• Use numerical integration

π(xi | y) =

∫π(xi | y, κ) π(κy) dκ

≈∑k

N (xi ; µκk, σ2(κk)) × π(κk | y) × ∆k

Page 34: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – Deterministic inference

Results: Mixture of Gaussians

Histogram of x

x

Den

sity

0.1 0.2 0.3 0.4

02

46

8

Page 35: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – Deterministic inference

Results: Improved....

Histogram of x

x

Den

sity

0.1 0.2 0.3 0.4

02

46

8

Page 36: Yet another MCMC case-study

Yet another MCMC case-study

MCMC – Deterministic inference

Deterministic inference

• Obtain “exact results”; cannot find error using MCMC

• Fast; about 1 second on my laptop.

• This approach is much more generally applicable.

• A lot of delicate coding

• In this example, it makes the model useful in practise!

Page 37: Yet another MCMC case-study

Yet another MCMC case-study

Discussion

What can be learned from this exercise?

For a relative simple model, we have implemented

• single-site with auxiliary variables (looong time; hours)

• various forms for blocking (long time; many minutes)

• independence sampler (long time; many minutes)

• approximate inference (nearly instant; one second)

Page 38: Yet another MCMC case-study

Yet another MCMC case-study

Discussion

What can be learned from this exercise? ...

My guess; Near all statisticians will implement the single-sitescheme, (possible) with auxiliary variables.

Which implies

• Most probably, the results would be not correct.

• They “accept” the long running-time.

• Trouble: such MCMC-schemes is not useful for routineanalysis of similar data.

Page 39: Yet another MCMC case-study

Yet another MCMC case-study

Discussion

What can be learned from this exercise? ...

• In many cases, the situation is much worse in practice; thiswas a very simple model.

• Single-site MCMC is still the default choice for the non-expertuser.

• Hierarchical models are popular, but they are difficult forMCMC.

Perhaps the development of models is not in sync with thedevelopment of inference? We cannot just wait for more powerfulcomputers...

Page 40: Yet another MCMC case-study

Yet another MCMC case-study

Discussion

MCMC tries to solve the full problem...

If the target is the posterior marginals, then perhaps MCMC is notneeded?

• Consider a latent AR(1) model observed with non-Gaussianobservations of length n

• If we are interested in the posterior marginal for x1, then onlya subset of the data (and x’s) are needed! No effect of n.

• For hyper-parameters, then large n makes life easier!