diagnostics in mcmc - duke university · pdf filethe coda package provides many popular...

Download Diagnostics in MCMC - Duke University · PDF fileThe CODA package provides many popular diagnostics for assessing convergence of MCMC output from WinBUGS (and other programs) > library(coda)

If you can't read please download the document

Upload: ngoanh

Post on 06-Feb-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • Diagnostics in MCMC

    Hoff Chapter 6

    October 13, 2010

  • Convergence to Posterior Distribution

    Theory tells us that if we run the Gibbs sampler long enough thesamples we obtain will be samples from the joint posteriordistribution (target or stationary distribution). This does notdepend on the starting point (forgets the past).

    I How long do we need to run the Markov Chain to adequatelyexplore the posterior distribution?

    I Mixing of the chain plays a critical role in how fast we canobtain good results. How can we tell if the chain is mixingwell (or poorly)?

  • Three Component Mixture Model

    Posterior for :

    | Y 0.45N(3, 1/3) + 0.10N(0, 1/3) + 0.45N(3, 1/3)

    How can we draw samples from the posterior?

    Introduce mixture component indicator , an unobserved latentvariable which simplifies sampling

    I = 1 then | ,Y N(3, 1/3) and P( = 1 | Y ) = 0.45

    I = 2 then | ,Y N(0, 1/3) and P( = 2 | Y ) = 0.10;

    I = 3 then | ,Y N(3, 1/3) and P( = 3 | Y ) = 0.45

    Monte Carlo sampling: Draw ; Given , draw

  • MC density

    Mixture Density

    Den

    sity

    4 2 0 2 4

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    Histogram of from 1000 MC draws with posterior density as asolid line

  • MC Variation

    If we want to find the posterior mean of g(), then the MonteCarlo estimate based on M MC samples is

    gMC =1

    M

    m

    g((m)) E[g() | Y ]

    with variance

    Var[gMC ] = E[gMC E[gMC ]]2 =

    Var[g() | Y ]

    M

    leading to Monte Carlo Standard Error

    Var[gMC ].

    We expect the posterior mean of g() should in the intervalgMC 2

    Var[gMC ] for roughly 95% of repeated MC samples.

    To increase accuracy, increase M.

  • Markov Chain Monte Carlo

    I Full conditional for | ,YI | = 1, Y N(3, 1/3)I | = 2, Y N(0, 1/3)I | = 3, Y N(3, 1/3)

    I Full conditional for | ,Y (use Bayes Theorem):

    P( = d | ,Y ) =p( = d | Y ) dnorm(,md , s

    2d)

    3d=1 p( = d | Y ) dnorm(,md , s

    2d)

    where dnorm is the normal density.

  • MCMC density

    Mixture Density

    Den

    sity

    6 4 2 0 2 4

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    1000 draws using MCMC starting at = 6 and = 1

  • MC Trace Plots

    0 200 400 600 800 1000

    4

    2

    02

    4

    Monte Carlo Samples

    Index

  • MCMC Traceplot

    0 200 400 600 800 1000

    6

    4

    2

    02

    4

    Markov Chain Monte Carlo Samples

    Index

  • Stationarity

    Partition the parameter space as follows:

    I A0 = (,5)

    I A1 = (5,1)

    I A2 = (1, 1)

    I A3 = (1, 5)

    I A4 = (5,)

    Under the posterior, we should haveP(A3) = P(A1) > P(A2) > P(A0) = P(A4).We need to have our MCMC sample size S to be big enough sothat we can

    1. Move out of A2 (or another areas of low probability) intoregions of high posterior regions (convergence)

    2. Move between A1 and A3 and other high probability regions(mixing)

  • Stickiness

    The traceplots show

    I MC samples can move from one region to another in 1 step(perfect mixing)

    I MCMC quickly moves away from the starting value 6

    I MCMC has more difficulty moving from A2 into higherprobability regions

    I MCMC has difficulty moving between the differentcomponents and tends to get stuck in one component for awhile (stickiness)

  • Consequences for Variance for MCMC estimates

    VarMCMC[g ] =VarMC(g )+

    s 6=t

    E[(

    g((s)) E[g()|Y ]) (

    g((t)) E[g()|Y ])]

    I The second term depends on the autocorrelation of sampleswithin the Markov chain

    I Autocorrelation of lag t is the correlation between g()(s))and g()(s+t)) (elements that are t time steps apart)

    I often positive so MCMC variance is larger than MC variance.

    I high correlation is an indicator of poor mixing, so that weneed a larger sample size to obtain a comparable variance

    I Effective Sample Size

    VarMC =VarMCMC

    Seff

  • ACF Plots Monte Carlo Sample

    0 5 10 15 20 25 30

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Lag

    AC

    F

  • ACF Plots MCMC Sample

    0 5 10 15 20 25 30

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Lag

    AC

    F

  • CODA

    The CODA package provides many popular diagnostics forassessing convergence of MCMC output from WinBUGS (andother programs)

    > library(coda)

    > effectiveSize(theta.MCMC)

    var1 var2

    3.444007 4.390776

    The precision of the MCMC estimate of the posterior mean basedon 1000 samples is a good as taking 4 independent samples!

    Running for 1 million iterations, still has an effective sample size of1927.

  • Useful Diagnostics/Functions

    I Effective Sample Size: effectiveSize()

    I Autocorrelation autocorr.plot()

    I Cross Variable Correlations:crosscorr.plot()

    I Geweke: geweke.diag()

    I Gelman-Rubin: gelman.diag()

    I Heidelberg & Welch: heidel.diag()

    I Raftery-Lewis: raftery.diag()

  • Geweke

    Geweke (1992) proposed a convergence diagnostic for Markovchains based on a test for equality of the means of the first and lastpart of a Markov chain (by default the first 10% and the last 50%).

    If the samples are drawn from the stationary distribution of thechain, the two means are equal and Gewekes statistic has anasymptotically standard normal distribution.

  • Gelman-Rubin

    Gelman and Rubin (1992) propose a general approach tomonitoring convergence of MCMC output in which m > 1 parallelchains are run.

    I Use different starting values that are overdispersed relative tothe posterior distribution.

    I Convergence is diagnosed when the chains have forgottentheir initial values, and the output from all chains isindistinguishable.

    I The diagnostic is applied to a single variable from the chain.It is based a comparison of within-chain and between-chainvariances (similar to a classical analysis of variance)

    I Assumes that the target is normal (transformations may help)

    I Values of R near 1 suggest convergence

  • Heidelberg-Welch

    The convergence test uses the Cramer-von-Mises statistic to testthe null hypothesis that the sampled values come from a stationarydistribution.

    I The test is successively applied, firstly to the whole chain,then after discarding the first 10%, 20%, of the chain untileither the null hypothesis is accepted, or 50% of the chain hasbeen discarded.

    I The latter outcome constitutes failure of the stationaritytest and indicates that a longer MCMC run is needed.

    I If the stationarity test is passed, the number of iterations tokeep and the number to discard (burn-in) are reported.

  • Raftery-Lewis

    Calculates the number of iterations required to estimate thequantile q to within an accuracy of r with probability p.

    I Separate calculations are performed for each variable withineach chain. If the number of iterations in data is too small, anerror message is printed indicating the minimum length ofpilot run.

    I The minimum length is the required sample size for a chainwith no correlation between consecutive samples. An estimateI (the dependence factor) of the extent to whichautocorrelation inflates the required sample size is alsoprovided.

    I Values of I larger than 5 indicate strong autocorrelation whichmay be due to a poor choice of starting value, high posteriorcorrelations or stickiness of the MCMC algorithm.

    I The number of burn-in iterations to be discarded at thebeginning of the chain is also calculated.

  • Summary

    I Diagnostics cannot guarantee that chain has converged

    I Can indicate that it has not converged

    Solutions?

    I Run longer and thin output

    I Reparametrize model

    I Block correlated variables together

    I integrate out variables

    I Add auxiliary variables (Slice-sampler for example)

    I Use Rao-Blackwellization in estimation