reading group on sampling methods - university of oxforddavidc/pubs/mt2015_ttz.pdfmetropolis within...

16
Reading Group on Sampling Methods An implementation perspective Tingting Zhu

Upload: others

Post on 06-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Reading Group on Sampling Methods An implementation perspective

    Tingting Zhu

  • Overview

    Discussion of the following:

    • Rejection Sampling

    • Metropolis and Metropolis Hastings Algorithms

    • Gibbs Sampler

    • Convergence Diagnostic Tests

  • Monte Carlo Sampling

    Aim: To sample from an unknown target distribution p(x).

    Observation: Suppose {𝑥 𝑖 } is an i.i.d. random sample drawn from p(x) for i=1,…, N samples.

  • Rejection Sampling

    Pseudo-code

    i ← 1

    while i ≠ N do

    𝑥(𝑖) ∼ q(x)

    u ∼ U(0, 1)

    if u < p(𝑥(𝑖))

    𝑀 q(𝑥(𝑖)) then

    accept 𝑥(𝑖)

    i ← i + 1

    else

    reject 𝑥(𝑖)

    end if

    end while Target distribution: p(𝑥) Proposal distribution: q(𝑥) Scaled proposal (i.e., envelope) distribution: Mq(𝑥) Uniform distribution on (0,1): U(0,1)

    -10 -8 -6 -4 -2 0 2 4 6 8 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    Target

    Scaled Proposal

    Accept

    Reject

    uMq(x*)

  • Example Target distribution:

    p 𝑥 = normpdf(x,3,2) + normpdf(x,−5,1) Proposed distribution:

    q 𝑥 = normpdf(x,0,4)

    Steps: 1) Chooses M which M>1, ensuring M*q(x) ‘covers’ p(x):

    i.e. M = max(p(x)./q(x)) 2) Draws x* ∼ N(0,4)

    3) Accept samples if u < p(𝑥∗)

    𝑀 q(𝑥∗) , where u ∼ U(0, 1)

    4) Repeat (2) and (3) for N times

    Source: https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/

    PDF of a Gaussian distribution

    -10 -8 -6 -4 -2 0 2 4 6 8 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    Target

    Scaled Proposal

    Accept

    Reject

    uMq(x*)

    https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/

  • -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100

    5

    10

    15

    20

    25

    30

    Counts

    x

    1000 accepted samples

    -10 -8 -6 -4 -2 0 2 4 6 8 100

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    x

    pdf

    of

    p(x

    )

    Drawbacks: 1) Hard to choose a proposal distribution; 2) Limitation in high dimension:

    • choosing a sufficiently large M to guarantee a bound on p(x) leads to small acceptance rate.

    -20 -15 -10 -5 0 5 10 150

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    Target

    Scaled Proposal

    Accept

    Reject

    uMq(x*)

  • Markov Chain Monte Carlo (MCMC)

    MCMC is a class of methods in which one can simulate sample draws that are slightly dependent and are approximately from a (posterior) distribution. Markov Chain: a stochastic process in which future states are independent of past

    states given the present state 𝑥(0) → 𝑥(1) → 𝑥(2) → ⋯ → 𝑥(𝑁−1) → 𝑥(𝑁) Monte Carlo: random simulation MCMC in Bayesian Analysis:

    𝑝 𝜃 𝑦 =𝑝 𝑦 𝜃 𝑝(𝜃)

    𝑝 𝑦 𝜃 𝑝 𝜃 𝑑𝜃=1

    𝑍𝑝 𝑦 𝜃 𝑝(𝜃)

    • It draws a sample from the full posterior distribution, and make inferences using the sample

    as a representative of the posterior distribution.

  • Metropolis & Metropolis Hastings Algorithms

    Pseudo-code (M)

    Init 𝑥(1)

    for i = 1 to N do

    u ∼ U(0, 1)

    𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )

    if u < min 1 ,p(𝑥∗)

    p(𝑥(𝑖)) then

    𝑥(𝑖+1) ← 𝑥∗

    else

    𝑥(𝑖+1) ← 𝑥(𝑖)

    end if

    end for

    Pseudo-code (MH)

    Init 𝑥(1)

    for i = 1 to N do

    u ∼ U(0, 1)

    𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )

    if u < min 1 ,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))

    then

    𝑥(𝑖+1) ← 𝑥∗

    else

    𝑥(𝑖+1) ← 𝑥(𝑖)

    end if

    end for

    Note: Metropolis algorithm assumes the proposed distribution (i.e., q(x)) is symmetric to fulfil the reversibility of the Markov chain property.

  • Example(1)

    Source: https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/

    Target distribtuion: p 𝑥 = mvnpdf(𝑥, 𝜇, ∑) where 𝜇 =00

    and ∑ =0.25 0.30.3 1

    Proposal draw: 𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = MVN(𝑥 𝑖 , ∑𝑞 =1 00 1

    )

    Acceptance probability:

    -30

    3-3

    03

    0

    50

    100

    150

    200

    x1

    Sampled Distribution

    x2

    Fre

    quency

    -3

    0

    3

    -30

    3

    0

    0.1

    0.2

    0.3

    0.4

    x1

    Analytic Distribution

    x2

    p(x

    )

    𝑚𝑖𝑛 1,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)

    p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))= 𝑚𝑖𝑛 1,

    mvnpdf(𝑥∗, 𝜇, ∑)mvnpdf(𝑥(𝑖), 𝑥∗, ∑𝑞)

    mvnpdf(𝑥(𝑖), 𝜇, ∑)mvnpdf(𝑥∗, 𝑥(𝑖), ∑𝑞)

    https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/

  • Example(2)

    Source: Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.

    Target:

    𝑝 𝑥 ∝ 0.3𝑒−0.2𝑥2+ 0.7𝑒−0.2(𝑥−10)

    2

    Proposal draw:

    𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = N(𝑥 𝑖 , 𝜎∗ = 10)

    The scale of the proposed distribution: - small step size => high acceptance rare; - large step size => low acceptance rate. Both result poor mixing.

    Choose an ideal proposed distribution? - An artistic choice

  • The Gibbs Sampler

    Let 𝒛 = 𝑧1, … , 𝑧𝑁 .

    Aim: Samples 𝒛(𝑖) from the joint distribution p(𝒛) Pseudo-code (Gibbs)

    Init 𝒛(1)

    for i = 1 to M do

    𝑧1(𝑖+1)∼ p(𝑧1| 𝑧2

    (𝑖), 𝑧3𝑖 , … , 𝑧𝑁

    (𝑖) )

    𝑧2(𝑖+1)∼ p(𝑧2| 𝑧1

    (𝑖), 𝑧3𝑖 , … , 𝑧𝑁

    (𝑖))

    𝑧𝑁(𝑖+1)∼ p(𝑧𝑁| 𝑧1

    (𝑖), 𝑧3𝑖 , … , 𝑧𝑁−1

    (𝑖))

    i = i+1

    end for

    Note: Gibbs sampler is a special case of Metropolis Hastings where the proposal distribution is the conditional distribution:

    q(·) = p(𝑧𝑗| 𝑧−𝑗)

    The acceptance ratio =1

    How to find conjugate priors? https://en.wikipedia.org/wiki/Conjugate_prior

    Conjugate Prior: For a given likelihood, the prior and the posterior are in the same family of distributions.

    𝑝 𝜃 𝑦 ∝ 𝑝 𝑦 𝜃 𝑝(𝜃)

    https://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_prior

  • Example

    -10 0 10 20 30 40 50 60 700

    10

    20

    30

    40

    50

    60

    70

    80

    90

    x

    coun

    ts

    Given a dataset 𝑋 = [𝑥𝑖=1, … , 𝑥𝑖=𝑁], where

    𝑥𝑖~𝑁(𝜇, 𝜏) Steps: 1) Model 𝝁 conditional on 𝝉

    𝜇~𝑁 𝑎∗, 𝜏𝜇∗

    where

    𝑎∗ =𝜏𝜇𝑎+𝜏 ∑ 𝑥𝑖

    𝑁𝑖=1

    𝜏𝜇∗ , 𝜏𝜇

    ∗ = 𝜏𝜇 +𝑁𝜏

    2) Model 𝝉 conditional on 𝝁 3) Repeat 1) and 2) for M times

    𝑥𝑖

    𝜇 𝜏

    𝑎 𝜏𝜇 𝑘 𝜃

    i=1,..,N

    𝜏~ 𝐺𝑎𝑚𝑚𝑎(𝑘∗,1

    𝜃∗)

    where

    𝑘∗ = 𝑘 +𝑁

    2 ,

    1

    𝜃∗=

    1

    𝜃+

    ∑ (𝑥𝑖−𝜇)2𝑁

    𝑖=1

    2

    0 2000 4000 600023

    24

    25

    26

    Trace plot for

    0 2000 4000 60000.008

    0.009

    0.01

    0.011

    0.012

    Trace plot for

    Precision

    23 24 25 260

    500

    1000

    1500

    Marginal Posterior Histogram for

    80 90 100 110 1200

    500

    1000

    1500

    Marginal Posterior Histogram for 2

    2

  • Metropolis within Gibbs?

    • This is considered when some of the full conditionals have a known form, but some of them do not.

    • It retains sequential sampling steps in Gibbs.

    • Updates one or more parameters using the Metropolis Hastings step.

    Burn-in and Convergence measure The burn-in period: amount of samples to be discarded to

    1. avoid correlation among initial successive samples; 2. make sure the random draws of the posterior are closer to the targeted posterior

    distribution and less dependent on the starting distribution of the model.

    Convergence measure of MCMC: 1)Visual analysis via trace plots 2)Autocorrelation 3)Geweke diagnostic 4)Raftery-Lewis diagnostic 5)Gelman-Rubin diagnostic and so on…

  • Visual analysis via trace plots

    Images taken from: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm

    Burn-in

    http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htm

  • Convergence Diagnostic Tests

    Source : http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm CODA functions in R: • Gelman-Rubin: gelman.diag() • Geweke: geweke.diag() • Heidelberg & Welch: heidel.diag() • Raftery-Lewis: raftery.diag() • Autocorrelation autocorr.plot() • Effective Sample Size: effectiveSize() Source: https://cran.r-project.org/web/packages/coda/coda.pdf

    http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdf

  • Thank you

    References and relevant reading materials:

    • Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.

    • Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2012.

    • Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

    • Gelman, Andrew, et al. Bayesian Data Analysis. Vol. 2. London: Chapman & Hall/CRC, 2014.

    • Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in practice. London: Chapman and Hall.

    • Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

    • Nabney, Ian. NETLAB: algorithms for pattern recognition. Springer Science & Business Media, 2002.

    • MCMC using Python: https://github.com/markdregan/Bayesian-Modelling-in-Python

    • More books on Bayesian analysis: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect045.htm

    https://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htm