reading group on sampling methods - university of oxforddavidc/pubs/mt2015_ttz.pdfmetropolis within...

Reading Group on Sampling Methods An implementation perspective

Tingting Zhu

Overview

Discussion of the following:

• Rejection Sampling

• Metropolis and Metropolis Hastings Algorithms

• Gibbs Sampler

• Convergence Diagnostic Tests

Monte Carlo Sampling

Aim: To sample from an unknown target distribution p(x).

Observation: Suppose {𝑥 𝑖 } is an i.i.d. random sample drawn from p(x) for i=1,…, N samples.

Rejection Sampling

Pseudo-code

i ← 1

while i ≠ N do

𝑥(𝑖) ∼ q(x)

u ∼ U(0, 1)

if u < p(𝑥(𝑖))

𝑀 q(𝑥(𝑖)) then

accept 𝑥(𝑖)

i ← i + 1

else

reject 𝑥(𝑖)

end if

end while Target distribution: p(𝑥) Proposal distribution: q(𝑥) Scaled proposal (i.e., envelope) distribution: Mq(𝑥) Uniform distribution on (0,1): U(0,1)

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Target

Scaled Proposal

Accept

Reject

uMq(x*)

Example Target distribution:

p 𝑥 = normpdf(x,3,2) + normpdf(x,−5,1) Proposed distribution:

q 𝑥 = normpdf(x,0,4)

Steps: 1) Chooses M which M>1, ensuring M*q(x) ‘covers’ p(x):

i.e. M = max(p(x)./q(x)) 2) Draws x* ∼ N(0,4)

3) Accept samples if u < p(𝑥∗)

𝑀 q(𝑥∗) , where u ∼ U(0, 1)

4) Repeat (2) and (3) for N times

Source: https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/

PDF of a Gaussian distribution

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Target

Scaled Proposal

Accept

Reject

uMq(x*)

https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

Counts

x

1000 accepted samples

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

pdf

of

p(x

)

Drawbacks: 1) Hard to choose a proposal distribution; 2) Limitation in high dimension:

• choosing a sufficiently large M to guarantee a bound on p(x) leads to small acceptance rate.

-20 -15 -10 -5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Target

Scaled Proposal

Accept

Reject

uMq(x*)

Markov Chain Monte Carlo (MCMC)

MCMC is a class of methods in which one can simulate sample draws that are slightly dependent and are approximately from a (posterior) distribution. Markov Chain: a stochastic process in which future states are independent of past

states given the present state 𝑥(0) → 𝑥(1) → 𝑥(2) → ⋯ → 𝑥(𝑁−1) → 𝑥(𝑁) Monte Carlo: random simulation MCMC in Bayesian Analysis:

𝑝 𝜃 𝑦 =𝑝 𝑦 𝜃 𝑝(𝜃)

𝑝 𝑦 𝜃 𝑝 𝜃 𝑑𝜃=1

𝑍𝑝 𝑦 𝜃 𝑝(𝜃)

• It draws a sample from the full posterior distribution, and make inferences using the sample

as a representative of the posterior distribution.

Metropolis & Metropolis Hastings Algorithms

Pseudo-code (M)

Init 𝑥(1)

for i = 1 to N do

u ∼ U(0, 1)

𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )

if u < min 1 ,p(𝑥∗)

p(𝑥(𝑖)) then

𝑥(𝑖+1) ← 𝑥∗

else

𝑥(𝑖+1) ← 𝑥(𝑖)

end if

end for

Pseudo-code (MH)

Init 𝑥(1)

for i = 1 to N do

u ∼ U(0, 1)

𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )

if u < min 1 ,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))

then

𝑥(𝑖+1) ← 𝑥∗

else

𝑥(𝑖+1) ← 𝑥(𝑖)

end if

end for

Note: Metropolis algorithm assumes the proposed distribution (i.e., q(x)) is symmetric to fulfil the reversibility of the Markov chain property.

Example(1)

Source: https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/

Target distribtuion: p 𝑥 = mvnpdf(𝑥, 𝜇, ∑) where 𝜇 =00

and ∑ =0.25 0.30.3 1

Proposal draw: 𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = MVN(𝑥 𝑖 , ∑𝑞 =1 00 1

)

Acceptance probability:

-30

3-3

03

0

50

100

150

200

x1

Sampled Distribution

x2

Fre

quency

-3

0

3

-30

3

0

0.1

0.2

0.3

0.4

x1

Analytic Distribution

x2

p(x

)

𝑚𝑖𝑛 1,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)

p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))= 𝑚𝑖𝑛 1,

mvnpdf(𝑥∗, 𝜇, ∑)mvnpdf(𝑥(𝑖), 𝑥∗, ∑𝑞)

mvnpdf(𝑥(𝑖), 𝜇, ∑)mvnpdf(𝑥∗, 𝑥(𝑖), ∑𝑞)

https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/

Example(2)

Source: Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.

Target:

𝑝 𝑥 ∝ 0.3𝑒−0.2𝑥2+ 0.7𝑒−0.2(𝑥−10)

2

Proposal draw:

𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = N(𝑥 𝑖 , 𝜎∗ = 10)

The scale of the proposed distribution: - small step size => high acceptance rare; - large step size => low acceptance rate. Both result poor mixing.

Choose an ideal proposed distribution? - An artistic choice

The Gibbs Sampler

Let 𝒛 = 𝑧1, … , 𝑧𝑁 .

Aim: Samples 𝒛(𝑖) from the joint distribution p(𝒛) Pseudo-code (Gibbs)

Init 𝒛(1)

for i = 1 to M do

𝑧1(𝑖+1)∼ p(𝑧1| 𝑧2

(𝑖), 𝑧3𝑖 , … , 𝑧𝑁

(𝑖) )

𝑧2(𝑖+1)∼ p(𝑧2| 𝑧1

(𝑖), 𝑧3𝑖 , … , 𝑧𝑁

(𝑖))

⋮

𝑧𝑁(𝑖+1)∼ p(𝑧𝑁| 𝑧1

(𝑖), 𝑧3𝑖 , … , 𝑧𝑁−1

(𝑖))

i = i+1

end for

Note: Gibbs sampler is a special case of Metropolis Hastings where the proposal distribution is the conditional distribution:

q(·) = p(𝑧𝑗| 𝑧−𝑗)

The acceptance ratio =1

How to find conjugate priors? https://en.wikipedia.org/wiki/Conjugate_prior

Conjugate Prior: For a given likelihood, the prior and the posterior are in the same family of distributions.

𝑝 𝜃 𝑦 ∝ 𝑝 𝑦 𝜃 𝑝(𝜃)

https://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_prior

Example

-10 0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

80

90

x

coun

ts

Given a dataset 𝑋 = [𝑥𝑖=1, … , 𝑥𝑖=𝑁], where

𝑥𝑖~𝑁(𝜇, 𝜏) Steps: 1) Model 𝝁 conditional on 𝝉

𝜇~𝑁 𝑎∗, 𝜏𝜇∗

where

𝑎∗ =𝜏𝜇𝑎+𝜏 ∑ 𝑥𝑖

𝑁𝑖=1

𝜏𝜇∗ , 𝜏𝜇

∗ = 𝜏𝜇 +𝑁𝜏

2) Model 𝝉 conditional on 𝝁 3) Repeat 1) and 2) for M times

𝑥𝑖

𝜇 𝜏

𝑎 𝜏𝜇 𝑘 𝜃

i=1,..,N

𝜏~ 𝐺𝑎𝑚𝑚𝑎(𝑘∗,1

𝜃∗)

where

𝑘∗ = 𝑘 +𝑁

2 ,

1

𝜃∗=

1

𝜃+

∑ (𝑥𝑖−𝜇)2𝑁

𝑖=1

2

0 2000 4000 600023

24

25

26

Trace plot for

0 2000 4000 60000.008

0.009

0.01

0.011

0.012

Trace plot for

Precision

23 24 25 260

500

1000

1500

Marginal Posterior Histogram for

80 90 100 110 1200

500

1000

1500

Marginal Posterior Histogram for 2

2

Metropolis within Gibbs?

• This is considered when some of the full conditionals have a known form, but some of them do not.

• It retains sequential sampling steps in Gibbs.

• Updates one or more parameters using the Metropolis Hastings step.

Burn-in and Convergence measure The burn-in period: amount of samples to be discarded to

1. avoid correlation among initial successive samples; 2. make sure the random draws of the posterior are closer to the targeted posterior

distribution and less dependent on the starting distribution of the model.

Convergence measure of MCMC: 1)Visual analysis via trace plots 2)Autocorrelation 3)Geweke diagnostic 4)Raftery-Lewis diagnostic 5)Gelman-Rubin diagnostic and so on…

Visual analysis via trace plots

Images taken from: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm

Burn-in

http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htm

Convergence Diagnostic Tests

Source : http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm CODA functions in R: • Gelman-Rubin: gelman.diag() • Geweke: geweke.diag() • Heidelberg & Welch: heidel.diag() • Raftery-Lewis: raftery.diag() • Autocorrelation autocorr.plot() • Effective Sample Size: effectiveSize() Source: https://cran.r-project.org/web/packages/coda/coda.pdf

http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdf

Thank you

References and relevant reading materials:

• Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.

• Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2012.

• Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.

• Gelman, Andrew, et al. Bayesian Data Analysis. Vol. 2. London: Chapman & Hall/CRC, 2014.

• Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in practice. London: Chapman and Hall.

• Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

• Nabney, Ian. NETLAB: algorithms for pattern recognition. Springer Science & Business Media, 2002.

• MCMC using Python: https://github.com/markdregan/Bayesian-Modelling-in-Python

• More books on Bayesian analysis: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect045.htm
https://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htm

reading group on sampling methods - university of oxforddavidc/pubs/mt2015_ttz.pdfmetropolis within...

Documents