reading group on sampling methods - university of oxforddavidc/pubs/mt2015_ttz.pdfmetropolis within...
TRANSCRIPT
-
Reading Group on Sampling Methods An implementation perspective
Tingting Zhu
-
Overview
Discussion of the following:
• Rejection Sampling
• Metropolis and Metropolis Hastings Algorithms
• Gibbs Sampler
• Convergence Diagnostic Tests
-
Monte Carlo Sampling
Aim: To sample from an unknown target distribution p(x).
Observation: Suppose {𝑥 𝑖 } is an i.i.d. random sample drawn from p(x) for i=1,…, N samples.
-
Rejection Sampling
Pseudo-code
i ← 1
while i ≠ N do
𝑥(𝑖) ∼ q(x)
u ∼ U(0, 1)
if u < p(𝑥(𝑖))
𝑀 q(𝑥(𝑖)) then
accept 𝑥(𝑖)
i ← i + 1
else
reject 𝑥(𝑖)
end if
end while Target distribution: p(𝑥) Proposal distribution: q(𝑥) Scaled proposal (i.e., envelope) distribution: Mq(𝑥) Uniform distribution on (0,1): U(0,1)
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
Target
Scaled Proposal
Accept
Reject
uMq(x*)
-
Example Target distribution:
p 𝑥 = normpdf(x,3,2) + normpdf(x,−5,1) Proposed distribution:
q 𝑥 = normpdf(x,0,4)
Steps: 1) Chooses M which M>1, ensuring M*q(x) ‘covers’ p(x):
i.e. M = max(p(x)./q(x)) 2) Draws x* ∼ N(0,4)
3) Accept samples if u < p(𝑥∗)
𝑀 q(𝑥∗) , where u ∼ U(0, 1)
4) Repeat (2) and (3) for N times
Source: https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/
PDF of a Gaussian distribution
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
Target
Scaled Proposal
Accept
Reject
uMq(x*)
https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/https://theclevermachine.wordpress.com/2012/09/10/rejection-sampling/
-
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
Counts
x
1000 accepted samples
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
pdf
of
p(x
)
Drawbacks: 1) Hard to choose a proposal distribution; 2) Limitation in high dimension:
• choosing a sufficiently large M to guarantee a bound on p(x) leads to small acceptance rate.
-20 -15 -10 -5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
Target
Scaled Proposal
Accept
Reject
uMq(x*)
-
Markov Chain Monte Carlo (MCMC)
MCMC is a class of methods in which one can simulate sample draws that are slightly dependent and are approximately from a (posterior) distribution. Markov Chain: a stochastic process in which future states are independent of past
states given the present state 𝑥(0) → 𝑥(1) → 𝑥(2) → ⋯ → 𝑥(𝑁−1) → 𝑥(𝑁) Monte Carlo: random simulation MCMC in Bayesian Analysis:
𝑝 𝜃 𝑦 =𝑝 𝑦 𝜃 𝑝(𝜃)
𝑝 𝑦 𝜃 𝑝 𝜃 𝑑𝜃=1
𝑍𝑝 𝑦 𝜃 𝑝(𝜃)
• It draws a sample from the full posterior distribution, and make inferences using the sample
as a representative of the posterior distribution.
-
Metropolis & Metropolis Hastings Algorithms
Pseudo-code (M)
Init 𝑥(1)
for i = 1 to N do
u ∼ U(0, 1)
𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )
if u < min 1 ,p(𝑥∗)
p(𝑥(𝑖)) then
𝑥(𝑖+1) ← 𝑥∗
else
𝑥(𝑖+1) ← 𝑥(𝑖)
end if
end for
Pseudo-code (MH)
Init 𝑥(1)
for i = 1 to N do
u ∼ U(0, 1)
𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) )
if u < min 1 ,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))
then
𝑥(𝑖+1) ← 𝑥∗
else
𝑥(𝑖+1) ← 𝑥(𝑖)
end if
end for
Note: Metropolis algorithm assumes the proposed distribution (i.e., q(x)) is symmetric to fulfil the reversibility of the Markov chain property.
-
Example(1)
Source: https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/
Target distribtuion: p 𝑥 = mvnpdf(𝑥, 𝜇, ∑) where 𝜇 =00
and ∑ =0.25 0.30.3 1
Proposal draw: 𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = MVN(𝑥 𝑖 , ∑𝑞 =1 00 1
)
Acceptance probability:
-30
3-3
03
0
50
100
150
200
x1
Sampled Distribution
x2
Fre
quency
-3
0
3
-30
3
0
0.1
0.2
0.3
0.4
x1
Analytic Distribution
x2
p(x
)
𝑚𝑖𝑛 1,p(𝑥∗) q(𝑥(𝑖)|𝑥∗)
p(𝑥(𝑖)) q(𝑥∗|𝑥(𝑖))= 𝑚𝑖𝑛 1,
mvnpdf(𝑥∗, 𝜇, ∑)mvnpdf(𝑥(𝑖), 𝑥∗, ∑𝑞)
mvnpdf(𝑥(𝑖), 𝜇, ∑)mvnpdf(𝑥∗, 𝑥(𝑖), ∑𝑞)
https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/https://theclevermachine.wordpress.com/2012/11/04/mcmc-multivariate-distributions-block-wise-component-wise-updates/
-
Example(2)
Source: Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.
Target:
𝑝 𝑥 ∝ 0.3𝑒−0.2𝑥2+ 0.7𝑒−0.2(𝑥−10)
2
Proposal draw:
𝑥∗ ∼ q(𝑥∗| 𝑥(𝑖) ) = N(𝑥 𝑖 , 𝜎∗ = 10)
The scale of the proposed distribution: - small step size => high acceptance rare; - large step size => low acceptance rate. Both result poor mixing.
Choose an ideal proposed distribution? - An artistic choice
-
The Gibbs Sampler
Let 𝒛 = 𝑧1, … , 𝑧𝑁 .
Aim: Samples 𝒛(𝑖) from the joint distribution p(𝒛) Pseudo-code (Gibbs)
Init 𝒛(1)
for i = 1 to M do
𝑧1(𝑖+1)∼ p(𝑧1| 𝑧2
(𝑖), 𝑧3𝑖 , … , 𝑧𝑁
(𝑖) )
𝑧2(𝑖+1)∼ p(𝑧2| 𝑧1
(𝑖), 𝑧3𝑖 , … , 𝑧𝑁
(𝑖))
⋮
𝑧𝑁(𝑖+1)∼ p(𝑧𝑁| 𝑧1
(𝑖), 𝑧3𝑖 , … , 𝑧𝑁−1
(𝑖))
i = i+1
end for
Note: Gibbs sampler is a special case of Metropolis Hastings where the proposal distribution is the conditional distribution:
q(·) = p(𝑧𝑗| 𝑧−𝑗)
The acceptance ratio =1
How to find conjugate priors? https://en.wikipedia.org/wiki/Conjugate_prior
Conjugate Prior: For a given likelihood, the prior and the posterior are in the same family of distributions.
𝑝 𝜃 𝑦 ∝ 𝑝 𝑦 𝜃 𝑝(𝜃)
https://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_priorhttps://en.wikipedia.org/wiki/Conjugate_prior
-
Example
-10 0 10 20 30 40 50 60 700
10
20
30
40
50
60
70
80
90
x
coun
ts
Given a dataset 𝑋 = [𝑥𝑖=1, … , 𝑥𝑖=𝑁], where
𝑥𝑖~𝑁(𝜇, 𝜏) Steps: 1) Model 𝝁 conditional on 𝝉
𝜇~𝑁 𝑎∗, 𝜏𝜇∗
where
𝑎∗ =𝜏𝜇𝑎+𝜏 ∑ 𝑥𝑖
𝑁𝑖=1
𝜏𝜇∗ , 𝜏𝜇
∗ = 𝜏𝜇 +𝑁𝜏
2) Model 𝝉 conditional on 𝝁 3) Repeat 1) and 2) for M times
𝑥𝑖
𝜇 𝜏
𝑎 𝜏𝜇 𝑘 𝜃
i=1,..,N
𝜏~ 𝐺𝑎𝑚𝑚𝑎(𝑘∗,1
𝜃∗)
where
𝑘∗ = 𝑘 +𝑁
2 ,
1
𝜃∗=
1
𝜃+
∑ (𝑥𝑖−𝜇)2𝑁
𝑖=1
2
0 2000 4000 600023
24
25
26
Trace plot for
0 2000 4000 60000.008
0.009
0.01
0.011
0.012
Trace plot for
Precision
23 24 25 260
500
1000
1500
Marginal Posterior Histogram for
80 90 100 110 1200
500
1000
1500
Marginal Posterior Histogram for 2
2
-
Metropolis within Gibbs?
• This is considered when some of the full conditionals have a known form, but some of them do not.
• It retains sequential sampling steps in Gibbs.
• Updates one or more parameters using the Metropolis Hastings step.
Burn-in and Convergence measure The burn-in period: amount of samples to be discarded to
1. avoid correlation among initial successive samples; 2. make sure the random draws of the posterior are closer to the targeted posterior
distribution and less dependent on the starting distribution of the model.
Convergence measure of MCMC: 1)Visual analysis via trace plots 2)Autocorrelation 3)Geweke diagnostic 4)Raftery-Lewis diagnostic 5)Gelman-Rubin diagnostic and so on…
-
Visual analysis via trace plots
Images taken from: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm
Burn-in
http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htm
-
Convergence Diagnostic Tests
Source : http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect024.htm CODA functions in R: • Gelman-Rubin: gelman.diag() • Geweke: geweke.diag() • Heidelberg & Welch: heidel.diag() • Raftery-Lewis: raftery.diag() • Autocorrelation autocorr.plot() • Effective Sample Size: effectiveSize() Source: https://cran.r-project.org/web/packages/coda/coda.pdf
http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect024.htmhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdfhttps://cran.r-project.org/web/packages/coda/coda.pdf
-
Thank you
References and relevant reading materials:
• Andrieu, Christophe, et al. "An introduction to MCMC for machine learning." Machine learning 50.1-2 (2003): 5-43.
• Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2012.
• Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
• Gelman, Andrew, et al. Bayesian Data Analysis. Vol. 2. London: Chapman & Hall/CRC, 2014.
• Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in practice. London: Chapman and Hall.
• Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
• Nabney, Ian. NETLAB: algorithms for pattern recognition. Springer Science & Business Media, 2002.
• MCMC using Python: https://github.com/markdregan/Bayesian-Modelling-in-Python
• More books on Bayesian analysis: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_introbayes_sect045.htm
https://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttps://github.com/markdregan/Bayesian-Modelling-in-Pythonhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htmhttp://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htmstatug_introbayes_sect045.htm