bayes factors, posterior predictives, short intro to...
TRANSCRIPT
![Page 1: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/1.jpg)
Bayes Factors, posterior predictives, short intro to
RJMCMC
© Dave Campbell 2016
Thermodynamic Integration
![Page 2: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/2.jpg)
Bayesian Statistical Inference
P(θ∣Y )∝ P(Y∣θ)π (θ)
Once you have posterior samples you can compute the predictive distribution of future observations:
P(Ynew∣θ,Yold )
![Page 3: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/3.jpg)
To do this you sample a from
(Sample 1 value from your collection of posterior samples)
Generate simulated data from the likelihood:
Repeat for a large sample of from to get at the posterior predictive distribution
P(Ynew∣θ*)
P(θ∣Y )θ*
P(θ∣Y )θ*
![Page 4: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/4.jpg)
Posterior predictive distribution:
No need to use asymptotic normal assumptions or a single point and variance estimate for
Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!
θ*
P(θ∣Y )
![Page 5: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/5.jpg)
Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.
P(Ynew∣θ,Yold )
![Page 6: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/6.jpg)
Uses:
Another diagnostic tool; Obtain a sample from and see if it is similar to the data.
Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.
P(Ynew∣θ,Yold )
![Page 7: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/7.jpg)
Hypothesis testing ; model comparison
![Page 8: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/8.jpg)
Ultimately we want inference on
But computing the marginal likelihood is difficult.
P(M∣Y )
P(M∣Y ) =P(Y∣θM )π (θ)
Θ∫ π (M )dθ
P(Y )
![Page 9: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/9.jpg)
Usually Bayesians make model decisions through Bayes Factors
B12 (y) =w1(y)w2 (y)
w(y) = π (θ) f (y∣θ)dθΘ∫
![Page 10: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/10.jpg)
Bayes Factor interpretation
B12 (y) =w1(y)w2 (y)
w(y) = π (θ) f (y∣θ)dθΘ∫
![Page 11: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/11.jpg)
The odds ratio for two models:
posterior odds = Bayes Factor X prior odds
Uniform prior odds across models implies that
posterior odds = Bayes Factor
![Page 12: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/12.jpg)
posterior odds = Bayes Factor X prior odds
So the Bayes factor is the amount of evidence for one model compared to another.
Bf = the change in odds when moving from the prior to the posterior
![Page 13: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/13.jpg)
Recall: P(θ∣Y ) = P(y∣θ)P(θ)P(y)
P(Y ) = P(y∣θ)P(θ)dθ∫
![Page 14: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/14.jpg)
Newton & Raftery (1994)
P(θ∣Y ) = P(y∣θ)P(θ)P(y)
P(Y )P(θ∣Y )P(y∣θ)
= P(θ)
P(Y ) P(θ∣Y )P(y∣θ)
dθ∫ = P(θ)dθ = 1∫
![Page 15: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/15.jpg)
Newton & Raftery (1994)
P(Y ) P(θ∣Y )P(y∣θ)
dθ∫ = 1
E 1P(y∣θ)⎡
⎣⎢
⎤
⎦⎥P(θ∣Y )
=1
P(Y )
And estimated P(Y) by P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
![Page 16: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/16.jpg)
Newton & Raftery (1994)
Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step
θi
P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
![Page 17: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/17.jpg)
Newton & Raftery (1994)
The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of
But it is asymptotically unbiased
Estimate P(Y) by
P(y∣θ)
P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
![Page 18: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/18.jpg)
Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples
![Page 19: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/19.jpg)
Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)
Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)
![Page 20: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/20.jpg)
In Parallel Tempering we sample from
But we can get the marginal likelihood via:
Pm (θ∣Y ) =P(y∣θ)βm P(θ)
Pm (y)
log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫0
1
∫
log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0
1
∫
![Page 21: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/21.jpg)
Compute via 1-dimensional quadrature over the temperature!
log(p(Y )) = 12
βm − βm−1( ) Em + Em−1[ ]m∑
log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0
1
∫
Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )
![Page 22: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/22.jpg)
To compute log(marginal likelihoods) all we need is to define a good grid for temperatures
Calderhead and Girolami (2009) suggest
log(p(Y )) = 12
βm − βm−1( ) Em + Em−1[ ]m∑
Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )
β =seq( from = 1,to = N )
N⎛⎝⎜
⎞⎠⎟
5
![Page 23: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/23.jpg)
Parallel Tempering To the Extreme!
R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature
![Page 24: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/24.jpg)
Parallel Tempering densitiesThat dip just before
temperature β = 1 is real. It is caused by the introduction
of new modes
![Page 25: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/25.jpg)
Compare the 3 group Galaxy to the 6 group galaxy.
Show plots of mean density vs temperature
25,000 iterations with 30 parallel chains
B12 (y) =w1(y)w2 (y)
![Page 26: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/26.jpg)
Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.
(the result: there is decisive evidence that the k=3 groups model is better)
![Page 27: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/27.jpg)
Alternative to Bayes Factors: RJMCMC
![Page 28: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/28.jpg)
MODEL POSTERIOR PROBABILITY
Likelihood:
Parameter Prior:
Model Prior: for
The marginal posterior probability of a model is helpful when the answer is not clear
P(Y∣θ j ,M j )
P(θ j∣M j )
P(M j∣Ω)
P(M j∣Y ,Ω) =P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫
P(Y )
M j ∈Ω
P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫
![Page 29: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/29.jpg)
Our goal is to get in a single MCMC chain
even if contains a lot of models
We need simulation methods that sample across models.
P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫
Ω
![Page 30: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/30.jpg)
We can avoid extensive MCMC for each model and instead sample from directly!
We just adjust MCMC so at each iteration we:
1. Sample j, i.e. choose a model Mnew
2. Then propose a from Mnew
3. Keep Mnew and with probability
REVERSIBLE JUMP MCMC
P(M j∣Y ,Ω)
θnew
α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old
,Mold )Pold (vold )Jold,new ,1
⎛
⎝⎜
⎞
⎠⎟
θnew
Biometrika (1995), 82, 4, pp. 711-32
![Page 31: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/31.jpg)
V
We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)
v
α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old
,Mold )Pold (vold )Jold,new ,1
⎛
⎝⎜
⎞
⎠⎟
![Page 32: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/32.jpg)
JACOBIANWe need the Jacobian for the transformation
And the proposed values needs to allow the possibility of being accepted.
J =
∂θold ,1∂θnew,1
...∂θold ,1∂θnew, pnew
...∂θold ,1∂vnew
M O M M∂θold , pold∂θnew,1
...∂θold , pold∂θnew, pnew
...
M M O M∂vold∂θnew,1
... ... ∂vold∂vnew
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
θnew
. . .
.... . .
...
...
...
![Page 33: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/33.jpg)
M1 and M2 have different parameter dimensions
Often model parameters don’t have an obvious a transformation allowing an intuitive transition
The last accepted value might be from a different model and may require a large jump in the parameter space.
POTENTIAL PROBLEMS
![Page 34: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/34.jpg)
M1:
M2:
Moving from M1 to M2 to will require moving β1,0
quite far to get to a reasonable location for β2,0
Y ⇠ N(�2,0 + �2,1X + �2,2X2,�2
2)
Y ⇠ N(�1,0 + �1,1X,�21)
![Page 35: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/35.jpg)
M1: Galaxy with 3 Gaussians
M2: Galaxy with 4 Gaussians
Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components
![Page 36: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption](https://reader033.vdocuments.net/reader033/viewer/2022043007/5f9528e793e2172c5764c27b/html5/thumbnails/36.jpg)
RJMCMC: Beautiful in principle, nasty in practice
Needs: transition function between parameters in multiple model spaces.
Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.
Works well when we can use birth / death process (change-point analysis).