better together? statistical learning in models made of modules
TRANSCRIPT
Better together? Statistical learning in modelsmade of modules
Christian P. Robert (Paris-Dauphine & Warwick)joint work with Pierre E. Jacob (Harvard), and
Chris Holmes (Oxford), Lawrence Murray (Uppsala)
PC Mahalanobis 125th B’day
Story line
Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Recoup hierarchical, super, meta-analysis, and coupled models.
Story line
Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Conventional statistical (Bayesian) updating tackles all modulesjointly with the advantage that all uncertainties can be treatedsimultaneously and coherently.
Story line
Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.However, when information flows both ways between any pair ofmodules, misspecification of either leads to misspecification ofthe full model and misleading quantification of uncertainties.Departing from learning with the full model may thus bebeneficial, along other motivations to eschew the full model,e.g., computational constraints and data confidentiality.
Disclaimer!
This work is unrelated with the recent referendum onScottish independence and is not intended to supportone side versus the other!
Outline
1 Models made of modules
2 Candidate distributions
3 Choosing among candidates
4 Computational challenges
Models made of modules
I First module:parameter θ1, data Y1
prior: p1(θ1)likelihood: p1(Y1|θ1)
I Second module:parameter θ2, data Y2
prior: p2(θ2|θ1)likelihood: p2 (Y2|θ1, θ2)
We are interested in the estimation of θ1, θ2 or both.
Joint model approach
Parameter (θ1, θ2), with prior
p(θ1, θ2) = p1(θ1)p2(θ2|θ1).
Data (Y1, Y2), likelihood
p(Y1, Y2|θ1, θ2) = p1(Y1|θ1)p2(Y2|θ1, θ2).
Posterior distribution
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1(Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).
Example: PKPD
I Pharmarcokinetics:
∀t Yt ∼ N (logCt, τ2PK), where Ct = ψ(t, θPK),
concentration of substances/drugs in organisms, i.e., body’seffect on a drug
I Pharmarcodynamics:
∀j Zj ∼ N (Ej , τ2PD), where Ej = ϕ(Cj , θPD),
effect of substances/drugs on health, i.e., drug’s effect onbody
Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009).Combining MCMC with ‘sequential’ PKPD modelling
Example: longitudinal + time to event medical data
I Longitudinal model:
∀i, j Yij = X ′i1β1 + U0i + U1itij + Zij ,
glomerular filtration rate (kidney function), onto serumcreatinine, age, gender, follow-up time
I Hazard model:
∀i, t hi(t) = h0(t) exp(X ′i2β2 + γ(U0i + U1it)
),
time to initiation of renal replacement therapy, ontocovariates and fitted time trend.
Asar, Ritchie, Kalra & Diggle (2015). Joint modelling of repeatedmeasurement and time-to-event data: an introductory tutorial.
Example: two-step regressions
I First regression model,
∀t DMt = X1tθ + DMRt,
proportional growth in the M1 definition of money onto twoannual lags, current deviations of real federal expenditures,lagged unemployment.
I Second regression model,
∀t log UNt
1−UNt= X2tβ + γ(L)DMRt + ut,
Annual average unemployment rate onto level of minimum wage,measure of military conscriptions, γ is 2nd order polynomial.
Murphy & Topel (1985). Estimation and Inference in Two-StepEconometric Models.
Example: biased data
I Normal location model:∀i = 1, . . . , n1 Y i
1 ∼ N (θ1, 1), prior θ1 ∼ N (0, 1).
I Extra data Y2 suspected to be biased:∀i = 1, . . . , n2 Y i
2 ∼ N (θ1 + θ2, 1), prior θ2 ∼ N (0, V ).
Liu, Bayarri & Berger (2009). Modularization in Bayesian analysis,with emphasis on analysis of computer models.
Example: epidemiological study
I Model of virus prevalence
∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),
Zi is number of women infected with high-risk HPV in asample of size Ni in country i.
I Impact of prevalence onto cervical cancer occurrence
∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,
Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.
Plummer (2015). Cuts in Bayesian graphical models.
Example: state space models
I Geophysics model of the temperature of the ocean, ϕt.
I The temperature ϕt is used in a model of planktonpopulation sizes βt,e.g. through an SDE:
dβt = µ(βt, ϕt)dt+ σ(βt, ϕt)dWt.
I Should we propagate the uncertainty of ϕt to the biology?
I Should we allow feedback from the biological data to ϕt?
Parslow, J., Cressie, N., Campbell, E. P., Jones, E., & Murray, L.(2013). Bayesian learning and predictability in a stochastic nonlineardynamical model. Ecological applications, 23(4), 679-698.
Example: causal inference with propensity scores
I Exposure to treatment variable X ∈ {0, 1}, outcomevariable Y ∈ {0, 1}. Causal effect of X on Y ?
I Covariates C. Assume no unmeasured confounders andnon-randomised experiment.
I Propensity logistic score ei = P(Xi = 1|Ci) for individual i.
I Logistic regression of Y on X for groups of individuals withsimilar values of e.
I Either a joint model (X,Y,C), or a two-step approach,(X,C) and then (X,Y, e), can be envisioned.
Zigler, Watts, Yeh, Wang, Coull & Dominici. (2013). Model feedbackin Bayesian propensity score estimation.
Example: Bayesian k-mean
I When learning about Bayesian k-mean model on learningdata X ∼ fX(x|θ)
I and extrapolating on unlabeled data Z ∼ fZ(z|θ, x)I large size of Z leads to uncertainty swamp
Cucala, L., Marin, J.M., X, and Titterington, D.M. (2009) ABayesian reassessment of nearest-neighbour classification
Joint model approach
I The joint model approach uses all data to simultaneouslyinfer all parameters. . .
I . . . so that uncertainty about θ1 is propagated to theestimation of θ2. . .
I . . . but misspecification of the 2nd module can damage theestimation of θ1.
I What about allowing uncertainty propagation, butpreventing feedback of some modules on others...
I ...and still remaining within Bayesian safety net?
Models/modules of interest
- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2
- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1
- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.
Models/modules of interest
- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2
- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1
- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.
Models/modules of interest
- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2
- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1
- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.
Outline
1 Models made of modules
2 Candidate distributions
3 Choosing among candidates
4 Computational challenges
Plug-in approach
A simple (naıve?) approach consists in
I estimating θ1 given Y1 first, e.g. θ1 =∫θ1 p1(dθ1|Y1),
I inference on θ2 given Y2 and θ1 using
p2(θ2|θ1, Y2) = p2(θ2|θ1)p2(Y2|θ1, θ2)p2(Y2|θ1)
.
Uncertainty about θ1 is ignored in the estimation of θ2.Misspecification of 2nd module doesn’t impact estimation of θ1.
Cut approach
One might want to propagate uncertainty without allowing“feedback” of second module on first module.
Cut distribution:
πcut (θ1, θ2;Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2).
Genuine if different from the posterior distribution under jointmodel:
πcut(θ1, θ2;Y1, Y2) ∝ π(θ1, θ2|Y1, Y2)p2(Y2|θ1) .
Uncertainty of θ1 is propagated to the marginal of θ2.
Marginal of θ1 is p1(θ1|Y1), irrespective of 2nd module.
Candidate distributions
For θ1, we can consider the following distributions.I p1(θ1)I p1(θ1|Y1)I π(θ1|Y1, Y2)
but also. . .I ∝ p1(θ1)p2(Y2|θ1)I ∝ p1(θ1|Y1)p2(Y2|θ1)γ
I etc.We can consider, more generally,
πγ1,γ2(θ1;Y1, Y2) ∝ p1(θ1)p1(Y1|θ1)γ1 p2(Y2|θ1)γ2 .
Grunwald, P. (2012) The safe Bayesian
Candidate distributions
To infer the pair (θ1, θ2), for any marginal π1 on θ1, we canconsider
I π1(θ1)p2(θ2|θ1)
I π1(θ1)p2(θ2|θ1, Y2)
I etc.
If π1 is reduced to δθ1, we retrieve plug-in approaches.
Example: biased data
Y 11 , . . . , Y
n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.
Y 12 , . . . , Y
n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.
Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12).
Figures in the next slides showI full posterior: π(θ1, θ2|Y1, Y2),I module 1: p1(θ1|Y1),I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),I prior: p1(θ1)p2(θ2|θ1).
Example: biased data
0
2
4
6
−2 −1 0 1 2
θ1
dens
ity
full posterior module 1 module 2 prior
Example: biased data
0
5
10
0.0 0.5 1.0
θ2
dens
ity
cut full posterior module 2 plugin prior
Outline
1 Models made of modules
2 Candidate distributions
3 Choosing among candidates
4 Computational challenges
Predictive performance
I Consider the task of predicting a future observation Y .
I For a candidate π(θ), the predictive density is
y 7→∫
Θp(y|θ)π(dθ).
I Its expected utility, using the logarithmic scoring rule, is∫Y
log(∫
Θp(y|θ)π(dθ)
)p?(dy),
where p?(dy) is the data-generating process.
I Other scoring rules are principled too.
Parry, Dawid & Lauritzen (2012). Proper local scoring rules.
Logarithmic scoring rule
Kullback-Leibler (KL) divergence from distribution withdensity p′, to another distribution with density p,
KL(p, p′) =∫
log(p(y)p′(y)
)p(y)dy.
posterior(θ|Y ) minimises∫(− log likelihood(Y |θ)) ν(θ)dθ + KL (ν(θ), prior(θ))
over all choices of ν(θ)
For a predictive distribution with density y 7→ π(y) and anobservation Y , the logarithmic score is log π(Y ).
Predictive performance
I A practical assessment of predictive performance can beobtained in the prequential approach.
I Sequentially obtain πt(θ) using data y1, . . . , yt and predictthe next observation:
n∑t=1
log(∫
Θp(yt|θ)πt−1(dθ)
).
I The log-evidence is retrieved when (πt) is the sequence ofpartial posteriors, p(θ), p(θ|y1), p(θ|y1, y2), etc.
Bernardo & Smith (1994). Bayesian Theory.Dawid (1984). The prequential approach.
Proposed plan
In the first module, θ1 is (usually) defined in its relation to Y1.
We propose to assess candidate distributions for θ1 based onpredictive performance for Y1.
To compare p1(θ1|Y1) and π(θ1|Y1, Y2), under the prequentialapproach and under the logarithmic scoring rule, we comparep1(Y1) with π(Y1|Y2).
If p1(Y1) > π(Y1|Y2), we can justify the use of distributions on(θ1, θ2) that admit p1(θ1|Y1) as first marginal,e.g. the cut distribution.
Example: biased data
Y 11 , . . . , Y
n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.
Y 12 , . . . , Y
n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.
Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12) (precision).
candidate predictive score for Y1p1(θ1|Y1) -144.5 [-144.5, -144.4]p1(θ1) -151.4 [-152, -150.6]π(θ1|Y1, Y2) -165 [-165.1, -165]
Side note on prior vs posterior
Data-generating process p?:
1/2 N (−2, 1) + 1/2 N (+2, 1).
Likelihood: Y |θ ∼ N (θ, 1).
Prior θ ∼ N (0, 1).
Prior predictive N (0, 2).
Posterior predictive N (0, 1) (asymptotically).
Prior predictive is closer to p? than posterior predictive,under the KL divergence.
Example: epidemiological study
I Model of virus prevalence
∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),
Zi is number of women infected with high-risk HPV in asample of size Ni in country i.Beta(1,1) prior on each ϕi, independently.
I Impact of prevalence onto cervical cancer occurrence
∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,
Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.N (0, 103) on θ2,1, θ2,2, independently (sorry, not my fault).
Plummer (2015). Cuts in Bayesian graphical models.
Example: epidemiological study
candidate predictive score for Y1p1(θ1|Y1) -64.8 [-65, -64.7]π(θ1|Y1, Y2) -74.9 [-75.1, -74.5]p1(θ1) -262.7 [-276.6, -253.5]
. . . from which we can decide that the 2nd module does not helpestimating θ1, which justifies the use of the cut distribution.
In the next figure, different candidate distributions for(θ2,1, θ2,2) are displayed:
I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I full posterior: π(θ1, θ2|Y1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),
Example: epidemiological study
−10
0
10
20
30
−4 −2 0 2
θ2.1
θ 2.2
cut full posterior module 2
Side note: prior vs posterior
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).
Side note: prior vs posterior
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
prior predictive
Figure: + density of prior predictive N (0, 2)
Side note: prior vs posterior
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
prior predictive
posterior predictive
Figure: + density of posterior predictive ≈ N (0, 1).
Prior or posterior?
In the next figures. . .I for R = 256 datasets of size n, drawn independently fromp?, compute the posterior distribution πn(dθ),
I compute the posterior predictive density
Y 7→ qn(Y ) =∫
Θp(Y |θ)πn(dθ),
I compute the expected utility∫Y
log qn(Y )p?(dY ),
I and plot the 256 results obtained for all n = 0, . . . , 1000.
Example: prior vs posterior, well-specified setting
Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).
−1.7
−1.6
−1.5
0 250 500 750 1000
n
expe
cted
util
ity
Figure: Predicting Y using n data points.
Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.
Example: prior vs posterior, well-specified setting
Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).
−1.7
−1.6
−1.5
1 10 100 1000
n
expe
cted
util
ity
Figure: Predicting Y using n data points.
Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.
Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).
−4.0
−3.5
−3.0
−2.5
0 250 500 750 1000
n
expe
cted
util
ity
Figure: Predicting Y using n data points.
Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).
−4.0
−3.5
−3.0
−2.5
1 10 100 1000
n
expe
cted
util
ity
Figure: Predicting Y using n data points.
Example: prior vs posterior, misspecified setting
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).
Example: prior vs posterior, misspecified setting
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
prior predictive
Figure: + density of prior predictive N (0, 2)
Example: prior vs posterior, misspecified setting
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
y
dens
ity
p star
prior predictive
posterior predictive
Figure: + density of posterior predictive ≈ N (0, 1).
Asymptotic predictions in misspecified settings
I Asymptotically, the posterior concentrates around
θ? = argminθ
KL(p?(dY ) | p(dY |θ)),
i.e. around θ? that maximizes the expected utility.
I If we have to use a candidate distribution of the formπ(dθ) = δθ(dθ), then the best choice is δθ?(dθ).
I But there might be many π(dθ) such that the predictiveq(dY ) =
∫Θ p(dY |θ)π(dθ) satisfies
KL(p?(dY ) | q(dY )) < argminθ
KL(p?(dY ) | p(dY |θ)).
Joint or modularized approach?
I Consider the use of the second module (i.e. θ2, Y2) toimprove the prediction of Y1.
I Biased data example:θ1 ∼ N (0, 1), Y1 ∼ N (θ1, 1),θ2 ∼ N (0, λ2), Y2 ∼ N (θ1 + θ2, 1).
I Data-generating process: Y1 ∼ N (0, 1) and Y2 ∼ N (θ?2, 1).We set n1 = 10 and plot the expected utility against n2,for two values of λ2 (vague / precise) andfor two values of θ?2 (bias / no bias).
Biased data example: use of Y2 to predict Y1
−2.0
−1.8
−1.6
−1.4
1 10 100 1000
n2
expe
cted
util
ity
Figure: λ2 = 0.01 (vague prior) and θ?2 = 0 (no bias).
Biased data example: use of Y2 to predict Y1
−2.0
−1.8
−1.6
−1.4
1 10 100 1000
n2
expe
cted
util
ity
Figure: λ2 = 100 (precise prior) and θ?2 = 0 (no bias).
Biased data example: use of Y2 to predict Y1
−2.0
−1.8
−1.6
−1.4
1 10 100 1000
n2
expe
cted
util
ity
Figure: λ2 = 0.01 (vague prior) and θ?2 = 1 (bias).
Biased data example: use of Y2 to predict Y1
−2.0
−1.8
−1.6
−1.4
1 10 100 1000
n2
expe
cted
util
ity
Figure: λ2 = 100 (precise prior) and θ?2 = 1 (bias).
Outline
1 Models made of modules
2 Candidate distributions
3 Choosing among candidates
4 Computational challenges
From toy models to real-world situations
Computationally, expected utilities typically involve:
I integrals with respect to the data distribution p? on Y ,
→ cross-validation, bootstrap, or sequential out-of-sampleprediction (i.e. prequential approach);
I integrals with respect to candidate distributions on θ,
→ sampling methods, e.g. MCMC, SMC, etc.
Sampling in the joint model approach
I Joint model posterior has (genuine) probability density
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1 (Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).
I The computational complexity typically growssuper-linearly with the number of modules.
I Difficulties stack up. . .intractability, multimodality, ridges, etc.
I Y1 and Y2 might not be simultaneously available,for e.g. confidentiality reasons.
Approximating constants
The proposed plan involves computingp1(Y1) and π(Y1|Y2).This can be seen as a normalizing constantestimation task. Challenging but numerousmethods have been proposed.Bridge sampling, path sampling, nestedsampling, Besag’s, Chib’s, Gelfand’s andDey’s, Geyer’s, sequential Monte Carlosamplers, &.
Del Moral, Doucet & Jasra (2005). Sequential Monte Carlo samplers.Zhou, Johansen & Aston (2016). Towards automatic ModelComparison: an adaptive Sequential Monte Carlo approach.
Sampling in the joint model approach: SMC way
I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ
N11 ).
I For each i ∈ {1, . . . , N1}, an SMC sampler
I approximates π(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N22 ),
I approximates π(Y2|θi1) by πN2(Y2|θi1).
I Define wi ∝ πN2(Y2|θi1).
I Then for any test function ϕ
N1∑i=1
wi1N2
N2∑j=1
ϕ(θi1, θi,j2 ) P−−−−−−−→
N1,N2→∞
∫ϕ(θ1, θ2)π(dθ1, dθ2|Y1, Y2).
Sampling in the joint model approach: MCMC way
I Metropolis–Hastings with proposal distribution
q((θ1, θ2)→ (dθ′1, dθ′2)
)= πN1
1 (dθ′1)q(θ2 → dθ′2),
i.e. θ′1 is drawn among (θ11, . . . , θ
N11 ) uniformly at random.
I If we pretend that πN11 (dθ′1) ≈ π1(dθ′1|Y1), the acceptance
probability becomes
1 ∧ p2(Y2|θ′1, θ′2)p2(θ′2|θ′1)p2(Y2|θ1, θ2)p2(θ2|θ1)
q(θ′2 → θ2)q(θ2 → θ′2) ,
which does not involve Y1.
Lunn, D., Barrett, J., Sweeting, M., & Thompson, S. (2013). FullyBayesian hierarchical modelling in two stages, with application tometa-analysis.
Sampling from the cut distribution
I The cut distribution is defined as
πcut (θ1, θ2;Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2) ∝ π (θ1, θ2|Y1, Y2)p2 (Y2|θ1) .
I The denominator is the feedback of the 2nd module on θ1:
p2 (Y2|θ1) =∫p2(Y2|θ1, θ2)p2(dθ2|θ1).
I The feedback term is typically intractable.
Sampling [not] from the cut distribution
WinBUGS’ approach via the cut function: alternate between
I sampling θ′1 from K1(θ1 → dθ′1), targeting p1(dθ1|Y1);
I sampling θ′2 from K2θ′1
(θ2 → dθ′2), targeting p2(dθ2|θ′1, Y2).
This does not leave the cut distribution invariant: ifθ1, θ2 ∼ πcut(dθ1, dθ2;Y1, Y2), then θ′1, θ
′2 follows
πcut(dθ′1, dθ′2;Y1, Y2)∫p2(θ2|θ1, Y2)p2(θ2|θ′1, Y2)K
1(θ′1 → dθ1)K2θ′1
(θ′2 → dθ2).
Iterating the kernel K2θ′1
enough times mitigates the issue.
Plummer (2015). Cuts in Bayesian graphical models. BayesianAnalysis
Sampling from the cut distribution
In a perfect world, we could sample i.i.d.
I θi1 from p1(θ1|Y1),
I θi2 given θi1 from p2(θ2|θi1, Y2),
then (θi1, θi2) would be i.i.d. from the cut distribution.
Sampling from the cut distribution
In an MCMC world, we can sample
I θi1 approximately from p1(θ1|Y1) using MCMC,
I θi2 given θi1 approximately from p2(θ2|θi1, Y2) using MCMC,
then resulting samples approximate the cut distribution,in the limit of the numbers of iterations, at both stages.
There is an alternative approach thatI is valid in a single asymptotic regime,I enables simple constructions of confidence intervals.
Unbiased MCMC [on the side]
By coupling pairs of MCMC chains, we can produce (random)empirical measures
π(·) =N∑`=1
ω`δX`(·)
that approximate a target π, in the sense
E[N∑`=1
ω`h(X`)] =∫h(x)π(dx),
for a test function h.
Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.Heng, Jacob (2017). Unbiased Hamiltonian MC with couplings.
Sampling from the cut distribution
In an unbiased MCMC world, we can approximateI p1(dθ1|Y1) with a random measure
π1(dθ1) =N1∑`=1
ω`1δθ`1(dθ1),
I p2(dθ2|θ`1, Y2) for any θ`1, with a random measure
π2(dθ2|θ`1) =N2∑k=1
ω`,k2 δθ`,k
2(dθ2),
Thus, by the tower property, we can unbiasedly estimate∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1).
Sampling from the cut distribution
We can thus construct unbiased estimators of∫h(θ1, θ2)πcut(dθ1, dθ2;Y1, Y2)
for test functions h.
We care about unbiasedness because. . .
I we can draw R independent copies of the estimator,Hr, for r = 1, . . . , R,
I and average the results: R−1∑Rr=1 H
r,
I for which confidence intervals can be constructed.
Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.
Sampling from the cut distribution
Thus, by the tower property,∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1),
can be approximated with no bias by an estimator of the form
N1∑`=1
ω`1
N2∑k=1
ω`,k2 h(θ`1, θ`,k2 )
.
One can then sample independently R such estimators, andaverage the results to consistently approximate the cutdistribution as R→∞.
Sampling in the modularized approach: SMC way
I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ
N11 ).
I For each i ∈ {1, . . . , N1}, an SMC sampler approximatesπ(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N2
2 ),
I Then for any test function ϕ
1N1
1N2
N1∑i=1
N2∑j=1
ϕ(θi1, θi,j2 )
P−−−−−−−→N1,N2→∞
∫ϕ(θ1, θ2)π(dθ2|θ1, Y2)π1(dθ1|Y1).
I The SMC approach yields approximations of the joint andthe cut distributions simultaneously, by using or not theestimated feedback terms πN2(Y2|θi1).
Sampling in the modularized approach: MCMC way
I First approximate π1(dθ1|Y1) with (θ11, . . . , θ
N11 ).
I Run a (long enough) MCMC targeting π(dθ2|θi1, Y2) foreach i ∈ {1, . . . , N1}. . . independently?
I Introduce swap moves, as in parallel tempering:
I select k, ` in {1, . . . , N1} uniformly at random;
I swap θk2 with θ`2 with probability
1 ∧ πcut(θk1 , θ`2)πcut(θk1 , θk2)
πcut(θ`1, θk2)πcut(θ`1, θ`2)
= 1 ∧ p2(θ`2|θk1)p2(Y2|θk1 , θ`2)p2(θk2 |θk1)p2(Y2|θk1 , θk2)
p2(θk2 |θ`1)p2(Y2|θ`1, θk2)p2(θ`2|θ`1)p2(Y2|θ`1, θ`2)
.
Discussion
I Models made of multiple modules abound.Forcings in dynamical systems, computer models, causalinference with propensity scores, PKPD, air pollution data,meta-analysis, generated regressors, regression with functionalpredictors, linkage disequilibrium modelling. . .
I Departures from the joint model approach can be justifiedin a decision-theoretic framework.
I These enable propagation of uncertainties but with acontrol of the feedback of some variables on others.
I Is the cut distribution a Bayesian equivalent of two-stepestimators routinely used in econometrics?
I The comparison between candidates and the approximationof the cut distribution raise computational challenges.
References
Technical report at arxiv.org/abs/1708.08719.Unbiased MCMC report at arxiv.org/abs/1708.03625.
References
Technical report at arxiv.org/abs/1708.08719.Unbiased MCMC report at arxiv.org/abs/1708.03625.
Appendix: unbiased MCMC
Consider two Markov chains (Xt) and (Yt) generated as follows:
I sample X0 and Y0 from π0, an initial distribution,
I sample X1 ∼ P (X0, ·), with a π-invariant Markov kernel P ,
I for t ≥ 1, sample (Xt+1, Yt) ∼ P ((Xt, Yt−1), ·),where P is a coupling of P with itself.
Each chain marginally evolves according to P .
Xt and Yt have the same distribution for all t ≥ 0.
P such that ∃τ such that Xt = Yt−1 for all t ≥ τ .
Appendix: unbiased MCMC
Marginal convergence of each chain:
Eπ[h(X)] = limt→∞
E[h(Xt)].
Limit as a telescopic sum, for all k ≥ 0,
limt→∞
E[h(Xt)]
= E[h(Xk)] +∞∑
t=k+1E[h(Xt)− h(Xt−1)].
Since Yt has the same distribution as Xt, then
Eπ[h(X)] = E[h(Xk)] +∞∑
t=k+1E[h(Xt)− h(Yt−1)].
Appendix: unbiased MCMC
If we can swap expectation and limit,
Eπ[h(X)] = E[h(Xk) +∞∑
t=k+1(h(Xt)− h(Yt−1))],
the random variable
h(Xk) +∞∑
t=k+1(h(Xt)− h(Yt−1))
is an unbiased estimator of Eπ[h(X)].
We can compute the ”infinite” sum since Xt = Yt−1 for all t ≥ τ .
Appendix: unbiased MCMC
The proposed random measure estimator is given by
π(·) = δXk(·) +
τ−1∑t=k+1
(δXt(·)− δYt−1(·)) ≡N∑`=1
ω`δZ`(·).
Variant of Glynn & Rhee’s estimator.
Glynn & Rhee (2014). Exact estimation for Markov chain equilibriumexpectations.Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.
Appendix: unbiased MCMC
Conditions ensuring correctness, finite cost, and finite variance,for a test function h.
1. Marginal chain converges:E[h(Xt)]→ Eπ[h(X)],∃η > 0, D <∞, E[|h(Xt)|2+η] < D.
2. Meeting time τ is geometric:
∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 1 P(τ > t) ≤ Cδt.
3. Chains stay together: Xt = Yt−1 for all t ≥ τ .
(Sufficient but not necessary!)