better together? statistical learning in models made of modules

Better together? Statistical learning in modelsmade of modules

Christian P. Robert (Paris-Dauphine & Warwick)joint work with Pierre E. Jacob (Harvard), and

Chris Holmes (Oxford), Lawrence Murray (Uppsala)

PC Mahalanobis 125th B’day

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Recoup hierarchical, super, meta-analysis, and coupled models.

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Conventional statistical (Bayesian) updating tackles all modulesjointly with the advantage that all uncertainties can be treatedsimultaneously and coherently.

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.However, when information flows both ways between any pair ofmodules, misspecification of either leads to misspecification ofthe full model and misleading quantification of uncertainties.Departing from learning with the full model may thus bebeneficial, along other motivations to eschew the full model,e.g., computational constraints and data confidentiality.

Disclaimer!

This work is unrelated with the recent referendum onScottish independence and is not intended to supportone side versus the other!

Outline

1 Models made of modules

2 Candidate distributions

3 Choosing among candidates

4 Computational challenges

Models made of modules

I First module:parameter θ1, data Y1

prior: p1(θ1)likelihood: p1(Y1|θ1)

I Second module:parameter θ2, data Y2

prior: p2(θ2|θ1)likelihood: p2 (Y2|θ1, θ2)

We are interested in the estimation of θ1, θ2 or both.

Joint model approach

Parameter (θ1, θ2), with prior

p(θ1, θ2) = p1(θ1)p2(θ2|θ1).

Data (Y1, Y2), likelihood

p(Y1, Y2|θ1, θ2) = p1(Y1|θ1)p2(Y2|θ1, θ2).

Posterior distribution

π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1(Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).

Example: PKPD

I Pharmarcokinetics:

∀t Yt ∼ N (logCt, τ2PK), where Ct = ψ(t, θPK),

concentration of substances/drugs in organisms, i.e., body’seffect on a drug

I Pharmarcodynamics:

∀j Zj ∼ N (Ej , τ2PD), where Ej = ϕ(Cj , θPD),

effect of substances/drugs on health, i.e., drug’s effect onbody

Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009).Combining MCMC with ‘sequential’ PKPD modelling

Example: longitudinal + time to event medical data

I Longitudinal model:

∀i, j Yij = X ′i1β1 + U0i + U1itij + Zij ,

glomerular filtration rate (kidney function), onto serumcreatinine, age, gender, follow-up time

I Hazard model:

∀i, t hi(t) = h0(t) exp(X ′i2β2 + γ(U0i + U1it)

),

time to initiation of renal replacement therapy, ontocovariates and fitted time trend.

Asar, Ritchie, Kalra & Diggle (2015). Joint modelling of repeatedmeasurement and time-to-event data: an introductory tutorial.

Example: two-step regressions

I First regression model,

∀t DMt = X1tθ + DMRt,

proportional growth in the M1 definition of money onto twoannual lags, current deviations of real federal expenditures,lagged unemployment.

I Second regression model,

∀t log UNt

1−UNt= X2tβ + γ(L)DMRt + ut,

Annual average unemployment rate onto level of minimum wage,measure of military conscriptions, γ is 2nd order polynomial.

Murphy & Topel (1985). Estimation and Inference in Two-StepEconometric Models.

Example: biased data

I Normal location model:∀i = 1, . . . , n1 Y i

1 ∼ N (θ1, 1), prior θ1 ∼ N (0, 1).

I Extra data Y2 suspected to be biased:∀i = 1, . . . , n2 Y i

2 ∼ N (θ1 + θ2, 1), prior θ2 ∼ N (0, V ).

Liu, Bayarri & Berger (2009). Modularization in Bayesian analysis,with emphasis on analysis of computer models.

Example: epidemiological study

I Model of virus prevalence

∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),

Zi is number of women infected with high-risk HPV in asample of size Ni in country i.

I Impact of prevalence onto cervical cancer occurrence

∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,

Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.

Plummer (2015). Cuts in Bayesian graphical models.

Example: state space models

I Geophysics model of the temperature of the ocean, ϕt.

I The temperature ϕt is used in a model of planktonpopulation sizes βt,e.g. through an SDE:

dβt = µ(βt, ϕt)dt+ σ(βt, ϕt)dWt.

I Should we propagate the uncertainty of ϕt to the biology?

I Should we allow feedback from the biological data to ϕt?

Parslow, J., Cressie, N., Campbell, E. P., Jones, E., & Murray, L.(2013). Bayesian learning and predictability in a stochastic nonlineardynamical model. Ecological applications, 23(4), 679-698.

Example: causal inference with propensity scores

I Exposure to treatment variable X ∈ {0, 1}, outcomevariable Y ∈ {0, 1}. Causal effect of X on Y ?

I Covariates C. Assume no unmeasured confounders andnon-randomised experiment.

I Propensity logistic score ei = P(Xi = 1|Ci) for individual i.

I Logistic regression of Y on X for groups of individuals withsimilar values of e.

I Either a joint model (X,Y,C), or a two-step approach,(X,C) and then (X,Y, e), can be envisioned.

Zigler, Watts, Yeh, Wang, Coull & Dominici. (2013). Model feedbackin Bayesian propensity score estimation.

Example: Bayesian k-mean

I When learning about Bayesian k-mean model on learningdata X ∼ fX(x|θ)

I and extrapolating on unlabeled data Z ∼ fZ(z|θ, x)I large size of Z leads to uncertainty swamp

Cucala, L., Marin, J.M., X, and Titterington, D.M. (2009) ABayesian reassessment of nearest-neighbour classification

Joint model approach

I The joint model approach uses all data to simultaneouslyinfer all parameters. . .

I . . . so that uncertainty about θ1 is propagated to theestimation of θ2. . .

I . . . but misspecification of the 2nd module can damage theestimation of θ1.

I What about allowing uncertainty propagation, butpreventing feedback of some modules on others...

I ...and still remaining within Bayesian safety net?

Models/modules of interest

- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2

- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1

- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.

Outline





Plug-in approach

A simple (naıve?) approach consists in

I estimating θ1 given Y1 first, e.g. θ1 =∫θ1 p1(dθ1|Y1),

I inference on θ2 given Y2 and θ1 using

p2(θ2|θ1, Y2) = p2(θ2|θ1)p2(Y2|θ1, θ2)p2(Y2|θ1)

.

Uncertainty about θ1 is ignored in the estimation of θ2.Misspecification of 2nd module doesn’t impact estimation of θ1.

Cut approach

One might want to propagate uncertainty without allowing“feedback” of second module on first module.

Cut distribution:

πcut (θ1, θ2;Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2).

Genuine if different from the posterior distribution under jointmodel:

πcut(θ1, θ2;Y1, Y2) ∝ π(θ1, θ2|Y1, Y2)p2(Y2|θ1) .

Uncertainty of θ1 is propagated to the marginal of θ2.

Marginal of θ1 is p1(θ1|Y1), irrespective of 2nd module.

Candidate distributions

To infer the pair (θ1, θ2), for any marginal π1 on θ1, we canconsider

I π1(θ1)p2(θ2|θ1)

I π1(θ1)p2(θ2|θ1, Y2)

I etc.

If π1 is reduced to δθ1, we retrieve plug-in approaches.


Y 11 , . . . , Y

n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.

Y 12 , . . . , Y

n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.

Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12).

Figures in the next slides showI full posterior: π(θ1, θ2|Y1, Y2),I module 1: p1(θ1|Y1),I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),I prior: p1(θ1)p2(θ2|θ1).


0

2

4

6

−2 −1 0 1 2

θ1

dens

ity

full posterior module 1 module 2 prior


0

5

10

0.0 0.5 1.0

θ2

dens

ity

cut full posterior module 2 plugin prior

Outline





Predictive performance

I Consider the task of predicting a future observation Y .

I For a candidate π(θ), the predictive density is

y 7→∫

Θp(y|θ)π(dθ).

I Its expected utility, using the logarithmic scoring rule, is∫Y

log(∫

Θp(y|θ)π(dθ)

)p?(dy),

where p?(dy) is the data-generating process.

I Other scoring rules are principled too.

Parry, Dawid & Lauritzen (2012). Proper local scoring rules.

Logarithmic scoring rule

Kullback-Leibler (KL) divergence from distribution withdensity p′, to another distribution with density p,

KL(p, p′) =∫

log(p(y)p′(y)

)p(y)dy.

posterior(θ|Y ) minimises∫(− log likelihood(Y |θ)) ν(θ)dθ + KL (ν(θ), prior(θ))

over all choices of ν(θ)

For a predictive distribution with density y 7→ π(y) and anobservation Y , the logarithmic score is log π(Y ).

Predictive performance

I A practical assessment of predictive performance can beobtained in the prequential approach.

I Sequentially obtain πt(θ) using data y1, . . . , yt and predictthe next observation:

n∑t=1

log(∫

Θp(yt|θ)πt−1(dθ)

).

I The log-evidence is retrieved when (πt) is the sequence ofpartial posteriors, p(θ), p(θ|y1), p(θ|y1, y2), etc.

Bernardo & Smith (1994). Bayesian Theory.Dawid (1984). The prequential approach.

Proposed plan

In the first module, θ1 is (usually) defined in its relation to Y1.

We propose to assess candidate distributions for θ1 based onpredictive performance for Y1.

To compare p1(θ1|Y1) and π(θ1|Y1, Y2), under the prequentialapproach and under the logarithmic scoring rule, we comparep1(Y1) with π(Y1|Y2).

If p1(Y1) > π(Y1|Y2), we can justify the use of distributions on(θ1, θ2) that admit p1(θ1|Y1) as first marginal,e.g. the cut distribution.


Y 11 , . . . , Y

n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.

Y 12 , . . . , Y

n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.

Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12) (precision).

candidate predictive score for Y1p1(θ1|Y1) -144.5 [-144.5, -144.4]p1(θ1) -151.4 [-152, -150.6]π(θ1|Y1, Y2) -165 [-165.1, -165]

Side note on prior vs posterior

Data-generating process p?:

1/2 N (−2, 1) + 1/2 N (+2, 1).

Likelihood: Y |θ ∼ N (θ, 1).

Prior θ ∼ N (0, 1).

Prior predictive N (0, 2).

Posterior predictive N (0, 1) (asymptotically).

Prior predictive is closer to p? than posterior predictive,under the KL divergence.


I Model of virus prevalence

∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),

Zi is number of women infected with high-risk HPV in asample of size Ni in country i.Beta(1,1) prior on each ϕi, independently.

I Impact of prevalence onto cervical cancer occurrence

∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,

Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.N (0, 103) on θ2,1, θ2,2, independently (sorry, not my fault).

Plummer (2015). Cuts in Bayesian graphical models.


candidate predictive score for Y1p1(θ1|Y1) -64.8 [-65, -64.7]π(θ1|Y1, Y2) -74.9 [-75.1, -74.5]p1(θ1) -262.7 [-276.6, -253.5]

. . . from which we can decide that the 2nd module does not helpestimating θ1, which justifies the use of the cut distribution.

In the next figure, different candidate distributions for(θ2,1, θ2,2) are displayed:

I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I full posterior: π(θ1, θ2|Y1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),


−10

0

10

20

30

−4 −2 0 2

θ2.1

θ 2.2

cut full posterior module 2

Side note: prior vs posterior

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

Figure: + density of prior predictive N (0, 2)


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

posterior predictive

Figure: + density of posterior predictive ≈ N (0, 1).

Prior or posterior?

In the next figures. . .I for R = 256 datasets of size n, drawn independently fromp?, compute the posterior distribution πn(dθ),

I compute the posterior predictive density

Y 7→ qn(Y ) =∫

Θp(Y |θ)πn(dθ),

I compute the expected utility∫Y

log qn(Y )p?(dY ),

I and plot the 256 results obtained for all n = 0, . . . , 1000.

Example: prior vs posterior, well-specified setting

Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).

−1.7

−1.6

−1.5

0 250 500 750 1000

n

expe

cted

util

ity

Figure: Predicting Y using n data points.

Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.

Example: prior vs posterior, well-specified setting

Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).

−1.7

−1.6

−1.5

1 10 100 1000

n

expe

cted

util

ity


Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.

Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).

−4.0

−3.5

−3.0

−2.5

0 250 500 750 1000

n

expe

cted

util

ity


Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).

−4.0

−3.5

−3.0

−2.5

1 10 100 1000

n

expe

cted

util

ity


Example: prior vs posterior, misspecified setting

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

Figure: + density of prior predictive N (0, 2)


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

posterior predictive

Figure: + density of posterior predictive ≈ N (0, 1).

Asymptotic predictions in misspecified settings

I Asymptotically, the posterior concentrates around

θ? = argminθ

KL(p?(dY ) | p(dY |θ)),

i.e. around θ? that maximizes the expected utility.

I If we have to use a candidate distribution of the formπ(dθ) = δθ(dθ), then the best choice is δθ?(dθ).

I But there might be many π(dθ) such that the predictiveq(dY ) =

∫Θ p(dY |θ)π(dθ) satisfies

KL(p?(dY ) | q(dY )) < argminθ

KL(p?(dY ) | p(dY |θ)).

Joint or modularized approach?

I Consider the use of the second module (i.e. θ2, Y2) toimprove the prediction of Y1.

I Biased data example:θ1 ∼ N (0, 1), Y1 ∼ N (θ1, 1),θ2 ∼ N (0, λ2), Y2 ∼ N (θ1 + θ2, 1).

I Data-generating process: Y1 ∼ N (0, 1) and Y2 ∼ N (θ?2, 1).We set n1 = 10 and plot the expected utility against n2,for two values of λ2 (vague / precise) andfor two values of θ?2 (bias / no bias).

Biased data example: use of Y2 to predict Y1

−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 0.01 (vague prior) and θ?2 = 0 (no bias).


−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 100 (precise prior) and θ?2 = 0 (no bias).


−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 0.01 (vague prior) and θ?2 = 1 (bias).


−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 100 (precise prior) and θ?2 = 1 (bias).

Outline





From toy models to real-world situations

Computationally, expected utilities typically involve:

I integrals with respect to the data distribution p? on Y ,

→ cross-validation, bootstrap, or sequential out-of-sampleprediction (i.e. prequential approach);

I integrals with respect to candidate distributions on θ,

→ sampling methods, e.g. MCMC, SMC, etc.

Sampling in the joint model approach

I Joint model posterior has (genuine) probability density

π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1 (Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).

I The computational complexity typically growssuper-linearly with the number of modules.

I Difficulties stack up. . .intractability, multimodality, ridges, etc.

I Y1 and Y2 might not be simultaneously available,for e.g. confidentiality reasons.

Approximating constants

The proposed plan involves computingp1(Y1) and π(Y1|Y2).This can be seen as a normalizing constantestimation task. Challenging but numerousmethods have been proposed.Bridge sampling, path sampling, nestedsampling, Besag’s, Chib’s, Gelfand’s andDey’s, Geyer’s, sequential Monte Carlosamplers, &.

Del Moral, Doucet & Jasra (2005). Sequential Monte Carlo samplers.Zhou, Johansen & Aston (2016). Towards automatic ModelComparison: an adaptive Sequential Monte Carlo approach.

Sampling in the joint model approach: SMC way

I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I For each i ∈ {1, . . . , N1}, an SMC sampler

I approximates π(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N22 ),

I approximates π(Y2|θi1) by πN2(Y2|θi1).

I Define wi ∝ πN2(Y2|θi1).

I Then for any test function ϕ

N1∑i=1

wi1N2

N2∑j=1

ϕ(θi1, θi,j2 ) P−−−−−−−→

N1,N2→∞

∫ϕ(θ1, θ2)π(dθ1, dθ2|Y1, Y2).

Sampling in the joint model approach: MCMC way

I Metropolis–Hastings with proposal distribution

q((θ1, θ2)→ (dθ′1, dθ′2)

)= πN1

1 (dθ′1)q(θ2 → dθ′2),

i.e. θ′1 is drawn among (θ11, . . . , θ

N11 ) uniformly at random.

I If we pretend that πN11 (dθ′1) ≈ π1(dθ′1|Y1), the acceptance

probability becomes

1 ∧ p2(Y2|θ′1, θ′2)p2(θ′2|θ′1)p2(Y2|θ1, θ2)p2(θ2|θ1)

q(θ′2 → θ2)q(θ2 → θ′2) ,

which does not involve Y1.

Lunn, D., Barrett, J., Sweeting, M., & Thompson, S. (2013). FullyBayesian hierarchical modelling in two stages, with application tometa-analysis.

Sampling [not] from the cut distribution

WinBUGS’ approach via the cut function: alternate between

I sampling θ′1 from K1(θ1 → dθ′1), targeting p1(dθ1|Y1);

I sampling θ′2 from K2θ′1

(θ2 → dθ′2), targeting p2(dθ2|θ′1, Y2).

This does not leave the cut distribution invariant: ifθ1, θ2 ∼ πcut(dθ1, dθ2;Y1, Y2), then θ′1, θ

′2 follows

πcut(dθ′1, dθ′2;Y1, Y2)∫p2(θ2|θ1, Y2)p2(θ2|θ′1, Y2)K

1(θ′1 → dθ1)K2θ′1

(θ′2 → dθ2).

Iterating the kernel K2θ′1

enough times mitigates the issue.

Plummer (2015). Cuts in Bayesian graphical models. BayesianAnalysis


In a perfect world, we could sample i.i.d.

I θi1 from p1(θ1|Y1),

I θi2 given θi1 from p2(θ2|θi1, Y2),

then (θi1, θi2) would be i.i.d. from the cut distribution.


In an MCMC world, we can sample

I θi1 approximately from p1(θ1|Y1) using MCMC,

I θi2 given θi1 approximately from p2(θ2|θi1, Y2) using MCMC,

then resulting samples approximate the cut distribution,in the limit of the numbers of iterations, at both stages.

There is an alternative approach thatI is valid in a single asymptotic regime,I enables simple constructions of confidence intervals.

Unbiased MCMC [on the side]

By coupling pairs of MCMC chains, we can produce (random)empirical measures

π(·) =N∑`=1

ω`δX`(·)

that approximate a target π, in the sense

E[N∑`=1

ω`h(X`)] =∫h(x)π(dx),

for a test function h.

Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.Heng, Jacob (2017). Unbiased Hamiltonian MC with couplings.


In an unbiased MCMC world, we can approximateI p1(dθ1|Y1) with a random measure

π1(dθ1) =N1∑`=1

ω`1δθ`1(dθ1),

I p2(dθ2|θ`1, Y2) for any θ`1, with a random measure

π2(dθ2|θ`1) =N2∑k=1

ω`,k2 δθ`,k

2(dθ2),

Thus, by the tower property, we can unbiasedly estimate∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1).


We can thus construct unbiased estimators of∫h(θ1, θ2)πcut(dθ1, dθ2;Y1, Y2)

for test functions h.

We care about unbiasedness because. . .

I we can draw R independent copies of the estimator,Hr, for r = 1, . . . , R,

I and average the results: R−1∑Rr=1 H

r,

I for which confidence intervals can be constructed.

Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.


Thus, by the tower property,∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1),

can be approximated with no bias by an estimator of the form

N1∑`=1

ω`1

N2∑k=1

ω`,k2 h(θ`1, θ`,k2 )

.

One can then sample independently R such estimators, andaverage the results to consistently approximate the cutdistribution as R→∞.

Sampling in the modularized approach: SMC way

I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I For each i ∈ {1, . . . , N1}, an SMC sampler approximatesπ(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N2

2 ),

I Then for any test function ϕ

1N1

1N2

N1∑i=1

N2∑j=1

ϕ(θi1, θi,j2 )

P−−−−−−−→N1,N2→∞

∫ϕ(θ1, θ2)π(dθ2|θ1, Y2)π1(dθ1|Y1).

I The SMC approach yields approximations of the joint andthe cut distributions simultaneously, by using or not theestimated feedback terms πN2(Y2|θi1).

Sampling in the modularized approach: MCMC way

I First approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I Run a (long enough) MCMC targeting π(dθ2|θi1, Y2) foreach i ∈ {1, . . . , N1}. . . independently?

I Introduce swap moves, as in parallel tempering:

I select k, ` in {1, . . . , N1} uniformly at random;

I swap θk2 with θ`2 with probability

1 ∧ πcut(θk1 , θ`2)πcut(θk1 , θk2)

πcut(θ`1, θk2)πcut(θ`1, θ`2)

= 1 ∧ p2(θ`2|θk1)p2(Y2|θk1 , θ`2)p2(θk2 |θk1)p2(Y2|θk1 , θk2)

p2(θk2 |θ`1)p2(Y2|θ`1, θk2)p2(θ`2|θ`1)p2(Y2|θ`1, θ`2)

.

Discussion

I Models made of multiple modules abound.Forcings in dynamical systems, computer models, causalinference with propensity scores, PKPD, air pollution data,meta-analysis, generated regressors, regression with functionalpredictors, linkage disequilibrium modelling. . .

I Departures from the joint model approach can be justifiedin a decision-theoretic framework.

I These enable propagation of uncertainties but with acontrol of the feedback of some variables on others.

I Is the cut distribution a Bayesian equivalent of two-stepestimators routinely used in econometrics?

I The comparison between candidates and the approximationof the cut distribution raise computational challenges.

References

Technical report at arxiv.org/abs/1708.08719.Unbiased MCMC report at arxiv.org/abs/1708.03625.

arxiv.org/abs/1708.08719

arxiv.org/abs/1708.03625

Appendix: unbiased MCMC

Consider two Markov chains (Xt) and (Yt) generated as follows:

I sample X0 and Y0 from π0, an initial distribution,

I sample X1 ∼ P (X0, ·), with a π-invariant Markov kernel P ,

I for t ≥ 1, sample (Xt+1, Yt) ∼ P ((Xt, Yt−1), ·),where P is a coupling of P with itself.

Each chain marginally evolves according to P .

Xt and Yt have the same distribution for all t ≥ 0.

P such that ∃τ such that Xt = Yt−1 for all t ≥ τ .


Marginal convergence of each chain:

Eπ[h(X)] = limt→∞

E[h(Xt)].

Limit as a telescopic sum, for all k ≥ 0,

limt→∞

E[h(Xt)]

= E[h(Xk)] +∞∑

t=k+1E[h(Xt)− h(Xt−1)].

Since Yt has the same distribution as Xt, then

Eπ[h(X)] = E[h(Xk)] +∞∑

t=k+1E[h(Xt)− h(Yt−1)].


If we can swap expectation and limit,

Eπ[h(X)] = E[h(Xk) +∞∑

t=k+1(h(Xt)− h(Yt−1))],

the random variable

h(Xk) +∞∑

t=k+1(h(Xt)− h(Yt−1))

is an unbiased estimator of Eπ[h(X)].

We can compute the ”infinite” sum since Xt = Yt−1 for all t ≥ τ .


The proposed random measure estimator is given by

π(·) = δXk(·) +

τ−1∑t=k+1

(δXt(·)− δYt−1(·)) ≡N∑`=1

ω`δZ`(·).

Variant of Glynn & Rhee’s estimator.

Glynn & Rhee (2014). Exact estimation for Markov chain equilibriumexpectations.Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.


Conditions ensuring correctness, finite cost, and finite variance,for a test function h.

1. Marginal chain converges:E[h(Xt)]→ Eπ[h(X)],∃η > 0, D <∞, E[|h(Xt)|2+η] < D.

2. Meeting time τ is geometric:

∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 1 P(τ > t) ≤ Cδt.

3. Chains stay together: Xt = Yt−1 for all t ≥ τ .

(Sufficient but not necessary!)

better together? statistical learning in models made of modules

Science