better together? statistical learning in models made of modules

79
Better together? Statistical learning in models made of modules Christian P. Robert (Paris-Dauphine & Warwick) joint work with Pierre E. Jacob (Harvard), and Chris Holmes (Oxford), Lawrence Murray (Uppsala) PC Mahalanobis 125th B’day

Upload: christian-robert

Post on 18-Mar-2018

2.132 views

Category:

Science


0 download

TRANSCRIPT

Page 1: better together? statistical learning in models made of modules

Better together? Statistical learning in modelsmade of modules

Christian P. Robert (Paris-Dauphine & Warwick)joint work with Pierre E. Jacob (Harvard), and

Chris Holmes (Oxford), Lawrence Murray (Uppsala)

PC Mahalanobis 125th B’day

Page 2: better together? statistical learning in models made of modules

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Recoup hierarchical, super, meta-analysis, and coupled models.

Page 3: better together? statistical learning in models made of modules

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.Conventional statistical (Bayesian) updating tackles all modulesjointly with the advantage that all uncertainties can be treatedsimultaneously and coherently.

Page 4: better together? statistical learning in models made of modules

Story line

Statistical model assembled from different components,renamed here modules.Each module possibly developed by a separate user/community,or built from specific domain knowledge of a particular datamodality.However, when information flows both ways between any pair ofmodules, misspecification of either leads to misspecification ofthe full model and misleading quantification of uncertainties.Departing from learning with the full model may thus bebeneficial, along other motivations to eschew the full model,e.g., computational constraints and data confidentiality.

Page 5: better together? statistical learning in models made of modules

Disclaimer!

This work is unrelated with the recent referendum onScottish independence and is not intended to supportone side versus the other!

Page 6: better together? statistical learning in models made of modules

Outline

1 Models made of modules

2 Candidate distributions

3 Choosing among candidates

4 Computational challenges

Page 7: better together? statistical learning in models made of modules

Models made of modules

I First module:parameter θ1, data Y1

prior: p1(θ1)likelihood: p1(Y1|θ1)

I Second module:parameter θ2, data Y2

prior: p2(θ2|θ1)likelihood: p2 (Y2|θ1, θ2)

We are interested in the estimation of θ1, θ2 or both.

Page 8: better together? statistical learning in models made of modules

Joint model approach

Parameter (θ1, θ2), with prior

p(θ1, θ2) = p1(θ1)p2(θ2|θ1).

Data (Y1, Y2), likelihood

p(Y1, Y2|θ1, θ2) = p1(Y1|θ1)p2(Y2|θ1, θ2).

Posterior distribution

π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1(Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).

Page 9: better together? statistical learning in models made of modules

Example: PKPD

I Pharmarcokinetics:

∀t Yt ∼ N (logCt, τ2PK), where Ct = ψ(t, θPK),

concentration of substances/drugs in organisms, i.e., body’seffect on a drug

I Pharmarcodynamics:

∀j Zj ∼ N (Ej , τ2PD), where Ej = ϕ(Cj , θPD),

effect of substances/drugs on health, i.e., drug’s effect onbody

Lunn, Best, Spiegelhalter, Graham & Neuenschwander (2009).Combining MCMC with ‘sequential’ PKPD modelling

Page 10: better together? statistical learning in models made of modules

Example: longitudinal + time to event medical data

I Longitudinal model:

∀i, j Yij = X ′i1β1 + U0i + U1itij + Zij ,

glomerular filtration rate (kidney function), onto serumcreatinine, age, gender, follow-up time

I Hazard model:

∀i, t hi(t) = h0(t) exp(X ′i2β2 + γ(U0i + U1it)

),

time to initiation of renal replacement therapy, ontocovariates and fitted time trend.

Asar, Ritchie, Kalra & Diggle (2015). Joint modelling of repeatedmeasurement and time-to-event data: an introductory tutorial.

Page 11: better together? statistical learning in models made of modules

Example: two-step regressions

I First regression model,

∀t DMt = X1tθ + DMRt,

proportional growth in the M1 definition of money onto twoannual lags, current deviations of real federal expenditures,lagged unemployment.

I Second regression model,

∀t log UNt

1−UNt= X2tβ + γ(L)DMRt + ut,

Annual average unemployment rate onto level of minimum wage,measure of military conscriptions, γ is 2nd order polynomial.

Murphy & Topel (1985). Estimation and Inference in Two-StepEconometric Models.

Page 12: better together? statistical learning in models made of modules

Example: biased data

I Normal location model:∀i = 1, . . . , n1 Y i

1 ∼ N (θ1, 1), prior θ1 ∼ N (0, 1).

I Extra data Y2 suspected to be biased:∀i = 1, . . . , n2 Y i

2 ∼ N (θ1 + θ2, 1), prior θ2 ∼ N (0, V ).

Liu, Bayarri & Berger (2009). Modularization in Bayesian analysis,with emphasis on analysis of computer models.

Page 13: better together? statistical learning in models made of modules

Example: epidemiological study

I Model of virus prevalence

∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),

Zi is number of women infected with high-risk HPV in asample of size Ni in country i.

I Impact of prevalence onto cervical cancer occurrence

∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,

Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.

Plummer (2015). Cuts in Bayesian graphical models.

Page 14: better together? statistical learning in models made of modules

Example: state space models

I Geophysics model of the temperature of the ocean, ϕt.

I The temperature ϕt is used in a model of planktonpopulation sizes βt,e.g. through an SDE:

dβt = µ(βt, ϕt)dt+ σ(βt, ϕt)dWt.

I Should we propagate the uncertainty of ϕt to the biology?

I Should we allow feedback from the biological data to ϕt?

Parslow, J., Cressie, N., Campbell, E. P., Jones, E., & Murray, L.(2013). Bayesian learning and predictability in a stochastic nonlineardynamical model. Ecological applications, 23(4), 679-698.

Page 15: better together? statistical learning in models made of modules

Example: causal inference with propensity scores

I Exposure to treatment variable X ∈ {0, 1}, outcomevariable Y ∈ {0, 1}. Causal effect of X on Y ?

I Covariates C. Assume no unmeasured confounders andnon-randomised experiment.

I Propensity logistic score ei = P(Xi = 1|Ci) for individual i.

I Logistic regression of Y on X for groups of individuals withsimilar values of e.

I Either a joint model (X,Y,C), or a two-step approach,(X,C) and then (X,Y, e), can be envisioned.

Zigler, Watts, Yeh, Wang, Coull & Dominici. (2013). Model feedbackin Bayesian propensity score estimation.

Page 16: better together? statistical learning in models made of modules

Example: Bayesian k-mean

I When learning about Bayesian k-mean model on learningdata X ∼ fX(x|θ)

I and extrapolating on unlabeled data Z ∼ fZ(z|θ, x)I large size of Z leads to uncertainty swamp

Cucala, L., Marin, J.M., X, and Titterington, D.M. (2009) ABayesian reassessment of nearest-neighbour classification

Page 17: better together? statistical learning in models made of modules

Joint model approach

I The joint model approach uses all data to simultaneouslyinfer all parameters. . .

I . . . so that uncertainty about θ1 is propagated to theestimation of θ2. . .

I . . . but misspecification of the 2nd module can damage theestimation of θ1.

I What about allowing uncertainty propagation, butpreventing feedback of some modules on others...

I ...and still remaining within Bayesian safety net?

Page 18: better together? statistical learning in models made of modules

Models/modules of interest

- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2

- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1

- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.

Page 19: better together? statistical learning in models made of modules

Models/modules of interest

- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2

- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1

- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.

Page 20: better together? statistical learning in models made of modules

Models/modules of interest

- The module of interest might be (θ1, Y1), with Y2 extradata available to update inference on θ1, through a modelwith extra parameter θ2

- The module of interest might be (θ2, Y2), with unknownextra parameter θ1, to be learned with data Y1

- The first module (θ1, Y1) might be of interest for a certaincommunity, and the second module (θ2, Y2) for anothercommunity.

Page 21: better together? statistical learning in models made of modules

Outline

1 Models made of modules

2 Candidate distributions

3 Choosing among candidates

4 Computational challenges

Page 22: better together? statistical learning in models made of modules

Plug-in approach

A simple (naıve?) approach consists in

I estimating θ1 given Y1 first, e.g. θ1 =∫θ1 p1(dθ1|Y1),

I inference on θ2 given Y2 and θ1 using

p2(θ2|θ1, Y2) = p2(θ2|θ1)p2(Y2|θ1, θ2)p2(Y2|θ1)

.

Uncertainty about θ1 is ignored in the estimation of θ2.Misspecification of 2nd module doesn’t impact estimation of θ1.

Page 23: better together? statistical learning in models made of modules

Cut approach

One might want to propagate uncertainty without allowing“feedback” of second module on first module.

Cut distribution:

πcut (θ1, θ2;Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2).

Genuine if different from the posterior distribution under jointmodel:

πcut(θ1, θ2;Y1, Y2) ∝ π(θ1, θ2|Y1, Y2)p2(Y2|θ1) .

Uncertainty of θ1 is propagated to the marginal of θ2.

Marginal of θ1 is p1(θ1|Y1), irrespective of 2nd module.

Page 24: better together? statistical learning in models made of modules

Candidate distributions

For θ1, we can consider the following distributions.I p1(θ1)I p1(θ1|Y1)I π(θ1|Y1, Y2)

but also. . .I ∝ p1(θ1)p2(Y2|θ1)I ∝ p1(θ1|Y1)p2(Y2|θ1)γ

I etc.We can consider, more generally,

πγ1,γ2(θ1;Y1, Y2) ∝ p1(θ1)p1(Y1|θ1)γ1 p2(Y2|θ1)γ2 .

Grunwald, P. (2012) The safe Bayesian

Page 25: better together? statistical learning in models made of modules

Candidate distributions

To infer the pair (θ1, θ2), for any marginal π1 on θ1, we canconsider

I π1(θ1)p2(θ2|θ1)

I π1(θ1)p2(θ2|θ1, Y2)

I etc.

If π1 is reduced to δθ1, we retrieve plug-in approaches.

Page 26: better together? statistical learning in models made of modules

Example: biased data

Y 11 , . . . , Y

n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.

Y 12 , . . . , Y

n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.

Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12).

Figures in the next slides showI full posterior: π(θ1, θ2|Y1, Y2),I module 1: p1(θ1|Y1),I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),I prior: p1(θ1)p2(θ2|θ1).

Page 27: better together? statistical learning in models made of modules

Example: biased data

0

2

4

6

−2 −1 0 1 2

θ1

dens

ity

full posterior module 1 module 2 prior

Page 28: better together? statistical learning in models made of modules

Example: biased data

0

5

10

0.0 0.5 1.0

θ2

dens

ity

cut full posterior module 2 plugin prior

Page 29: better together? statistical learning in models made of modules

Outline

1 Models made of modules

2 Candidate distributions

3 Choosing among candidates

4 Computational challenges

Page 30: better together? statistical learning in models made of modules

Predictive performance

I Consider the task of predicting a future observation Y .

I For a candidate π(θ), the predictive density is

y 7→∫

Θp(y|θ)π(dθ).

I Its expected utility, using the logarithmic scoring rule, is∫Y

log(∫

Θp(y|θ)π(dθ)

)p?(dy),

where p?(dy) is the data-generating process.

I Other scoring rules are principled too.

Parry, Dawid & Lauritzen (2012). Proper local scoring rules.

Page 31: better together? statistical learning in models made of modules

Logarithmic scoring rule

Kullback-Leibler (KL) divergence from distribution withdensity p′, to another distribution with density p,

KL(p, p′) =∫

log(p(y)p′(y)

)p(y)dy.

posterior(θ|Y ) minimises∫(− log likelihood(Y |θ)) ν(θ)dθ + KL (ν(θ), prior(θ))

over all choices of ν(θ)

For a predictive distribution with density y 7→ π(y) and anobservation Y , the logarithmic score is log π(Y ).

Page 32: better together? statistical learning in models made of modules

Predictive performance

I A practical assessment of predictive performance can beobtained in the prequential approach.

I Sequentially obtain πt(θ) using data y1, . . . , yt and predictthe next observation:

n∑t=1

log(∫

Θp(yt|θ)πt−1(dθ)

).

I The log-evidence is retrieved when (πt) is the sequence ofpartial posteriors, p(θ), p(θ|y1), p(θ|y1, y2), etc.

Bernardo & Smith (1994). Bayesian Theory.Dawid (1984). The prequential approach.

Page 33: better together? statistical learning in models made of modules

Proposed plan

In the first module, θ1 is (usually) defined in its relation to Y1.

We propose to assess candidate distributions for θ1 based onpredictive performance for Y1.

To compare p1(θ1|Y1) and π(θ1|Y1, Y2), under the prequentialapproach and under the logarithmic scoring rule, we comparep1(Y1) with π(Y1|Y2).

If p1(Y1) > π(Y1|Y2), we can justify the use of distributions on(θ1, θ2) that admit p1(θ1|Y1) as first marginal,e.g. the cut distribution.

Page 34: better together? statistical learning in models made of modules

Example: biased data

Y 11 , . . . , Y

n11 i.i.d. from N (θ1, 1), with n1 = 100, θ1 = 0.

Y 12 , . . . , Y

n22 i.i.d. from N (θ1 + θ2, 1), with n2 = 1000, θ2 = 1.

Prior θ1 ∼ N (0, 1) and θ2 ∼ N (0, 0.12) (precision).

candidate predictive score for Y1p1(θ1|Y1) -144.5 [-144.5, -144.4]p1(θ1) -151.4 [-152, -150.6]π(θ1|Y1, Y2) -165 [-165.1, -165]

Page 35: better together? statistical learning in models made of modules

Side note on prior vs posterior

Data-generating process p?:

1/2 N (−2, 1) + 1/2 N (+2, 1).

Likelihood: Y |θ ∼ N (θ, 1).

Prior θ ∼ N (0, 1).

Prior predictive N (0, 2).

Posterior predictive N (0, 1) (asymptotically).

Prior predictive is closer to p? than posterior predictive,under the KL divergence.

Page 36: better together? statistical learning in models made of modules

Example: epidemiological study

I Model of virus prevalence

∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),

Zi is number of women infected with high-risk HPV in asample of size Ni in country i.Beta(1,1) prior on each ϕi, independently.

I Impact of prevalence onto cervical cancer occurrence

∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,

Yi is number of cancer cases arising from Ti woman-years offollow-up in country i.N (0, 103) on θ2,1, θ2,2, independently (sorry, not my fault).

Plummer (2015). Cuts in Bayesian graphical models.

Page 37: better together? statistical learning in models made of modules

Example: epidemiological study

candidate predictive score for Y1p1(θ1|Y1) -64.8 [-65, -64.7]π(θ1|Y1, Y2) -74.9 [-75.1, -74.5]p1(θ1) -262.7 [-276.6, -253.5]

. . . from which we can decide that the 2nd module does not helpestimating θ1, which justifies the use of the cut distribution.

In the next figure, different candidate distributions for(θ2,1, θ2,2) are displayed:

I cut: p1(θ1|Y1)p2(θ2|θ1, Y2),I full posterior: π(θ1, θ2|Y1, Y2),I module 2: ∝ p1(θ1)p2(θ2|θ1)p2(Y2|θ1, θ2),

Page 38: better together? statistical learning in models made of modules

Example: epidemiological study

−10

0

10

20

30

−4 −2 0 2

θ2.1

θ 2.2

cut full posterior module 2

Page 39: better together? statistical learning in models made of modules

Side note: prior vs posterior

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).

Page 40: better together? statistical learning in models made of modules

Side note: prior vs posterior

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

Figure: + density of prior predictive N (0, 2)

Page 41: better together? statistical learning in models made of modules

Side note: prior vs posterior

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

posterior predictive

Figure: + density of posterior predictive ≈ N (0, 1).

Page 42: better together? statistical learning in models made of modules

Prior or posterior?

In the next figures. . .I for R = 256 datasets of size n, drawn independently fromp?, compute the posterior distribution πn(dθ),

I compute the posterior predictive density

Y 7→ qn(Y ) =∫

Θp(Y |θ)πn(dθ),

I compute the expected utility∫Y

log qn(Y )p?(dY ),

I and plot the 256 results obtained for all n = 0, . . . , 1000.

Page 43: better together? statistical learning in models made of modules

Example: prior vs posterior, well-specified setting

Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).

−1.7

−1.6

−1.5

0 250 500 750 1000

n

expe

cted

util

ity

Figure: Predicting Y using n data points.

Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.

Page 44: better together? statistical learning in models made of modules

Example: prior vs posterior, well-specified setting

Normal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = N (0, 1).

−1.7

−1.6

−1.5

1 10 100 1000

n

expe

cted

util

ity

Figure: Predicting Y using n data points.

Bands represent 10% and 90% quantiles over 256 differentdatasets; dotted line represents median.

Page 45: better together? statistical learning in models made of modules

Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).

−4.0

−3.5

−3.0

−2.5

0 250 500 750 1000

n

expe

cted

util

ity

Figure: Predicting Y using n data points.

Page 46: better together? statistical learning in models made of modules

Example: prior vs posterior, misspecified settingNormal location model: θ ∼ N (0, 1), Y ∼ N (θ, 1).Data-generating process: p? = 1/2 N (−2, 1) + 1/2 N (+2, 1).

−4.0

−3.5

−3.0

−2.5

1 10 100 1000

n

expe

cted

util

ity

Figure: Predicting Y using n data points.

Page 47: better together? statistical learning in models made of modules

Example: prior vs posterior, misspecified setting

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

Figure: Density of data-generating process p?,1/2 N (−2, 1) + 1/2 N (+2, 1).

Page 48: better together? statistical learning in models made of modules

Example: prior vs posterior, misspecified setting

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

Figure: + density of prior predictive N (0, 2)

Page 49: better together? statistical learning in models made of modules

Example: prior vs posterior, misspecified setting

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

dens

ity

p star

prior predictive

posterior predictive

Figure: + density of posterior predictive ≈ N (0, 1).

Page 50: better together? statistical learning in models made of modules

Asymptotic predictions in misspecified settings

I Asymptotically, the posterior concentrates around

θ? = argminθ

KL(p?(dY ) | p(dY |θ)),

i.e. around θ? that maximizes the expected utility.

I If we have to use a candidate distribution of the formπ(dθ) = δθ(dθ), then the best choice is δθ?(dθ).

I But there might be many π(dθ) such that the predictiveq(dY ) =

∫Θ p(dY |θ)π(dθ) satisfies

KL(p?(dY ) | q(dY )) < argminθ

KL(p?(dY ) | p(dY |θ)).

Page 51: better together? statistical learning in models made of modules

Joint or modularized approach?

I Consider the use of the second module (i.e. θ2, Y2) toimprove the prediction of Y1.

I Biased data example:θ1 ∼ N (0, 1), Y1 ∼ N (θ1, 1),θ2 ∼ N (0, λ2), Y2 ∼ N (θ1 + θ2, 1).

I Data-generating process: Y1 ∼ N (0, 1) and Y2 ∼ N (θ?2, 1).We set n1 = 10 and plot the expected utility against n2,for two values of λ2 (vague / precise) andfor two values of θ?2 (bias / no bias).

Page 52: better together? statistical learning in models made of modules

Biased data example: use of Y2 to predict Y1

−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 0.01 (vague prior) and θ?2 = 0 (no bias).

Page 53: better together? statistical learning in models made of modules

Biased data example: use of Y2 to predict Y1

−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 100 (precise prior) and θ?2 = 0 (no bias).

Page 54: better together? statistical learning in models made of modules

Biased data example: use of Y2 to predict Y1

−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 0.01 (vague prior) and θ?2 = 1 (bias).

Page 55: better together? statistical learning in models made of modules

Biased data example: use of Y2 to predict Y1

−2.0

−1.8

−1.6

−1.4

1 10 100 1000

n2

expe

cted

util

ity

Figure: λ2 = 100 (precise prior) and θ?2 = 1 (bias).

Page 56: better together? statistical learning in models made of modules

Outline

1 Models made of modules

2 Candidate distributions

3 Choosing among candidates

4 Computational challenges

Page 57: better together? statistical learning in models made of modules

From toy models to real-world situations

Computationally, expected utilities typically involve:

I integrals with respect to the data distribution p? on Y ,

→ cross-validation, bootstrap, or sequential out-of-sampleprediction (i.e. prequential approach);

I integrals with respect to candidate distributions on θ,

→ sampling methods, e.g. MCMC, SMC, etc.

Page 58: better together? statistical learning in models made of modules

Sampling in the joint model approach

I Joint model posterior has (genuine) probability density

π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1 (Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).

I The computational complexity typically growssuper-linearly with the number of modules.

I Difficulties stack up. . .intractability, multimodality, ridges, etc.

I Y1 and Y2 might not be simultaneously available,for e.g. confidentiality reasons.

Page 59: better together? statistical learning in models made of modules

Approximating constants

The proposed plan involves computingp1(Y1) and π(Y1|Y2).This can be seen as a normalizing constantestimation task. Challenging but numerousmethods have been proposed.Bridge sampling, path sampling, nestedsampling, Besag’s, Chib’s, Gelfand’s andDey’s, Geyer’s, sequential Monte Carlosamplers, &.

Del Moral, Doucet & Jasra (2005). Sequential Monte Carlo samplers.Zhou, Johansen & Aston (2016). Towards automatic ModelComparison: an adaptive Sequential Monte Carlo approach.

Page 60: better together? statistical learning in models made of modules

Sampling in the joint model approach: SMC way

I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I For each i ∈ {1, . . . , N1}, an SMC sampler

I approximates π(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N22 ),

I approximates π(Y2|θi1) by πN2(Y2|θi1).

I Define wi ∝ πN2(Y2|θi1).

I Then for any test function ϕ

N1∑i=1

wi1N2

N2∑j=1

ϕ(θi1, θi,j2 ) P−−−−−−−→

N1,N2→∞

∫ϕ(θ1, θ2)π(dθ1, dθ2|Y1, Y2).

Page 61: better together? statistical learning in models made of modules

Sampling in the joint model approach: MCMC way

I Metropolis–Hastings with proposal distribution

q((θ1, θ2)→ (dθ′1, dθ′2)

)= πN1

1 (dθ′1)q(θ2 → dθ′2),

i.e. θ′1 is drawn among (θ11, . . . , θ

N11 ) uniformly at random.

I If we pretend that πN11 (dθ′1) ≈ π1(dθ′1|Y1), the acceptance

probability becomes

1 ∧ p2(Y2|θ′1, θ′2)p2(θ′2|θ′1)p2(Y2|θ1, θ2)p2(θ2|θ1)

q(θ′2 → θ2)q(θ2 → θ′2) ,

which does not involve Y1.

Lunn, D., Barrett, J., Sweeting, M., & Thompson, S. (2013). FullyBayesian hierarchical modelling in two stages, with application tometa-analysis.

Page 62: better together? statistical learning in models made of modules

Sampling from the cut distribution

I The cut distribution is defined as

πcut (θ1, θ2;Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2) ∝ π (θ1, θ2|Y1, Y2)p2 (Y2|θ1) .

I The denominator is the feedback of the 2nd module on θ1:

p2 (Y2|θ1) =∫p2(Y2|θ1, θ2)p2(dθ2|θ1).

I The feedback term is typically intractable.

Page 63: better together? statistical learning in models made of modules

Sampling [not] from the cut distribution

WinBUGS’ approach via the cut function: alternate between

I sampling θ′1 from K1(θ1 → dθ′1), targeting p1(dθ1|Y1);

I sampling θ′2 from K2θ′1

(θ2 → dθ′2), targeting p2(dθ2|θ′1, Y2).

This does not leave the cut distribution invariant: ifθ1, θ2 ∼ πcut(dθ1, dθ2;Y1, Y2), then θ′1, θ

′2 follows

πcut(dθ′1, dθ′2;Y1, Y2)∫p2(θ2|θ1, Y2)p2(θ2|θ′1, Y2)K

1(θ′1 → dθ1)K2θ′1

(θ′2 → dθ2).

Iterating the kernel K2θ′1

enough times mitigates the issue.

Plummer (2015). Cuts in Bayesian graphical models. BayesianAnalysis

Page 64: better together? statistical learning in models made of modules

Sampling from the cut distribution

In a perfect world, we could sample i.i.d.

I θi1 from p1(θ1|Y1),

I θi2 given θi1 from p2(θ2|θi1, Y2),

then (θi1, θi2) would be i.i.d. from the cut distribution.

Page 65: better together? statistical learning in models made of modules

Sampling from the cut distribution

In an MCMC world, we can sample

I θi1 approximately from p1(θ1|Y1) using MCMC,

I θi2 given θi1 approximately from p2(θ2|θi1, Y2) using MCMC,

then resulting samples approximate the cut distribution,in the limit of the numbers of iterations, at both stages.

There is an alternative approach thatI is valid in a single asymptotic regime,I enables simple constructions of confidence intervals.

Page 66: better together? statistical learning in models made of modules

Unbiased MCMC [on the side]

By coupling pairs of MCMC chains, we can produce (random)empirical measures

π(·) =N∑`=1

ω`δX`(·)

that approximate a target π, in the sense

E[N∑`=1

ω`h(X`)] =∫h(x)π(dx),

for a test function h.

Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.Heng, Jacob (2017). Unbiased Hamiltonian MC with couplings.

Page 67: better together? statistical learning in models made of modules

Sampling from the cut distribution

In an unbiased MCMC world, we can approximateI p1(dθ1|Y1) with a random measure

π1(dθ1) =N1∑`=1

ω`1δθ`1(dθ1),

I p2(dθ2|θ`1, Y2) for any θ`1, with a random measure

π2(dθ2|θ`1) =N2∑k=1

ω`,k2 δθ`,k

2(dθ2),

Thus, by the tower property, we can unbiasedly estimate∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1).

Page 68: better together? statistical learning in models made of modules

Sampling from the cut distribution

We can thus construct unbiased estimators of∫h(θ1, θ2)πcut(dθ1, dθ2;Y1, Y2)

for test functions h.

We care about unbiasedness because. . .

I we can draw R independent copies of the estimator,Hr, for r = 1, . . . , R,

I and average the results: R−1∑Rr=1 H

r,

I for which confidence intervals can be constructed.

Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.

Page 69: better together? statistical learning in models made of modules

Sampling from the cut distribution

Thus, by the tower property,∫h(θ1, θ2)p1(dθ1|Y1)p2(dθ2|θ1, Y1),

can be approximated with no bias by an estimator of the form

N1∑`=1

ω`1

N2∑k=1

ω`,k2 h(θ`1, θ`,k2 )

.

One can then sample independently R such estimators, andaverage the results to consistently approximate the cutdistribution as R→∞.

Page 70: better together? statistical learning in models made of modules

Sampling in the modularized approach: SMC way

I We can first approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I For each i ∈ {1, . . . , N1}, an SMC sampler approximatesπ(dθ2|θi1, Y2) by (θi,12 , . . . , θi,N2

2 ),

I Then for any test function ϕ

1N1

1N2

N1∑i=1

N2∑j=1

ϕ(θi1, θi,j2 )

P−−−−−−−→N1,N2→∞

∫ϕ(θ1, θ2)π(dθ2|θ1, Y2)π1(dθ1|Y1).

I The SMC approach yields approximations of the joint andthe cut distributions simultaneously, by using or not theestimated feedback terms πN2(Y2|θi1).

Page 71: better together? statistical learning in models made of modules

Sampling in the modularized approach: MCMC way

I First approximate π1(dθ1|Y1) with (θ11, . . . , θ

N11 ).

I Run a (long enough) MCMC targeting π(dθ2|θi1, Y2) foreach i ∈ {1, . . . , N1}. . . independently?

I Introduce swap moves, as in parallel tempering:

I select k, ` in {1, . . . , N1} uniformly at random;

I swap θk2 with θ`2 with probability

1 ∧ πcut(θk1 , θ`2)πcut(θk1 , θk2)

πcut(θ`1, θk2)πcut(θ`1, θ`2)

= 1 ∧ p2(θ`2|θk1)p2(Y2|θk1 , θ`2)p2(θk2 |θk1)p2(Y2|θk1 , θk2)

p2(θk2 |θ`1)p2(Y2|θ`1, θk2)p2(θ`2|θ`1)p2(Y2|θ`1, θ`2)

.

Page 72: better together? statistical learning in models made of modules

Discussion

I Models made of multiple modules abound.Forcings in dynamical systems, computer models, causalinference with propensity scores, PKPD, air pollution data,meta-analysis, generated regressors, regression with functionalpredictors, linkage disequilibrium modelling. . .

I Departures from the joint model approach can be justifiedin a decision-theoretic framework.

I These enable propagation of uncertainties but with acontrol of the feedback of some variables on others.

I Is the cut distribution a Bayesian equivalent of two-stepestimators routinely used in econometrics?

I The comparison between candidates and the approximationof the cut distribution raise computational challenges.

Page 73: better together? statistical learning in models made of modules

References

Technical report at arxiv.org/abs/1708.08719.Unbiased MCMC report at arxiv.org/abs/1708.03625.

Page 74: better together? statistical learning in models made of modules

References

Technical report at arxiv.org/abs/1708.08719.Unbiased MCMC report at arxiv.org/abs/1708.03625.

Page 75: better together? statistical learning in models made of modules

Appendix: unbiased MCMC

Consider two Markov chains (Xt) and (Yt) generated as follows:

I sample X0 and Y0 from π0, an initial distribution,

I sample X1 ∼ P (X0, ·), with a π-invariant Markov kernel P ,

I for t ≥ 1, sample (Xt+1, Yt) ∼ P ((Xt, Yt−1), ·),where P is a coupling of P with itself.

Each chain marginally evolves according to P .

Xt and Yt have the same distribution for all t ≥ 0.

P such that ∃τ such that Xt = Yt−1 for all t ≥ τ .

Page 76: better together? statistical learning in models made of modules

Appendix: unbiased MCMC

Marginal convergence of each chain:

Eπ[h(X)] = limt→∞

E[h(Xt)].

Limit as a telescopic sum, for all k ≥ 0,

limt→∞

E[h(Xt)]

= E[h(Xk)] +∞∑

t=k+1E[h(Xt)− h(Xt−1)].

Since Yt has the same distribution as Xt, then

Eπ[h(X)] = E[h(Xk)] +∞∑

t=k+1E[h(Xt)− h(Yt−1)].

Page 77: better together? statistical learning in models made of modules

Appendix: unbiased MCMC

If we can swap expectation and limit,

Eπ[h(X)] = E[h(Xk) +∞∑

t=k+1(h(Xt)− h(Yt−1))],

the random variable

h(Xk) +∞∑

t=k+1(h(Xt)− h(Yt−1))

is an unbiased estimator of Eπ[h(X)].

We can compute the ”infinite” sum since Xt = Yt−1 for all t ≥ τ .

Page 78: better together? statistical learning in models made of modules

Appendix: unbiased MCMC

The proposed random measure estimator is given by

π(·) = δXk(·) +

τ−1∑t=k+1

(δXt(·)− δYt−1(·)) ≡N∑`=1

ω`δZ`(·).

Variant of Glynn & Rhee’s estimator.

Glynn & Rhee (2014). Exact estimation for Markov chain equilibriumexpectations.Jacob, O’Leary & Atchade (2017). Unbiased MCMC with couplings.

Page 79: better together? statistical learning in models made of modules

Appendix: unbiased MCMC

Conditions ensuring correctness, finite cost, and finite variance,for a test function h.

1. Marginal chain converges:E[h(Xt)]→ Eπ[h(X)],∃η > 0, D <∞, E[|h(Xt)|2+η] < D.

2. Meeting time τ is geometric:

∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 1 P(τ > t) ≤ Cδt.

3. Chains stay together: Xt = Yt−1 for all t ≥ τ .

(Sufficient but not necessary!)