stochastic compartmental modeling and inference …...stochastic compartmental modeling and...

Stochastic Compartmental Modeling andInference with Biological Applications

Jason XuDepartment of Biomathematics, UCLA

Challenges in the Statistical Modeling of Stochastic Processesfor the Natural Sciences Workshop

Banff International Research Station, July 11, 2017

Review: hematopoiesis

A complex mechanism in which self-renewing hematopoietic stemcells (HSCs) differentiate via a series of intermediate progenitorcell stages to produce blood cells

• Multi-stage process in bone marrow, lymphatic and circulatorysystems: difficult to observe in vivo

• Dynamics and structure are largely unknown

• Clinical relevance: stem cell transplantation is a mainstay ofcancer therapy; all blood cell diseases are caused bymalfunctions in the hematopoietic process

• Stochastic modeling efforts provide quantitative basis toanswer questions about dynamics (parameter inference) andstructure (model selection)

Branching structure of hematopoiesis

Many versions of the hematopoietic tree have been proposed

Two-type compartmental model

• Series of statistical studies targeting HSC dynamics [Abkowitz et al1990, Golinelli et al 2009, Catlin et al 2011]

• Cannot resolve questions about later stages of differentiation

• Past studies: intensive simulation study, estimating equations,reversible-jump MCMC

• Can be equivalently treated as a Markov branching process

Multi-type Markov branching processes

• Random vector X(t); Xi (t) denotes type i population at time t

• Cells act independently: can die, reproduce, create other cells

• Independence ⇒ linearity: overall rates are multiplicative in numberof cells

• Time-homogeneity: jump rates are constant over time

• A class of continuous-time Markov chains (CTMCs): memoryless,exponential times between events

Challenges: discretely observed data

The discretely-observed data likelihood

`o(Y|θ) =m∑

p=1

n(p)−1∑i=0

log pXp(tp,i ),Xp(tp,i+1)(tp,i+1 − tp,i |θ)

In particular, need finite-time transition probabilities:

px,y(t) = Pr (X(t + s) = y|X(s) = x)

• Classical matrix exponentiation for CTMCs is O(|Ω|3)

P(t) :=px,y(t)

x,y∈Ω

= eQt =∞∑k=0

(Qt)k

k!.

• When only partially observed (latent process), compounded byadditional marginalization over hidden states

Using the probability generating function φ

φjk(t, s1, s2;θ) = Eθ

(sX1(t)1 s

X2(t)2 |X1(0) = j ,X2(0) = k

)=∞∑l=0

∞∑m=0

p(jk),(lm)(t;θ)s l1sm2 ; |si | ≤ 1

• PGF φjk computed by solving Kolmogorov forward/backward ODEs

• Transition probabilities related via differentiation, but impractical

p(jk),(lm)(t) =1

l!m!

∂ l

∂s1

∂m

∂s2φjk(t)

∣∣∣∣s1=s2=0

• Transform s1 = e2πiw1 , s2 = e2πiw2 ⇒ φ becomes a Fourier series:

φjk(t, e2πiw1 , e2πiw2 ) =∞∑l=0

∞∑m=0

p(jk),(lm)(t)e2πilw1e2πimw2

From differentiation to integration: a spectral trick

• Inverting the Fourier series representation recovers transitionprobabilities efficiently:

p(jk),(lm)(t) =

∫ 1

0

∫ 1

0

φjk(t, e2πiw1 , e2πiw2 )e−2πilw1e−2πimw2dw1dw2

(applying a Riemann sum approximation)

≈ 1

N2

N−1∑u=0

N−1∑v=0

φjk(t, e2πiu/N , e2πiv/N)e−2πilu/Ne−2πimv/N .

• Can simultaneously compute probabilities p(jk),(lm)(t) for alll ,m = 0, . . . ,N via Fast Fourier Transform (FFT)

• Compute discrete-data likelihood practically; similar approach yieldsconditioned moments useful for EM[Xu, Guttorp, Kato-Maeda, Minin 2015]

0.5 1.5

0.00

00.

008

Shift RateTr

ansi

tion

Pro

babi

lity

0.5 1.5

0.00

00.

015

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

020

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

004

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

020

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

20.

010

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

020

Shift RateTr

ansi

tion

Pro

babi

lity

0.5 1.5

0.00

00.

004

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

015

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

008

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

010

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

004

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

0.02

Shift RateTr

ansi

tion

Pro

babi

lity

0.5 1.5

0.00

00.

010

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

004

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

015

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.01

00.

025

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

010

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

0.03

Shift RateTr

ansi

tion

Pro

babi

lity

0.5 1.5

0.00

00.

010

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.01

0.03

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

0.03

0.06

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.01

0.03

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

0.02

0.04

Shift Rate

Tran

sitio

n P

roba

bilit

y

0.5 1.5

0.00

00.

006

Shift RateTr

ansi

tion

Pro

babi

lity

Shift Rate ν

Tran

sitio

n pr

obab

ility

More missing data: partially observed processes

Process X(t) poses same challenges as before, but now we only glimpsepartial information (sampling, measurement error)

• Direct marginalization impractical for large Ω

• Data augmented MCMC: slow mixing, need efficient proposals

• Simulation approaches (particle filtering, SMC) are flexible,but quickly become limited for large populations[Andrieu et al 2010]

Hematopoietic lineage barcoding data

• DNA barcoding experiments enable individual cell lineage trackingthrough time, in vivo

• IID time series data: DNA read counts partially inform thepopulations of each barcode ID present among each cell type

• Monitored at discrete times over 30 months, dataset contains 110million read counts across 9635 unique barcode IDs

• Discrete hidden space of multiple very large, hidden populations

Illustration: experimental design

t0 t1 t2 t3

HSC

Progenitors

Mature Cell Legend:

...

...

t0 t1 t2 t3 ...

B cell readsT cell reads

Read

cou

nts

fo

r

PCR genomicDNA from each sample

Latent process forone barcode lineage

Sampling times

The hidden branching process model

• Latent process: each barcode lineage evolves as acontinuous-time, multitype branching process X(t) whosecomponents are counts of each cell type

• Observation process: flexible choice of emission distributionfor sample counts: we use multivariate hypergeometricdistribution Y ∼ mvhypgeo(X)

• Read data: read counts Y are proportional to Y withunknown amplification constant

Moment-based method of inference

Loss function estimation: match pairwise model-based andempirical correlations across barcode lineages [Xu et al 2017],

L(θ;Y) =∑tj

∑m

∑n 6=m

[ψjmn(θ)− ψj

mn(Y)]2,

ψjmn(θ) = [ρ(Ym(tj),Yn(tj));θ] , and

ψjmn denotes the corresponding sample correlations at time tj

• Estimating parameters θ reduces to nonlinear least squaresoptimization: θ = argminθ L(θ;Y)

• Consistent under mild assumptions: θN → θ0 in probability

A richer class of compartmental models

λνa

µ a

HSC Progenitor

µ 5

µ 3

µ 2

µ 1

Mo

T

B

Gr

NK

µ 4

ν 1

ν 2

ν3

ν 4

ν 5

(a) Multipotent progenitor

Mo

Bλ νa

HSC

T

Gr

NK

νb

Prog.a

Prog.b

µ 5

µ 4

µ 3

µ 2

µ 1ν 1

ν 2

ν3

ν 4

ν 5

µ a

µ b

(b) Oligopotent progenitors

Allows for an arbitrary number of intermediate progenitors and mature

cell types, requiring that each mature type can be descended from only

one possible progenitor type

Fitted correlations: macaque lineage tracking data

0.

00.

20.

40.

60.

81.

0

Pairwise correlations in zh33, one progenitor

Observation Time (Months)

Pai

rwis

e co

rrel

atio

n

2 4.5 6.5 9.5 12 14 21 23 28 30

Gr and MoGr and TGr and BGr and NKMo and TMo and BMo and NKT and BT and NKB and NK

Solid lines denote empirical correlation profiles; dotted lines denotemodel-based correlations from best fitting estimates θ

Overview of results

• HSC self-renewal rate λ = .0593 (every 12 weeks) falls into theconfidence interval (0.0095, 0.0649) obtained in previous primatestudies

• Initial distribution π = .139 consistent with GFP marking levelsstabilizing at 13%

• Intermediate rates νi suggest granulocytes and monocytes areproduced much more rapidly than T, B and NK cells, and individualprogenitors can each produce thousands of cells daily (not previouslyestimated)

• NK cells track distinctly from other mature blood cells

• Single-progenitor models fit best, affirming recent findings [Notta2015] of in vitro human hematopoiesis that challenge the traditionaloligopotency assumption

Evidence against oligopotency

Notta et al. 2015

An open challenge: model selection

Mo

Bλ νa

HSC

T

Gr

NK

νb

Prog.a

Prog.b

µ 5

µ 4

µ 3

µ 2

µ 1ν 1

ν 2

ν3

ν 4

ν 5

µ a

µ b

A model with two oligopotent progenitors instead of one commonmultipoint progenitor leads to poor model fit

An open challenge: model selection

0.0

0.2

0.4

0.6

0.8

1.0

Pairwise correlations in zh33, 2 progenitors


Cor

rela

tion

2 4.5 6.5 9.5 12 14 21 23 28 30

• Efficient parameterestimation for fittingcell lineage barcodingdata to rich models ofhematopoiesis

• Need model selectiontechniques forrigorous conclusionsabout pathwaystructure

Nonlinear compartmental models

Motivating example: stochastic SIR model of infection

“The associated mathematical manipulations required to generatesolutions can only be described as heroic.”

— E. Renshaw, 2015,Stochastic population processes: analysis, approximations, simulations.

SIR dynamics

Pr(S(t + h) = xh, I (t + h) = yh|S(t) = xt , I (t) = yt)

=

βxtyth + o(h) if (xh, yh) = (xt − 1, yt + 1)

γyth + o(h) if (xh, yh) = (xt , yt − 1)

1− (βxtyt + γyt)h + o(h) if (xh, yh) = (xt , yt)

• Parameters: infection rate β, recovery rate γ

• Nonlinearity arises from interactions: does not satisfy particleindependence ⇒ cannot analyze as branching process

• Finite-time behavior (transition probabilities) challenging

Transition probabilities: a different route

Very briefly, working in the Laplace domain,

φab(s) := L[Pa0b0ab (t)](s) =

∫ ∞0

e−stPa0b0ab (t)dt

satisfies a recursion with continued fraction representation

φ(0)ab (s) =

b∏i=1

xaixa,b+1

Ya,b+1 +xa,b+2Yab

ya,b+2 +xa,b+3

ya,b+3 +xa,b+4

ya,b+4 + · · ·

• Evaluate to finite depth, numerically invert Laplace transform[Ho, Xu, Crawford, Minin, Suchard 2017]

Back to branching processes

• Continued fraction method is limited to moderate outbreak sizes;derivation is delicate and hard to extend

• Two-type branching approximation yields analytic transitionprobabilities

(c) Continued fraction (d) Branching process

Current/future work: correcting the approximationBranching process model as proposal density within MCMC

• Metropolis-Hastings step corrects approximation error

Circles represent true populations. I and R curves proposed frombranching process, given true β, γ, and observed S population

References

• JL Abkowitz, ML Linenberger, MA Newton, GH Shelton, RL Ott, and PGuttorp. Evidence for the maintenance of hematopoiesis in a large animal bythe sequential activation of stem-cell clones.Proceedings of the National Academy of Sciences, 1990.

• SN Catlin, JL Abkowitz, and P Guttorp. Statistical inference in atwo-compartment model for hematopoiesis. Biometrics, 2001.

• D Golinelli, P Guttorp, and JL Abkowitz. Bayesian inference in a hiddenstochastic two-compartment model for feline hematopoiesis.Mathematical Medicine and Biology, 2006.

• J Xu, P Guttorp, MM Kato-Maeda, and VN Minin. Likelihood-based inferencefor discretely observed birth-death-shift processes, with applications to evolutionof mobile genetic elements. Biometrics, 2015.

• C Andrieu, A Doucet, R Holenstein. Particle Markov chain Monte Carlomethods. JRSS: Series B, 2010.

• J Xu, S Koelle, P Guttorp, C Wu, C Dunbar, JL Abkowitz, VN Minin. Statisticalinference in partially observed stochastic compartmental models with applicationto cell lineage tracking of in vivo hematopoiesis. ArXiv preprint, 2016.

• L Ho, J Xu, FW Crawford, VN Minin, MA Suchard . Birth (death)/birth-deathprocesses and their computable transition probabilities with statisticalapplications. Journal of Mathematical Biology, 2017 (to appear).

Simulation study

Infer parameters for partially observed datasets simulated from models inour class: we use the following as an example

Mo

Bλ νa

HSC

T

Gr

NK

νb

Prog.a

Prog.b

µ 5

µ 4

µ 3

µ 2

µ 1ν 1

ν 2

ν3

ν 4

ν 5

µ a

µ b

Results: simulation study

−1

01

2

Estimated rates, 400 synthetic datasets

Parameter

Rel

ativ

e er

ror

λ νa νb µa µb ν1 ν2 ν3 ν4 ν5 πa πb

Gran + Mono

T + B

λνa

ν1

ν2

ν3

µ a

µ 2

µ 1

µ 3

HSC Progenitor

NK

0.0

0.2

0.4

0.6

0.8

1.0

Pairwise correlations, three mature cell groupings


Cor

rela

tion

2 4.5 6.5 9.5 12 14 21 23 28 30

Gr+Mono and T+BGr+Mono and NKT+B and NKFitted ValuesGCSF Mobilization Dates

Histogram of relative errors across all parameters

Relative error

Fre

quen

cy

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

020

040

060

080

010

00

Performance under model misspecification

20 40 60 80 100 120 140

−0.

50.

00.

51.

0

Fitting three−progenitor model on two−progenitor data

Observation Time

Cor

rela

tion

True parametersMisspecified Fit

Figure: Inference on same synthetic data generated from the two-progenitormodel, but misspecifying a three-progenitor model


20 40 60 80 100 120 140

−0.

50.

00.

51.

0

Fitting one−progenitor model on two−progenitor data

Observation Time

Cor

rela

tion

True parametersMisspecified Fit

Figure: Here we wrongly assume there is one common progenitor


20 40 60 80 100 120 140

−0.

50.

00.

51.

0

Lumping five−type dataset into three mature compartments

Observation Time

Cor

rela

tion

True observed correlationMisspecified Fit

Figure: When we lump mature compartments together but otherwise correctlyspecify progenitors they are descended from, the fit is still good.

Parameter estimates

Estimated parameter 5-type model fit 3-type model fit

HSC renewal λ 0.0634 0.0497HSC diff ν0 2.80 × 10−6 1.11 × 10−5

Progen. death µ0 0.000 0.000Progen. diff. to Type 1 ν1 1614.7 2635.7

ν2 6093.6 283.3ν3 39.6 173.1ν4 126.1 NAν5 64.4 NA

Mature death of Type 1 µ1 0.5 0.7µ2 0.7 0.01µ3 0.01 0.40µ4 0.01 NAµ5 0.45 NA

Percentage barcoded at HSC 0.289 0.148

The emission distribution

Observation model: read count data Yp(t) for barcode p at timet are distributed according to multivariate hypergeometricdistribution:

Y p1 (t) | X(t) ∼ hypergeom(N3,X

p3 (t), n1),

Y p2 (t) | X(t) ∼ hypergeom(N4,X

p4 (t), n2),

• n1 and n2 are known numbers of sampled cells of types 3(Gr+Mono) and 4 (T+B+NK)

• N3 and N4 are known total numbers of barcoded type 3 and 4cells in the animal

• X p3 (t) and X p

4 (t) are unknown numbers of types 3 and 4 cellswith barcode p

Fitted correlation plots

20 40 60 80 100 120 140

0.85

0.90

0.95

1.00

Pairwise correlation profiles, synthetic data

Observation Time

Cor

rela

tion

True parametersFitted parameters

Fitted correlation plots

20 40 60 80 100 120 140

−0.

50.

00.

51.

0

Synthetic data with two distinct progenitors

Observation Time

Cor

rela

tion

True parametersFitted parameters

Wu et al. 2014

stochastic compartmental modeling and inference …...stochastic compartmental modeling and...

Documents