stochastic compartmental modeling and inference …...stochastic compartmental modeling and...
TRANSCRIPT
Stochastic Compartmental Modeling andInference with Biological Applications
Jason XuDepartment of Biomathematics, UCLA
Challenges in the Statistical Modeling of Stochastic Processesfor the Natural Sciences Workshop
Banff International Research Station, July 11, 2017
Review: hematopoiesis
A complex mechanism in which self-renewing hematopoietic stemcells (HSCs) differentiate via a series of intermediate progenitorcell stages to produce blood cells
• Multi-stage process in bone marrow, lymphatic and circulatorysystems: difficult to observe in vivo
• Dynamics and structure are largely unknown
• Clinical relevance: stem cell transplantation is a mainstay ofcancer therapy; all blood cell diseases are caused bymalfunctions in the hematopoietic process
• Stochastic modeling efforts provide quantitative basis toanswer questions about dynamics (parameter inference) andstructure (model selection)
Branching structure of hematopoiesis
Many versions of the hematopoietic tree have been proposed
Two-type compartmental model
• Series of statistical studies targeting HSC dynamics [Abkowitz et al1990, Golinelli et al 2009, Catlin et al 2011]
• Cannot resolve questions about later stages of differentiation
• Past studies: intensive simulation study, estimating equations,reversible-jump MCMC
• Can be equivalently treated as a Markov branching process
Multi-type Markov branching processes
• Random vector X(t); Xi (t) denotes type i population at time t
• Cells act independently: can die, reproduce, create other cells
• Independence ⇒ linearity: overall rates are multiplicative in numberof cells
• Time-homogeneity: jump rates are constant over time
• A class of continuous-time Markov chains (CTMCs): memoryless,exponential times between events
Challenges: discretely observed data
The discretely-observed data likelihood
`o(Y|θ) =m∑
p=1
n(p)−1∑i=0
log pXp(tp,i ),Xp(tp,i+1)(tp,i+1 − tp,i |θ)
In particular, need finite-time transition probabilities:
px,y(t) = Pr (X(t + s) = y|X(s) = x)
• Classical matrix exponentiation for CTMCs is O(|Ω|3)
P(t) :=px,y(t)
x,y∈Ω
= eQt =∞∑k=0
(Qt)k
k!.
• When only partially observed (latent process), compounded byadditional marginalization over hidden states
Using the probability generating function φ
φjk(t, s1, s2;θ) = Eθ
(sX1(t)1 s
X2(t)2 |X1(0) = j ,X2(0) = k
)=∞∑l=0
∞∑m=0
p(jk),(lm)(t;θ)s l1sm2 ; |si | ≤ 1
• PGF φjk computed by solving Kolmogorov forward/backward ODEs
• Transition probabilities related via differentiation, but impractical
p(jk),(lm)(t) =1
l!m!
∂ l
∂s1
∂m
∂s2φjk(t)
∣∣∣∣s1=s2=0
• Transform s1 = e2πiw1 , s2 = e2πiw2 ⇒ φ becomes a Fourier series:
φjk(t, e2πiw1 , e2πiw2 ) =∞∑l=0
∞∑m=0
p(jk),(lm)(t)e2πilw1e2πimw2
From differentiation to integration: a spectral trick
• Inverting the Fourier series representation recovers transitionprobabilities efficiently:
p(jk),(lm)(t) =
∫ 1
0
∫ 1
0
φjk(t, e2πiw1 , e2πiw2 )e−2πilw1e−2πimw2dw1dw2
(applying a Riemann sum approximation)
≈ 1
N2
N−1∑u=0
N−1∑v=0
φjk(t, e2πiu/N , e2πiv/N)e−2πilu/Ne−2πimv/N .
• Can simultaneously compute probabilities p(jk),(lm)(t) for alll ,m = 0, . . . ,N via Fast Fourier Transform (FFT)
• Compute discrete-data likelihood practically; similar approach yieldsconditioned moments useful for EM[Xu, Guttorp, Kato-Maeda, Minin 2015]
0.5 1.5
0.00
00.
008
Shift RateTr
ansi
tion
Pro
babi
lity
0.5 1.5
0.00
00.
015
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
020
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
004
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
020
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
20.
010
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
020
Shift RateTr
ansi
tion
Pro
babi
lity
0.5 1.5
0.00
00.
004
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
015
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
008
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
010
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
004
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
0.02
Shift RateTr
ansi
tion
Pro
babi
lity
0.5 1.5
0.00
00.
010
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
004
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
015
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.01
00.
025
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
010
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
0.03
Shift RateTr
ansi
tion
Pro
babi
lity
0.5 1.5
0.00
00.
010
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.01
0.03
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
0.03
0.06
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.01
0.03
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
0.02
0.04
Shift Rate
Tran
sitio
n P
roba
bilit
y
0.5 1.5
0.00
00.
006
Shift RateTr
ansi
tion
Pro
babi
lity
Shift Rate ν
Tran
sitio
n pr
obab
ility
More missing data: partially observed processes
Process X(t) poses same challenges as before, but now we only glimpsepartial information (sampling, measurement error)
• Direct marginalization impractical for large Ω
• Data augmented MCMC: slow mixing, need efficient proposals
• Simulation approaches (particle filtering, SMC) are flexible,but quickly become limited for large populations[Andrieu et al 2010]
Hematopoietic lineage barcoding data
• DNA barcoding experiments enable individual cell lineage trackingthrough time, in vivo
• IID time series data: DNA read counts partially inform thepopulations of each barcode ID present among each cell type
• Monitored at discrete times over 30 months, dataset contains 110million read counts across 9635 unique barcode IDs
• Discrete hidden space of multiple very large, hidden populations
Illustration: experimental design
t0 t1 t2 t3
HSC
Progenitors
Mature Cell Legend:
...
...
t0 t1 t2 t3 ...
B cell readsT cell reads
Read
cou
nts
fo
r
PCR genomicDNA from each sample
Latent process forone barcode lineage
Sampling times
The hidden branching process model
• Latent process: each barcode lineage evolves as acontinuous-time, multitype branching process X(t) whosecomponents are counts of each cell type
• Observation process: flexible choice of emission distributionfor sample counts: we use multivariate hypergeometricdistribution Y ∼ mvhypgeo(X)
• Read data: read counts Y are proportional to Y withunknown amplification constant
Moment-based method of inference
Loss function estimation: match pairwise model-based andempirical correlations across barcode lineages [Xu et al 2017],
L(θ;Y) =∑tj
∑m
∑n 6=m
[ψjmn(θ)− ψj
mn(Y)]2,
ψjmn(θ) = [ρ(Ym(tj),Yn(tj));θ] , and
ψjmn denotes the corresponding sample correlations at time tj
• Estimating parameters θ reduces to nonlinear least squaresoptimization: θ = argminθ L(θ;Y)
• Consistent under mild assumptions: θN → θ0 in probability
A richer class of compartmental models
λνa
µ a
HSC Progenitor
µ 5
µ 3
µ 2
µ 1
Mo
T
B
Gr
NK
µ 4
ν 1
ν 2
ν3
ν 4
ν 5
(a) Multipotent progenitor
Mo
Bλ νa
HSC
T
Gr
NK
νb
Prog.a
Prog.b
µ 5
µ 4
µ 3
µ 2
µ 1ν 1
ν 2
ν3
ν 4
ν 5
µ a
µ b
(b) Oligopotent progenitors
Allows for an arbitrary number of intermediate progenitors and mature
cell types, requiring that each mature type can be descended from only
one possible progenitor type
Fitted correlations: macaque lineage tracking data
0.
00.
20.
40.
60.
81.
0
Pairwise correlations in zh33, one progenitor
Observation Time (Months)
Pai
rwis
e co
rrel
atio
n
2 4.5 6.5 9.5 12 14 21 23 28 30
Gr and MoGr and TGr and BGr and NKMo and TMo and BMo and NKT and BT and NKB and NK
Solid lines denote empirical correlation profiles; dotted lines denotemodel-based correlations from best fitting estimates θ
Overview of results
• HSC self-renewal rate λ = .0593 (every 12 weeks) falls into theconfidence interval (0.0095, 0.0649) obtained in previous primatestudies
• Initial distribution π = .139 consistent with GFP marking levelsstabilizing at 13%
• Intermediate rates νi suggest granulocytes and monocytes areproduced much more rapidly than T, B and NK cells, and individualprogenitors can each produce thousands of cells daily (not previouslyestimated)
• NK cells track distinctly from other mature blood cells
• Single-progenitor models fit best, affirming recent findings [Notta2015] of in vitro human hematopoiesis that challenge the traditionaloligopotency assumption
Evidence against oligopotency
Notta et al. 2015
An open challenge: model selection
Mo
Bλ νa
HSC
T
Gr
NK
νb
Prog.a
Prog.b
µ 5
µ 4
µ 3
µ 2
µ 1ν 1
ν 2
ν3
ν 4
ν 5
µ a
µ b
A model with two oligopotent progenitors instead of one commonmultipoint progenitor leads to poor model fit
An open challenge: model selection
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise correlations in zh33, 2 progenitors
Observation Time (Months)
Cor
rela
tion
2 4.5 6.5 9.5 12 14 21 23 28 30
• Efficient parameterestimation for fittingcell lineage barcodingdata to rich models ofhematopoiesis
• Need model selectiontechniques forrigorous conclusionsabout pathwaystructure
Nonlinear compartmental models
Motivating example: stochastic SIR model of infection
“The associated mathematical manipulations required to generatesolutions can only be described as heroic.”
— E. Renshaw, 2015,Stochastic population processes: analysis, approximations, simulations.
SIR dynamics
Pr(S(t + h) = xh, I (t + h) = yh|S(t) = xt , I (t) = yt)
=
βxtyth + o(h) if (xh, yh) = (xt − 1, yt + 1)
γyth + o(h) if (xh, yh) = (xt , yt − 1)
1− (βxtyt + γyt)h + o(h) if (xh, yh) = (xt , yt)
• Parameters: infection rate β, recovery rate γ
• Nonlinearity arises from interactions: does not satisfy particleindependence ⇒ cannot analyze as branching process
• Finite-time behavior (transition probabilities) challenging
Transition probabilities: a different route
Very briefly, working in the Laplace domain,
φab(s) := L[Pa0b0ab (t)](s) =
∫ ∞0
e−stPa0b0ab (t)dt
satisfies a recursion with continued fraction representation
φ(0)ab (s) =
b∏i=1
xaixa,b+1
Ya,b+1 +xa,b+2Yab
ya,b+2 +xa,b+3
ya,b+3 +xa,b+4
ya,b+4 + · · ·
• Evaluate to finite depth, numerically invert Laplace transform[Ho, Xu, Crawford, Minin, Suchard 2017]
Back to branching processes
• Continued fraction method is limited to moderate outbreak sizes;derivation is delicate and hard to extend
• Two-type branching approximation yields analytic transitionprobabilities
(c) Continued fraction (d) Branching process
Current/future work: correcting the approximationBranching process model as proposal density within MCMC
• Metropolis-Hastings step corrects approximation error
Circles represent true populations. I and R curves proposed frombranching process, given true β, γ, and observed S population
References
• JL Abkowitz, ML Linenberger, MA Newton, GH Shelton, RL Ott, and PGuttorp. Evidence for the maintenance of hematopoiesis in a large animal bythe sequential activation of stem-cell clones.Proceedings of the National Academy of Sciences, 1990.
• SN Catlin, JL Abkowitz, and P Guttorp. Statistical inference in atwo-compartment model for hematopoiesis. Biometrics, 2001.
• D Golinelli, P Guttorp, and JL Abkowitz. Bayesian inference in a hiddenstochastic two-compartment model for feline hematopoiesis.Mathematical Medicine and Biology, 2006.
• J Xu, P Guttorp, MM Kato-Maeda, and VN Minin. Likelihood-based inferencefor discretely observed birth-death-shift processes, with applications to evolutionof mobile genetic elements. Biometrics, 2015.
• C Andrieu, A Doucet, R Holenstein. Particle Markov chain Monte Carlomethods. JRSS: Series B, 2010.
• J Xu, S Koelle, P Guttorp, C Wu, C Dunbar, JL Abkowitz, VN Minin. Statisticalinference in partially observed stochastic compartmental models with applicationto cell lineage tracking of in vivo hematopoiesis. ArXiv preprint, 2016.
• L Ho, J Xu, FW Crawford, VN Minin, MA Suchard . Birth (death)/birth-deathprocesses and their computable transition probabilities with statisticalapplications. Journal of Mathematical Biology, 2017 (to appear).
Simulation study
Infer parameters for partially observed datasets simulated from models inour class: we use the following as an example
Mo
Bλ νa
HSC
T
Gr
NK
νb
Prog.a
Prog.b
µ 5
µ 4
µ 3
µ 2
µ 1ν 1
ν 2
ν3
ν 4
ν 5
µ a
µ b
Results: simulation study
−1
01
2
Estimated rates, 400 synthetic datasets
Parameter
Rel
ativ
e er
ror
λ νa νb µa µb ν1 ν2 ν3 ν4 ν5 πa πb
Gran + Mono
T + B
λνa
ν1
ν2
ν3
µ a
µ 2
µ 1
µ 3
HSC Progenitor
NK
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise correlations, three mature cell groupings
Observation Time (Months)
Cor
rela
tion
2 4.5 6.5 9.5 12 14 21 23 28 30
Gr+Mono and T+BGr+Mono and NKT+B and NKFitted ValuesGCSF Mobilization Dates
Histogram of relative errors across all parameters
Relative error
Fre
quen
cy
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
020
040
060
080
010
00
Performance under model misspecification
20 40 60 80 100 120 140
−0.
50.
00.
51.
0
Fitting three−progenitor model on two−progenitor data
Observation Time
Cor
rela
tion
True parametersMisspecified Fit
Figure: Inference on same synthetic data generated from the two-progenitormodel, but misspecifying a three-progenitor model
Performance under model misspecification
20 40 60 80 100 120 140
−0.
50.
00.
51.
0
Fitting one−progenitor model on two−progenitor data
Observation Time
Cor
rela
tion
True parametersMisspecified Fit
Figure: Here we wrongly assume there is one common progenitor
Performance under model misspecification
20 40 60 80 100 120 140
−0.
50.
00.
51.
0
Lumping five−type dataset into three mature compartments
Observation Time
Cor
rela
tion
True observed correlationMisspecified Fit
Figure: When we lump mature compartments together but otherwise correctlyspecify progenitors they are descended from, the fit is still good.
Parameter estimates
Estimated parameter 5-type model fit 3-type model fit
HSC renewal λ 0.0634 0.0497HSC diff ν0 2.80 × 10−6 1.11 × 10−5
Progen. death µ0 0.000 0.000Progen. diff. to Type 1 ν1 1614.7 2635.7
ν2 6093.6 283.3ν3 39.6 173.1ν4 126.1 NAν5 64.4 NA
Mature death of Type 1 µ1 0.5 0.7µ2 0.7 0.01µ3 0.01 0.40µ4 0.01 NAµ5 0.45 NA
Percentage barcoded at HSC 0.289 0.148
The emission distribution
Observation model: read count data Yp(t) for barcode p at timet are distributed according to multivariate hypergeometricdistribution:
Y p1 (t) | X(t) ∼ hypergeom(N3,X
p3 (t), n1),
Y p2 (t) | X(t) ∼ hypergeom(N4,X
p4 (t), n2),
• n1 and n2 are known numbers of sampled cells of types 3(Gr+Mono) and 4 (T+B+NK)
• N3 and N4 are known total numbers of barcoded type 3 and 4cells in the animal
• X p3 (t) and X p
4 (t) are unknown numbers of types 3 and 4 cellswith barcode p
Fitted correlation plots
20 40 60 80 100 120 140
0.85
0.90
0.95
1.00
Pairwise correlation profiles, synthetic data
Observation Time
Cor
rela
tion
True parametersFitted parameters
Fitted correlation plots
20 40 60 80 100 120 140
−0.
50.
00.
51.
0
Synthetic data with two distinct progenitors
Observation Time
Cor
rela
tion
True parametersFitted parameters
Wu et al. 2014