hierarchically-coupled hidden markov models for...

Hierarchically-coupled hidden Markov modelsfor learning kinetic rates from single-molecule data

Jan-Willem van de Meent1 [email protected] E. Bronson1 [email protected] Wood2 [email protected] L. Gonzalez Jr.1 [email protected] H. Wiggins1 [email protected] University, New York, NY, USA; 2University of Oxford, Oxford, UK

Abstract

We address the problem of analyzing setsof noisy time-varying signals that all reporton the same process but confound straight-forward analyses due to complex inter-signalheterogeneities and measurement artifacts.In particular we consider single-molecule ex-periments which indirectly measure the dis-tinct steps in a biomolecular process viaobservations of noisy time-dependent sig-nals such as a fluorescence intensity or beadposition. Straightforward hidden Markovmodel (HMM) analyses attempt to character-ize such processes in terms of a set of confor-mational states, the transitions that can oc-cur between these states, and the associatedrates at which those transitions occur; butrequire ad-hoc post-processing steps to com-bine multiple signals. Here we develop a hi-erarchically coupled HMM that allows exper-imentalists to deal with inter-signal variabil-ity in a principled and automatic way. Ourapproach is a generalized expectation max-imization hyperparameter point estimationprocedure with variational Bayes at the levelof individual time series that learns an sin-gle interpretable representation of the overalldata generating process.

1. Introduction

Over the past two decades, single-molecule biophysicaltechniques have revolutionized the study of Nature’s

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

biomolecular machines by enabling direct observationof some of the cell’s most fundamental and complexbiochemical reactions (Tinoco & Gonzalez, 2011; Jooet al., 2008; Borgia et al., 2008; Neuman & Nagy, 2008;Cornish & Ha, 2007). At the molecular level, suchreactions can be described as a ‘kinetic scheme’ thatdetails the number of conformational states accessibleto the biomolecular machine, the order in which thesestates are explored, and the associated transition rates.In single-molecule biophysics, the goal of inference isto reconstruct this kinetic scheme from a noisy time-dependent observable, such as a fluorescence intensityor bead position, which indirectly reports on the statetransitions in a biomolecular process.

The learning task that motivates this work is the es-timation of a consensus kinetic scheme from a multi-tude of time series. Single-molecule experiments com-monly yield observations for an ‘ensemble’ of hundredsof molecules. An illustration of this analysis prob-lem can be seen in figure 1. The time series shownare obtained in a single-molecule fluorescence reso-nance energy transfer (smFRET) experiment, wherea fluctuating intensity ratio EFRET = IA/(ID + IA)of two fluorescent molecules, known as the donorand acceptor, reports on the conformational transi-tions of a molecule of interest. Conditioned on theconformational state these observables are approxi-mately normally distributed, but their means varysignificantly over the ensemble owing to a combina-tion of image-processing artifacts and physical inho-mogeneities. Moreover, the number of conformationaltransitions that can be observed for each molecule islimited by the life-time of the fluorescent labels, whichhave a fixed probability of photobleaching upon eachexcitation. Learning a common set of states and as-sociated transition probabilities from time series suchas in figure 1 remains a challenging task. Existing ap-

Hierarchically-coupled hidden Markov models for learning kinetic rates from single-molecule data

Figure 1. (a) Sample time series from a single-molecule fluorescent energy transfer (smFRET) experiment, which measuresa ratio of intensities of two fluorescent labels EFRET = IA/(ID + IA) to detect conformational transitions between twostates (see section 5). Analysis with a variational Bayesian HMM shows states with means and precisions that varysignificantly from molecule to molecule. Moreover photobleaching of the fluorophores results in time series of variable(exponentially distributed) lengths. (b) Histogram of observables xn,t, and inferred state means µn,k (red) along withempirical Bayes estimates of priors (dotted lines) for an ensemble of N = 336 time series from a single experiment.

proaches use maximum likelihood (McKinney et al.,2006; Greenfeld et al., 2012) or variational Bayesian(Bronson et al., 2009) estimation on HMMs to inferindependent models for each individual time series,resulting in a set of similar but variable parameterestimates. Experimentalists then resort to an ad-hocsemi-manual binning of states with similar observationmeans, after which the corresponding binned transi-tion counts can be averaged over the ensemble to ob-tain a more informed estimate of the consensus kineticrates, implicitly assuming an identical set of transitionprobabilities for all time series.

The contribution presented here is to develop hier-archically coupled HMMs to learn a distribution onthe parameters for each state. Specifically, we as-sume a HMM joint p(xn, zn | θn) for the observablesxn and latent states zn for each time series n ∈ [N ],conditioned on a set of parameters θn drawn from ashared prior p(θ |ψ). In the statistical community thisis known as a conditionally independent hierarchicalmodel (Kass & Steffey, 1989). The hyperparametersare estimated with an empirical Bayes approach (Mor-ris, 1983). This generalized expectation (EM) proce-dure iteratively performs variational Bayes (VB) esti-mation for each time series, after which the summedlower bound is maximized with respect to ψ.

The resulting procedure, which we will call variationalempirical Bayes (VEB), infers a single consensus pa-rameter distribution from an ensemble of time seriesthat represents the kinetic scheme that is of experi-mental interest. The VEB procedure also yields moreaccurate inference results in individual time series.This is a well-known feature of empirical Bayes mod-els (Berger, 1982), that intuitively follows from the factthat the shared prior p(θ |ψ) incorporates knowledgeof parameter values across the ensemble that can aid

the inference process in individual time series. Finallyour hierarchical construction allows comparison ofprior densities to their ensemble-averaged posterior es-timates, providing the experimentalist with a intuitivediagnostic for the agreement between observed exper-imental data and a chosen graphical model, which canthen inform the next iteration of model design.

2. Related Work

Variational approaches are a mainstay of Bayesian in-ference (Jordan et al., 1999; Wainwright & Jordan,2008; Bishop, 2006). Variational Bayes estimation forHMMs has previously been applied to smFRET exper-iments (Bronson et al., 2009; Fei et al., 2009; Bronsonet al., 2010), and these techniques have both been useddirectly (Sorgenfrei et al., 2011) and in a modified form(Okamoto & Sako, 2012) in the context of several othersingle-molecule experiments.

Methods that obtain point estimates for a set of hy-perparameters are known under a variety of names,including empirical Bayes, type II maximum likeli-hood, generalized maximum likelihood and evidenceapproximation (Bishop, 2006). A common use forsuch techniques is to adaptively set the hyperparam-eters of a model to values that are appropriate for agiven inference application. These approaches havethe well-documented feature of allowing more accu-rate inferences for individual samples in learning sce-narios where parameters are correlated (Morris, 1983;Kass & Steffey, 1989; Carlin & Louis, 1996), whichcan be interpreted as a generalization of the Stein ef-fect (Casella, 1985; Stein, 1981; Berger, 1982). In somecases, such as mixture models, type II maximum like-lihood methods can also be employed in quasi non-parametric manner, by using a larger than requirednumber of mixture components and relying on the in-


Figure 2. HMMs for ensembles of time series. (a) Ourmodel employs a separate set of hyperparameters for eachstate, representing a consensus model whose parametersare allowed to vary across the ensemble. (b) A fully sharedapproach with a single set of K states, which are identicalfor all N time series, and are each drawn from the same(typically uninformative) prior.

ference procedure to leave superfluous components un-populated (Corduneanu & Bishop, 2001).

In this biophysical application, the hyperparametersprovide an explicit representation of the common fea-tures in multiple time series, which are individuallyrepresented as HMMs with Gaussian observations

xn,t | zn,t=k ∼ Normal(µn,k, λn,k),

zn,t | zn,t−1 =k ∼ Discrete(An,k),

zn,0 ∼ Discrete(πn).

For each n ∈ [N ] time series, the set of parametersθn = {µn, λn, An, πn} are drawn from the priors

µn,k ∼ Normal(mk, βkλn,k),

λn,k ∼ Gamma(ak, bk),

An,k ∼ Dirichlet(αk),

πn ∼ Dirichlet(ρ).

This graphical model (fig 2a) represents each of thek ∈ [K] consensus states with a set hyperparametersψ = {mk, βk, ak, bk, αk, ρ}, introducing a hierarchicalcoupling that can be used to represent an ensembleof HMMs with similar yet not identical parameters.This type of construction is more expressive than amodel with identical parameters for each time series(fig 2b). At the same time the use of differentiatedpriors defines a correspondence between the states inindividual time series and a set of consensus states,eliminating label-switching issues across the ensemble.

As we will see in our discussion of results, estimation ofthe hyperparameters allows us to take advantage of theknown features of empirical Bayes methods, includingincreased accuracy of inferred parameters and adap-tive selection of model complexity. However, from thepoint of view of modeling single-molecule biophysicalsystems the most substantial advantage of this type ofapproach is one of dimensionality reduction. A com-paratively small set of K(K+5) hyperparameters pro-vides a single model that incorporates the informationcontained in hundreds of single-molecule time seriesin a statistically robust manner. In contrast to ap-plications where empirical Bayes estimation is used toaid posterior inference, the learned hyperparameters inthis hierarchical model are in fact the primary quantityof interest.

3. Variational empirical Bayes

All variants of expectation maximization procedureshave equivalent formulations in terms of the mini-mization of a Kullback-Leibler (KL) divergence, or themaximization of a (lower bound marginal) likelihood.In the first formulation an optimization criterion forparametric empirical Bayes models can be written as

ψ∗ = argminψ

DKL

[p(z, θ |x, ψ) || p(z, θ |ψ)

], (1)

Intuitively this self-consistency relationship states thatthe joint prior p(z, θ |ψ∗) should be chosen to be assimilar as possible to the posterior p(z, θ |x, ψ∗).

A dual problem for equation 1 can be constructed byintroduction of a variational approximation to the pos-terior. Given any function q(z, θ), the marginal like-lihood p(x |ψ), also known as the evidence, can betrivially represented as

p(x |ψ) =p(x, z, θ |ψ)

q(z, θ)

q(z, θ)

p(z, θ | x, ψ).

Taking the logarithm of the above equation, followedby an expectation over q(z, θ) yields the relationship

Lveb = Eq

[log

p(x, z, θ |ψ)

q(z, θ)

]= log p(x |ψ)−DKL

[q(z, θ) || p(z, θ |x, ψ)

].

In a model where p(x, z, θ |ψ) = p(x | z, θ)p(z, θ |ψ) wecan additionally write

Lveb = Eq [log p(x | z, θ)]−DKL

[q(z, θ) || p(z, θ |ψ)

].

Because Lveb ≤ log p(x |ψ), this quantity is oftencalled an evidence lower bound (ELBO). Maximiza-tion of Lveb with respect to q(z, θ) is equivalent to


minimization of the KL-divergence between q(z, θ) andthe posterior, whereas maximization with respect to ψminimizes the KL-divergence between q(z, θ) and theprior. These two steps can be combined to constructan iterative optimization algorithm

δLveb / δq(z, θ) = 0, ∂Lveb / ∂ψ = 0.

If the posterior could be calculated directly, wecould solve the first equation by setting q(z, θ) =p(z, θ |x, ψ). The second equation then gives us the so-lution the hyperparameter estimation problem posedin equation 1. In this scenario, parametric empiricalBayes estimation is equivalent to an expectation max-imization (EM) algorithm for the marginal likelihoodp(x |ψ).

For models where the posterior cannot be calculateddirectly we can substitute any analytically convenientform for q(z, θ) and calculate an approximation

q∗(z, θ) = argmaxq(z,θ)

Lveb[q(z, θ), ψ]

= argminq(z,θ)

DKL

[q(z, θ) || p(z, θ |x, ψ)

].

An often convenient choice is a factorized form

q(z, θ) =

N∏n=1

q(zn)q(θn),

which in conjugate exponential family models guar-antees that the approximate posterior has the sameanalytical form as the prior, i.e. q(θn) = p(θn | ψn),

where ψn is a set of variational posterior parametersthat can typically be expressed as a weighted averageof the hyperparameters and the expectation of a set ofsufficient statistics (Beal, 2003).

With the above factorization, we can construct an ap-proximate inference procedure for parametric empiri-cal Bayes models, which we call variational empiricalBayes (VEB), that optimizes Lveb through

δLveb

δq(zn)= 0,

δLveb

δq(θn)= 0,

∂Lveb

∂ψ= 0.

This method is directly related to existing variationalinference techniques on HMMs, which employ thelower bound

Lvbn = Eq

[p(xn, zn, θn |ψ)

q(zn)q(θn)

]= log p(xn |ψ)

−DKL

[q(zn)q(θn) || p(zn, θn |xn, ψ)

].

Because Lveb =∑n L

vbn optimization w.r.t. q(z, θ)

simply reduces to performing VB estimation on eachof the individual time series.

The hyperparameter updates ∂Lveb / ∂ψ = 0 take theform of a coupled set of equations over the n ∈ [N ]time series in the ensemble

0 =∑n

Eq(θn)

[∂ log p(θn |ψ)

∂ψ

].

We can express these coupled equations in terms of aset of ensemble averages, defined by

Eq[ · ] = 1/N∑n

Eq(θn)[ · ].

For a Dirichlet prior, the hyperparameter updates sim-ply match the log expectation values

Ep[logAk] = Eq[logAk],

Ep[log π] = Eq[log π].

In terms of ψ these log expectation values can be ex-pressed in terms of the digamma function Ψ

Ψ[∑

mαkm]−Ψ

[αkl]

=1

N

∑n

Ψ[∑

mαn,km]−Ψ

[αn,kl

].

While these equations have no analytical solution theirstationary point can be found efficiently with a Newtoniteration method (Minka, 2012).

Maximimzation with respect to the Normal-Gammaprior on µ and λ similarly yields four relationshipsbetween prior expectation values and their ensembleposterior averages

Ep[µk] = Eq[µkλk] / Eq[λk],

Ep[λk(µk −mk)2] = Eq[λk(µk −mk)2],

Ep[log λk] = Eq[log λk],

Ep[λk] = Eq[λk].

which are equivalent to the following updates for ψ

mk = Eq[µkλk]/Eq[λk],

1/βk = Eq[µ2kλk]− Eq[λkµk]2/Eq[λk],

Ψ(ak)− log(ak) = E[log λk]− log Eq[λk],

bk = ak/Eq[λk],

where the ensemble expectations are given by

Eq[λk] = 1/N∑nan,k/bn,k,

Eq[log λk] = 1/N∑nψ(an,k)− log bn,k,

Eq[µkλk] = 1/N∑nmn,kan,k/bn,k,

Eq[µ2kλk] = 1/N

∑n1/βn,k + (mn,k)2an,k/bn,k.


Figure 3. Validation on simulated data. (a) Errors in inferred transition pseudocounts, separated into diagonal ‘occu-pancy’ and off-diagonal ‘transition’ terms. By both measures the VEB algorithm (blue) significantly outperforms VB(green) and ML (orange) methods. (b) Histograms (logarithmically scaled) of inferred vs true state mean µz(t), forsimulated datasets with K = 4. The increasing inference error is visible in the progressively blurred distribution alongthe vertical axis, which is markedly reduced when using VEB inference. (c) Effective number of states. ML and VBestimation (using the BIC and ELBO for model selection respectively), both systematically underestimate the number ofstates, whereas VEB algorithm shows a significantly improved performance, even at high noise.

4. Performance on simulated data

Because VEB estimation is a generalized EM algo-rithm for the hyperparameters, it is in principle sub-ject to the usual concerns about overfitting and thenumber of restarts required to avoid local maxima. Atthe same time results for type II maximum likelihoodestimation of mixture weights suggest (Corduneanu& Bishop, 2001) that such estimation schemes mayleave superfluous states unpopulated, and the empiri-cal Bayes literature suggests that hyperparameter es-timation results in more accurate posterior inference(Morris, 1983). As a pragmatic approach to evaluatingthe performance of this method we have sampled simu-lated datasets from a range of experimentally plausibleparameters, and compared inference results from theVEB algorithm to those obtained with maximum like-lihood (ML) and variational Bayesian (VB) methods.

We performed inference on simulated datasets contain-ing N = 500 time series, each with T = 100 timepoints, exploring K = 3, 4, 5 evenly spaced states atdistance ∆µ = 0.2, and increase the noise relative tothis separation. Noise levels for σ = λ−1/2/∆µ rangefrom 0.1 (unrealistically noiseless) to 0.9 (unrealisti-cally noisy), where 0.5 can be considered an upperlimit of data that still permits analysis with state-of-the-art algorithms. In typical experiments, the vari-ance of apparent state means Vp(θ|ψ)[µ] is of a similarorder of magnitude as the variance of the observablesVp(x|θ,z)[x]. For the purposes of this numerical experi-

ment, we simulated a relatively homogeneous ensemblewith Vp(θ|ψ)[µ] = 0.4 Vp(x|θ,z)[x] in order to isolate theinfluence of the experimental signal-to-noise ratio onthe difficulty of the inference problem.

In single-molecule biophysical applications, the goal ofinference is to reconstruct a kinetic scheme in the formof a set of consensus states and associated transitionprobabilities. For this reason the quantities that areof primary interest are the transition pseudo-counts

ξn,ij =∑t

q(zn,t= i, zn,t+1 =j) = αn,ij − αij .

In ML and VB and estimation, the inferred states ineach time series are unconnected, so we must con-struct some form of mapping (n, i) 7→ (n, k) to identifya set of consensus states. Here we use an approachthat could typically be applied in existing experimen-tal work flows, which is to simply cluster the inferredstate means using a Gaussian mixture model. Theremapped set of counts ξn,kl can now be used to cal-culate the error relative to the true counts ξ0

n,kl thatwere sampled from the generative model.

We identify an ‘occupancy’ error for the diagonal termsand a ‘transition’ error for the off-diagonal terms, bothdefined in terms of the number of false positives andfalse negatives in the inferred transition counts. Thesequantities have the advantage of having simple intu-itive interpretations. One is the error in the fractionof time spent in each state, whereas the other is thefraction of spurious/missed transitions.


The error rates obtained from this analysis are shownin figure 3a. By both measures, the VEB algorithmsignificantly outperforms the ML and VB approachesover the entire range of noise levels. From an experi-mental perspective, this suggests that switching froma VB to a VEB procedure yields an improvement inaccuracy comparable to increasing the signal-to-noiseratio of the measurements by a factor 2.

Figure 3b shows 2-dimensional histograms (logarith-mically scaled) that represent the joint distribution ofinferred and true means µk. Learning parameter dis-tributions for each state results in well-defined peaks ofthe inferred posterior means, whereas the distributionsfor ML and VB become progressively poorly definedas the noise level increases.

To evaluate the accuracy of inference in terms of thenumber of states populated in each time series, wecalculate a quantity known as the effective number ofdegrees of freedom, which is based on the Shannonentropy of the time-averaged posterior

Keff = exp[H(Q(z))] Q(z) =1

T

∑t

q(zt = z).

Note that for a discrete distribution with K choices,equal probability Q(z) = 1/K for each outcome yields

Keff = exp[∑K

k=1−1/K log(1/K)]

= K,

whereas any distribution of lower entropy will havea correspondingly smaller number Keff , resulting ina continuous measure of the degrees of freedom as afunction of state occupancy.

Figure 3c shows 2-dimensional histograms of the in-ferred and true number of effective states for the samedataset shown in figure 3b. In VB and ML inference,we perform model selection on each time series usingthe lower bound Lvb and BIC respectively. VEB isgiven the correct maximum number of states at the hy-perparameter level, but no additional model selectionis performed for individual time series. The resultsin figure 3c show a clear picture: at high noise levels,both VB and ML estimation procedures underestimatethe number of states, whereas VEB estimation is muchmore robust.

These results can each be seen as instances of the wellknown Stein effect (Berger, 1982): an estimator thatexploits shared information between observations, inour case different time series, can provide more ac-curate inferences than an estimator that lacks thisshared information. Empirical Bayes methods obtaininformation about the distribution of the parametersp(θ |ψ) that is not available in ML and VB approaches.

In other words, the prior knowledge encoded in the es-timated hyperparameters can be used to obtain morerobust inference results, as visible in the transitionprobabilities (fig 3a) and state means (fig 3b).

We additionally observe that VEB inference populatesthe correct number of states in individual time se-ries without needing model selection criteria (fig 3c),which is consistent with the results obtained for mix-ture models (Corduneanu & Bishop, 2001). A caveatis that all three analysis methods have thus far beensupplied with a correct guess of the maximum numberof states at the ensemble level. We will return to thequestion of ensemble model selection in section 6.

5. Analysis of smFRET data

The VEB method introduced in this paper canbe implemented for any single-molecule experimentamenable to analysis with HMMs and related graphi-cal models. Here we focus on the analysis of smFRETexperiments, which track conformational changes inreal time by measuring the anti-correlated intensitychanges of a pair of fluorophores, known as the donorand acceptor. Figure 4 shows analysis of a set ofexperiments that investigate the mechanism of pro-tein synthesis, or translation, by the bacterial ribo-some. Specifically these experiments focus on a pro-cess known as translocation, the precise motion of theribosome along its messenger RNA (mRNA) templateas it synthesizes the protein encoded by the mRNA(for a review see (Tinoco & Gonzalez, 2011)).

The hypothesized mechanism for translocation (figure4a) breaks down into 3 kinetic steps. The first step isa thermally driven, reversible transition between twoconformational states of the pre-translocation riboso-mal complex (labeled GS1 and GS2). This transi-tion is followed by the binding of a GTPase trans-lation factor, EF-G, which has the effect of stabiliz-ing the GS2 long enough to enable a GTP hydrolysis-driven movement of the ribosome along the mRNAtemplate. The first two steps in this kinetic schemecan be studied experimentally by substitution of GTPwith a non-hydrolyzable GTP analogue, preventingthe GTP hydrolysis-driven final step in the translo-cation reaction. Figure 4b shows two time series thatmeasure reversible transitions between the GS1 andGS2 states, customarily plotted as the intensity ra-tio of the donor and acceptor fluorophores, which isgiven by EFRET = IA/(IA + ID). The first, recordedin the absence of EF-G, shows a preference for the GS1state. The second time series, taken from an experi-ment where 500 nM EF-G was added to the solution,shows a shift of the equilibrium towards the GS2 state.


Figure 4. smFRET studies of translocation in the bacterial ribosome (Fei et al., 2009). (a) The dominant pathway fortranslocation is believed to have three steps: A reversible rotation of the two subunits (purple and tan), followed bythe binding of EF-G (green) which stabilizes the rotated GS2 state long enough for a GTP-driven transition to thepost-translocation (POST) complex to take place. (b) smFRET signals show a shift of the equilibrium from the GS1state (magenta, EFRET ' 0.55) towards the GS2 state (cyan, EFRET ' 0.35) in the presence of EF-G. (c) Histograms ofobservables xn,t, split by state, showing a continuous shift of the equilibrium towards GS2 as the concentration of EF-G isincreased. Out of 4 states used for inference, only two dominant states are significantly populated. (d) Summed posteriordistributions on the relative free energy q(∆Gn,k), showing a weak bi-modal distribution in the free energy of the GS2state whose lower component becomes more populated with increasing EF-G concentration.

Analysis of 4 experiments performed with EF-G con-centrations of 0 nM to 250 nM are shown in figure 4c.We perform VEB inference with 4 states on the en-semble of traces from all experiments to learn a set ofshared states for the entire series. In each of the experi-ments we observe two dominant states, correspondingto the GS2 (cyan) and GS1 (magenta) state respec-tively. Histograms of observations, on the top row,show a shift in the GS1-GS2 equilibrium towards theGS2 state as the concentration of EF-G is increased.

The marginal probability for each state can be ex-pressed in terms of a free energy Gk via the relationq(z=k) ∼ exp[−Gk/kBT ]. An energy relative to otherstates can be defined as

∆Gk(A) = log

∑k 6=lAkl∑k 6=lAlk

,

allowing formulation of a set of posterior distributions

q(∆Gn,k | ψ) =

∫∆Gk(An)=∆Gn,k

dAn,k q(An,k | ψ).

Figure 4d shows the posterior ensemble-averaged dis-tributions on ∆Gk. The distribution for the GS1 stateexhibits a bi-modal signature, indicative of a mixedpopulation of EF-G bound and unbound molecules. Atlow EF-G concentrations, the lower mode dominates,whereas the upper mode takes over at high EF-G con-centrations, indicating an energetically favorable GS2state. This type of bi-modal signature in the posteriorwould be difficult, if not outright impossible to discernwith existing non-hierarchical approaches.

6. Model Selection Criteria

Our analysis of simulated data in section 4 suggeststhat the VEB procedure may exhibit a resistance tooverfitting similar to what has been observed in othertype II maximum likelihood studies. In particular,we could supply the method with a larger than suffi-cient maximum number of states at the hyperparame-ter level and hope to learn a model that leaves superflu-ous degrees of freedom unpopulated. To see how wellsuch an approach works in practice we compare analy-sis of our experimental dataset in the absence of EF-Gto a simulated dataset with similar hyperparameters,i.e. we first perform inference with a single populationon the experimental dataset, and then use the inferredhyperparameters to generate our simulated data.

Results of this analysis are shown in figure 5. Inferenceon experimental data shows a more or less monotonicincrease of the lower bound. This same trend is visiblein the lower bound on held-out data obtained from 10-fold cross-validation. The reason for this becomes ap-parent when examining histograms of the observations,shown in figure 5c. On a semi-logarithmic scale nor-mally distributed observables should show a parabolicprofile. The experimental histograms at K = 2 showsignificant discrepancies with respect to this distribu-tion, in the form of asymmetries and long tails, whichare separated into additional states when inference isperformed with a higher number of degrees of freedom.

Given that the parameters for each time series arei.i.d., we can attempt to use a heuristic like the BIC =−2Lveb + K(K + 5) logN for model selection. This


yields a model with 4 states. We now use the estimatedhyperparameters to sample a simulated dataset fromthe generative model. In the absence of artifacts anddiscrepancies with respect to the underlying model,the monotonic increase of Lveb is absent, and we in factobserve an all but infinitesimal decrease when overfit-ting the data. This decrease is also present, if not nec-essarily more pronounced, in the lower bound on heldout data obtained from cross-validation. This qualita-tively different outcome is also reflected in the effectivenumber of states, shown in figure 5b, which increasesmuch more modestly as a function of K as comparedto the experimental case.

In short, these results show that the VEB method doesin principle have a built-in resistance to overfitting, inthe sense that superfluous states are left unpopulatedby the algorithm. Moreover solutions that do over-fit the data exhibit a decreased evidence lower bound.At the same time analysis of actual biophysical datashows that this feature of the method is in practice out-weighed by discrepancies between the data and the hy-pothesized generative model. Model selection on realbiological datasets is therefore limited by our abilityto construct a generative model that is not overly sen-sitive to such discrepancies.

7. Discussion

We have formulated an approximate parametric empir-ical Bayes procedure on HMMs based on variational in-ference, for learning of a single model from a multitudeof biological time series. The estimation of parameterdistributions that lies at the core of this method canbe thought of as a mechanism for introducing a “loose”sharing of parameters between time series. This typeof approach could therefore be useful in any type oflearning scenario where it is desirable to extract a sin-gle interpretable model from an ensemble of similaryet heterogeneous time series.

The methodology employed here has a number of de-sirable features, including increased accuracy of pos-terior inference and a built-in resistance to overfitting.However its main advantage in the context of single-molecule biophysics is that it enables robust estimationon large numbers of time series. The learned hyper-parameters not only provide a single summary repre-sentation of an entire experiment, but are also usefulin validating mechanistic hypotheses encoded in dif-ferent graphical model variants. Comparison of priorand posterior densities allows detailed evaluation ofthe agreement between observed data and a chosengraphical model, even in the case where one may notwish to blindly employ the evidence lower bound as a

Figure 5. Dependence on number of states in experimen-tal and simulated data. (a) Lower bound Lveb (blue),lower bound on held out data (green) and heuristic lowerbound Lbic (orange). (b) The effective number of states in-creases monotonically for experimental data, but does notreveal significant over-fitting on simulated data. (c) Ex-perimental data show asymmetries and long tails, resultingin population of additional states. (d) In simulated data,which lack these discrepancies with respect to the genera-tive model, additional states are left largely unpopulated.

model selection criterion. For example, the bi-modalsignature of the posterior in figure 4 suggests that amore detailed analysis could be performed by assuminga mixture of parameter distributions p(A |ψm) condi-tioned on a new latent state that indicates the presenceof an EF-G molecule bound to the ribosomal complex.

In experimental platforms beyond smFRET, severalobvious extensions to graphical models related to theHMM could extend the scope of applications of thisVEB approach. Double Markov chain models canbe used to model diffusive motion governed by a la-tent state, as observed in tethered particle motion ex-periments (Beausang et al., 2007). Factorial HMMs(Ghahramani & Jordan, 1997), can be used to modelsystems with orthogonal degrees of freedom such as theinternal state of a molecular machine that performs astepping motion along a substrate, or dynamic bindingand unbinding of external factors.

All code used in this publication is available as opensource on http://ebfret.github.com.

8. Acknowledgements

The authors would like to thank Matt Hoffman andDavid Blei for helpful discussions. This work was sup-ported by an NSF CAREER Award (MCB 0644262)and an NIH-NIGMS grant (R01 GM084288) to R.L.G.;and a Rubicon fellowship from the Netherlands Orga-nization for Scientific Research (NWO) to J.W.M.

http://ebfret.github.com


References

Beal, M.J. Variational Algorithms for ApproximateBayesian Inference. Phd thesis, University CollegeLondon, 2003.

Beausang, J.F., Zurla, C., Manzo, C., Dunlap, D.,Finzi, L., and Nelson, P.C. DNA looping kineticsanalyzed using diffusive hidden Markov model. Bio-phys. J., 92(8):L64–6, May 2007.

Berger, J. Bayesian Robustness and the Stein Effect.JASA, 77(378):358–368, June 1982.

Bishop, C.M. Pattern recognition and machine learn-ing. Springer, New York, 2006.

Borgia, A., Williams, P.M., and Clarke, J. Single-molecule studies of protein folding. Ann. Rev.Biochem., 77:101–25, January 2008.

Bronson, J.E., Fei, J., Hofman, J.M., Gonzalez, R.L.,and Wiggins, C.H. Learning rates and states frombiophysical time series: a Bayesian approach tomodel selection and single-molecule FRET data.Biophys. J., 97(12):3196–205, December 2009.

Bronson, J.E., Hofman, J.M., Fei, J., Gonzalez, R.L.,and Wiggins, C.H. Graphical models for inferringsingle molecule dynamics. BMC Bioinf., 11 Suppl 8(Suppl 8):S2, January 2010.

Carlin, B.P. and Louis, T.A. Bayes and EmpiricalBayes Methods for Data Analysis. Number 1. Chap-man & Hall, London, 1996.

Casella, G. An introduction to empirical Bayes dataanalysis. Am. Statistician, 39(2):83–87, 1985.

Corduneanu, Adrian and Bishop, CM. VariationalBayesian model selection for mixture distributions.AISTATS, 2001.

Cornish, P.V. and Ha, T. A survey of single-moleculetechniques in chemical biology. ACS Chem. Biol., 2(1):53–61, January 2007.

Fei, J., Bronson, J.E., Hofman, J.M., Srinivas, R.L.,Wiggins, C.H., and Gonzalez, R.L. Allosteric col-laboration between elongation factor G and the ri-bosomal L1 stalk directs tRNA movements duringtranslation. Proc. Nat. Acad. Sci. USA, 106(37):15702–7, September 2009.

Ghahramani, Z. and Jordan, M.I. Factorial HiddenMarkov Models. Machine Learning, 29(2-3):245–273, 1997.

Greenfeld, M., Pavlichin, D.S., Mabuchi, H., and Her-schlag, D. Single Molecule Analysis Research Tool(SMART): An Integrated Approach for AnalyzingSingle Molecule Data. PloS One, 7(2):e30024, Jan-uary 2012.

Joo, C., Balci, H., Ishitsuka, Y., Buranachai, C., andHa, T. Advances in single-molecule fluorescencemethods for molecular biology. Ann. Rev. Biochem.,77:51–76, January 2008.

Jordan, M.I., Ghahramani, Z., and Jaakkola, T.S. Anintroduction to variational methods for graphicalmodels. Machine Learning, 233:183–233, 1999.

Kass, R.E. and Steffey, D. Approximate Bayesianinference in conditionally independent hierarchicalmodels (parametric empirical Bayes models). JASA,84(407), 1989.

McKinney, S.A, Joo, C., and Ha, T. Analysis of single-molecule FRET trajectories using hidden Markovmodeling. Biophys. J., 91(5):1941–51, September2006.

Minka, Thomas P. Estimating a Dirichlet distribution.Technical Report 8, MIT, 2012.

Morris, C.N. Parametric Empirical Bayes Inference:Theory and Applications. JASA, 78(381):47–55,March 1983.

Neuman, K.C. and Nagy, A. Single-molecule forcespectroscopy: optical tweezers, magnetic tweezersand atomic force microscopy. Nat. Methods, 5(6):491–505, June 2008.

Okamoto, K. and Sako, Y. Variational Bayes Analysisof a Photon-Based Hidden Markov Model for Single-Molecule FRET Trajectories. Biophys. J., 103(6):1315–24, September 2012.

Sorgenfrei, S., Chiu, C-Y., Gonzalez, R.L., Yu, Y-J.,Kim, P., Nuckolls, C., and Shepard, K.L. Label-freesingle-molecule detection of DNA-hybridization ki-netics with a carbon nanotube field-effect transistor.Nat. Nanotechnol., 6(2):126–32, February 2011.

Stein, C.M. Estimation of the Mean of a MultivariateNormal Distribution. Ann. Stat., 9(6):1135–1151,November 1981.

Tinoco, I. and Gonzalez, R.L. Biological mechanisms,one molecule at a time. Genes Dev., 25(12):1205–31,June 2011.

Wainwright, M.J. and Jordan, M.I. Graphical Mod-els, Exponential Families, and Variational Infer-ence. Found. Trends Machine Learning, 1(12):1–305, 2008.

hierarchically-coupled hidden markov models for...

Documents