arxiv:1805.07226v2 [stat.ml] 21 jan 2019

20
Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows George Papamakarios David C. Sterratt Iain Murray University of Edinburgh University of Edinburgh University of Edinburgh Abstract We present Sequential Neural Likelihood (SNL), a new method for Bayesian inference in simulator models, where the likelihood is in- tractable but simulating data from the model is possible. SNL trains an autoregressive flow on simulated data in order to learn a model of the likelihood in the region of high poste- rior density. A sequential training procedure guides simulations and reduces simulation cost by orders of magnitude. We show that SNL is more robust, more accurate and requires less tuning than related neural-based meth- ods, and we discuss diagnostics for assessing calibration, convergence and goodness-of-fit. 1 Introduction In many areas of science and engineering, a natural way to model a stochastic system of interest is as a simulator, which may be controlled by potentially inter- pretable parameters and can be run forward to generate data. Such simulator models lend themselves naturally to modelling mechanistic processes such as particle collisions in high energy physics [1, 71], universe evolu- tion in cosmology [2, 67], stochastic dynamical systems in ecology and evolutionary biology [64, 65, 84, 85], macroeconomic models in econometrics [3, 26], and biological neural networks in neuroscience [48, 62, 73]. Simulators are a powerful and flexible modelling tool; they can take the form of a computer graphics engine [45] or a physics engine [86], and they can become part of probabilistic programs [14, 16, 37]. Due to the wide applicability of simulator models, sev- eral tasks of interest can be framed as inference of the parameters of a simulator from observed data. For Proceedings of the 22 nd International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by the author(s). instance, inferring the parameters of physical theories given particle collisions can be the means for discovering new physics [11], and inferring the scene parameters of a graphics engine given a natural image can be the basis for computer vision [66, 87]. Hence, the development of accurate and efficient general-purpose inference meth- ods for simulator models can have a profound impact in scientific discovery and artificial intelligence. The fundamental difficulty in inferring the parameters of a simulator given data is the unavailability of the likelihood function. In Bayesian inference, posterior be- liefs about parameters θ given data x can be obtained by multiplying the likelihood p(x | θ) with prior beliefs p(θ) and normalizing. However, calculating the likeli- hood p(x | θ) of a simulator model for given parameters θ and data x is computationally infeasible in general, and hence traditional likelihood-based Bayesian meth- ods, such as variational inference [83] or Markov Chain Monte Carlo [54], are not directly applicable. To overcome this difficulty, several methods for likelihood-free inference have been developed, such as Approximate Bayesian Computation [46] and Synthetic Likelihood [85], that require only the ability to generate data from the simulator. Such methods simulate the model repeatedly, and use the simulated data to build estimates of the parameter posterior. In general, the accuracy of likelihood-free inference improves as the number of simulations increases, but so does the com- putational cost, especially if the simulator is expensive to run. The challenge in developing new likelihood- free inference methods is to achieve a good trade-off between estimation accuracy and simulation cost. In this paper we present Sequential Neural Likelihood (SNL), a new likelihood-free method for Bayesian in- ference of simulator models, based on state-of-the-art conditional neural density estimation with autoregres- sive flows. The main idea of SNL is to train a Masked Autoregressive Flow [60] on simulated data in order to estimate the conditional probability density of data given parameters, which then serves as an accurate model of the likelihood function. During training, a Markov Chain Monte Carlo sampler selects the next arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Upload: others

Post on 01-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood:Fast Likelihood-free Inference with Autoregressive Flows

George Papamakarios David C. Sterratt Iain MurrayUniversity of Edinburgh University of Edinburgh University of Edinburgh

Abstract

We present Sequential Neural Likelihood(SNL), a new method for Bayesian inferencein simulator models, where the likelihood is in-tractable but simulating data from the modelis possible. SNL trains an autoregressive flowon simulated data in order to learn a modelof the likelihood in the region of high poste-rior density. A sequential training procedureguides simulations and reduces simulation costby orders of magnitude. We show that SNLis more robust, more accurate and requiresless tuning than related neural-based meth-ods, and we discuss diagnostics for assessingcalibration, convergence and goodness-of-fit.

1 Introduction

In many areas of science and engineering, a naturalway to model a stochastic system of interest is as asimulator, which may be controlled by potentially inter-pretable parameters and can be run forward to generatedata. Such simulator models lend themselves naturallyto modelling mechanistic processes such as particlecollisions in high energy physics [1, 71], universe evolu-tion in cosmology [2, 67], stochastic dynamical systemsin ecology and evolutionary biology [64, 65, 84, 85],macroeconomic models in econometrics [3, 26], andbiological neural networks in neuroscience [48, 62, 73].Simulators are a powerful and flexible modelling tool;they can take the form of a computer graphics engine[45] or a physics engine [86], and they can become partof probabilistic programs [14, 16, 37].

Due to the wide applicability of simulator models, sev-eral tasks of interest can be framed as inference of theparameters of a simulator from observed data. For

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

instance, inferring the parameters of physical theoriesgiven particle collisions can be the means for discoveringnew physics [11], and inferring the scene parameters ofa graphics engine given a natural image can be the basisfor computer vision [66, 87]. Hence, the development ofaccurate and efficient general-purpose inference meth-ods for simulator models can have a profound impactin scientific discovery and artificial intelligence.

The fundamental difficulty in inferring the parametersof a simulator given data is the unavailability of thelikelihood function. In Bayesian inference, posterior be-liefs about parameters θ given data x can be obtainedby multiplying the likelihood p(x |θ) with prior beliefsp(θ) and normalizing. However, calculating the likeli-hood p(x |θ) of a simulator model for given parametersθ and data x is computationally infeasible in general,and hence traditional likelihood-based Bayesian meth-ods, such as variational inference [83] or Markov ChainMonte Carlo [54], are not directly applicable.

To overcome this difficulty, several methods forlikelihood-free inference have been developed, such asApproximate Bayesian Computation [46] and SyntheticLikelihood [85], that require only the ability to generatedata from the simulator. Such methods simulate themodel repeatedly, and use the simulated data to buildestimates of the parameter posterior. In general, theaccuracy of likelihood-free inference improves as thenumber of simulations increases, but so does the com-putational cost, especially if the simulator is expensiveto run. The challenge in developing new likelihood-free inference methods is to achieve a good trade-offbetween estimation accuracy and simulation cost.

In this paper we present Sequential Neural Likelihood(SNL), a new likelihood-free method for Bayesian in-ference of simulator models, based on state-of-the-artconditional neural density estimation with autoregres-sive flows. The main idea of SNL is to train a MaskedAutoregressive Flow [60] on simulated data in orderto estimate the conditional probability density of datagiven parameters, which then serves as an accuratemodel of the likelihood function. During training, aMarkov Chain Monte Carlo sampler selects the next

arX

iv:1

805.

0722

6v2

[st

at.M

L]

21

Jan

2019

Page 2: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

batch of simulations to run using the most up-to-dateestimate of the likelihood function, leading to a re-duction in the number of simulations of several ordersof magnitude. To aid the practitioner in deployingSNL, we discuss practical issues such as implementa-tion and tuning, diagnosing convergence, and assessinggoodness-of-fit. Experimental results show that SNLis robust and well-calibrated, and that it outperformsexisting state-of-the-art methods based on neural den-sity estimation both in terms of posterior estimationaccuracy and simulation cost.

2 Likelihood-free inference withneural density estimation

A simulator model is a computer program, which takesa vector of parameters θ, makes internal calls to a ran-dom number generator, and outputs a data vector x.Implicitly, this procedure defines a conditional proba-bility distribution p(x |θ) which in general we cannotevaluate, but we can easily sample from by runningthe program. Given an observed data vector xo and aprior distribution p(θ), we are interested in estimatingthe parameter posterior p(θ |xo) ∝ p(xo |θ) p(θ).

This paper focuses on an approach to likelihood-freeinference that is based on neural density estimation.A conditional neural density estimator is a parametricmodel qφ (such as a neural network) controlled by aset of parameters φ, which takes a pair of datapoints(u,v) and outputs a conditional probability densityqφ(u |v). Given a set of training data {un,vn}1:N thatare independent and identically distributed accordingto a joint probability density p(u,v), we can train qφ bymaximizing the total log probability

∑n log qφ(un |vn)

with respect to φ. With enough training data, andwith a sufficiently flexible model, qφ(u |v) will learn toapproximate the conditional p(u |v).

We can use a neural density estimator qφ(θ |x) thatmodels the conditional of parameters given data toapproximate the posterior p(θ |xo) as follows. First,we obtain a set of samples {θn,xn}1:N from the jointdistribution p(θ,x), by θn ∼ p(θ) and xn ∼ p(x |θn)for n = 1, . . . , N . Then, we train qφ using {θn,xn}1:Nas training data in order to obtain a global approx-imation of p(θ |x). Finally, p(θ |xo) can be simplyestimated by qφ(θ |xo). In practice, a large numberof simulations may be required for there to be enoughtraining data in the vicinity of xo in order to obtain anaccurate posterior fit. However, running the simulatormany times can be prohibitively expensive.

Sequential Neural Posterior Estimation is a strategy forreducing the number of simulations needed by condi-tional neural density estimation, originally proposed by

Papamakarios and Murray [59] and further developedby Lueckmann et al. [43]. The name SNPE was firstused for the method of Lueckmann et al. [43], but inthis paper we will use it to refer to both methods, dueto their close relationship. The main idea of SNPE isto generate parameter samples θn from a proposal p(θ)instead of the prior p(θ) that makes generated data xnmore likely to be close to the observed datapoint xo.SNPE finds a good proposal p(θ) by training the es-timator qφ over a number of rounds, whereby in eachround p(θ) is taken to be the approximate posteriorobtained in the round before. This sequential proce-dure converges rapidly and can be implemented witha relatively small number of simulations per round,which leads to massive savings in simulation cost. Forits neural density estimator, SNPE uses a Mixture Den-sity Network [7], which is a feedforward neural networkthat takes x as input and outputs the parameters ofa Gaussian mixture over θ. To avoid overfitting dueto the small number of training data per round, theMDN is trained with variational dropout [36].

The main issue with SNPE is that the proposal biasesthe approximation of the posterior. Since the parametersamples follow p(θ) instead of p(θ), the MDN willapproximate p(xo |θ) p(θ) instead of p(xo |θ) p(θ) (upto normalization). Hence, an adjustment of either thelearned posterior or the proposed samples must bemade to account for sampling from the ‘wrong’ prior.In the variant of Papamakarios and Murray [59], whichwe refer to as SNPE-A, the learned posterior qφ(θ |xo)is adjusted, by dividing it by p(θ) and multiplying itby p(θ). SNPE-A restricts p(θ) to be Gaussian; sinceqφ(θ |xo) is a Gaussian mixture, the division by p(θ)can be done analytically. The problem with SNPE-Ais that if p(θ) happens to have a smaller variance thanany of the components of qφ(θ |xo), the division yieldsa Gaussian with negative variance, from which thealgorithm is unable to recover and thus is forced toterminate prematurely. In the variant of Lueckmannet al. [43], which we refer to as SNPE-B, the parametersamples θn are adjusted, by assigning them weightswn = p(θn)/p(θn). During training, the weighted loglikelihood

∑n wn log qφ(θn |xn) is used instead of the

total log likelihood. Compared to SNPE-A, this methoddoes not require the proposal p(θ) to be Gaussian, andit does not suffer from negative variances. However,the weights can have high variance, which may result inhigh-variance gradients and instability during training.

3 Sequential Neural Likelihood

Our new method, Sequential Neural Likelihood (SNL),avoids the bias introduced by the proposal, by opting tolearn a model of the likelihood instead of the posterior.Let p(θ) be a proposal distribution over parameters

Page 3: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

(not necessarily the prior) and let p(x |θ) be the in-tractable likelihood of a simulator model. Consider aset of samples {θn,xn}1:N obtained by θn∼ p(θ) andxn∼p(x |θn), and define p(θ,x) = p(x |θ) p(θ) to bethe joint distribution of each pair (θn,xn). Suppose wetrain a conditional neural density estimator qφ(x |θ),which models the conditional of data given parameters,on the set {θn,xn}1:N . For large N , maximizing thetotal log likelihood

∑n log qφ(xn |θn) is approximately

equivalent to maximizing:

Ep(θ,x)(log qφ(x |θ)) =

− Ep(θ)(DKL(p(x |θ) ‖ qφ(x |θ))) + const, (1)

where DKL(· ‖ ·) is the Kullback–Leibler divergence.The above quantity attains its maximum when the KLis zero in the support of p(θ), i.e. when qφ(x |θ) =p(x |θ) for all θ such that p(θ) > 0. Therefore, givenenough simulations, a sufficiently flexible conditionalneural density estimator will eventually approximatethe likelihood in the support of the proposal, regardlessof the shape of the proposal. In other words, as longas we do not exclude parts of the parameter space, theway we propose parameters does not bias learning thelikelihood asymptotically. Unlike when learning theposterior, no adjustment is necessary to account forour proposing strategy.

In practice, for a moderate number of simulations, theproposal p(θ) controls where qφ(x |θ) will be mostaccurate. In a parameter region where p(θ) is high,there will be a high concentration of training data,hence p(x |θ) will be approximated better. Since weare ultimately interested in estimating the posteriorp(θ |xo) for a specific datapoint xo, it makes sense touse a proposal that is high in regions of high posteriordensity, and low otherwise, to avoid expending simula-tions in regions that are not relevant to the inferencetask. Inspired by SNPE, we train qφ over multiplerounds, indexed by r ≥ 1. Let pr−1(θ |xo) be the ap-proximate posterior obtained in round r−1, and takep0(θ |xo) = p(θ). In round r, we generate a new batchofN parameters θn from pr−1(θ |xo), and data xn fromp(x |θn). We then (re-)train qφ on all data generatedin rounds 1 up to r, and set pr(θ |xo) ∝ qφ(xo |θ) p(θ).This method, which we call Sequential Neural Likeli-hood, is detailed in Algorithm 1.

In round r, SNL effectively uses rN parameter samplesfrom pr(θ) = 1

r

∑r−1i=0 pi(θ |xo). As the amount of

training data grows in each round, qφ(x |θ) becomesa more accurate model of the likelihood in the regionof high proposal density, and hence the approximateposterior pr(θ |xo) gets closer to the exact posterior.In turn, the proposal pr(θ) also tends to the exactposterior, and therefore most of the simulations in laterrounds come from parameter regions of high posterior

Algorithm 1: Sequential Neural Likelihood (SNL)

Input : observed data xo, estimator qφ(x |θ),number of rounds R, simulations per roundN

Output : approximate posterior p(θ |xo)set p0(θ |xo) = p(θ) and D = {}for r = 1 : R do

for n = 1 : N dosample θn ∼ pr−1(θ |xo) with MCMCsimulate xn ∼ p(x |θn)add (θn,xn) into D

(re-)train qφ(x |θ) on D and setpr(θ |xo) ∝ qφ(xo |θ) p(θ)

return pR(θ |xo)

density. In Section 5 we see that focusing on highposterior parameters massively reduces the number ofsimulations. Finally, unlike SNPE which trains onlyon simulations from the latest round, SNL trains onall simulations obtained up to each round.

In general, we are free to choose any neural densityestimator qφ(x |θ) that is suitable for the task at hand.Because we are interested in a general-purpose solution,we propose taking qφ(x |θ) to be a conditional MaskedAutoregressive Flow [60], which has been shown toperform well in a variety of general-purpose density es-timation tasks. MAF represents qφ(x |θ) as a transfor-mation of a standard Gaussian density N (0, I) througha series of K autoregressive functions f1, . . . , fK eachof which depends on θ, that is:

x = zK wherez0 ∼ N (0, I)zk = fk(zk−1,θ).

(2)

Each fk is a bijection with a lower-triangular Jacobianmatrix, and is implemented by a Masked Autoencoderfor Distribution Estimation [23] conditioned on θ. Bychange of variables, the conditional density is given by

qφ(x |θ) = N (z0 |0, I)∏k

∣∣∣det(

∂fk∂zk−1

)∣∣∣−1.

In order to sample from the approximate posteriorp(θ |xo) ∝ qφ(xo |θ) p(θ), we use Markov Chain MonteCarlo (MCMC) in the form of Slice Sampling with axis-aligned updates [55]. The Markov chain is initializedwith a sample from the prior and it persists acrossrounds; that is, in every round other than the first, theinitial state of the chain is the same as its last statein the round before. At the beginning of each round,the chain is burned-in for 200 iterations, in order toadapt to the new approximate posterior. This schemerequires no tuning, and we found that it performedrobustly across our experiments, so we recommend itas a default for general use, at least for tens of parame-ters that do not have pathologically strong correlations.

Page 4: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

For parameter spaces of larger dimensionality, otherMCMC schemes could be considered, such as Hamilto-nian Monte Carlo [56].

4 Related work

Approximate Bayesian Computation [4, 5, 8, 42,46]. ABC is a family of mainly non-parametric methodsthat repeatedly simulate the model, and reject simu-lations that do not reproduce the observed data. Inpractice, to reduce rejection rates to manageable lev-els, (a) lower-dimensional features (or summary statis-tics) are used instead of raw data, and (b) simulationsare accepted whenever simulated features are withina distance ε from the observed features. A basic im-plementation of ABC would simulate parameters fromthe prior; more sophisticated variants such as MarkovChain Monte Carlo ABC [47, 51, 69] and SequentialMonte Carlo ABC [6, 10, 70, 76] guide future simu-lations based on previously accepted parameters. Anissue with ABC in general is that in practice the re-quired number of simulations increases dramatically asε becomes small, which, as we shall see in Section 5,can lead to an unfavourable trade-off between estima-tion accuracy and simulation cost. Advanced ABCalgorithms that work for ε = 0 exist [27], but requirethe simulator to be differentiable.

Learning the posterior. Another approach tolikelihood-free inference is learning a parametric modelof the posterior from simulations. Regression Adjust-ment [5, 9] employs a parametric regressor (such as alinear model or a neural network) to learn the depen-dence of parameters θ given data x in order to correctposterior parameter samples (obtained by e.g. ABC).Gaussian Copula ABC [40] estimates the posteriorwith a parametric Gaussian copula model. VariationalLikelihood-Free Inference [52, 77, 78] trains a paramet-ric posterior model by maximizing the (here intractable)variational lower bound; either the bound or its gra-dients are stochastically estimated from simulations.Parametrically learning the posterior is the target ofSequential Neural Posterior Estimation [43, 59], whichSNL directly builds on, and which was discussed indetail in Section 2. Beyond SNPE, conditional neuraldensity estimators have been used to learn the poste-rior from simulations in graphical models [53, 58], anduniversal probabilistic programs [38].

Learning the likelihood. Similarly to SNL, a bodyof work has focused on parametrically approximatingthe intractable likelihood function. Synthetic Likeli-hood [20, 57, 63, 85] estimates the mean mθ and co-variance matrix Sθ of a batch of data xn∼p(x |θ) sam-pled at a given θ, and then approximates p(xo |θ) ≈N (xo |mθ,Sθ). Non-Gaussian likelihood approxima-

tions are also possible, e.g. saddlepoint approximations[22]. Typically, SL would be run as an inner loop in thecontext of an MCMC sampler. Gaussian Process Surro-gate ABC [49] employs a Gaussian process to model thedependence of (mθ,Sθ) on θ, and uses the uncertaintyin the GP to decide whether to run more simulationsto estimate (mθ,Sθ). Other predecessors to SNL in-clude approximating the likelihood as a linear-Gaussianmodel [39], or as a mixture of Gaussian copulas withmarginals modelled by mixtures of experts [21].

Learning the likelihood ratio. Rather than learn-ing the likelihood p(x |θ), an alternative approachis learning the likelihood ratio p(x |θ1)/p(x |θ2) [11,12, 15, 30, 61, 74], or the likelihood-to-marginal ratiop(x |θ)/p(x) [19, 33, 77]. In practice, this can be doneby training a classifier to discriminate between data sim-ulated at θ1 vs θ2 for the likelihood ratio, or betweendata simulated at θ vs random parameter samples fromthe prior for the likelihood-to-marginal ratio. The like-lihood ratio can be used directly for Bayesian inference(e.g. by multiplying it with the prior and then samplingwith MCMC), for hypothesis testing [11, 15], or forestimating the variational lower bound in the contextof variational inference [77]. We note that the strategyfor guiding simulations proposed in this paper can (atleast in principle) be used with likelihood-ratio estima-tion too; it might be possible that large computationalsavings can be achieved as in SNL.

Guiding simulations. Various approaches have beenproposed for guiding simulations in order to reduce sim-ulation cost, typically by making use of the observeddata xo in some way. Optimization Monte Carlo [50]directly optimizes parameter samples so as to generatedata similar to xo. Bayesian Optimization Likelihood-free Inference [29] uses Bayesian optimization to find pa-rameter samples that minimize the distance ‖x− xo‖.An active-learning-style approach is to use the Bayesianuncertainty in the posterior estimate to choose whatsimulation to run next [34, 44]; in this scheme, the nextsimulation is chosen to maximally reduce the Bayesianuncertainty. Similarly to Sequential Neural PosteriorEstimation [43, 59], SNL selects future simulations byproposing parameters from preliminary approximationsto the posterior, which in Section 5 is shown to reducesimulation cost significantly.

5 Experiments

5.1 Setup

In all experiments, we used a Masked AutoregressiveFlow (MAF) with 5 autoregressive layers, each of whichhas two hidden layers of 50 units each and tanh non-linearities. Our design and training choices follow

Page 5: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

closely the reference implementation [60]. We usedbatch normalization [32] between autoregressive layers,and trained MAF by stochastically maximizing thetotal log likelihood using Adam [35] with a minibatchsize of 100 and a learning rate of 10−4. We used 1000simulations in each round of SNL, 5% of which wererandomly selected to be used as a validation set; westopped training if validation log likelihood did not im-prove after 20 epochs. No other form of regularizationwas used other than early stopping. These settings wereheld constant and performed robustly across all exper-iments, which is evidence for the robustness of SNLto parameter tuning. We recommend these settings asdefaults for general use.

We compare SNL to the following algorithms:

Neural Likelihood (NL). By this we refer to train-ing a MAF (of the same architecture as above) on Nsimulations from the prior. This is essentially SNLwithout simulation guiding; we use it as a control toassess the benefit of SNL’s guiding strategy. We vary Nfrom 103 to 106 (or more if the simulation cost permitsit), and plot the performance for each N .

SNPE-A [59]. We use 1000 simulations in each of theproposal-estimating rounds, and 2000 simulations inthe posterior-estimating round (the final round). In allexperiments, SNPE-A uses a Mixture Density Network[7] with two hidden layers of 50 units each, 8 mixturecomponents, and tanh nonlinearities. We chose theMDN to have roughly as many parameters as MAF inSNL. The MDN is trained for 1000 epochs per round,except for the final round which uses 5000 epochs, andis regularized with variational dropout [36].

SNPE-B [43]. We use 1000 simulations in each round.The same MDN as in SNPE-A is used, and it is trainedfor 1000 epochs per round using variational dropout.

Synthetic Likelihood (SL) [85]. Our implementa-tion uses axis-aligned Slice Sampling [55], where theintractable likelihood is approximated by a Gaussianfitted to a batch of N simulations run on the fly ateach visited parameter θ. We vary N from 10 to 100–1000 (depending on what the simulation cost of eachexperiment permits) and plot the performance of SLfor each N . We run Slice Sampling until we obtain1000 posterior samples.

Sequential Monte Carlo ABC (SMC-ABC). Weuse the version of Beaumont et al. [6]. We use 1000particles, and resample the population if the effectivesample size falls below 50%. We set the initial ε suchthat the acceptance probability in the first round is atleast 20%, and decay ε by a factor of 0.9 per round.

We chose SNPE-A/B because they are the most relatedmethods, and because they achieve state-of-the-art

results in the literature; the comparison with themis intended to establish the competitiveness of SNL.The versions of SL and SMC-ABC we use here arenot state-of-the-art, but they are robust and widely-used, and are intended as baselines. The comparisonwith them is meant to demonstrate the gap betweenstate-of-the-art neural-based methods and off-the-shelf,commonly-used alternatives.

5.2 Results

We demonstrate SNL in four cases: (a) a toy modelwith complex posterior, (b) an M/G/1 queue model,(c) a Lotka–Volterra model from ecology, and (d) aHodgkin–Huxley model of neural activity from neu-roscience. The first two models are fast to simulate,and are intended as toy demonstrations. The last twomodels are relatively slow to simulate (as they involvenumerically solving differential equations), and are il-lustrative of real-world problems of interest. In thelast two models, the computational cost of training theneural networks for SNPE-A/B and SNL is negligiblecompared to the cost of simulating training data.

A detailed description of the models and the full set ofresults are in Appendices A and B. Code that repro-duces the experiments with detailed user instructionscan be found at https://github.com/gpapamak/snl.

A toy model with complex posterior. We considerthe following model, where θ is 5-dimensional, and x isa set of four 2-dimensional points (or an 8-dimensionalvector) sampled from a Gaussian whose mean mθ andcovariance matrix Sθ are functions of θ:

θi ∼ U(−3, 3) for i = 1, . . . , 5 (3)

mθ = (θ1, θ2) (4)

s1 = θ23, s2 = θ24, ρ = tanh(θ5) (5)

Sθ =

(s21 ρs1s2

ρs1s2 s22

)(6)

x = (x1, . . . ,x4) where xj ∼ N (mθ,Sθ). (7)

The likelihood is p(x |θ) =∏j N (xj |mθ,Sθ). De-

spite the model’s simplicity, the posterior is complexand non-trivial: it has four symmetric modes (due tosquaring), and vertical cut-offs (due to the uniformprior). This example illustrates that the posterior canbe complicated even if the prior and the likelihoodare composed entirely of simple operations (which is acommon situation e.g. in probabilistic programming).In such situations, approximating the likelihood canbe simpler than approximating the posterior.

Figure 1a shows the Maximum Mean Discrepancy(MMD) [28] between the approximate posterior of eachmethod and the true posterior vs the total numberof simulations used. SNL achieves the best trade-off

Page 6: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

a

103 104 105 106 107 108

Number of simulations (log scale)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Max

imum

Mea

nD

iscr

epan

cy SMC ABCSLSNPE-ASNPE-BNLSNL

b θ1 θ2 θ3

c

0 5 10 15 20 25 30 35 40Round

02468

1012

Med

ian

dist

ance SNPE-A

SNPE-BSNL

Figure 1: A toy model with complex posterior. a: Accuracy vs simulation cost: bottom left is best. b: Calibrationtest for SNL, histogram outside gray band indicates poor calibration. c: Median distance from simulated toobserved data.

between accuracy and simulation cost, and achieves themost accurate approximation to the posterior overall.SNPE-A fails in the second round due to the varianceof the proposal becoming negative (hence there are onlytwo points in the graph) and SNPE-B experiences highvariability; SNL is significantly more robust in compar-ison. SMC-ABC and SL require orders of magnitudemore simulations than the sequential neural methods.

To assess whether SNL is well-calibrated, we performeda simulation-based calibration test [75]: we generated200 pairs (θn,xn) from the joint p(θ,x), used SNLto approximate the posterior for each xn, obtained9 close-to-independent samples from each posterior,and calculated the rank statistic of each parameter θnifor i = 1, . . . , 5 in the corresponding set of posteriorsamples. If SNL is well-calibrated, the distribution ofeach rank statistic must be uniform. The histogramsof the first three rank statistics are shown in Figure 1b,with a gray band showing the expected variability ofa uniform histogram. The test does not find evidenceof any gross mis-calibration, and suggests that SNLperforms consistently across multiple runs (here 200).

Another diagnostic is shown in Figure 1c, where we plotthe median distance between simulated and observeddata for each round. From this plot we can assessconvergence, and determine the minimum number ofrounds to run for. SNL has lower median distancecompared to SNPE-B, which is evidence that SNPE-Bhas not estimated the posterior accurately enough (asalso shown in the left plot).

M/G/1 queue model [68]. The model describes aserver processing customers waiting in a queue, andhas parameters θ = (θ1, θ2, θ3). The time it takesto serve a customer is uniformly distributed in [θ1, θ2],and the time between customer arrivals is exponentiallydistributed with rate θ3. The data x is 5 equally spaced

quantiles of the distribution of inter-departure times.The model can be easily simulated, but the likelihoodp(x |θ) is intractable. Our experimental setup followsPapamakarios and Murray [59].

The trade-off between accuracy and simulation costis shown in Figure 2a. Here we do not have accessto the true posterior; instead we plot the negative logprobability of the true parameters, obtained by kerneldensity estimation on posterior samples, vs number ofsimulations. SNL recovers the true parameters fasterand more accurately than all other methods. Althoughhigh posterior probability of true parameters is notenough evidence that the posterior is correctly cali-brated (e.g. it could be that the approximate posteriorhappens to be centred at the right value, but is over-confident), the calibration test (Figure 2b) suggeststhat SNL is indeed well-calibrated.

Another diagnostic possible with SNL is to check howwell qφ(x |θ) fits the data distribution p(x |θ) for acertain value of θ. To do this we (a) generate N inde-pendent samples from p(x |θ) by running the simulator,(b) generate N independent samples from qφ(x |θ) us-ing Equation (2), and (c) calculate the Maximum MeanDiscrepancy [28] between the two sets of samples. Thisis shown in Figure 2c using the true parameters, wherewe compare with NL and a baseline Gaussian directlyfitted to the samples from p(x |θ). This plot can beused to assess the performance of SNL, and also deter-mine how many rounds to run SNL for. We note thatthis kind of diagnostic is not possible with methodsthat approximate the posterior or the likelihood ratioinstead of the likelihood.

Lotka–Volterra population model [84]. This is aMarkov jump process that models the interaction of apopulation of predators with a population of prey. Itis a classic model of oscillating populations. It has four

Page 7: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

a

103 104 105 106 107 108

Number of simulations (log scale)

0

1

2

3

4

5

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

b θ1 θ2 θ3

c

103 104 105 106

Number of simulations (log scale)

0.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

Figure 2: M/G/1 queue model. a: Accuracy vs simulation cost: bottom left is best. b: Calibration test forSNL, histogram outside gray band indicates poor calibration. c: Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

parameters θ, which control the rate of (a) predatorbirths, (b) predator deaths, (c) prey births, and (d)predator-prey interactions. The process can be sim-ulated exactly (by using e.g. the Gillespie algorithm[24]), but its likelihood is intractable. In our experi-ments we follow the setup of Papamakarios and Murray[59], where x is 9 features of the population timeseries.

Figure 3a shows negative log probability of true param-eters vs simulation cost. SNL and SNPE-A performthe best, whereas SNPE-B is less accurate. Runningthe calibration test for SNL, using parameters drawnfrom a broad prior, shows the procedure is sometimesover-confident (Figure 3b, left plot). In this test, manyof the ‘true’ parameters considered corresponded touninteresting models, where both populations died outquickly, or the prey population diverged. In an applica-tion we want to know if the procedure is well-behavedfor the sorts of parameters we seem to have. We rana calibration test for an alternative prior, constrainedto parameters that give the oscillating behaviour ob-served in interesting data. This test suggests that thecalibration is reasonable when modelling oscillatingdata (Figure 3b, right plot). If we had still observedcalibration problems we would have investigated usinglarger neural networks and/or longer MCMC runs inSNL. Figure 3c shows the median distance between thesimulated data and the observed data for each round ofSNPE-A, SNPE-B and SNL. We see that SNPE-A andSNL have a lower median distance compared to SNPE-B, which suggests that SNPE-B has not estimated theposterior accurately enough.

Hodgkin–Huxley neuron model [31]. This modeldescribes how the electrical potential measured acrossthe cell membrane of a neuron varies over time as afunction of current injected through an electrode. Weused a model of a regular-spiking cortical pyramidalcell [62]; the model is described by a set of five cou-

pled ordinary differential equations, which we solvednumerically using NEURON [13]. This is a challengingproblem of practical interest in neuroscience [17, 43],whose task is to infer 12 parameters describing thefunction of the neuron from features of the membranepotential timeseries. We followed the setup of Lueck-mann et al. [43], and we used 18 features extractedfrom the timeseries as data x.

Figure 4a shows negative log probability of true pa-rameters vs simulation cost; SNL outperforms all othermethods. SNPE-A fails in the second round due tothe variance of the proposal becoming negative. Thecalibration test in Figure 4b suggests that SNL is well-calibrated. Figure 4c shows goodness-of-fit betweenMAF and the true likelihood as measured by MMD;SNL converges faster than NL to the true likelihood.The comparison with a baseline Gaussian fit is evidencethat SNL has not fully converged, which indicates thatrunning SNL for longer may improve results further.

6 Discussion

Performance and robustness of SNL. Our experi-ments show that SNL is accurate, efficient and robust.The comparison between SNL and NL demonstratesthat guiding the simulations is crucial in achievinga good likelihood fit in the region of interest with areasonable number of simulations. The comparisonbetween SNL and SMC-ABC shows that flexible para-metric models can be more accurate and cost-effectivecompared to non-parametric alternatives. The com-parison with SNPE shows that SNL is more robust;SNPE-A failed in two out of four cases, and SNPE-Bexhibited high variability. We used the same MAFarchitecture and training hyperparameters in all ourexperiments, which shows that a flexible neural archi-tecture can be broadly applicable without extensive

Page 8: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

a

103 104 105 106

Number of simulations (log scale)

−2

0

2

4

6

8

10

12

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

b θ1, broad prior θ1, oscillating regime

c

0 2 4 6 8 10 12Round

012345678

Med

ian

dist

ance SNPE-A

SNPE-BSNL

Figure 3: Lotka–Volterra model. a: Accuracy vs simulation cost: bottom left is best. b: Calibration tests forSNL, histogram outside gray band indicates poor calibration. Calibration depends on data regime (see maintext). c: Median distance from simulated to observed data.

a

103 104 105 106 107

Number of simulations (log scale)

5.4

5.6

5.8

6.0

6.2

6.4

6.6

6.8

7.0

7.2

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

b θ1 θ2 θ3

c

103 104 105 106

Number of simulations (log scale)

0.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

Figure 4: Hodgkin–Huxley model. a: Accuracy vs simulation cost: bottom left is best. b: Calibration test forSNL, histogram outside gray band indicates poor calibration. c: Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

problem-specific tuning. We also discussed a numberof diagnostics that can be used with SNL to determinethe number of rounds to run it for, assess calibration,diagnose convergence and check goodness-of-fit.

Scaling to high-dimensional data. Likelihood-freeinference becomes challenging when the dimensionalityof the data x is large, which in practice necessitates theuse of low-dimensional features (or summary statistics)in place of the raw data. SNL relies on estimating thedensity of the data, which is a hard problem in highdimensions. A potential strategy for scaling SNL up tohigh dimensions is exploiting the structure of the data.For instance, if x = (x1, . . . ,xN ) is a dataset of a largenumber of i.i.d. datapoints, we could decompose thelikelihood as p(x |θ) =

∏n p(xn |θ) and only learn a

model qφ(xn |θ) of the density in the lower-dimensionalspace of individual points. By exploiting data struc-ture, neural density estimators have been shown toaccurately model high-dimensional data such as images[18, 80, 81] and raw audio [79, 82]; SNL could easily

incorporate such or further advances in neural densityestimation for high-dimensional structured objects.

Learning the likelihood vs the posterior. A gen-eral question is whether it is preferable to learn theposterior or the likelihood; SNPE learns the posterior,whereas SNL targets the likelihood. As we saw, learn-ing the likelihood can often be easier than learningthe posterior, and it does not depend on the choiceof proposal, which makes learning easier and morerobust. Moreover, a model of the likelihood can bereused with different priors, and is in itself an objectof interest that can be used for identifiability analysis[41] or hypothesis testing [11, 15]. On the other hand,methods such as SNPE return a parametric model ofthe posterior directly, whereas a further inference step(e.g. variational inference or MCMC) is needed on topof SNL to obtain a posterior estimate, which introducesfurther computational cost and approximation error.Ultimately, the best approach depends on the problemand application at hand.

Page 9: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

Acknowledgments

We thank Kyle Cranmer and Michael Gutmann foruseful discussions, and James Ritchie for pointing usto simulation-based calibration [75]. George Papa-makarios was supported by the Centre for DoctoralTraining in Data Science, funded by EPSRC (grantEP/L016427/1) and the University of Edinburgh, andby Microsoft Research through its PhD ScholarshipProgramme. David C. Sterratt received funding fromthe European Unions Horizon 2020 research and inno-vation programme under grant agreement No 785907.

A Detailed description of simulatormodels used in experiments

A.1 A toy problem with complex posterior

This toy model illustrates that even simple models canhave non-trivial posteriors. The model has 5 parame-ters θ = (θ1, . . . , θ5) sampled from a uniform prior asfollows:

θi ∼ U(−3, 3) for i = 1, . . . , 5. (8)

Given parameters θ, the data x is generated as follows:

mθ = (θ1, θ2) (9)

s1 = θ23 (10)

s2 = θ24 (11)

ρ = tanh(θ5) (12)

Sθ =

(s21 ρs1s2

ρs1s2 s22

)(13)

xj ∼ N (mθ,Sθ) for j = 1, . . . , 4 (14)

x = (x1, . . . ,x4). (15)

The data x is 8-dimensional. The likelihood is:

p(x |θ) =

4∏j=1

N (xj |mθ,Sθ). (16)

In our experiments, we took the ground truth parame-ters to be:

θ∗ = (0.7, −2.9, −1, −0.9, 0.6), (17)

and simulated the model with parameters θ∗ to getobserved data xo.

A.2 M/G/1 queue model

The M/G/1 queue model [68] describes how a serverprocesses a queue of arriving customers. Our experi-mental setup follows Papamakarios and Murray [59].

There are 3 parameters θ = (θ1, θ2, θ3) sampled froma uniform prior as follows:

θ1 ∼ U(0, 10) (18)

θ2 − θ1 ∼ U(0, 10) (19)

θ3 ∼ U(0, 1/3). (20)

Let I be the total number of customers, si be thetime the server takes to serve customer i, ai be thetime customer i arrived, and di be the time customer ideparted. Take a0 = d0 = 0. The M/G/1 queue modelis described by:

si ∼ U(θ1, θ2) (21)

ai − ai−1 ∼ Exp(θ3) (22)

di − di−1 = si + max (0, ai − di−1). (23)

In our experiments we used I = 50. The data x is5-dimensional, and is obtained by (a) calculating the0th, 25th, 50th, 75th and 100th quantiles of the set ofinter-departure times {di − di−1}1:I , and (b) linearlytransforming the quantiles to have approximately zeromean and unit covariance matrix. The parameters ofthe linear transformation were determined by a pilotrun. We took the ground truth parameters to be:

θ∗ = (1, 5, 0.2), (24)

and simulated the model with parameters θ∗ to getobserved data xo.

A.3 Lotka–Volterra population model

The Lotka–Volterra model [84] is a Markov jump pro-cess describing the evolution of a population of preda-tors interacting with a population of prey, and has fourparameters θ = (θ1, . . . , θ4). Let X be the number ofpredators, and Y be the number of prey. According tothe model, the following can take place:

• With rate exp(θ1)XY a predator may be born,increasing X by one.

• With rate exp(θ2)X a predator may die, decreasingX by one.

• With rate exp(θ3)Y a prey may be born, increasingY by one.

• With rate exp(θ4)XY a prey may be eaten by apredator, decreasing Y by one.

Our experimental setup follows that of Papamakariosand Murray [59]. We used initial populations X = 50and Y = 100. We simulated the model using theGillespie algorithm [24] for a total of 30 time units.We recorded the two populations every 0.2 time units,

Page 10: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

which gives two timeseries of 151 values each. The datax is 9-dimensional, and corresponds to the followingtimeseries features:

• The mean of each timeseries.

• The log variance of each timeseries.

• The autocorrelation coefficient of each timeseriesat lags 0.2 and 0.4 time units.

• The cross-correlation coefficient between the twotimeseries.

Each feature was normalized to have approximatelyzero mean and unit variance based on a pilot run. Theground truth parameters were taken to be:

θ∗ = (log 0.01, log 0.5, log 1, log 0.01), (25)

and the observed data xo were generated from a simu-lation of the model at θ∗. In our experiments we usedtwo priors: (a) a broad prior defined by:

pbroad(θ) ∝4∏i=1

I(−5 ≤ θi ≤ 2), (26)

and (b) a prior corresponding to the oscillating regime,defined by:

posc(θ) ∝ N(θ |θ∗, 0.52

)pbroad(θ). (27)

A.4 Hodgkin–Huxley cortical pyramidalneuron model

In neuroscience, the formalism developed by Hodgkin–Huxley in their classic model of the squid giant axon[31] is used to model many different types of neuron.In our experiments, we used a slightly modified ver-sion of a regular-spiking cortical pyramidal cell [62],for which NEURON [13] simulation code is availablein ModelDB.1 The model is formulated as a set of fivecoupled ordinary differential equations (ODEs) anddescribes how the electrical potential V (t) measuredacross the neuronal cell membrane varies over time as afunction of current Ie(t) injected through an electrode.In essence, the membrane is a capacitor punctuatedby conductances formed by multiple types of ion chan-nel through which currents flow. The currents chargeand discharge the membrane capacitance, causing themembrane potential to change, as described by the firstODE, the membrane equation:

CmdV

dt= −I` − INa − IK − IM − Ie. (28)

1https://senselab.med.yale.edu/ModelDB/ShowModel.cshtml?model=123623

Here, Cm = 1 µF cm−2 is the specific membrane capac-itance, and I`, INa, IK and IM are the ionic currentsflowing through ‘leak’ channels, sodium channels, potas-sium delayed-rectifier channels and M-type potassiumchannels respectively. Each ionic current depends on aconductance that corresponds to how many channelsare open, and on the difference between the membranepotential and an equilibrium potential. For example,for the leak current:

I` = g` (V − E`), (29)

where g` is the leak conductance, and E` is the leakequilibrium potential. Here the conductance g` is con-stant through time, but for the sodium, potassium andM-type channels, the conductances vary over time, asdescribed by the product of a fixed conductance andtime-varying state variables:

INa = gNam3h (V − ENa) (30)

IK = gK n4 (V − EK) (31)

IM = gM p (V − EK). (32)

Here, gNa, gK and gM are the per-channel maximumconductances, m, h, n and p are state variables thatrange between 0 and 1, and ENa and EK are the sodiumand potassium reversal potentials. The state variablesevolve according to differential equations of the formfirst introduced by Hodgkin and Huxley [31]:

dx

dt= αx(V )(1− x)− βx(V )x for x ∈ {m,h, n}

(33)

dp

dt=p− p∞(V )

τp(V ), (34)

where αx(V ), βx(V ), p∞(V ) and τp(V ) are nonlinearfunctions of the membrane potential. We use the pub-lished equations [62] for αm(V ), βm(V ), αh(V ), βh(V )and αn(V ), which contain a parameter VT, and thepublished equations for p∞(V ) and τp(V ), the latter ofwhich contains a parameter τmax. We use a generalizedversion of βn(V ):

βn(V ) = kβn1 exp

(−V − VT − 10

kβn2

)(35)

in which kβn1 and kβn2 are adjustable parameters(rather than 0.5 ms−1 and 40 mV). In order to sim-ulate the model, we use NEURON [13] to solve theODEs numerically from initial conditions of:

m = h = n = p = 0 and V = −70 mV, (36)

using the “CNexp” method and a time-step of 25 µs.At each time step the injected current Ie is drawn froma normal distribution with mean 0.5 nA and variance

Page 11: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

σ2. The duration of the simulation is 100 ms and thevoltage is recorded, which generates a timeseries of4001 voltage recordings.

Our inference setup follows Lueckmann et al. [43].There are 12 parameters θ = (θ1, . . . , θ12) to infer,defined as:

θ1 = log(g`) θ7 = log(−EK)

θ2 = log(gNa) θ8 = log(−VT)

θ3 = log(gK) θ9 = log(kβn1)

θ4 = log(gM) θ10 = log(kβn2)

θ5 = log(−E`) θ11 = log(τmax)

θ6 = log(ENa) θ12 = log(σ).

(37)

The data x is taken to be 18 features of the voltagetimeseries V (t), in particular:

• The mean and log standard deviation of V (t).

• The normalized 3rd, 5th and 7th moments of V (t).

• The logs of the normalized 4th, 6th and 8th mo-ments of V (t).

• The autocorrelation coefficients of V (t) at lagsk×2.5 ms for k = 1, . . . , 10.

The features are linearly transformed to have approx-imately zero mean and unit covariance matrix; theparameters of the transformation are calculated basedon a pilot run. The ground truth parameters θ∗ aretaken to be:

θ∗1 = log(10−4

)θ∗7 = log(100)

θ∗2 = log(0.2) θ∗8 = log(60)

θ∗3 = log(0.05) θ∗9 = log(0.5)

θ∗4 = log(7× 10−5

)θ∗10 = log(40)

θ∗5 = log(70) θ∗11 = log(1000)

θ∗6 = log(50) θ∗12 = log(1).

(38)

The prior over parameters is:

θi ∼ U(θ∗i − log 2, θ∗i + log 1.5) for i = 1, . . . , 12.(39)

Finally, the observed data xo were generated by simu-lating the model at θ∗.

B Full experimental results

In this section, we include the full set of experimentalresults. For each simulator model, we report:

• The approximate posterior computed by SNL.

• The trade-off between accuracy and simulationcost. This is reported for all methods.

• The full results of the simulation-based calibrationtest, consisting of one histogram per parameter.

• The distance-based convergence diagnostic, i.e. thedistance between simulated and observed data vsthe number of rounds. This is reported for SNL,SNPE-A and SNPE-B.

• The goodness-of-fit diagnostic, i.e. the MaximumMean Discrepancy between simulated data anddata generated from the likelihood model, for agiven parameter value (we use the true parametersθ∗). We report this for SNL, NL and a baselineGaussian fit.

B.1 A toy problem with complex posterior

Figure 5 shows the results. The exact posterior p(θ |xo)is plotted in Figure 5a. Even though the prior is uni-form and the likelihood is Gaussian, the posterior iscomplex and non-trivial: it has four symmetric modesdue to the two squaring operations in Equations (10)and (11), and vertical cut-offs due to the hard con-straints imposed by the prior. We can see that theSNL posterior (Figure 5b) approximates the exact pos-terior well.

B.2 M/G/1 queue model

Figure 6 shows the results. The SNL posterior is shownin Figure 6a. We can see that the posterior is concen-trated around the true parameters. Parameter θ1 isparticularly well constrained. From the description ofthe model in Equations (18)–(20), it directly followsthat:

θ1 ≤ mini (di − di−1). (40)

The data x is an invertible linear transformation of thequantiles of {di − di−1}1:I including the 0th quantile,which is precisely equal to mini (di − di−1). Hence, thedata x imposes a hard constraint on the maximumpossible value of θ1, as correctly captured by the SNLposterior. On the other hand, the data is less infor-mative about θ2 and θ3, hence the parameters are lessconstrained.

B.3 Lotka–Volterra population model

Figure 7 shows the results. The SNL posterior is shownin Figure 7a; the extent of each plot corresponds to thebroad prior pbroad(θ). We can see that the posterior istightly concentrated around the true parameters, whichsuggests that the data x is highly informative aboutthe parameters θ.

Page 12: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

B.4 Hodgkin–Huxley cortical pyramidalneuron model

The results are shown in Figure 8. The SNL posterior isshown in Figure 9; the extent of each plot correspondsto the uniform prior. In the SNL posterior, it can beseen that all parameters are clustered around their truevalues (red dots and lines).

The sodium and equilibrium potentials are relativelytightly clustered, and the potassium equilibrium poten-tial less so:

ENa = 50± 2 mV (41)

EK = −99 mV (range [−121 mV, −90 mV]) (42)

E` = −70± 4 mV. (43)

The tight clustering reflects that concentrations, andhence equilibrium potentials, are maintained within arange by neuronal ion exchangers and and glial buffer-ing [72]. Furthermore, when regulation of ion concen-tration fails, pathological brain states can arise [72].The longer tail of the potassium equilibrium potentialposterior might be due to it having relatively littleinfluence on the mean of V (t), since at lower poten-tials the potassium conductance gKn

4 will be relativelysmall, so, according to Equation (31), the potassiumcurrent will also be small. The quantity VT, whichadjusts the threshold of spike initiation, is also fairlytightly controlled, which will tend to keep the firingrate around a constant value.

In contrast to the equilibrium potentials, the conduc-tances vary more, within a factor of 1.8 for g`, anda factor of 3 for gNa, gK and gM. Moreover, gNa andgK are correlated, which is consistent with their op-posing depolarizing and hyperpolarizing influences onthe membrane potential. A higher sodium conduc-tance could lead to the cell being hyper-excitable, butthis should be counteracted by a greater potassiumconductance. This allows for their wide range, and isconsistent with the biological evidence for diverse butcorrelated sets of channel conductances underlying par-ticular activity patterns [25]. In contrast, gM appearsto have relatively little influence over the output, andis not correlated with any other parameters.

The parameter τmax, which also relates to the M-typepotassium channel, also has little effect. Further simu-lations of other neuron types could be undertaken to seeif these parameters are generally loosely constrained,which could then lead to experimentally testable pre-dictions about the density and variability of M-typeconductances.

The parameters kβn1 and kβn2 also have relatively wideranges, and there is a weak correlation between the two.Increasing kβn1 effectively increases the half-activation

voltage of potassium conductances, and increasing kβn2makes the slope of transition less pronounced. Thelack of posterior at high kβn1 (high threshold) andlow kβn2 (sharper transition) might cause the neuronnot to repolarize quickly enough, and hence be hyper-excitable.

Finally, we note that the posterior computed by SNLis qualitatively consistent with the posterior reportedby Lueckmann et al. [43] in Figure G.2 of their article.

References

[1] S. Agostinelli et al. Geant4—a simulation toolkit.Nuclear Instruments and Methods in Physics Re-search Section A: Accelerators, Spectrometers, De-tectors and Associated Equipment, 506(3):250–303,2003.

[2] J. Alsing, B. Wandelt, and S. Feeney. Massiveoptimal data compression and density estimationfor scalable, likelihood-free inference in cosmology.Monthly Notices of the Royal Astronomical Society,477(3):2874–2885, 2018.

[3] R. Bansal and A. Yaron. Risks for the long run: Apotential resolution of asset pricing puzzles. Jour-nal of Finance, 59(4):1481–1509, 2004.

[4] M. A. Beaumont. Approximate Bayesian compu-tation in evolution and ecology. Annual Reviewof Ecology, Evolution, and Systematics, 41(1):379–406, 2010.

[5] M. A. Beaumont, W. Zhang, and D. J. Balding.Approximate Bayesian computation in populationgenetics. Genetics, 162:2025–2035, 2002.

[6] M. A. Beaumont, J.-M. Cornuet, J.-M. Marin,and C. P. Robert. Adaptive approximate Bayesiancomputation. Biometrika, 96(4):983–990, 2009.

[7] C. M. Bishop. Mixture density networks. TechnicalReport NCRG/94/004, Aston University, 1994.

[8] M. G. B. Blum. Approximate Bayesian compu-tation: A nonparametric perspective. Journalof the American Statistical Association, 105(491):1178–1187, 2010.

[9] M. G. B. Blum and O. Francois. Non-linear re-gression models for approximate Bayesian com-putation. Statistics and Computing, 20(1):63–73,2010.

[10] F. V. Bonassi and M. West. Sequential MonteCarlo with adaptive weights for approximateBayesian computation. Bayesian Analysis, 10(1):171–187, 2015.

[11] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez.A guide to constraining effective field theories withmachine learning. Physical Review Letters, 98(5):052004, 2018.

Page 13: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

θ1

θ2

θ3

θ4

θ5(a) MCMC samples from exact posterior. True

parameters indicated in red.

θ1

θ2

θ3

θ4

θ5(b) MCMC samples from SNL posterior. True

parameters indicated in red.

103 104 105 106 107 108

Number of simulations (log scale)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Max

imum

Mea

nD

iscr

epan

cy SMC ABCSLSNPE-ASNPE-BNLSNL

(c) Maximum Mean Discrepancy vs simulationcost.

103 104 105 106 107 108

Number of simulations (log scale)

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

8.5

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

(d) Minus log probability of true parameters vssimulation cost.

θ1 θ2 θ3 θ4 θ5

(e) Calibration test for SNL. Histogram outside gray band indicates poor calibration.

0 5 10 15 20 25 30 35 40Round

02468

1012

Med

ian

dist

ance SNPE-A

SNPE-BSNL

(f) Median distance from simulated to observeddata.

103 104 105 106

Number of simulations (log scale)

−0.10.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

(g) Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

Figure 5: A toy model with complex posterior.

Page 14: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

θ1

θ2

θ3(a) MCMC samples from SNL posterior.

True parameters indicated in red.

103 104 105 106 107 108

Number of simulations (log scale)

0

1

2

3

4

5

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

(b) Minus log probability of true parameters vs simulationcost.

θ1 θ2 θ3

(c) Calibration test for SNL. Histogram outside gray bandindicates poor calibration.

0 5 10 15 20 25Round

0.000.050.100.150.200.250.300.35

Med

ian

dist

ance SNPE-A

SNPE-BSNL

(d) Median distance from simulated to observed data.

103 104 105 106

Number of simulations (log scale)

0.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

(e) Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

Figure 6: M/G/1 queue model.

Page 15: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

θ1

θ2

θ3

θ4(a) MCMC samples from SNL posterior.

True parameters indicated in red.

103 104 105 106

Number of simulations (log scale)

−2

0

2

4

6

8

10

12

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

(b) Minus log probability of true parameters vssimulation cost.

θ1, oscillating regime θ2, oscillating regime

θ3, oscillating regime θ4, oscillating regime

(c) Calibration test for SNL,oscillating regime.

θ1, broad prior θ2, broad prior

θ3, broad prior θ4, broad prior

(d) Calibration test for SNL, broadprior.

0 2 4 6 8 10 12Round

012345678

Med

ian

dist

ance SNPE-A

SNPE-BSNL

(e) Median distance from simulated to observed data.

103 104 105 106

Number of simulations (log scale)

0.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

(f) Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

Figure 7: Lotka–Volterra population model.

Page 16: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

103 104 105 106 107

Number of simulations (log scale)

5.4

5.6

5.8

6.0

6.2

6.4

6.6

6.8

7.0

7.2

−lo

gpr

obab

ility

oftr

uepa

ram

eter

s

SMC ABCSLSNPE-ASNPE-BNLSNL

(a) Minus log probability of true parameters vs simulationcost.

θ1 θ2 θ3 θ4

θ5 θ6 θ7 θ8

θ9 θ10 θ11 θ12

(b) Calibration test for SNL. Histogram outside gray band indicates poorcalibration.

0 5 10 15 20 25 30 35 40Round

0.00.51.01.52.02.53.03.54.04.5

Med

ian

dist

ance SNPE-A

SNPE-BSNL

(c) Median distance from simulated to observed data.

103 104 105 106

Number of simulations (log scale)

0.00.10.20.30.40.50.60.70.8

MM

D

NLSNLGaussian

(d) Likelihood goodness-of-fit vs simulation cost,calculated at true parameters.

Figure 8: Hodgkin–Huxley neuron model.

Page 17: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

log(gNa)

log(gK)

log(g`)

log(ENa)

log(−EK)

log(−E`)

log(gM)

log(τmax)

log(kβn1)

log(kβn2)

log(−VT)

log(σ)

Figure 9: Hodgkin–Huxley model: MCMC samples from SNL posterior. True parameters θ∗ are indicated in red.The range of each histogram is [θ∗i −log 2, θ∗i +log 1.5] for i = 1, . . . , 12. Compare with Figure G.2 by Lueckmannet al. [43].

Page 18: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

[12] J. Brehmer, G. Louppe, J. Pavez, and K. Cran-mer. Mining gold from implicit models to improvelikelihood-free inference. arXiv:1805.12244, 2018.

[13] T. Carnevale and M. Hines. The NEURON Book.Cambridge University Press, 2006.

[14] M. L. Casado, A. G. Baydin, D. M. Rubio, T. A.Le, F. Wood, L. Heinrich, G. Louppe, K. Cran-mer, K. Ng, W. Bhimji, and Prabhat. Improve-ments to inference compilation for probabilisticprogramming in large-scale scientific simulators.arXiv:1712.07901, 2017.

[15] K. Cranmer, J. Pavez, and G. Louppe. Approxi-mating likelihood ratios with calibrated discrimi-native classifiers. arXiv:1506.02169, 2016.

[16] M. F. Cusumano-Towner, A. Radul, D. Wingate,and V. K. Mansinghka. Probabilistic programsfor inferring the goals of autonomous agents.arXiv:1704.04977, 2017.

[17] A. C. Daly, D. J. Gavaghan, C. Holmes,and J. Cooper. Hodgkin–Huxley revisited:reparametrization and identifiability analysis ofthe classic action potential model with approxi-mate Bayesian methods. Royal Society Open Sci-ence, 2(12), 2015.

[18] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Densityestimation using Real NVP. Proceedings of the5th International Conference on Learning Repre-sentations, 2017.

[19] R. Dutta, J. Corander, S. Kaski, and M. U. Gut-mann. Likelihood-free inference by ratio estima-tion. arXiv:1611.10242, 2016.

[20] R. G. Everitt. Bootstrapped synthetic likelihood.arXiv:1711.05825, 2018.

[21] Y. Fan, D. J. Nott, and S. A. Sisson. Approxi-mate Bayesian Computation via regression densityestimation. Stat, 2(1):34–48, 2013.

[22] M. Fasiolo, S. N. Wood, F. Hartig, and M. V.Bravington. An extended empirical saddlepointapproximation for intractable likelihoods. Elec-tronic Journal of Statistics, 12(1):1544–1578, 2018.

[23] M. Germain, K. Gregor, I. Murray, andH. Larochelle. MADE: Masked autoencoder fordistribution estimation. Proceedings of the 32ndInternational Conference on Machine Learning,2015.

[24] D. T. Gillespie. Exact stochastic simulation of cou-pled chemical reactions. The Journal of PhysicalChemistry, 81(25):2340–2361, 1977.

[25] M. S. Goldman, J. Golowasch, E. Marder, andL. F. Abbott. Global structure, robustness, andmodulation of neuronal models. The Journal ofNeuroscience, 21(14):5229–5238, 2001.

[26] C. Gourieroux, A. Monfort, and E. Renault. Indi-rect inference. Journal of Applied Econometrics,8(S1):S85–S118, 1993.

[27] M. M. Graham and A. J. Storkey. Asymptoticallyexact inference in differentiable generative models.Electronic Journal of Statistics, 11(2):5105–5164,2017.

[28] A. Gretton, K. M. Borgwardt, M. J. Rasch,B. Scholkopf, and A. Smola. A kernel two-sampletest. Journal of Machine Learning Research, 13(1):723–773, 2012.

[29] M. U. Gutmann and J. Corander. Bayesian opti-mization for likelihood-free inference of simulator-based statistical models. Journal of MachineLearning Research, 17(125):1–47, 2016.

[30] M. U. Gutmann, R. Dutta, S. Kaski, and J. Coran-der. Likelihood-free inference via classification.Statistics and Computing, 28(2):411–425, 2018.

[31] A. L. Hodgkin and A. F. Huxley. A quantitativedescription of membrane current and its applica-tion to conduction and excitation in nerve. TheJournal of Physiology, 117:500–544, 1952.

[32] S. Ioffe and C. Szegedy. Batch normalization:Accelerating deep network training by reducinginternal covariate shift. Proceedings of the 32ndInternational Conference on Machine Learning,2015.

[33] R. Izbicki, A. Lee, and C. Schafer. High-dimensional density ratio estimation with exten-sions to approximate likelihood computation. Pro-ceedings of the 17th International Conference onArtificial Intelligence and Statistics, 2014.

[34] M. Jarvenpaa, M. U. Gutmann, A. Pleska, A. Ve-htari, and P. Marttinen. Efficient acquisition rulesfor model-based approximate Bayesian computa-tion. Bayesian Analysis, 2018.

[35] D. P. Kingma and J. Ba. Adam: A method forstochastic optimization. Proceedings of the 3rdInternational Conference on Learning Representa-tions, 2015.

[36] D. P. Kingma, T. Salimans, and M. Welling. Vari-ational dropout and the local reparameterizationtrick. Advances in Neural Information ProcessingSystems 28, 2015.

[37] T. D. Kulkarni, P. Kohli, J. B. Tenenbaum,and V. K. Mansinghka. Picture: A probabilis-tic programming language for scene perception.IEEE Conference on Computer Vision and Pat-tern Recognition, 2015.

[38] T. A. Le, A. G. Baydin, and F. Wood. Infer-ence compilation and universal probabilistic pro-gramming. Proceedings of the 20th International

Page 19: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

George Papamakarios, David C. Sterratt, Iain Murray

Conference on Artificial Intelligence and Statistics,pages 1338–1348, 2017.

[39] C. Leuenberger and D. Wegmann. Bayesian com-putation and model selection without likelihoods.Genetics, 184(1):243–252, 2010.

[40] J. Li, D. J. Nott, Y. Fan, and S. A. Sisson. Extend-ing approximate Bayesian computation methodsto high dimensions via a Gaussian copula model.Computational Statistics & Data Analysis, 106(C):77–89, 2017.

[41] J. Lintusaari, M. U. Gutmann, S. Kaski, andJ. Corander. On the identifiability of transmissiondynamic models for infectious diseases. Genetics,202(3):911–918, 2016.

[42] J. Lintusaari, M. U. Gutmann, R. Dutta, S. Kaski,and J. Corander. Fundamentals and recent de-velopments in approximate Bayesian computation.Systematic Biology, 66(1):e66–e82, 2017.

[43] J.-M. Lueckmann, P. J. Goncalves, G. Bassetto,K. Ocal, M. Nonnenmacher, and J. H. Macke. Flex-ible statistical inference for mechanistic models ofneural dynamics. Advances in Neural InformationProcessing Systems 30, 2017.

[44] J.-M. Lueckmann, G. Bassetto, T. Karaletsos, andJ. H. Macke. Likelihood-free inference with emu-lator networks. arXiv:1805.09294, 2018.

[45] V. K. Mansinghka, T. D. Kulkarni, Y. N. Perov,and J. B. Tenenbaum. Approximate Bayesian im-age interpretation using generative probabilisticgraphics programs. Advances in Neural Informa-tion Processing Systems 26, 2013.

[46] J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ry-der. Approximate Bayesian computational meth-ods. Statistics and Computing, 22(6):1167–1180,2012.

[47] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavare.Markov chain Monte Carlo without likelihoods.Proceedings of the National Academy of Sciences,100(26):15324–15328, 2003.

[48] H. Markram et al. Reconstruction and simulationof neocortical microcircuitry. Cell, 163:456–492,2015.

[49] E. Meeds and M. Welling. GPS-ABC: Gaussianprocess surrogate approximate Bayesian compu-tation. Proceedings of the 30th Conference onUncertainty in Artificial Intelligence, 2014.

[50] E. Meeds and M. Welling. Optimization MonteCarlo: Efficient and embarrassingly parallellikelihood-free inference. Advances in Neural In-formation Processing Systems 28, 2015.

[51] E. Meeds, R. Leenders, and M. Welling. Hamilto-nian ABC. Proceedings of the 31st Conference onUncertainty in Artificial Intelligence, 2015.

[52] A. Moreno, T. Adel, E. Meeds, J. M. Rehg,and M. Welling. Automatic variational ABC.arXiv:1606.08549, 2016.

[53] Q. Morris. Recognition networks for approximateinference in BN20 networks. Proceedings of the17th Conference on Uncertainty in Artificial Intel-ligence, 2001.

[54] R. M. Neal. Probabilistic inference using Markovchain Monte Carlo methods. Technical ReportCRG-TR-93-1, Department of Computer Science,University of Toronto, 1993.

[55] R. M. Neal. Slice sampling. The Annals of Statis-tics, 31(3):705–767, 2003.

[56] R. M. Neal. MCMC using Hamiltonian dynam-ics. In Handbook of Markov Chain Monte Carlo,chapter 5, pages 113–162. Chapman & Hall/CRC.,2011.

[57] V. M. H. Ong, D. J. Nott, M.-N. Tran, S. A.Sisson, and C. C. Drovandi. Variational Bayes withsynthetic likelihood. Statistics and Computing, 28(4):971–988, 2018.

[58] B. Paige and F. Wood. Inference networks forsequential Monte Carlo in graphical models. Pro-ceedings of the 33rd International Conference onMachine Learning, 2016.

[59] G. Papamakarios and I. Murray. Fast ε-free in-ference of simulation models with Bayesian con-ditional density estimation. Advances in NeuralInformation Processing Systems 29, 2016.

[60] G. Papamakarios, T. Pavlakou, and I. Murray.Masked autoregressive flow for density estimation.Advances in Neural Information Processing Sys-tems 30, 2017.

[61] K. C. Pham, D. J. Nott, and S. Chaudhuri. Anote on approximating ABC-MCMC using flexibleclassifiers. Stat, 3(1):218–227, 2014.

[62] M. Pospischil, M. Toledo-Rodriguez, C. Monier,Z. Piwkowska, T. Bal, Y. Fregnac, H. Markram,and A. Destexhe. Minimal Hodgkin–Huxley typemodels for different classes of cortical and thalamicneurons. Biological Cybernetics, 99(4):427–441,2008.

[63] L. F. Price, C. C. Drovandi, A. Lee, and D. J.Nott. Bayesian synthetic likelihood. Journal ofComputational and Graphical Statistics, 27(1):1–11, 2018.

[64] J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun,and M. W. Feldman. Population growth of hu-man Y chromosomes: a study of Y chromosomemicrosatellites. Molecular Biology and Evolution,16(12):1791–1798, 1999.

Page 20: arXiv:1805.07226v2 [stat.ML] 21 Jan 2019

Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

[65] O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf,S. Richardson, and C. Wiuf. Using likelihood-freeinference to compare evolutionary dynamics ofthe protein networks of H. pylori and P. falci-parum. PLoS Computational Biology, 3(11):2266–2278, 2007.

[66] L. Romaszko, C. K. Williams, P. Moreno, andP. Kohli. Vision-as-inverse-graphics: Obtaining arich 3D explanation of a scene from a single im-age. IEEE International Conference on ComputerVision Workshop, 2017.

[67] C. M. Schafer and P. E. Freeman. Likelihood-freeinference in cosmology: Potential for the estima-tion of luminosity functions. Statistical Challengesin Modern Astronomy V, pages 3–19, 2012.

[68] A. Y. Shestopaloff and R. M. Neal. On Bayesian in-ference for the M/G/1 queue with efficient MCMCsampling. arXiv:1401.5548, 2014.

[69] S. A. Sisson and Y. Fan. Likelihood-free Markovchain Monte Carlo. In Handbook of Markov ChainMonte Carlo, chapter 12, pages 313–336. Chapman& Hall/CRC., 2011.

[70] S. A. Sisson, Y. Fan, and M. M. Tanaka. SequentialMonte Carlo without likelihoods. Proceedings ofthe National Academy of Sciences, 104(6):1760–1765, 2007.

[71] T. Sjostrand, S. Mrenna, and P. Skands. A briefintroduction to PYTHIA 8.1. Computer PhysicsCommunications, 178(11):852–867, 2008.

[72] G. G. Somjen. Ion regulation in the brain: impli-cations for pathophysiology. The Neuroscientist,8(3):254–267, 2002.

[73] D. C. Sterratt, B. Graham, A. Gillies, and D. Will-shaw. Principles of Computational Modelling inNeuroscience. Cambridge University Press, 2011.

[74] M. Stoye, J. Brehmer, G. Louppe, J. Pavez, andK. Cranmer. Likelihood-free inference with an im-proved cross-entropy estimator. arXiv:1808.00973,2018.

[75] S. Talts, M. Betancourt, D. Simpson, A. Ve-htari, and A. Gelman. Validating Bayesian infer-ence algorithms with simulation-based calibration.arXiv:1804.06788, 2018.

[76] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, andM. P. H. Stumpf. Approximate Bayesian compu-tation scheme for parameter inference and modelselection in dynamical systems. Journal of TheRoyal Society Interface, 6(31):187–202, 2009.

[77] D. Tran, R. Ranganath, and D. Blei. Hierarchi-cal implicit models and likelihood-free variationalinference. Advances in Neural Information Pro-cessing Systems 30, 2017.

[78] M.-N. Tran, D. J. Nott, and R. Kohn. VariationalBayes with intractable likelihood. Journal of Com-putational and Graphical Statistics, 26(4):873–882,2017.

[79] A. van den Oord, S. Dieleman, H. Zen, K. Si-monyan, O. Vinyals, A. Graves, N. Kalchbrenner,A. W. Senior, and K. Kavukcuoglu. WaveNet: Agenerative model for raw audio. arXiv:1609.03499,2016.

[80] A. van den Oord, N. Kalchbrenner, L. Espeholt,K. Kavukcuoglu, O. Vinyals, and A. Graves. Con-ditional image generation with PixelCNN decoders.Advances in Neural Information Processing Sys-tems 29, 2016.

[81] A. van den Oord, N. Kalchbrenner, andK. Kavukcuoglu. Pixel recurrent neural networks.Proceedings of the 33rd International Conferenceon Machine Learning, 2016.

[82] A. van den Oord et al. Parallel WaveNet: Fasthigh-fidelity speech synthesis. Proceedings of the35th International Conference on Machine Learn-ing, 2018.

[83] M. J. Wainwright and M. I. Jordan. Graphicalmodels, exponential families, and variational infer-ence. Foundations and Trends in Machine Learn-ing, 1(1–2):1–305, 2008.

[84] D. J. Wilkinson. Stochastic Modelling for SystemsBiology, Second Edition. Taylor & Francis, 2011.

[85] S. N. Wood. Statistical inference for noisy non-linear ecological dynamic systems. Nature, 466(7310):1102–1104, 2010.

[86] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. B.Tenenbaum. Galileo: Perceiving physical objectproperties by integrating a physics engine withdeep learning. Advances in Neural InformationProcessing Systems 28, 2015.

[87] A. Yuille and D. Kersten. Vision as Bayesian infer-ence: analysis by synthesis? Trends in CognitiveSciences, 10(7):301–308, 2006.