the turing language for probabilistic programming · m ~ normal(0, sqrt(s)) for i = 1:length(x)...

24
The Turing language for probabilistic programming Hong Ge, Kai Xu Oct 2018 The first international conference on probabilistic programming 1: Department of Engineering, University of Cambridge 2: School of Informatics, University of Edinburgh 1 2

Upload: ngothu

Post on 07-Aug-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

The Turing language for probabilistic programming

Hong Ge, Kai Xu Oct 2018

The first international conference on probabilistic programming

1: Department of Engineering, University of Cambridge 2: School of Informatics, University of Edinburgh

1 2

Page 2: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!2

Talk plan

Prob-Prog

Model

Predictions

Data Test Data

Page 3: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!3

• Probabilistic programs: computer programs represent probabilistic models with probabilistic statements: • Declaring random variables• Conditioning on observed data

• Universal probabilistic programming• Stochastic control flows • Allows representing arbitrary probabilistic models

• Generic inference engines: HMC, SMC, particle Gibbs, EP

• Two approaches to implement a PPL• Standalone: Stan, BUGS, Venture, etc• Embedded: Anglican, infer.NET, PyMC3, Pyro, Edward, Turing, etc

Probabilistic programming languages

Page 4: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!4

Talk plan

Compiler Inference Engines

Learned Model

Model

Predictions

Libtask, AutoDiff, Bijector

Data Test Data

Fig 1. Workflow & components of Turing

Page 5: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!5

Talk plan

Compiler Inference Engines

Learned Model

Model

Predictions

Libtask, AutoDiff, Bijector

Data Test Data

Page 6: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

The modelling language in Turing

!6

@model gdemo(x) = begin s ~ InverseGamma(2, 3) m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end

Fig2: Simple Gaussian Model in Turing

Page 7: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

The modelling language in Turing

!7

@model gdemo(x) = begin s ~ InverseGamma(2, 3) m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end

`@model` translates a normal Julia program into a Turing model.

`gdemo(x)` defines a Julia function.

Observations are declared as the parameters in the function definition.

Everything else follows standard Julia syntax.

When the left hand side of `~` is not declared as an observation, the statement defines a random variable.

Fig3: Illustration of Turing’s syntax

When the left hand side of `~` is an observation, it performs conditioning.

3

2

1

4

5

Page 8: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!8

Talk plan

Compiler Inference Engines

Learned Model

Model

Predictions

Data Test Data

Page 9: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!9

• Sample all variables using a forward simulation method• Sequential Monte Carlo• Particle MCMC• single-site MH, …

• Universal: applicable to models with stochastic control flows

Simulation-based inference

Related work: Church, WebPPL, Venture, Anglican, Turing

statesmean

states[1] states[2] states[3]

trans[3]

data[1] data[2] data[3]

initial[3]

Remove related work

Page 10: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!10

• Sample all variables using a generic gradient guided algorithm, e.g.:• HMC (NUTS) • Blackbox variational

inference

• Non-universal: • No stochastic control flows• No discrete variables

Gradient-based inference

statesmean

states[1] states[2] states[3]

trans[3]

data[1] data[2] data[3]

initial[3]

Page 11: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!11

Compositional inference

statesmean

states[1] states[2] states[3] …

trans[3]

data[1] data[2] data[3] …

initial[3]

• Combine simulation and gradient-based inference

• Generic universal engine

• Gibbs sampling for BayesHMM• Sample states using particle Gibbs• Sample initial, trans and statesmean using HMC

Page 12: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!12

Basic inference in Turing@model gdemo(x) =  begin     s ~ InverseGamma(2,3)   m ~ Normal(0,sqrt(s))   for i=1:length(x)     x[i] ~ Normal(m, sqrt(s))   end   return(s, m) end

mf = gdemo([1.5, 2])

alg = HMC(2000, 0.1, 10)

chain = sample(mf, alg)

By passing data to a compiled model, we get a generated model function `mf`.

1

The `sample` function takes a generated model function and a sampling algorithm to perform inference.

2 An inference algorithm is defined by its name and corresponding parameters.

3

The returned value `chain` stores MCMC samples. 4

TODO: - adapt Guide section in Turing’s doc- add a slide on `SampleFromPrior`

Add some details on post-sampling processing with MCMCChain

Page 13: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!13

Compositional inference in Turing

# Sampler = HamiltonianMonteCarlo + ParticleGibbsg1 = Gibbs(500, HMC(1, 0.2, 3, :m), PG(50, 1, :s))

Gibbs is defined by number of iterations and multiple sampling algorithms as its components.

1

HMC is specified to sample variable m.

2PG is specified to sample variable s.

3

Page 14: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!14

Available algorithms in Turing

Current supported inference algorithms in Turing

Move to conclusion section.

This should really not be an image but rather a svg or pdf. It's to blurry as is.

Also I'm not sure how informative this table actually is. Maybe add this as an additional slide at the end?

Hong Ge, Kai Xu, Zoubin Ghahramani

SamplerSupport discretevariables?

Require gradients? Require adaption?Support universalprograms?

MCMC factoryoperator?

HMC No Yes Yes No YesNUTS No Yes Yes No YesIS Yes No No Yes NoSMC Yes No No Yes NoPG Yes No No Yes YesPMMH Yes No No Yes YesIPMCMC Yes No No Yes Yes

Table 1: Supported Monte Carlo algorithms in Turing (v0.3.1).

a probabilistic program (i.e.✓ = {⇡1:K ,�1:K , z1:T }, cfAlgorithm 2.1), we no longer have the computationgraph between model parameters.

For certain probabilistic programs, it is possible toconstruct the computation graph between variablesthrough static analysis. This is the approach taken bythe BUGS language [Lunn et al., 2000] and infer.NET[Minka et al., 2014]. Once the computation graphis constructed, a Gibbs sampler or message passingalgorithm [Minka, 2001, Winn and Bishop, 2005] canbe applied to each random node of the computationgraph. However, one caveat of this approach is that thecomputation graph underlying a probabilistic programneeds to be fixed during inference time. For programsinvolving stochastic branches, this requirement maynot be satisfied. In such cases, we have to resort toother inference methods.

2.4 Hamiltonian Monte Carlo based

inference

For the family of models whose log probability is point-wise computable and di↵erentiable, there exists ane�cient sampling method using Hamiltonian dynam-ics. Developed as an extension of the Metropolis algo-rithm, Hamiltonian Monte Carlo (HMC) uses Hamil-tonian dynamics to produce minimally correlated pro-posals. Within HMC, the slow exploration of the statespace, originating from the di↵usive behavior of MH’srandom-walk proposals, is avoided by augmenting thestate space of the target distribution p(✓) with a d-dimensional vector r. The resulting joint distributionis as follows:

p(✓, r) = p(✓)p(r), p(r) = N (0, ID) (5)

HMC operates through alternating between two typesof proposals. The first proposal randomizes r (alsoknown as the momentum variable), leaving the state ✓

unchanged. The second proposal changes both ✓ and r

using simulated Hamiltonian dynamics as define by

H(✓, r) = E(✓) +K(r) (6)

where E(✓) ⌘ � log p(✓), K(r) is a ‘kinetic energy’such as K(r) = r

Tr/2. These two proposals would

produce samples from the joint distribution p(✓, r).The distribution p(✓, r) is separable, so the marginaldistribution of ✓ is the desired distribution p(✓). Hence,we can obtain a sequence of samples for ✓ by simplydiscarding the momentum variables r.

The limitation of HMC is that it cannot sample fromdistributions that are not di↵erentiable, or involving dis-crete variables. Probabilistic programming languageswith stochastic branches [Goodman et al., 2008, Mans-inghka et al., 2014, Wood et al., 2014], such as con-ditionals, loops and recursions, poses a substantiallymore challenging Bayesian inference problem becauseinference engines have to manage varying number ofmodel dimensions, dynamic computation graph and soon.

2.5 Simulation based inference

Currently, most inference engines for universal proba-bilistic programs (those involving stochastic branches)use forward-simulation based Monte Carlo methodssuch as importance sampling, sequential Monte Carlo(SMC), and particle MCMC. In order to simplify im-plementation, almost all existing implementations ofthese sampling algorithms use (conditional) prior asproposals.

In particular, the particle MCMC framework of An-drieu et al. [2010] uses sequential Monte Carlo (SMC;or particle filtering) as a complex, high-dimensionalproposal for the Metropolis-Hastings algorithm. Theparticle Gibbs (PG) sampler is a conditional SMC algo-rithm resulting from clamping one particle to an apriorifixed trajectory. More specifically, it defines a transi-tion kernel that has p(z1:T | y1:T , ✓) as its stationarydistribution. Wood et al. [2014] were first to use particleMCMC to sample from complex target (posterior) dis-tributions defined by universal probabilistic programs.The key to this application of particle MCMC is tonote that a probabilistic program implicitly defines a

• Particle Gibbs in Turing is a re-implementation of Wood (2014), with a more efficient mechanism for copying/forking particles.

• Compositional inference is closely related with Vikash (2014).

Page 15: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!15

Talk plan

Compiler Inference Engines

Learned Model

Model

Predictions

Data Test Data

Page 16: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!16

g = Gibbs(500, HMC(1, 0.2, 3, :s), PG(10, 1, :m))

# Run the samplerc = sample(gdemo(1.5, 2), g);

julia> c = sample(gdemo(1.5, 2), g)[ Info: Assume - `s` is a parameter[ Info: Assume - `m` is a parameter[ Info: Observe - `x` is an observation[ Info: Observe - `y` is an observation[Gibbs] Sampling...100% Time: 0:00:04[ Info: [Gibbs] Finished with[ Info: Running time = 4.406516913999995;Object of type "Turing.Chain{AbstractRange{Int64}}"

Iterations = 1:1000Thinning interval = 1Chains = 1Samples per chain = 1000

[1.19424 0.0 … 0.1 0.0; 1.76147 5.0 … 0.1 -5.04962; … ; 0.16521 5.0 … 0.1 -6.34745; 2.17485 5.0 … 0.1 -5.78878]

I would drop this slide.

Maybe it would be more interesting to have a GP example? This way we could highlight how Turing interacts with other packages.

Inference results

Page 17: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!17

Inference results

julia> plot(c)

Page 18: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!18

Inference results

julia> describe(c)Iterations = 1:1000Thinning interval = 1Chains = 1Samples per chain = 1000

Empirical Posterior Estimates: Mean SD Naive SE MCSE ESS m 1.159092440 0.80245398600286965695716 0.0253758231324994407152040 0.0180752920127798012706055 1000.00000 lf_num 4.995000000 0.15811388300841913712169 0.0050000000000000044408921 0.0049999999999999827568486 1000.00000 s 2.074526995 2.06522391559965390328557 0.0653081145154625064552789 0.0901289413221706137147038 525.05615elapsed 0.004406517 0.04087875130208082352645 0.0012926996201814923339452 0.0018081594527423736941396 511.11872epsilon 0.100000000 0.00000000000000013884732 0.0000000000000000043907378 0.0000000000000000046259293 900.90090 lp -5.751843644 1.16059043841524767159967 0.0367010921600556330735010 0.0472090647734787552392000 604.37596

Quantiles: 2.5% 25.0% 50.0% 75.0% 97.5% m -0.4156437838 0.6751308439 1.1591866244 1.6194797994 2.797401556 lf_num 5.0000000000 5.0000000000 5.0000000000 5.0000000000 5.000000000 s 0.5592247213 1.0143592798 1.5257070660 2.3944259281 6.980157267elapsed 0.0019876127 0.0022259065 0.0024100215 0.0026112668 0.007310805epsilon 0.1000000000 0.1000000000 0.1000000000 0.1000000000 0.100000000 lp -8.7877142810 -6.1846725212 -5.4480569586 -4.9603088831 -4.636854812

Page 19: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!19

Probabilistic programming in Julia

AutoDiff in the HMC sampler

Constrained variable transformations

Excellent numerical libraries

Parallel inference algorithms

SMC sampler

MCMC inference results diagnostics

Comprehensive support for distributions

Runtime compiler & intuitive modelling syntax

Turing Other libraries in Julia that

need Bayesian Inference or probabilistic

modelling

Powerful meta-programming

MCMCChain.jl

Corporative multitasking (Libtask.jl)

Distributed computing

Bijector.jl

Distributions.jl

ForwardDiff.jl Flux.Tracker

Clean syntax

DiffEqBayes.jl

Flux.jl

And your own codes!

Page 20: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!20

Bayesian Deep Learning

function nn_forward(x, theta::AbstractVector) W₁, b₁, W₂, b₂, Wₒ, bₒ = unpack(theta) nn = Chain(Dense(W₁, b₁, tanh), Dense(W₂, b₂, tanh), Dense(Wₒ, bₒ, σ)) return nn(x)end

Flux.jl

Turing.jl

alpha = 0.09 # regularizatin termsig = sqrt(1.0 / alpha) # variance of the Gaussian prior

@model bayes_nn(xs, ts) = begin theta ~ MvNormal(zeros(20), sig .* ones(20)) preds = nn_forward(xs, theta) for i = 1:length(ts) ts[i] ~ Bernoulli(preds[i]) endend

Page 21: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!21

[HMC] Sampling... 98% ETA: 0:00:01 ϵ: 0.05 α: 0.9999819814494996

[HMC] Finished with Running time = 60.61580677799989; Accept rate = 0.9206; #lf / sample = 3.9992; #evals / sample = 5.999; pre-cond. diag mat = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0,....

[HMC] Sampling...100% Time: 0:01:01

Bayesian Deep Learning - Inference

N = 5000ch = sample(bayes_nn(hcat(xs...), ts), HMC(N, 0.05, 4))

Replace 'Inference results' with 'Inference'

Page 22: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!22

• A powerful probabilistic programming language• Intuitive modelling syntax• Support both black-box and compositional inference• Pure Julia code, fully hackable

• Next milestones: • Compositional modelling• Tighter integration with deep learning packages• Scaling up to bigger problems

Takeaways* Instead of next steps: Next milestones* tighter integration instead of integration* Maybe mention that we will support

Page 23: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!23

Bibliography

1. Ge, Hong, Kai Xu, and Zoubin Ghahramani. "Turing: Composable inference for probabilistic programming." In International Conference on Artificial Intelligence and Statistics, pp. 1682-1690. 2018.

2. Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. "Stan: A probabilistic programming language." Journal of statistical software 76, no. 1 (2017).

3. Wood, Frank, Jan Willem Meent, and Vikash Mansinghka. "A new approach to probabilistic programming inference." In Artificial Intelligence and Statistics, pp. 1024-1032. 2014.

4. Winn, John, and Tom Minka. "Probabilistic programming with Infer .NET." Machine Learning Summer School lecture notes, available at http://research. microsoft. com/~ minka/papers/mlss2009 (2009).

5. Lunn, David J., Andrew Thomas, Nicky Best, and David Spiegelhalter. "WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility." Statistics and computing 10, no. 4 (2000): 325-337.

6. Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. "Deep Probabilistic Programming." (2016).

7. Mansinghka, Vikash, Daniel Selsam, and Yura Perov. "Venture: a higher-order probabilistic programming platform with programmable inference." arXiv preprint arXiv:1404.0099(2014).

8. Hoffman, Matthew D., and Andrew Gelman. "The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research 15, no. 1 (2014): 1593-1623.

Page 24: The Turing language for probabilistic programming · m ~ Normal(0, sqrt(s)) for i = 1:length(x) x[i] ~ Normal(m, sqrt(s)) end return s, m end `@model` translates a normal Julia program

!24

Contributors Please get in touch in you want to contribute!

1: Department of Engineering, University of Cambridge 2: Department of Statistics, University of Oxford 3. School of Informatics, University of Edinburgh

4: Austrian Research Institute for Artificial Intelligence

Hong GeKai Xu Emile Mathieu

Zoubin Ghahramani

3 1

1

2 many others…

Martin Trapp 4

Will Tebbutt 1 Wessel Bruinsma1 Yee Whye Teh 2

Change my affiliation to:SPSC Lab, Graz University of Technology

or Austrian Research Institute for Artificial Intelligence

web: turing.ml