reparameterization gradients through acceptance-rejection sampling...

Reparameterization Gradients through Acceptance-RejectionSampling Algorithms

Christian A. Naesseth†‡ Francisco J. R. Ruiz‡§ Scott W. Linderman‡ David M. Blei‡

†Linkoping University ‡Columbia University §University of Cambridge

Abstract

Variational inference using the reparameteri-zation trick has enabled large-scale approx-imate Bayesian inference in complex pro-babilistic models, leveraging stochastic op-timization to sidestep intractable expecta-tions. The reparameterization trick is appli-cable when we can simulate a random vari-able by applying a differentiable determinis-tic function on an auxiliary random variablewhose distribution is fixed. For many dis-tributions of interest (such as the gamma orDirichlet), simulation of random variables re-lies on acceptance-rejection sampling. Thediscontinuity introduced by the accept–rejectstep means that standard reparameterizationtricks are not applicable. We propose a newmethod that lets us leverage reparameteriza-tion gradients even when variables are out-puts of a acceptance-rejection sampling algo-rithm. Our approach enables reparameteri-zation on a larger class of variational distribu-tions. In several studies of real and syntheticdata, we show that the variance of the estima-tor of the gradient is significantly lower thanother state-of-the-art methods. This leads tofaster convergence of stochastic gradient vari-ational inference.

1 Introduction

Variational inference [Hinton and van Camp, 1993,Waterhouse et al., 1996, Jordan et al., 1999] under-lies many recent advances in large scale probabilisticmodeling. It has enabled sophisticated modeling ofcomplex domains such as images [Kingma and Welling,

Proceedings of the 20th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-erdale, Florida, USA. JMLR: W&CP volume 54. Copy-right 2017 by the author(s).

2014] and text [Hoffman et al., 2013]. By definition,the success of variational approaches depends on ourability to (i) formulate a flexible parametric familyof distributions; and (ii) optimize the parameters tofind the member of this family that most closely ap-proximates the true posterior. These two criteria areat odds—the more flexible the family, the more chal-lenging the optimization problem. In this paper, wepresent a novel method that enables more efficient op-timization for a large class of variational distributions,namely, for distributions that we can efficiently sim-ulate by acceptance-rejection sampling, or rejectionsampling for short.

For complex models, the variational parameters canbe optimized by stochastic gradient ascent on the evi-dence lower bound (elbo), a lower bound on themarginal likelihood of the data. There are two pri-mary means of estimating the gradient of the elbo:the score function estimator [Paisley et al., 2012, Ran-ganath et al., 2014, Mnih and Gregor, 2014] and thereparameterization trick [Kingma and Welling, 2014,Rezende et al., 2014, Price, 1958, Bonnet, 1964], bothof which rely on Monte Carlo sampling. While thereparameterization trick often yields lower varianceestimates and therefore leads to more efficient opti-mization, this approach has been limited in scope to afew variational families (typically Gaussians). Indeed,some lines of research have already tried to addressthis limitation [Knowles, 2015, Ruiz et al., 2016].

There are two requirements to apply the reparameteri-zation trick. The first is that the random variable canbe obtained through a transformation of a simple ran-dom variable, such as a uniform or standard normal;the second is that the transformation be differentiable.In this paper, we observe that all random variables wesimulate on our computers are ultimately transforma-tions of uniforms, often followed by accept-reject steps.So if the transformations are differentiable then we canuse these existing simulation algorithms to expand thescope of the reparameterization trick.

Thus, we show how to use existing rejection samplers

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

to develop stochastic gradients of variational param-eters. In short, each rejection sampler uses a highly-tuned transformation that is well-suited for its distri-bution. We can construct new reparameterization gra-dients by “removing the lid” from these black boxes,applying 65+ years of research on transformations [vonNeumann, 1951, Devroye, 1986] to variational infer-ence. We demonstrate that this broadens the scopeof variational models amenable to efficient inferenceand provides lower-variance estimates of the gradientcompared to state-of-the-art approaches.

We first review variational inference, with a focus onstochastic gradient methods. We then present our keycontribution, rejection sampling variational inference(rsvi), showing how to use efficient rejection samplersto produce low-variance stochastic gradients of thevariational objective. We study two concrete exam-ples, analyzing rejection samplers for the gamma andDirichlet to produce new reparameterization gradientsfor their corresponding variational factors. Finally, weanalyze two datasets with a deep exponential family(def) [Ranganath et al., 2015], comparing rsvi to thestate of the art. We found that rsvi achieves a sig-nificant reduction in variance and faster convergenceof the elbo. Code for all experiments is provided atgithub.com/blei-lab/ars-reparameterization.

2 Variational Inference

Let p(x, z) be a probabilistic model, i.e., a joint prob-ability distribution of data x and latent (unobserved)variables z. In Bayesian inference, we are interested

in the posterior distribution p(z|x) = p(x,z)p(x) . For most

models, the posterior distribution is analytically in-tractable and we have to use an approximation, suchas Monte Carlo methods or variational inference. Inthis paper, we focus on variational inference.

In variational inference, we approximate the posteriorwith a variational family of distributions q(z ; θ), pa-rameterized by θ. Typically, we choose the variationalparameters θ that minimize the Kullback-Leibler (kl)divergence between q(z ; θ) and p(z|x). This minimiza-tion is equivalent to maximizing the elbo [Jordanet al., 1999], defined as

L(θ) = Eq(z ;θ) [f(z)] + H[q(z ; θ)],

f(z) := log p(x, z),

H[q(z ; θ)] := Eq(z ;θ)[− log q(z ; θ)].

(1)

When the model and variational family satisfy con-jugacy requirements, we can use coordinate ascent tofind a local optimum of the elbo [Blei et al., 2016]. Ifthe conjugacy requirements are not satisfied, a com-mon approach is to build a Monte Carlo estimator of

the gradient of the elbo [Paisley et al., 2012, Ran-ganath et al., 2014, Mnih and Gregor, 2014, Sali-mans and Knowles, 2013, Kingma and Welling, 2014].This results in a stochastic optimization procedure,where different Monte Carlo estimators of the gradi-ent amount to different algorithms. We review belowtwo common estimators: the score function estimatorand the reparameterization trick.1

Score function estimator. The score function esti-mator, also known as the log-derivative trick or rein-force [Williams, 1992, Glynn, 1990], is a general wayto estimate the gradient of the elbo [Paisley et al.,2012, Ranganath et al., 2014, Mnih and Gregor, 2014].The score function estimator expresses the gradient asan expectation with respect to q(z ; θ):

∇θL(θ) = Eq(z ;θ)[f(z)∇θ log q(z ; θ)] +∇θH[q(z ; θ)].

We then form Monte Carlo estimates by approx-imating the expectation with independent samplesfrom the variational distribution. Though it is verygeneral, the score function estimator typically suf-fers from high variance. In practice we also needto apply variance reduction techniques such as Rao-Blackwellization [Casella and Robert, 1996] and con-trol variates [Robert and Casella, 2004].

Reparameterization trick. The reparameteriza-tion trick [Salimans and Knowles, 2013, Kingma andWelling, 2014, Price, 1958, Bonnet, 1964] results in alower variance estimator compared to the score func-tion, but it is not as generally applicable. It requiresthat: (i) the latent variables z are continuous; and (ii)we can simulate from q(z ; θ) as follows,

z = h(ε, θ), with ε ∼ s(ε). (2)

Here, s(ε) is a distribution that does not depend onthe variational parameters; it is typically a standardnormal or a standard uniform. Further, h(ε, θ) mustbe differentiable with respect to θ. In statistics, thisis known as a non-central parameterization and hasbeen shown to be helpful in, e.g., Markov chain MonteCarlo methods [Papaspiliopoulos et al., 2003].

Using (2), we can move the derivative inside the ex-pectation and rewrite the gradient of the elbo as

∇θL(θ) = Es(ε) [∇zf(h(ε, θ))∇θh(ε, θ)] +∇θH[q(z ; θ)].

Empirically, the reparameterization trick has beenshown to be beneficial over direct Monte Carlo es-

1In this paper, we assume for simplicity that the gra-dient of the entropy ∇θH[q(z ; θ)] is available analytically.The method that we propose in Section 3 can be easilyextended to handle non-analytical entropy terms. Indeed,the resulting estimator of the gradient may have lower vari-ance when the analytic gradient of the entropy is replacedby its Monte Carlo estimate. Here we do not explore that.

Christian A. Naesseth†‡, Francisco J. R. Ruiz‡§, Scott W. Linderman‡, David M. Blei‡

timation of the gradient using the score fuction es-timator [Salimans and Knowles, 2013, Kingma andWelling, 2014, Titsias and Lazaro-Gredilla, 2014, Fanet al., 2015]. Unfortunately, many distributions com-monly used in variational inference, such as gammaor Dirichlet, are not amenable to standard reparame-terization because samples are generated using a rejec-tion sampler [von Neumann, 1951, Robert and Casella,2004], introducing discontinuities to the mapping. Wenext show that taking a novel view of the acceptance-rejection sampler lets us perform exact reparameteri-zation.

3 Reparameterizing theAcceptance-Rejection Sampler

The basic idea behind reparameterization is to rewritesimulation from a complex distribution as a determin-istic mapping of its parameters and a set of simplerrandom variables. We can view the rejection sam-pler as a complicated deterministic mapping of a (ran-dom) number of simple random variables such as uni-forms and normals. This makes it tempting to takethe standard reparameterization approach when weconsider random variables generated by rejection sam-plers. However, this mapping is in general not contin-uous, and thus moving the derivative inside the expec-tation and using direct automatic differentiation wouldnot necessarily give the correct answer.

Our insight is that we can overcome this problem by in-stead considering only the marginal over the acceptedsample, analytically integrating out the accept-rejectvariable. Thus, the mapping comes from the proposalstep. This is continuous under mild assumptions, en-abling us to greatly extend the class of variational fam-ilies amenable to reparameterization.

We first review rejection sampling and present thereparameterized rejection sampler. Next we show howto use it to calculate low-variance gradients of theelbo. Finally, we present the complete stochastic op-timization for variational inference, rsvi.

3.1 Reparameterized Rejection Sampling

Acceptance-Rejection sampling is a powerful way ofsimulating random variables from complex distribu-tions whose inverse cumulative distribution functionsare not available or are too expensive to evaluate [De-vroye, 1986, Robert and Casella, 2004]. We consider analternative view of rejection sampling in which we ex-plicitly make use of the reparameterization trick. Thisview of the rejection sampler enables our variationalinference algorithm in Section 3.2.

To generate samples from a distribution q(z ; θ) us-

Algorithm 1 Reparameterized Rejection Sampling

Input: target q(z ; θ), proposal r(z ; θ), and constantMθ, with q(z ; θ) ≤Mθr(z ; θ)

Output: ε such that h(ε, θ) ∼ q(z ; θ)1: i← 02: repeat3: i← i+ 14: Propose εi ∼ s(ε)5: Simulate ui ∼ U [0, 1]

6: until ui <q(h(εi,θ) ;θ)

Mθr(h(εi,θ) ;θ)7: return εi

ing rejection sampling, we first sample from a proposaldistribution r(z ; θ) such that q(z ; θ) ≤ Mθr(z ; θ) forsome Mθ <∞. In our version of the rejection sampler,we assume that the proposal distribution is reparame-terizable, i.e., that generating z ∼ r(z ; θ) is equivalentto generating ε ∼ s(ε) (where s(ε) does not dependon θ) and then setting z = h(ε, θ) for a differentiablefunction h(ε, θ). We then accept the sample with prob-

ability min{

1, q(h(ε,θ) ;θ)Mθr(h(ε,θ) ;θ)

}; otherwise, we reject the

sample and repeat the process. We illustrate this inFigure 1 and provide a summary of the method in Al-gorithm 1, where we consider the output to be the(accepted) variable ε, instead of z.

The ability to simulate from r(z ; θ) by a reparameteri-zation through a differentiable h(ε, θ) is not neededfor the rejection sampler to be valid. However, thisis indeed the case for the rejection sampler of manycommon distributions.

3.2 The Reparameterized Rejection Samplerin Variational Inference

We now use reparameterized rejection sampling to de-velop a novel Monte Carlo estimator of the gradientof the elbo. We first rewrite the elbo in (1) as anexpectation in terms of the transformed variable ε,

L(θ) = Eq(z ;θ) [f(z)] + H[q(z ; θ)]

= Eπ(ε ;θ) [f (h(ε, θ))] + H[q(z ; θ)].(3)

In this expectation, π(ε ; θ) is the distribution of theaccepted sample ε in Algorithm 1. We construct it bymarginalizing over the auxiliary uniform variable u,

π(ε ; θ) =

∫π(ε, u ; θ)du

=

∫Mθs(ε)1

[0 < u <

q (h(ε, θ) ; θ)

Mθr (h(ε, θ) ; θ)

]du

= s(ε)q (h(ε, θ) ; θ)

r (h(ε, θ) ; θ), (4)

where 1[x ∈ A] is the indicator function, and re-call that Mθ is a constant used in the rejection sam-


−6 −4 −2 0 2 4 6"

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

density

Mµs(")

¼("jµ)

−6 −4 −2 0 2 4 6"

0

2

4

6

8

10

12

z=h(";µ)

0 2 4 6 8 10 12z

10-5

10-4

10-3

10-2

10-1

100

density

Mµr(zjµ)q(zjµ)

Figure 1: Example of a reparameterized rejection sampler for q(z ; θ) = Gamma(θ, 1), shown here with θ = 2. Weuse the rejection sampling algorithm of Marsaglia and Tsang [2000], which is based on a nonlinear transformationh(ε, θ) of a standard normal ε ∼ N (0, 1) (c.f. Eq. 10), and has acceptance probability of 0.98 for θ = 2. Themarginal density of the accepted value of ε (integrating out the acceptance variables, u1:i) is given by π(ε ; θ).We compute unbiased estimates of the gradient of the elbo (6) via Monte Carlo, using Algorithm 1 to rejectionsample ε ∼ π(ε ; θ). By reparameterizing in terms of ε, we obtain a low-variance estimator of the gradient forchallenging variational distributions.

pler. This can be seen by the algorithmic defini-tion of the rejection sampler, where we propose valuesε ∼ s(ε) and u ∼ U [0, 1] until acceptance, i.e., un-

til u < q(h(ε,θ) ;θ)Mθr(h(ε,θ) ;θ)

. Eq. 3 follows intuitively, but we

formalize it in Proposition 1.

Proposition 1. Let f be any measurable function,and ε ∼ π(ε ; θ), defined by (4) (and implicitly by Al-gorithm 1). Then

Eπ(ε ;θ) [f (h(ε, θ))] =

∫f(z)q(z ; θ)dz.

Proof. Using the definition of π(ε ; θ),

Eπ(ε ;θ) [f (h(ε, θ))] =

∫f (h(ε, θ)) s(ε)

q (h(ε, θ) ; θ)

r (h(ε, θ) ; θ)dε

=

∫f(z)r(z ; θ)

q(z ; θ)

r(z ; θ)dz =

∫f(z)q(z ; θ)dz,

where the second to last equality follows becauseh(ε, θ), ε ∼ s(ε) is a reparameterization of r(z ; θ).

We can now compute the gradient of Eq(z ;θ)[f(z)]based on Eq. 3,

∇θEq(z ;θ)[f(z)] = ∇θEπ(ε ;θ)[f(h(ε, θ))]

= Eπ(ε ;θ)[∇θf(h(ε, θ))]︸︷︷︸=:grep

+

+ Eπ(ε ;θ)[f(h(ε, θ))∇θ log

q(h(ε, θ) ; θ)

r(h(ε, θ) ; θ)

]︸︷︷︸

=:gcor

,

(5)

where we have used the log-derivative trick and rewrit-ten the integrals as expectations with respect to π(ε ; θ)

(see the supplement for all details.) We define grep asthe reparameterization term, which takes advantage ofgradients with respect to the model and its latent vari-ables; we define gcor as a correction term that accountsfor not using r(z ; θ) ≡ q(z ; θ).

Using (5), the gradient of the elbo in (1) can be writ-ten as

∇θL(θ) = grep + gcor +∇θH[q(z ; θ)], (6)

and thus we can build an unbiased one-sample MonteCarlo estimator g ≈ ∇θL(θ) as

g := grep + gcor +∇θH[q(z ; θ)],

grep = ∇zf(z)∣∣z=h(ε,θ)

∇θh(ε, θ)

gcor = f(h(ε, θ))∇θ logq(h(ε, θ) ; θ)

r(h(ε, θ) ; θ),

(7)

where ε is a sample generated using Algorithm 1. Ofcourse, one could generate more samples of ε and av-erage, but we have found a single sample to suffice inpractice.

Note if h(ε, θ) is invertible in ε then we can simplifythe evaluation of the gradient of the log-ratio in gcor,

∇θ logq(h(ε, θ) ; θ)

r(h(ε, θ) ; θ)=

∇θ log q(h(ε, θ) ; θ) +∇θ log

∣∣∣∣dhdε (ε, θ)

∣∣∣∣ . (8)

See the supplementary material for details.

Alternatively, we could rewrite the gradient as an ex-pectation with respect to s(ε) (this is an intermediate


step in the derivation shown in the supplement),

∇θEq(z ;θ)[f(z)] = Es(ε)[q (h(ε, θ) ; θ)

r (h(ε, θ) ; θ)∇θf (h(ε, θ))

]+

+ Es(ε)[q (h(ε, θ) ; θ)

r (h(ε, θ) ; θ)f (h(ε, θ))∇θ log

q (h(ε, θ) ; θ)

r (h(ε, θ) ; θ)

],

and build an importance sampling-based Monte Carloestimator, in which the importance weights would beq (h(ε, θ) ; θ) /r (h(ε, θ) ; θ). However, we would ex-pect this approach to be beneficial for low-dimensionalproblems only, since for high-dimensional z the vari-ance of the importance weights would be too high.

Algorithm 2 Rejection Sampling Variational Infer-ence

Input: Data x, model p(x, z), variational familyq(z ; θ)

Output: Variational parameters θ∗

1: repeat2: Run Algorithm 1 for θn to obtain a sample ε3: Estimate the gradient gn at θ = θn (Eq. 7)4: Calculate the stepsize ρn (Eq. 9)5: Update θn+1 = θn + ρngn

6: until convergence

3.3 Full Algorithm

We now describe the full variational algorithm basedon reparameterizing the rejection sampler. In Sec-tion 5 we give concrete examples of how to reparame-terize common variational families.

We make use of Eq. 6 to obtain a Monte Carlo esti-mator of the gradient of the elbo. We use this es-timate to take stochastic gradient steps. We use thestep-size sequence ρn proposed by Kucukelbir et al.[2016] (also used by Ruiz et al. [2016]), which combinesrmsprop [Tieleman and Hinton, 2012] and Adagrad[Duchi et al., 2011]. It is

ρn = η · n−1/2+δ ·(

1 +√sn)−1

,

sn = t (gn)2

+ (1− t)sn−1,(9)

where n is the iteration number. We set δ = 10−16 andt = 0.1, and we try different values for η. (When θ isa vector, the operations above are element-wise.)

We summarize the full method in Algorithm 2. Werefer to our method as rsvi.

4 Related Work

The reparameterization trick has also been used inautomatic differentiation variational inference (advi)

[Kucukelbir et al., 2015, 2016]. advi applies a trans-formation to the random variables such that their sup-port is on the reals and then places a Gaussian vari-ational posterior approximation over the transformedvariable ε. In this way, advi allows for standard repa-rameterization, but it cannot fit gamma or Dirichletvariational posteriors, for example. Thus, advi strug-gles to approximate probability densities with singu-larities, as noted by Ruiz et al. [2016]. In contrast,our approach allows us to apply the reparameteriza-tion trick on a wider class of variational distributions,which may be more appropriate when the exact pos-terior exhibits sparsity.

In the literature, we can find other lines of researchthat focus on extending the reparameterization gradi-ent to other distributions. For the gamma distribution,Knowles [2015] proposed a method based on approx-imations of the inverse cumulative density function;however, this approach is limited only to the gammadistribution and it involves expensive computations.For general expectations, Schulman et al. [2015] ex-pressed the gradient as a sum of a reparameterizationterm and a correction term to automatically estimatethe gradient in the context of stochastic computationgraphs. However, it is not possible to directly apply itto variational inference with acceptance-rejection sam-pling. This is due to discontinuities in the accept–reject step and the fact that a rejection sampler pro-duces a random number of random variables. Recently,another line of work has focused on applying reparame-terization to discrete latent variable models [Maddisonet al., 2017, Jang et al., 2017] through a continuous re-laxation of the discrete space.

The generalized reparameterization (g-rep) method[Ruiz et al., 2016] exploits the decomposition of thegradient as grep + gcor by applying a transformationbased on standardization of the sufficient statistics ofz. Our approach differs from g-rep: instead of search-ing for a transformation of z that makes the distribu-tion of ε weakly dependent on the variational parame-ters (namely, standardization), we do the opposite bychoosing a transformation of a simple random variableε such that the distribution of z = h(ε, θ) is almostequal to q(z ; θ). For that, we reuse the transforma-tions typically used in rejection sampling. Rather thanhaving to derive a new transformation for each vari-ational distribution, we leverage decades of researchon transformations in the rejection sampling literature[Devroye, 1986]. In rejection sampling, these transfor-mations (and the distributions of ε) are chosen so thatthey have high acceptance probability, which meanswe should expect to obtain gcor ≈ 0 with rsvi. In Sec-tions 5 and 6 we compare rsvi with g-rep and showthat it exhibits significantly lower variance, thus lead-


ing to faster convergence of the inference algorithm.

Finally, another line of research in non-conjugate vari-ational inference aims at developing more expressivevariational families [Salimans et al., 2015, Tran et al.,2016, Maaløe et al., 2016, Ranganath et al., 2016].rsvi can extend the reparameterization trick to thesemethods as well, whenever rejection sampling is usedto generate the random variables.

5 Examples of Acceptance-RejectionReparameterization

As two examples, we study rejection sampling and re-parameterization of two well-known distributions: thegamma and Dirichlet. These have been widely usedas variational families for approximate Bayesian infer-ence. We emphasize that rsvi is not limited to thesetwo cases, it applies to any variational family q(z ; θ)for which a reparameterizable rejection sampler exists.We provide other examples in the supplement.

5.1 Gamma Distribution

One of the most widely used rejection sampler is for thegamma distribution. Indeed, the gamma distributionis also used in practice to generate e.g. beta, Dirich-let, and Student’s t-distributed random variables. Thegamma distribution, Gamma(α, β), is defined by itsshape α and rate β.

For Gamma(α, 1) with α ≥ 1, Marsaglia and Tsang[2000] developed an efficient rejection sampler. It usesa truncated version of the following reparameterization

z = hGamma(ε, α) :=

(α− 1

3

)(1 +

ε√9α− 3

)3

,

(10)

ε ∼ s(ε) := N (0, 1).

When β 6= 1, we divide z by the rate β and obtain asample distributed as Gamma(α, β). The acceptanceprobability is very high: it exceeds 0.95 and 0.98 forα = 1 and α = 2, respectively. In fact, as α → ∞ wehave that π(ε ; θ)→ s(ε), which means that the accep-tance probability approaches 1. Figure 1 illustrates theinvolved functions and distributions for shape α = 2.

For α < 1, we observe that z = u1/αz is distributed asGamma(α, β) for z ∼ Gamma(α+1, β) and u ∼ U [0, 1][Stuart, 1962, Devroye, 1986], and apply the rejectionsampler above for z.

We now study the quality of the transformation in(10) for different values of the shape parameter α.Since π(ε ; θ) → s(ε) as α → ∞, we should expectthe correction term gcor to decrease with α. We show

−3 −2 −1 0 1 2 3

"

10-5

10-4

10-3

10-2

10-1

100

101

102

jr®logq(h(";®)j®)

r(h(";®)j®)j

®=1.0®=2.0®=3.0®=4.0

Figure 2: The correction term of rsvi, and as a resultthe gradient variance, decreases with increasing shapeα. We plot absolute value of the gradient of the log-ratio between the target (gamma) and proposal distri-butions as a function of ε.

that in Figure 2, where we plot the log-ratio (8) fromthe correction term as a function of ε for four val-ues of α. We additionally show in Figure 3 that thedistribution π(ε ; θ) converges to s(ε) (a standard nor-mal) as α increases. For large α, π(ε ; θ) ≈ s(ε) andthe acceptance probability of the rejection sampler ap-proaches 1, which makes the correction term negligi-ble. In Figure 3, we also show that π(ε ; θ) convergesfaster to a standard normal than the standardizationprocedure used in g-rep. We exploit this property—that performance improves with α—to artificially in-crease the shape for any gamma distribution. We nowexplain this trick, which we call shape augmentation.

Shape augmentation. Here we show how to ex-ploit the fact that the rejection sampler improves forincreasing shape α. We make repeated use of the trickabove, using uniform variables, to control the value ofα that goes into the rejection sampler. That is, to com-pute the elbo for a Gamma(α, 1) distribution, we can

first express the random variable as z = z∏Bi=1 u

1α+i−1

i

(for some positive integer B), z ∼ Gamma(α + B, 1)

and uii.i.d.∼ U [0, 1]. This can be proved by in-

duction, since zu1

α+B−1

B ∼ Gamma(α + B − 1, 1),

zu1

α+B−1

B u1

α+B−2

B−1 ∼ Gamma(α+B − 2, 1), etc. Hence,we can apply the rejection sampling framework forz ∼ Gamma(α + B, 1) instead of the original z. Westudy the effect of shape augmentation on the variancein Section 5.2.

5.2 Dirichlet Distribution

The Dirichlet(α1:K) distribution, with concentra-tion parameters α1:K , is a K-dimensional mul-tivariate distribution with K − 1 degrees of free-


−3 −2 −1 0 1 2 3

"

0.0

0.1

0.2

0.3

0.4

0.5 ®=1:0

−3 −2 −1 0 1 2 3

"

0.0

0.1

0.2

0.3

0.4

0.5 ®=2:0

−3 −2 −1 0 1 2 3

"

0.0

0.1

0.2

0.3

0.4

0.5 ®=10:0

N(0;1)G¡REPRSVI

Figure 3: In the distribution on the transformed space ε for a gamma distribution we can see that the rejectionsampling-inspired transformation converges faster to a standard normal. Therefore it is less dependent on theparameter α, which implies a smaller correction term. We compare the transformation of rsvi (this paper) withthe standardization procedure suggested in g-rep [Ruiz et al., 2016], for shape parameters α = {1, 2, 10}.

1.0 1.5 2.0 2.5 3.0®

10-2

10-1

100

101

102

103

104

105

106

Var

G¡REPRSVI (B=0)

RSVI (B=1)

RSVI (B=3)

RSVI (B=10)

Figure 4: rsvi (this paper) achieves lower variancecompared to g-rep [Ruiz et al., 2016]. The estimatedvariance is for the first component of Dirichlet ap-proximation to a multinomial likelihood with uniformDirichlet prior. Optimal concentration is α = 2, andB denotes shape augmentation.

dom. To simulate random variables we usethe fact that if zk ∼ Gamma(αk, 1) i.i.d., then

z1:K = (∑` z`)

−1(z1, . . . , zK)> ∼ Dirichlet(α1:K).

Thus, we make a change of variables to reduce theproblem to that of simulating independent gamma dis-tributed random variables,

Eq(z1:K ;α1:K)[f(z1:K)] =

=

∫f

(z1:K∑K`=1 z`

)K∏k=1

Gamma(zk ;αk, 1)dz1:K .

We apply the transformation in Section 5.1 for thegamma-distributed variables, zk = hGamma(εk, αk),where the variables εk are generated by independentgamma rejection samplers. To showcase this, we studya simple conjugate model where the exact gradient andposterior are available: a multinomial likelihood withDirichlet prior and Dirichlet variational distribution.In Figure 4 we show the resulting variance of the first

component of the gradient, based on simulated datafrom a Dirichlet distribution with K = 100 compo-nents, uniform prior, and N = 100 trials. We comparethe variance of rsvi (for various shape augmentationsettings) with the g-rep approach [Ruiz et al., 2016].rsvi performs better even without the augmentationtrick, and significantly better with it.

6 Experiments

In Section 5 we compared rejection sampling varia-tional inference (rsvi) with generalized reparameteri-zation (g-rep) and found a substantial variance reduc-tion on synthetic examples. Here we evaluate rsvi on amore challenging model, the sparse gamma deep expo-nential family (def) [Ranganath et al., 2015]. On tworeal datasets, we compare rsvi with state-of-the-artmethods: automatic differentiation variational infer-ence (advi) [Kucukelbir et al., 2015, 2016], black-boxvariational inference (bbvi) [Ranganath et al., 2014],and g-rep [Ruiz et al., 2016].

Data. The datasets we consider are the Olivettifaces2 and Neural Information Processing Systems(nips) 2011 conference papers. The Olivetti facesdataset consists of 64×64 gray-scale images of humanfaces in 8 bits, i.e., the data is discrete and in the set{0, . . . , 255}. In the nips dataset we have documentsin a bag-of-words format with an effective vocabularyof 5715 words.

Model. The sparse gamma def [Ranganath et al.,2015] is a multi-layered probabilistic model that mim-ics the architecture of deep neural networks. It mo-dels the data using a set of local latent variables z`n,kwhere n indexes observations, k components, and `layers. These local variables are connected betweenlayers through global weights w`k,k′ . The observationsare xn,d, where d denotes dimension. The joint proba-

2http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html


rsvi B = 1 rsvi B = 4 g-repMin 6.0e−4 1.2e−3 2.7e−3

Median 9.0e7 2.9e7 1.6e12Max 1.2e17 3.4e14 1.5e17

rsvi B = 1 rsvi B = 4 g-repMin 1.8e−3 1.5e−3 2.6e−3

Median 1.2e4 4.5e3 1.5e7Max 1.4e12 1.6e11 3.5e12

Table 1: The rsvi gradient (this paper) exhibits lower variance than g-rep [Ruiz et al., 2016]. We show estimatedvariance, based on 10 samples, of g-rep and rsvi (for B = 1, 4 shape augmentation steps), for parameters atthe initialization point (left) and at iteration 2600 in rsvi (right), estimated for the nips data.

100 101 102 103 104

Time [s]

−2.5

−2.0

−1.5

−1.0

−0.5

EL

BO

×107

ADVI

BBVI

G− REP

RSVI

Figure 5: rsvi (this paper) presents a significantlyfaster initial improvement of the evidence lower bound(elbo) as a function of wall-clock time. The modelis a sparse gamma def, applied to the Olivetti facesdataset, and we compare with advi [Kucukelbir et al.,2016], bbvi [Ranganath et al., 2014], and g-rep [Ruizet al., 2016].

bilistic model is defined as

z`n,k ∼ Gamma

(αz,

αz∑k′ w

`k,k′z

`+1n,k′

),

xn,d ∼ Poisson

(∑k

w0k,dz

1n,k

).

(11)

We set αz = 0.1 in the experiments. All priors on theweights are set to Gamma(0.1, 0.3), and the top-layerlocal variables priors are set to Gamma(0.1, 0.1). Weuse 3 layers, with 100, 40, and 15 components in each.This is the same model that was studied by Ruiz et al.[2016], where g-rep was shown to outperform bothbbvi (with control variates and Rao-Blackwellization),as well as advi. In the experiments we follow their ap-proach and parameterize the variational approximat-ing gamma distribution using the shape and mean. Toavoid constrained optimization we use the transformθ = log(1+exp(ϑ)) for non-negative variational param-eters θ, and optimize ϑ in the unconstrained space.

Results. For the Olivetti faces we exploreη ∈ {0.75, 1, 2, 5} and show the resulting elbo of the

best one in Figure 5. We can see that rsvi has a sig-nificantly faster initial improvement than any of theother methods.3 The wall-clock time for rsvi is basedon a Python implementation (average 1.5s per itera-tion) using the automatic differentiation package au-tograd [Maclaurin et al., 2015]. We found that rsvi isapproximately two times faster than g-rep for com-parable implementations. One reason for this is thatthe transformations based on rejection sampling arecheaper to evaluate. Indeed, the research literature onrejection sampling is heavily focused on finding cheapand efficient transformations.

For the nips dataset, we now compare the varianceof the gradients between the two estimators, rsvi andg-rep, for different shape augmentation steps B. InTable 1 we show the minimum, median, and maxi-mum values of the variance across all dimensions. Wecan see that rsvi again clearly outperforms g-rep interms of variance. Moreover, increasing the number ofaugmentation steps B provides even further improve-ments.

7 Conclusions

We introduced rejection sampling variational inference(rsvi), a method for deriving reparameterization gra-dients when simulation from the variational distribu-tion is done using a acceptance-rejection sampler. Inpractice, rsvi leads to lower-variance gradients thanother state-of-the-art methods. Further, it enables re-parameterization gradients for a large class of varia-tional distributions, taking advantage of the efficienttransformations developed in the rejection samplingliterature.

This work opens the door to other strategies that “re-move the lid” from existing black-box samplers in theservice of variational inference. As future work, wecan consider more complicated simulation algorithmswith accept-reject-like steps, such as adaptive rejec-tion sampling, importance sampling, sequential MonteCarlo, or Markov chain Monte Carlo.

3The results of g-rep, advi and bbvi where reproducedwith permission from Ruiz et al. [2016].


Acknowledgements

Christian A. Naesseth is supported by CADICS, a Lin-naeus Center, funded by the Swedish Research Coun-cil (VR). Francisco J. R. Ruiz is supported by theEU H2020 programme (Marie Sk lodowska-Curie grantagreement 706760). Scott W. Linderman is supportedby the Simons Foundation SCGB-418011. This workis supported by NSF IIS-1247664, ONR N00014-11-1-0651, DARPA PPAML FA8750-14-2-0009, DARPASIMPLEX N66001-15-C-4032, Adobe, and the AlfredP. Sloan Foundation. The authors would like to thankAlp Kucukelbir and Dustin Tran for helpful commentsand discussion.

References

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe.Variational inference: A review for statisticians.arXiv:1601.00670, 2016.

G. Bonnet. Transformations des signaux aleatoires atravers les systemes non lineaires sans memoire. An-nals of Telecommunications, 19(9):203–220, 1964.

G. Casella and C. P. Robert. Rao-Blackwellisation ofsampling schemes. Biometrika, 83(1):81–94, 1996.

L. Devroye. Non-Uniform Random Variate Genera-tion. Springer-Verlag, 1986.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic op-timization. Journal of Machine Learning Research,12:2121–2159, jul 2011.

K. Fan, Z. Wang, J. Beck, J. Kwok, and K. A. Heller.Fast second order stochastic backpropagation forvariational inference. In Advances in Neural Infor-mation Processing Systems, 2015.

P. W. Glynn. Likelihood ratio gradient estimation forstochastic systems. Communications of the ACM,33(10):75–84, oct 1990.

G. E. Hinton and D. van Camp. Keeping the neu-ral networks simple by minimizing the descriptionlength of the weights. In Proceedings of the Sixth An-nual Conference on Computational Learning The-ory, pages 5–13, New York, NY, USA, 1993. ACM.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley.Stochastic variational inference. Journal of MachineLearning Research, 14:1303–1347, May 2013.

E. Jang, S. Gu, and B. Poole. Categorical reparame-terization using gumbel-softmax. In InternationalConference on Learning Representations, 2017. (ac-cepted for publication).

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, andL. K. Saul. An introduction to variational meth-ods for graphical models. Machine Learning, 37(2):183–233, Nov. 1999.

D. P. Kingma and M. Welling. Auto-encoding varia-tional Bayes. In International Conference on Learn-ing Representations, 2014.

D. A. Knowles. Stochastic gradient variationalBayes for Gamma approximating distributions.arXiv:1509.01631v1, 2015.

A. Kucukelbir, R. Ranganath, A. Gelman, and D. M.Blei. Automatic variational inference in Stan. In Ad-vances in Neural Information Processing Systems,2015.

A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman,and D. M. Blei. Automatic differentiation varia-tional inference. arXiv:1603.00788, 2016.

L. Maaløe, C. K. Sønderby, S. K. Sønderby, andO. Winther. Auxiliary deep generative models.In International Conference on Machine Learning,2016.

D. Maclaurin, D. Duvenaud, M. Johnson, and R. P.Adams. Autograd: Reverse-mode differentiation ofnative Python, 2015. URL http://github.com/

HIPS/autograd.

C. J. Maddison, A. Mnih, and Y. W. Teh. The con-crete distribution: A continuous relaxation of dis-crete random variables. In International Conferenceon Learning Representations, 2017. (accepted forpublication).

G. Marsaglia and W. W. Tsang. A simple method forgenerating gamma variables. ACM Transactions onMathematical Software, 26(3):363–372, Sept. 2000.

A. Mnih and K. Gregor. Neural variational inferenceand learning in belief networks. In InternationalConference on Machine Learning, 2014.

J. W. Paisley, D. M. Blei, and M. I. Jordan. VariationalBayesian inference with stochastic search. In Inter-national Conference on Machine Learning, 2012.

O. Papaspiliopoulos, G. O. Roberts, and M. Skold.Non-centered parameterisations for hierarchical mo-dels and data augmentation. In Bayesian Statistics7: Proceedings of the Seventh Valencia InternationalMeeting, page 307. Oxford University Press, USA,2003.

R. Price. A useful theorem for nonlinear devices havingGaussian inputs. IRE Transactions on InformationTheory, 4(2):69–72, 1958.


R. Ranganath, S. Gerrish, and D. M. Blei. Black boxvariational inference. In Artificial Intelligence andStatistics, 2014.

R. Ranganath, L. Tang, L. Charlin, and D. M. Blei.Deep exponential families. In Artificial Intelligenceand Statistics, 2015.

R. Ranganath, D. Tran, and D. M. Blei. Hierarchicalvariational models. In International Conference onMachine Learning, 2016.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochas-tic backpropagation and approximate inference indeep generative models. In International Confer-ence on Machine Learning, 2014.

C. Robert and G. Casella. Monte Carlo statisticalmethods. Springer Science & Business Media, 2004.

F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. The gen-eralized reparameterization gradient. In Advancesin Neural Information Processing Systems, 2016.

T. Salimans and D. A. Knowles. Fixed-form varia-tional posterior approximation through stochasticlinear regression. Bayesian Analysis, 8(4):837–882,2013.

T. Salimans, D. P. Kingma, and M. Welling. Markovchain Monte Carlo and variational inference: Bridg-ing the gap. In International Conference on MachineLearning, 2015.

J. Schulman, N. Heess, T. Weber, and P. Abbeel.Gradient estimation using stochastic computationgraphs. In Advances in Neural Information Process-ing Systems, 2015.

A. Stuart. Gamma-distributed products of indepen-dent random variables. Biometrika, 49:64–65, 1962.

T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP:Divide the gradient by a running average of its re-cent magnitude. Coursera: Neural Networks for Ma-chine Learning, 4, 2012.

M. K. Titsias and M. Lazaro-Gredilla. Doubly stochas-tic variational Bayes for non-conjugate inference.In International Conference on Machine Learning,2014.

D. Tran, R. Ranganath, and D. M. Blei. The vari-ational Gaussian process. In International Confer-ence on Learning Representations, 2016.

J. von Neumann. Various Techniques Used in Connec-tion with Random Digits. Journal of Research ofthe National Bureau of Standards, 12:36–38, 1951.

S. Waterhouse, D. Mackay, and T. Robinson. Bayesianmethods for mixtures of experts. In Advances inNeural Information Processing Systems, pages 351–357. MIT Press, 1996.

R. J. Williams. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine Learning, 8(3–4):229–256, 1992.

reparameterization gradients through acceptance-rejection sampling...

Documents