(dl hacks輪読) variational inference with rényi divergence

23
Variational Inference with Rényi Divergence D1 Ł¾ʼnz

Upload: masahiro-suzuki

Post on 19-Jan-2017

558 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: (DL hacks輪読) Variational Inference with Rényi Divergence

Variational Inference with Rényi DivergenceD1 Ł¾ʼnz

Page 2: (DL hacks輪読) Variational Inference with Rényi Divergence

ñà/)�*¤ arXiv/2�6y/¼ŕ

¤ Yingzhen Li�Richard E. Turner ¤ University of Cambridge ¤ º�ŋ�0Li B1D3� “Stochastic Expectation Propagation”+NIPS�(*>

¤ ï�iETAģĈ!*�!��¿�ï�Rényi�¿�A�ý!&¤ VAE:importance weighted AE[Burda et al., 2015]0Äçå�A�á¢

¤ 7B.zĐ�Appendix0��Ã�Ģñठ�×�/1�.=l_CYM+#

Page 3: (DL hacks輪読) Variational Inference with Rényi Divergence

İĂÐŏĜñ

¤ Ã�0áõp[t+1��áõ:ĩĤř�AÕĚÎ8>�,�+�.�0+�Ðŏ#>���ō!�1PRML3��

¤ ï�Ĝñ�ï�iET�­ćĔęŗ�.-

¤ $0}+9�ï�Ĝñ1;�Ý@?>��¤ ï��íH`tL �ï��¿�A«z¢#>¤ «řĜ�,KLŌ𫤢A��/��

ln #(%) = ℒ % + *+(,||%)

Page 4: (DL hacks輪読) Variational Inference with Rényi Divergence

ÐŏĜñ0�£Ċė«Ð1Ó0;�.ÐŏĜñ0\ux]��>

¤ zóě[ V30ģĈ¤ áõ�ï�Ĝñ�SVI�[Hoffmann et al, 2013]¤ áõ�­ćĔęŗ��SEP�[Li et al., 2015]

¤ pxZJtv�:black-boxĜñ¤ aETŚĀ0»¦ [Ranganath et al., 2014]¤ black-box alpha�BB-α�[Hernandez-Labato et al., 2015]

¤ �!�ï��¿0Òʤ Importance weighted AE (IWAE)[Burda et al., 2015]

¤ VAE0ģĈ�ICLR2016/Įē�

!�!��?6+1�?<AŘ�>;�.ļÀ71.�(&

Page 5: (DL hacks輪読) Variational Inference with Rényi Divergence

ï�Ĝñ¤ iETĜñ+1[ V �ü�<?&,��Ó0���ľAÎ8>

¤ Ö0���ľ#(.|/)1Î8>�,�+�.��¤ ;(*�Ðŏ,(.)AÔ��Ó0èA«¤¢!*Î8>�

¤ !�!�0KLŌğ9Î8>�,�+�.��Hd[xS# / AÎ8>ĩĤ¢���zï�0+�Ó0è+Î8>�

¤ ��+� Aï��¿,���

two distributions p and q on a random variable ✓ 2 ⇥:

D↵

[p||q] = 1

↵� 1log

Zp(✓)↵q(✓)1�↵

d✓. (1)

For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integration isreplaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that plays acrucial role in machine learning and information theory:

D1[p||q] = lim↵!1

D↵

[p||q] =Z

p(✓) logp(✓)

q(✓)d✓ = KL[p||q]. (2)

Similar to ↵ = 1, for values ↵ = 0,+1 the Renyi divergence is defined by continuity in ↵:

D0[p||q] = � log

Z

p(✓)>0q(✓)d✓, (3)

D+1[p||q] = logmax✓2⇥

p(✓)

q(✓). (4)

Another special case is ↵ = 12 , where the corresponding Renyi divergence is a function of the square

Hellinger distance Hel2[p||q] = 12

R(p

p(✓)�pq(✓))2d✓:

D 12[p||q] = �2 log(1�Hel2[p||q]). (5)

In [Van Erven and Harremoes, 2014] the definition (1) is also extended to negative ↵ values, although inthat case it is non-positive and is thus no longer a valid divergence measure. The proposed method alsoconsiders negative ↵ values, and we include from [Van Erven and Harremoes, 2014] some useful propertiesfor forthcoming derivations.

Proposition 1. Renyi’s ↵-divergence definition, extended to negative ↵, is a continuous and non-

decreasing function of ↵ on [0, 1] [ {�1 < D↵

< +1}. Especially D↵

[p||q] � 0 for ↵ � 0 andD

[p||q] 0 for ↵ < 0.

Proposition 2. (Skew symmetry) For ↵ 6= 0, 1,

D↵

[p||q] = ↵

1� ↵

D1�↵

[q||p]. (6)

For the limiting case D�1[p||q] = �D+1[q||p].

A critical question to ask here is how to choose a divergence in this rich family to obtain optimalsolution for a particular application. This is still an open question in active research, so here we onlyinclude examples for the special ↵ values. The KL divergence (↵ = 1) is widely used in approximateinference, including VI (KL[q||p]) and EP (KL[p||q]). The Hellinger distance (↵ = 0.5) is closely related tothe Bhattacharyya bound for the Bayes error, which can be generalised to ↵ 2 (0, 1) and can be improvedby choosing the optimal ↵ [Cover and Thomas, 2006, Nielsen, 2011]. The limiting case D+1[p||q] iscalled the worst-case regret of compressing the information in p with q in the minimum description lengthprinciple literature [Grunwald, 2007].

2.2 Variational Inference

Next we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisationperspective, using posterior approximation as a running example. Consider observing a dataset of Ni.i.d. samples D = {x

n

}Nn=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ that

is drawn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of theparameters given the data,

p(✓|D) =p(✓,D)

p(D)=

p0(✓)Q

N

n=1 p(xn

|✓)p(D)

, (7)

3

two distributions p and q on a random variable ✓ 2 ⇥:

D↵

[p||q] = 1

↵� 1log

Zp(✓)↵q(✓)1�↵

d✓. (1)

For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integration isreplaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that plays acrucial role in machine learning and information theory:

D1[p||q] = lim↵!1

D↵

[p||q] =Z

p(✓) logp(✓)

q(✓)d✓ = KL[p||q]. (2)

Similar to ↵ = 1, for values ↵ = 0,+1 the Renyi divergence is defined by continuity in ↵:

D0[p||q] = � log

Z

p(✓)>0q(✓)d✓, (3)

D+1[p||q] = logmax✓2⇥

p(✓)

q(✓). (4)

Another special case is ↵ = 12 , where the corresponding Renyi divergence is a function of the square

Hellinger distance Hel2[p||q] = 12

R(p

p(✓)�pq(✓))2d✓:

D 12[p||q] = �2 log(1�Hel2[p||q]). (5)

In [Van Erven and Harremoes, 2014] the definition (1) is also extended to negative ↵ values, although inthat case it is non-positive and is thus no longer a valid divergence measure. The proposed method alsoconsiders negative ↵ values, and we include from [Van Erven and Harremoes, 2014] some useful propertiesfor forthcoming derivations.

Proposition 1. Renyi’s ↵-divergence definition, extended to negative ↵, is a continuous and non-

decreasing function of ↵ on [0, 1] [ {�1 < D↵

< +1}. Especially D↵

[p||q] � 0 for ↵ � 0 andD

[p||q] 0 for ↵ < 0.

Proposition 2. (Skew symmetry) For ↵ 6= 0, 1,

D↵

[p||q] = ↵

1� ↵

D1�↵

[q||p]. (6)

For the limiting case D�1[p||q] = �D+1[q||p].

A critical question to ask here is how to choose a divergence in this rich family to obtain optimalsolution for a particular application. This is still an open question in active research, so here we onlyinclude examples for the special ↵ values. The KL divergence (↵ = 1) is widely used in approximateinference, including VI (KL[q||p]) and EP (KL[p||q]). The Hellinger distance (↵ = 0.5) is closely related tothe Bhattacharyya bound for the Bayes error, which can be generalised to ↵ 2 (0, 1) and can be improvedby choosing the optimal ↵ [Cover and Thomas, 2006, Nielsen, 2011]. The limiting case D+1[p||q] iscalled the worst-case regret of compressing the information in p with q in the minimum description lengthprinciple literature [Grunwald, 2007].

2.2 Variational Inference

Next we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisationperspective, using posterior approximation as a running example. Consider observing a dataset of Ni.i.d. samples D = {x

n

}Nn=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ that

is drawn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of theparameters given the data,

p(✓|D) =p(✓,D)

p(D)=

p0(✓)Q

N

n=1 p(xn

|✓)p(D)

, (7)

3

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

Page 6: (DL hacks輪読) Variational Inference with Rényi Divergence

ï�I \HxO W�VAE�¤ [Kingma et al,. 2014]</;(*ÒÊ ?&ĘĻ�²p[t

¤ �²p[t1ĪĻœîï¶p[t+Ó0;�/�ý ?>�

¤ ą¯�0ℎ1ĘĻ¨Ĵ/��>Ņ?Ļ+1.�

¤ !�!ÕĚÎ8>�,�+�.�0+�Ðŏ�ľ A�ý

¤ $!*�Ó0�¿A«z¢#>�

3.1 Variational Auto-encoder with Renyi Divergence

The variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recentlyproposed (deep) generative model that parametrizes the variational approximation with a recognitionnetwork. The generative model is specified as a hierarchical latent variable model:

p(x) =X

h

(1)...h

(L)

p(h(L))p(h(L�1)|h(L)) · · · p(x|h(1)). (14)

Here we drop the parameters ✓ but keep in mind that they will be learned using approximate maximumlikelihood. However for these models the exact computation of log p(x) requires marginalisation of allhidden variables and is thus often intractable. Variational expectation-maximisation (EM) methodscomes to the rescue by approximating

log p(x) ⇡ LV I

(q;x) = Eq(h|x)

log

p(x,h)

q(h|x)

�, (15)

where h collects all the hidden variables h(1), ...,h

(L) and the approximate posterior q(h|x) is defined as

q(h|x) = q(h(1)|x)q(h(2)|h(1)) · · · q(h(L)|h(L�1)). (16)

In variational EM, optimisation for q and p are alternated to guarantee convergence. However the coreidea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the MLEobjective function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. Thispaper explores the possibility that alternative surrogate functions might return estimates that are tighterlower-bounds. So the VR bound is considered in this context:

L↵

(q;x) =1

1� ↵

logEq(h|x)

"✓p(x,h)

q(h|x)

◆1�↵

#. (17)

This is because Theorem 1 indicates that the approximation (17) provides a tighter lower-bound onlog p(x) than VAE’s objective (15) when ↵ 2 [0, 1). A denoising criterion has also been considered[Im et al., 2015] and we show in the supplementary its application to the VR bound. In general newimprovements upon the evidence lower-bound based on the concavity of the logarithm also extend to theVR bound when 0 ↵ 1.

4 The VR Bound Optimisation Framework

Variational free-energy methods sidestep intractabilities in a class of intractable models. Recent work hasused further approximations based on Monte Carlo to expend the set of models that can be handled. TheVR bounds can be deployed on the same model class as Monte Carlo variational methods, but which havesmaller scope if Monte Carlo methods is not resorted to. This section develops a scalable optimisationmethod for the VR bound by extending the recent advances of traditional VI. Black-box methods arealso discussed to enable it applications to arbitrary finite ↵ settings.

4.1 Monte Carlo Estimation of the VR Bound

We propose a simple Monte Carlo method that uses finite samples ✓k

⇠ q(✓), k = 1, ...,K to approximateL↵

⇡ L↵,K

:

L↵,K

(q;D) =1

1� ↵

log1

K

KX

k=1

"✓p(✓

k

,D)

q(✓k

)

◆1�↵

#. (18)

Unlike traditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is insidethe logarithm. However we can bound the bias by the following theorems proved in the supplementary.

Theorem 2. E{✓k}Kk=1

[L↵,K

(q;D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ 1,

where the limiting result is L↵

for K ! +1 if |p/q| is bounded; 2) continuous and non-increasing in↵ on [0, 1] [ {|L

| < +1}.

5

3.1 Variational Auto-encoder with Renyi Divergence

The variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recentlyproposed (deep) generative model that parametrizes the variational approximation with a recognitionnetwork. The generative model is specified as a hierarchical latent variable model:

p(x) =X

h

(1)...h

(L)

p(h(L))p(h(L�1)|h(L)) · · · p(x|h(1)). (14)

Here we drop the parameters ✓ but keep in mind that they will be learned using approximate maximumlikelihood. However for these models the exact computation of log p(x) requires marginalisation of allhidden variables and is thus often intractable. Variational expectation-maximisation (EM) methodscomes to the rescue by approximating

log p(x) ⇡ LV I

(q;x) = Eq(h|x)

log

p(x,h)

q(h|x)

�, (15)

where h collects all the hidden variables h(1), ...,h

(L) and the approximate posterior q(h|x) is defined as

q(h|x) = q(h(1)|x)q(h(2)|h(1)) · · · q(h(L)|h(L�1)). (16)

In variational EM, optimisation for q and p are alternated to guarantee convergence. However the coreidea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the MLEobjective function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. Thispaper explores the possibility that alternative surrogate functions might return estimates that are tighterlower-bounds. So the VR bound is considered in this context:

L↵

(q;x) =1

1� ↵

logEq(h|x)

"✓p(x,h)

q(h|x)

◆1�↵

#. (17)

This is because Theorem 1 indicates that the approximation (17) provides a tighter lower-bound onlog p(x) than VAE’s objective (15) when ↵ 2 [0, 1). A denoising criterion has also been considered[Im et al., 2015] and we show in the supplementary its application to the VR bound. In general newimprovements upon the evidence lower-bound based on the concavity of the logarithm also extend to theVR bound when 0 ↵ 1.

4 The VR Bound Optimisation Framework

Variational free-energy methods sidestep intractabilities in a class of intractable models. Recent work hasused further approximations based on Monte Carlo to expend the set of models that can be handled. TheVR bounds can be deployed on the same model class as Monte Carlo variational methods, but which havesmaller scope if Monte Carlo methods is not resorted to. This section develops a scalable optimisationmethod for the VR bound by extending the recent advances of traditional VI. Black-box methods arealso discussed to enable it applications to arbitrary finite ↵ settings.

4.1 Monte Carlo Estimation of the VR Bound

We propose a simple Monte Carlo method that uses finite samples ✓k

⇠ q(✓), k = 1, ...,K to approximateL↵

⇡ L↵,K

:

L↵,K

(q;D) =1

1� ↵

log1

K

KX

k=1

"✓p(✓

k

,D)

q(✓k

)

◆1�↵

#. (18)

Unlike traditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is insidethe logarithm. However we can bound the bias by the following theorems proved in the supplementary.

Theorem 2. E{✓k}Kk=1

[L↵,K

(q;D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ 1,

where the limiting result is L↵

for K ! +1 if |p/q| is bounded; 2) continuous and non-increasing in↵ on [0, 1] [ {|L

| < +1}.

5

3.1 Variational Auto-encoder with Renyi Divergence

The variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recentlyproposed (deep) generative model that parametrizes the variational approximation with a recognitionnetwork. The generative model is specified as a hierarchical latent variable model:

p(x) =X

h

(1)...h

(L)

p(h(L))p(h(L�1)|h(L)) · · · p(x|h(1)). (14)

Here we drop the parameters ✓ but keep in mind that they will be learned using approximate maximumlikelihood. However for these models the exact computation of log p(x) requires marginalisation of allhidden variables and is thus often intractable. Variational expectation-maximisation (EM) methodscomes to the rescue by approximating

log p(x) ⇡ LV I

(q;x) = Eq(h|x)

log

p(x,h)

q(h|x)

�, (15)

where h collects all the hidden variables h(1), ...,h

(L) and the approximate posterior q(h|x) is defined as

q(h|x) = q(h(1)|x)q(h(2)|h(1)) · · · q(h(L)|h(L�1)). (16)

In variational EM, optimisation for q and p are alternated to guarantee convergence. However the coreidea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the MLEobjective function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. Thispaper explores the possibility that alternative surrogate functions might return estimates that are tighterlower-bounds. So the VR bound is considered in this context:

L↵

(q;x) =1

1� ↵

logEq(h|x)

"✓p(x,h)

q(h|x)

◆1�↵

#. (17)

This is because Theorem 1 indicates that the approximation (17) provides a tighter lower-bound onlog p(x) than VAE’s objective (15) when ↵ 2 [0, 1). A denoising criterion has also been considered[Im et al., 2015] and we show in the supplementary its application to the VR bound. In general newimprovements upon the evidence lower-bound based on the concavity of the logarithm also extend to theVR bound when 0 ↵ 1.

4 The VR Bound Optimisation Framework

Variational free-energy methods sidestep intractabilities in a class of intractable models. Recent work hasused further approximations based on Monte Carlo to expend the set of models that can be handled. TheVR bounds can be deployed on the same model class as Monte Carlo variational methods, but which havesmaller scope if Monte Carlo methods is not resorted to. This section develops a scalable optimisationmethod for the VR bound by extending the recent advances of traditional VI. Black-box methods arealso discussed to enable it applications to arbitrary finite ↵ settings.

4.1 Monte Carlo Estimation of the VR Bound

We propose a simple Monte Carlo method that uses finite samples ✓k

⇠ q(✓), k = 1, ...,K to approximateL↵

⇡ L↵,K

:

L↵,K

(q;D) =1

1� ↵

log1

K

KX

k=1

"✓p(✓

k

,D)

q(✓k

)

◆1�↵

#. (18)

Unlike traditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is insidethe logarithm. However we can bound the bias by the following theorems proved in the supplementary.

Theorem 2. E{✓k}Kk=1

[L↵,K

(q;D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ 1,

where the limiting result is L↵

for K ! +1 if |p/q| is bounded; 2) continuous and non-increasing in↵ on [0, 1] [ {|L

| < +1}.

5

Under review as a conference paper at ICLR 2016

rived from importance weighting. The recognition network generates multiple approximate pos-terior samples, and their weights are averaged. As the number of samples is increased, the lowerbound approaches the true log-likelihood. The use of multiple samples gives the IWAE additionalflexibility to learn generative models whose posterior distributions do not fit the VAE modeling as-sumptions. This approach is related to reweighted wake sleep (Bornschein & Bengio, 2015), butthe IWAE is trained using a single unified objective. Compared with the VAE, our IWAE is able tolearn richer representations with more latent dimensions, which translates into significantly higherlog-likelihoods on density estimation benchmarks.

2 BACKGROUND

In this section, we review the variational autoencoder (VAE) model of Kingma & Welling (2014). Inparticular, we describe a generalization of the architecture to multiple stochastic hidden layers. Wenote, however, that Kingma & Welling (2014) used a single stochastic hidden layer, and there areother sensible generalizations to multiple layers, such as the one presented by Rezende et al. (2014).

The VAE defines a generative process in terms of ancestral sampling through a cascade of hiddenlayers:

p(x|✓) =X

h

1,...,hL

p(hL|✓)p(hL�1|hL,✓) · · · p(x|h1,✓). (1)

Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1, . . . ,hL} denotes thestochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. Forconvenience, we define h

0= x. Each of the terms p(h`|h`+1

) may denote a complicated nonlinearrelationship, for instance one computed by a multilayer neural network. However, it is assumedthat sampling and probability evaluation are tractable for each p(h`|h`+1

). Note that L denotesthe number of stochastic hidden layers; the deterministic layers are not shown explicitly here. Weassume the recognition model q(h|x) is defined in terms of an analogous factorization:

q(h|x) = q(h1|x)q(h2|h1) · · · q(hL|hL�1

), (2)where sampling and probability evaluation are tractable for each of the terms in the product.

In this work, we assume the same families of conditional probability distributions as Kingma &Welling (2014). In particular, the prior p(hL

) is fixed to be a zero-mean, unit-variance Gaussian.In general, each of the conditional distributions p(h`| h`+1

) and q(h`|h`�1) is a Gaussian with

diagonal covariance, where the mean and covariance parameters are computed by a deterministicfeed-forward neural network. For real-valued observations, p(x|h1

) is also defined to be such aGaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean parametersare computed by a neural network.

The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fromJensen’s Inequality:

log p(x) = logEq(h|x)

p(x,h)

q(h|x)

�� Eq(h|x)

log

p(x,h)

q(h|x)

�= L(x). (3)

Since L(x) = log p(x) � DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off thedata log-likelihood log p(x) and the KL divergence from the true posterior. This is beneficial, in thatit encourages the model to learn a representation where posterior inference is easy to approximate.

If one computes the log-likelihood gradient for the recognition network directly from Eqn. 3, the re-sult is a REINFORCE-like update rule which trains slowly because it does not use the log-likelihoodgradients with respect to latent variables (Dayan et al., 1995; Mnih & Gregor, 2014). Instead,Kingma & Welling (2014) proposed a reparameterization of the recognition distribution in termsof auxiliary variables with fixed distributions, such that the samples from the recognition model area deterministic function of the inputs and auxiliary variables. While they presented the reparameter-ization trick for a variety of distributions, for convenience we discuss the special case of Gaussians,since that is all we require in this work. (The general reparameterization trick can be used with ourIWAE as well.)

In this paper, the recognition distribution q(h`|h`�1,✓) always takes the form of a GaussianN (h

`|µ(h`�1,✓),⌃(h

`�1,✓)), whose mean and covariance are computed from the the states of

2

Page 7: (DL hacks輪読) Variational Inference with Rényi Divergence

ï�I \HxO W�VAE�¤ AKFS�ľ,#>,� /;(*Ó0;�/��ñ�.ďł,.>�reparameterization trick��

¤ ;(*��¿0ŚĀ1Ó0;�/Î6>

¤ </pxZJtv�/;(*Ó0;�/.>

Under review as a conference paper at ICLR 2016

rived from importance weighting. The recognition network generates multiple approximate pos-terior samples, and their weights are averaged. As the number of samples is increased, the lowerbound approaches the true log-likelihood. The use of multiple samples gives the IWAE additionalflexibility to learn generative models whose posterior distributions do not fit the VAE modeling as-sumptions. This approach is related to reweighted wake sleep (Bornschein & Bengio, 2015), butthe IWAE is trained using a single unified objective. Compared with the VAE, our IWAE is able tolearn richer representations with more latent dimensions, which translates into significantly higherlog-likelihoods on density estimation benchmarks.

2 BACKGROUND

In this section, we review the variational autoencoder (VAE) model of Kingma & Welling (2014). Inparticular, we describe a generalization of the architecture to multiple stochastic hidden layers. Wenote, however, that Kingma & Welling (2014) used a single stochastic hidden layer, and there areother sensible generalizations to multiple layers, such as the one presented by Rezende et al. (2014).

The VAE defines a generative process in terms of ancestral sampling through a cascade of hiddenlayers:

p(x|✓) =X

h

1,...,hL

p(hL|✓)p(hL�1|hL,✓) · · · p(x|h1,✓). (1)

Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1, . . . ,hL} denotes thestochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. Forconvenience, we define h

0= x. Each of the terms p(h`|h`+1

) may denote a complicated nonlinearrelationship, for instance one computed by a multilayer neural network. However, it is assumedthat sampling and probability evaluation are tractable for each p(h`|h`+1

). Note that L denotesthe number of stochastic hidden layers; the deterministic layers are not shown explicitly here. Weassume the recognition model q(h|x) is defined in terms of an analogous factorization:

q(h|x) = q(h1|x)q(h2|h1) · · · q(hL|hL�1

), (2)where sampling and probability evaluation are tractable for each of the terms in the product.

In this work, we assume the same families of conditional probability distributions as Kingma &Welling (2014). In particular, the prior p(hL

) is fixed to be a zero-mean, unit-variance Gaussian.In general, each of the conditional distributions p(h`| h`+1

) and q(h`|h`�1) is a Gaussian with

diagonal covariance, where the mean and covariance parameters are computed by a deterministicfeed-forward neural network. For real-valued observations, p(x|h1

) is also defined to be such aGaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean parametersare computed by a neural network.

The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fromJensen’s Inequality:

log p(x) = logEq(h|x)

p(x,h)

q(h|x)

�� Eq(h|x)

log

p(x,h)

q(h|x)

�= L(x). (3)

Since L(x) = log p(x) � DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off thedata log-likelihood log p(x) and the KL divergence from the true posterior. This is beneficial, in thatit encourages the model to learn a representation where posterior inference is easy to approximate.

If one computes the log-likelihood gradient for the recognition network directly from Eqn. 3, the re-sult is a REINFORCE-like update rule which trains slowly because it does not use the log-likelihoodgradients with respect to latent variables (Dayan et al., 1995; Mnih & Gregor, 2014). Instead,Kingma & Welling (2014) proposed a reparameterization of the recognition distribution in termsof auxiliary variables with fixed distributions, such that the samples from the recognition model area deterministic function of the inputs and auxiliary variables. While they presented the reparameter-ization trick for a variety of distributions, for convenience we discuss the special case of Gaussians,since that is all we require in this work. (The general reparameterization trick can be used with ourIWAE as well.)

In this paper, the recognition distribution q(h`|h`�1,✓) always takes the form of a GaussianN (h

`|µ(h`�1,✓),⌃(h

`�1,✓)), whose mean and covariance are computed from the the states of

2

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Page 8: (DL hacks輪読) Variational Inference with Rényi Divergence

VAE0ûø�Ŗ�Ì(*�>VAE,.B�Ć��,Ú(&|�$?1µ!�+#�

¤ �6+őđ!*�&VAE1�B.è

¤ ��0è1

¤ VAE0Ëñà+1��Ŀ�z��.>�,AĖś!*KLŌğ,Ûÿ²Ŋč/��*�&���0ñà+1�0è+Î8*�>

For the purpose of solving the above problems, let us introduce a recognition model q�(z|x): anapproximation to the intractable true posterior p✓(z|x). Note that in contrast with the approximateposterior in mean-field variational inference, it is not necessarily factorial and its parameters � arenot computed from some closed-form expectation. Instead, we’ll introduce a method for learningthe recognition model parameters � jointly with the generative model parameters ✓.

From a coding theory perspective, the unobserved variables z have an interpretation as a latentrepresentation or code. In this paper we will therefore also refer to the recognition model q�(z|x)as a probabilistic encoder, since given a datapoint x it produces a distribution (e.g. a Gaussian)over the possible values of the code z from which the datapoint x could have been generated. In asimilar vein we will refer to p✓(x|z) as a probabilistic decoder, since given a code z it produces adistribution over the possible corresponding values of x.

2.2 The variational bound

The marginal likelihood is composed of a sum over the marginal likelihoods of individual datapointslog p✓(x

(1), · · · ,x(N)

) =

PNi=1 log p✓(x

(i)), which can each be rewritten as:

log p✓(x(i)) = DKL(q�(z|x(i)

)||p✓(z|x(i))) + L(✓,�;x(i)

) (1)The first RHS term is the KL divergence of the approximate from the true posterior. Since thisKL-divergence is non-negative, the second RHS term L(✓,�;x(i)

) is called the (variational) lower

bound on the marginal likelihood of datapoint i, and can be written as:

log p✓(x(i)) � L(✓,�;x(i)

) = Eq�(z|x) [� log q�(z|x) + log p✓(x, z)] (2)which can also be written as:

L(✓,�;x(i)) = �DKL(q�(z|x(i)

)||p✓(z)) + Eq�(z|x(i))

hlog p✓(x

(i)|z)i

(3)

We want to differentiate and optimize the lower bound L(✓,�;x(i)) w.r.t. both the variational

parameters � and generative parameters ✓. However, the gradient of the lower bound w.r.t. �is a bit problematic. The usual (naıve) Monte Carlo gradient estimator for this type of problemis: r�Eq�(z) [f(z)] = Eq�(z)

⇥f(z)rq�(z) log q�(z)

⇤' 1

L

PLl=1 f(z)rq�(z(l)) log q�(z

(l)) where

z

(l) ⇠ q�(z|x(i)). This gradient estimator exhibits exhibits very high variance (see e.g. [BJP12])

and is impractical for our purposes.

2.3 The SGVB estimator and AEVB algorithm

In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. theparameters. We assume an approximate posterior in the form q�(z|x), but please note that thetechnique can be applied to the case q�(z), i.e. where we do not condition on x, as well. The fullyvariational Bayesian method for inferring a posterior over the parameters is given in the appendix.

Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q�(z|x) wecan reparameterize the random variable ez ⇠ q�(z|x) using a differentiable transformation g�(✏,x)of an (auxiliary) noise variable ✏:

ez = g�(✏,x) with ✏ ⇠ p(✏) (4)

See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and functiong�(✏,x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.q�(z|x) as follows:

Eq�(z|x(i)) [f(z)] = Ep(✏)

hf(g�(✏,x

(i)))

i' 1

L

LX

l=1

f(g�(✏(l),x

(i))) where ✏(l) ⇠ p(✏) (5)

We apply this technique to the variational lower bound (eq. (2)), yielding our generic StochasticGradient Variational Bayes (SGVB) estimator eLA

(✓,�;x(i)) ' L(✓,�;x(i)

):

eLA(✓,�;x(i)

) =

1

L

LX

l=1

log p✓(x(i), z

(i,l))� log q�(z

(i,l)|x(i))

where z

(i,l)= g�(✏

(i,l),x

(i)) and ✏(l) ⇠ p(✏) (6)

3

Algorithm 1 Minibatch version of the Auto-Encoding VB (AEVB) algorithm. Either of the twoSGVB estimators in section 2.3 can be used. We use settings M = 100 and L = 1 in experiments.

✓,� Initialize parametersrepeat

X

M Random minibatch of M datapoints (drawn from full dataset)✏ Random samples from noise distribution p(✏)g r✓,�

eLM(✓,�;XM

, ✏) (Gradients of minibatch estimator (8))✓,� Update parameters using gradients g (e.g. SGD or Adagrad [DHS10])

until convergence of parameters (✓,�)return ✓,�

Often, the KL-divergence DKL(q�(z|x(i))||p✓(z)) of eq. (3) can be integrated analytically (see

appendix B), such that only the expected reconstruction error Eq�(z|x(i))

⇥log p✓(x

(i)|z)⇤

requiresestimation by sampling. The KL-divergence term can then be interpreted as regularizing �, encour-aging the approximate posterior to be close to the prior p✓(z). This yields a second version of theSGVB estimator eLB

(✓,�;x(i)) ' L(✓,�;x(i)

), corresponding to eq. (3), which typically has lessvariance than the generic estimator:

eLB(✓,�;x(i)

) = �DKL(q�(z|x(i))||p✓(z)) +

1

L

LX

l=1

(log p✓(x(i)|z(i,l)))

where z

(i,l)= g�(✏

(i,l),x

(i)) and ✏(l) ⇠ p(✏) (7)

Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of themarginal likelihood lower bound of the full dataset, based on minibatches:

L(✓,�;X) ' eLM(✓,�;XM

) =

N

M

MX

i=1

eL(✓,�;x(i)) (8)

where the minibatch X

M= {x(i)}Mi=1 is a randomly drawn sample of M datapoints from the

full dataset X with N datapoints. In our experiments we found that the number of samples L

per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.Derivatives r✓,�

eL(✓;XM) can be taken, and the resulting gradients can be used in conjunction

with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for abasic approach to compute the stochastic gradients.

A connection with auto-encoders becomes clear when looking at the objective function given ateq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as aregularizer, while the second term is a an expected negative reconstruction error. The function g�(.)

is chosen such that it maps a datapoint x(i) and a random noise vector ✏(l) to a sample from theapproximate posterior for that datapoint: z

(i,l)= g�(✏(l),x(i)

) where z

(i,l) ⇠ q�(z|x(i)). Subse-

quently, the sample z

(i,l) is then input to function log p✓(x(i)|z(i,l)), which equals the probability

density (or mass) of datapoint x(i) under the generative model, given z

(i,l). This term is a negativereconstruction error in auto-encoder parlance.

2.4 The reparameterization trick

In order to solve our problem we invoked an alternative method for generating samples fromq�(z|x). The essential parameterization trick is quite simple. Let z be a continuous random vari-able, and z ⇠ q�(z|x) be some conditional distribution. It is then often possible to express therandom variable z as a deterministic variable z = g�(✏,x), where ✏ is an auxiliary variable withindependent marginal p(✏), and g�(.) is some vector-valued function parameterized by �.

This reparameterization is useful for our case since it can be used to rewrite an expectation w.r.tq�(z|x) such that the Monte Carlo estimate of the expectation is differentiable w.r.t. �. A proofis as follows. Given the deterministic mapping z = g�(✏,x) we know that q�(z|x)

Qi dzi =

p(✏)Q

i d✏i. Therefore1,Rq�(z|x)f(z) dz =

Rp(✏)f(z) d✏ =

Rp(✏)f(g�(✏,x)) d✏. It follows

1Note that for infinitesimals we use the notational convention dz =Q

i dzi

4

Page 9: (DL hacks輪読) Variational Inference with Rényi Divergence

Importance weighted AE�IWAE�

¤ VAE0 ��¶1®§�Ĩ!#�>¤ ġĽï¶0ă�AŊ(&Ðŏ��áõ/±�j^tZDAü�*�>�¤ ���ľ�_q rt`Y\w M0Ē/ú� ?*!6��

¤ ;(*�Ó0;�.�¿AÒÊ#>

¤ �0�¿/��*�Ó0�â�Ù� ?*�>

¤ k=10��VAE,�"¤ QxgsxN�¶kAÅ:#�,+�;=Ĩ!�®§AÆĦ#>�,�+�>�

Under review as a conference paper at ICLR 2016

the hidden units at the previous layer and the model parameters. This can be alternatively expressedby first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic mapping

h

`(✏`,h`�1,✓) = ⌃(h

`�1,✓)1/2✏` + µ(h`�1,✓). (4)The joint recognition distribution q(h|x,✓) over all latent variables can be expressed in terms ofa deterministic mapping h(✏,x,✓), with ✏ = (✏1, . . . , ✏L), by applying Eqn. 4 for each layer insequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of thebound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:

r✓ logEh⇠q(h|x,✓)

p(x,h|✓)q(h|x,✓)

�= r✓E✏1,...,✏L⇠N (0,I)

log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�(5)

= E✏1,...,✏L⇠N (0,I)

r✓ log

p(x,h(✏,x,✓)|✓)q(h(✏,x,✓)|x,✓)

�. (6)

Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the MonteCarlo estimator

1

k

kX

i=1

r✓ logw (x,h(✏i,x,✓),✓) (7)

with w(x,h,✓) = p(x,h|✓)/q(h|x,✓). This is an unbiased estimate of r✓L(x). We note thatthe VAE update and the basic REINFORCE-like update are both unbiased estimators of the samegradient, but the VAE update tends to have lower variance in practice because it makes use of thelog-likelihood gradients with respect to the latent variables.

3 IMPORTANCE WEIGHTED AUTOENCODER

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explainthe observations. This places a strong constraint on the model, since the variational assumptionsmust be approximately satisfied in order to achieve a good lower bound. In particular, the posteriordistribution must be approximately factorial and predictable with a feed-forward neural network.This VAE criterion may be too strict; a recognition network which places only a small fraction(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient forperforming accurate inference. If we lower our standards in this way, this may give us additionalflexibility to train a generative network whose posterior distributions do not fit the VAE assump-tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder(IWAE).

Our IWAE uses the same architecture as the VAE, with both a generative network and a recognitionnetwork. The difference is that it is trained to maximize a different lower bound on log p(x). Inparticular, we use the following lower bound, corresponding to the k-sample importance weightingestimate of the log-likelihood:

Lk(x) = Eh1,...,hk⇠q(h|x)

"log

1

k

kX

i=1

p(x,hi)

q(hi|x)

#. (8)

Here, h1, . . . ,hk are sampled independently from the recognition model. The term inside the sumcorresponds to the unnormalized importance weights for the joint distribution, which we will denoteas wi = p(x,hi)/q(hi|x).This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and thefact that the average importance weights are an unbiased estimator of p(x):

Lk = E"log

1

k

kX

i=1

wi

# logE

"1

k

kX

i=1

wi

#= log p(x), (9)

where the expectations are with respect to q(h|x).It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen-sions. Observe, however, that the special case of k = 1 is equivalent to the standard VAE objectiveshown in Eqn. 3. Using more samples can only improve the tightness of the bound:

3

Under review as a conference paper at ICLR 2016

Theorem 1. For all k, the lower bounds satisfy

log p(x) � Lk+1 � Lk. (10)

Moreover, if p(h,x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity.

Proof. See Appendix A.

The bound Lk can be estimated using the straightforward Monte Carlo estimator, where we generatesamples from the recognition network and average the importance weights. One might worry aboutthe variance of this estimator, since importance weighting famously suffers from extremely highvariance in cases where the proposal and target distributions are not a good match. However, asour estimator is based on the log of the average importance weights, it does not suffer from highvariance. This argument is made more precise in Appendix B.

3.1 TRAINING PROCEDURE

To train an IWAE with a stochastic gradient based optimizer, we use an unbiased estimate of thegradient of Lk, defined in Eqn. 8. As with the VAE, we use the reparameterization trick to derive alow-variance upate rule:

r✓Lk(x) = r✓Eh1,...,hk

"log

1

k

kX

i=1

wi

#= r✓E✏1,...,✏k

"log

1

k

kX

i=1

w(x,h(x, ✏i,✓),✓)

#(11)

= E✏1,...,✏k

"r✓ log

1

k

kX

i=1

w(x,h(x, ✏i,✓),✓)

#(12)

= E✏1,...,✏k

"kX

i=1

fwir✓ logw(x,h(x, ✏i,✓),✓)

#, (13)

where ✏1, . . . , ✏k are the same auxiliary variables as defined in Section 2 for the VAE, wi =

w(x,h(x, ✏i,✓),✓) are the importance weights expressed as a deterministic function, and fwi =

wi/Pk

i=1 wi are the normalized importance weights.

In the context of a gradient-based learning algorithm, we draw k samples from the recognitionnetwork (or, equivalently, k sets of auxiliary variables), and use the Monte Carlo estimate of Eqn. 13:

kX

i=1

fwir✓ logw (x,h(✏i,x,✓),✓) . (14)

In the special case of k = 1, the single normalized weight fw1 takes the value 1, and one obtains theVAE update rule.

We unpack this update because it does not quite parallel that of the standard VAE.1 The gradient ofthe log weights decomposes as:

r✓ logw(x,h(x, ✏i,✓),✓) = r✓ log p(x,h(x, ✏i,✓)|✓)�r✓ log q(h(x, ✏i,✓)|x,✓). (15)

The first term encourages the generative model to assign high probability to each h

` given h

`+1

(following the convention that x = h

0). It also encourages the recognition network to adjust thehidden representations so that the generative network makes better predictions. In the case of a singlestochastic layer (i.e. L = 1), the combination of these two effects is equivalent to backpropagationin a stochastic autoencoder. The second term of this update encourages the recognition network tohave a spread-out distribution over predictions. This update is averaged over the samples with weightproportional to the importance weights, motivating the name “importance weighted autoencoder.”

1Kingma & Welling (2014) separated out the KL divergence in the bound of Eqn. 3 in order to achieve asimpler and lower-variance update. Unfortunately, no analogous trick applies for k > 1. In principle, the IWAEupdates may be higher variance for this reason. However, in our experiments, we observed that the performanceof the two update rules was indistinguishable in the case of k = 1.

4

Page 10: (DL hacks輪読) Variational Inference with Rényi Divergence

Rényi0αWEb RGxS¤ áõï¶./��>#,,0Ð AĽ>

¤ &'!�11cro V� 1 > 0, 1 ≠ 1�

¤ 1 → 10,��KLŌğ

¤ 1 = 890,��hsxK Ōğ,�ô

two distributions p and q on a random variable ✓ 2 ⇥:

D↵

[p||q] = 1

↵� 1log

Zp(✓)↵q(✓)1�↵

d✓. (1)

For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integration isreplaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that plays acrucial role in machine learning and information theory:

D1[p||q] = lim↵!1

D↵

[p||q] =Z

p(✓) logp(✓)

q(✓)d✓ = KL[p||q]. (2)

Similar to ↵ = 1, for values ↵ = 0,+1 the Renyi divergence is defined by continuity in ↵:

D0[p||q] = � log

Z

p(✓)>0q(✓)d✓, (3)

D+1[p||q] = logmax✓2⇥

p(✓)

q(✓). (4)

Another special case is ↵ = 12 , where the corresponding Renyi divergence is a function of the square

Hellinger distance Hel2[p||q] = 12

R(p

p(✓)�pq(✓))2d✓:

D 12[p||q] = �2 log(1�Hel2[p||q]). (5)

In [Van Erven and Harremoes, 2014] the definition (1) is also extended to negative ↵ values, although inthat case it is non-positive and is thus no longer a valid divergence measure. The proposed method alsoconsiders negative ↵ values, and we include from [Van Erven and Harremoes, 2014] some useful propertiesfor forthcoming derivations.

Proposition 1. Renyi’s ↵-divergence definition, extended to negative ↵, is a continuous and non-

decreasing function of ↵ on [0, 1] [ {�1 < D↵

< +1}. Especially D↵

[p||q] � 0 for ↵ � 0 andD

[p||q] 0 for ↵ < 0.

Proposition 2. (Skew symmetry) For ↵ 6= 0, 1,

D↵

[p||q] = ↵

1� ↵

D1�↵

[q||p]. (6)

For the limiting case D�1[p||q] = �D+1[q||p].

A critical question to ask here is how to choose a divergence in this rich family to obtain optimalsolution for a particular application. This is still an open question in active research, so here we onlyinclude examples for the special ↵ values. The KL divergence (↵ = 1) is widely used in approximateinference, including VI (KL[q||p]) and EP (KL[p||q]). The Hellinger distance (↵ = 0.5) is closely related tothe Bhattacharyya bound for the Bayes error, which can be generalised to ↵ 2 (0, 1) and can be improvedby choosing the optimal ↵ [Cover and Thomas, 2006, Nielsen, 2011]. The limiting case D+1[p||q] iscalled the worst-case regret of compressing the information in p with q in the minimum description lengthprinciple literature [Grunwald, 2007].

2.2 Variational Inference

Next we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisationperspective, using posterior approximation as a running example. Consider observing a dataset of Ni.i.d. samples D = {x

n

}Nn=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ that

is drawn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of theparameters given the data,

p(✓|D) =p(✓,D)

p(D)=

p0(✓)Q

N

n=1 p(xn

|✓)p(D)

, (7)

3

two distributions p and q on a random variable ✓ 2 ⇥:

D↵

[p||q] = 1

↵� 1log

Zp(✓)↵q(✓)1�↵

d✓. (1)

For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integration isreplaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that plays acrucial role in machine learning and information theory:

D1[p||q] = lim↵!1

D↵

[p||q] =Z

p(✓) logp(✓)

q(✓)d✓ = KL[p||q]. (2)

Similar to ↵ = 1, for values ↵ = 0,+1 the Renyi divergence is defined by continuity in ↵:

D0[p||q] = � log

Z

p(✓)>0q(✓)d✓, (3)

D+1[p||q] = logmax✓2⇥

p(✓)

q(✓). (4)

Another special case is ↵ = 12 , where the corresponding Renyi divergence is a function of the square

Hellinger distance Hel2[p||q] = 12

R(p

p(✓)�pq(✓))2d✓:

D 12[p||q] = �2 log(1�Hel2[p||q]). (5)

In [Van Erven and Harremoes, 2014] the definition (1) is also extended to negative ↵ values, although inthat case it is non-positive and is thus no longer a valid divergence measure. The proposed method alsoconsiders negative ↵ values, and we include from [Van Erven and Harremoes, 2014] some useful propertiesfor forthcoming derivations.

Proposition 1. Renyi’s ↵-divergence definition, extended to negative ↵, is a continuous and non-

decreasing function of ↵ on [0, 1] [ {�1 < D↵

< +1}. Especially D↵

[p||q] � 0 for ↵ � 0 andD

[p||q] 0 for ↵ < 0.

Proposition 2. (Skew symmetry) For ↵ 6= 0, 1,

D↵

[p||q] = ↵

1� ↵

D1�↵

[q||p]. (6)

For the limiting case D�1[p||q] = �D+1[q||p].

A critical question to ask here is how to choose a divergence in this rich family to obtain optimalsolution for a particular application. This is still an open question in active research, so here we onlyinclude examples for the special ↵ values. The KL divergence (↵ = 1) is widely used in approximateinference, including VI (KL[q||p]) and EP (KL[p||q]). The Hellinger distance (↵ = 0.5) is closely related tothe Bhattacharyya bound for the Bayes error, which can be generalised to ↵ 2 (0, 1) and can be improvedby choosing the optimal ↵ [Cover and Thomas, 2006, Nielsen, 2011]. The limiting case D+1[p||q] iscalled the worst-case regret of compressing the information in p with q in the minimum description lengthprinciple literature [Grunwald, 2007].

2.2 Variational Inference

Next we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisationperspective, using posterior approximation as a running example. Consider observing a dataset of Ni.i.d. samples D = {x

n

}Nn=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ that

is drawn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of theparameters given the data,

p(✓|D) =p(✓,D)

p(D)=

p0(✓)Q

N

n=1 p(xn

|✓)p(D)

, (7)

3

two distributions p and q on a random variable ✓ 2 ⇥:

D↵

[p||q] = 1

↵� 1log

Zp(✓)↵q(✓)1�↵

d✓. (1)

For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integration isreplaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that plays acrucial role in machine learning and information theory:

D1[p||q] = lim↵!1

D↵

[p||q] =Z

p(✓) logp(✓)

q(✓)d✓ = KL[p||q]. (2)

Similar to ↵ = 1, for values ↵ = 0,+1 the Renyi divergence is defined by continuity in ↵:

D0[p||q] = � log

Z

p(✓)>0q(✓)d✓, (3)

D+1[p||q] = logmax✓2⇥

p(✓)

q(✓). (4)

Another special case is ↵ = 12 , where the corresponding Renyi divergence is a function of the square

Hellinger distance Hel2[p||q] = 12

R(p

p(✓)�pq(✓))2d✓:

D 12[p||q] = �2 log(1�Hel2[p||q]). (5)

In [Van Erven and Harremoes, 2014] the definition (1) is also extended to negative ↵ values, although inthat case it is non-positive and is thus no longer a valid divergence measure. The proposed method alsoconsiders negative ↵ values, and we include from [Van Erven and Harremoes, 2014] some useful propertiesfor forthcoming derivations.

Proposition 1. Renyi’s ↵-divergence definition, extended to negative ↵, is a continuous and non-

decreasing function of ↵ on [0, 1] [ {�1 < D↵

< +1}. Especially D↵

[p||q] � 0 for ↵ � 0 andD

[p||q] 0 for ↵ < 0.

Proposition 2. (Skew symmetry) For ↵ 6= 0, 1,

D↵

[p||q] = ↵

1� ↵

D1�↵

[q||p]. (6)

For the limiting case D�1[p||q] = �D+1[q||p].

A critical question to ask here is how to choose a divergence in this rich family to obtain optimalsolution for a particular application. This is still an open question in active research, so here we onlyinclude examples for the special ↵ values. The KL divergence (↵ = 1) is widely used in approximateinference, including VI (KL[q||p]) and EP (KL[p||q]). The Hellinger distance (↵ = 0.5) is closely related tothe Bhattacharyya bound for the Bayes error, which can be generalised to ↵ 2 (0, 1) and can be improvedby choosing the optimal ↵ [Cover and Thomas, 2006, Nielsen, 2011]. The limiting case D+1[p||q] iscalled the worst-case regret of compressing the information in p with q in the minimum description lengthprinciple literature [Grunwald, 2007].

2.2 Variational Inference

Next we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisationperspective, using posterior approximation as a running example. Consider observing a dataset of Ni.i.d. samples D = {x

n

}Nn=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ that

is drawn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of theparameters given the data,

p(✓|D) =p(✓,D)

p(D)=

p0(✓)Q

N

n=1 p(xn

|✓)p(D)

, (7)

3

Page 11: (DL hacks輪読) Variational Inference with Rényi Divergence

ï�Rényi�¿¤ ï�Ĝñ+1Ö0���ľ# . / ,Ðŏ,(.)0KLŌğA«¤¢!*�&¤ �?ARényi/ïĶ#>�,+�ĺ¢+�.���

¤ Rényi0αWEb RGxS/;(*�ý ?&Ó0èA«¤¢

¤ �?1ä0«į¢°½/ç�ĥ�<?>�ï�Ĝñ,�"�

¤ é/� 1 ≠ 10,�1

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

where p(D) =Rp0(✓)

QN

n=1 p(xn

|✓)d✓ is often called marginal likelihood or model evidence. For manypowerful models, including Bayesian neural networks, the true posterior is typically intractable. Varia-tional inference introduces an approximation q(✓) to the true posterior, which is obtained by minimisingthe KL divergence in some tractable distribution family Q:

q(✓) = argminq2Q

KL[q(✓)||p(✓|D)]. (8)

However the KL divergence in (8) is also intractable, mainly because of the di�cult term p(D). Variationalinference sidesteps this di�culty by considering an equivalent optimisation problem:

q(✓) = argmaxq2Q

LV I

(q;D), (9)

where the variational lower-bound or evidence lower-bound (ELBO) LV I

(q;D) is defined by

LV I

(q;D) = log p(D)�KL[q(✓)||p(✓|D)]

= Eq

log

p(✓,D)

q(✓)

�.

(10)

3 Variational Renyi Bound

Recall from Section 2.1 that the family of Renyi divergences includes the KL divergence. Perhaps canvariational free-energy approaches be generalised to the Renyi case? Consider approximating the trueposterior p(✓|D) by minimizing Renyi’s ↵-divergence for some selected ↵ � 0):

q(✓) = argminq2Q

D↵

[q(✓)||p(✓|D)]. (11)

Now we verify the alternative optimization problem

q(✓) = argmaxq2Q

log p(D)�D↵

[q(✓)||p(✓|D)]. (12)

When ↵ 6= 1, the objective can be rewritten as

log p(D)� 1

↵� 1log

Zq(✓)↵p(✓|D)1�↵

d✓

= log p(D)� 1

↵� 1logE

q

"✓p(✓,D)

q(✓)p(D)

◆1�↵

#

=1

1� ↵

logEq

"✓p(✓,D)

q(✓)

◆1�↵

#:= L

(q;D).

(13)

We name this new objective the variational Renyi bound (VR). Importantly the following theorem is adirect result of Proposition 1.

Theorem 1. The objective L↵

(q;D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵

| < +1}.Especially for all 0 < ↵ < 1,

LV I

(q;D) = lim↵!1

L↵

(q;D) L↵

(q;D) L0(q;D).

The last bound L0(q;D) = log p(D) if and only if the support supp(p(✓|D)) ✓ supp(q(✓)).

The above theorem indicates a smooth interpolation between traditional variational lower-bound LV I

and the exact log marginal log p(D) under the mild assumption that q is supported where the trueposterior is supported. This assumption holds for a lot of commonly used distributions, e.g. Gaussian,and in the following we assume this condition is satisfied.

Similar to traditional VI, the VR bound can also be applied as a surrogate objective to the MLEproblem. In the following we detail the derivations of log-likelihood approximation using the variationalauto-encoder [Kingma and Welling, 2014, Rezende et al., 2014] as an example.

4

ï�Rényi�¿�VR�

Page 12: (DL hacks輪読) Variational Inference with Rényi Divergence

VR�¿0pxZJtv�¤ ßú0QxgsxN/;(*�VR�¿1Ó0;�/Ðŏ+�>�

¤ �0,��Ó0�,�²=�)

¤ ;�@�<.�0+ĎAÝ(*ă�!6#�

3.1 Variational Auto-encoder with Renyi Divergence

The variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recentlyproposed (deep) generative model that parametrizes the variational approximation with a recognitionnetwork. The generative model is specified as a hierarchical latent variable model:

p(x) =X

h

(1)...h

(L)

p(h(L))p(h(L�1)|h(L)) · · · p(x|h(1)). (14)

Here we drop the parameters ✓ but keep in mind that they will be learned using approximate maximumlikelihood. However for these models the exact computation of log p(x) requires marginalisation of allhidden variables and is thus often intractable. Variational expectation-maximisation (EM) methodscomes to the rescue by approximating

log p(x) ⇡ LV I

(q;x) = Eq(h|x)

log

p(x,h)

q(h|x)

�, (15)

where h collects all the hidden variables h(1), ...,h

(L) and the approximate posterior q(h|x) is defined as

q(h|x) = q(h(1)|x)q(h(2)|h(1)) · · · q(h(L)|h(L�1)). (16)

In variational EM, optimisation for q and p are alternated to guarantee convergence. However the coreidea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the MLEobjective function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. Thispaper explores the possibility that alternative surrogate functions might return estimates that are tighterlower-bounds. So the VR bound is considered in this context:

L↵

(q;x) =1

1� ↵

logEq(h|x)

"✓p(x,h)

q(h|x)

◆1�↵

#. (17)

This is because Theorem 1 indicates that the approximation (17) provides a tighter lower-bound onlog p(x) than VAE’s objective (15) when ↵ 2 [0, 1). A denoising criterion has also been considered[Im et al., 2015] and we show in the supplementary its application to the VR bound. In general newimprovements upon the evidence lower-bound based on the concavity of the logarithm also extend to theVR bound when 0 ↵ 1.

4 The VR Bound Optimisation Framework

Variational free-energy methods sidestep intractabilities in a class of intractable models. Recent work hasused further approximations based on Monte Carlo to expend the set of models that can be handled. TheVR bounds can be deployed on the same model class as Monte Carlo variational methods, but which havesmaller scope if Monte Carlo methods is not resorted to. This section develops a scalable optimisationmethod for the VR bound by extending the recent advances of traditional VI. Black-box methods arealso discussed to enable it applications to arbitrary finite ↵ settings.

4.1 Monte Carlo Estimation of the VR Bound

We propose a simple Monte Carlo method that uses finite samples ✓k

⇠ q(✓), k = 1, ...,K to approximateL↵

⇡ L↵,K

:

L↵,K

(q;D) =1

1� ↵

log1

K

KX

k=1

"✓p(✓

k

,D)

q(✓k

)

◆1�↵

#. (18)

Unlike traditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is insidethe logarithm. However we can bound the bias by the following theorems proved in the supplementary.

Theorem 2. E{✓k}Kk=1

[L↵,K

(q;D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ 1,

where the limiting result is L↵

for K ! +1 if |p/q| is bounded; 2) continuous and non-increasing in↵ on [0, 1] [ {|L

| < +1}.

5

3.1 Variational Auto-encoder with Renyi Divergence

The variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recentlyproposed (deep) generative model that parametrizes the variational approximation with a recognitionnetwork. The generative model is specified as a hierarchical latent variable model:

p(x) =X

h

(1)...h

(L)

p(h(L))p(h(L�1)|h(L)) · · · p(x|h(1)). (14)

Here we drop the parameters ✓ but keep in mind that they will be learned using approximate maximumlikelihood. However for these models the exact computation of log p(x) requires marginalisation of allhidden variables and is thus often intractable. Variational expectation-maximisation (EM) methodscomes to the rescue by approximating

log p(x) ⇡ LV I

(q;x) = Eq(h|x)

log

p(x,h)

q(h|x)

�, (15)

where h collects all the hidden variables h(1), ...,h

(L) and the approximate posterior q(h|x) is defined as

q(h|x) = q(h(1)|x)q(h(2)|h(1)) · · · q(h(L)|h(L�1)). (16)

In variational EM, optimisation for q and p are alternated to guarantee convergence. However the coreidea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the MLEobjective function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. Thispaper explores the possibility that alternative surrogate functions might return estimates that are tighterlower-bounds. So the VR bound is considered in this context:

L↵

(q;x) =1

1� ↵

logEq(h|x)

"✓p(x,h)

q(h|x)

◆1�↵

#. (17)

This is because Theorem 1 indicates that the approximation (17) provides a tighter lower-bound onlog p(x) than VAE’s objective (15) when ↵ 2 [0, 1). A denoising criterion has also been considered[Im et al., 2015] and we show in the supplementary its application to the VR bound. In general newimprovements upon the evidence lower-bound based on the concavity of the logarithm also extend to theVR bound when 0 ↵ 1.

4 The VR Bound Optimisation Framework

Variational free-energy methods sidestep intractabilities in a class of intractable models. Recent work hasused further approximations based on Monte Carlo to expend the set of models that can be handled. TheVR bounds can be deployed on the same model class as Monte Carlo variational methods, but which havesmaller scope if Monte Carlo methods is not resorted to. This section develops a scalable optimisationmethod for the VR bound by extending the recent advances of traditional VI. Black-box methods arealso discussed to enable it applications to arbitrary finite ↵ settings.

4.1 Monte Carlo Estimation of the VR Bound

We propose a simple Monte Carlo method that uses finite samples ✓k

⇠ q(✓), k = 1, ...,K to approximateL↵

⇡ L↵,K

:

L↵,K

(q;D) =1

1� ↵

log1

K

KX

k=1

"✓p(✓

k

,D)

q(✓k

)

◆1�↵

#. (18)

Unlike traditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is insidethe logarithm. However we can bound the bias by the following theorems proved in the supplementary.

Theorem 2. E{✓k}Kk=1

[L↵,K

(q;D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ 1,

where the limiting result is L↵

for K ! +1 if |p/q| is bounded; 2) continuous and non-increasing in↵ on [0, 1] [ {|L

| < +1}.

5

(a) Sampling approximated VR bounds. (b) Simulated values of divergences.

Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alphadivergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, wherethe only di↵erence is µ

p

= [0, 0] and µ

q

= [1, 1]. Best viewed in colour.

Corollary 1. For K < +1, there exists ↵

K

< 0 such that for all ↵ � ↵

K

, E{✓k}Kk=1

[L↵,K

(q;D)] log p(D). Furthermore ↵

K

is non-decreasing in K, with limK!1 ↵K

= �1 and limK!+1 ↵

K

= 0.

To better understand the above theorems we plot in Figure 2(a) an illustration of the boundingproperties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihoodlog p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵ 1 the sampling approximationL↵,K

in expectation under-estimates the exact VR bound L↵

(blue dashed lines), where the approximationquality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negativealpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrowbetween the two blue dashed lines visualising L

↵1,K1 and L↵2,K1 , respectively).

We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and MonteCarlo estimate of the (negative) Renyi divergences between two Gaussian distributions. The samplingprocedure is repeated 200 times to get an estimation of the expectation value. Clearly for K = 1 itreturns a stochastic estimate of the (negative) KL-divergence no matter what value ↵ is used. For K > 1samples and ↵ < 1, the sampling approximation under-estimates the VR bound, and the bias decreaseswhen increasing K. For ↵ > 1 the inequality is reversed simply because of the term 1 � ↵ in the VRbound.

The importance weighted auto-encoder (IWAE) [Burda et al., 2015] is a special case of our frameworkapplied to VAE with ↵ = 0 and K < +1. The purpose of IWAE is to performing approximate MLEwith an accurate approximation to the likelihood function, so large number of samples are required. Inthis context, Corollary 1 indicates that negative ↵ can be used to obtain a better approximation whenK is small. But ↵

K

depends on both p and q and it remains an open question how to select ↵.

4.2 Unified Code with the Reparameterization Trick

Readers may have noticed that LV I

has a di↵erent form compared to L↵

with ↵ 6= 1. In this section weuse the reparameterization trick [Kingma and Welling, 2014] to unify the implementation for all finite ↵

settings. This trick assumes the existence of the mapping ✓ = g

(✏), where the distribution of the noiseterm ✏ satisfies q(✓)d✓ = p(✏)d✏. Then the expectation of a function F (✓) over distribution q(✓) can becomputed as E

q(✓)[F (✓)] = Ep(✏)[F (g

(✏))]. One prevalent example is the Gaussian reparameterization:

✓ ⇠ N (µ,⌃) ) ✓ = µ + ⌃12✏, ✏ ⇠ N (0, I). In the following we may use g

= g

(✏) and E✏

[·] = Ep(✏)[·]

6

Page 13: (DL hacks輪読) Variational Inference with Rényi Divergence

VR�¿0QxgsxNÐŏ

exact approx.(a) Sampling approximated VR bounds. (b) Simulated values of divergences.

Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alphadivergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, wherethe only di↵erence is µ

p

= [0, 0] and µ

q

= [1, 1]. Best viewed in colour.

Corollary 1. For K < +1, there exists ↵

K

< 0 such that for all ↵ � ↵

K

, E{✓k}Kk=1

[L↵,K

(q;D)] log p(D). Furthermore ↵

K

is non-decreasing in K, with limK!1 ↵K

= �1 and limK!+1 ↵

K

= 0.

To better understand the above theorems we plot in Figure 2(a) an illustration of the boundingproperties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihoodlog p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵ 1 the sampling approximationL↵,K

in expectation under-estimates the exact VR bound L↵

(blue dashed lines), where the approximationquality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negativealpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrowbetween the two blue dashed lines visualising L

↵1,K1 and L↵2,K1 , respectively).

We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and MonteCarlo estimate of the (negative) Renyi divergences between two Gaussian distributions. The samplingprocedure is repeated 200 times to get an estimation of the expectation value. Clearly for K = 1 itreturns a stochastic estimate of the (negative) KL-divergence no matter what value ↵ is used. For K > 1samples and ↵ < 1, the sampling approximation under-estimates the VR bound, and the bias decreaseswhen increasing K. For ↵ > 1 the inequality is reversed simply because of the term 1 � ↵ in the VRbound.

The importance weighted auto-encoder (IWAE) [Burda et al., 2015] is a special case of our frameworkapplied to VAE with ↵ = 0 and K < +1. The purpose of IWAE is to performing approximate MLEwith an accurate approximation to the likelihood function, so large number of samples are required. Inthis context, Corollary 1 indicates that negative ↵ can be used to obtain a better approximation whenK is small. But ↵

K

depends on both p and q and it remains an open question how to select ↵.

4.2 Unified Code with the Reparameterization Trick

Readers may have noticed that LV I

has a di↵erent form compared to L↵

with ↵ 6= 1. In this section weuse the reparameterization trick [Kingma and Welling, 2014] to unify the implementation for all finite ↵

settings. This trick assumes the existence of the mapping ✓ = g

(✏), where the distribution of the noiseterm ✏ satisfies q(✓)d✓ = p(✏)d✏. Then the expectation of a function F (✓) over distribution q(✓) can becomputed as E

q(✓)[F (✓)] = Ep(✏)[F (g

(✏))]. One prevalent example is the Gaussian reparameterization:

✓ ⇠ N (µ,⌃) ) ✓ = µ + ⌃12✏, ✏ ⇠ N (0, I). In the following we may use g

= g

(✏) and E✏

[·] = Ep(✏)[·]

6

Ĩŀ.VR�¿

1 ≤ 10,�QxgsxNÐŏ1ā�.>

1A¤ �#>,�¿�~�>

QxgsxN¶kAÅ:#,~�>

Page 14: (DL hacks輪読) Variational Inference with Rényi Divergence

VR�¿0WEb RGxS

¤ ĝ+@�>;�/�IWAE+1QxgsxN¶AÃ�!.�,�Ðŏ�)6=WEb RGxS���+�.�

¤ 1�lE^S0,�1�ð.�Qxgt¶+ù�Ðŏ�+�>�

(a) Sampling approximated VR bounds. (b) Simulated values of divergences.

Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alphadivergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, wherethe only di↵erence is µ

p

= [0, 0] and µ

q

= [1, 1]. Best viewed in colour.

Corollary 1. For K < +1, there exists ↵

K

< 0 such that for all ↵ � ↵

K

, E{✓k}Kk=1

[L↵,K

(q;D)] log p(D). Furthermore ↵

K

is non-decreasing in K, with limK!1 ↵K

= �1 and limK!+1 ↵

K

= 0.

To better understand the above theorems we plot in Figure 2(a) an illustration of the boundingproperties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihoodlog p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵ 1 the sampling approximationL↵,K

in expectation under-estimates the exact VR bound L↵

(blue dashed lines), where the approximationquality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negativealpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrowbetween the two blue dashed lines visualising L

↵1,K1 and L↵2,K1 , respectively).

We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and MonteCarlo estimate of the (negative) Renyi divergences between two Gaussian distributions. The samplingprocedure is repeated 200 times to get an estimation of the expectation value. Clearly for K = 1 itreturns a stochastic estimate of the (negative) KL-divergence no matter what value ↵ is used. For K > 1samples and ↵ < 1, the sampling approximation under-estimates the VR bound, and the bias decreaseswhen increasing K. For ↵ > 1 the inequality is reversed simply because of the term 1 � ↵ in the VRbound.

The importance weighted auto-encoder (IWAE) [Burda et al., 2015] is a special case of our frameworkapplied to VAE with ↵ = 0 and K < +1. The purpose of IWAE is to performing approximate MLEwith an accurate approximation to the likelihood function, so large number of samples are required. Inthis context, Corollary 1 indicates that negative ↵ can be used to obtain a better approximation whenK is small. But ↵

K

depends on both p and q and it remains an open question how to select ↵.

4.2 Unified Code with the Reparameterization Trick

Readers may have noticed that LV I

has a di↵erent form compared to L↵

with ↵ 6= 1. In this section weuse the reparameterization trick [Kingma and Welling, 2014] to unify the implementation for all finite ↵

settings. This trick assumes the existence of the mapping ✓ = g

(✏), where the distribution of the noiseterm ✏ satisfies q(✓)d✓ = p(✏)d✏. Then the expectation of a function F (✓) over distribution q(✓) can becomputed as E

q(✓)[F (✓)] = Ep(✏)[F (g

(✏))]. One prevalent example is the Gaussian reparameterization:

✓ ⇠ N (µ,⌃) ) ✓ = µ + ⌃12✏, ✏ ⇠ N (0, I). In the following we may use g

= g

(✏) and E✏

[·] = Ep(✏)[·]

6

1 ≤ 10,�1�QxgsxN¶AÅ:#,z��.>

1 = 00,�IWAE,�Ĺ

Page 15: (DL hacks輪読) Variational Inference with Rényi Divergence

VR-max¤ Reparameterization trick/;(*

¤ �¿0ŚĀ1

¤ &'!�¤ 1 = 10,��VAE0�¿,�"

¤ �0,�1 → −∞0,�AÔ�>¤ �?1importance weight�«z,.>QxgtA�Ń#>�,,ĵÑ

¤ �0Cgv XAVR-max,Ĭ2

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Page 16: (DL hacks輪読) Variational Inference with Rényi Divergence

zóě[ V/�#>áõ�Ðŏ¤ [Li et al., 2015]0åē+1áõ�EP��ñ ?*�>

¤ {ñà+1m_bYX¨Ĵ�+�>ä0áõ�ÐŏAÒÊ

¤ ¸ķř� A�ý�&'!� �!*����ľA ,©Í#>¤ �>Qxgt�<MĄ0[ VA$?%?Qxgt#>,Ô�*���ì�¸ķř� ,#>,�VR�¿1

¤ m_bYX¶�z��.>�M�¤ �.>�,ÐŏbECS�ā�.>�

¤ Black-box alpha�BB-α�1��0VR�¿0éä.Ď

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7

Algorithm 1 one gradient step for VR-↵/VR-max

1: sample ✏1, ..., ✏K ⇠ p(✏)2: for all k = 1, ...,K, and n 2 S the current minibatch, compute the unnormalised weight

log w(✏k

;xn

) = log p(g�

(✏k

),xn

)� log q(g�

(✏k

)|xn

)

3: choose the sample ✏

jn to back-propagate:if |↵| < 1: j

n

⇠ p

k

where p

k

/ w(✏k

;xn

)1�↵

if ↵ = �1: jn

= argmaxk

log w(✏k

;xn

)4: return the gradients to the optimiser

1

|S|X

n2Sr

log w(✏jn ;xn

)

to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound

L↵

(q�

;D) =1

1� ↵

logE✏

"✓p(g

,D)

q(g�

)

◆1�↵

#. (19)

Then the gradient of the VR bound w.r.t. � is

r�

L↵

(q�

;D) = E✏

w

(✏;�,D)r�

logp(g

,D)

q(g�

)

�, (20)

where w

(✏;�,D) /⇣

p(g�,D)q(g�)

⌘1�↵

denotes the normalised importance weight. For finite samples ✏

k

⇠p(✏), k = 1, ...,K the gradient is approximated by

r�

L↵,K

(q�

;D) =1

K

KX

k=1

w

↵,k

r�

logp(g

(✏k

),D)

q(g�

(✏k

))

�. (21)

with w

↵,k

short-hand for w

(✏k

;�,D), the normalised importance weight with finite samples. One canshow that it recovers the the stochastic gradients of L

V I

by setting ↵ = 1 in (21):

r�

LV I

(q�

;D) ⇡ 1

K

KX

k=1

r�

logp(g

(✏k

),D)

q(g�

(✏k

)), (22)

which means the resulting algorithm unifies the computation for all finite ↵ settings.To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏

j

with j ⇠ p

j

=w

↵,j

, which can be easily extended to our framework. Importantly, the use of di↵erent ↵ < 1 indicatesthe degree of emphasising locations where the approximation q(✓) mismatches p(✓|D). Negative ↵ canalso be used according to Section 4.1, and in the extreme case ↵ ! �1, the algorithm chooses thesample that has the maximum unnormalised importance weight. We name this approach VR-max andsummarise it and other finite ↵ settings in Algorithm 1 for VAEs. Note that VR-max does not computethe q distribution by the minimum description length principle (MDL), since MDL approximates the trueposterior by minimising the exact VR bound L�1 that upper-bounds the exact log-likelihood function.

4.3 Stochastic Approximation for Large-scale Learning

So far we discussed the VR bounds computed on the whole dataset D. However for large datasets fullbatch learning will be very ine�cient. In the appendix of [Li et al., 2015] the authors discussed stochasticEP as a way to approximating the VR bound optimisation for large-scale learning. Here we proposeanother stochastic approximation method to enable minibatch training, which directly applies to the VRbound.

Using the notation f

n

(✓) = p(xn

|✓) and defining the “average likelihood” fD(✓) = [Q

N

n=1 fn(✓)]1N ,

the joint distribution can be rewritten as p(✓,D) = p0(✓)fD(✓)N . Now we sample M datapoints S =

7{xn1 , ...,xnM } ⇠ D and define the corresponding “subset average likelihood” fS(✓) = [

QM

m=1 fnm(✓)]1M .

When M = 1 we also write fS(✓) = f

n

(✓) for S = {xn

}. Then we approximate the VR bound (13) byreplacing fD(✓) with fS(✓):

L↵

(q;S) = 1

1� ↵

log

Zq(✓)↵

�p0(✓)fS(✓)

N

�1�↵

d✓

=1

1� ↵

logEq

[

✓p0(✓)fS(✓)N

q(✓)

◆1�↵

].

(23)

This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed bythe following theorem proved in the supplementary.

Theorem 3. If the approximate distribution q(✓) is Gaussian N (µ,⌃), and the likelihood functions hasan exponential family form p(x|✓) = exp[h✓, (x)i �A(✓)], then for ↵ 1 the stochastic approximationis bounded by

ES [L↵

(q;S)] L1�(1�↵)r(q;D) +N

2(1� ↵)r

2(r � 1)tr(⌃⇤S)

where ⇤S = CovS⇠D( S) and for any r > 1.

The covariance matrix of the minibatch’s su�cient statistic mean S is ⇤S = CovD( ) for M = 1and ⇤S = N�M

NM

CovD( ) when M > 1. So for full batch training (M = N) the approximation is exact,

i.e. L↵

= L↵

.The recently proposed black-box alpha (BB-↵) algorithm [Hernandez-Lobato et al., 2015] is a special

case of the proposed stochastic approximation method. This can be shown by selectingM = 1, ↵0 = 1� ↵

N

and defining q(✓) = 1Zq

p0(✓)f(✓)N :

LBB�↵

(q;D) =1

NX

n=1

logZp

↵n� (

N

� 1) logZq

= ED[N

log

Zq(✓)1�

↵N�p0(✓)fn(✓)

N

� ↵Nd✓]

= ED[L↵

0(q;xn

)],

(24)

where p

n

(✓) / p0(✓)f(✓)N�↵

f

n

(✓)↵ is the tilted distribution with power ↵ and Z

p

↵nis its normalising

constant. BB-↵ was originally proposed to approximate the energy of (power) EP [Minka, 2001, Minka,2004], which minimises Amari’s ↵-divergence locally. The new derivation sheds lights on the connectionsbetween approximate local and global ↵-divergence minimisations.

BB-↵ also uses simple Monte Carlo (Section 4.1) to estimate the energy function, and one can showin the same way as to prove Theorem 2 that the Monte Carlo estimate in expectation provides a lower-bound on L

BB�↵

when ↵ > 0. In general it is also an open question how to choose ↵ to guarantee alower-bound to log p(D) for given the minibatch size M and the number of samples K.

5 Experiments

We evaluated the proposed family of variational methods on variational auto-encoders and Bayesian neuralnetworks. Unless stated we tested VR bounds using ↵ = 0.5, 0 and �1, and compared to traditional VI(↵ = 1.0) and BB-↵ (↵ 2 [1� 1/N, 1]).

5.1 Variational auto-encoder

The first set of experiments considered variational auto-encoders for unsupervised learning. We mainlycompared three approaches: the original VAE using VI (↵ = 1.0) [Kingma and Welling, 2014], theimportance weighted auto encoder (IWAE) [Burda et al., 2015] that is equivalent to choosing ↵ = 0in our framework, and the proposed VR-max (↵ = �1) algorithm applied to VAEs. The VR-max

8

{xn1 , ...,xnM } ⇠ D and define the corresponding “subset average likelihood” fS(✓) = [

QM

m=1 fnm(✓)]1M .

When M = 1 we also write fS(✓) = f

n

(✓) for S = {xn

}. Then we approximate the VR bound (13) byreplacing fD(✓) with fS(✓):

L↵

(q;S) = 1

1� ↵

log

Zq(✓)↵

�p0(✓)fS(✓)

N

�1�↵

d✓

=1

1� ↵

logEq

[

✓p0(✓)fS(✓)N

q(✓)

◆1�↵

].

(23)

This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed bythe following theorem proved in the supplementary.

Theorem 3. If the approximate distribution q(✓) is Gaussian N (µ,⌃), and the likelihood functions hasan exponential family form p(x|✓) = exp[h✓, (x)i �A(✓)], then for ↵ 1 the stochastic approximationis bounded by

ES [L↵

(q;S)] L1�(1�↵)r(q;D) +N

2(1� ↵)r

2(r � 1)tr(⌃⇤S)

where ⇤S = CovS⇠D( S) and for any r > 1.

The covariance matrix of the minibatch’s su�cient statistic mean S is ⇤S = CovD( ) for M = 1and ⇤S = N�M

NM

CovD( ) when M > 1. So for full batch training (M = N) the approximation is exact,

i.e. L↵

= L↵

.The recently proposed black-box alpha (BB-↵) algorithm [Hernandez-Lobato et al., 2015] is a special

case of the proposed stochastic approximation method. This can be shown by selectingM = 1, ↵0 = 1� ↵

N

and defining q(✓) = 1Zq

p0(✓)f(✓)N :

LBB�↵

(q;D) =1

NX

n=1

logZp

↵n� (

N

� 1) logZq

= ED[N

log

Zq(✓)1�

↵N�p0(✓)fn(✓)

N

� ↵Nd✓]

= ED[L↵

0(q;xn

)],

(24)

where p

n

(✓) / p0(✓)f(✓)N�↵

f

n

(✓)↵ is the tilted distribution with power ↵ and Z

p

↵nis its normalising

constant. BB-↵ was originally proposed to approximate the energy of (power) EP [Minka, 2001, Minka,2004], which minimises Amari’s ↵-divergence locally. The new derivation sheds lights on the connectionsbetween approximate local and global ↵-divergence minimisations.

BB-↵ also uses simple Monte Carlo (Section 4.1) to estimate the energy function, and one can showin the same way as to prove Theorem 2 that the Monte Carlo estimate in expectation provides a lower-bound on L

BB�↵

when ↵ > 0. In general it is also an open question how to choose ↵ to guarantee alower-bound to log p(D) for given the minibatch size M and the number of samples K.

5 Experiments

We evaluated the proposed family of variational methods on variational auto-encoders and Bayesian neuralnetworks. Unless stated we tested VR bounds using ↵ = 0.5, 0 and �1, and compared to traditional VI(↵ = 1.0) and BB-↵ (↵ 2 [1� 1/N, 1]).

5.1 Variational auto-encoder

The first set of experiments considered variational auto-encoders for unsupervised learning. We mainlycompared three approaches: the original VAE using VI (↵ = 1.0) [Kingma and Welling, 2014], theimportance weighted auto encoder (IWAE) [Burda et al., 2015] that is equivalent to choosing ↵ = 0in our framework, and the proposed VR-max (↵ = �1) algorithm applied to VAEs. The VR-max

8

Page 17: (DL hacks輪読) Variational Inference with Rényi Divergence

VR�¿06,8

Page 18: (DL hacks輪読) Variational Inference with Rényi Divergence

¡ċ¤ Ó0Ï)0Ď+¡ċ

¤ ï�I \HxO W ¤ iERCx_q rt`Y\w M

Page 19: (DL hacks輪読) Variational Inference with Rényi Divergence

¡ċ1�ï�I \HxO W¤ Ó03)0��+÷Œ¡ċ

¤ VAE�1 = 1�¤ IWAE�1 = 0�¤ VR-max�1 = −∞�

¤ ĠÑ �1 = 0�* = 5000�

¤ VR-max1IWAE,�"�<�¤ &'!�Áã��1VR-max0��ĕ�

¤ VR-max25hr29min�IWAE61hr16min

(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST

Figure 3: Sampled images from the VR-max trained auto-encoders.

Dataset L K VAE IWAE VR-maxFrey Face 1 5 1322.96 1380.30 1377.40(± std. err.) ±10.03 ±4.60 ±4.59Caltech 101 1 5 -119.69 -117.89 -118.01Silhouettes 50 -119.61 -117.21 -117.10MNIST 1 5 -86.47 -85.41 -85.42

50 -86.35 -84.80 -84.812 5 -85.01 -83.92 -84.04

50 -84.78 -83.12 -83.44

Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].IWAE results are reproduced using the publicly available implementation.

method was implemented upon the publicly available code1. Note that the original implementation back-propagates all the samples to compute gradients, while VR-max only back-propagates the sample withthe largest importance weight.

Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments wererun with 10-fold cross validation on the comparably small Frey Face dataset, while the other two weretested only on the provided splits. The VAE model consists of L = 1 or 2 stochastic layers with determin-istic layers stacked between, and the detailed network architecture is detailed in the supplementary. Weused ADAM [Kingma and Ba, 2015] for optimisation. For MNIST we used settings from [Burda et al.,2015] including the learning rates, size of minibatch and number of epochs. For other two datasets thelearning rates and stopping epochs were tuned on the VI setting. We reproduced the experiments forIWAE to obtain a fair comparison, since the results included in [Burda et al., 2015] mismatches thoseevaluated on the publicly available code.

We report the test data likelihood results in Table 1 by computing log p(x) ⇡ L↵,K

(q;x) with ↵ = 0.0,K = 5000 following [Burda et al., 2015]. We also present some samples from the VR-max trained modelsin Figure 3. Overall VR-max applied to VAEs worked almost indistinguishable to IWAEs on all the threedatasets. However, VR-max took significantly less time to run compared to IWAE with a full backwardpass. The experiments were run on a machine with a Tesla C2075 GPU, and when trained on MNISTwith L = 1 stochastic layer and K = 50 samples, VR-max and IWAE took 25hr29min and 61hr16min,respectively (3280 epochs in total). In this case we also implemented the single backward pass versionbased on the available code, and the test log-likelihood result for IWAE is -85.02, which is slightly worsethan VR-max’s performance. These results justify the arguments in Section 4.1 that negative ↵ can beused with negligible model quality loss when the computation resources are limited.

A conjecture for VR-max’s success is that the ↵ value corresponding to the tightest VR bound becomesmore negative as the mismatch between the optimal q and the true posterior increases. This is the casefor neural networks because now a simple distribution q is fitted to approximate the typically multimodal

1https://github.com/yburda/iwae

9

(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST

Figure 3: Sampled images from the VR-max trained auto-encoders.

Dataset L K VAE IWAE VR-maxFrey Face 1 5 1322.96 1380.30 1377.40(± std. err.) ±10.03 ±4.60 ±4.59Caltech 101 1 5 -119.69 -117.89 -118.01Silhouettes 50 -119.61 -117.21 -117.10MNIST 1 5 -86.47 -85.41 -85.42

50 -86.35 -84.80 -84.812 5 -85.01 -83.92 -84.04

50 -84.78 -83.12 -83.44

Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].IWAE results are reproduced using the publicly available implementation.

method was implemented upon the publicly available code1. Note that the original implementation back-propagates all the samples to compute gradients, while VR-max only back-propagates the sample withthe largest importance weight.

Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments wererun with 10-fold cross validation on the comparably small Frey Face dataset, while the other two weretested only on the provided splits. The VAE model consists of L = 1 or 2 stochastic layers with determin-istic layers stacked between, and the detailed network architecture is detailed in the supplementary. Weused ADAM [Kingma and Ba, 2015] for optimisation. For MNIST we used settings from [Burda et al.,2015] including the learning rates, size of minibatch and number of epochs. For other two datasets thelearning rates and stopping epochs were tuned on the VI setting. We reproduced the experiments forIWAE to obtain a fair comparison, since the results included in [Burda et al., 2015] mismatches thoseevaluated on the publicly available code.

We report the test data likelihood results in Table 1 by computing log p(x) ⇡ L↵,K

(q;x) with ↵ = 0.0,K = 5000 following [Burda et al., 2015]. We also present some samples from the VR-max trained modelsin Figure 3. Overall VR-max applied to VAEs worked almost indistinguishable to IWAEs on all the threedatasets. However, VR-max took significantly less time to run compared to IWAE with a full backwardpass. The experiments were run on a machine with a Tesla C2075 GPU, and when trained on MNISTwith L = 1 stochastic layer and K = 50 samples, VR-max and IWAE took 25hr29min and 61hr16min,respectively (3280 epochs in total). In this case we also implemented the single backward pass versionbased on the available code, and the test log-likelihood result for IWAE is -85.02, which is slightly worsethan VR-max’s performance. These results justify the arguments in Section 4.1 that negative ↵ can beused with negligible model quality loss when the computation resources are limited.

A conjecture for VR-max’s success is that the ↵ value corresponding to the tightest VR bound becomesmore negative as the mismatch between the optimal q and the true posterior increases. This is the casefor neural networks because now a simple distribution q is fitted to approximate the typically multimodal

1https://github.com/yburda/iwae

9

Page 20: (DL hacks輪読) Variational Inference with Rényi Divergence

¡ċ1�ï�I \HxO W¤ Qxgtæł

(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST

Figure 3: Sampled images from the VR-max trained auto-encoders.

Dataset L K VAE IWAE VR-maxFrey Face 1 5 1322.96 1380.30 1377.40(± std. err.) ±10.03 ±4.60 ±4.59Caltech 101 1 5 -119.69 -117.89 -118.01Silhouettes 50 -119.61 -117.21 -117.10MNIST 1 5 -86.47 -85.41 -85.42

50 -86.35 -84.80 -84.812 5 -85.01 -83.92 -84.04

50 -84.78 -83.12 -83.44

Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].IWAE results are reproduced using the publicly available implementation.

method was implemented upon the publicly available code1. Note that the original implementation back-propagates all the samples to compute gradients, while VR-max only back-propagates the sample withthe largest importance weight.

Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments wererun with 10-fold cross validation on the comparably small Frey Face dataset, while the other two weretested only on the provided splits. The VAE model consists of L = 1 or 2 stochastic layers with determin-istic layers stacked between, and the detailed network architecture is detailed in the supplementary. Weused ADAM [Kingma and Ba, 2015] for optimisation. For MNIST we used settings from [Burda et al.,2015] including the learning rates, size of minibatch and number of epochs. For other two datasets thelearning rates and stopping epochs were tuned on the VI setting. We reproduced the experiments forIWAE to obtain a fair comparison, since the results included in [Burda et al., 2015] mismatches thoseevaluated on the publicly available code.

We report the test data likelihood results in Table 1 by computing log p(x) ⇡ L↵,K

(q;x) with ↵ = 0.0,K = 5000 following [Burda et al., 2015]. We also present some samples from the VR-max trained modelsin Figure 3. Overall VR-max applied to VAEs worked almost indistinguishable to IWAEs on all the threedatasets. However, VR-max took significantly less time to run compared to IWAE with a full backwardpass. The experiments were run on a machine with a Tesla C2075 GPU, and when trained on MNISTwith L = 1 stochastic layer and K = 50 samples, VR-max and IWAE took 25hr29min and 61hr16min,respectively (3280 epochs in total). In this case we also implemented the single backward pass versionbased on the available code, and the test log-likelihood result for IWAE is -85.02, which is slightly worsethan VR-max’s performance. These results justify the arguments in Section 4.1 that negative ↵ can beused with negligible model quality loss when the computation resources are limited.

A conjecture for VR-max’s success is that the ↵ value corresponding to the tightest VR bound becomesmore negative as the mismatch between the optimal q and the true posterior increases. This is the casefor neural networks because now a simple distribution q is fitted to approximate the typically multimodal

1https://github.com/yburda/iwae

9

Page 21: (DL hacks輪読) Variational Inference with Rényi Divergence

bECS0ÇÙ¤ VR-max/;(*�Ö0ř�/Ð��¿AÐŏ+�>�,AÇÙ�

¤ Frey Face[ V+¡ċ

¤ 1�¤ �.>/)?�bECS�¤ �.(*�>�,�@�>

Page 22: (DL hacks輪読) Variational Inference with Rényi Divergence

¡ċ2iERCx_q rt`Y\w M

¤ [ VUY\UCIskR\s¤ p[t:cro V.-1ñàòň

¤ ÷ŒVI[Graves,2011]�PBP[Hernandez-Lobato et al., 2015]¤ BB-α=BO1iET«į¢A�(*�>0+÷Œ1+�.��

Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-maxBoston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA

Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are availableonly for small datasets.

Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-maxBoston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA

Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only forsmall datasets.

sub-sampling is currently unknown to us. The results also indicate that for posterior approximationproblems, in contrast to MLE, the optimal ↵ may vary for di↵erent datasets and/or di↵erent models.

6 Conclusion

We have presented the variational Renyi bound and its optimisation framework. We have shown therichness of the new family, not only by connecting to existing approaches including traditional VI/VB,stochastic EP, BB-↵, VAE and IWAE, but also by proposing the VR-max algorithm as a new specialcase. Empirical results on variational auto-encoders and Bayesian neural networks indicate that VR-maxand other alpha settings ↵ = 0.0, 0.5 work at least similarly well, or even better, to the state-of-the-artresults as predicted by the theory. Future work will focus on both experimental and theoretical sides. Theproposed class of variational methods should be tested on other popular probabilistic models. Theoreticalwork will study the interaction of the biases introduced by the Monte Carlo estimate and datapoint sub-sampling. Theoretical and empirical studies on choosing optimal ↵ values will guide practitioners to usingthe appropriate algorithm for their applications.

References

[Amari, 1985] Amari, S.-i. (1985). Di↵erential-Geometrical Methods in Statistic. Springer, New York.

[Beal, 2003] Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis,University College London.

[Blundell et al., 2015] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weightuncertainty in neural networks. In International Conference on Machine Learning.

[Broderick et al., 2013] Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013).Streaming variational bayes. In Burges, C. J. C., Bottou, L., Ghahramani, Z., and Weinberger, K. Q.,editors, NIPS, pages 1727–1735.

[Burda et al., 2015] Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoen-coders. arXiv preprint arXiv:1509.00519.

11

Page 23: (DL hacks輪読) Variational Inference with Rényi Divergence

6,8¤ ï�Rényi�¿,«į¢0eu nw MAÒÊ

¤ ħ³0���VI/VB�áõ�EP�BB-α�VAE�IWAE�A6,8&'�+.���!���VR-maxCtPsTnAÒÊ!&�

¤ ¡ċ/;(*��ĵ�$?·~0ĸ�,.(&�

¤ ī³�/1�¡ċØ,¥ñØ0ë�/ņ´A¬*>Â�¤ ı0ßÈ.áõp[t+ZS\¤ pxZJtv�:[ VQfQxgsxN+0bECS¤ 10����