（dl輪読）variational dropout sparsifies deep neural networks

Variational Dropout SparsifiesDeep Neural Networks2017/03/24 鈴⽊雅⼤

本論⽂について¤ Dmitry Molchanov，Arsenii Ashukha，Dmitry Vetrov

¤ スコルコボ科学技術⼤学，国⽴研究⼤学⾼等経済学院，モスクワ物理⼯科⼤学

¤ ICML2017投稿論⽂（2017/2/27 arXiv）

¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿法を提案．¤ アイディア⾃体はめちゃくちゃシンプル．¤ スパース性が⾼くなるだけではなく，通常のCNNの汎化性能に関する問題を解消できる．

¤ 選定理由¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから．¤ シンプルなアイディアで⼤きな効果を上げているのが好み．

¤ データ𝐷 = (𝑥%, 𝑦%)%)*+ を観測したとき・・・¤ ⽬標は𝑝 𝑦 𝑥,𝑤 = 𝑝 𝐷 𝑤 を求めること．

¤ ベイズ学習の枠組みでは，パラメーターwの事前知識を考える¤ Dを観測した後のwの事後分布は次のようになる

¤ この処理をベイズ推論という．

¤ 事後分布を求めるためには，分⺟で周辺化が必要．->変分推論

ベイズ推論

𝑝 𝑤 𝐷 =𝑝 𝐷 𝑤 𝑝(𝑤)

𝑝(𝐷) =𝑝 𝐷 𝑤 𝑝(𝑤)

∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤��

変分推論¤ 近似分布𝑞(𝑤|𝜙)を考えて，真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝(𝑤|𝐷)]を最⼩化する．

¤ これは次の変分下界を最⼤化することと等価

¤ 再パラメータ化トリックによって，変分下界は𝜙について微分可能になる．¤ ミニバッチにおいて，下界と下界の勾配の不偏推定量は

ただし

Variational Dropout Sparsifies Deep Neural Networks

L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1

Eq�(w)[log p(yn |xn, w)] (2)

It consists of two parts, the expected log-likelihood LD(�)and the KL-divergence DKL(q�(w) k p(w)), which acts asa regularization term.

3.2. Stochastic Variational Inference

In the case of complex models expectations in (1) and (2)are intractable. Therefore the variational lower bound (1)and its gradients can not be computed exactly. However, itis still possible to estimate them using sampling and opti-mize the variational lower bound using stochastic optimiza-tion.

We follow (Kingma & Welling, 2013) and use the Repa-rameterization Trick to obtain an unbiased differentiableminibatch-based Monte Carlo estimator of the expectedlog-likelihood (3). The main idea is to represent the para-metric noise q�(w) as a deterministic differentiable func-tion w = f(�, ✏) of a non-parametric noise ✏ s p(✏).This trick allows us to obtain an unbiased estimate ofr�LD(q�). Here we denote objects from a mini-batch as(xm, ym)

Mm=1.

L(�)'LSGVB(�)=LSGVB

D (�)�DKL(q�(w)kp(w)) (3)

LD(�)'LSGVBD (�)=

N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)

The Local Reparameterization Trick is another techniquethat reduces the variance of this gradient estimator even fur-ther (Kingma et al., 2015). The idea is to sample separateweight matrices for each data-point inside mini-batch. It iscomputationally hard to do it straight-forwardly, but it canbe done efficiently by moving the noise from weights toactivations (Wang & Manning, 2013; Kingma et al., 2015).

3.3. Variational Dropout

In this section we consider a single fully-connected layerwith I input neurons and O output neurons before a non-linearity. We denote an output matrix as BM⇥O, input ma-trix as AM⇥I and a weight matrix as W I⇥O. We indexthe elements of these matrices as bmj , ami and wij respec-tively. Then B = AW .

Dropout is one of the most popular regularization methodsfor deep neural networks. It injects a multiplicative random

noise ⌅ to the layer input A at each iteration of trainingprocedure (Hinton et al., 2012).

B = (A� ⌅)W, with ⇠mi s p(⇠) (6)

The original version of dropout, so-called Bernoulli or Bi-nary Dropout, was presented with ⇠mi s Bernoulli(1� p)(Hinton et al., 2012). It means that each element of the in-put matrix is put to zero with probability p, also knownas a dropout rate. Later the same authors reported thatGaussian Dropout with continuous noise ⇠mi s N (1,↵ =

p1�p ) works as well and is similar to Binary Dropout withdropout rate p (Srivastava et al., 2014). It is beneficialto use continuous noise instead of discrete one becausemultiplying the inputs by a Gaussian noise is equivalentto putting Gaussian noise on the weights. This proce-dure can be used to obtain a posterior distribution overthe model’s weights (Wang & Manning, 2013; Kingmaet al., 2015). That is, putting multiplicative Gaussian noise⇠ij ⇠ N (1,↵) on a weight wij is equivalent to samplingof wij from q(wij | ✓ij ,↵) = N (wij | ✓ij ,↵✓2ij). Now wij

becomes a random variable parametrized by ✓ij .

wij = ✓ij⇠ij = ✓ij(1 +p↵✏ij) ⇠ N (wij | ✓ij ,↵✓2ij)

✏ij s N (0, 1)(7)

Gaussian Dropout training is equivalent to stochastic op-timization of the expected log likelihood (2) in the casewhen we use the reparameterization trick and draw a singlesample W s q(W | ✓,↵) per minibatch to estimate the ex-pectation. Variational Dropout extends this technique andexplicitly uses q(W | ✓,↵) as an approximate posterior dis-tribution for a model with a special prior on the weights.The parameters ✓ and ↵ of the distribution q(W | ✓,↵) aretuned via stochastic variational inference, i.e. � = (✓,↵)are the variational parameters, as denoted in Section 3.2.The prior distribution p(W ) is chosen to be improper log-scale uniform to make the Variational Dropout with fixed ↵equivalent to Gaussian Dropout (Kingma et al., 2015).

p(log |wij |) = const , p(|wij |) /1

|wij |(8)

In this model, it is the only prior distribution that makesvariational inference consistent with Gaussian Dropout(Kingma et al., 2015). When parameter ↵ is fixed, theDKL(q(W | ✓,↵) k p(W )) term in the variational lowerbound (1) does not depend on ✓ (Kingma et al., 2015).Maximization of the variational lower bound (1) then be-comes equivalent to maximization of the expected log-likelihood (2) with fixed parameter ↵. It means that Gaus-sian Dropout training is exactly equivalent to VariationalDropout with fixed ↵. However, Variational Dropout pro-vides a way to train dropout rate ↵ by optimizing the vari-ational lower bound (1). Interestingly, dropout rate ↵ now


L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =

NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)


ドロップアウト¤ 全結合層において，ドロップアウトは各訓練処理においてランダムなノイズを加える．

¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる

¤ 𝑊にガウスノイズを⼊れることは，から𝑊をサンプリングすることと等価¤ すると，確率変数𝑤は𝜃によって次のようにパラメータ化される．


L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)


変分ドロップアウト¤ をパラメータをもつ近似分布と考えると，このパラメータは変分推論で計算することができる（変分ドロップアウト）

¤ 𝛼を固定すると，変分ドロップアウトとガウスドロップアウトは等価になる．¤ KL項が⼀定になるため．

¤ 変分ドロップアウトにおいて，𝛼は学習するパラメータになっている！¤ つまり， 𝛼を学習時に⾃動的に決定することができる．

¤ しかし，先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている．¤ ノイズが⼊りすぎると，勾配の分散が⼤きくなる．¤ しかし， 𝛼が無限⼤（=ドロップアウト率が1）まで設定できたほうが⾯⽩い結果がでそう．


L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)



L(�) = LD(�)�DKL(q�(w) k p(w)) ! max

�2�(1)

LD(�) =NX

n=1






Mm=1.


D (�)�DKL(q�(w)kp(w)) (3)


N

M

MX

m=1

log p(ym|xm, f(�, ✏m)) (4)

r�LD(�)'N

M

MX

m=1

r� log p(ym|xm, f(�, ✏m)) (5)






B = (A� ⌅)W, with ⇠mi s p(⇠) (6)





✏ij s N (0, 1)(7)



|wij |(8)


Additive Noise Reparameterization

¤ 下界の勾配の2つめの乗数はαが⼤きくなるとノイズが⼤きくなる．

¤ そこで，つぎのような式変形をする．

¤ すると，

となるので，勾配の分散を⼤幅に減らすことができる！

¤ これによって， 𝛼を∞にまで⼤きく設定することができる．


becomes a variational parameter and not a hyperparameter.In theory, it allows us to train individual dropout rates ↵ij

for each layer, neuron or even weight (Kingma et al., 2015).However, no experimental results concerning the trainingof individual dropout rates were reported in the original pa-per. Also, the approximate posterior family was manuallyrestricted to the case ↵ 1.

4. Sparse Variational Dropout

In the original paper, authors reported difficulties in train-ing the model with large values of dropout rates ↵ (Kingmaet al., 2015) and only considered the case of ↵ 1, whichcorresponds to a binary dropout rate p 0.5. However,the case of large ↵ij is very exciting (here we mean sepa-rate ↵ij per weight). High dropout rate ↵ij ! +1 corre-sponds to a binary dropout rate that approaches p = 1. Iteffectively means that the corresponding weight or neuronis always ignored and can be removed from the model. Inthis work, we consider the case of individual ↵ij for eachweight of the model.

4.1. Additive Noise Reparameterization

Training Neural Networks with Variational Dropout is dif-ficult when dropout rates ↵ij are large because of a hugevariance of stochastic gradients (Kingma et al., 2015). Thecause of large gradient variance arises from multiplicativenoise. To see it clearly, we can rewrite the gradient of LSGVB

w.r.t. ✓ij as follows.

@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)

In the case of original parameterization (✓,↵) the secondmultiplier in (9) is very noisy if ↵ij is large.

wij = ✓ij(1 +p↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)

We propose a trick that allows us to drastically reduce thevariance of this term in the case when ↵ij is large. The ideais to replace the multiplicative noise term 1+

p↵ij ·✏ij with

an exactly equivalent additive noise term �ij · ✏ij , where�2ij = ↵ij✓

2ij is treated as a new independent variable. Af-

ter this trick we will optimize the variational lower boundw.r.t. (✓,�). However, we will still use ↵ throughout thepaper, as it has a nice interpretation as a dropout rate.

wij = ✓ij(1 +p↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)

From (11) we can see that @wij

@✓ijnow has no injected noise,

but the distribution over wij ⇠ q(wij | ✓ij ,�2ij) remains ex-

actly the same. The objective function and the posteriorapproximating family are unaltered. The only thing thatchanged is the parameterization of the approximate pos-terior. However, the variance of a stochastic gradient isgreatly reduced. Using this trick, we avoid the problem oflarge gradient variance and can train the model within thefull range of ↵ij 2 (0,+1).

It should be noted that the Local Reparameterization Trickdoes not depend on parameterization, so it can also be ap-plied here to reduce the variance even further. In our ex-periments, we use both Additive Noise Reparameterizationand the Local Reparameterization Trick. We provide the fi-nal expressions for the outputs of fully-connected and con-volutional layers for our model in Section 4.4.

4.2. Approximation of the KL Divergence

As the prior and the approximate posterior are fully factor-ized, the full KL-divergence term in the lower bound (1)can be decomposed into a sum:

DKL(q(W | ✓,↵)k p(W )) =

=

X

ij

DKL(q(wij | ✓ij ,↵ij) k p(wij))(12)

The log-scale uniform prior distribution is an improperprior, so the KL divergence can only be calculated up toan additive constant C (Kingma et al., 2015).

�DKL(q(wij | ✓ij ,↵ij) k p(wij)) =

=

1

2

log↵ij � E✏⇠N (1,↵ij) log |✏|+C

(13)

In the Variational Dropout model this term is intractable, asthe expectation E✏⇠N (1,↵ij) log |✏| in (13) cannot be com-puted analytically (Kingma et al., 2015). However, thisterm can be sampled and then approximated. Two differentapproximations were provided in the original paper, how-ever they are accurate only for small values of the dropoutrate ↵ (↵ 1). We propose another approximation (14)that is tight for all values of alpha. Here �(·) denotes thesigmoid function. Different approximations and the truevalue of �DKL are presented in Fig. 1. Original �DKL

was obtained by averaging over 107 samples of ✏ with lessthan 2⇥ 10

�3 variance of the estimation.

�DKL(q(wij | ✓ij ,↵ij) k p(wij)) ⇡⇡ k1�(k2 + k3 log↵ij))� 0.5 log(1 + ↵�1

ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)

We used the following intuition to obtain this formula. Thenegative KL-divergence goes to a constant as log↵ij goes









@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)


was obtained by averaging over 107 samples of ✏ with lessthan 2 ⇥ 10


�DKL(q(wij | ✓ij ,↵ij) k p(wij)) ⇡⇡ k1�(k2 + k3 log↵ij)) � 0.5 log(1 + ↵�1

ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)










@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)





ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)










@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)





ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)










@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)





ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)


ただし

𝛼が⼤きくなると，この項も⼤きくなる









@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)





ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)


KL項について¤ [Kingma+15]で提案されたKL項（正規化項）の近似⽅法は， 𝛼が1以下の場合のみ．

¤ 本研究では，すべての値の𝛼で適⽤可能なKL項を提案









@LSGVB

@✓ij=

@LSGVB

@wij· @wij

@✓ij(9)


wij = ✓ij(1 +p

↵ij · ✏ij),

@wij

@✓ij= 1 +

p↵ij · ✏ij ,

✏ij ⇠ N (0, 1)

(10)


p↵ij ·✏ij with




wij = ✓ij(1 +p

↵ij · ✏ij) = ✓ij + �ij · ✏ij

@wij

@✓ij= 1, ✏ij ⇠ N (0, 1)

(11)








DKL(q(W | ✓,↵)k p(W )) =

=

X

ij




=

1

2


(13)





ij ) + C

k1 = 0.63576 k2 = 1.87320 k3 = 1.48695

(14)



Figure 1. Different approximations of KL divergence: blue andgreen ones (Kingma et al., 2015) are tight only for ↵ 1; blackone is the true value, estimated by sampling; red one is our ap-proximation.

to infinity, and tends to 0.5 log↵ij as log↵ij goes to minusinfinity. We model this behaviour with �0.5 log(1+↵�1

ij ).We found that the remainder �DKL + 0.5 log(1 + ↵�1

ij )

looks very similar to a sigmoid function of log↵ij , so we fitits linear transformation k1�(k2+k3 log↵ij) to this curve.We observe that this approximation is extremely accurate(less than 0.009 maximum absolute deviation on the fullrange of log↵ij 2 (�1,+1); the original approximation(Kingma et al., 2015) has 0.04 maximum absolute devia-tion with log↵ij 2 (�1, 0]).

One should notice that as ↵ approaches infinity, the KL-divergence approaches a constant. As in this model theKL-divergence is defined up to an additive constant, it isconvenient to choose C = �k1 so that the KL-divergencegoes to zero when ↵ goes to infinity. It allows us to com-pare values of LSGVB for neural networks of different sizes.

4.3. Sparsity

From the Fig. 1 one can see that �DKL term increaseswith the growth of ↵. It means that this regularization termfavors large values of ↵.

The case of ↵ij ! 1 corresponds to a Binary Dropoutrate pij ! 1 (recall ↵ =

p1�p ). Intuitively it means that the

corresponding weight is almost always dropped from themodel. Therefore its value does not influence the modelduring the training phase and is put to zero during the test-ing phase.

We can also look at this situation from another angle. In-finitely large ↵ij corresponds to infinitely large multiplica-tive noise in wij . It means that the value of this weightwill be completely random and its magnitude will be un-bounded. It will corrupt the model prediction and decreasethe expected log likelihood. Therefore it is beneficial toput the corresponding weight ✓ij to zero in such a way that

↵ij✓2ij goes to zero as well. It means that q(wij | ✓ij ,↵ij)

is effectively a delta function, centered at zero �(wij).

✓ij ! 0, ↵ij✓2ij ! 0

+q(wij | ✓ij ,↵ij) ! N (wij | 0, 0) = �(wij)

(15)

In the case of linear regression this fact can be shown an-alytically. We denote a data matrix as XN⇥D and ↵, ✓ 2RD. If ↵ is fixed, the optimal value of ✓ can also be ob-tained in a closed form.

✓ = (X>X + diag(X>X)diag(↵))�1X>y (16)

Assume that (X>X)ii 6= 0, so that i-th feature is not aconstant zero. Then from (16) it follows that ✓i = ⇥(↵�1

i )

when ↵i ! +1, so both ✓i and ↵i✓2i tend to 0.

4.4. Sparse Variational Dropout for Fully-Connected

and Convolutional Layers

Finally we optimize the stochastic gradient variationallower bound (3) with our approximation of KL-divergence(14). We apply Sparse Variational Dropout to both convo-lutional and fully-connected layers. To reduce the varianceof LSGVB we use a combination of the Local Reparameter-ization Trick and Additive Noise Reparameterization. Inorder to improve convergence, optimization is performedw.r.t. (✓, log �2

).

For a fully connected layer we use the same notation as inSection 3.3. In this case, Sparse Variational Dropout withthe Local Reparameterization Trick and Additive NoiseReparameterization can be computed as follows:

bmj s N (�mj , �mj)

�mj =

IX

i=1

ami✓ij , �mj =

IX

i=1

a2mi�2ij

(17)

Now consider a convolutional layer. Take a single inputtensor AH⇥W⇥C

m , a single filter wh⇥w⇥Ck and correspond-

ing output matrix bH0⇥W 0

mk . This filter has correspondingvariational parameters ✓h⇥w⇥C

k and �h⇥w⇥Ck . Note that in

this case Am, ✓k and �k are tensors. Because of linear-ity of convolutional layers, it is possible to apply the LocalReparameterization Trick. Sparse Variational Dropout forconvolutional layers then can be expressed in a way, simi-lar to (17). Here we use (·)2 as an element-wise operation,⇤ denotes the convolution operation, vec(·) denotes reshap-ing of a matrix/tensor into a vector.

vec(bmk) s N (�mk, �mk)

�mk = vec(Am⇤✓k), �mk = diag(vec(A2m⇤�2

k))(18)

These formulae can be used for the implementation ofSparse Variational Dropout layers. We will provide a refer-ence implementation using Theano (Bergstra et al., 2010)

スパース変分ドロップアウトの計算¤ 下界の学習では，提案するAdditive Noise Reparameterizationに加えて， Local Reparameterization Trick[Kingma+15]を適⽤して分散を抑える．

¤ Local Reparameterization Trickは以前の輪読スライドを参照．

¤ 全結合層だけではなく，畳込み層でも適⽤可能．





ij )



4.3. Sparsity








✓ij ! 0, ↵ij✓2ij ! 0


(15)


✓ = (X>X + diag(X>X)diag(↵))�1X>y (16)


i )





).



�mj =

IX

i=1

ami✓ij , �mj =

IX

i=1

a2mi�2ij

(17)









k))(18)






ij )



4.3. Sparsity








✓ij ! 0, ↵ij✓2ij ! 0


(15)


✓ = (X>X + diag(X>X)diag(↵))�1X>y (16)


i )





).



�mj =

IX

i=1

ami✓ij , �mj =

IX

i=1

a2mi�2ij

(17)









k))(18)


実験設定¤ 𝛼はlog𝛼 = 3まで（ドロップアウト率0.95まで）に制限

¤ 事前に，本⼿法を適⽤しない学習を⾏う¤ 事前学習をしない場合，⾼いスパースレベルとなるが，正解率が低くなる¤ Bayesian DNNでは共通の問題点らしい．¤ 本研究で⾏った事前学習は10~30epochほど

¤ その他の設定は論⽂参照

Additive Noise Reparameterizationの検証

¤ Additive Noise Reparameterizationによって分散が抑えられているかを検証¤ 本研究の⼿法を適⽤しない⽅法と，スパース性&下界の精度について⽐較


Figure 2. Original parameterization vs Additive Noise Reparam-eterization. Additive Noise Reparameterization leads to a muchfaster convergence, a better value of the variational lower boundand a higher sparsity level.

epochs required for convergence from a random initializa-tion is roughly the same as for the original network. How-ever, we only need to make a small number of epochs (10-30) in order for our method to converge from a pre-trainednetwork.

We train all networks using Adam (Kingma & Ba, 2014).When we start from a random initialization, we train for200 epochs and linearly decay the learning rate from 10

�4

to zero. When we start from a pre-trained model, we fine-tune for 10-30 epochs with learning rate 10

�5.

5.2. Variance reduction

To see how Additive Noise Reparameterization reduces thevariance, we compare it with the original parameteriza-tion. We used a fully-connected architecture with two lay-ers with 1000 neurons each. Both models were trained withidentical random initializations and with the same learningrate, equal to 10

�4. We did not rescale the KL term duringtraining. It is interesting that the original version of Vari-ational Dropout with our approximation of KL-divergenceand with no restriction on alphas also provides a sparse so-lution. However, our method has much better convergencerate and provides higher sparsity and a better value of thevariational lower bound, as shown in Fig. 2.

5.3. LeNet-300-100 and LeNet5 on MNIST

We compare our method with other methods of trainingsparse neural networks on the MNIST dataset using a fully-connected architecture LeNet-300-100 and a convolutionalarchitecture LeNet-5-Caffe1. These networks were trained

1A modified version of LeNet5 from (LeCun et al., 1998).Caffe Model specification: https://goo.gl/4yI3dL

Table 1. Comparison of different sparsity-inducing techniques(Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull-rich et al., 2017)) on LeNet architectures. Our method providesthe highest level of sparsity with a similar accuracy.

Network Method Error % Sparsity per Layer % |W||W 6=0|

Original 1.64 1Pruning 1.59 92.0� 91.0� 74.0 12

LeNet-300-100 DNS 1.99 98.2� 98.2� 94.5 56SWS 1.94 23

(ours) Sparse VD 1.92 98.9� 97.2� 62.0 68

Original 0.80 1Pruning 0.77 34� 88� 92.0� 81 12

LeNet-5-Caffe DNS 0.91 86� 97� 99.3� 96 111SWS 0.97 200

(ours) Sparse VD 0.75 67� 98� 99.8� 95 280

from a random initialization and without data augmenta-tion. We consider pruning (Han et al., 2015b;a), DynamicNetwork Surgery (Guo et al., 2016) and Soft Weight Shar-ing (Ullrich et al., 2017). In these architectures, our methodachieves a state-of-the-art level of sparsity, while its accu-racy is comparable to other methods. The comparison withother techniques is shown in Table 1. It should be notedthat we only consider the level of sparsity and not the finalcompression ratio.

5.4. VGG-like on CIFAR-10 and CIFAR-100

To demonstrate that our method scales to large modern ar-chitectures, we apply it to a VGG-like network (Zagoruyko,2015) adapted for the CIFAR-10 (Krizhevsky & Hinton,2009) dataset. The network consists of 13 convolutionaland two fully-connected layers, each layer followed bypre-activation batch normalization and Binary Dropout.We experiment with different sizes of this architectureby scaling the number of units in each network by k 2{0.25, 0.5, 1.0, 1.5}. We use CIFAR-10 and CIFAR-100for evaluation. The reported error of this architecture onthe CIFAR-10 dataset with k = 1 is 7.55%. As no pre-trained weights are available, we train our own networkand achieve 7.3% error. Sparse VD also achieves 7.3%error for k = 1, but retains 48⇥ less weights. The resultsare presented in Fig. 3.

We observe underfitting while training our model from arandom initialization, so we initialize it with a network,pre-trained with Binary Dropout and L2 regularization. Itshould be noted that most modern DNN compression tech-niques also can be applied only to pre-trained networks andwork best with networks, trained with L2 regularization(Han et al., 2015b).

Our method achieves over 65x sparsification on the CIFAR-10 dataset with no accuracy drop and up to 41x sparsifica-tion on CIFAR-100 with a moderate accuracy drop.

提案⼿法のほうがスパースになるのが速い

提案⼿法の下界のほうが速く収束

MNIST¤ LeNetでMNISTを学習

¤ LeNet-300-100（全結合）とLeNet-5-Caffe（畳込み）¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight

Sharing[Ullrich+ 17]と⽐較


Figure 2. Original parameterization vs Additive Noise Reparam-eterization. Additive Noise Reparameterization leads to a muchfaster convergence, a better value of the variational lower boundand a higher sparsity level.

epochs required for convergence from a random initializa-tion is roughly the same as for the original network. How-ever, we only need to make a small number of epochs (10-30) in order for our method to converge from a pre-trainednetwork.

We train all networks using Adam (Kingma & Ba, 2014).When we start from a random initialization, we train for200 epochs and linearly decay the learning rate from 10

�4

to zero. When we start from a pre-trained model, we fine-tune for 10-30 epochs with learning rate 10

�5.

5.2. Variance reduction

To see how Additive Noise Reparameterization reduces thevariance, we compare it with the original parameteriza-tion. We used a fully-connected architecture with two lay-ers with 1000 neurons each. Both models were trained withidentical random initializations and with the same learningrate, equal to 10

�4. We did not rescale the KL term duringtraining. It is interesting that the original version of Vari-ational Dropout with our approximation of KL-divergenceand with no restriction on alphas also provides a sparse so-lution. However, our method has much better convergencerate and provides higher sparsity and a better value of thevariational lower bound, as shown in Fig. 2.

5.3. LeNet-300-100 and LeNet5 on MNIST

We compare our method with other methods of trainingsparse neural networks on the MNIST dataset using a fully-connected architecture LeNet-300-100 and a convolutionalarchitecture LeNet-5-Caffe1. These networks were trained

1A modified version of LeNet5 from (LeCun et al., 1998).Caffe Model specification: https://goo.gl/4yI3dL

Table 1. Comparison of different sparsity-inducing techniques(Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull-rich et al., 2017)) on LeNet architectures. Our method providesthe highest level of sparsity with a similar accuracy.

Network Method Error % Sparsity per Layer % |W||W 6=0|

Original 1.64 1Pruning 1.59 92.0 � 91.0 � 74.0 12

LeNet-300-100 DNS 1.99 98.2 � 98.2 � 94.5 56SWS 1.94 23

(ours) Sparse VD 1.92 98.9 � 97.2 � 62.0 68

Original 0.80 1Pruning 0.77 34 � 88 � 92.0 � 81 12

LeNet-5-Caffe DNS 0.91 86 � 97 � 99.3 � 96 111SWS 0.97 200

(ours) Sparse VD 0.75 67 � 98 � 99.8 � 95 280

from a random initialization and without data augmenta-tion. We consider pruning (Han et al., 2015b;a), DynamicNetwork Surgery (Guo et al., 2016) and Soft Weight Shar-ing (Ullrich et al., 2017). In these architectures, our methodachieves a state-of-the-art level of sparsity, while its accu-racy is comparable to other methods. The comparison withother techniques is shown in Table 1. It should be notedthat we only consider the level of sparsity and not the finalcompression ratio.

5.4. VGG-like on CIFAR-10 and CIFAR-100

To demonstrate that our method scales to large modern ar-chitectures, we apply it to a VGG-like network (Zagoruyko,2015) adapted for the CIFAR-10 (Krizhevsky & Hinton,2009) dataset. The network consists of 13 convolutionaland two fully-connected layers, each layer followed bypre-activation batch normalization and Binary Dropout.We experiment with different sizes of this architectureby scaling the number of units in each network by k 2{0.25, 0.5, 1.0, 1.5}. We use CIFAR-10 and CIFAR-100for evaluation. The reported error of this architecture onthe CIFAR-10 dataset with k = 1 is 7.55%. As no pre-trained weights are available, we train our own networkand achieve 7.3% error. Sparse VD also achieves 7.3%error for k = 1, but retains 48⇥ less weights. The resultsare presented in Fig. 3.

We observe underfitting while training our model from arandom initialization, so we initialize it with a network,pre-trained with Binary Dropout and L2 regularization. Itshould be noted that most modern DNN compression tech-niques also can be applied only to pre-trained networks andwork best with networks, trained with L2 regularization(Han et al., 2015b).

Our method achieves over 65x sparsification on the CIFAR-10 dataset with no accuracy drop and up to 41x sparsifica-tion on CIFAR-100 with a moderate accuracy drop.

提案⼿法が最もスパース提案⼿法が最もスパース

CIFAR-10,CIFAR-100¤ VGG-like network[Zagoruyko+15]でCIFAR10,CIFAR-100を学習

¤ ユニットサイズのスケーリングkを変更して実験

¤ 正解率はほぼ同じで，最⼤65倍のスパース性（CIFAR-10）

ランダムラベルの学習¤ [Zhang+ 16]では，CNNがランダムラベルについても学習してしまうことが⽰されている．¤ 通常のドロップアウトではこの問題を解消できない．

¤ 提案⼿法（Sparse VD）では，学習すると重みがすべて1つの値になり，⼀定の予測しかしないようになった．¤ しかも，スパース性が100%になった．¤ スパース性が100%になると重みが0になる（4.3節を参照）．

¤ 提案⼿法によって，記憶にペナルティがかけられて，汎化を促進している？


(a) Results on the CIFAR-10 dataset (b) Results on the CIFAR-100 dataset

Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. The number of neurons and filters scales as k. Densenetworks were trained with Binary Dropout, and Sparse VD networks were trained with Sparse Variational Dropout on all layers. Theoverall sparsity level, achieved by our method, is reported as a dashed line. The accuracy drop is negligible in most cases, and thesparsity level is high, especially in larger networks.

Table 2. Experiments with random labeling. Sparse VariationalDropout (Sparse VD) removes all weights from the model andfails to overfit where Binary Dropout networks (BD) learn therandom labeling perfectly.

Dataset Architecture Train acc. Test acc. SparsityMNIST FC + BD 1.0 0.1 —MNIST FC + Sparse VD 0.1 0.1 100%CIFAR-10 VGG-like + BD 1.0 0.1 —CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100%

5.5. Random Labels

Recently is was shown that the CNNs are capable of mem-orizing the data even with random labeling (Zhang et al.,2016). The standard dropout as well as other regulariza-tion techniques did not prevent this behaviour. Followingthat work, we also experiment with the random labelingof data. We use a fully-connected network on the MNISTdataset and VGG-like networks on CIFAR-10. We put Bi-nary Dropout (BD) with dropout rate p = 0.5 on all fully-connected layers of these networks. We observe that thesearchitectures can fit a random labeling even with BinaryDropout. However, our model decides to drop every singleweight and provide a constant prediction. It is still possibleto make our model learn random labeling by initializing itwith a network, pre-trained on this random labeling, andthen finetuning it. However, the variational lower boundL(✓,↵) in this case is lower than in the case of 100% spar-sity. Our results are presented in Table 2.

These observations may mean that Sparse VariationalDropout implicitly penalizes memorization and favors gen-eralization. However, this still requires a more thoroughinvestigation.

6. Discussion

The “Occam’s razor” principle states that unnecessarilycomplex should not be preferred to simpler ones (MacKay,1992). Automatic Relevance Determination is effectivelya Bayesian implementation of this principle that occurs indifferent cases. Previously, it was mostly studied in the caseof factorized Gaussian prior in linear models, GaussianProcesses, etc. In the Relevance Tagging Machine model(Molchanov et al., 2015) the same effect was achieved us-ing Beta distributions as a prior. Finally, in this work, theARD-effect is reproduced in an entirely different setting.We consider a fixed prior and train the model using vari-ational inference. In this case, the ARD effect is causedby the particular combination of the approximate posteriordistribution family and prior distribution, and not by modelselection. This way we can abandon the empirical Bayesapproach that is known to overfit (Cawley, 2010).

We observed that if we allow Variational Dropout to dropirrelevant weights automatically, it ends up cutting mostof the model weights. This result correlates with resultsof other works on training of sparse neural networks (Hanet al., 2015a; Wen et al., 2016; Ullrich et al., 2017; So-ravit Changpinyo, 2017). All these works can also beviewed as a kind of regularization of neural networks, asthey restrict the model complexity. Further investigationof this kind of redundancy may lead to an understandingof generalization properties of DNNs and explain the phe-nomenon, observed by (Zhang et al., 2016). Accordingto that paper, it seems that while modern architectures ofDNNs generalize well in practice, they can also easily learna completely random labeling of data. Interestingly, it is notthe case for our model, as a network with zero weights hasa higher value of objective than a trained network.

まとめ¤ 本研究では，Variational Dropoutにおいて，𝛼を⼤きくしても勾配の分散が⼤きくならない再パラメータ化⼿法を提案した．¤ [Kingma+ 15]のLocal reparameterizaiton trickとも併⽤できる．¤ CNNにも適⽤可能．

¤ 実験では，既存⼿法よりも⾼いスパース性を獲得できることがわかった．

¤ さらに，ランダムラベルのデータをDNNが簡単に学習してしまう問題が，本⼿法では当てはまらないことを⽰した．

（dl輪読）variational dropout sparsifies deep neural networks

Engineering