Variational Inference & Variational Autoencoders
Casey Meehan & Mary Anne Smart
CSE 254
May 2019
1 / 45
Overview
1 Variational Inference: A Review for Statisticians (Blei, Kucukelbir, &McAuliffe)
When and why use VI?The Evidence Lower Bound (ELBO)Mean Field Variational InferenceComparison with older methods
2 Auto-Encoding Variational Bayes (Kingma & Welling)
2 / 45
Bayesian Modeling
Define a probabilistic model for your data in which the parameters arerandom variables
Consider a simple coin-toss model, xi ∈ {0, 1}:
Non-Bayesian
xi
θpθ(x)
n
Xi ∼ Bern(θ)
Bayesian
xi
θ ααα
pα(θ)
p(x|θ)
n
Xi ∼ Bern(θ)
θ ∼ Beta(α1, α2)
3 / 45
MLE, MAP, Full Bayesian
Max Likelihood Est. (MLE)
Latent θ is a numerical parameter, not random variable.
Choose the θ∗ that best explains the data
Maximize log likelihood:
θ∗ = arg maxθ
n∑i=1
log pθ(xi ) =
∑ni=1 xin
likely to overfit without enough data:
seeing 1 sample, x1 = 1 implies θ∗ = 1
4 / 45
MLE, MAP, Full Bayesian
Maximum A Posteriori Estimate (MAP)
Latent θ ∼ Beta(α1, α2) is a random variable
Choose the θ∗ with maximum probability
Maximize the log posterior:
θ∗ = arg maxθ
[log p(θ|x)
]= arg max
θ
[log
p(x|θ)
p(x)
]=α1 − 1 +
∑ni=1 xi
n + α1 + α2 − 2
less likely to overfit:
Say, α1 = α2 = 2. Then seeing 1 sample, x1 = 1 implies θ∗ = 23
5 / 45
MLE, MAP, Full Bayesian
Full Bayesian(goal of variational inference)
Latent θ ∼ Beta(α1, α2) is a random variableFind the full distribution of θ|x,ααα. Don’t pick a single value
6 / 45
A VI Story...
1 we have a sample of n data points, x = {xi : i ∈ [1, n]}2 we have latent vars, z = {zi : i ∈ [1,m]},3 we have a probabilistic model with a likelihood and a prior distribution
pθ(x|z), pα(z)
4 we want to be full Bayesian, but the posterior is intractable!
p(z|x)
5 so, we pick a simpler class of distributions, Q, to approximate p(z|x)
6 we choose the optimal qφ(z|x) ∈ Q by minimizing the reverse KLdivergence from p(z|x)
7 ...which is equivalent to maximizing the ELBO
7 / 45
Latent Variables can be Difficult to Infer
Consider this Gaussian mixture model of N samples and K clusters:
xi
µk ci
σ2 π[1,...,K ]
N
K
µk ∼ N (0, σ2)
ci ∼ Cat(π1, . . . , πK )
xi |ci ,µµµ ∼ N (cᵀi µµµ, I)
8 / 45
Latent Variables can be Difficult to Infer
The posterior probability requires computing the evidence...
p(x) = p(x1, x2, . . . , xn)
p(µµµ, c|x) =p(µµµ, c, x)
p(x)=
∏Kk=1 p(µk)
∏ni=1 p(ci )p(xi |ci ,µµµ)
p(x)
...and it’s a pain to compute
p(x) =
∫p(µµµ)
n∏i=1
p(xi |µµµ)dµµµ
=
∫p(µµµ)
n∏i=1
∑ci
p(ci )p(xi |ci ,µµµ)dµµµ =⇒ O(Kn)
9 / 45
A general framework for variational inference
Generally, let z be our set of latent variables, x our observed data
Variational inference approximates p(z|x) with some q(z|x)
x
z
ααα
pα(z)
pθ(x|z) qφ(z|x)
In the case of the GMM above...
z = {µ1, . . . , µK , c1, . . . , cn}x = {x1, . . . , xn}ααα = {σ2, π1, . . . , πK}θθθ = {∅}φφφ not defined yet
10 / 45
The ELBO: choosing qφ(z |x) with KL divergence
qφ(z|x) belongs to set of distributions Q parameterized by φ ∈ Φ.
VI chooses qφ(z|x) by maximizing the Evidence Lower BOund
This is akin to finding parameters φ ∈ Φ that minimize the KLdistance between p(z|x) and qφ(z|x).
minφ
[DKL
(qφ(z|x)||p(z|x)
)]
11 / 45
Deriving the ELBO
Factor the KL divergence; let Eq = Eqφ(z|x)
DKL
(qφ(z|x)||p(z|x)
)(1)
= Eq
[log qφ(z|x)
]− Eq
[log p(z|x)
](2)
= Eq
[log qφ(z|x)
]−(Eq
[log pθ(x|z)
]+ Eq
[log pα(z)
]− log p(x)
)(3)
= DKL
(qφ(z|x)||pα(z)
)− Eq
[log pθ(x|z)
]+ log p(x) (4)
So the evidence is bounded by the ELBO:
log p(x) = Eq
[log pθ(x|z)
]− DKL
(qφ(z|x)||pα(z)
)+ DKL
(qφ(z|x)||p(z|x)
)≥ Eq
[log pθ(x|z)
]− DKL
(qφ(z|x)||pα(z)
)12 / 45
Anatomy of the ELBO?
ELBO = Eq
[log pθ(x|z)
]− DKL
(qφ(z|x)||pα(z)
)= likelihood + regularizer
The first term encourages qφ(z|x) to place mass on z values thatexplain x according to the likelihood, pθ(x|z)
The second term encourages qφ(z|x) to be close to the priordistribution, pθ(z), effectively regularizing the approximate posterior
13 / 45
Reverse KL (used) vs. Forward KL (not used)
Reverse KL zero ‘forces’:
DKL
(qφ(z|x)||pθ(z|x)
)= Eqφ(z|x)
[log
qφ(z|x)
pθ(z|x)
]Forward KL zero ‘avoids’:
DKL
(pθ(z|x)||qφ(z|x)
)= Epθ(z|x)
[log
pθ(z|x)
qφ(z|x)
]
img credit: Eric Jang
Consequence: support of posterior may be under-represented
14 / 45
Mean Field Variational Inference
Mean-field variational inference assumes qφ(z|x) completely factors:
qφ(z|x) =M∏i=1
qi (zi |x)
p(z, x)
x
z1 z2 z3
ααα
pα(z)
pθ(x|z)
N
p(z|x)
x
z1 z2 z3
ααα
pα(z)
N
qφ(z|x)
x
z1 z2 z3
N
15 / 45
Mean Field on GMM example
A MF approximation for our GMM applies gaussians for the µk ’s andcategoricals for the ci ’s:
xi
µk ci
σ2 π[1,...,K ]
N
K
qφ(µk |x) = N (µk ;mk , s2k )
qφ(ci |x) = Cat(ci ;ψ1, . . . , ψk)
these are optimal forms ofmean-field for GMM
16 / 45
Mean Field captures marginal, not covariate characteristics
p(z|x) =M∏i=1
p(zi |zi−1, . . . , z1, x)
vs
qφ(z|x) =M∏i=1
qi (zi |x)
17 / 45
Coordinate Ascent Variational Inference (CAVI)
One method to optimize our variational bound is by updating eachqi (zi |x) while keeping all other Z−i constant.
The update will look a lot like EM, except we’re updating anapproximation:
qi (zi |x) ∝ exp{Eq−i
[log p(zi ,Z−i , x)
]}Note that the expectation over q−i only needs to include the markovblanket of i , not all zi ’s
18 / 45
Coordinate Ascent Variational Inference (CAVI)
CAVI iteratively updates each coordinate: qi (zi |x)
ELBO = Eq
[log p(z, x)
]− Eq
[log qφ(z|x)
](1)
= log p(x) + Eq
[log p(z|x)
]−
M∑i=1
Eqi
[log qi (zi |x)
](2)
= const +M∑i=1
(Eq
[log p(zi |z1:(i−1), x)
]− Eqi
[log qi (zi |x)
])(3)
ELBOi = Eqi
[Eq−i
[log p(zi |z−i , x)
]]− Eqi
[log qi (zi |x)
](i)
=
∫qi (zi )
(Eq−i
[log p(zi |z−i , x)
]− log qi (zi |x)
)dzi (ii)
19 / 45
Coordinate Ascent Variational Inference (CAVI)
ELBOi = Eqi
[Eq−i
[log p(zi |z−i , x)
]]− Eqi
[log qi (zi |x)
](i)
=
∫qi (zi )
(Eq−i
[log p(zi |z−i , x)
]− log qi (zi |x)
)dzi (ii)
d
dqi (zi )ELBOi = Eq−i
[log p(zi |z−i , x)
]− log qi (zi |x)− 1 = 0 (iii)
(5)
solve with Lagrange multipliers:
qi (zi |x) ∝ exp{Eq−i
[log p(zi |Z−i , x)
]}∝ exp
{Eq−i
[log p(zi ,Z−i , x)
]}20 / 45
How Does VI Compare to Older Methods like EM?
Doesn’t Expectation Maximization (EM) compute latent variables?
ELBO = Eq
[log pθ(x|z)
]− DKL
(qφ(z|x)||pα(z)
)EM sets qφ(z|x) = p(z|x) and iterates to find fixed latent params, z
VI, does not assume posterior is tractable – needed for bayesiansettings, where latent params are not fixed, but distributed
21 / 45
How Does VI Compare to Older Methods like MCMC?
MCMC: produce samples that asymptotically approach p(z|x)
More accurate, with better guarantees, but computationally veryexpensive
VI: run optimization to analyitically approximate p(z|x)
VI is generally faster, and scales nicely for large datasetsTake advantage of stochastic and distributed optimization methods
22 / 45
Open Questions in VI
VI is not as well understood as MCMC, open areas include
Theory to bound accuracy of VI for certain classes of models
Improved optimization methods for massive data
Developing VI approximations that work for a wide class of models
Improving posterior representation without overly complex family ofqφ’s
23 / 45
Questions on General Variational Inference?
24 / 45
Auto-Encoding Variational Bayes (Kingma & Welling)
SGVB (Stochastic Gradient Variational Bayes) estimator
AEVB (Auto-Encoding Variational Bayes) algorithm
Variational Auto-encoder
25 / 45
VAE Terminology
qφ(z|x) is our encoder . Given some x in our observed space, how isz distributed in latent space?
pθ(x|z) is our decoder . Given some z in latent space, how is xdistributed in the original space?
x
z
ααα
pα(z)
pθ(x|z) qφ(z|x)
26 / 45
Experiments
qφ(z|x) and pθ(x|z) are learned jointly using a neural network.
Once we are done training the network, we can generate samples fromthe learned distribution pθ(x|z).
Datasets
MNIST: training set of 60,000 28x28 pixel images of handwritten digitsFrey Face: almost 2000 20x28 images of Brendan Frey’s face,taken from sequential video frames
27 / 45
MNIST Generated Samples
Figure: (Kingma & Welling, 2013)
28 / 45
Visualizing Learned 2-D Manifolds
29 / 45
VAE Example
Let the prior over the latent variables be the centered isotropicmultivariate Gaussian:pα(z) = N (z ; 0, I ).
Let qφ(z |x) be a multivariate Gaussian with a diagonal covariancestructure:qφ(z |x (i)) = N (z ;µ(i),σ2(i)I )where µ and σ are outputs of our encoder network, a neural networkwith one hidden layer.
Let pθ(x |z) also be a multivariate Gaussian parameterized by a neuralnetwork (our decoder network) with one hidden layer.
30 / 45
VAE Diagram
x ε
Encoder
µqφ σqφ
z
Decoder
µpθ σpθ
x ′
31 / 45
Reparameterization Trick
We can often reparameterize z ∼ qφ(z |x) as z = g(x , ε),where g is some deterministic, differentiable function,and ε ∼ p(ε) is a noise variable.
Gaussian example:We want to sample z ∼ qφ(z |x) = N (z ;µ,σ2I ).Let ε ∼ N (0, I ).Let g(x , ε) = µ+ σ � ε.We can write: z = g(x , ε)
Instead of sampling z from qφ directly, we sample ε from N (0, I ) andapply the differentiable function g .
32 / 45
Applying the Reparameterization Trick
Eqφ(z|x)[f (z)]
=∫qφ(z |x)f (z)dz
=∫p(ε)f (z)dε
=∫p(ε)f (g(x , ε))dε
= Ep(ε)[f (g(x , ε))]
33 / 45
When can we apply the reparameterization trick?
When q belongs to any location-scale family
e.g. Gaussian, Laplace
When q has a tractable inverse CDF
Let g be the inverse CDF and let ε ∼ U(0, I ).
e.g. Exponential, Cauchy, Logistic
When we can express z ∼ q as some compositional transformation of
auxiliary variables
e.g. Log-Normal, Dirichlet
34 / 45
Variational Inference & the ELBO
Recall that in VI, we select qφ(z |x) to maximize the ELBO(Evidence Lower BOund):
L(θ,φ; x (i)) = −DKL(qφ(z |x (i))||pα(z))+ Eqφ(z |x (i))[log pθ(x (i)|z)]
35 / 45
Why differentiate the ELBO?
If we want to train the neural network from the earlier examplethrough backprop, we need a differentiable objective function.
Even if we don’t intend to parameterize pθ and qφ with a neuralnetwork, the ability to use stochastic gradient ascent techniques tooptimize the ELBO can help VI scale to large datasets.
36 / 45
Auto-Encoding Variational Bayes Algorithm
Algorithm 1 Minibatch version of the AEVB algorithm.
1: θ, φ← Initialize parameters2: repeat3: XM ← Random minibatch of M datapoints4: ε← Random samples from p(ε)5: g ← ∇θ,φL̃M(θ,φ; XM , ε)6: θ,φ← Update parameters using gradients (e.g. SGD or Adagrad)7: until convergence of parameters (θ, φ)8: return θ, φ
L̃M(θ, φ;XM , ε) = NM
∑Ni=1 L̃(θ, φ; x (i))
where L̃ is the differentiable estimator of the ELBO that we willderive next. (L̃M is the minibatch version.)
37 / 45
VAE Diagram
x ε
Encoder
µqφ σqφ
z
Decoder
µpθ σpθ
x ′
38 / 45
Stochastic Gradient Variational Bayes Estimator
ELBO (Evidence Lower BOund)
L(θ,φ; x (i)) = −DKL(qφ(z |x (i))||pα(z))+ Eqφ(z |x (i))[log pθ(x (i)|z)]
How to differentiate with respect to φ and θ?
In practice, we can often obtain closed form expression for the first
term (KL-divergence).
For the second term we need to do some kind of estimation by
sampling.
39 / 45
Deriving the SGVB Estimator
Eqφ(z|x(i))[log pθ(x (i)|z)] ≈ 1L
∑Ll=1 log pθ(x (i)|z (i ,l))
where z (i ,l) = gφ(x (i), ε(i ,l)) and ε(l) ∼ p(ε)
This gives us our SGVB estimator:
L(θ,φ; x (i)) ≈ −DKL(qφ(z |x (i))||pα(z))+ 1L
∑Ll=1 log pθ(x (i)|z (i ,l))
40 / 45
Back to Our Original Example
Let the prior over the latent variables be the centered isotropicmultivariate Gaussian:pα(z) = N (z ; 0, I ).
Let qφ(z |x) be a multivariate Gaussian with a diagonal covariancestructure:qφ(z |x (i)) = N (z ;µ(i),σ2(i)I )where µ and σ are outputs of our encoder network, a neural networkwith one hidden layer.
Claim:
−DKL(qφ(z |x (i))||pα(z))
= 12
∑Jj=1 (1 + log((σ
(i)j )2)− (µ
(i)j )2 − (σ
(i)j )2)
41 / 45
Deriving an Expression for the KL-Divergence
−DKL(Q||P) = −∫q(z) log q(z)
p(z)dz
=∫
(q(z) log p(z)dz −∫q(z) log q(z)dz
=∫N (z ;µ,σ2) logN (z ; 0, I ) −
∫N (z ;µ,σ2) logN (z ;µ,σ2)
= [−J2 log(2π)− 1
2
∑Jj=1(µ2
j + σ2j )] −[−J
2 log(2π)− 12
∑Jj=1(1 + log σ2
j )]
= 12
∑Jj=1 (1 + log((σj)
2)− (µj)2 − (σj)
2)
Note: J denotes the dimensionality of z.
42 / 45
Final Result
L(θ,φ; x (i))
≈ 12
J∑j=1
(1 + log((σj)2)− (µj)
2 − (σj)2) + 1
L
L∑l=1
log pθ(x (i)|z (i ,l))
where z (i ,l) = µ(i) + σ(i) � ε(l) and ε(l) ∼ N (0, I )
43 / 45
VAE Diagram Revisited
x ε
Encoder
µqφ σqφ
z
Decoder
µpθ σpθ
x ′
−DKL(qφ(z |x)||pα(z))
log pθ(x |z)
44 / 45
See Also
These are the specific papers we talked about:
Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In The 2nd
International Conference on Learning Representations (ICLR), 2013.
David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: a review for
statisticians. arXiv:1601.00670, 2016.
This is a helpful tutorial that we also consulted:
C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.
Many variations of the standard VAE exist. For example:
Conditional VAEsVAE-GANs
45 / 45