Download - Variational Inference & Variational Autoencoderscseweb.ucsd.edu/~dasgupta/254-deep-ul/casey-mary.pdfAuto-Encoding Variational Bayes (Kingma & Welling) SGVB (Stochastic Gradient Variational

Variational Inference & Variational Autoencoders

Casey Meehan & Mary Anne Smart

CSE 254

May 2019

1 / 45

Overview

1 Variational Inference: A Review for Statisticians (Blei, Kucukelbir, &McAuliffe)

When and why use VI?The Evidence Lower Bound (ELBO)Mean Field Variational InferenceComparison with older methods

2 Auto-Encoding Variational Bayes (Kingma & Welling)

2 / 45

Bayesian Modeling

Define a probabilistic model for your data in which the parameters arerandom variables

Consider a simple coin-toss model, xi ∈ {0, 1}:

Non-Bayesian

xi

θpθ(x)

n

Xi ∼ Bern(θ)

Bayesian

xi

θ ααα

pα(θ)

p(x|θ)

n

Xi ∼ Bern(θ)

θ ∼ Beta(α1, α2)

3 / 45

MLE, MAP, Full Bayesian

Max Likelihood Est. (MLE)

Latent θ is a numerical parameter, not random variable.

Choose the θ∗ that best explains the data

Maximize log likelihood:

θ∗ = arg maxθ

n∑i=1

log pθ(xi ) =

∑ni=1 xin

likely to overfit without enough data:

seeing 1 sample, x1 = 1 implies θ∗ = 1

4 / 45


Maximum A Posteriori Estimate (MAP)

Latent θ ∼ Beta(α1, α2) is a random variable

Choose the θ∗ with maximum probability

Maximize the log posterior:

θ∗ = arg maxθ

[log p(θ|x)

]= arg max

θ

[log

p(x|θ)

p(x)

]=α1 − 1 +

∑ni=1 xi

n + α1 + α2 − 2

less likely to overfit:

Say, α1 = α2 = 2. Then seeing 1 sample, x1 = 1 implies θ∗ = 23

5 / 45


Full Bayesian(goal of variational inference)

Latent θ ∼ Beta(α1, α2) is a random variableFind the full distribution of θ|x,ααα. Don’t pick a single value

6 / 45

A VI Story...

1 we have a sample of n data points, x = {xi : i ∈ [1, n]}2 we have latent vars, z = {zi : i ∈ [1,m]},3 we have a probabilistic model with a likelihood and a prior distribution

pθ(x|z), pα(z)

4 we want to be full Bayesian, but the posterior is intractable!

p(z|x)

5 so, we pick a simpler class of distributions, Q, to approximate p(z|x)

6 we choose the optimal qφ(z|x) ∈ Q by minimizing the reverse KLdivergence from p(z|x)

7 ...which is equivalent to maximizing the ELBO

7 / 45

Latent Variables can be Difficult to Infer

Consider this Gaussian mixture model of N samples and K clusters:

xi

µk ci

σ2 π[1,...,K ]

N

K

µk ∼ N (0, σ2)

ci ∼ Cat(π1, . . . , πK )

xi |ci ,µµµ ∼ N (cᵀi µµµ, I)

8 / 45

Latent Variables can be Difficult to Infer

The posterior probability requires computing the evidence...

p(x) = p(x1, x2, . . . , xn)

p(µµµ, c|x) =p(µµµ, c, x)

p(x)=

∏Kk=1 p(µk)

∏ni=1 p(ci )p(xi |ci ,µµµ)

p(x)

...and it’s a pain to compute

p(x) =

∫p(µµµ)

n∏i=1

p(xi |µµµ)dµµµ

=

∫p(µµµ)

n∏i=1

∑ci

p(ci )p(xi |ci ,µµµ)dµµµ =⇒ O(Kn)

9 / 45

A general framework for variational inference

Generally, let z be our set of latent variables, x our observed data

Variational inference approximates p(z|x) with some q(z|x)

x

z

ααα

pα(z)

pθ(x|z) qφ(z|x)

In the case of the GMM above...

z = {µ1, . . . , µK , c1, . . . , cn}x = {x1, . . . , xn}ααα = {σ2, π1, . . . , πK}θθθ = {∅}φφφ not defined yet

10 / 45

The ELBO: choosing qφ(z |x) with KL divergence

qφ(z|x) belongs to set of distributions Q parameterized by φ ∈ Φ.

VI chooses qφ(z|x) by maximizing the Evidence Lower BOund

This is akin to finding parameters φ ∈ Φ that minimize the KLdistance between p(z|x) and qφ(z|x).

minφ

[DKL

(qφ(z|x)||p(z|x)

)]

11 / 45

Deriving the ELBO

Factor the KL divergence; let Eq = Eqφ(z|x)

DKL

(qφ(z|x)||p(z|x)

)(1)

= Eq

[log qφ(z|x)

]− Eq

[log p(z|x)

](2)

= Eq

[log qφ(z|x)

]−(Eq

[log pθ(x|z)

]+ Eq

[log pα(z)

]− log p(x)

)(3)

= DKL

(qφ(z|x)||pα(z)

)− Eq

[log pθ(x|z)

]+ log p(x) (4)

So the evidence is bounded by the ELBO:

log p(x) = Eq

[log pθ(x|z)

]− DKL

(qφ(z|x)||pα(z)

)+ DKL

(qφ(z|x)||p(z|x)

)≥ Eq

[log pθ(x|z)

]− DKL

(qφ(z|x)||pα(z)

)12 / 45

Anatomy of the ELBO?

ELBO = Eq

[log pθ(x|z)

]− DKL

(qφ(z|x)||pα(z)

)= likelihood + regularizer

The first term encourages qφ(z|x) to place mass on z values thatexplain x according to the likelihood, pθ(x|z)

The second term encourages qφ(z|x) to be close to the priordistribution, pθ(z), effectively regularizing the approximate posterior

13 / 45

Reverse KL (used) vs. Forward KL (not used)

Reverse KL zero ‘forces’:

DKL

(qφ(z|x)||pθ(z|x)

)= Eqφ(z|x)

[log

qφ(z|x)

pθ(z|x)

]Forward KL zero ‘avoids’:

DKL

(pθ(z|x)||qφ(z|x)

)= Epθ(z|x)

[log

pθ(z|x)

qφ(z|x)

]

img credit: Eric Jang

Consequence: support of posterior may be under-represented

14 / 45

Mean Field Variational Inference

Mean-field variational inference assumes qφ(z|x) completely factors:

qφ(z|x) =M∏i=1

qi (zi |x)

p(z, x)

x

z1 z2 z3

ααα

pα(z)

pθ(x|z)

N

p(z|x)

x

z1 z2 z3

ααα

pα(z)

N

qφ(z|x)

x

z1 z2 z3

N

15 / 45

Mean Field on GMM example

A MF approximation for our GMM applies gaussians for the µk ’s andcategoricals for the ci ’s:

xi

µk ci

σ2 π[1,...,K ]

N

K

qφ(µk |x) = N (µk ;mk , s2k )

qφ(ci |x) = Cat(ci ;ψ1, . . . , ψk)

these are optimal forms ofmean-field for GMM

16 / 45

Mean Field captures marginal, not covariate characteristics

p(z|x) =M∏i=1

p(zi |zi−1, . . . , z1, x)

vs

qφ(z|x) =M∏i=1

qi (zi |x)

17 / 45

Coordinate Ascent Variational Inference (CAVI)

One method to optimize our variational bound is by updating eachqi (zi |x) while keeping all other Z−i constant.

The update will look a lot like EM, except we’re updating anapproximation:

qi (zi |x) ∝ exp{Eq−i

[log p(zi ,Z−i , x)

]}Note that the expectation over q−i only needs to include the markovblanket of i , not all zi ’s

18 / 45


CAVI iteratively updates each coordinate: qi (zi |x)

ELBO = Eq

[log p(z, x)

]− Eq

[log qφ(z|x)

](1)

= log p(x) + Eq

[log p(z|x)

]−

M∑i=1

Eqi

[log qi (zi |x)

](2)

= const +M∑i=1

(Eq

[log p(zi |z1:(i−1), x)

]− Eqi

[log qi (zi |x)

])(3)

ELBOi = Eqi

[Eq−i

[log p(zi |z−i , x)

]]− Eqi

[log qi (zi |x)

](i)

=

∫qi (zi )

(Eq−i


]− log qi (zi |x)

)dzi (ii)

19 / 45


ELBOi = Eqi

[Eq−i


]]− Eqi

[log qi (zi |x)

](i)

=

∫qi (zi )

(Eq−i


]− log qi (zi |x)

)dzi (ii)

d

dqi (zi )ELBOi = Eq−i


]− log qi (zi |x)− 1 = 0 (iii)

(5)

solve with Lagrange multipliers:

qi (zi |x) ∝ exp{Eq−i

[log p(zi |Z−i , x)

]}∝ exp

{Eq−i

[log p(zi ,Z−i , x)

]}20 / 45

How Does VI Compare to Older Methods like EM?

Doesn’t Expectation Maximization (EM) compute latent variables?

ELBO = Eq

[log pθ(x|z)

]− DKL

(qφ(z|x)||pα(z)

)EM sets qφ(z|x) = p(z|x) and iterates to find fixed latent params, z

VI, does not assume posterior is tractable – needed for bayesiansettings, where latent params are not fixed, but distributed

21 / 45

How Does VI Compare to Older Methods like MCMC?

MCMC: produce samples that asymptotically approach p(z|x)

More accurate, with better guarantees, but computationally veryexpensive

VI: run optimization to analyitically approximate p(z|x)

VI is generally faster, and scales nicely for large datasetsTake advantage of stochastic and distributed optimization methods

22 / 45

Open Questions in VI

VI is not as well understood as MCMC, open areas include

Theory to bound accuracy of VI for certain classes of models

Improved optimization methods for massive data

Developing VI approximations that work for a wide class of models

Improving posterior representation without overly complex family ofqφ’s

23 / 45

Questions on General Variational Inference?

24 / 45

Auto-Encoding Variational Bayes (Kingma & Welling)

SGVB (Stochastic Gradient Variational Bayes) estimator

AEVB (Auto-Encoding Variational Bayes) algorithm

Variational Auto-encoder

25 / 45

VAE Terminology

qφ(z|x) is our encoder . Given some x in our observed space, how isz distributed in latent space?

pθ(x|z) is our decoder . Given some z in latent space, how is xdistributed in the original space?

x

z

ααα

pα(z)

pθ(x|z) qφ(z|x)

26 / 45

Experiments

qφ(z|x) and pθ(x|z) are learned jointly using a neural network.

Once we are done training the network, we can generate samples fromthe learned distribution pθ(x|z).

Datasets

MNIST: training set of 60,000 28x28 pixel images of handwritten digitsFrey Face: almost 2000 20x28 images of Brendan Frey’s face,taken from sequential video frames

27 / 45

MNIST Generated Samples

Figure: (Kingma & Welling, 2013)

28 / 45

Visualizing Learned 2-D Manifolds

29 / 45

VAE Example

Let the prior over the latent variables be the centered isotropicmultivariate Gaussian:pα(z) = N (z ; 0, I ).

Let qφ(z |x) be a multivariate Gaussian with a diagonal covariancestructure:qφ(z |x (i)) = N (z ;µ(i),σ2(i)I )where µ and σ are outputs of our encoder network, a neural networkwith one hidden layer.

Let pθ(x |z) also be a multivariate Gaussian parameterized by a neuralnetwork (our decoder network) with one hidden layer.

30 / 45

VAE Diagram

x ε

Encoder

µqφ σqφ

z

Decoder

µpθ σpθ

x ′

31 / 45

Reparameterization Trick

We can often reparameterize z ∼ qφ(z |x) as z = g(x , ε),where g is some deterministic, differentiable function,and ε ∼ p(ε) is a noise variable.

Gaussian example:We want to sample z ∼ qφ(z |x) = N (z ;µ,σ2I ).Let ε ∼ N (0, I ).Let g(x , ε) = µ+ σ � ε.We can write: z = g(x , ε)

Instead of sampling z from qφ directly, we sample ε from N (0, I ) andapply the differentiable function g .

32 / 45

Applying the Reparameterization Trick

Eqφ(z|x)[f (z)]

=∫qφ(z |x)f (z)dz

=∫p(ε)f (z)dε

=∫p(ε)f (g(x , ε))dε

= Ep(ε)[f (g(x , ε))]

33 / 45

When can we apply the reparameterization trick?

When q belongs to any location-scale family

e.g. Gaussian, Laplace

When q has a tractable inverse CDF

Let g be the inverse CDF and let ε ∼ U(0, I ).

e.g. Exponential, Cauchy, Logistic

When we can express z ∼ q as some compositional transformation of

auxiliary variables

e.g. Log-Normal, Dirichlet

34 / 45

Variational Inference & the ELBO

Recall that in VI, we select qφ(z |x) to maximize the ELBO(Evidence Lower BOund):

L(θ,φ; x (i)) = −DKL(qφ(z |x (i))||pα(z))+ Eqφ(z |x (i))[log pθ(x (i)|z)]

35 / 45

Why differentiate the ELBO?

If we want to train the neural network from the earlier examplethrough backprop, we need a differentiable objective function.

Even if we don’t intend to parameterize pθ and qφ with a neuralnetwork, the ability to use stochastic gradient ascent techniques tooptimize the ELBO can help VI scale to large datasets.

36 / 45

Auto-Encoding Variational Bayes Algorithm

Algorithm 1 Minibatch version of the AEVB algorithm.

1: θ, φ← Initialize parameters2: repeat3: XM ← Random minibatch of M datapoints4: ε← Random samples from p(ε)5: g ← ∇θ,φL̃M(θ,φ; XM , ε)6: θ,φ← Update parameters using gradients (e.g. SGD or Adagrad)7: until convergence of parameters (θ, φ)8: return θ, φ

L̃M(θ, φ;XM , ε) = NM

∑Ni=1 L̃(θ, φ; x (i))

where L̃ is the differentiable estimator of the ELBO that we willderive next. (L̃M is the minibatch version.)

37 / 45

VAE Diagram

x ε

Encoder

µqφ σqφ

z

Decoder

µpθ σpθ

x ′

38 / 45

Stochastic Gradient Variational Bayes Estimator

ELBO (Evidence Lower BOund)

L(θ,φ; x (i)) = −DKL(qφ(z |x (i))||pα(z))+ Eqφ(z |x (i))[log pθ(x (i)|z)]

How to differentiate with respect to φ and θ?

In practice, we can often obtain closed form expression for the first

term (KL-divergence).

For the second term we need to do some kind of estimation by

sampling.

39 / 45

Deriving the SGVB Estimator

Eqφ(z|x(i))[log pθ(x (i)|z)] ≈ 1L

∑Ll=1 log pθ(x (i)|z (i ,l))

where z (i ,l) = gφ(x (i), ε(i ,l)) and ε(l) ∼ p(ε)

This gives us our SGVB estimator:

L(θ,φ; x (i)) ≈ −DKL(qφ(z |x (i))||pα(z))+ 1L

∑Ll=1 log pθ(x (i)|z (i ,l))

40 / 45

Back to Our Original Example

Let the prior over the latent variables be the centered isotropicmultivariate Gaussian:pα(z) = N (z ; 0, I ).

Let qφ(z |x) be a multivariate Gaussian with a diagonal covariancestructure:qφ(z |x (i)) = N (z ;µ(i),σ2(i)I )where µ and σ are outputs of our encoder network, a neural networkwith one hidden layer.

Claim:

−DKL(qφ(z |x (i))||pα(z))

= 12

∑Jj=1 (1 + log((σ

(i)j )2)− (µ

(i)j )2 − (σ

(i)j )2)

41 / 45

Deriving an Expression for the KL-Divergence

−DKL(Q||P) = −∫q(z) log q(z)

p(z)dz

=∫

(q(z) log p(z)dz −∫q(z) log q(z)dz

=∫N (z ;µ,σ2) logN (z ; 0, I ) −

∫N (z ;µ,σ2) logN (z ;µ,σ2)

= [−J2 log(2π)− 1

2

∑Jj=1(µ2

j + σ2j )] −[−J

2 log(2π)− 12

∑Jj=1(1 + log σ2

j )]

= 12

∑Jj=1 (1 + log((σj)

2)− (µj)2 − (σj)

2)

Note: J denotes the dimensionality of z.

42 / 45

Final Result

L(θ,φ; x (i))

≈ 12

J∑j=1

(1 + log((σj)2)− (µj)

2 − (σj)2) + 1

L

L∑l=1

log pθ(x (i)|z (i ,l))

where z (i ,l) = µ(i) + σ(i) � ε(l) and ε(l) ∼ N (0, I )

43 / 45

VAE Diagram Revisited

x ε

Encoder

µqφ σqφ

z

Decoder

µpθ σpθ

x ′

−DKL(qφ(z |x)||pα(z))

log pθ(x |z)

44 / 45

See Also

These are the specific papers we talked about:

Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In The 2nd

International Conference on Learning Representations (ICLR), 2013.

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: a review for

statisticians. arXiv:1601.00670, 2016.

This is a helpful tutorial that we also consulted:

C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.

Many variations of the standard VAE exist. For example:

Conditional VAEsVAE-GANs

45 / 45

Download - Variational Inference & Variational Autoencoderscseweb.ucsd.edu/~dasgupta/254-deep-ul/casey-mary.pdfAuto-Encoding Variational Bayes (Kingma & Welling) SGVB (Stochastic Gradient Variational

Top Related