variational autoencoders (vaes) - peoplepeople.math.gatech.edu/~yyang767/research/vae_yuqin.pdf ·...

Preliminaries Variational Autoencoders Extensions of VAEs

Variational Autoencoders (VAEs)

Yuqin Yang

Wilson Lab Group Meeting Presentation

September 26 & October 3, 2017

Yuqin Yang Wilson Lab Group Meeting Presentation



Section 1

Preliminaries




Kullback-Leibler divergence

KL divergence (continuous case)

p(x) and q(x) are two density distributions. Then theKL-divergence is defined as

KL(p||q) =Z

p(x) logp(x)

q(x)dx . (1.1)

By Jensen’s Inequality, KL(p||q) � 0, and the equation holds if andonly if p = q, almost everywhere.




Kullback-Leibler divergence

Special case: multivariate Gaussian distribution

Suppose k-dimensional variable p1 ⇠ N (µ1,⌃1), p2 ⇠ N (µ2,⌃2),then

KL(p1||p2) =1

2

log

det(⌃1)

det(⌃2)� k + tr(⌃�1

2 ⌃1) +

(µ2 � µ1)>⌃�1

2 (µ2 � µ1)i.




Variational Inference

Suppose we want to use Q(z) to approximate P(z |X ), wherep(z |X ) does not have a explicit representation, then a goodapproximation would try to minimize

KL(Q(z)||P(z |X )) =

ZQ(z) log

Q(z)

P(z |X )dz .

By Bayes’ formula, the above equation could be transferred into

logP(X )� KL(Q(z)||P(z |X )) =Z

Q(z) logP(X |z)dz � KL(Q(z)||P(z)).(1.2)




Section 2

Variational Autoencoders

1

1A. B. L. Larsen et al. (2015). “Autoencoding beyond pixels using a learnedsimilarity metric”. In: arXiv preprint arXiv:1512.09300.




Original problem

Given a dataset X from a distribution P(x), we want to generatenew data that satisfies the unknown distribution P(x).

We construct a model f (z ; ✓) : Z ⇥⇥ ! X , where X is the spaceof observed variables (datas), Z the space of latent variables, ⇥the parameter space, and f a complex but deterministic mapping.

Latent Variables: Variables that are not directly observedbut are rather inferred from other directly observed variables.Given z , we can generate a sample X by f (z ; ✓).

We wish to optimize ✓ such that we can sample z from P(z) and,with high probability, f (z ; ✓) will be like the X s in our dataset.




Likelihood

P(X ; ✓) =

ZP(X |z ; ✓) P(z)dz (2.1)

Choose ✓ to maximize the above integral.

In VAEs, P(X |z ; ✓) ⇠ N (f (z ; ✓),�2 ⇤ I ) in continuous case, andP(X |z ; ✓) ⇠ B(f (z ; ✓)) in discrete case. In both cases, P(X |z ; ✓) iscontinuous with respect to theta, so we can use gradient ascent tomaximize ✓.Questions:

How to define the latent variable z to capture latentinformation?

How to deal with the integral over z , and its gradient withrespect to ✓?




Define latent variable

We want the latent variable satisfies these two properties:

The latent variables are chosen automatically, because we donot know too much about the intrinsic properties of X .

Di↵erent components of z are mutually independent, in orderto avoid the overlap in latent information.

VAEs asserts that the latent variable could be drawn fromstandard Gaussian distribution, N (0, I ).

Assertion

Any distribution in d dimensions can be generated by taking a setof d variables that are normally distributed and mapping themthrough a su�ciently complicated function.

Since f (z , ✓) is complicated enough (trained by neural network),this choice of latent variable will not matter too much.




Deal with the integral

P(X ; ✓) ⇡ 1

n

X

i

P(X |z(i); ✓), z(i) ⇠ N (0, I ).

Figure: Contradict Example. We need to set � very small, which willneed a very large dataset.

In this case, we need to choose a faster sampling procedure of z .Yuqin Yang Wilson Lab Group Meeting Presentation



Deal with the integral

Sampling in VAEs

The key idea behind the variational autoencoder is to attempt tosample values of z that are likely to have produced X , andcompute P(X ) just from those.

New function Q(z): gives us a distribution over z values that arelikely to produce X . Then EP(z)[P(X |z)] ! EQ(z)[P(X |z)].We can see that P(z |X ) is the optimum choice of Q(z), but P isintractable.

Aim:

Find a Q(z) which is an approximation of P(z |X ), with Q(z)simple enough.




Recall: Variational Inference

For any Q(z), use Q(z) to approximate P(z |X ). According toEquation (1.2),

logP(X )�KL(Q(z)||P(z |X )) = EQ(z)[logP(X |z)]�KL(Q(z)||P(z))

Since were interested in inferring P(X ), it makes sense toconstruct a Q which does depend on X :

logP(X )� KL(Q(z |X )||P(z |X ))

= EQ(z|X )[logP(X |z)]� KL(Q(z |X )||P(z)).(2.2)

Aim:

Maximize logP(x) (w.r.t. ✓), minimize KL(Q(z |X )||P(z |X )). ,Maximize LHS , Maximize RHS.




Second term of RHS

Aim:

Minimize KL(Q(z |X )||P(z)). We already have P(z) ⇠ N (0, I ).

The usual choice is to define Q(z |X ) ⇠ N (µ(X ;�),⌃(X ;�)),where µ and ⌃ are deterministic functions of X with parameters �.(We omit � in the following equations.) Besides, we constrain ⌃ tobe a diagonal matrix.

Minimization

According to previous equation of KL-divergence of multivariateGaussian distribution,

KL(Q(z |X )||P(z)) = 1

2(tr⌃(X )+(µ(X ))>(µ(X ))�k�log(det⌃(X ))).




First term of RHS

The maximization of the first item uses SGD. To approximate thedistribution, take a sample z from Q(z |X ), and

EQ(z|X )[logP(X |z)] ⇡ logP(X |z).

General Maximization function

EX⇠D [logP(X )� KL(Q(z |X )||P(z |X ))]

= EX⇠D [Ez⇠Q|X [logP(X |z)]� KL(Q(z |X )||P(z))].(2.3)

To use SGD, sample a value X and a value z , then compute thegradient of RHS by backpropagation. Do this for m times and takethe average to get the result converging to the gradient of RHS.




Figure: Flow chart for the VAE algorithm.




Significant Problems

The algorithm seems to be perfect, but there are two significantproblems during the calculation:

The gradient of first term of RHS in Equation (2.3) shouldhave included the parameters of both P and Q, but in oursampling method, we omit the parameters of Q. In this case,we cannot generate the true gradient of �.

The algorithm is separated into 2 parts: the first half train themodel Q(z |X ) by the given data X , the second half train themodel f by the newly-sampling data z . Thus thebackpropagation rule cannot cover this discontinuous point,making the algorthm fail.




Modification by Reparameterization Trick

To solve the first problem, we need to change the way ofsampling. We firstly sample a ✏ ⇠ N (0, I ), then definez = ⌃(X )1/2✏+ µ(X ). It is just the equivalent representationof the sample z in previous algorithm, but now theoptimization function is changed into

EX⇠D [E✏⇠N (0,I )[logP(X |µ(X )+⌃(X )1/2✏)]�KL(Q(z |X )||P(z))].

This time the sampling function does not include our targetfunction.

Sample from Q(z |X ) by evaluating a function h(⌘,X ), where⌘ is an unobserved noise, and h continuous in X . (DiscreteQ(z |X ) fails in this case.) Then the backpropagation can beoperated successfully.




Figure: Flow chart for the corrected VAE algorithm.




Verification

For decoder:

We can just sample a random variable z ⇠ N (0, I ) and input itinto the decoder to find the f .

The probability P(X ) for a testing example X :

This is not tractable, because P is implicit.However, according to Equation (2.2), since KL divergence isnon-negative, we can find a lower bound of logP(X ), which iscalled Expectation of Lower BOund (ELBO) of P(X ). This lowerbound can be a useful tool for getting a rough idea of how well ourmodel is capturing a particular datapoint X , because its fastconvergence.




Remarks

Detailed remarks are not presented here.

Interpretation of RHS. The two terms have their meanings ininformation theory.

Separate the RHS by sample.

Regularization term. It could be found by sometransformation on RHS.

Sampling for Q(z |X ). The original paper expresses thisdistribution with g(X , ✏), where ✏ ⇠ p✏ independently.Restriction on p✏ is needed.2

2D. P. Kingma and M. Welling (2013). “Auto-encoding variational bayes”.In: arXiv preprint arXiv:1312.6114.




Comparison Versus GAN

Section 3

Extensions of VAEs




Comparison Versus GAN

Both are newly deep generative models.

The biggest advantage of VAEs is the nice probabilisticformulation they come with as a result of maximizing a lowerbound on the log-likelihood. Also, VAE is usually easier totrain and get working. Relatively easy to implement androbust to hyperparameter choices.

GANs are better at generating visual features. Sometimes theoutput of VAEs is vague.

More detailed discussions are shown on Reddit.



https://www.reddit.com/r/MachineLearning/comments/4r3pjy/variational_autoencoders_vae_vs_generative/


Conditional Variational Autoencoders

Original problem:

Given input dataset X and output Y , we want to create a modelP(Y |X ) which maximizes the probability of the ground truthdistribution. Example: Generating Hand-write digits. We want toadd digits to an existing string of digits written by a single person.

A standard regression model will fail in this situation, because itwill finally generate an “average image” with the minimum indistance, which may look like a meaningless blur.However, CVAEs allow us to tackle problems where theinput-to-output mapping is one-to-many, without requiring us toexplicitly specify the structure of the output distribution.





Figure: Flow chart for the CVAE algorithm.





P(Y |X ) = N (f (z ,X ),�2I );

logP(Y |X )� KL(Q(z |Y ,X )||P(z |Y ,X ))

= EQ(z|Y ,X )[logP(Y |z ,X )]� KL(Q(z |Y ,X )||P(z |X )).




VAE-GAN3

Combine a VAE with a GAN by collapsing the decoder and thegenerator into one, since they are both from standard Gaussiandistribution to X .

Figure: Overview of the VAE-GAN algorithm.

3A. B. L. Larsen et al. (2015). “Autoencoding beyond pixels using a learnedsimilarity metric”. In: arXiv preprint arXiv:1512.09300.




Instead of analyzing the error element-wise, VAE-GANanalyses the error feature-wise, where the feature is generatedby Discriminator.

Share the parameters of Generator and Decoder together.

Optimize three kinds of errors simultaneously.

Figure: Flow of the VAE-GAN algorithm. Grey arrows represents theterms in the training objective.




That’s all. Thanks!



variational autoencoders (vaes) - peoplepeople.math.gatech.edu/~yyang767/research/vae_yuqin.pdf ·...

Documents