deep learning: a statistical perspectivestat.snu.ac.kr/mcp/lecture_9_vae.pdf · 2019-10-30 ·...

40
Unsupervised learning Unsupervised learning Seoul National University Deep Learning September-December, 2019 1 / 40

Upload: others

Post on 28-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Unsupervised learning

Unsupervised learning

Seoul National University Deep Learning September-December, 2019 1 / 40

Page 2: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Unsupervised learning

Unsupervised learning

In unsupervised learning, we try to learn about data without labels.

Unsupervised learning tasks involve estimation of the density of inputx , say p(x) or dimension reduction.

Dimension reduction tasks are sometimes called feature learning andinclude PCA, self organizing map, Multidimensional Scaling andautoencoder.

Density estimation tasks include cluster analysis, anomaly detection,variational autoencoder and generative adversarial network.

Seoul National University Deep Learning September-December, 2019 2 / 40

Page 3: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Unsupervised learning

Dimensionality and density estimation

To get the scope of effects of dimensionality on estimating densities,consider kernel density estimators.Assume that Xi ∈ C ⊂ Rd where C is compact.p(x) = n−1

∑ni=1

1hdK(‖x−Xi‖

h

)Let β and L be positive numbers. Given a vector s = (s1, · · · , sd),

|s| = s1 + · · ·+ sd , and Ds = ∂s1+···+sd

∂xd11 ···∂x

dss

. Define Holder class

Σ(β, L) = {|Dsg(x)− Dsg(y)| ≤ L‖x − y‖β−|s|, for all s such that|s| = bβc and all x and y} where bβc is the greatest integer strictlyless than β.Then the MSE is

MSE � h2β +1

nhd.

If h � n−1/(2β+d),

supp∈Σ(β,L)

E

∫(ph(x)− p(x))2dx ≤ cn−2β/(2β+d)

Seoul National University Deep Learning September-December, 2019 3 / 40

Page 4: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Unsupervised learning

High dimension and density estimation

The rate of convergence n−2β/(2β+d) is slow when the dimension d islarge.

Instead of estimating p precisely we have to settle for finding anadequate approximation that identifies the regions where p puts largeamounts of mass.

Consider

ph(x) = E (p(x)) =

∫1

hdK(‖x − X‖

h

)p(u)du

.

Then applying P(‖ph − ph‖∞ > ε) ≤ Ce−ncε2, and

‖ph − ph‖∞ = Op

(√ log n

n

)The rate of convergence does not depend on d but how to choose his not clear.

Seoul National University Deep Learning September-December, 2019 4 / 40

Page 5: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Unsupervised learning

Unsupervised learning

In deep learning we may face cases where a distribution isconcentrated near a lower-dimensional set. This causes problems fordensity estimation or even lack of definition. Current approachesattempt to find a smoothed density well defined on lower dimensionalrepresentation.

Variational autoencoder (VAE) and Generative adversarial network(GAN) attempt to estimate the density through a lower-dimensionalrepresentation obtained via deep models.

We first focus on dimension reduction and go over autoencoder.

Seoul National University Deep Learning September-December, 2019 5 / 40

Page 6: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Autoencoders

Seoul National University Deep Learning September-December, 2019 6 / 40

Page 7: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Autoencoders

• Goal: Given data x ∈ Rd , learn a representation z ∈ Rk , k < d , wherez = f (x) approximates x well with minimal capacity.

If the dimension of z is the sameas x , the modelling represents itself.Wants to maximally compress xwithout losing too much information.

The network may be viewedas consisting of two parts: an encoderfunction z = f (x) and a decoder thatproduces a reconstruction x = g(z), g(f (x)) = x .

Seoul National University Deep Learning September-December, 2019 7 / 40

Page 8: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Shallow autoencoders

When f (.) is linear, the problem boils down to principal componentanalysis (PCA).

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2.

Loss function E‖x − w2w1x‖22 can be smaller for a higher the

dimension of z .Seoul National University Deep Learning September-December, 2019 8 / 40

Page 9: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Shallow autoencoders

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2. It can be shown that columns of w2 are orthogonal(wT

2 w2 = Ikxk) and w1 = wT2 .

For any w1 and w2, R = {w1w2x : x ∈ Rd} is in an k dimensionallinear subspace of Rd . Let V ∈ Rd ,k be a matrix whose columns forman orthonomal basis of this subspace. Each vector in R can bewritten as Vy where y ∈ Rk . For every x ∈ Rd and y ∈ Rk we have‖x −Vy‖2

2 = ‖x‖2 + yTV TVy − 2yTV T x = ‖x‖2 + ‖y‖2− 2yTV T x ,which is minimized when y = V T x . Therefore VV T x is the argmin of‖x − x‖2

2.

Seoul National University Deep Learning September-December, 2019 9 / 40

Page 10: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Deep autoencoders

f (.) can be compositional including linear, nonlinear, ReLu,convolutional ..etc.

Seoul National University Deep Learning September-December, 2019 10 / 40

Page 11: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Autoencoders

Variational Autoencoders (VAE)

VAEs have already shown promise in generating many kinds ofcomplicated data, including handwritten digits, faces, house numbers,CIFAR images, physical models of scenes, segmentation, andpredicting the future from static images.

Approximate bayesian inference called variational inference is used.

We sidetrack for a moment to study variational inference.

Seoul National University Deep Learning September-December, 2019 11 / 40

Page 12: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Approximate Inference: Variational Inference

Seoul National University Deep Learning September-December, 2019 12 / 40

Page 13: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Approximate inference on posterior approximation

In Bayesian inference, posterior distribution

p(z |x) =p(x |z)p(z)∫p(x |v)p(v)dv

is of interest.

For most models, the posterior is analytically intractable.

Variation inference attempts to approximate the posterior via q(z |x).

Minimize Kullback-Leibler divergence KL(q(z |x)||p(z |x)) orKL(p(z |x)||q(z |x)), where

KL(q(z |x)||p(z |x)) =

∫q(z |x) log

q(z |x)

p(z |x)dz .

Seoul National University Deep Learning September-December, 2019 13 / 40

Page 14: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Two types of KL divergence

For KL(q(z |x)||p(z |x)) =∫q(z |x) log q(z|x)

p(z|x)dz , if p(z |x) = 0, we

must have q(z |x) = 0. Minimizing KL(q||p) forces q to choose singlemode.

For KL(p(z |x)||q(z |x)) =∫p(z |x) log p(z|x)

q(z|x)dz if p(z |x) > 0 we must

have q(z |x) > 0. Minimizing KL(p||q) forces q to cover both modes.

source: Bishop (2006) Pattern recognition and machine learning. Springer

Seoul National University Deep Learning September-December, 2019 14 / 40

Page 15: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Two types of approximation inference: Variationalinference vs. Expectation propagation

Approximate inference minimizing KL(p(z |x)||q(z |x)) includesExpectation propagation (EP).

KL(p(z |x)||q(z |x)) is harder to evaluate.

Approximate inference minimizing KL(q(z |x)||p(z |x)) is calledVariational inference (VI).

VI is more popular than EP due to computational feasibility.

Seoul National University Deep Learning September-December, 2019 15 / 40

Page 16: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Specifying variational distributions

Variational inference is originally in terms of an optimization problemin which the quantity being optimized is a functional. The solution isobtained by exploring all possible input functions to find the one thatmaximizes, or minimizes, the functional. We can consider minimizingKL divergence over q. This approach will be discussed in the meanfield method.

We first consider the case in which q is indexed by a finitedimensional parameter.

Seoul National University Deep Learning September-December, 2019 16 / 40

Page 17: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Variational Inference: Marginal likelihood, KL divergenceand ELBO

Variational inference minimizes KL(q(z |x)||p(z |x)).

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) =

∫q(z |x ;φ) log p(x ; θ)dz =

∫q(z |x ;φ) log

p(x , z ; θ)

p(z |x ; θ)dz

=

∫q(z |x ;φ) log

[p(x , z ; θ)

p(z |x ; θ)

q(z |x ;φ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ)

[log

q(z |x ;φ)

p(z |x ; θ)+ log

p(x , z ; θ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz +

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

Seoul National University Deep Learning September-December, 2019 17 / 40

Page 18: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Variational Inference: Marginal likelihood, KL divergenceand ELBO

Let

L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz .

Note that the first term is∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz ≡ KL(q(z |x ;φ)||p(z |x ; θ)).

Thenlog p(x ; θ) = KL(q(z |x ;φ)||p(z |x ; θ)) + L(θ, φ)

By maximizing L(θ, φ), we minimize KL(q(z |x ;φ)||p(z |x ; θ)).

Seoul National University Deep Learning September-December, 2019 18 / 40

Page 19: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Variational Inference: ELBO

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) = log( ∫

p(x , z ; θ)dz)

= log( ∫ p(x , z ; θ)

q(z |x ;φ)q(z |x ;φ)dz

)≥∫

log(p(x , z ; θ)

q(z |x ;φ)

)q(z |x ;φ)dz ≡ L(θ, φ)

Recall Jensen’s inequality E (f (x)) ≥ f (E (x)) if f is convex.

We call L(θ, φ) the evidence lower bound (ELBO).

Seoul National University Deep Learning September-December, 2019 19 / 40

Page 20: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Variational Inference: ELBO as a penalized expectedlikelihood

We can express

ELBO = L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

= Eq(z|x ;φ)[log p(x , z ; θ)]− Eq(z|x ;φ)[log q(z |x ;φ)]

= E [log p(x |z ; θ)] + E [log p(z)]− E [log q(z |x ;φ)]

= E [log p(x |z ; θ)]− KL(q(z |x ;φ)||p(z))

Maximizing ELBO is equivalent to maximizing E [log p(x |z ; θ)] withpenalty KL(q(z |x ;φ)||p(z)).

Seoul National University Deep Learning September-December, 2019 20 / 40

Page 21: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Optimizing ELBO with respect to parameters

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

First,

∂θL(θ, φ) =

∂θEq(z|x ;φ)[log p(x , z ; θ)]

=

∫p′(x |z ; θ)

p(x |z ; θ)q(z |x ;φ)dz

This term is usually computed stochastically.

Seoul National University Deep Learning September-December, 2019 21 / 40

Page 22: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Optimizing ELBO

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

∂φL(θ, φ) =

∂φEq(z|x ;φ)[log p(x , z ; θ)]− ∂

∂φEq(z|x ;φ)[log q(z |x ;φ)]

=

∫[log p(x , z ; θ)]

q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

−∫

q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

−∫

[log q(z |x ;φ)]q′(z |x ;φ)

q(z |x ;φ)q(z |x ;φ)dz

Computation is complicated due to parameter appearing in theexpectation. An EM-type iterative stochastic evaluation of each termis not guaranteed to be converged.

One may attempt to reparameterize ELBO so that the expectation isover a distribution free of parameters.

Seoul National University Deep Learning September-December, 2019 22 / 40

Page 23: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Reparameterization trick

In evaluating ∂∂φ

∫f (x)p(x ;φ)dx , let x = g(φ, ε), where g is

differentiable and p(ε) is free of φ. That is

p(g(ε, φ))∣∣∂g(φ, ε)

∂ε

∣∣ = p(ε).

Then ∫f (x)p(x ;φ)dx =

∫f (g(φ, ε))

∣∣∂g(φ, ε)

∂ε

∣∣p(g(φ, ε))dε

=

∫f (g(φ, ε))p(ε)dε

Seoul National University Deep Learning September-December, 2019 23 / 40

Page 24: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Reparameterization trick

For example, when p(x ;φ) is N(µ, σ) and x = g(φ, ε) = µ+ εσ,∫f (x)p(x ;φ)dx =

∫f (µ+ εσ)p(ε)dε,

where ε = (x − µ)/σ and p(ε) is the standard normal density.

Then

∂φ

∫f (x)p(x ;φ)dx =

∫f ′(g(φ, ε))

∣∣∂g(φ, ε)

∂φ

∣∣p(ε)dε

Seoul National University Deep Learning September-December, 2019 24 / 40

Page 25: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Approximate Inference: Variational Inference

Reparameterization trick

Back to evaluating ELBO, let

h(x , z , θ, φ) = log p(x , z ; θ)− log q(z |x ;φ).

We need to evaluate

L(θ, φ) =

∫h(x , z , θ, φ)q(z |x ;φ)dz =

∫h(x , g(φ, ε), θ, φ)p(ε)dε

and∂

∂φL(θ, φ) =

∫∂

∂φh(x , g(φ, ε), θ, φ)p(ε)dε

Evaluation is much simpler since the expectation part is free ofparameter φ. This can be evaluated stochastically.

Seoul National University Deep Learning September-December, 2019 25 / 40

Page 26: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoders

Seoul National University Deep Learning September-December, 2019 26 / 40

Page 27: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

ELBO: L(θ, φ) = Eq(z|x ;φ)[log p(x |z ; θ) + log p(z)− log q(z |x ;φ)]

Decoder: Assume p(x |z ; θ) ∼ N(µx |z ,Σx |z), where

µx |z = aK (aK−1(..a2(a1(z ; θ1); θ2)...θK−1); θK ),

diag(Σx |z) = b(aK−1(...a2(a1(z ; θ1); θ2)...θK−1); θb).

Parameters for µx |z and diag(Σx |z) are shared upto the (K − 1)th

layer.

Assume p(z) ∼ N(0, I )

Seoul National University Deep Learning September-December, 2019 27 / 40

Page 28: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

Encoder: Assume q(z |x) ∼ N(µz|x ,Σz|x) where

µz|x = fK (fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φK ),

diag(Σz|x) = g(fK−1(...f2(f1(z ;φ1);φ2)...φK−1);φg )

Parameters for µz|x and diag(Σz|x) are shared upto the (K − 1)th

layer.

Seoul National University Deep Learning September-December, 2019 28 / 40

Page 29: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Example

ELBO:

L(θ, φ) = Eq(z|x ;φ)[log p(x |z)] + Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)].

For the second term, a closed form can be obtained due to thenormality assumption of q(z |x ;φ).

Eq(z|x ;φ)[log p(z)− log q(z |x ;φ)]

∝ −E (n∑

i=1

zTi zi |x ;φ) +n∑

i=1

log |Σzi |xi |

+ En∑

i=1

(zi − µzi |xi )Σ−1zi |xi (zi − µzi |xi )

= −n∑

i=1

tr(Σzi |xi )−n∑

i=1

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |+ ntr(I )

Seoul National University Deep Learning September-December, 2019 29 / 40

Page 30: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Example

For the first term, use the reparameterization trick

Eq(z|x ;φ)[log p(x |z)] = Ep(ε)[log p(x |g(ε, x ;φ))]

and estimate stochastically using

Eq(z|x ;φ)[log p(x |z)] =1

M

M∑m=1

log p(x |g(ε(m), x ;φ))

Seoul National University Deep Learning September-December, 2019 30 / 40

Page 31: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Example

Putting together, maximize a stochastic version of ELBO,

L(θ, φ) =n∑

i=1

1

M

M∑m=1

log pθ(xi |g(εmi , xi ;φ))

−n∑

i=1

tr(Σzi |xi )−n∑

i=1

tr(µzi |xiµTzi |xi ) +

n∑i=1

log |Σzi |xi |

where g(εmi , xi ;φ) = µzi |xi + Σ12

zi |xi εmi and εmi ∼ N(0, I ).

To generate data, sample z from prior and generate x from p(x |z ; θ).

Seoul National University Deep Learning September-December, 2019 31 / 40

Page 32: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Variational Autoencoders

Variational Autoencoder: Visualization of Learned Manifold

source: Kingma and Welling, 2014

Seoul National University Deep Learning September-December, 2019 32 / 40

Page 33: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Mean Field Variational Inference

Mean Field Variational Inference

Seoul National University Deep Learning September-December, 2019 33 / 40

Page 34: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Mean Field Variational Inference

Mean Field VI

Variational inference is named after variational calculus in which weoptimize a functional over all possible input functions. Withoutassuming parametric models for q and optimizing over the parameter,we can consider minimizing KL divergence over function q. Meanfield method takes this approach.

Assume the variational distribution over the latent variables factorizesas q(z1, z2, · · · , zm|x) =

∏mj=1 q(zj). Denote q(zj) by qj .

This approximation may not contain the true posterior because thelatent variables are usually dependent.

Seoul National University Deep Learning September-December, 2019 34 / 40

Page 35: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Mean Field Variational Inference

Mean Field VI

Treating q1, · · · , qj−1, qj+1, · · · , qm fixed, ELBO can be expressed as:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

Let ∫log p(x , z)

∏i 6=j

qidzi ≡ Ei 6=j(log p(x , z)).

Seoul National University Deep Learning September-December, 2019 35 / 40

Page 36: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Mean Field Variational Inference

Mean Field VI

ELBO:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

=

∫qjEi 6=j(log p(x , z))dzj −

∫qj log qjdzj + c

=

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

wherelog p(x , zj) = Ei 6=j(log p(x , z)) + c

Seoul National University Deep Learning September-December, 2019 36 / 40

Page 37: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Mean Field Variational Inference

Mean Field VI

ELBO

L(θ, φ) =

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

= −KL(qj ||p(x , zj))

If we fix qj for i 6= j , ELBO is negative KL divergence between qj andp(x , zj).

ELBO can be maximized when q∗j = p(x , zj),

q∗j =exp[Ei 6=j{log p(x , z)}]∫exp[Ei 6=j{log p(x , z)}]dzj

Seoul National University Deep Learning September-December, 2019 37 / 40

Page 38: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Expectation propagation

Expectation propagation

Seoul National University Deep Learning September-December, 2019 38 / 40

Page 39: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Expectation propagation

Expectation propagation

Another approximated inference. Minimize KL(p||q) (Minka, 2001a;Minka, 2001b).

Consider the problem of minimizing KL(p||q) with respect to q(z)when p(z) is a fixed distribution and q(z) is a member of theexponential family,

q(z) = h(z)g(η) exp(ηTu(z))

KL divergence has a form:

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Seoul National University Deep Learning September-December, 2019 39 / 40

Page 40: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/Lecture_9_VAE.pdf · 2019-10-30 · Variational autoencoder (VAE) and Generative adversarial network (GAN) attempt to estimate

Expectation propagation

Expectation propagation

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Setting the derivative with respect to η equals to zero

− ∂

∂ηlog g(η) = Ep(z)(u(z)).

Note that

− ∂

∂ηlog g(η) = Eq(z)(u(z)).

Then,Eq(z)(u(z)) = Ep(z)(u(z))

The optimum solution corresponds to matching the expectedsufficient statistics. This is called moment matching.

Seoul National University Deep Learning September-December, 2019 40 / 40