deep learning: a statistical perspectivestat.snu.ac.kr/mcp/lecture_9_vae.pdf · 2019-10-30 ·...

Unsupervised learning


Seoul National University Deep Learning September-December, 2019 1 / 40



In unsupervised learning, we try to learn about data without labels.

Unsupervised learning tasks involve estimation of the density of inputx , say p(x) or dimension reduction.

Dimension reduction tasks are sometimes called feature learning andinclude PCA, self organizing map, Multidimensional Scaling andautoencoder.

Density estimation tasks include cluster analysis, anomaly detection,variational autoencoder and generative adversarial network.



Dimensionality and density estimation

To get the scope of effects of dimensionality on estimating densities,consider kernel density estimators.Assume that Xi ∈ C ⊂ Rd where C is compact.p(x) = n−1

∑ni=1

1hdK(‖x−Xi‖

h

)Let β and L be positive numbers. Given a vector s = (s1, · · · , sd),

|s| = s1 + · · ·+ sd , and Ds = ∂s1+···+sd

∂xd11 ···∂x

dss

. Define Holder class

Σ(β, L) = {|Dsg(x)− Dsg(y)| ≤ L‖x − y‖β−|s|, for all s such that|s| = bβc and all x and y} where bβc is the greatest integer strictlyless than β.Then the MSE is

MSE � h2β +1

nhd.

If h � n−1/(2β+d),

supp∈Σ(β,L)

E

∫(ph(x)− p(x))2dx ≤ cn−2β/(2β+d)



High dimension and density estimation

The rate of convergence n−2β/(2β+d) is slow when the dimension d islarge.

Instead of estimating p precisely we have to settle for finding anadequate approximation that identifies the regions where p puts largeamounts of mass.

Consider

ph(x) = E (p(x)) =

∫1

hdK(‖x − X‖

h

)p(u)du

.

Then applying P(‖ph − ph‖∞ > ε) ≤ Ce−ncε2, and

‖ph − ph‖∞ = Op

(√ log n

n

)The rate of convergence does not depend on d but how to choose his not clear.




In deep learning we may face cases where a distribution isconcentrated near a lower-dimensional set. This causes problems fordensity estimation or even lack of definition. Current approachesattempt to find a smoothed density well defined on lower dimensionalrepresentation.

Variational autoencoder (VAE) and Generative adversarial network(GAN) attempt to estimate the density through a lower-dimensionalrepresentation obtained via deep models.

We first focus on dimension reduction and go over autoencoder.


Autoencoders

Autoencoders


Autoencoders

Autoencoders

• Goal: Given data x ∈ Rd , learn a representation z ∈ Rk , k < d , wherez = f (x) approximates x well with minimal capacity.

If the dimension of z is the sameas x , the modelling represents itself.Wants to maximally compress xwithout losing too much information.

The network may be viewedas consisting of two parts: an encoderfunction z = f (x) and a decoder thatproduces a reconstruction x = g(z), g(f (x)) = x .


Autoencoders

Shallow autoencoders

When f (.) is linear, the problem boils down to principal componentanalysis (PCA).

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2.

Loss function E‖x − w2w1x‖22 can be smaller for a higher the

dimension of z .Seoul National University Deep Learning September-December, 2019 8 / 40

Autoencoders

Shallow autoencoders

z = w1x and x = w2z . w1 ∈ Rk,d , w2 ∈ Rd ,k . x = w2z = w2w1x .Consider minimizing

E‖x − w2w1x‖22

over w1 and w2. It can be shown that columns of w2 are orthogonal(wT

2 w2 = Ikxk) and w1 = wT2 .

For any w1 and w2, R = {w1w2x : x ∈ Rd} is in an k dimensionallinear subspace of Rd . Let V ∈ Rd ,k be a matrix whose columns forman orthonomal basis of this subspace. Each vector in R can bewritten as Vy where y ∈ Rk . For every x ∈ Rd and y ∈ Rk we have‖x −Vy‖2

2 = ‖x‖2 + yTV TVy − 2yTV T x = ‖x‖2 + ‖y‖2− 2yTV T x ,which is minimized when y = V T x . Therefore VV T x is the argmin of‖x − x‖2

2.


Autoencoders

Deep autoencoders

f (.) can be compositional including linear, nonlinear, ReLu,convolutional ..etc.


Autoencoders

Variational Autoencoders (VAE)

VAEs have already shown promise in generating many kinds ofcomplicated data, including handwritten digits, faces, house numbers,CIFAR images, physical models of scenes, segmentation, andpredicting the future from static images.

Approximate bayesian inference called variational inference is used.

We sidetrack for a moment to study variational inference.


Approximate Inference: Variational Inference




Approximate inference on posterior approximation

In Bayesian inference, posterior distribution

p(z |x) =p(x |z)p(z)∫p(x |v)p(v)dv

is of interest.

For most models, the posterior is analytically intractable.

Variation inference attempts to approximate the posterior via q(z |x).

Minimize Kullback-Leibler divergence KL(q(z |x)||p(z |x)) orKL(p(z |x)||q(z |x)), where

KL(q(z |x)||p(z |x)) =

∫q(z |x) log

q(z |x)

p(z |x)dz .



Two types of KL divergence

For KL(q(z |x)||p(z |x)) =∫q(z |x) log q(z|x)

p(z|x)dz , if p(z |x) = 0, we

must have q(z |x) = 0. Minimizing KL(q||p) forces q to choose singlemode.

For KL(p(z |x)||q(z |x)) =∫p(z |x) log p(z|x)

q(z|x)dz if p(z |x) > 0 we must

have q(z |x) > 0. Minimizing KL(p||q) forces q to cover both modes.

source: Bishop (2006) Pattern recognition and machine learning. Springer



Two types of approximation inference: Variationalinference vs. Expectation propagation

Approximate inference minimizing KL(p(z |x)||q(z |x)) includesExpectation propagation (EP).

KL(p(z |x)||q(z |x)) is harder to evaluate.

Approximate inference minimizing KL(q(z |x)||p(z |x)) is calledVariational inference (VI).

VI is more popular than EP due to computational feasibility.



Specifying variational distributions

Variational inference is originally in terms of an optimization problemin which the quantity being optimized is a functional. The solution isobtained by exploring all possible input functions to find the one thatmaximizes, or minimizes, the functional. We can consider minimizingKL divergence over q. This approach will be discussed in the meanfield method.

We first consider the case in which q is indexed by a finitedimensional parameter.



Variational Inference: Marginal likelihood, KL divergenceand ELBO

Variational inference minimizes KL(q(z |x)||p(z |x)).

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) =

∫q(z |x ;φ) log p(x ; θ)dz =

∫q(z |x ;φ) log

p(x , z ; θ)

p(z |x ; θ)dz

=

∫q(z |x ;φ) log

[p(x , z ; θ)

p(z |x ; θ)

q(z |x ;φ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ)

[log

q(z |x ;φ)

p(z |x ; θ)+ log

p(x , z ; θ)

q(z |x ;φ)

]dz

=

∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz +

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz



Variational Inference: Marginal likelihood, KL divergenceand ELBO

Let

L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz .

Note that the first term is∫q(z |x ;φ) log

q(z |x ;φ)

p(z |x ; θ)dz ≡ KL(q(z |x ;φ)||p(z |x ; θ)).

Thenlog p(x ; θ) = KL(q(z |x ;φ)||p(z |x ; θ)) + L(θ, φ)

By maximizing L(θ, φ), we minimize KL(q(z |x ;φ)||p(z |x ; θ)).



Variational Inference: ELBO

For any distribution q(z |x ;φ) indexed by φ, we have

log p(x ; θ) = log( ∫

p(x , z ; θ)dz)

= log( ∫ p(x , z ; θ)

q(z |x ;φ)q(z |x ;φ)dz

)≥∫

log(p(x , z ; θ)

q(z |x ;φ)

)q(z |x ;φ)dz ≡ L(θ, φ)

Recall Jensen’s inequality E (f (x)) ≥ f (E (x)) if f is convex.

We call L(θ, φ) the evidence lower bound (ELBO).



Variational Inference: ELBO as a penalized expectedlikelihood

We can express

ELBO = L(θ, φ) =

∫q(z |x ;φ) log

p(x , z ; θ)

q(z |x ;φ)dz

= Eq(z|x ;φ)[log p(x , z ; θ)]− Eq(z|x ;φ)[log q(z |x ;φ)]

= E [log p(x |z ; θ)] + E [log p(z)]− E [log q(z |x ;φ)]

= E [log p(x |z ; θ)]− KL(q(z |x ;φ)||p(z))

Maximizing ELBO is equivalent to maximizing E [log p(x |z ; θ)] withpenalty KL(q(z |x ;φ)||p(z)).



Optimizing ELBO with respect to parameters

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

First,

∂

∂θL(θ, φ) =

∂

∂θEq(z|x ;φ)[log p(x , z ; θ)]

=

∫p′(x |z ; θ)

p(x |z ; θ)q(z |x ;φ)dz

This term is usually computed stochastically.



Optimizing ELBO

Optimizing ELBO requires evaluating ∂∂θL(θ, φ) and ∂

∂φL(θ, φ).

∂

∂φL(θ, φ) =

∂

∂φEq(z|x ;φ)[log p(x , z ; θ)]− ∂

∂φEq(z|x ;φ)[log q(z |x ;φ)]

=

∫[log p(x , z ; θ)]

q′(z |x ;φ)


−∫

q′(z |x ;φ)


−∫

[log q(z |x ;φ)]q′(z |x ;φ)


Computation is complicated due to parameter appearing in theexpectation. An EM-type iterative stochastic evaluation of each termis not guaranteed to be converged.

One may attempt to reparameterize ELBO so that the expectation isover a distribution free of parameters.



Reparameterization trick

In evaluating ∂∂φ

∫f (x)p(x ;φ)dx , let x = g(φ, ε), where g is

differentiable and p(ε) is free of φ. That is

p(g(ε, φ))∣∣∂g(φ, ε)

∂ε

∣∣ = p(ε).

Then ∫f (x)p(x ;φ)dx =

∫f (g(φ, ε))

∣∣∂g(φ, ε)

∂ε

∣∣p(g(φ, ε))dε

=

∫f (g(φ, ε))p(ε)dε




For example, when p(x ;φ) is N(µ, σ) and x = g(φ, ε) = µ+ εσ,∫f (x)p(x ;φ)dx =

∫f (µ+ εσ)p(ε)dε,

where ε = (x − µ)/σ and p(ε) is the standard normal density.

Then

∂

∂φ

∫f (x)p(x ;φ)dx =

∫f ′(g(φ, ε))

∣∣∂g(φ, ε)

∂φ

∣∣p(ε)dε




Back to evaluating ELBO, let

h(x , z , θ, φ) = log p(x , z ; θ)− log q(z |x ;φ).

We need to evaluate

L(θ, φ) =

∫h(x , z , θ, φ)q(z |x ;φ)dz =

∫h(x , g(φ, ε), θ, φ)p(ε)dε

and∂

∂φL(θ, φ) =

∫∂

∂φh(x , g(φ, ε), θ, φ)p(ε)dε

Evaluation is much simpler since the expectation part is free ofparameter φ. This can be evaluated stochastically.


Variational Autoencoders




Variational Autoencoder: Example

Goal: Given data x , learn a representation z = f (x) thatapproximates x well with minimal capacity.

ELBO: L(θ, φ) = Eq(z|x ;φ)[log p(x |z ; θ) + log p(z)− log q(z |x ;φ)]

Decoder: Assume p(x |z ; θ) ∼ N(µx |z ,Σx |z), where

µx |z = aK (aK−1(..a2(a1(z ; θ1); θ2)...θK−1); θK ),

diag(Σx |z) = b(aK−1(...a2(a1(z ; θ1); θ2)...θK−1); θb).

Parameters for µx |z and diag(Σx |z) are shared upto the (K − 1)th

layer.

Assume p(z) ∼ N(0, I )



Variational Autoencoder: Visualization of Learned Manifold

source: Kingma and Welling, 2014


Mean Field Variational Inference




Mean Field VI

Variational inference is named after variational calculus in which weoptimize a functional over all possible input functions. Withoutassuming parametric models for q and optimizing over the parameter,we can consider minimizing KL divergence over function q. Meanfield method takes this approach.

Assume the variational distribution over the latent variables factorizesas q(z1, z2, · · · , zm|x) =

∏mj=1 q(zj). Denote q(zj) by qj .

This approximation may not contain the true posterior because thelatent variables are usually dependent.



Mean Field VI

Treating q1, · · · , qj−1, qj+1, · · · , qm fixed, ELBO can be expressed as:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

Let ∫log p(x , z)

∏i 6=j

qidzi ≡ Ei 6=j(log p(x , z)).



Mean Field VI

ELBO:

L(θ, qj) =

∫ m∏i=1

qi{

log p(x , z)−m∑i=1

log qi}dz

=

∫qj{∫

log p(x , z)∏i 6=j

qidzi}dzj −

∫qj log qjdzj + c

=

∫qjEi 6=j(log p(x , z))dzj −

∫qj log qjdzj + c

=

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

wherelog p(x , zj) = Ei 6=j(log p(x , z)) + c



Mean Field VI

ELBO

L(θ, φ) =

∫qj log p(x , zj)dzj −

∫qj log qjdzj + c

= −KL(qj ||p(x , zj))

If we fix qj for i 6= j , ELBO is negative KL divergence between qj andp(x , zj).

ELBO can be maximized when q∗j = p(x , zj),

q∗j =exp[Ei 6=j{log p(x , z)}]∫exp[Ei 6=j{log p(x , z)}]dzj


Expectation propagation





Another approximated inference. Minimize KL(p||q) (Minka, 2001a;Minka, 2001b).

Consider the problem of minimizing KL(p||q) with respect to q(z)when p(z) is a fixed distribution and q(z) is a member of theexponential family,

q(z) = h(z)g(η) exp(ηTu(z))

KL divergence has a form:

KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const




KL(p||q) = − log g(η)− ηTEp(z)(u(z)) + const

Setting the derivative with respect to η equals to zero

− ∂

∂ηlog g(η) = Ep(z)(u(z)).

Note that

− ∂

∂ηlog g(η) = Eq(z)(u(z)).

Then,Eq(z)(u(z)) = Ep(z)(u(z))

The optimum solution corresponds to matching the expectedsufficient statistics. This is called moment matching.


deep learning: a statistical perspectivestat.snu.ac.kr/mcp/lecture_9_vae.pdf · 2019-10-30 ·...

Documents