Post on 28-May-2020

0 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

Seoul National University Deep Learning September-December, 2019 1 / 38

Since publication of GAN by Goodfellow (2014), many applications are reported.

Seoul National University Deep Learning September-December, 2019 2 / 38

Since publication of GAN by Goodfellow (2014), many variants of GAN have been published.

Seoul National University Deep Learning September-December, 2019 3 / 38

Similar setup as in VAE. Attempt to generate x given z with a smaller dimension.

Generative adversarial networks are based on a game theoretic scenario in which the generator network must compete against an adversary.

The generator network produces samples x = g(z ; θg ) that attempts to fool the classifier into believing its samples are real. Its adversary, the discriminator network, attempts to distinguish between samples drawn from the training data and samples drawn from the generator through P(y = 1|x) = D(x).

Seoul National University Deep Learning September-December, 2019 4 / 38

Let y = 1 if the data is real and y = 0 if the data is fake. We assume that there is a lower dimensional representation z of x .

To generate data, one needs to know p(x |y = 1). If P(y=1)=P(y=0)=.5, we have

P(y = 1|x) = p(x |y = 1) p(x |y = 1) + p(x |y = 0)

In GAN, we specify P(y = 1|x) and p(x |y = 0) and estimate p(x |y = 1) by minimizing a distance between p(x |y = 0) and p(x |y = 1).

Seoul National University Deep Learning September-December, 2019 5 / 38

The likelihood function based on y |x ∼ Ber(D(x)) is

n∑ i=1

{yi log(D(xi ; θd)) + (1− yi ) log(1− D(xi ; θd))}

However, xi ’s are not observed for yi = 0 and we replace xi with g(zi ; θg ). z is not observed for y = 0, and the marginal likelihood is

L(θd , θg ) = n∏

i=1

{(D(xi ; θd))yi∫ p(zi )(1− D(g(zi ; θg ); θd))(1−yi )dzi}

Seoul National University Deep Learning September-December, 2019 6 / 38

Generative adversarial network: Discriminator and Generator

Consider the following quantity

v(θg , θd) = Ex∼pdata logD(x ; θd) + Ex∼pmodel log[1− D{g(z ; θg ); θd}],

where pdata ≡ p(x |y = 1) and pmodel ≡ p(x |y = 0). Note that this is not the expected likelihood in usual sense.

Optimization

Discriminator: Maximize v(θg , θd) over θd given θg . Generator: Minimize maxθd v(θg , θd) Alternate Discriminator and Generator steps.

Seoul National University Deep Learning September-December, 2019 7 / 38

Discriminator: Maximizing v(θg , θd) over θd is to estimate the discriminator. When function space of D(x) is not restricted, argmax of v over D is

D∗(x) = pdata

pdata + pmodel .

Generator: Plugging in P(y = 1|x) = D∗(x) = p(x |y=1)p(x |y=1)+p(x |y=0) to v(θg , θd), we minimize

v(θg , θd) =

∫ p(x |y = 1) log p(x |y = 1)

p(x |y = 1) + p(x |y = 0) dx

+

∫ p(x |y = 0) log p(x |y = 0)

p(x |y = 1) + p(x |y = 0) dx

= KL(pdata||p∗∗) + KL(pmodel ||p∗∗) + const

where p∗∗ = (pdata + pmodel)/2

Seoul National University Deep Learning September-December, 2019 8 / 38

source: https://poloclub.github.io/ganlab/

Seoul National University Deep Learning September-December, 2019 9 / 38

AE, VAE and GAN

AE VAE GAN

estimation transformation conditional distribution of distribution through

transformation

specifying x = g(z ; θg ) none x = g(z ; θg ) transformation z = f (x ; θf )

specifying none p(z), p(x |z) p(z) distributions and thus indirectly

q(z |x) p(x) objective ‖x − g(f (x ; θf ); θg )‖2 KL Jensen-Shannon function divergence divergence

Seoul National University Deep Learning September-December, 2019 10 / 38

GAN algorithm

Seoul National University Deep Learning September-December, 2019 11 / 38

Implementation of GAN

In practice, D(x) is restricted to neural network. We first maximize v(θg , θd) over θd to obtain θ

∗ d . Then,

θ∗g = argminθgEx∼pmodel log[1− D{g(z ; θg ); θ ∗ d}]

The i th contribution of Ex∼pmodel log[1− D{g(z ; θg ); θ∗d}] is stochastically evaluated by

1

M

M∑ m=1

log[1− D{g(z(m)i ; θg ); θ ∗ d}]

where z (m) i is generated from p(z).

In practice, minimizing Ex∼pmodel log[1− D{g(z ; θg ); θ∗d}] does not work well. Instead, one aims to maximize Ex∼pmodel log[D{g(z ; θg ); θ∗d}] over θg .

Seoul National University Deep Learning September-December, 2019 12 / 38

In a nutshell, GAN finds p(x |y = 0), pmodel , by minimizing KL divergence between pmodel and (pdata + pmodel)/2. Therefore the objective function is KL(pmodel ||p∗∗). GAN algorithm represents an example of how to minimize KL(pmodel ||p∗∗), namely finding P(y = 1|x) indexed by θd through ‘discriminator’.

Game-theoretic arguments may be oversold since they are not essential for estimating the density. The role of discriminator is to determine the loss function for the generator, KL(pdata||p∗∗) + KL(pmodel ||p∗∗), Jensen-Shannon divergence. Jensen-Shannon divergence has advantage over KL divergence in that one can avoid the problem of non-overlapping support.

Seoul National University Deep Learning September-December, 2019 13 / 38

Detailed Architecture of GAN: DCGAN (Radford et al. 2016, ICLR)

Radford et al. 2016, ICLR (DCGAN)

Seoul National University Deep Learning September-December, 2019 14 / 38

Detailed Architecture of GAN: DCGAN (Radford et al. 2016, ICLR)

Radford et al. 2016, ICLR (DCGAN)

Seoul National University Deep Learning September-December, 2019 15 / 38

Detailed Architecture of GAN: DCGAN (Radford et al. 2016, ICLR)

Architectures for classification need modification for GAN’s.

Some tips are given such as replacing max pooling by strided convolutional layers, using batch normalization, using ReLu for generator with tanh for output...

Issues of unstable training remained.

Seoul National University Deep Learning September-December, 2019 16 / 38

Many GAN’s

Many ways of improving GAN’s (Sailmans et al. 2016)

Many variants of GAN have been proposed.

CycleGAN (Zhu et al., 2017): Domain transfer (input=horse, output=zebra) Text to image (Reed et al., 2016, ICML) Pix2pix (Isola et al., 2017, CVPR)

WGAN (Arjovsky et al., 2017) is the most popular which uses Wasserstein distance metric to optimize the generating distribution.

Seoul National University Deep Learning September-December, 2019 17 / 38

Wasserstein GAN: Distance

If the real data distribution Pr of X admits a density and Pθ is the distribution of the parametrized density Pθ of g(Z ; θ) then, asymptotically, the likelihood inference amounts to minimizing the Kullback-Leibler divergence KL(Pr‖Pθ). When distributions are supported by low dimensional manifolds the KL distance is not defined.

WGAN minimizes Wasserstein distance between Pr and Pθ.

Seoul National University Deep Learning September-December, 2019 18 / 38

Distances between two distributions

Total variation distance:

δ(Pr ,Pθ) = sup A∈Σ |Pr (A)− Pθ(A)|

where Σ denote the set of all the Borel subsets of a compact metric set, X .

Kullback-Leibler divergence: KL(Pr‖Pθ) Jensen-Shannon divergence:

JS(Pr ,Pθ) = KL(Pr‖Pm) + KL(Pθ‖Pm)

where Pm = (Pr + Pθ)/2.

Seoul National University Deep Learning September-December, 2019 19 / 38

Earth-mover distance

Earth-mover distance or Wasserstein-1 distance:

W (Pr ,Pθ) = inf γ∈Π(Pr ,Pθ)

E(x ,y)∼γ [‖x − y‖]

where Π(Pr ,Pθ) denotes the set of all joint distributions γ(x , y) whose marginal distributions are respectively Pr and Pθ.

source: Cuturi and Solomon, 2017, NeurIPS tutorial, A Primer on optimal transport

Seoul National University Deep Learning September-December, 2019 20 / 38