generative adversarial networkstat.snu.ac.kr/mcp/lecture10_gan.pdf · 2019-11-11 · generative...

Generative Adversarial Network


Seoul National University Deep Learning September-December, 2019 1 / 38


Generative adversarial network (Goodfellow, 2014)

Since publication of GAN by Goodfellow (2014), many applications arereported.



Generative adversarial network (Goodfellow, 2014)

Since publication of GAN by Goodfellow (2014), many variants of GANhave been published.



Generative adversarial network

Similar setup as in VAE. Attempt to generate x given z with a smallerdimension.

Generative adversarial networks are based on a game theoreticscenario in which the generator network must compete against anadversary.

The generator network produces samples x = g(z ; θg ) that attemptsto fool the classifier into believing its samples are real. Its adversary,the discriminator network, attempts to distinguish between samplesdrawn from the training data and samples drawn from the generatorthrough P(y = 1|x) = D(x).




The likelihood function based on y |x ∼ Ber(D(x)) is

n∑i=1

yi log(D(xi ; θd)) + (1− yi ) log(1− D(xi ; θd))

However, xi ’s are not observed for yi = 0 and we replace xi withg(zi ; θg ). z is not observed for y = 0, and the marginal likelihood is

L(θd , θg ) =n∏

i=1

(D(xi ; θd))yi∫p(zi )(1− D(g(zi ; θg ); θd))(1−yi )dzi



Generative adversarial network: Discriminator andGenerator

Consider the following quantity

v(θg , θd) = Ex∼pdata logD(x ; θd) + Ex∼pmodellog[1− Dg(z ; θg ); θd],

where pdata ≡ p(x |y = 1) and pmodel ≡ p(x |y = 0). Note that this isnot the expected likelihood in usual sense.

Optimization

Discriminator: Maximize v(θg , θd) over θd given θg .Generator: Minimize maxθd v(θg , θd)Alternate Discriminator and Generator steps.




Discriminator: Maximizing v(θg , θd) over θd is to estimate thediscriminator. When function space of D(x) is not restricted, argmaxof v over D is

D∗(x) =pdata

pdata + pmodel.

Generator: Plugging in P(y = 1|x) = D∗(x) = p(x |y=1)p(x |y=1)+p(x |y=0) to

v(θg , θd), we minimize

v(θg , θd) =

∫p(x |y = 1) log

p(x |y = 1)

p(x |y = 1) + p(x |y = 0)dx

+

∫p(x |y = 0) log

p(x |y = 0)

p(x |y = 1) + p(x |y = 0)dx

= KL(pdata||p∗∗) + KL(pmodel ||p∗∗) + const

where p∗∗ = (pdata + pmodel)/2




source: https://poloclub.github.io/ganlab/



AE, VAE and GAN

AE VAE GAN

estimation transformation conditional distributionof distribution through

transformation

specifying x = g(z ; θg ) none x = g(z ; θg )transformation z = f (x ; θf )

specifying none p(z), p(x |z) p(z)distributions and thus indirectly

q(z |x) p(x)

objective ‖x − g(f (x ; θf ); θg )‖2 KL Jensen-Shannonfunction divergence divergence



GAN algorithm



Implementation of GAN

In practice, D(x) is restricted to neural network. We first maximizev(θg , θd) over θd to obtain θ∗d . Then,

θ∗g = argminθgEx∼pmodellog[1− Dg(z ; θg ); θ∗d]

The i th contribution of Ex∼pmodellog[1− Dg(z ; θg ); θ∗d] is

stochastically evaluated by

1

M

M∑m=1

log[1− Dg(z(m)i ; θg ); θ∗d]

where z(m)i is generated from p(z).

In practice, minimizing Ex∼pmodellog[1− Dg(z ; θg ); θ∗d] does not

work well. Instead, one aims to maximizeEx∼pmodel

log[Dg(z ; θg ); θ∗d] over θg .



GAN: Comments

In a nutshell, GAN finds p(x |y = 0), pmodel , by minimizing KLdivergence between pmodel and (pdata + pmodel)/2. Therefore theobjective function is KL(pmodel ||p∗∗).

GAN algorithm represents an example of how to minimizeKL(pmodel ||p∗∗), namely finding P(y = 1|x) indexed by θd through‘discriminator’.

Game-theoretic arguments may be oversold since they are notessential for estimating the density. The role of discriminator is todetermine the loss function for the generator,KL(pdata||p∗∗) + KL(pmodel ||p∗∗), Jensen-Shannon divergence.

Jensen-Shannon divergence has advantage over KL divergence in thatone can avoid the problem of non-overlapping support.



Detailed Architecture of GAN: DCGAN (Radford et al.2016, ICLR)

Radford et al. 2016, ICLR (DCGAN)




Radford et al. 2016, ICLR (DCGAN)




Architectures for classification need modification for GAN’s.

Some tips are given such as replacing max pooling by stridedconvolutional layers, using batch normalization, using ReLu forgenerator with tanh for output...

Issues of unstable training remained.



Many GAN’s

Many ways of improving GAN’s (Sailmans et al. 2016)

Many variants of GAN have been proposed.

CycleGAN (Zhu et al., 2017): Domain transfer (input=horse,output=zebra)Text to image (Reed et al., 2016, ICML)Pix2pix (Isola et al., 2017, CVPR)

WGAN (Arjovsky et al., 2017) is the most popular which usesWasserstein distance metric to optimize the generating distribution.



Wasserstein GAN: Distance

If the real data distribution Pr of X admits a density and Pθ is thedistribution of the parametrized density Pθ of g(Z ; θ) then,asymptotically, the likelihood inference amounts to minimizing theKullback-Leibler divergence KL(Pr‖Pθ).

When distributions are supported by low dimensional manifolds theKL distance is not defined.

WGAN minimizes Wasserstein distance between Pr and Pθ.



Distances between two distributions

Total variation distance:

δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|

where Σ denote the set of all the Borel subsets of a compact metricset, X .

Kullback-Leibler divergence: KL(Pr‖Pθ)

Jensen-Shannon divergence:

JS(Pr ,Pθ) = KL(Pr‖Pm) + KL(Pθ‖Pm)

where Pm = (Pr + Pθ)/2.



Earth-mover distance

Earth-mover distance or Wasserstein-1 distance:

W (Pr ,Pθ) = infγ∈Π(Pr ,Pθ)

E(x ,y)∼γ [‖x − y‖]

where Π(Pr ,Pθ) denotes the set of all joint distributions γ(x , y)whose marginal distributions are respectively Pr and Pθ.

source: Cuturi and Solomon, 2017, NeurIPS tutorial, A Primer on optimal transport



Optimal transport

If we imagine the distributions as different heaps of a certain amountof earth, then the EMD is the minimal total amount of work it takesto transform one heap into the other. Work is defined as the amountof earth in a chunk times the distance it was moved.

Calculating the EMD is in itself an optimization problem: There areinfinitely many ways to move the earth around, and we need to findthe optimal one. We call the transport plan that we are trying to find.It simply states how we distribute the amount of earth from one placeover the domain of, or vice versa.



Wasserstein distance

source: STAT36-708 Lecture note from Larry Wasserman

Distances may ignore the underlying geometry of the space. For threedensities p1, p2, p3, we have

∫p1−p2dx =

∫p1−p3dx =

∫p2−p3dx

and similarly for the other distances. But our intuition tells us that p1

and p2 are close together, which is captured in Wasserstein distance.



Distance: Example

Let Z be U[0, 1] the uniform distribution on the unit interval. Let P0

be the distribution of (0,Z ) ∈ R2, 0 on the x-axis and the randomvariable Z on the y -axis, uniform on a straight vertical line passingthrough the origin. Let Pθ be the distribution of (θ,Z ) Then,W (P0,Pθ) = |θ|

JS(P0,Pθ) =

log 2 if θ 6= 0

0 if θ = 0

KL(P0,Pθ) =

+∞ if θ 6= 0

0 if θ = 0

δ(P0,Pθ) =

1 if θ 6= 0

0 if θ = 0



Distance: Example

When θ → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge under the JS, KL, or TV divergences.

The KL, JS, and TV distances are not sensible loss functions whenlearning distributions supported by low dimensional manifolds.However the EM distance effectively captures the difference in thissetup.



Dual form of Wasserstein distance: Kantorovich-Rubinsteinduality

W (Pr ,Pθ) = infγ∈Π(Pr ,Pθ)

E(x ,y)∼γ [‖x − y‖]

= infγ

supf

E(x ,y)∼γ [‖x − y‖+ Es∼Pr ,[f (s)]− Et∼Pθ,[f (t)]− (f (x)− f (y))]

since

supf

E(x ,y)∼γ [Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]− (f (x)− f (y))] =

0, if γ ∈ Π

∞ otherwise

Using Simon’s minimax theorem, it can be shown that strong duality holds.



Kantorovich-Rubinstein duality-continued

Due to strong duality

infγ

supf

E(x ,y)∼γ [‖x − y‖+ Es∼Pr ,[f (s)]− Et∼Pθ,[f (t)]− (f (x)− f (y))]

= supf

infγE(x ,y)∼γ [‖x−y‖+Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]−(f (x)−f (y))]

= supf

[Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]+ inf

γE(x ,y)∼γ [‖x−y‖−(f (x)−f (y))]︸︷︷︸

=

0, if f∈|f (x1)−f (x2)|≤|x1−x2|−∞ otherwise

]

= supf ∈|f (x1)−f (x2)|≤|x1−x2|

[Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]

]



Wasserstein GAN (Arjovsky et al., 2017)

Kantorovich-Rubinstein duality:

W (Pr ,Pθ) = supf ∈|f (x1)−f (x2)|≤|x1−x2|

Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]

where the supremum is over all the 1-Lipschitz functions f : C → R.

To approximate computation of W (Pr ,Pθ), consider parametricfamily fw indexed by w and

maxw∈Ω

Ex∼Pr [fw (x)]− Ex∼Pθ[fw (gθ(z))]

WGAN trains a neural network parameterized with weights w lying ina compact space Ω and then backprop throughEz∼p(z)[∇θfw∗(gθ(z))], where w∗ is the argmax.



Wasserstein GAN

Note that the fact that Ω is compact implies that all the functions fwwill be K−Lipschitz for some K .

In order to have parameters w lie in a compact space, clamp theweights, for example, Ω = [−0.01, 0.01]l after each gradient update.

Weight clipping is a poor but practical way to enforce a Lipschitzconstraint. A large clipping parameter may cause slow convergence,and a small clipping may lead to vanishing gradients.

Gulrajani et al. (2017): Add penalizing term, (‖∇xD(x)‖2 − K )2 toenforce lipschitz continuity



Wasserstein GAN algorithm



Improved Wasserstein GAN algorithm (Gulrajani et al.,2017)



Family of distance functions for distributions: IntegralProbability Metrics

Given F and a set of functions from C → R, define

dF (P,Q) = supf ∈F

Ex∼P [f (x)]− Ex∼Q [f (x)],

called Integral Probability Metrics (IPM’s).

If F is the set of Lipschitz functions, dF (P,Q) is Wassersteindistance.

If F is the set of all measurable functions bounded between [−1, 1],dF (P,Q) is total variation distance.

If F = f ∈ H : ‖f ‖∞ ≤ 1 for some Reproducing Kernel HilbertSpace H, dF (P,Q) is the maximum mean discrepancy (MMD).



WGAN vs. GAN



Wasserstein Autoencoder (WAE)

Tolstikhin et al. (2018) proposed the Wasserstein Auto-Encoder(WAE) a new algorithm for building a generative model of the datadistribution. WAE minimizes a penalized form of the Wassersteindistance between the model distribution and the target distribution

The regularizer encourages the encoded training distribution to matchthe prior.

Comparing with WGAN, WAE uses the primal definition ofWasserstein distance.



AE, VAE, WGAN and WAE

AE VAE GAN/WGAN WAE

Encoder Deterministic Stochastic none Stochastic

Decoder Deterministic Stochastic Deterministic Stochastic

P(z) no yes yes yes



Wasserstein Autoencoder (WAE)

Let deterministic decoders, PG (X |Z ) map Z to X = G (Z ) for a givenG : Z → X . Q(Z |X ) is a conditional distribution of Z given X . LetPZ be prior and QZ (Z ) = EX∼PX

[Q(Z |X )]. Then

infΓ∈P(X∼,Y∼PG )

E(X ,Y )∼Γ[c(X ,Y )] = infQ:QZ=PZ

EPXEQ(Z |X )[c(X ,G (Z ))]

Objective function for WAE:

DWAE (PX ,PG ) = infQ(Z |X )∈Q

EPXEQ(Z |X )[c(X ,G (Z ))] + λDZ (QZ ,PZ ),

where Q is any nonparametric set of probablistic encoders, DZ is anarbitrary divergence between QZ and PZ , and λ > 0 is ahyperparameter.



Wasserstein Autoencoder (WAE): Choices of penalty term

GAN-based DZ : Jensen-Shannon distance between QZandPZ ,DJS(QZ ,PZ ). Adversarial training: estimate γ by maximizingλn

∑ni=1 logDγ(zi ) + log(1− Dγ(zi ))

MMD-based DZ : For a positive-definite reproducing kernelk : Z ×Z → R, the maximum mean discrepancy (MMD) is defined as

MMDk(PZ ,QZ ) = ‖∫Zk(z , )dPZ (z)−

∫Zk(z , )dQZ (z)‖Hk



WAE algorithms



WGAN, WAE-GAN and WAE-MMD

WGAN WAE-GAN WAE-MMD

p(x |z ; θ) x = g(z ; θg ) x = g(z ; θg ) x = g(z ; θg )

p(z) normal normal normal

q(z |x) none q(z |x) q(z |x)

P(y = 1|x) none P(y = 1|x) none

w-distance dual primal primal

critic for dual fw - -

primal - JS(p(z)‖q(z)) ‖∫k(z,)dP(z)

contraint -∫k(z,)dQ(z)‖H


generative adversarial networkstat.snu.ac.kr/mcp/lecture10_gan.pdf · 2019-11-11 · generative...

Documents