Transcript
Page 1: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Generative Adversarial Networks (part 2)

Benjamin Striner1

1Carnegie Mellon University

April 22, 2020

Benjamin Striner CMU

GANs

Page 2: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Table of Contents

1 Recap

2 Understanding Optimization Issues

3 GAN Training and Stabilization

4 Take Aways

Benjamin Striner CMU

GANs

Page 3: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Table of Contents

1 Recap

2 Understanding Optimization Issues

3 GAN Training and Stabilization

4 Take Aways

Benjamin Striner CMU

GANs

Page 4: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Recap

What did we talk about so far?

What is a GAN?

How do GANs work theoretically?

What kinds of problems can GANs address?

Benjamin Striner CMU

GANs

Page 5: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Recap

What is a GAN?

Train a generator to produce samples from a targetdistribution

Discriminator guides generator

Generator and Discriminator are trained adversarially

Benjamin Striner CMU

GANs

Page 6: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Recap

How do GANs work theoretically?

Discriminator calculates a divergence between generated andtarget distributions

Generator tries to minimize the divergence

Benjamin Striner CMU

GANs

Page 7: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Pseudocode

How can you build one yourself?

Define a generator network that takes random inputs andproduces an image

Define a discriminator network that takes images andproduces a scalar

Draw a random batch of Z from priorDraw a random batch of X from dataGradient descent generator weights w.r.t. generator lossGradient descent discriminator weights w.r.t. discriminator loss

See recitations and tutorials for details

Benjamin Striner CMU

GANs

Page 8: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Recap

What kinds of problems can GANs address?

Generation

Conditional Generation

Clustering

Semi-supervised Learning

Representation Learning

Translation

Any traditional discriminative task can be approached withgenerative models

Benjamin Striner CMU

GANs

Page 9: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Recap

What is a discriminator?

A new type of loss function

Measures the divergence between distributions, not betweenexamples

Implemented by neural network, has biases of neural network

Benjamin Striner CMU

GANs

Page 10: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Summary

Powerful tool for generative modeling

Lots of potential

Limited by pragmatic issues (stability)

Benjamin Striner CMU

GANs

Page 11: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Table of Contents

1 Recap

2 Understanding Optimization Issues

3 GAN Training and Stabilization

4 Take Aways

Benjamin Striner CMU

GANs

Page 12: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Common Failures

GAN training can be tricky to diagnose

“Mode collapse” generates a small subspace but does notcover the entire distribution:https://www.youtube.com/watch?v=ktxhiKhWoEE

Some errors harder to describe:https://www.youtube.com/watch?v=D5akt32hsCQ

Cause can be unclear

Discriminator too complicated?Discriminator not complicated enough?

Benjamin Striner CMU

GANs

Page 13: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Causes of Optimization Issues

Simultaneous updates require a careful balance betweenplayers

In general, two player games are not guaranteed to convergeto the global optimum

There is a stationary point but no guarantee of reaching it

Adversarial optimization is a more general, harder problemthan single-player optimization

Benjamin Striner CMU

GANs

Page 14: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Simultaneous Updates

Simultaneous Updates

Why are updates simultaneous? Can you just train an optimaldiscriminator?

For any given discriminator, the optimal generator outputs∀Z : G (Z ) = argmaxX D(X )

The optimal discriminator emits 0.5 for all inputs, so isn’tuseful for training anything

Optimal discriminator conditional on current generator andvice-versa

Cannot train generator without training discriminator first

Therefore generator and discriminator must be trainedtogether

Benjamin Striner CMU

GANs

Page 15: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Simultaneous Updates

Adversarial Balance

What kind of balance is required?

If discriminator is under-trained, it guides the generator in thewrong direction

If discriminator is over-trained, it is too “hard” and generatorcan’t make progress

If generator trains too quickly it will “overshoot” the loss thatthe discriminator learned

Benjamin Striner CMU

GANs

Page 16: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Simultaneous Updates

Factors Affecting Balance

What affects the balance?

Different optimizers and learning rates

Different architectures, depths, and numbers of parameters

Regularization

Train D for kD iterations, train G for kG iterations, repeat

Train D and G for dynamic kD and kG iterations, dependingon some metrics

Benjamin Striner CMU

GANs

Page 17: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Simultaneous Updates

Adversarial Balance

What does this look like in practice?

Target distribution is stationary 2D point (green)

Generator produces a single moving 2D point (blue)

Discriminator is a 2D linear function, represented by thecolored background

Watch oscillations as generator overshoots discriminator

https://www.youtube.com/watch?v=ebMei6bYeWw

Benjamin Striner CMU

GANs

Page 18: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Two-player games

Two-Player Games

Even a simple game, Rock, Paper, Scissors, might not convergeusing alternating updates.

Player A prefers rock by random initialization

Player B should therefore play only paper

Player A should play only scissors

Player B should play only rock

. . .

Benjamin Striner CMU

GANs

Page 19: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Two-player games

Rock, Paper, Scissors

Why is this so unstable?

E`A = ARBS + APBR + ASBP

Global optimumBoth players select uniformly w.p. 0.33Both players win, lose or draw w.p. 0.33

Local optimumSay player B plays (R,P,S) w.p. (0.4, 0.3, 0.3)Player A should play (R,P,S) w.p. (0, 1, 0)Player B wins w.p. 0.4

What happens if you use gradient descent?https://www.youtube.com/watch?v=JmON4S0kl04

Benjamin Striner CMU

GANs

Page 20: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Stationary Points

Stationary Points

The fact that there is a stationary point does not mean you willconverge to it

Gradients can circle or point away from the minima

Stationary point may not be stable, no local “well”

Some degree of smoothness to the discriminator is required

Even if discriminator correctly labels generated points 0 andreal points 1Does not mean the gradient of the discriminator is in the rightdirectionDoes not mean area around generated points is 0 and areaaround real points is 1

Benjamin Striner CMU

GANs

Page 21: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Table of Contents

1 Recap

2 Understanding Optimization Issues

3 GAN Training and Stabilization

4 Take Aways

Benjamin Striner CMU

GANs

Page 22: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

GAN Training

Ongoing research into “best” GAN

Likely no silver-bullet

Combinations of techniques work well

Getting better every year

Benjamin Striner CMU

GANs

Page 23: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

GAN Training Techniques

We will discuss a sample of training/stabilization techniques

Will not cover every idea people have tried

Goal is to understand the types of techniques and research

Will cover some interesting or historical ideas that aren’t thatgreat

I am not endorsing all of the following techniques

Benjamin Striner CMU

GANs

Page 24: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

GAN Training Techniques

Unrolled GenerativeAdversarial Networks

Gradient descent is locallystable

DRAGAN

Numerics of GANs

Improved Techniques forGANs

Least-Squares GAN

Instance Noise

EBGAN

WGAN

WGAN-GP

Spectral Normalized GAN

Fisher GAN

LapGAN

ProgGAN

Benjamin Striner CMU

GANs

Page 25: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Unrolled Generative Adversarial Networks

Unrolled Generative Adversarial Networks

Optimize future loss, not current loss [MPPS16]

Calculate the discriminator after a few SGD steps

Find the generator that has the best loss on the futurediscriminator

Differentiate through gradient descent

Benjamin Striner CMU

GANs

Page 26: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Unrolled Generative Adversarial Networks

UGAN Definition

Think of it like chess. Make move that gives the best result afterthe opponent’s move, not the best immediate reward.

θ0D = θD

θk+1D = θkD + ηk

∂f (θG , θkD)

∂θkD

fK (θG , θD) = f (θG , θKD (θG , θD))

Benjamin Striner CMU

GANs

Page 27: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Unrolled Generative Adversarial Networks

UGAN Diagram

Benjamin Striner CMU

GANs

Page 28: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Unrolled Generative Adversarial Networks

UGAN Results

Benjamin Striner CMU

GANs

Page 29: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Gradient descent GAN optimization is locally stable

Gradient descent GAN optimization is locally stable

Minimize the original objective as well as the gradient of theobjective [NK17]

Gradient of the discriminator is how much discriminator canimprove

If generator makes improvement, but discriminator gradient islarge, discriminator can undo that improvement

If generator makes improvement in a way that thediscriminator gradient is zero, discriminator cannot undo thatimprovement

`G = `G ,0 + η ‖∇`D‖2

Benjamin Striner CMU

GANs

Page 30: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

How to train your DRAGAN

How to train your DRAGAN

Minimize the norm of the gradient in a region around real data[KAHK17]

Minimize the norm of the gradient makes a function smoother

Smooth in a random region around real data to smooth thediscriminator

λEX ,ε max(0, ‖∇D(X + ε)‖2 − k)

Benjamin Striner CMU

GANs

Page 31: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

The Numerics of GANs

The Numerics of GANs

GANs define a vector field [MNG17]

Consider the joint field of generator and discriminatorparameters

Stationary points are where the gradient of each player w.r.t.its own parameters is 0

Find these regions by minimizing the norm of the gradient ofeach player

Regularization parameter balances between adversarialobjective and consensus objective

L = L0 + λ ‖∇L0‖22

Benjamin Striner CMU

GANs

Page 32: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

The Numerics of GANs

Consensus

Why “consensus”?

Generator tries to minimize discriminator gradients as well asits own

Discriminator tries to minimize generator gradients as well asits own

Mutual de-escalation

Encourages towards minima, maxima, and saddle points

Small regularization parameter so hopefully finds the minima

Benjamin Striner CMU

GANs

Page 33: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

The Numerics of GANs

Vector Fields

https:

//www.inference.vc/my-notes-on-the-numerics-of-gans/

Benjamin Striner CMU

GANs

Page 34: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

Improved Techniques for Training GANs

A collection of interesting techniques and experiments

Feature matching

Minibatch discrimination

Historical averaging

One-sided label smoothing

Virtual batch normalization

Benjamin Striner CMU

GANs

Page 35: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

Feature Matching

Statistics of generated images should match statistics of realimages

Discriminator produces multidimensional output, a “statistic”of the data

Generator trained to minimize L2 between real and generateddata

Discriminator trained to maximize L2 between real andgenerated data

‖EXD(X )− EZD(G (Z ))‖22

Benjamin Striner CMU

GANs

Page 36: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

Minibatch Discrimination

Discriminator can look at multiple inputs at once and decide ifthose inputs come from the real or generated distribution

GANs frequently collapse to a single point

Discriminator needs to differentiate between two distributions

Easier task if looking at multiple samples

Benjamin Striner CMU

GANs

Page 37: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

Historical Averaging

Dampen oscillations by encouraging updates to converge to amean

GANs frequently create a cycle or experience oscillations

Add a term to reduce oscillations that encourages the currentparameters to be near a moving average of the parameters∥∥∥∥∥θ − 1

t

t∑i

θi

∥∥∥∥∥2

2

Benjamin Striner CMU

GANs

Page 38: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

One-sided label smoothing

Don’t over-penalize generated images

Label smoothing is a common and easy technique thatimproves performance across many domains

Sigmoid tries hard to saturate to 0 or 1 but can never quitereach that goalProvide targets that are ε or 1− ε so the sigmoid doesn’tsaturate and overtrain

Experimentally, smooth the real targets but do not smooththe generated targets when training the discriminator

Benjamin Striner CMU

GANs

Page 39: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Techniques for Training GANs

Virtual Batch Normalization

Use batch normalization to accelerate convergence

Batch normalization accelerates convergence

However, hard to apply in an adversarial setting

Collect statistics on a fixed batch of real data and use tonormalize other data

Benjamin Striner CMU

GANs

Page 40: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Least-Squares GAN

Least-Squares GAN

Use an L2 loss instead of cross-entropy [MLX+16]

Generator tries to minimize L2 loss ‖a− D(G (Z ))‖22Generator tries to minimize L2 loss‖b − D(G (Z ))‖22 + ‖c − D(G (Z ))‖22For example, a = 1, b = 0, c = 1

Less squashing and large values than cross-entropy

Benjamin Striner CMU

GANs

Page 41: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Amortised MAP Inference for Image Super-Resolution

Amortised MAP Inference for Image Super-Resolution

Part of a larger paper with several modifications and experiments[SCT+16]

“Instance Noise”

Add noise to generated and real images

Smooth the function learned by the discriminator

Simple and effective

Benjamin Striner CMU

GANs

Page 42: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Energy-based Generative Adversarial Network

Energy-based Generative Adversarial Network

Simplify the loss function, removing powers and exponents[ZML16]

No longer easily described using JS divergence or somethingsimilar

Greatly simplified and linear

Gradients are neither squashed nor explode

LD = EXD(X ) + EZ [m − D(G (Z ))]+

LG = D(G (z))

Benjamin Striner CMU

GANs

Page 43: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

Wasserstein GAN

Further simplified and theoretically grounded [ACB17]

Solve Wasserstein “earth mover” distance

Provide a smoother gradient

Constrain Lipschitz of discriminator to 1

Harder to overtrain discriminator

LD = EXD(X )− EZD(G (Z ))

LG = EZD(G (Z ))

Benjamin Striner CMU

GANs

Page 44: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

Wasserstein Distance

The total∑

mass× distance required to transform one distributionto another.

Intuitive, symmetric measure of divergence

Hard to calculate because requires solving “optimal transport”

You have some distribution of products at some warehousesand you need to make some other distribution of products atthose warehouses

Move the products to the target distribution minimizing∑mass× distance

Creating the optimal “transport plan” can be NP-hard

Benjamin Striner CMU

GANs

Page 45: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

Optimal Transport

Benjamin Striner CMU

GANs

Page 46: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

KL Example

KL(p‖q) =

∫xp(x) log

p(x)

q(x)

Real data is a point mass at 0

Generated data is a point mass at θ

If θ 6= 0, p(0) log p(0)q(0) = 11

0 =∞

If θ = 0, 1 log 11 = 0

Not differentiable w.r.t. θ

Benjamin Striner CMU

GANs

Page 47: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

JS Example

Calculate the average distribution and calculate the average KL tothe average.

m(x) =p(x) + q(x)

2

JS(p‖q)

Real data is a point mass at 0

Generated data is a point mass at θ

If θ 6= 0, 12

[1 log 1

0.5 + 1 log 10.5

]= log 4

If θ = 0, 12

[1 log 1

1 + 1 log 11

]= 0

Not differentiable w.r.t. θ

Benjamin Striner CMU

GANs

Page 48: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

Wasserstein Example

Calculate the mass times the distance in one dimension

Real data is a point mass at 0

Generated data is a point mass at θ

EM is |θ|Differentiable w.r.t. θ!

Benjamin Striner CMU

GANs

Page 49: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

EM v JS Visualized

Benjamin Striner CMU

GANs

Page 50: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

GAN vs WGAN

Benjamin Striner CMU

GANs

Page 51: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

Kantorovich Dual

Result showing the EM can be calculated using a dual method

supφ,ψ

(∫Xφ(x)dµ(x) +

∫Yψ(y)dν(y)

)

φ(x)− ψ(y) ≤ c(x , y)

Provide justification for GAN techniques.

Benjamin Striner CMU

GANs

Page 52: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Wasserstein GAN

So what’s the catch?

Constrain the Lipschitz of an arbitrary function

Function is nonlinear

Cannot calculate‖D(x)−D(y)‖2‖x−y‖2

for all pairs

Just clip the weights to some value

Constraining the L∞ of all weight matrices upper bounds theLipschitz of the function

Bound is not great

In practice, leads to poor discriminators

Benjamin Striner CMU

GANs

Page 53: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Training of Wasserstein GAN

Wasserstein GAN-GP

Demonstrate the flaws with gradient clipping [GAA+17]

Propose an alternative method of constraining the Lipschitz

Lipschitz should be 1 almost everywhere

Calculate the gradient of D at random samples

Add a penalty of the mean squared distance between thegradient and 1

L = EX (D(X ))− EZ (D(G (Z )) + λEX ′(∥∥∇D(X ′)

∥∥2− 1)2

Benjamin Striner CMU

GANs

Page 54: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Training of Wasserstein GAN

Where to sample?

How to sample?

Cannot constrain gradient everywhere

Can only penalize gradient at some samples

Use samples that are random linear interpolations betweenreal and fake data

Keeps discriminator smooth over the relevant region

Benjamin Striner CMU

GANs

Page 55: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Improved Training of Wasserstein GAN

Discriminator Visualization

Benjamin Striner CMU

GANs

Page 56: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Spectral Normalized GAN

Spectral Normalized GAN

Bound the Lipschitz using the spectral norm of each layer[MKKY18]

Lipschitz of a single matrix multiplication can be calculatedusing a power iteration

supx

‖xA‖2‖x‖2

= ‖A‖∗2

Lipschitz of an MLP can be upper bounded by product ofeach layer

Constrain each layer by calculating W = W‖W ‖∗2

Benjamin Striner CMU

GANs

Page 57: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Fisher GAN

Fisher GAN

Fisher IPM does not require constraining the discriminator [MS17]

Constraint is imposed by the loss function itself

Difference between D(G (Z )) and D(X ) is scaled by D(G (Z ))and D(X )

supf

EXD(X )− EZD(G (Z ))√12EXD(X )2 + 1

2EZD(G (Z ))2

Benjamin Striner CMU

GANs

Page 58: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

LapGAN

LapGAN

Break image generation into smaller chunks [DCSF15]

GAN generates small, blurry image

CGAN sharpens and enlarges image slightly

Repeat step 2 until desired size

Separate, small GANs simplify training

Benjamin Striner CMU

GANs

Page 59: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Progressive Growing of GANs

Progressive Growing of GANs

Gradually expand networks, like a curriculum [KALL17]

Simple generator learns with simple discriminator

Gradually add layers to generator and discriminator to producelarger images

Benjamin Striner CMU

GANs

Page 60: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Progressive Growing of GANs

Progessive GAN Visualization

Benjamin Striner CMU

GANs

Page 61: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Table of Contents

1 Recap

2 Understanding Optimization Issues

3 GAN Training and Stabilization

4 Take Aways

Benjamin Striner CMU

GANs

Page 62: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Common “Gotchas”

You can only reduce an NP problem into another NP problem

Unrolling or second gradients are computationally expensive(UGAN, CGAN)

Sampling-based methods can be unreliable (WGAN-GP)

Bound-based methods can easily learn poor local optima(WGAN, SNGAN)

Benjamin Striner CMU

GANs

Page 63: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Lessons

Some major trends in the research

Regularize and smooth the discriminator, especially in someregion around the real and generated points

Update the players towards some sort of consensus or futurestable region, not just the immediate greedy update

Simplify networks and losses to eliminate nonlinearities

Constrain the “complexity” of the discriminator so you cantrain reliably without overtraining

Grow or chain GANs to reduce complexity, learn a curriculum

Benjamin Striner CMU

GANs

Page 64: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Where is current research?

Current techniques have impressive results

Lots of GPUs and tuning required

But the results are constantly improving

Better techniques allow larger models, always pushing thelimits

Mostly images but other modalities are in-progress

Benjamin Striner CMU

GANs

Page 65: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

Where is this going?

Ultimately, discriminator is calculating a divergence/loss function

Neural networks are universal function approximators

Loss functions are functions

How can we measure/constrain the “complexity” of a neuralnetwork?

How can we architect a neural network to learn a meaningfulloss function?

Benjamin Striner CMU

GANs

Page 66: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

References I

Martin Arjovsky, Soumith Chintala, and Leon Bottou,Wasserstein GAN, arXiv e-prints (2017), arXiv:1701.07875.

Emily L. Denton, Soumith Chintala, Arthur Szlam, and RobertFergus, Deep generative image models using a laplacianpyramid of adversarial networks, CoRR abs/1506.05751(2015).

Ishaan Gulrajani, Faruk Ahmed, Martın Arjovsky, VincentDumoulin, and Aaron C. Courville, Improved training ofwasserstein gans, CoRR abs/1704.00028 (2017).

Benjamin Striner CMU

GANs

Page 67: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

References II

Naveen Kodali, Jacob D. Abernethy, James Hays, and ZsoltKira, How to train your DRAGAN, CoRR abs/1705.07215(2017).

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen,Progressive growing of gans for improved quality, stability, andvariation, CoRR abs/1710.10196 (2017).

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida, Spectral normalization for generativeadversarial networks, CoRR abs/1802.05957 (2018).

Benjamin Striner CMU

GANs

Page 68: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

References III

Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, andZhen Wang, Multi-class generative adversarial networks withthe L2 loss function, CoRR abs/1611.04076 (2016).

Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger,The numerics of gans, CoRR abs/1705.10461 (2017).

Luke Metz, Ben Poole, David Pfau, and JaschaSohl-Dickstein, Unrolled generative adversarial networks, CoRRabs/1611.02163 (2016).

Youssef Mroueh and Tom Sercu, Fisher GAN, CoRRabs/1705.09675 (2017).

Benjamin Striner CMU

GANs

Page 69: Generative Adversarial Networks (part 2)deeplearning.cs.cmu.edu/document/slides/bstriner_gans_part_2.pdf · Unrolled Generative Adversarial Networks Unrolled Generative Adversarial

Recap Understanding Optimization Issues GAN Training and Stabilization Take Aways

References IV

Vaishnavh Nagarajan and J. Zico Kolter, Gradient descentGAN optimization is locally stable, CoRR abs/1706.04156(2017).

Casper Kaae Sønderby, Jose Caballero, Lucas Theis, WenzheShi, and Ferenc Huszar, Amortised MAP inference for imagesuper-resolution, CoRR abs/1610.04490 (2016).

Junbo Jake Zhao, Michael Mathieu, and Yann LeCun,Energy-based generative adversarial network, CoRRabs/1609.03126 (2016).

Benjamin Striner CMU

GANs


Top Related