generative adversarial networks - ian goodfello · generative adversarial networks 1 ... peaked...

40
Generative Adversarial Networks 1 presented by Ian Goodfellow presentation co-developed with Aaron Courville

Upload: dangthu

Post on 01-Jul-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Generative Adversarial Networks

1

presented by Ian Goodfellow

presentation co-developed with Aaron Courville

Page 2: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

In today’s talk…

• “Generative Adversarial Networks” Goodfellow et al., NIPS 2014

• “Conditional Generative Adversarial Nets” Mirza and Osindero, NIPS Deep Learning Workshop 2014

• “On Distinguishability Criteria for Estimating Generative Models” Goodfellow, ICLR Workshop 2015

• “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks” Denton, Chintala, et al., ArXiv 2015

2

Page 3: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Generative modeling

• Have training examples x ~ pdata(x )• Want a model that can draw samples: x ~ pmodel(x )• Where pmodel ≈ pdata

3

x ~ pdata(x ) x ~ pmodel(x )

Page 4: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Why generative models?

• Conditional generative models- Speech synthesis: Text ⇒ Speech- Machine Translation: French ⇒ English

• French: Si mon tonton tond ton tonton, ton tonton sera tondu.• English: If my uncle shaves your uncle, your uncle will be shaved

- Image ⇒ Image segmentation

• Environment simulator- Reinforcement learning- Planning

• Leverage unlabeled data?4

Page 5: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Maximum likelihood: the dominant approach

• ML objective function

5

⇤= max

1

m

mX

i=1

log p

⇣x

(i); ✓

1

Page 6: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Undirected graphical models

• Flagship undirected graphical model: Deep Boltzmann machines

• Several “hidden layers” h

6

p(h, x) =1

Zp̃(h, x)

p̃(h, x) = exp(�E(h, x))

Z =�

h,x

p̃(h, x)

h(1)

h(2)

h(3)

x

Page 7: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow 7

Boltzmann Machines: disadvantage

• Model is badly parameterized for learning high quality samples: peaked distributions -> slow mixing

• Why poor mixing?

MNIST dataset 1st layer features (RBM)

Coordinated flipping of low-level features

Page 8: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Directed graphical models

• Two problems:1. Summation over exponentially many states in h2. Posterior inference, i.e. calculating p(h | x), is intractable.

8

⇤= max

1

m

mX

i=1

log p

⇣x

(i); ✓

p(h, x) =

1

Z

p̃(h, x)

p̃(h, x) = exp (�E (h, x))

Z =

X

h,x

p̃(h, x)

d

d✓

i

log p(h, x) =

d

d✓

i

[log p̃(h, x)� logZ(✓)]

d

d✓

i

logZ(✓) =

d

d✓iZ(✓)

Z(✓)

p(x, h) = p(x | h(1))p(h

(1) | h(2)) . . . p(h

(L�1) | h(L))p(h

(L))

1

h(1)

h(2)

h(3)

x

d

d�ilog p(x) =

1

p(x)

d

d�ip(x)

p(x) =�

h

p(x | h)p(h)

Page 9: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Variational Autoencoder

9

E[x|z]

Differentiable decoder

x sampled from data

Differentiable encoder

Sample from q(z)

Noise

z

x

BRIEF ARTICLE

THE AUTHOR

Maximize log p(x)�DKL (q(x)kp(z | x))

1

(Kingma and Welling, 2014, Rezende et al 2014)

Page 10: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Generative stochastic networks

• General strategy: Do not write a formula for p(x), just learn to sample incrementally.

• Main issue: Subject to some of the same constraints on mixing as undirected graphical models.

10

...

(Bengio et al 2013)

Page 11: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Generative adversarial networks

• Don’t write a formula for p(x), just learn to sample directly.

• No Markov Chain

• No variational bound

• How? By playing a game.

11

Page 12: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Game theory: the basics

• N>1 players• Clearly defined set of actions each player can take• Clearly defined relationship between actions and

outcomes• Clearly defined value of each outcome• Can’t control the other player’s actions

12

Page 13: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Two-player zero-sum game

• Your winnings + your opponent’s winnings = 0• Minimax theorem: a rational strategy exists for all

such finite games

13

Page 14: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

• Strategy: specification of which moves you make in which circumstances.

• Equilibrium: each player’s strategy is the best possible for their opponent’s strategy.

• Example: Rock-paper-scissors:

- Mixed strategy equilibrium

- Choose your action at random

14

0 -1 1

1 0 -1

-1 1 0

You

Your opponentRock Paper Scissors

Rock

Pape

rSc

issor

s

Two-player zero-sum game

Page 15: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Adversarial nets framework

15

• A game between two players:1. Discriminator D 2. Generator G

• D tries to discriminate between: - A sample from the data distribution. - And a sample from the generator G.

• G tries to “trick” D by generating samples that are hard for D to distinguish from data.

Page 16: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Adversarial nets framework

16

Input noiseZ

Differentiable function G

x sampled from model

Differentiable function D

D tries to output 0

x sampled from data

Differentiable function D

D tries to output 1

xx

z

Page 17: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

• Minimax value function:

Zero-sum game

17

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Generator pushes down

Discriminator pushes up

Discriminator’s ability to

recognize data as being real

Discriminator’sability to recognize generator

samples as being fake

Page 18: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Discriminator strategy

• Optimal strategy for any pmodel(x) is always

18

⇤= max

1

m

mX

i=1

log p

⇣x

(i); ✓

p(h, x) =

1

Z

p̃(h, x)

p̃(h, x) = exp (�E (h, x))

Z =

X

h,x

p̃(h, x)

d

d✓

i

log p(x) =

d

d✓

i

"log

X

h

p̃(h, x)� logZ(✓)

#

d

d✓

i

logZ(✓) =

d

d✓iZ(✓)

Z(✓)

p(x, h) = p(x | h(1))p(h

(1) | h(2)) . . . p(h

(L�1) | h(L))p(h

(L))

d

d✓

i

log p(x) =

d

d✓ip(x)

p(x)

p(x) =

X

h

p(x | h)p(h)

D(x) =

p

data

(x)

p

data

(x) + p

model

(x)

1

Page 19: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Learning process

19

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Poorly fit model After updating D After updating G Mixed strategyequilibrium

Data distributionModel distribution

D(x)

Page 20: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Learning process

20

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Poorly fit model After updating D After updating G Mixed strategyequilibrium

Data distributionModel distribution

D(x)

Page 21: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Learning process

21

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Poorly fit model After updating D After updating G Mixed strategyequilibrium

Data distributionModel distribution

D(x)

Page 22: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Learning process

22

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Poorly fit model After updating D After updating G Mixed strategyequilibrium

Data distributionModel distribution

D(x)

Page 23: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Theoretical properties

• Theoretical properties (assuming infinite data, infinite model capacity, direct updating of generator’s distribution):- Unique global optimum.

- Optimum corresponds to data distribution.

- Convergence to optimum guaranteed.

23

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

In practice: no proof that SGD converges

Page 24: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Oscillation

24

(Alec Radford)

Page 25: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Visualization of model samples

25

MNIST TFD

CIFAR-10 (fully connected) CIFAR-10 (convolutional)

Page 26: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Learned 2-D manifold of MNIST

26

Page 27: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

1. Draw sample (A)

2. Draw sample (B)

3. Simulate samples along the path between A and B

4. Repeat steps 1-3 as desired.

Visualizing trajectories

27

A

B

Page 28: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Visualization of model trajectories

28

MNIST digit dataset Toronto Face Dataset (TFD)

Page 29: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow 29

CIFAR-10 (convolutional)

Visualization of model trajectories

Page 30: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

GANs vs VAEs• Both use backprop through continuous random number generation

• VAE:

- generator gets direct output target

- need REINFORCE to do discrete latent variables

- possible underfitting due to variational approximation

- gets global image composition right but blurs details

• GAN:

- generator never sees the data

- need REINFORCE to do discrete visible variables

- possible underfitting due to non-convergence

- gets local image features right but not global structure

30

Page 31: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

VAE + GAN

31

(Alec Radford, 2015)

VAE VAE+GAN

-Reduce VAE blurriness-Reduce GAN oscillation

Page 32: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

MMD-based generator nets

32

(Li et al 2015) (Dziugaite et al 2015)

Page 33: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Supervised Generator Nets

33

(Dosovitskiy et al 2014)

Generator nets arepowerful—it is our

ability to infer a mapping from an unobserved space

that is limited.

Page 34: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

General game

34

Page 35: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Extensions

35

• Inference net:- Learn a network to model p(z | x)

- Wake/Sleep style approach

- Sample z from prior

- Sample x from p(z|x)

- Learn mapping from x to z- Infinite training set!

Page 36: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Extensions

36

• Conditional model:

- Learn p(x | y)- Discriminator is trained

on (x,y) pairs

- Generator net gets y and z as input

- Useful for : Translation, speech synth, image segmentation.

(Mirza and Osindero, 2014)

Page 37: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Laplacian Pyramid

37

(Denton + Chintala, et al 2015)

Page 38: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

LAPGAN results• 40% of samples mistaken by humans for real photos

38

(Denton + Chintala, et al 2015)

Page 39: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Deep Learning Workshop, ICML 2015 --- Ian Goodfellow

Open problems

• Is non-convergence a serious problem in practice?

• If so, how can we prevent non-convergence?

• Is there a better loss function for the generator?

39

Page 40: Generative Adversarial Networks - Ian Goodfello · Generative Adversarial Networks 1 ... peaked distributions -> slow mixing • Why poor mixing? ... 2015_icml_gans_wkshp_invited

Thank You.

40

Questions?