generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...

Deep Learning Srihari

1

Generative Adversarial Networks: Theory

Sargur N. [email protected]


Topics in GAN1. GAN Motivation 2. GAN Theory

1. Generator and Discriminator2. Adversarial Modeling framework3. GAN Objective and SGD4. Theoretical Results5. GAN Difficulties

3. Mode Collapse4. Wasserstein GAN5. GAN Variants

2

àDiscussed Here

https://cedar.buffalo.edu/~srihari/CSE676/22.1-GAN%20Motivation.pdf

https://cedar.buffalo.edu/~srihari/CSE676/22.3-GAN%20Mode%20Collapse.pdf

https://cedar.buffalo.edu/~srihari/CSE676/22.4-WGAN.pdf

https://cedar.buffalo.edu/~srihari/CSE676/22.5-GAN%20Variants.pdf


Generator Network• To learn the generator’s distribution pg over

data x, we define a prior on input noise variables pz(z)

• Represent a mapping to data space as G(z; θg)– where G is a differentiable generator function

represented by a multilayer perceptron with parameters θg

3

p(x,z)=p(x|z)p(z)

GeneratorNetworkG(z; θg)

G(z) mapsz to x which isdistributedaccording topg=p(x|z)

pz(z)

https://cedar.buffalo.edu/~srihari/CSE676/20.10.2-DifferentiableGenNets.pdf


Differentiable Generator Functions1. To sample from x ~N(μ,Σ): x =g(z) = μ + Lz

• where L is the Cholesky decomposition of Σ– Decomposition: Σ = LLT where L is lower triangular

2. To sample from any p(x)

– Distribution is governed by

• Alternatively, define p(x|z) and marginalize z :

z ~N(0,I) x ~N(μ, Σ)g(z)

z ~U(0,I) x ~p(x)g(z) x =g(z) = h-1(z) ~ p(x)h(z) is the cdf of p(x)

p(x)

h(x)

x

Denominator is determinant of Jacobian

e.g., by using a VAE


Generator and Discriminator Networks1. Generator Network produces samples x=G(z;θ(g))

2. Discriminator Network 1. Attempts to distinguish samples drawn from

training data and samples drawn from generator2. Emits a probability value

• D(x ;θ(d)) indicating the probability that x is a real training example rather than a fake sample drawn from the model

5


Adversarial Modeling Framework• We also define a second multilayer perceptron D(x; θd) that outputs a single scalar– D(x) represents the probability that x came from

the data rather than pg

6


Training of D and G1. We train D to maximize the probability of

assigning the correct label to both training examples and samples from G

2. We simultaneously train G to minimize log(1−D(G(z)))

• In other words, D and G play a two-player minimax game with Value Function V(G,D)

7


Minimax value explanation1. D is accurate over real data by maximizing

!x∼pdata(x)[log D(x)]

– Given sample G(z), z∼pz(z), D(G(z)) is close to zero by maximizing !z∼p(z)[log(1−D(G(z)))]

2. G is trained to increase D’s probability for a fake example,

– i.e. minimize !z∼p(z) [log(1−D(G(z)))]

• Combined together D and G are playing a minimax game in which we optimize:

8


Optimal value of D• To find best value for D consider

L(G,D)=∫x(pdata(x) logD(x) + pg(x)log(1-D(x))dx

– Since we want value of D(x) to maximize L(G,D), label x ̃ =D(x), A=pdata(x), B=pg(x)

• What is inside integral is:à

– Setting the derivative to zero:

– Once G is trained to optimal, pg is very close to pdata

– When pg=pdata, D∗(x) becomes 1/2 9


Intuition in training of GAN

Discriminativedistribution D

Data generatingdistribution pdata(x)

Generativedistribution pg(G(z))

Goal:Make these two equal

Domain fromwhich zis sampled

(uniformly)

Non-uniformmapping x=G(z)G contracts in regions ofhigh density of pg(G) andexpands in regions oflow density

Note:Here pg(G)is close to pdata(x)

G(z)


D is trained to discriminate samples from data,

converging to

Update of D

11

Discriminative distribution D(x)

D*(x)=




Update of G

12

z

x


After an update to G, gradient of D has guided G(z) to flow to regions that aremore likely to be classified as data

G(z)


Final State

13

After several steps of training, if D and G have enough capacity, they will reacha point at which both cannot improve because pg=pdata

The discriminator is unable to differentiate between the two distributions,i.e., D(x)=0.5

Generative distribution pg(G(z))


Discriminative distribution D(x)

Deep Learning SrihariGAN Objective• Value Function of Minimax played by G,D

pX: data distribution, represented by samplespG(z) : model distribution, z modeled as Gaussian

– We must implement the game using an iterative, numerical approach

14

Optimizing D to completion in innerloop of training is computationallyprohibitive and results in overfitting infinite data sets.

Instead we alternate between k stepsof optimizing D and one step of optimizingG


Interaction between G and D

• Weights are updated so as to– maximize the probability that any real data input x is

classified as belonging to the real dataset, while – minimizing the probability that any fake image is

classified as belonging to the real dataset • In more technical terms,

– Loss function maximizes the function D(x), and it also minimizes D(G(z))

15


Alternating gradient updates

• Step 1: Fix G and perform a gradient step for maximization:

• Step 2: Fix D and perform a gradient step for minimization:– In theory

– In practice

16


GAN Architecture

17

G and D are deep neural networksFine tune training consists of backpropagation

x=G(z; θ(G))

D(x; θ(D))

θ(G)

θ(D)


SGD for GAN

18


Theoretical Results

• Theorem 1: – The global minimum of the virtual training criterion C(G)=maxDV(G,D) is achieved iff pg=pdata

• At that point C(G) achieves the value –log4

• Convergence of Algorithm 1:– If G and D have enough capacity, and at each step

of Algorithm 1, D is allowed to reach its optimum given G, and pg is updated so to improve the criterion

– Then pg converges to pdata

19


GAN and Convexity• Main motivation for GAN: No approximation

– In inference or in partition function• When maxd v(g,d) is convex in θg

– (in the space of probability density functions)– the procedure is guaranteed to converge

• If g,d are neural nets and maxd v(g,d) nonconvex– Learning is difficult

• From min-max theory, mode collapse is caused by the improper convergence to the optimum, due to the non-convexity and non-concavity of the objective function

20

https://cedar.buffalo.edu/~srihari/CSE676/22.3-GAN%20Mode%20Collapse.pdf


GAN underfit: nonconvergence• Simultaneous gradient descent on two player’s

costs is not guaranteed to reach an equilibrium– Ex: v(a,b)=ab

• where Player 1 controls a and incurs cost ab, while Player 2 controls b and receives cost -ab

• If each player makes small gradient steps, each player reducing own cost at expense of other,

– then a and b go into a stable circular orbit, rather than arriving at the equilibrium point at the origin

21


Proposed Solutions1. Game is no longer zero-sum

– Same expected gradient as maximum likelihood learning

– Does not work well in practice2. G increases probability that D makes a

mistake rather than aiming to decrease probability that D makes right prediction– Motivated by keeping derivatives large

• Stabilization of GAN learning an open problem

22


References• https://www.kdnuggets.com/2019/04/rise-generative-adversarial-networks.html?fbclid=IwAR3wmr2-

MbiNEjoFGu6QoocPmOCmAuPYPBfmoIDQcolJWThnXRiVP7iKGSk• https://www.slideshare.net/mlreview/tutorial-on-theory-and-application-of-generative-adversarial-networks

23

https://www.kdnuggets.com/2019/04/rise-generative-adversarial-networks.html?fbclid=IwAR3wmr2-MbiNEjoFGu6QoocPmOCmAuPYPBfmoIDQcolJWThnXRiVP7iKGSk

https://www.slideshare.net/mlreview/tutorial-on-theory-and-application-of-generative-adversarial-networks

generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...

Documents