generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...

Post on 23-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning Srihari

1

Generative Adversarial Networks: Theory

Sargur N. Sriharisrihari@cedar.buffalo.edu

Deep Learning Srihari

Topics in GAN1. GAN Motivation 2. GAN Theory

1. Generator and Discriminator2. Adversarial Modeling framework3. GAN Objective and SGD4. Theoretical Results5. GAN Difficulties

3. Mode Collapse4. Wasserstein GAN5. GAN Variants

2

àDiscussed Here

Deep Learning Srihari

Generator Network• To learn the generator’s distribution pg over

data x, we define a prior on input noise variables pz(z)

• Represent a mapping to data space as G(z; θg)– where G is a differentiable generator function

represented by a multilayer perceptron with parameters θg

3

p(x,z)=p(x|z)p(z)

GeneratorNetworkG(z; θg)

G(z) mapsz to x which isdistributedaccording topg=p(x|z)

pz(z)

Deep Learning Srihari

Differentiable Generator Functions1. To sample from x ~N(μ,Σ): x =g(z) = μ + Lz

• where L is the Cholesky decomposition of Σ– Decomposition: Σ = LLT where L is lower triangular

2. To sample from any p(x)

– Distribution is governed by

• Alternatively, define p(x|z) and marginalize z :

z ~N(0,I) x ~N(μ, Σ)g(z)

z ~U(0,I) x ~p(x)g(z) x =g(z) = h-1(z) ~ p(x)h(z) is the cdf of p(x)

p(x)

h(x)

x

Denominator is determinant of Jacobian

e.g., by using a VAE

Deep Learning Srihari

Generator and Discriminator Networks1. Generator Network produces samples x=G(z;θ(g))

2. Discriminator Network 1. Attempts to distinguish samples drawn from

training data and samples drawn from generator2. Emits a probability value

• D(x ;θ(d)) indicating the probability that x is a real training example rather than a fake sample drawn from the model

5

Deep Learning Srihari

Adversarial Modeling Framework• We also define a second multilayer perceptron D(x; θd) that outputs a single scalar– D(x) represents the probability that x came from

the data rather than pg

6

Deep Learning Srihari

Training of D and G1. We train D to maximize the probability of

assigning the correct label to both training examples and samples from G

2. We simultaneously train G to minimize log(1−D(G(z)))

• In other words, D and G play a two-player minimax game with Value Function V(G,D)

7

Deep Learning Srihari

Minimax value explanation1. D is accurate over real data by maximizing

!x∼pdata(x)[log D(x)]

– Given sample G(z), z∼pz(z), D(G(z)) is close to zero by maximizing !z∼p(z)[log(1−D(G(z)))]

2. G is trained to increase D’s probability for a fake example,

– i.e. minimize !z∼p(z) [log(1−D(G(z)))]

• Combined together D and G are playing a minimax game in which we optimize:

8

Deep Learning Srihari

Optimal value of D• To find best value for D consider

L(G,D)=∫x(pdata(x) logD(x) + pg(x)log(1-D(x))dx

– Since we want value of D(x) to maximize L(G,D), label x ̃ =D(x), A=pdata(x), B=pg(x)

• What is inside integral is:à

– Setting the derivative to zero:

– Once G is trained to optimal, pg is very close to pdata

– When pg=pdata, D∗(x) becomes 1/2 9

Deep Learning Srihari

Intuition in training of GAN

Discriminativedistribution D

Data generatingdistribution pdata(x)

Generativedistribution pg(G(z))

Goal:Make these two equal

Domain fromwhich zis sampled

(uniformly)

Non-uniformmapping x=G(z)G contracts in regions ofhigh density of pg(G) andexpands in regions oflow density

Note:Here pg(G)is close to pdata(x)

G(z)

Deep Learning Srihari

D is trained to discriminate samples from data,

converging to

Update of D

11

Discriminative distribution D(x)

D*(x)=

Data generatingdistribution pdata(x)

Generativedistribution pg(G(z))

Deep Learning Srihari

Update of G

12

z

x

Generativedistribution pg(G(z))

After an update to G, gradient of D has guided G(z) to flow to regions that aremore likely to be classified as data

G(z)

Deep Learning Srihari

Final State

13

After several steps of training, if D and G have enough capacity, they will reacha point at which both cannot improve because pg=pdata

The discriminator is unable to differentiate between the two distributions,i.e., D(x)=0.5

Generative distribution pg(G(z))

Data generatingdistribution pdata(x)

Discriminative distribution D(x)

Deep Learning SrihariGAN Objective• Value Function of Minimax played by G,D

pX: data distribution, represented by samplespG(z) : model distribution, z modeled as Gaussian

– We must implement the game using an iterative, numerical approach

14

Optimizing D to completion in innerloop of training is computationallyprohibitive and results in overfitting infinite data sets.

Instead we alternate between k stepsof optimizing D and one step of optimizingG

Deep Learning Srihari

Interaction between G and D

• Weights are updated so as to– maximize the probability that any real data input x is

classified as belonging to the real dataset, while – minimizing the probability that any fake image is

classified as belonging to the real dataset • In more technical terms,

– Loss function maximizes the function D(x), and it also minimizes D(G(z))

15

Deep Learning Srihari

Alternating gradient updates

• Step 1: Fix G and perform a gradient step for maximization:

• Step 2: Fix D and perform a gradient step for minimization:– In theory

– In practice

16

Deep Learning Srihari

GAN Architecture

17

G and D are deep neural networksFine tune training consists of backpropagation

x=G(z; θ(G))

D(x; θ(D))

θ(G)

θ(D)

Deep Learning Srihari

SGD for GAN

18

Deep Learning Srihari

Theoretical Results

• Theorem 1: – The global minimum of the virtual training criterion C(G)=maxDV(G,D) is achieved iff pg=pdata

• At that point C(G) achieves the value –log4

• Convergence of Algorithm 1:– If G and D have enough capacity, and at each step

of Algorithm 1, D is allowed to reach its optimum given G, and pg is updated so to improve the criterion

– Then pg converges to pdata

19

Deep Learning Srihari

GAN and Convexity• Main motivation for GAN: No approximation

– In inference or in partition function• When maxd v(g,d) is convex in θg

– (in the space of probability density functions)– the procedure is guaranteed to converge

• If g,d are neural nets and maxd v(g,d) nonconvex– Learning is difficult

• From min-max theory, mode collapse is caused by the improper convergence to the optimum, due to the non-convexity and non-concavity of the objective function

20

Deep Learning Srihari

GAN underfit: nonconvergence• Simultaneous gradient descent on two player’s

costs is not guaranteed to reach an equilibrium– Ex: v(a,b)=ab

• where Player 1 controls a and incurs cost ab, while Player 2 controls b and receives cost -ab

• If each player makes small gradient steps, each player reducing own cost at expense of other,

– then a and b go into a stable circular orbit, rather than arriving at the equilibrium point at the origin

21

Deep Learning Srihari

Proposed Solutions1. Game is no longer zero-sum

– Same expected gradient as maximum likelihood learning

– Does not work well in practice2. G increases probability that D makes a

mistake rather than aiming to decrease probability that D makes right prediction– Motivated by keeping derivatives large

• Stabilization of GAN learning an open problem

22

Deep Learning Srihari

References• https://www.kdnuggets.com/2019/04/rise-generative-adversarial-networks.html?fbclid=IwAR3wmr2-

MbiNEjoFGu6QoocPmOCmAuPYPBfmoIDQcolJWThnXRiVP7iKGSk• https://www.slideshare.net/mlreview/tutorial-on-theory-and-application-of-generative-adversarial-networks

23

top related