generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...

Deep Learning Srihari

Generative Adversarial Networks: Theory

Sargur N. Sriharisrihari@cedar.buffalo.edu

Topics in GAN1. GAN Motivation 2. GAN Theory

1. Generator and Discriminator2. Adversarial Modeling framework3. GAN Objective and SGD4. Theoretical Results5. GAN Difficulties

3. Mode Collapse4. Wasserstein GAN5. GAN Variants

àDiscussed Here

Generator Network• To learn the generator’s distribution pg over

data x, we define a prior on input noise variables pz(z)

• Represent a mapping to data space as G(z; θg)– where G is a differentiable generator function

represented by a multilayer perceptron with parameters θg

p(x,z)=p(x|z)p(z)

GeneratorNetworkG(z; θg)

G(z) mapsz to x which isdistributedaccording topg=p(x|z)

Differentiable Generator Functions1. To sample from x ~N(μ,Σ): x =g(z) = μ + Lz

• where L is the Cholesky decomposition of Σ– Decomposition: Σ = LLT where L is lower triangular

2. To sample from any p(x)

– Distribution is governed by

• Alternatively, define p(x|z) and marginalize z :

z ~N(0,I) x ~N(μ, Σ)g(z)

z ~U(0,I) x ~p(x)g(z) x =g(z) = h-1(z) ~ p(x)h(z) is the cdf of p(x)

Denominator is determinant of Jacobian

e.g., by using a VAE

Generator and Discriminator Networks1. Generator Network produces samples x=G(z;θ(g))

2. Discriminator Network 1. Attempts to distinguish samples drawn from

training data and samples drawn from generator2. Emits a probability value

• D(x ;θ(d)) indicating the probability that x is a real training example rather than a fake sample drawn from the model

Adversarial Modeling Framework• We also define a second multilayer perceptron D(x; θd) that outputs a single scalar– D(x) represents the probability that x came from

the data rather than pg

Training of D and G1. We train D to maximize the probability of

assigning the correct label to both training examples and samples from G

2. We simultaneously train G to minimize log(1−D(G(z)))

• In other words, D and G play a two-player minimax game with Value Function V(G,D)

Minimax value explanation1. D is accurate over real data by maximizing

!x∼pdata(x)[log D(x)]

– Given sample G(z), z∼pz(z), D(G(z)) is close to zero by maximizing !z∼p(z)[log(1−D(G(z)))]

2. G is trained to increase D’s probability for a fake example,

– i.e. minimize !z∼p(z) [log(1−D(G(z)))]

• Combined together D and G are playing a minimax game in which we optimize:

Optimal value of D• To find best value for D consider

L(G,D)=∫x(pdata(x) logD(x) + pg(x)log(1-D(x))dx

– Since we want value of D(x) to maximize L(G,D), label x ̃ =D(x), A=pdata(x), B=pg(x)

• What is inside integral is:à

– Setting the derivative to zero:

– Once G is trained to optimal, pg is very close to pdata

– When pg=pdata, D∗(x) becomes 1/2 9

Intuition in training of GAN

Discriminativedistribution D

Data generatingdistribution pdata(x)

Generativedistribution pg(G(z))

Goal:Make these two equal

Domain fromwhich zis sampled

(uniformly)

Non-uniformmapping x=G(z)G contracts in regions ofhigh density of pg(G) andexpands in regions oflow density

Note:Here pg(G)is close to pdata(x)

D is trained to discriminate samples from data,

converging to

Update of D

Discriminative distribution D(x)

D*(x)=

Update of G

After an update to G, gradient of D has guided G(z) to flow to regions that aremore likely to be classified as data

Final State

After several steps of training, if D and G have enough capacity, they will reacha point at which both cannot improve because pg=pdata

The discriminator is unable to differentiate between the two distributions,i.e., D(x)=0.5

Generative distribution pg(G(z))

Discriminative distribution D(x)

Deep Learning SrihariGAN Objective• Value Function of Minimax played by G,D

pX: data distribution, represented by samplespG(z) : model distribution, z modeled as Gaussian

– We must implement the game using an iterative, numerical approach

Optimizing D to completion in innerloop of training is computationallyprohibitive and results in overfitting infinite data sets.

Instead we alternate between k stepsof optimizing D and one step of optimizingG

Interaction between G and D

• Weights are updated so as to– maximize the probability that any real data input x is

classified as belonging to the real dataset, while – minimizing the probability that any fake image is

classified as belonging to the real dataset • In more technical terms,

– Loss function maximizes the function D(x), and it also minimizes D(G(z))

Alternating gradient updates

• Step 1: Fix G and perform a gradient step for maximization:

• Step 2: Fix D and perform a gradient step for minimization:– In theory

– In practice

GAN Architecture

G and D are deep neural networksFine tune training consists of backpropagation

x=G(z; θ(G))

D(x; θ(D))

SGD for GAN

Theoretical Results

• Theorem 1: – The global minimum of the virtual training criterion C(G)=maxDV(G,D) is achieved iff pg=pdata

• At that point C(G) achieves the value –log4

• Convergence of Algorithm 1:– If G and D have enough capacity, and at each step

of Algorithm 1, D is allowed to reach its optimum given G, and pg is updated so to improve the criterion

– Then pg converges to pdata

GAN and Convexity• Main motivation for GAN: No approximation

– In inference or in partition function• When maxd v(g,d) is convex in θg

– (in the space of probability density functions)– the procedure is guaranteed to converge

• If g,d are neural nets and maxd v(g,d) nonconvex– Learning is difficult

• From min-max theory, mode collapse is caused by the improper convergence to the optimum, due to the non-convexity and non-concavity of the objective function

GAN underfit: nonconvergence• Simultaneous gradient descent on two player’s

costs is not guaranteed to reach an equilibrium– Ex: v(a,b)=ab

• where Player 1 controls a and incurs cost ab, while Player 2 controls b and receives cost -ab

• If each player makes small gradient steps, each player reducing own cost at expense of other,

– then a and b go into a stable circular orbit, rather than arriving at the equilibrium point at the origin

Proposed Solutions1. Game is no longer zero-sum

– Same expected gradient as maximum likelihood learning

– Does not work well in practice2. G increases probability that D makes a

mistake rather than aiming to decrease probability that D makes right prediction– Motivated by keeping derivatives large

• Stabilization of GAN learning an open problem

References• https://www.kdnuggets.com/2019/04/rise-generative-adversarial-networks.html?fbclid=IwAR3wmr2-

MbiNEjoFGu6QoocPmOCmAuPYPBfmoIDQcolJWThnXRiVP7iKGSk• https://www.slideshare.net/mlreview/tutorial-on-theory-and-application-of-generative-adversarial-networks

generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...

Documents

ch 22.2 origins of the cold war

11ibbc - 22.2 - zhu - tappi

circular 4143.068. 22.2 - cali.gov.co

22.2 the enron case

service magazine 22.2

ks1 english activities 22.2 - stanley road

22.2 - hsl.fi

chapter 22.2- america gears up for war

22.2 roots, stems, and leaves - weebly

košice:dnes 22.2. 2016

straight bt40/st20 air turbine spindle unit...

eurospar 22.2-2.3.2011

a victorian legacy - chapter 22.2 cake

abiotic cycles 22.2

chapter 22 (22.1, 22.2)

frigo 22.2. do 15.5.2011

contexto.internacional.22.2 - usp · title:...

22.2 the enlightenment in europe

excellent science (22.2 b €)

american way 22.2 ebook