generative adversarial networks: theorysrihari/cse676/22.2-gan theory... · 2019. 5. 11. · 1.to...
TRANSCRIPT
Deep Learning Srihari
Topics in GAN1. GAN Motivation 2. GAN Theory
1. Generator and Discriminator2. Adversarial Modeling framework3. GAN Objective and SGD4. Theoretical Results5. GAN Difficulties
3. Mode Collapse4. Wasserstein GAN5. GAN Variants
2
àDiscussed Here
Deep Learning Srihari
Generator Network• To learn the generator’s distribution pg over
data x, we define a prior on input noise variables pz(z)
• Represent a mapping to data space as G(z; θg)– where G is a differentiable generator function
represented by a multilayer perceptron with parameters θg
3
p(x,z)=p(x|z)p(z)
GeneratorNetworkG(z; θg)
G(z) mapsz to x which isdistributedaccording topg=p(x|z)
pz(z)
Deep Learning Srihari
Differentiable Generator Functions1. To sample from x ~N(μ,Σ): x =g(z) = μ + Lz
• where L is the Cholesky decomposition of Σ– Decomposition: Σ = LLT where L is lower triangular
2. To sample from any p(x)
– Distribution is governed by
• Alternatively, define p(x|z) and marginalize z :
z ~N(0,I) x ~N(μ, Σ)g(z)
z ~U(0,I) x ~p(x)g(z) x =g(z) = h-1(z) ~ p(x)h(z) is the cdf of p(x)
p(x)
h(x)
x
Denominator is determinant of Jacobian
e.g., by using a VAE
Deep Learning Srihari
Generator and Discriminator Networks1. Generator Network produces samples x=G(z;θ(g))
2. Discriminator Network 1. Attempts to distinguish samples drawn from
training data and samples drawn from generator2. Emits a probability value
• D(x ;θ(d)) indicating the probability that x is a real training example rather than a fake sample drawn from the model
5
Deep Learning Srihari
Adversarial Modeling Framework• We also define a second multilayer perceptron D(x; θd) that outputs a single scalar– D(x) represents the probability that x came from
the data rather than pg
6
Deep Learning Srihari
Training of D and G1. We train D to maximize the probability of
assigning the correct label to both training examples and samples from G
2. We simultaneously train G to minimize log(1−D(G(z)))
• In other words, D and G play a two-player minimax game with Value Function V(G,D)
7
Deep Learning Srihari
Minimax value explanation1. D is accurate over real data by maximizing
!x∼pdata(x)[log D(x)]
– Given sample G(z), z∼pz(z), D(G(z)) is close to zero by maximizing !z∼p(z)[log(1−D(G(z)))]
2. G is trained to increase D’s probability for a fake example,
– i.e. minimize !z∼p(z) [log(1−D(G(z)))]
• Combined together D and G are playing a minimax game in which we optimize:
8
Deep Learning Srihari
Optimal value of D• To find best value for D consider
L(G,D)=∫x(pdata(x) logD(x) + pg(x)log(1-D(x))dx
– Since we want value of D(x) to maximize L(G,D), label x ̃ =D(x), A=pdata(x), B=pg(x)
• What is inside integral is:à
– Setting the derivative to zero:
– Once G is trained to optimal, pg is very close to pdata
– When pg=pdata, D∗(x) becomes 1/2 9
Deep Learning Srihari
Intuition in training of GAN
Discriminativedistribution D
Data generatingdistribution pdata(x)
Generativedistribution pg(G(z))
Goal:Make these two equal
Domain fromwhich zis sampled
(uniformly)
Non-uniformmapping x=G(z)G contracts in regions ofhigh density of pg(G) andexpands in regions oflow density
Note:Here pg(G)is close to pdata(x)
G(z)
Deep Learning Srihari
D is trained to discriminate samples from data,
converging to
Update of D
11
Discriminative distribution D(x)
D*(x)=
Data generatingdistribution pdata(x)
Generativedistribution pg(G(z))
Deep Learning Srihari
Update of G
12
z
x
Generativedistribution pg(G(z))
After an update to G, gradient of D has guided G(z) to flow to regions that aremore likely to be classified as data
G(z)
Deep Learning Srihari
Final State
13
After several steps of training, if D and G have enough capacity, they will reacha point at which both cannot improve because pg=pdata
The discriminator is unable to differentiate between the two distributions,i.e., D(x)=0.5
Generative distribution pg(G(z))
Data generatingdistribution pdata(x)
Discriminative distribution D(x)
Deep Learning SrihariGAN Objective• Value Function of Minimax played by G,D
pX: data distribution, represented by samplespG(z) : model distribution, z modeled as Gaussian
– We must implement the game using an iterative, numerical approach
14
Optimizing D to completion in innerloop of training is computationallyprohibitive and results in overfitting infinite data sets.
Instead we alternate between k stepsof optimizing D and one step of optimizingG
Deep Learning Srihari
Interaction between G and D
• Weights are updated so as to– maximize the probability that any real data input x is
classified as belonging to the real dataset, while – minimizing the probability that any fake image is
classified as belonging to the real dataset • In more technical terms,
– Loss function maximizes the function D(x), and it also minimizes D(G(z))
15
Deep Learning Srihari
Alternating gradient updates
• Step 1: Fix G and perform a gradient step for maximization:
• Step 2: Fix D and perform a gradient step for minimization:– In theory
– In practice
16
Deep Learning Srihari
GAN Architecture
17
G and D are deep neural networksFine tune training consists of backpropagation
x=G(z; θ(G))
D(x; θ(D))
θ(G)
θ(D)
Deep Learning Srihari
SGD for GAN
18
Deep Learning Srihari
Theoretical Results
• Theorem 1: – The global minimum of the virtual training criterion C(G)=maxDV(G,D) is achieved iff pg=pdata
• At that point C(G) achieves the value –log4
• Convergence of Algorithm 1:– If G and D have enough capacity, and at each step
of Algorithm 1, D is allowed to reach its optimum given G, and pg is updated so to improve the criterion
– Then pg converges to pdata
19
Deep Learning Srihari
GAN and Convexity• Main motivation for GAN: No approximation
– In inference or in partition function• When maxd v(g,d) is convex in θg
– (in the space of probability density functions)– the procedure is guaranteed to converge
• If g,d are neural nets and maxd v(g,d) nonconvex– Learning is difficult
• From min-max theory, mode collapse is caused by the improper convergence to the optimum, due to the non-convexity and non-concavity of the objective function
20
Deep Learning Srihari
GAN underfit: nonconvergence• Simultaneous gradient descent on two player’s
costs is not guaranteed to reach an equilibrium– Ex: v(a,b)=ab
• where Player 1 controls a and incurs cost ab, while Player 2 controls b and receives cost -ab
• If each player makes small gradient steps, each player reducing own cost at expense of other,
– then a and b go into a stable circular orbit, rather than arriving at the equilibrium point at the origin
21
Deep Learning Srihari
Proposed Solutions1. Game is no longer zero-sum
– Same expected gradient as maximum likelihood learning
– Does not work well in practice2. G increases probability that D makes a
mistake rather than aiming to decrease probability that D makes right prediction– Motivated by keeping derivatives large
• Stabilization of GAN learning an open problem
22
Deep Learning Srihari
References• https://www.kdnuggets.com/2019/04/rise-generative-adversarial-networks.html?fbclid=IwAR3wmr2-
MbiNEjoFGu6QoocPmOCmAuPYPBfmoIDQcolJWThnXRiVP7iKGSk• https://www.slideshare.net/mlreview/tutorial-on-theory-and-application-of-generative-adversarial-networks
23