an analysis of generative adversarial networks toward...

9
An Analysis Of Generative Adversarial Networks Toward Addressing Mode Collapse Tuan Dinh UW-Madison Madison, WI 53705 [email protected] Kurtis Liu UW-Madison Madison, WI 53705 [email protected] Varun Sah UW-Madison Madison, WI 53705 [email protected] Abstract Generative adversarial networks (GANs) are increasingly being recognized as a powerful class of generative models for learning to model real data distributions. Despite achieving state-of-the-art results on a variety of generative tasks, GANs are still prone to the occurrence of mode collapse, wherein they learn only a few modes of the true distribution. To address this, we introduce a mean-matching regu- larizer into generator objective of gradient-penalized Wasserstein-GAN, christened MeanWGAN-GP, via a theoretical and systematic triangulated analysis of GANs’ system. The experiments illustrate that our proposed model (MeanWGAN-GP) covers all modes and acquires a small Kullback–Leibler divergence on simple mixture Gaussian datasets, and outperforms the latest approaches in term of the number of modes covered on Stacked-MNIST. 1 Introduction Generative Adversarial Networks (GANs) are an emerging and powerful approach of learning probability distributions over complicated manifolds. GANs combine two powerful concepts of adversarial training and neural networks by posing the problem of learning real (unlabeled) data distributions as an adversarial game between two networks: a generative network (generator), which generates samples to be indistinguishable from real samples, and a discriminative network (discriminator) which differentiates between generated and real samples. Since their introduction in 2014 [10], GANs have shown impressive results in a myriad of applications, such as image and text generation [13, 4], text to image conversion [12] and and image painting [1], however, they still remain remarkably difficult to train. It has been observed that they often suffer from mode collapse [5, 20], in which the learned generator generates only few modes of multi-modal distributions. Much of the recent work [20, 17, 5, 21] has been devoted to addressing this issue: AdaGAN [22] paper uses multiple GANs to acquire multiple sub-modes, Improved GANs[20] paper introduces minibatch discrimination and feature mapping to directly encourage diversity of GANs, while UnrolledGAN[17] tries to anticipate counter-play of discriminator and allows generator to unroll its update, and WGAN [3] applies Wasserstein distance to stabilize GANs’ training, implicitly reducing mode missing. Recently, RegGAN [5] and VEEGAN [21] brought the idea of Variational Auto Encoder [14] to reverse the action of generator, the mismatch of which is considered as an indicator of mode collapse. Although each approach succeeds to avoid mode missing to some extent, each one has certain shortcomings like long training time, complicated gradient computation, and failing to cover complicated data. Thus, mode dropping within GANs remains an open problem [6]. This can be attributed to the following reasons. (1) With the exception of a few [20, 6], a majority of approaches attack mode collapse solely at a single level (objective function, optimization, or architecture) (2) Most proposed approaches result from empirical heuristics, with very few underlying theories to understand mode collapse clearly, which is a bane of deep learning research in general. Mathematical Foundations of Machine Learning (ECE/CS 761 Fall 2017), Madison, WI, USA.

Upload: dangnga

Post on 10-May-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

An Analysis Of Generative Adversarial NetworksToward Addressing Mode Collapse

Tuan DinhUW-Madison

Madison, WI [email protected]

Kurtis LiuUW-Madison

Madison, WI [email protected]

Varun SahUW-Madison

Madison, WI [email protected]

Abstract

Generative adversarial networks (GANs) are increasingly being recognized as apowerful class of generative models for learning to model real data distributions.Despite achieving state-of-the-art results on a variety of generative tasks, GANsare still prone to the occurrence of mode collapse, wherein they learn only a fewmodes of the true distribution. To address this, we introduce a mean-matching regu-larizer into generator objective of gradient-penalized Wasserstein-GAN, christenedMeanWGAN-GP, via a theoretical and systematic triangulated analysis of GANs’system. The experiments illustrate that our proposed model (MeanWGAN-GP)covers all modes and acquires a small Kullback–Leibler divergence on simplemixture Gaussian datasets, and outperforms the latest approaches in term of thenumber of modes covered on Stacked-MNIST.

1 Introduction

Generative Adversarial Networks (GANs) are an emerging and powerful approach of learningprobability distributions over complicated manifolds. GANs combine two powerful concepts ofadversarial training and neural networks by posing the problem of learning real (unlabeled) datadistributions as an adversarial game between two networks: a generative network (generator),which generates samples to be indistinguishable from real samples, and a discriminative network(discriminator) which differentiates between generated and real samples.Since their introduction in 2014 [10], GANs have shown impressive results in a myriad of applications,such as image and text generation [13, 4], text to image conversion [12] and and image painting [1],however, they still remain remarkably difficult to train. It has been observed that they often sufferfrom mode collapse [5, 20], in which the learned generator generates only few modes of multi-modaldistributions. Much of the recent work [20, 17, 5, 21] has been devoted to addressing this issue:AdaGAN [22] paper uses multiple GANs to acquire multiple sub-modes, Improved GANs[20] paperintroduces minibatch discrimination and feature mapping to directly encourage diversity of GANs,while UnrolledGAN[17] tries to anticipate counter-play of discriminator and allows generator tounroll its update, and WGAN [3] applies Wasserstein distance to stabilize GANs’ training, implicitlyreducing mode missing. Recently, RegGAN [5] and VEEGAN [21] brought the idea of VariationalAuto Encoder [14] to reverse the action of generator, the mismatch of which is considered as anindicator of mode collapse. Although each approach succeeds to avoid mode missing to some extent,each one has certain shortcomings like long training time, complicated gradient computation, andfailing to cover complicated data.Thus, mode dropping within GANs remains an open problem [6]. This can be attributed to thefollowing reasons. (1) With the exception of a few [20, 6], a majority of approaches attack modecollapse solely at a single level (objective function, optimization, or architecture) (2) Most proposedapproaches result from empirical heuristics, with very few underlying theories to understand modecollapse clearly, which is a bane of deep learning research in general.

Mathematical Foundations of Machine Learning (ECE/CS 761 Fall 2017), Madison, WI, USA.

Page 2: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

Inspired by the limited amount of foundational work and the need for avoiding ’alchemical approaches’in machine learning (Ali Rahimi Test of the time award speech, NIPS 2017), this paper presents awell-rounded theoretical and empirical analysis of mode collapse in GANs. We start by providing abrief overview of GANs (Sec 2), before proceeding to the analysis of mode collapse and a triangulatedtechnique of handling it (Sec 3, 4 ). Finally, we propose our holistic method MeanWGAN-GP (Sec5), and present a comparative analysis (Sec 6), in terms of ability to learn the true distribution, betweenMeanWGAN-GP and previously published GANs, including specialized GANs like VEEGAN [21],RegGAN [5], and WGAN [3] that are aimed at tackling mode collapse.

Our Main Contributions:(1) We provide a foundational analysis of mode collapse in GANs starting with a simple pedagogi-cal setting that mathematically illustrates the problem of mode collapse, followed by a systematiclook at theoretical techniques of handling mode collapse at computational (objective functions),algorithmic (optimization) and implementation (engineering) levels.(2) We introduce the idea of mean-matching regularization on the generator objective as an ad-dition to gradient-penalized Wasserstein objective coupled with utilization of mixture Gaussian noisefor training. To the best of our knowledge, our study is the first work imposing moment matching togradient-penalized Wasserstein GAN, and formally analyzing it on the subject of mode collapse.

Having described the motivation behind this study as well as its significance, in the next section, webriefly provide an overview of GANs, and its properties.

2 Overview of Generative Adversarial Networks

Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used inunsupervised machine learning, implemented by a system of two neural networks contesting witheach other in a zero-sum game framework.

Figure 1: GAN Architecture: Noise z fed to Gen-erator G, with G(z) being the generated data. Dis-criminatorD tries to distinguish betweenG(z) andthe real data X with D’s loss as training signal forG and D update.

The goal of GANs is to train a generator net-work G(z; γ) that produces samples from thereal distribution, pr(x), by transforming vectorsof noise z as x = G(z; γ). The training signalfor G is provided by a discriminator networkD(x; θ) that is trained to distinguish betweensamples of the generator distribution pg(x) andthe real distribution pr(x). The generator net-work G is then trained to fool the discriminatorinto accepting its outputs as being real (Fig 1).

On one hand, for the discriminator D’sdecisions over real data to be accurate,Ex∼pr(x)[logD(x)] needs to be maximized. Atthe same time, given a fake sample G(z), where z ∼ pz(z), the discriminator D is expected to outputa probability, D(G(z)), close to zero by maximizing Ez∼pz(z)[log(1−D(G(z)))].

On the other hand, the generatorG is trained with an objective to increase the chances ofD producinga high probability for a fake example, i.e. by minimizing Ez∼pz(z)[log(1−D(G(z)))].

Combining both perspectives, GANs constitute D and G playing a minimax game:minG

maxD

L(D,G) =Ex∼pr(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]

Vanilla GAN uses Jensen–Shannon divergence to measure similarity between probability distributions.It has the favorable properties of symmetry, being bounded within [0, 1], and being smoother thannormal asymmetric KL divergence used in classical methods. The Loss function can be written as

L(G,D) =

∫x

(pr(x) log(D(x)) + pg(x)log(1−D(x)))dx

For D to be its optimal value D∗,∂L(G,D)

∂D(x)=pr(x)

D(x)− pg(x)

1−D(x)= 0⇒ D∗(x) =

pr(x)

pr(x) + pg(x)∈ [0, 1]

Once the generator is trained to its optimal, pg ∼ pr. We can see that when pg = pr, D∗(x) = 1/2,i.e. the discriminator cannot distinguish between real and generated data.

2

Page 3: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

3 Mode Collapse Analysis

Mode collapse is said to occur when the generator only outputs samples from a single mode, ratherthan all sub-groups of the true distribution, since it is never directly incentivised to cover all modes.In such a scenario the generator will exhibit very poor diversity amongst generated samples, whichlimits the usefulness of the learnt GAN. For instance, if we consider a scenario where a GAN istrying to generate data from a bimodal distribution, the generator can learn to simply output realisticsamples from a single mode to deceive the discriminator. In this scenario, mode collapse could occurin a cyclic fashion as follows:

1. The generator learns that it can trick the discriminator into thinking that it is outputtingrealistic data by producing values close to Class-A.

2. The discriminator counters by learning that all Class-B data presented to it are real (sincenone are produced by the generator), and essentially guesses whether Class-A data are realor fake since they are indistinguishable.

3. The generator exploits the discriminator by switching modes to produce values close toClass-B data instead, abandoning the Class-A mode.

4. The discriminator now assumes that all Class-B data are fake & Class-A data are real. Thevicious cycle repeats.

A Simple Model Showing Mode CollapseWith the above intuition in mind, we now present a numeric model that exhibits mode collapse.Let (θ, γ) be the parameters of the discriminator and the generator respectively.The loss be f(θ, γ) = Ex[logDθ(x)] +Ez[log(1−Dθ(Gγ(Z)))]. If we can make the loss functionlike Figure 2, we can generate a looping update step, wherein the model will suffer from modecollapse loop and never converge. That is, our goal is to find

Ex[logDθ(x)] + Ez[log(1−Dθ(Gγ(Z)))] = e−(θ2+γ2) sin θ sin γ

Figure 2: Mode Collapse Illustration. Gradient decent never converges to the Nash-equilibrium onthis loss function. Blue arrows represent the update step of γ and θ

Consider The following single parameter generator Gγ(z) =

(γ2 γ 1 0 0 00 0 0 γ2 γ 1

)z, and

discriminator Dθ(x) = (θ2, θ, 1)x. We have the following loss function,

m(Ex[logDθ(x)] + Ez[log(1−Dθ(Gγ(Z)))])

= log

n∏i=1

(θ2xi2 + θxi1 + 1)mn

m∏j=1

(θ2γ2zj6 + γ2θzj5 + γθ2zj4 + γθzj3 + θzj2 + γzj1 + zj0)

Now, consider the Taylor expansion of g(θ, γ) = e−(θ2+γ2) sin θ sin γ

g(θ, γ) =

N∑n=0

1

n!

∑k+l=n

∂ng

∂kθ∂lγ(0, 0)θkγl +

∑k+l=N+1

1

(N + 1)!

∂ng

∂kθ∂lγ(0, 0)ekθe

We consider the space θ2 + γ2 ≤ 10, on this set, to find N large enough s.t.∀θ, γ, 1

(N+1)!

∑k+l=N+1

∂ng∂kθ∂lγ

(0, 0)ekθelγ ≤ ε. (Taylor series convergence for all finite radius)

3

Page 4: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

Rejecting the residue, we get g′(θ, γ) =∑Nn=0

1n!

∑k+l=n

∂ng∂kθ∂lγ

(0, 0)θkγl

By factorization, we can always get a cubic expansion of g′(θ, γ),

g′(θ, γ) =

N/2∏j=1

(θ2γ2aj6 + γ2θaj5 + γθ2aj4 + γθaj3 + θaj2 + γaj1 + aj0)

By choosing the training data xi = 0, and noise zj to be the solutions of aj above, we constructed aloss function that has looping update steps. This results in mode collapse.

Next, we propose methods to address mode collapse, such that our previous construction of the GANwould converge.

4 Triangulated Analysis of Mode Collapse Solution

In this section, we analyze the solution for mode collapse at 3 different levels, and seek to incorporatethe integrated nature of problems by: setting a meaningful GAN objective, modifying optimizationtechniques and altering implementation features.

4.1 Main Objective Function

A reason behind GANs’ big success is switching the loss function from asymmetric Kullback–Leibler(KL) divergence to Jensen-Shannon (JS) divergence whose properties were inspected in section 2.However, [9] proves that in this typical case, the Jenson Shannon divergence does not serve as auseful criterion in choosing the appropriate generated distribution, and [2] notes that the JS distancedoes not depend on the generated distribution whenever its dimension is smaller than that of the truedistribution, which is easily solved by using Wasserstein distance.Furthermore, the WGAN paper [3] shows that the Wasserstein distance has many advantages as theloss function correlates with the quality of the generated samples, and it is proven to directly addressvanishing gradients problem.Hence, the WGAN objective is a reasonable choice for the main objective function of GANs. Thecomputable form of Wasserstein distance is modified to be K-Lipschitz continuous:

L(pr, pg) =W (pr, pg) = maxγ∈Γ

Ex∼pr [fγ(x)]− Ez∼pz(z)[fγ(gθ(z))]

4.2 Optimization and Regularization

The ultimate goal of GANs is to learn the true distribution, and not to merely optimize the loss[8]. While Stochastic Gradient Descent still proves its efficiency [18], improving the optimizationalgorithm can be helpful, but would not be able to fix GANs’ issues. Hence, rather than modifyingthe optimization algorithm, we simply leverage the Adam optimizer, which is an SGD approach andwidely known to be efficient for deep learning tasks.From our view, the focus should be robustness which can stabilize unstable system of GANs, andindirectly, reduce the mode dropping problem. Since regularization is a common way to strengthenrobustness of algorithm, we propose the use of (1) Minibatch moment matching, together with (2) theGradient penalty to regularize GANs.

4.2.1 Mini Batch Moment Matching

Intuitively, for the generated distribution to match the true distribution, their statistical measuresshould be equivalent. Furthermore, if pure gradient decent is utilized, the optimizer will move towardsthe greater modes, making minor modes less likely to appear in later iterations. Minibatch momentmatching immediately provides a great reward when a new mode is generated. Specifically, based oncomputational complexity, we match empirical mean of every mini-batch, which is:

M =√

E[X]− E[G(Z)]

Proposition: KL divergence of M is boundedAssuming a strong bound for the probabilities, N0 ≥ q(x), p(x) ≥ ε0, ‖x‖ ≤ A, we can bound the

4

Page 5: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

KL divergence in terms of M0 = 1N

∑xi − 1

N g(zi). |x| > 1 can be achieved by shifting the dataset.

|p(x)− q(x)| ≤∑x(p(x)− q(x))

Ad≤ ε

=⇒ KL(p, q) =∑

p(x)(log p(x)− log q(x)) ≤ ε

C(A,N0,ε0)

The above shows that moment matching also decreases the KL divergence. Now, we add theconsideration of mini batch training. Instead of E[X], we have 1

N

∑xi, the sampling difference is

the advantage of mini batch training, it represents the combination of different modes. But combinedwith mode matching directly, we will get an unstable gradient, highly dependent on the samples wedraw on each iteration. Consider the following delta function

δ(x, z) =

{1, if x,z share modes0, otherwise

If we use the update step δ(x, z)∇M0(x, z), the model will learn correct samples, and not bedistracted by wrong outputs. But unsupervised learning does not tell us the modes of x. To solve thisproblem, we use M0 it self as a similarity measure of x and z. Using the term M =

√M0, fixes the

update, the derivative ddx

√x = 1√

xis inversely proportional to x. This has the effect of δ(x, z), now

giving less penalty when the modes do not agree much, and more stability. Solves mode collapse bycreating new modes.

4.2.2 Gradient Regularization

Back to our objective, WGAN suffers a problem: maintaining Lipschitz constraint by weight clippingis a terrible strategy that raises many issues in training. The two-sided gradient penalty [11] is a goodsolution to tackle these problems as well as strengthen robustness of GANs:

Ex∼p(x),z∼p(z)[(|∇f(αx− (1− α)g(z))| − 1)2)] (1)

=Ex̂∼τ [(‖∇fx̂)‖2 − 1)2)] (2)Intuitively, WGAN-GP gets rid of weight clipping by restricting some gradients to have a normof 1, which enforcing the Lipschitz constraint. Furthermore, training scheme requires optimizingdiscriminator to the optimal point before optimizing generators, hence avoiding the problem ofsimultaneous gradient descent (as pointed out form section 2 and [16]) which leads to a cyclicconvergence of non-conservative vector field.

Adding regularization also makes the model converge faster. When following the a updatevector field, the measurement

∫ 2∗πφ=0

uφ · ∇fdφ, is positive when the path diverges, and nega-tive when the path converges. By using L2-norm as regularization, a negative divergence isadded to the vector field. Using the divergence theorem, −

∫ 2∗πφ=0

uφ · vdφ = −∫∇ · fdV =

−∫2dV ≤ 0 In fact, any regularization with negative divergence helps closed loops converge.

Figure 3: Gaussian Noise isbetter than Uniform Noise. Top:Uniform Z, Gaussian X; Bot:Gaussian Z, Gaussian X

4.3 Implementation

A large portion of deep learning achievements can undoubtedlybe attributed to practical success, where implementation choicesplay an important role. However, it is still not clear how to makethese choices, and most decisions are based on empirical results.DCGAN Architecture([19]) showed their empirical success atavoiding mode dropping problem and widely used among GANscommunity.Gaussian Noise: Our proof from section 3 showed that if wechoose xi, zj carefully, we can create mode collapse. The distri-bution of the noise controls the roots of the objective. We showtwo objective functions generated by different noise [Fig: 3]. Uni-form noise creates dense roots, hence the objective is constantin the generator direction, making the discriminator able to learnfast. On the other hand, Gaussian noise solves the problem by not having dense roots every where.

5

Page 6: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

5 Proposed Method

The main idea of our method, MeanWGAN-GP is to introduce a new term of mean matchingregularization to generator objective of WGAN-GP to tackle mode collapse. To the best of ourknowledge, our method is novel and the first one incorporating mean matching with WGAN-GP. TheGAN optimization problem becomes:

ming

maxd

Ex∼px [d(x)]−Ez∼pz [d(g(z))]+αEx̂∼τ (‖∇dx̂)‖2−1)2)]+β√

Ex∼px [X]− Ez∼pz [g(z)]

Notice that, our loss function of generator now needs to take into account the real data X. Specifically,from our analysis toward solution for mode collapse in sections 3 and 4, we propose a holisticapproach summarized as follows:

• Using Wasserstein distance as the objective function and using two-sided gradient penaltyto improve WGAN

• Applying mean matching regularization for generator

• Utilization of mixture Gaussian noise for sampling

Figure 4: Modified GAN Architecture: (1) Gaussian Loss, (2) Moment Matching Generator Regular-ization, (3) Wasserstein Loss Objective

Figure 4 illustrates our new modified architecture of GANs’s system: The experimental result showsthat our proposed model acquires competitive results in term of the number of covered modes andsmall divergence to the true distribution.

6 Experimental Evaluation

An enormous amount of attention has been devoted recently to improved methods for GAN training,with a special focus on addressing mode collapse. Quantifying the extent of mode collapse, however,is not straightforward except for synthetic datasets or selected real datasets (like MNIST) which havea known number of maximum modes. For this reason, the experiments presented in this section werecarried out on synthetic datasets, MNIST (real data) as well as a stacked MNIST dataset. Our resultsindicate that our proposed method (MeanWGAN-GP) performs favorably against not only vanillaGAN [10] and ALI [7] as well as recently proposed specialized GANs like VEEGAN [21], ModeRegularized GAN [5], and WGAN-GP[11]. An important distinction to note is that no parametertuning was performed while training MeanWGAN-GP.

6.1 Synthetic Data

Synthetic datasets provide the ideal setting for measuring mode collapse since their total number ofmodes and true distribution are exactly known. We compare the performance of MeanWGAN-GP,with the results presented within the VEEGAN [21] paper, on the following datasets :

1. 2D Ring (8-modes): Eight 2-dimensional Gaussian distributions arranged along the circum-ference of a circle at equal intervals.

2. 2D Grid (25-modes): 25 Gaussian distributions of 2 dimensions arranged in a 5 ∗ 5 grid.

3. 1200D Synthetic (10-modes): Mixture of ten 700-dimensional Gaussian distributions em-bedded in a 1200-dimensional space.

6

Page 7: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

The specific structures of the aforementioned synthetic datasets were chosen to enable extensionof the comparative analysis (between Vanilla GAN, ALI, VEEGAN) provided in [21]. For a faircomparison of the different GAN training methods, we used the same network architectures ofgenerator and discriminator for WGAN-GP and MeanWGAN-GP. Specifically, for 2D Ring and2D Grid datasets, a fully-connected MLP with two hidden layers formed the generator, and a twolayer MLP without dropout or normalization constituted the discriminator. For the 1200D dataset,we chose an off-the-shelf standard implementation of DCGAN to form the generator to ensure itwas powerful enough to capture the higher dimensional manifolds of natural images that the 1200Ddataset mimics.

2D Ring 2D Grid 1200D SyntheticModes(Max8) Modes (Max25) Modes(Max10)

Vanilla GAN 1 3.3 1.6ALI 2.8 15.8 3

VEEGAN 8 24.6 5.5WGAN-GP 8 25 10

MeanWGAN-GP 8 25 10

Table 1: Mode collapse in Synthetic Datasets. MeanWGAN-GP & WGAN-GP captured all modes

As summarized in Table 1, both MeanWGAN-GP and WGAN-GP capture all the modes on allthe synthetic datasets. This is also visually illustrated in figure 5 which contrasts the performanceof Vanilla GAN with that of WGAN-GP and MeanWGAN-GP on the 2D Ring dataset. The truedistribution is represented by yellow clusters, while green points represent data produced by thegenerator. The contours represent increasing magnitude of gradients with yellow being the highest. Insummary, both WGAN-GP and MeanWGAN-GP succeeded in avoiding mode collapse on all threedatasets and also performed equally well on all three synthetic datasets in terms of running time.

Figure 5: L-R: (i)Vanilla GAN, (ii)WGAN-GP, (iii)MeanWGAN-GP performance on 2D Ringdataset. Yellow: True Distribution, Green: Generated Data

6.2 MNIST

The success of our MeanWGAN-GP in avoiding mode collapse on synthetic data prompted us toincrease the complexity of the setting by applying our method on a real dataset, particularly MNIST.MeanWGAN-GP captured all 10 modes of MNIST and produced realistic images ( Figure 6).

Figure 6: MeanWGAN-GP on MNIST : Evolution of generated data from noise to realistic images

7

Page 8: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

6.3 Stacked MNIST

Building on the success of MeanWGAN-GP on MNIST, we evaluated the methods on a stackedMNIST dataset, a variant of the MNIST data specifically designed to increase the number of discretemodes. We synthesized the data by stacking three randomly sampled MNIST digits along thecolor channel resulting in a 28x28x3 image. This resulted in a dataset with 1000 distinct modes,corresponding to the total number of possible triplets of digits (from 000 through 999). Again, thespecific type of modification to be applied to MNIST was chosen in accordance with [21], so as tomaintain the extensibility of their results to this study.Similar to 1200D and MNIST, the generatorconsisted of a standard DCGAN implementation.

Figure 7: MeanWGAN-GP on Stacked MNIST : Evolution of generated data

The performances on the compositional experiment are measured by two metrics.Modes representsthe classifier-reported number of generated modes, which is the size of the set of numbers that themodel generates. KL stands for the KL divergence between the classifier-reported distribution ofgenerated numbers and the distribution of numbers in the training data. The results are summarizedin Table 2. Our results indicate that our proposed MeanWGAN-GP consistently captured morenumber of modes in its generated data than even WGAN-GP. On an average, MeanWGAN-GPcaptured 12-18 more modes of the compositional MNIST dataset than WGAN-GP across multipleruns. MeanWGAN-GP, however, was found to be slower than WGAN-GP and its KL divergence wasmore that that of WGAN-GP by 0.01 across several runs.

Stacked MNISTModes (Max1000) KL

DC GAN 99 3.40ALI 16 5.40

VEEGAN 150 2.95RegGAN 955 62.3

WGAN-GP 961 0.14MeanWGAN-GP 977 0.15

Table 2: Mode collapse on Stacked MNIST. MeanWGAN-GP captured highest number of modes

7 Conclusion and Future Work

In this paper, we have presented MeanWGAN-GP, a regularized GAN that combines our proposedmean-matching regularization of the generator objective with the gradient-penalized version ofWasserstein GAN, which is, via our systematic and theoretically-motivated analysis, a robust methodfor mode covering. Our experimental results illustrate that our approach is more effective than thestate-of-the-art approaches at avoiding mode collapse.

Since mean-matching enables GANs to cover more modes than otherwise possible by allowing thegenerator to look at the real data, utilization of other moments would be a logical future extensionthat is likely to work reasonably well. Addition of higher-order moments, however, could result in alonger training time due to higher dimensionality. Another possible extension of our proposed modelcould be leveraging one-sided gradient penalty, [15], rather than the currently employed two-sidedgradient penalization on Wasserstein loss.

8

Page 9: An Analysis Of Generative Adversarial Networks Toward ...pages.cs.wisc.edu/~tuandinh/GANs_Mode_Collapse_DinhLiuSah_Final.pdf · Toward Addressing Mode Collapse Tuan Dinh UW-Madison

References[1] Brandon Amos. Image Completion with Deep Learning in TensorFlow. http://bamos.

github.io/2016/08/09/deep-completion. Accessed: 2017.

[2] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adver-sarial networks. ICLR, 2017.

[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarialnetworks. ICML, 2017.

[4] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilibrium generativeadversarial networks. CoRR, 2017. URL http://arxiv.org/abs/1703.10717.

[5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularizedgenerative adversarial networks. ICLR, abs/1612.02136, 2017.

[6] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, andAnil A Bharath. Generative adversarial networks: An overview. IEEE Signal ProcessingMagazine, 2018.

[7] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mas-tropietro, and Aaron Courville. Adversarially learned inference. ICLR, 2017.

[8] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. ManyPaths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step. ArXive-prints, October 2017. URL http://arxiv.org/abs/1710.08446.

[9] S. Feizi, C. Suh, and D. Tse. Understanding GANs: the LQG Setting. ArXiv e-prints, 2017.URL https://arxiv.org/abs/1710.10793.

[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NIPS, 2014.

[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.Improved training of wasserstein gans. ICLR, 2018.

[12] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generativeadversarial networks. CVPR, 2017.

[13] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans forimproved quality, stability, and variation. CoRR, 2017. URL http://arxiv.org/abs/1710.10196.

[14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.

[15] Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. How to train your DRAGAN.CoRR, 2017. URL http://arxiv.org/abs/1705.07215.

[16] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. NIPS,2017.

[17] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarialnetworks. ICLR, 2017.

[18] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable.NIPS, 2017.

[19] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning withdeep convolutional generative adversarial networks. ICLR, 2016.

[20] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. NIPS, 2016.

[21] Akash Srivastava, Lazar Valkov, Chris Russell, Michael Gutmann, and Charles Sutton. Veegan:Reducing mode collapse in gans using implicit variational learning. NIPS, 2017.

[22] Ilya Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and BernhardSchölkopf. Adagan: Boosting generative models. NIPS, 2017.

9