disentangling by factorising - arxiv · 2018. 10. 8. · disentangling by factorising gence between...

18
Disentangling by Factorising Hyunjik Kim 12 Andriy Mnih 1 Abstract We define and address the problem of unsuper- vised learning of disentangled representations on data generated from independent factors of varia- tion. We propose FactorVAE, a method that dis- entangles by encouraging the distribution of rep- resentations to be factorial and hence independent across the dimensions. We show that it improves upon β-VAE by providing a better trade-off be- tween disentanglement and reconstruction quality. Moreover, we highlight the problems of a com- monly used disentanglement metric and introduce a new metric that does not suffer from them. 1. Introduction Learning interpretable representations of data that expose semantic meaning has important consequences for artificial intelligence. Such representations are useful not only for standard downstream tasks such as supervised learning and reinforcement learning, but also for tasks such as transfer learning and zero-shot learning where humans excel but machines struggle (Lake et al., 2016). There have been multiple efforts in the deep learning community towards learning factors of variation in the data, commonly referred to as learning a disentangled representation. While there is no canonical definition for this term, we adopt the one due to Bengio et al. (2013): a representation where a change in one dimension corresponds to a change in one factor of variation, while being relatively invariant to changes in other factors. In particular, we assume that the data has been generated from a fixed number of independent factors of variation. 3 We focus on image data, where the effect of factors of variation is easy to visualise. Using generative models has shown great promise in learn- ing disentangled representations in images. Notably, semi- supervised approaches that require implicit or explicit knowledge about the true underlying factors of the data have excelled at disentangling (Kulkarni et al., 2015; Kingma 1 DeepMind 2 Department of Statistics, University of Oxford, UK. Correspondence to: Hyunjik Kim <[email protected]>. 3 We discuss the limitations of this assumption in Section 4. Cross-entropy loss for classifying samples from each class - encouraging to be factorised randomly permute each dimension across batch VAE - + Discriminator Input FactorVAE objective = VAE objective - cross-entropy loss Figure 1. Architecture of FactorVAE, a Variational Autoencoder (VAE) that encourages the code distribution to be factorial. The top row is a VAE with convolutional encoder and decoder, and the bottom row is an MLP classifier, the discriminator, that dis- tinguishes whether the input was drawn from the marginal code distribution or the product of its marginals. et al., 2014; Reed et al., 2014; Siddharth et al., 2017; Hinton et al., 2011; Mathieu et al., 2016; Goroshin et al., 2015; Hsu et al., 2017; Denton & Birodkar, 2017). However, ideally we would like to learn these in an unsupervised manner, due to the following reasons: 1. Humans are able to learn factors of variation unsupervised (Perry et al., 2010). 2. Labels are costly as obtaining them requires a human in the loop. 3. Labels assigned by humans might be inconsistent or leave out the factors that are difficult for humans to identify. β-VAE (Higgins et al., 2016) is a popular method for un- supervised disentangling based on the Variational Autoen- coder (VAE) framework (Kingma & Welling, 2014; Rezende et al., 2014) for generative modelling. It uses a modified ver- sion of the VAE objective with a larger weight (β> 1) on the KL divergence between the variational posterior and the prior, and has proven to be an effective and stable method for disentangling. One drawback of β-VAE is that recon- struction quality (compared to VAE) must be sacrificed in order to obtain better disentangling. The goal of our work is to obtain a better trade-off between disentanglement and reconstruction, allowing to achieve better disentangle- ment without degrading reconstruction quality. In this work, we analyse the source of this trade-off and propose Factor- VAE, which augments the VAE objective with a penalty that encourages the marginal distribution of representations to be factorial without substantially affecting the quality of reconstructions. This penalty is expressed as a KL diver- arXiv:1802.05983v1 [stat.ML] 16 Feb 2018

Upload: others

Post on 05-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Hyunjik Kim 1 2 Andriy Mnih 1

AbstractWe define and address the problem of unsuper-vised learning of disentangled representations ondata generated from independent factors of varia-tion. We propose FactorVAE, a method that dis-entangles by encouraging the distribution of rep-resentations to be factorial and hence independentacross the dimensions. We show that it improvesupon β-VAE by providing a better trade-off be-tween disentanglement and reconstruction quality.Moreover, we highlight the problems of a com-monly used disentanglement metric and introducea new metric that does not suffer from them.

1. IntroductionLearning interpretable representations of data that exposesemantic meaning has important consequences for artificialintelligence. Such representations are useful not only forstandard downstream tasks such as supervised learning andreinforcement learning, but also for tasks such as transferlearning and zero-shot learning where humans excel butmachines struggle (Lake et al., 2016). There have beenmultiple efforts in the deep learning community towardslearning factors of variation in the data, commonly referredto as learning a disentangled representation. While there isno canonical definition for this term, we adopt the one dueto Bengio et al. (2013): a representation where a changein one dimension corresponds to a change in one factorof variation, while being relatively invariant to changes inother factors. In particular, we assume that the data hasbeen generated from a fixed number of independent factorsof variation.3 We focus on image data, where the effect offactors of variation is easy to visualise.

Using generative models has shown great promise in learn-ing disentangled representations in images. Notably, semi-supervised approaches that require implicit or explicitknowledge about the true underlying factors of the data haveexcelled at disentangling (Kulkarni et al., 2015; Kingma

1DeepMind 2Department of Statistics, University of Oxford,UK. Correspondence to: Hyunjik Kim <[email protected]>.

3We discuss the limitations of this assumption in Section 4.

Cross-entropy lossfor classifying samplesfrom each class - encouraging to be factorised

randomly permute each dimension

across batch

VAE

-+

Discriminator

Input

FactorVAE objective = VAE objective - cross-entropy loss

Figure 1. Architecture of FactorVAE, a Variational Autoencoder(VAE) that encourages the code distribution to be factorial. Thetop row is a VAE with convolutional encoder and decoder, andthe bottom row is an MLP classifier, the discriminator, that dis-tinguishes whether the input was drawn from the marginal codedistribution or the product of its marginals.

et al., 2014; Reed et al., 2014; Siddharth et al., 2017; Hintonet al., 2011; Mathieu et al., 2016; Goroshin et al., 2015; Hsuet al., 2017; Denton & Birodkar, 2017). However, ideallywe would like to learn these in an unsupervised manner, dueto the following reasons: 1. Humans are able to learn factorsof variation unsupervised (Perry et al., 2010). 2. Labels arecostly as obtaining them requires a human in the loop. 3.Labels assigned by humans might be inconsistent or leaveout the factors that are difficult for humans to identify.

β-VAE (Higgins et al., 2016) is a popular method for un-supervised disentangling based on the Variational Autoen-coder (VAE) framework (Kingma & Welling, 2014; Rezendeet al., 2014) for generative modelling. It uses a modified ver-sion of the VAE objective with a larger weight (β > 1) onthe KL divergence between the variational posterior and theprior, and has proven to be an effective and stable methodfor disentangling. One drawback of β-VAE is that recon-struction quality (compared to VAE) must be sacrificedin order to obtain better disentangling. The goal of ourwork is to obtain a better trade-off between disentanglementand reconstruction, allowing to achieve better disentangle-ment without degrading reconstruction quality. In this work,we analyse the source of this trade-off and propose Factor-VAE, which augments the VAE objective with a penalty thatencourages the marginal distribution of representations tobe factorial without substantially affecting the quality ofreconstructions. This penalty is expressed as a KL diver-

arX

iv:1

802.

0598

3v1

[st

at.M

L]

16

Feb

2018

Page 2: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

gence between this marginal distribution and the productof its marginals, and is optimised using a discriminator net-work following the divergence minimisation view of GANs(Nowozin et al., 2016). Our experimental results show thatthis approach achieves better disentanglement than β-VAEfor the same reconstruction quality. In addition, we pointout the weaknesses in the disentangling metric of Higginset al. (2016), and propose a new metric that addresses theseshortcomings.

A popular alternative to β-VAE is InfoGAN (Chen et al.,2016), which is based on the Generative Adversarial Net(GAN) framework (Goodfellow et al., 2014) for generativemodelling. InfoGAN learns disentangled representations byrewarding the mutual information between the observationsand a subset of latents. However at least in part due to itstraining stability issues (Higgins et al., 2016), there has beenlittle empirical comparison between VAE-based methodsand InfoGAN. Taking advantage of the recent developmentsin the GAN literature that help stabilise training, we includeInfoWGAN-GP, a version of InfoGAN that uses Wasser-stein distance (Arjovsky et al., 2017) and gradient penalty(Gulrajani et al., 2017), in our experimental evaluation.

In summary, we make the following contributions: 1) Weintroduce FactorVAE, a method for disentangling that giveshigher disentanglement scores than β-VAE for the samereconstruction quality. 2) We identify the weaknesses ofthe disentanglement metric of Higgins et al. (2016) andpropose a more robust alternative. 3) We give quantitativecomparisons of FactorVAE and β-VAE against InfoGAN’sWGAN-GP counterpart for disentanglement.

2. Trade-off between Disentanglement andReconstruction in β-VAE

We motivate our approach by analysing where the disen-tanglement and reconstruction trade-off arises in the β-VAE objective. First, we introduce notation and archi-tecture of our VAE framework. We assume that observa-tions x(i) ∈ X , i = 1, . . . , N are generated by combiningK underlying factors f = (f1, . . . , fK). These observa-tions are modelled using a real-valued latent/code vectorz ∈ Rd, interpreted as the representation of the data. Thegenerative model is defined by the standard Gaussian priorp(z) = N (0, I), intentionally chosen to be a factoriseddistribution, and the decoder pθ(x|z) parameterised by aneural net. The variational posterior for an observation isqθ(z|x) =

∏dj=1N (zj |µj(x), σ2

j (x)), with the mean andvariance produced by the encoder, also parameterised bya neural net.1 The variational posterior can be seen as thedistribution of the representation corresponding to the data

1In the rest of the paper we will omit the dependence of p andq on their parameters θ for notational convenience.

point x. The distribution of representations for the entiredata set is then given by

q(z) = Epdata(x)[q(z|x)] =1

N

N∑i=1

q(z|x(i)), (1)

which is known as the marginal posterior or aggregate pos-terior, where pdata is the empirical data distribution. Adisentangled representation would have each zj correspondto precisely one underlying factor fk. Since we assume thatthese factors vary independently, we wish for a factorialdistribution q(z) =

∏dj=1 q(zj).

The β-VAE objective

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]− βKL(q(z|x(i))||p(z))

]is a variational lower bound on Epdata(x)[log p(x(i))] forβ ≥ 1, reducing to the VAE objective for β = 1. Its firstterm can be interpreted as the negative reconstruction error,and the second term as the complexity penalty that acts asa regulariser. We may further break down this KL term as(Hoffman & Johnson, 2016; Makhzani & Frey, 2017)

Epdata(x)[KL(q(z|x)||p(z))] = I(x; z)+KL(q(z)||p(z)),

where I(x; z) is the mutual information between x and z un-der the joint distribution pdata(x)q(z|x). See Appendix Cfor the derivation. Penalising the KL(q(z)||p(z)) termpushes q(z) towards the factorial prior p(z), encouragingindependence in the dimensions of z and thus disentangling.Penalising I(x; z), on the other hand, reduces the amount ofinformation about x stored in z, which can lead to poor re-constructions for high values of β (Makhzani & Frey, 2017).Thus making β larger than 1, penalising both terms more,leads to better disentanglement but reduces reconstructionquality. When this reduction is severe, there is insufficientinformation about the observation in the latents, making itimpossible to recover the true factors. Therefore there existsa value of β > 1 that gives highest disentanglement, butresults in a higher reconstruction error than a VAE.

3. Total Correlation Penalty and FactorVAEPenalising I(x; z) more than a VAE does might be neithernecessary nor desirable for disentangling. For example,InfoGAN disentangles by encouraging I(x; c) to be highwhere c is a subset of the latent variables z 2. Hence wemotivate FactorVAE by augmenting the VAE objective with

2Note however that I(x; z) in β-VAE is defined under the jointdistribution of data and their encoding distribution pdata(x)q(z|x),whereas I(x; c) in InfoGAN is defined on the joint distribution ofthe prior on c and the decoding distribution p(c)p(x|c).

Page 3: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

a term that directly encourages independence in the codedistribution, arriving at the following objective:

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]−KL(q(z|x(i))||p(z))

]− γKL(q(z)||q(z)), (2)

where q(z) :=∏dj=1 q(zj). Note that this is also a lower

bound on the marginal log likelihood Epdata(x)[log p(x)].KL(q(z)||q(z)) is known as Total Correlation (TC, Watan-abe, 1960), a popular measure of dependence for multiplerandom variables. In our case this term is intractable sinceboth q(z) and q(z) involve mixtures with a large number ofcomponents, and the direct Monte Carlo estimate requires apass through the entire data set for each q(z) evaluation.3.Hence we take an alternative approach for optimizing thisterm. We start by observing we can sample from q(z) effi-ciently by first choosing a datapoint x(i) uniformly at ran-dom and then sampling from q(z|x(i)). We can also samplefrom q(z) by generating d samples from q(z) and then ig-noring all but one dimension for each sample. A moreefficient alternative involves sampling a batch from q(z)and then randomly permuting across the batch for each la-tent dimension (see Alg. 1). This is a standard trick used inthe independence testing literature (Arcones & Gine, 1992)and as long as the batch is large enough, the distributionof these samples samples will closely approximate q(z).Having access to samples from both distributions allowsus to minimise their KL divergence using the density-ratiotrick (Nguyen et al., 2010; Sugiyama et al., 2012) whichinvolves training a classifier/discriminator to approximatethe density ratio that arises in the KL term. Suppose wehave a discriminator D (in our case an MLP) that outputsan estimate of the probability D(z) that its input is a samplefrom q(z) rather than from q(z). Then we have

TC(z) = KL(q(z)||q(z)) = Eq(z)[

logq(z)

q(z)

]≈ Eq(z)

[log

D(z)

1−D(z)

]. (3)

We train the discriminator and the VAE jointly. In particu-lar, the VAE parameters are updated using the objective inEqn. (2), with the TC term replaced using the discriminator-based approximation from Eqn. (3). The discriminator istrained to classify between samples from q(z) and q(z),thus learning to approximate the density ratio needed forestimating TC. See Alg. 2 for pseudocode of FactorVAE.

In the GAN literature, divergence minimisation is usuallydone between two distributions over the data space, whichis often very high dimensional (e.g. images). As a result, the

3We have also tried using a batch estimate of q(z), but this didnot work. See Appendix D for details.

Algorithm 1 permute dims

Input: {z(i) ∈ Rd : i = 1, . . . , B}for j = 1 to d doπ ← random permutation on {1, . . . , B}(z

(i)j )Bi=1 ← (z

(π(i))j )Bi=1

end forOutput: {z(i) : i = 1, . . . , B}

Algorithm 2 FactorVAE

Input: observations (x(i))Ni=1, batch size m, latent di-mension d, γ, VAE/Discriminator optimisers: g, gDInitialize VAE and discriminator parameters θ, ψ.repeat

Randomly select batch (x(i))i∈B of size mSample z(i)θ ∼ qθ(z|x(i)) ∀i ∈ Bθ ← g(∇θ 1

m

∑i∈B

[logpθ(x

(i),z(i)θ )

qθ(z(i)θ |x(i))

− γ logDψ(z

(i)θ )

1−Dψ(z(i)θ )])

Randomly select batch (x(i))i∈B′ of size mSample z′(i)θ ∼ qθ(z|x(i)) for i ∈ B′

(z′(i)perm)i∈B′ ← permute dims((z′(i)θ )i∈B′ )

ψ ← gD(∇ψ 12m [

∑i∈B

log(Dψ(z(i)θ ))

+∑i∈B′

log(1−Dψ(z′(i)perm))])

until convergence of objective.

two distributions often have disjoint support, making train-ing unstable, especially when the discriminator is strong.Hence it is necessary to use tricks to weaken the discrim-inator such as instance noise (Sønderby et al., 2016) or toreplace the discriminator with a critic, as in WassersteinGANs (Arjovsky et al., 2017). In this work, we minimisedivergence between two distributions over the latent space(as in e.g. (Mescheder et al., 2017)), which is typically muchlower dimensional and the two distributions have overlap-ping support. We observe that training is stable for suffi-ciently large batch sizes (e.g. 64 worked well for d = 10),allowing us to use a strong discriminator.

4. A New Metric for DisentanglementThe definition of disentanglement we use in this paper,where a change in one dimension of the representation cor-responds to a change in exactly one factor of variation, isclearly a simplistic one. It does not allow correlations amongthe factors or hierarchies over them. Thus this definitionseems more suited to synthetic data with independent fac-tors of variation than to most realistic data sets. However,as we will show below, robust disentanglement is not a fullysolved problem even in this simple setting. One obstacleon the way to this first milestone is the absence of a soundquantitative metric for measuring disentanglement.

Page 4: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by FactorisingGenerate data withfixed , random

Getrepresentation

Take absolutevalue of difference

Take meanOne training point for

linear classifier

Generate data withfixed , random

Get rescaledrepresentation

Take empirical variancein each dimension Take argmin

One training point formajority-vote classifier

Fix one factor

Fix one factor

Figure 2. Top: Metric in (Higgins et al., 2016). Bottom: Our newmetric, where s ∈ Rd is the scale (empirical standard deviation)of latent representations of the full data (or large enough randomsubset).

A popular method of measuring disentanglement is by in-specting latent traversals: visualising the change in recon-structions while traversing one dimension of the latent spaceat a time. Although latent traversals can be a useful indicatorof when a model has failed to disentangle, the qualitativenature of this approach makes it unsuitable for comparingalgorithms reliably. Doing this would require inspectinga multitude of latent traversals over multiple reference im-ages, random seeds, and points during training. Having ahuman in the loop to assess the traversals is also too time-consuming and subjective. Unfortunately, for data sets thatdo not have the ground truth factors of variation available,currently this is the only viable option for assessing disen-tanglement.

Higgins et al. (2016) proposed a supervised metric that at-tempts to quantify disentanglement when the ground truthfactors of a data set are given. The metric is the error rateof a linear classifier that is trained as follows. Choose afactor k; generate data with this factor fixed but all otherfactors varying randomly; obtain their representations (de-fined to be the mean of q(z|x)); take the absolute value ofthe pairwise differences of these representations. Then themean of these statistics across the pairs gives one traininginput for the classifier, and the fixed factor index k is thecorresponding training output (see top of Figure 2). So if therepresentations were perfectly disentangled, we would seezeros in the dimension of the training input that correspondsto the fixed factor of variation, and the classifier would learnto map the index of the zero value to the index of the factor.

However this metric has several weaknesses. Firstly, itcould be sensitive to hyperparameters of the linear classifieroptimisation, such as the choice of the optimiser and itshyperparameters, weight initialisation, and the number oftraining iterations. Secondly, having a linear classifier isnot so intuitive – we could get representations where eachfactor corresponds to a linear combination of dimensionsinstead of a single dimension. Finally and most importantly,the metric has a failure mode: it gives 100% accuracy evenwhen only K − 1 factors out of K have been disentangled;to predict the remaining factor, the classifier simply learns

orig

reconstr

Figure 3. A β-VAE model trained on the 2D Shapes data thatscores 100% on metric in Higgins et al. (2016) (ignoring the shapefactor). First row: originals. Second row: reconstructions. Re-maining rows: reconstructions of latent traversals. The model onlyuses three latent units to capture x-position, y-position, scale andignores orientation, yet achieves a perfect score on the metric.

to detect when all the values corresponding to the K − 1factors are non-zero. An example of such a case is shown inFigure 3.

To address these weaknesses, we propose a new disentan-glement metric as follows. Choose a factor k; generate datawith this factor fixed but all other factors varying randomly;obtain their representations; normalise each dimension byits empirical standard deviation over the full data (or a largeenough random subset); take the empirical variance in eachdimension4 of these normalised representations. Then theindex of the dimension with the lowest variance and thetarget index k provide one training input/output example forthe classifier (see bottom of Figure 2). Thus if the repre-sentation is perfectly disentangled, the empirical variancein the dimension corresponding to the fixed factor will be0. We normalise the representations so that the arg minis invariant to rescaling of the representations in each di-mension. Since both inputs and outputs lie in a discretespace, the optimal classifier is the majority-vote classifier(see Appendix B for details), and the metric is the error rateof the classifier. The resulting classifier is a deterministicfunction of the training data, hence there are no optimisationhyperparameters to tune. We also believe that this metricis conceptually simpler and more natural than the previousone. Most importantly, it circumvents the failure mode ofthe earlier metric, since the classifier needs to see the lowestvariance in a latent dimension for a given factor to classifyit correctly.

We think developing a reliable unsupervised disentanglingmetric that does not use the ground truth factors is an im-portant direction for future research, since unsuperviseddisentangling is precisely useful for the scenario where wedo not have access to the ground truth factors. With this inmind, we believe that having a reliable supervised metric isstill valuable as it can serve as a gold standard for evaluatingunsupervised metrics.

4We can use Gini’s definition of variance for discrete latents(Gini, 1971). See Appendix B for details.

Page 5: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

5. Related WorkThere are several recent works that use a discriminator tooptimise a divergence to encourage independence in thelatent codes. Adversarial Autoencoder (AAE, Makhzaniet al., 2015) removes the I(x; z) term in the VAE objec-tive and maximizes the negative reconstruction error minusKL(q(z)||p(z)) using the density-ratio trick. This meansthat the AAE objective is not a lower bound on the logmarginal likelihood. Moreover, the emphasis of that work ison semi-supervised classification and unsupervised cluster-ing, rather than disentangling. In PixelGAN Autoencoders(Makhzani & Frey, 2017), the same objective is used tostudy the decomposition of information between the latentcode and the decoder. The authors state that adding noise tothe inputs of the encoder is crucial, which suggests that lim-iting the information that the code contains about the inputis essential and that the I(x; z) term should not be droppedfrom the VAE objective. Brakel & Bengio (2017) also usea discriminator to penalise the Jensen-Shannon Divergencebetween the distribution of codes and the product of itsmarginals. However, they use the GAN loss with determin-istic encoders and decoders and only explore their techniquein the context of Independent Component Analysis sourceseparation.

Early works on unsupervised disentangling include (Schmid-huber, 1992) which attempts to disentangle codes in an au-toencoder by penalising predictability of one latent dimen-sion given the others and (Desjardins et al., 2012) wherea variant of a Boltzmann Machine is used to disentangletwo factors of variation in the data. More recently, Achille& Soatto (2018) have used a loss function that penalisesTC in the context of supervised learning. They show thattheir approach can be extended to the VAE setting, but donot perform any experiments on disentangling to supportthe theory. In a concurrent work, Kumar et al. (2018) usedmoment matching in VAEs to penalise the covariance be-tween the latent dimensions, but did not constrain the meanor higher moments. We provide the objectives used in theserelated methods and analyse them in Appendix E.

There have been various works that use the notion of pre-dictability to quantify disentanglement, mostly predictingthe value of ground truth factors f = (f1, . . . , fK) fromthe latent code z. This dates back to Yang & Amari (1997)who learn a linear map from representations to factors in thecontext of linear ICA, and quantify how close this map is toa permutation matrix. More recently Eastwood & Williams(2018) have extended this idea to disentanglement by train-ing a Lasso regressor to map z to f and using its trainedweights to quantify disentanglement. Like other regression-based approaches, this one introduces hyperparameters suchas the optimiser and the Lasso penalty coefficient. The met-ric of Higgins et al. (2016) as well as the one we proposed,

predict the factor k from the z of images with a fixed fkbut f−k varying randomly. Schmidhuber (1992) quantifiespredictability between the different dimensions of z, usinga predictor that is trained to predict zj from z−j .

Invariance and equivariance are frequently considered tobe desirable properties of representations in the literature(Goodfellow et al., 2009; Kivinen & Williams, 2011; Lenc& Vedaldi, 2015). A representation is said to be invariantfor a particular task if it does not change when nuisance fac-tors of the data, that are irrelevant to the task, are changed.An equivariant representation changes in a stable and pre-dictable manner when altering a factor of variation. A dis-entangled representation, in the sense used in the paper,is equivariant, since changing one factor of variation willchange one dimension of a disentangled representation in apredictable manner. Given a task, it will be easy to obtainan invariant representation from the disentangled represen-tation by ignoring the dimensions encoding the nuisancefactors for the task (Cohen & Welling, 2014).

6. ExperimentsWe compare FactorVAE to β-VAE on the following data setswith i) known generative factors: 1) 2D Shapes (Mattheyet al., 2017): 737,280 binary 64× 64 images of 2D shapeswith ground truth factors[number of values]: shape[3],scale[6], orientation[40], x-position[32], y-position[32]. ii)unknown generative factors: 2) 3D Faces (Paysan et al.,2009): 239,840 grey-scale 64× 64 images of 3D Faces. 3)3D Chairs (Aubry et al., 2014): 86,366 RGB 64× 64× 3images of chair CAD models. 4) CelebA (cropped ver-sion) (Liu et al., 2015): 202,599 RGB 64× 64× 3 imagesof celebrity faces. The experimental details such as en-coder/decoder architectures and hyperparameter settings arein Appendix A.

From Figure 4, we see that FactorVAE gives much betterdisentanglement scores than VAEs (β = 1), while barelysacrificing reconstruction error, highlighting the disentan-gling effect of adding the Total Correlation penalty to theVAE objective. The best disentanglement scores for Fac-torVAE are as good or slightly better than those for β-VAEbut correspond to much lower reconstruction errors. Thiscan be seen more clearly in Figure 5 where the reconstruc-tion error for the best mean disentanglement of β-VAE(β = 32) is around 160, significantly higher than the onefor FactorVAE (γ = 35), which is around 40. From Fig-ure 6, we can see that both models are capable of findingx-position, y-position, and scale, but struggle to disentangleorientation and shape, β-VAE especially. For this data set,neither method can robustly capture shape, the discrete fac-

Page 6: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorisingre

const

ruct

ion e

rror

old

dis

enta

ng

lem

ent

metr

icnew

dis

enta

ng

lem

ent

metr

ic

iteration iteration

Figure 4. Reconstruction error (top), metric in Higgins et al. (2016)(middle), our metric (bottom). β-VAE (left), FactorVAE (right).The colours correspond to different values of β and γ respectively,and confidence intervals are over 10 random seeds.

Better

Figure 5. Reconstruction error plotted against our disentanglementmetric, both averaged over 10 random seeds at the end of training.The numbers at each point are values of β and γ. Note that wewant low reconstruction error and a high disentanglement metric.

-VAE FactorVAE

orig

reconstr

Figure 6. First row: originals. Second row: reconstructions. Re-maining rows: reconstructions of latent traversals across eachlatent dimension sorted by KL(q(zj |x)||p(zj)), for the best scor-ing models on our disentanglement metric. Left: β-VAE, score:0.884, β = 4. Right: FactorVAE, score: 0.912, γ = 25.

tor of variation5. As a sanity check, we also evaluated the

5This is partly due to the fact that learning discrete factorswould require using discrete latent variables instead of Gaussians,but jointly modelling discrete and continuous factors of variationis a non-trivial problem that needs further research.

Tru

e T

C

Dis

crim

inato

r TC

est

imate

iteration iteration

Figure 7. Total Correlation values for FactorVAE on 2D Shapes.Left: True TC value. Right: Discriminator’s estimate of TC.

correlation between our metric and the metric in Higginset al. (2016): Pearson (linear correlation coefficient): 0.582,Kendall (proportion of pairs that have the same ordering):0.432, Spearman (linear correlation of the rankings): 0.611,all with p-value 0.000. Hence the two metrics show a highcorrelation as expected.

We have also examined how the discriminator’s estimate ofthe Total Correlation (TC) behaves and the effect of γ onthe true TC. From Figure 7, observe that the discriminator isconsistently underestimating the true TC. However the trueTC decreases throughout training, and a higher γ leads tolower TC, so the gradients obtained using the discriminatorare sufficient for encouraging independence in the codedistribution.

We also evaluated InfoWGAN-GP, the counterpart of Info-GAN that uses Wasserstein distance and gradient penalty.See Appendix F for an overview. One advantage of Info-GAN is that the Monte Carlo estimate of its objective isdifferentiable with respect to its parameters even for dis-crete codes c, which makes gradient-based optimisationstraightforward. In contrast, VAE-based methods that relyon the reprameterisation trick for gradient-based optimisa-tion require z to be a reparameterisable continuous randomvariable and alternative approaches require various vari-ance reduction techniques for gradient estimation (Mnih& Rezende, 2016; Maddison et al., 2017). Thus we mightexpect Info(W)GAN(-GP) to show better disentangling incases where some factors are discrete. Hence we use 4continuous latents (one for each continuous factor) and onecategorical latent of 3 categories (one for each shape). Wetuned for λ, the weight of the mutual information termin Info(W)GAN(-GP), ∈ {0.0, 0.1, 0.2, . . . , 1.0}, numberof noise variables ∈ {5, 10, 20, 40, 80, 160} and the learn-ing rates of the generator ∈ {10−3, 10−4}, discriminator∈ {10−4, 10−5}.

However from Figure 8 we can see that the disentanglementscores are disappointingly low. From the latent traversalsin Figure 9, we can see that the model learns only the scalefactor, and tries to put positional information in the discretelatent code, which is one reason for the low disentanglementscore. Using 5 continuous codes and no categorical codesdid not improve the disentanglement scores however. Info-GAN with early stopping (before training instability occurs– see Appendix G) also gave similar results. The fact that

Page 7: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorisingold

dis

enta

ngle

ment

metr

ic

new

dis

enta

ngle

ment

metr

ic

iteration iteration

Figure 8. Disentanglement scores for InfoWGAN-GP on 2DShapes for 10 random seeds per hyperparameter setting. Left:Metric in Higgins et al. (2016). Right: Our metric.

Figure 9. Latent traversals for InfoWGAN-GP on 2D Shapesacross four continuous codes (first four rows) and categorical code(last row) for run with best disentanglement score (λ = 0.2).

some latent traversals give blank reconstructions indicatesthat the model does not generalise well to all parts of thedomain of p(z).

One reason InfoWGAN-GP’s poor performance on this dataset could be that InfoGAN is sensitive to the generator anddiscriminator architecture, which is one thing we did nottune extensively. We use a similar architecture to the VAE-based approaches for 2D shapes for a fair comparison, buthave also tried a bigger architecture which gave similarresults (see Appendix G). If architecture search is indeedimportant, this would be a weakness of InfoGAN relative toFactorVAE and β-VAE, which are both much more robustto architecture choice. In Appendix G, we check that wecan replicate the results of Chen et al. (2016) on MNISTusing InfoWGAN-GP, verify that it makes training stablecompared to InfoGAN, and give implementation details withfurther empirical studies of InfoGAN and InfoWGAN-GP.

We also show results for β-VAE and FactorVAE experimentson the data sets with unknown generative factors, namely3D Chairs, 3D Faces, and CelebA. Note that inspecting la-tent traversals is the only evaluation method possible here.We can see from Figure 10 (and Figures 30 and 31 in Ap-pendix H) that FactorVAE has smaller reconstruction errorcompared to β-VAE, and is capable of learning sensiblefactors of variation, as shown in the latent traversals in Fig-ures 11, 12 and 13. Unfortunately, as explained in Section 4,latent traversals tell us little about the robustness of ourmethod.

reco

nst

ruct

ion e

rror

iteration iteration

Figure 10. Plots of reconstruction error of β-VAE (left) and Fac-torVAE (right) for different values of β and γ on 3D Faces dataover 5 random seeds.

size

azimuth

back length

leg style

-VAE

size

azimuth

back length

FactorVAE

Figure 11. β-VAE and FactorVAE latent traversals across eachlatent dimension sorted by KL on 3D Chairs, with annotations ofthe factor of variation corresponding to each latent unit.

azimuth

elevation

azimuth

lighting

-VAE

FactorVAEazimuth

elevation

lighting

Figure 12. Same as Figure 11 but for 3D Faces.

Page 8: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

background brightness

azimuth

skin tone

hair length

background blueness

fringe

head shape

FactorVAE

-VAE

background brightness

hair colour

azimuth

hair colour

skin tone

hair length

Figure 13. Same as Figure 11 but for CelebA.

7. Conclusion and DiscussionWe have introduced FactorVAE, a novel method for disen-tangling that achieves better disentanglement scores thanβ-VAE on the 2D Shapes data set for the same reconstruc-tion quality. Moreover, we have identified weaknesses ofthe commonly used disentanglement metric of Higgins et al.(2016), and proposed an alternative metric that is concep-tually simpler, is free of hyperparameters, and avoids thefailure mode of the former. Finally, we have performedan experimental evaluation of disentangling for the VAE-based methods and InfoWGAN-GP, a more stable variantof InfoGAN, and identified its weaknesses relative to theVAE-based methods.

One of the limitations of our approach is that low TotalCorrelation is necessary but not sufficient for disentanglingof independent factors of variation. For example, if allbut one of the latent dimensions were to collapse to theprior, the TC would be 0 but the representation would notbe disentangled. Our disentanglement metric also requiresus to be able to generate samples holding one factor fixed,which may not always be possible, for example when ourtraining set does not cover all possible combinations offactors. The metric is also unsuitable for data with non-independent factors of variation.

For future work, we would like to use discrete latent vari-ables to model discrete factors of variation and investigatehow to reliably capture combinations of discrete and contin-uous factors using discrete and continuous latents.

ReferencesAchille, A. and Soatto, S. Information Dropout: Learn-

ing optimal representations through noisy computation.IEEE Transactions on Pattern Analysis and Machine In-telligence, 2018.

Arcones, M. A. and Gine, E. On the bootstrap of U and Vstatistics. The Annals of Statistics, pp. 655–674, 1992.

Arjovsky, M., Chintala, S., and Bottou, L. WassersteinGenerative Adversarial Networks. In ICML, 2017.

Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., andSivic, J. Seeing 3D chairs: exemplar part-based 2D-3Dalignment using a large dataset of cad models. In CVPR,2014.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.arXiv preprint arXiv:1607.06450, 2016.

Bengio, Y., Courville, A., and Vincent, P. Representationlearning: A review and new perspectives. IEEE transac-tions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.

Brakel, P. and Bengio, Y. Learning independent featureswith adversarial nets for non-linear ICA. arXiv preprintarXiv:1710.05050, 2017.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. InfoGAN: Interpretable representationlearning by information maximizing Generative Adver-sarial Nets. In NIPS, 2016.

Cohen, T. and Welling, M. Learning the irreducible repre-sentations of commutative lie groups. In ICML, 2014.

Denton, E. L. and Birodkar, V. Unsupervised learning ofdisentangled representations from video. In NIPS, 2017.

Desjardins, G., Courville, A., and Bengio, Y. Disentan-gling factors of variation via generative entangling. arXivpreprint arXiv:1210.5474, 2012.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.JMLR, 12(Jul):2121–2159, 2011.

Eastwood, C. and Williams, C. A framework for the quanti-tative evaluation of disentangled representations. In ICLR,2018.

Gini, C. W. Variability and mutability, contribution to thestudy of statistical distributions and relations. Journal ofAmerican Statistical Association, 66:534–544, 1971.

Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., and Ng, A. Y.Measuring invariances in deep networks. In NIPS, 2009.

Page 9: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative Adversarial Nets. In NIPS, 2014.

Goroshin, R., Bruna, J., Tompson, J., Eigen, D., and LeCun,Y. Unsupervised learning of spatiotemporally coherentmetrics. In ICCV, 2015.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. Improved training of wasserstein GANs. InNIPS, 2017.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. Beta-VAE: Learning basic visual concepts with a constrainedvariational framework. 2016.

Hinton, G. E., Krizhevsky, A., and Wang, S. D. Trans-forming auto-encoders. In International Conference onArtificial Neural Networks, pp. 44–51. Springer, 2011.

Hoffman, M. D. and Johnson, M. J. ELBO surgery:yet another way to carve up the variational evidencelower bound. In Workshop in Advances in ApproximateBayesian Inference, NIPS, 2016.

Hsu, W. N., Zhang, Y., and Glass, J. Unsupervised learningof disentangled and interpretable representations fromsequential data. In NIPS, 2017.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In ICML, 2015.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In ICLR, 2015.

Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. 2014.

Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling,M. Semi-supervised learning with deep generative mod-els. In NIPS, 2014.

Kivinen, J. J. and Williams, C. Transformation equivariantboltzmann machines. In International Conference onArtificial Neural Networks, 2011.

Kulkarni, T., Whitney, W. F., Kohli, P., and Tenenbaum, J.Deep convolutional inverse graphics network. In NIPS,2015.

Kumar, A., Sattigeri, P., and Balakrishnan, A. Variationalinference of disentangled latent concepts from unlabeledobservations. In ICLR, 2018.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople. Behavioral and Brain Sciences, pp. 1–101, 2016.

Lenc, K. and Vedaldi, A. Understanding image represen-tations by measuring their equivariance and equivalence.In CVPR, 2015.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learningface attributes in the wild. In Proceedings of the IEEEInternational Conference on Computer Vision, pp. 3730–3738, 2015.

Maddison, C. J., Mnih, A., and Teh, Y. W. The CONCRETEdistribution: A continuous relaxation of discrete randomvariables. In ICLR, 2017.

Makhzani, A. and Frey, B. PixelGAN autoencoders. InNIPS, 2017.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., andFrey, B. Adversarial autoencoders. arXiv preprintarXiv:1511.05644, 2015.

Mathieu, M. F., Zhao, J. J., Ramesh, A., Sprechmann, P.,and LeCun, Y. Disentangling factors of variation in deeprepresentation using adversarial training. In NIPS, 2016.

Matthey, L., Higgins, I., Hassabis, D., and Lerchner,A. dSprites: Disentanglement testing Sprites dataset.https://github.com/deepmind/dsprites-dataset/, 2017.

Mescheder, L., Nowozin, S., and Geiger, A. Adversarialvariational Bayes: Unifying Variational Autoencodersand Generative Adversarial Networks. In ICML, 2017.

Mnih, A. and Rezende, D. J. Variational inference for MonteCarlo objectives. In ICML, 2016.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimatingdivergence functionals and the likelihood ratio by convexrisk minimization. IEEE Transactions on InformationTheory, 2010.

Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Traininggenerative neural samplers using variational divergenceminimization. In NIPS, 2016.

Paysan, P., Knothe, R., Amberg, B., Romdhani, S., andVetter, T. A 3D face model for pose and illuminationinvariant face recognition. In Proceedings of the IEEEInternational Conference on Advanced Video and Signalbased Surveillance, pp. 296–301, 2009.

Perry, G., Rolls, E. T., and Stringer, S. M. Continuous trans-formation learning of translation invariant representations.Experimental Brain Research, 204(2):255–270, 2010.

Reed, S., Sohn, K., Zhang, Y., and Lee, H. Learning todisentangle factors of variation with manifold interaction.In ICML, 2014.

Page 10: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas-tic backpropagation and approximate inference in deepgenerative models. In ICML, 2014.

Schmidhuber, J. Learning factorial codes by predictabilityminimization. Neural Computation, 4(6):863–879, 1992.

Siddharth, N., Paige, B., Van de Meent, J. W., Desmaison,A., Wood, F., Goodman, N. D., Kohli, P., and Torr, P.H. S. Learning disentangled representations with semi-supervised deep generative models. In NIPS, 2017.

Sønderby, C. K., Caballero, J., Theis, L., Shi, W., andHuszar, F. Amortised MAP inference for image super-resolution. In ICLR, 2016.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density-ratiomatching under the Bregman divergence: a unified frame-work of density-ratio estimation. Annals of the Instituteof Statistical Mathematics, 64(5):1009–1044, 2012.

Watanabe, S. Information theoretical analysis of multivari-ate correlation. IBM Journal of research and development,4(1):66–82, 1960.

Yang, H. H. and Amari, S. I. Adaptive online learningalgorithms for blind separation: maximum entropy andminimum mutual information. Neural computation, 9(7):1457–1482, 1997.

Page 11: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Appendix

A. Experimental Details for FactorVAE andβ-VAE

We use a Convolutional Neural Network for the encoder,a Deconvolutional Neural Network for the decoder and aMulti-Layer Perceptron (MLP) with for the discriminatorin FactorVAE for experiments on all data sets. We use [0,1]normalised data as targets for the mean of a Bernoulli dis-tribution, using negative cross-entropy for log p(x|z) andAdam optimiser (Kingma & Ba, 2015) with learning rate10−4, β1 = 0.9, β2 = 0.999 for the VAE updates, as in Hig-gins et al. (2016). We also use Adam for the discriminatorupdates with β1 = 0.5, β2 = 0.9 and a learning rate tunedfrom {10−4, 10−5}. We use 10−4 for 2D Shapes and 3DFaces, and 10−5 for 3D Chairs and CelebA. The encoderoutputs parameters for the mean and log-variance of Gaus-sian q(z|x), and the decoder outputs logits for each entry ofthe image. We use the same encoder/decoder architecturefor β-VAE and FactorVAE, shown in Tables 1, 2, and 3. Weuse the same 6 layer MLP discriminator with 1000 hiddenunits per layer and leaky ReLU (lReLU) non-linearity, thatoutputs 2 logits in all FactorVAE experiments. We noticedthat smaller discriminator architectures work fine, but no-ticed small improvements up to 6 hidden layers and 1000hidden units per layer. Note that scaling the discriminatorlearning rate is not equivalent to scaling γ, since γ does notaffect the discriminator loss. See Algorithm 2 for details ofFactorVAE updates. We train for 3× 105 iterations on 2DShapes, and 106 iterations on Chairs, 3D Faces and CelebA.We use a batch size of 64 for all data sets.

B. Details for the Disentanglement MetricsFor the disentanglement metric of Higgins et al. (2016), weuse L = 100 (c.f. Figure 2), with a batch size of 10 periteration of training the linear classifier. We observed thatusing bigger values of L did not make much of a difference.Following Higgins et al. (2016) we use Adagrad (Duchiet al., 2011) with a learning rate of 0.01 for 1000 iterations,and use a batch of size 500 to evaluate the metric at the endof training.

For our disentanglement metric, we use L = 100 and takethe majority vote classifier from 500 votes. The major-ity vote classifier C works as follows: suppose we aregiven data (ai, bi)

Mi=1, ai ∈ {1, . . . , D}, bi ∈ {1, . . . ,K}

(so M = 500). Then for j ∈ {1, . . . , D}, let Vjk =∑Mi=1 I(ai = j, bi = k). Then the majority vote classi-

fier is defined to be C(j) = arg maxk Vjk.

Note that D, the dimensionality of the latents, does notaffect the metric; for a classifier that chooses at random, theaccuracy is 1/K, independent of D.

Table 1. Encoder and Decoder architecture for 2D Shapes data.

Encoder Decoder

Input 64× 64 binary image Input ∈ R10

4× 4 conv. 32 ReLU. stride 2 FC. 128 ReLU.

4× 4 conv. 32 ReLU. stride 2 FC. 4× 4× 64 ReLU.

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 64 ReLU. stride 2

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 32 ReLU. stride 2

FC. 128. FC. 2× 10. 4× 4 upconv. 1. stride 2

Table 2. Encoder and Decoder architecture for CelebA, Chairsdata.

Encoder Decoder

Input 64× 64× 3 RGB image Input ∈ R10

4× 4 conv. 32 ReLU. stride 2 FC. 256 ReLU.

4× 4 conv. 32 ReLU. stride 2 FC. 4× 4× 64 ReLU.

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 64 ReLU. stride 2

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 32 ReLU. stride 2

FC. 256. FC. 2× 10. 4× 4 upconv. 3. stride 2

Table 3. Encoder and Decoder architecture for 3D Faces data.

Encoder Decoder

Input 64× 64 greyscale image Input ∈ R10

4× 4 conv. 32 ReLU. stride 2 FC. 256 ReLU.

4× 4 conv. 32 ReLU. stride 2 FC. 4× 4× 64 ReLU.

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 64 ReLU. stride 2

4× 4 conv. 64 ReLU. stride 2 4× 4 upconv. 32 ReLU. stride 2

FC. 256. FC. 2× 10. 4× 4 upconv. 1. stride 2

For discrete latent variables, we use Gini’s definition ofempirical variance:

V ar∧

(x) =1

2N(N − 1)

N∑i,j=1

d(xi, xj) (4)

for x = [x1, . . . , xN ] ∈ RN , d(xi, xj) = 1 if xi 6= xj and0 if xi = xj . Note that this is equal to empirical variancefor continuous variables when d(xi, xj) = (xi − xj)2.

C. KL DecompositionThe KL term in the VAE objective decomposes as follows(Makhzani & Frey, 2017):

Lemma 1. Epdata(x)[KL(q(z|x)||p(z))] = Iq(x; z) +KL(q(z)||p(z)) where q(x, z) = pdata(x)q(z|x).

Page 12: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Proof.

Epdata(x)[KL(q(z|x)||p(z))]

= Epdata(x)Eq(z|x)[

logq(z|x)

p(z)

]= Epdata(x)Eq(z|x)

[log

q(z|x)

q(z)

q(z)

p(z)

]= Epdata(x)Eq(z|x)

[log

q(z|x)

q(z)+ log

q(z)

p(z)

]= Epdata(x)[KL(q(z|x)||q(z))] + Eq(x,z)

[log

q(z)

p(z)

]= Iq(x; z) + Eq(z)

[log

q(z)

p(z)

]= Iq(x; z) +KL(q(z)||p(z))

Remark. Note that this decomposition is equivalent tothat in Hoffman & Johnson (2016), written as follows:Epdata(x)[KL(q(z|x)||p(z))] = Ir(i; z) +KL(q(z)||p(z))where r(i, z) = 1

N q(z|x(i)), hence r(z|i) = q(z|x(i)),

r(z) = 1N

∑Ni=1 q(z|x(i)) = q(z).

Proof.

Ir(i; z) = Er(i)[KL(r(z|i)||r(z))]

=1

N

N∑i=1

KL(q(z|x(i))||q(z))

= Epdata(x)[KL(q(z|x)||q(z))]= Iq(x; z)

D. Using a Batch Estimate of q(z) forEstimating TC

We have also tried using a batch estimate for the densityq(z), thus optimising this estimate of the TC directly insteadof having a discriminator and using the density ratio trick.In other words, we tried q(z) ≈ q(z) = 1

|B|∑i∈B q(z|x(i)),

and using the estimate:

KL(q(z)||∏j

q(zj)) = Eq(z)[

logq(z)∏j q(zj)

]

≈ Eq(z)[

logq(z)∏j q(zj)

](5)

Note that:

Eq(z)[

logq(z)∏j q(zj)

]

≈ 1

H

H∑h=1

[log

1

|B|∑i∈B

D∏j=1

q(z(h)j |x

(i))

− log

D∏j=1

1

|B|∑i∈B

q(z(h)j |x

(i))

](6)

for z(h) iid∼ q(z). However while experimenting on 2DShapes, we observed that the value of log q(z(h)) becomesvery small (negative with high absolute value) for latentdimension d ≥ 2 during training, because q(z) is not a goodenough approximation to q(z) unless B is very big. Astraining progresses for the VAE, the variance of Gaussiansq(z|x(i)) becomes smaller and smaller, so they do not over-lap too much in higher dimensions. Hence we get z(h) ∼q(z) that land on the tails of q(z) = 1

|B|∑i∈B q(z|x(i)),

giving worryingly small values of log q(z(h)). On the otherhand

∏j q(z

(h)j ), a mixture of |B|d Gaussians hence of

much higher entropy, gives much more stable values oflog∏j q(z

(h)j ). From Figure 14, we can see that even with

B as big as 10,000, we get negative values for the esti-mate of TC, which is a KL divergence and hence shouldbe non-negative, hence this method of using a batch es-timate for q(z) does not work. A fix is to use samplesfrom q(z) instead of q(z), but this seemed to give a similarreconstruction-disentanglement trade-off to β-VAE.

beta=1.0,TC estimate=-193.30 beta=5.0,TC estimate=-10.62 beta=9.0, TC estimate=-2.22

Figure 14. Histogram of log q(z(h)) (top) and∏d

j=1 q(z(h)j ) (bot-

tom) for z(h) iid∼ q(z) with |B| = 10000, d = 10. The columnscorrespond to values of β = 1, 5, 9 for training β-VAE. In the titleof each histogram, there is an estimate of TC based on the samplesof z(h).

E. Losses and Experiments for other relatedMethods

The Adversarial Autoencoder (AAE) (Makhzani et al., 2015)uses the following objective

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]

]−KL(q(z)||p(z)), (7)

Page 13: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

utilising the density ratio trick to estimate the KL term.

Information Dropout (Achille & Soatto, 2018) uses the ob-jective

1

N

N∑i=1

Eq(z|x(i))[log p(x(i)|z)]− βKL(q(z|x(i))||q(z)).

(8)The following objective is also considered in the paper butis dismissed as intractable:

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]−βKL(q(z|x(i))||q(z))

]−γKL(q(z)||

d∏j=1

q(zj))

(9)

Note that it is similar to the FactorVAE objective (whichhas β = 1), but with p(z) in the first KL term replaced withq(z).

DIP-VAE (Kumar et al., 2018) uses the VAE objectivewith an additional penalty on how much the covariance ofq(z) deviates from the identity matrix, either using the lawof total covariance Covq(z)[z] = Epdata(x)Covq(z|x)[z] +Covpdata(x)(Eq(z|x)[z]) (DIP-VAE I):

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]−KL(q(z|x(i))||p(z))

]− λod

∑i 6=j

[Covpdata(x)[µ(x)]]2ij

− λd∑i

([Covpdata(x)[µ(x)]]ii − 1)2 (10)

where µ(x) = mean(q(z|x)), or directly (DIP-VAE II):

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]−KL(q(z|x(i))||p(z))

]− λod

∑i 6=j

[Covq(z)[z]]2ij

− λd∑i

([Covq(z)[z]]ii − 1)2 (11)

One could argue that during training of FactorVAE,∏j q(zj) will be similar to p(z), assuming the prior is fac-

torial, due to the KL(q(z|x)||p(z)) term in the objective.Hence we also investigate a modified FactorVAE objectivethat replaces

∏j q(zj) with p(z):

1

N

N∑i=1

[Eq(z|x(i))[log p(x(i)|z)]−KL(q(z|x(i))||p(z))

]−γKL(q(z)||p(z)) (12)

However as shown in Figure 32 of Appendix H, the his-tograms of samples from the marginals are clearly quitedifferent from the the prior for FactorVAE.

reco

nst

ruct

ion e

rror

old

dis

enta

ng

lem

ent

metr

icnew

dis

enta

ng

lem

ent

metr

ic

iteration iteration

Figure 15. Same as Figure 4 but for AAE (left) and a variant ofFactorVAE (Eqn. (12).

Moreover we show experimental results for AAE (addinga γ coefficient in front of the KL(q(z)||p(z)) term ofthe objective and tuning it) and the variant of FactorVAE(Eqn. (12)) on the 2D Shapes data. From Figure 15, we seethat the disentanglement performance for both are some-what lower than that for FactorVAE. This difference couldbe explained as a benefit of directly encouraging q(z) tobe factorised (FactorVAE) instead of encouraging it to ap-proach an arbitrarily chosen factorised prior p(z) = N (0, I)(AAE, Eqn. (12)). Information Dropout and DIP-VAE didnot have enough experimental details in the paper nor pub-licly available code to have their results reproduced andcompared against.

F. InfoGAN and InfoWGAN-GPWe give an overview of InfoGAN (Chen et al., 2016) andInfoWGAN-GP, its counterpart using Wasserstein distanceand gradient penalty. InfoGAN uses latents z = (c, ε) wherec models semantically meaningful codes and ε models in-compressible noise. The generative model is defined bya generator G with the process: c ∼ p(c), ε ∼ p(ε), z =(c, ε), x = G(z). i.e. p(z) = p(c)p(ε). GANs are defined asa minimax game on some objective V (D,G), whereD is ei-ther a discriminator (e.g. for the original GAN (Goodfellowet al., 2014)) that outputs log probabilities for binary clas-sification, or a critic (e.g. for Wasserstein-GAN (Arjovskyet al., 2017)) that outputs a real-valued scalar. InfoGANdefines an extra encoding distribution Q(c|x) that is used todefine an extra penalty:

L(G,Q) = Ep(c)Ep(ε)[logQ(c|G(c, ε))] (13)

Page 14: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

that is added to the GAN objective. Hence InfoGAN is thefollowing minimax game on the parameters of neural netsD,G,Q:

minG,Q

maxD

VI(D,G,Q) = minG,Q

maxD

V (D,G)− λL(G,Q)

(14)L can be interpreted as a variational lower bound toI(c;G(c, ε)), with equality at Q = arg minQ VI(D,G,Q).i.e. L encourages the codes to be more informative aboutthe image. From the definition of L, it can also be seen asthe reconstruction error of codes in the latent space. Theoriginal InfoGAN defines:

V (D,G) = Epdata(x)[D(x)]− Ep(z)[D(G(z))] (15)

same as the original GAN objective where D outputs logprobabilities. However as we’ll show in Appendix G thishas known instability issues in training. So it is natural to tryreplacing this with the more stable WGAN-GP (Gulrajaniet al., 2017) objective:

V (D,G) =Epdata(x)[D(x)]− Ep(z)[D(G(z))]

+ η(||∇xD(x)|x=x||2 − 1)2 (16)

for x = πxr + (1−π)xf with π ∼ U [0, 1], xr ∼ pdata(x),zf ∼ p(z), xf =stop gradient(G(zf )) and with anew x for each iteration of optimisation. Thus we obtainInfoWGAN-GP.

G. Empirical Study of InfoGAN andInfoWGAN-GP

To begin with, we implemented InfoGAN and InfoWGAN-GP on MNIST using the hyperparameters given in Chenet al. (2016) to better understand its behaviour, using 1 cat-egorical code with 10 categories, 2 continuous codes, and62 noise variables. We use priors p(cj) = U [−1, 1] for thecontinuous codes, p(cj) = 1

J for categorical codes with Jcategories, and p(εj) = N (0, 1) for the noise variables. For2D Shapes data we use 1 categorical codes with 3 categories(J = 3), 4 continuous codes, and 5 noise variables. Thenumber of noise variables did not seem to have a noticeableeffect on the experiment results. We use the Adam opti-miser (Kingma & Ba, 2015) with β1 = 0.5, β2 = 0.999,and learning rate 10−3 for the generator updates and 10−4

for the discriminator updates. The detailed Discrimina-tor/Encoder/Generator architecture are given in Tables 4and 5. The architecture for InfoWGAN-GP is the sameas InfoGAN, except that we use no Batch Normalisation(batchnorm) (Ioffe & Szegedy, 2015) for the convolutionsin the discriminator, and replace batchnorm with Layer Nor-malisation (Ba et al., 2016) in the fully connected layerthat follows the convolutions as recommended in (Gulrajaniet al., 2017). We use gradient penalty coefficient η = 10,again as recommended.

Table 4. InfoGAN architecture for MNIST data. 2 continuouscodes, 1 categorical code with 10 categories, 62 noise variables.

discriminator D / encoder Q generator G

Input 28× 28 greyscale image Input ∈ R74

4× 4 conv. 64 lReLU. stride 2 FC. 1024 ReLU. batchnorm

4× 4 conv. 128 lReLU. stride 2. batchnorm FC. 7× 7× 128 ReLU. batchnorm

FC. 1024 lReLU. batchnorm 4× 4 upconv. 64 ReLU. stride 2. batchnorm

FC. 1. output layer for D 4× 4 upconv. 1 Sigmoid. stride 2

FC. 128 lReLU. batchnorm. FC 2× 2 + 1× 10

Table 5. InfoGAN architecture for 2D Shapes data. 4 continuouscodes, 1 categorical code with 3 categories, 5 noise variables.

discriminator D / encoder Q generator G

Input 64× 64 binary image Input ∈ R12

4× 4 conv. 32 lReLU. stride 2 FC. 128 ReLU. batchnorm

4× 4 conv. 32 lReLU. stride 2. batchnorm FC. 4× 4× 64 ReLU. batchnorm

4× 4 conv. 64 lReLU. stride 2. batchnorm 4× 4 upconv. 64 lReLU. stride 2. batchnorm

4× 4 conv. 64 lReLU. stride 2. batchnorm 4× 4 upconv. 32 lReLU. stride 2. batchnorm

FC. 128 lReLU. batchnorm 4× 4 upconv. 32 lReLU. stride 2. batchnorm

FC. 1. output layer for D 4× 4 upconv. 1 Sigmoid. stride 2

FC. 128 lReLU. batchnorm. FC 4× 2 + 1× 3 for Q

Table 6. Bigger InfoGAN architecture for 2D Shapes data. 4 con-tinuous codes, 1 categorical code with 3 categories, 128 noisevariables.

discriminator D / encoder Q generator G

Input 64× 64 binary image Input ∈ R136

4× 4 conv. 64 lReLU. stride 2 FC. 1024 ReLU. batchnorm

4× 4 conv. 128 lReLU. stride 2. batchnorm FC. 8× 8× 256 ReLU. batchnorm

4× 4 conv. 256 lReLU. stride 2. batchnorm 4× 4 upconv. 256 lReLU. stride 1. batchnorm

4× 4 conv. 256 lReLU. stride 1. batchnorm 4× 4 upconv. 256 lReLU. stride 1. batchnorm

4× 4 conv. 256 lReLU. stride 1. batchnorm 4× 4 upconv. 128 lReLU. stride 2. batchnorm

FC. 1024 lReLU. batchnorm 4× 4 upconv. 64 lReLU. stride 2. batchnorm

FC. 1. output layer for D 4× 4 upconv. 1 Sigmoid. stride 2

FC. 128 lReLU. batchnorm. FC 4× 2 + 1× 3 for Q

iteration

Figure 16. Discriminator accuracy of InfoGAN on MNISTthroughout training.

We firstly observe that for all runs, we eventually get adegenerate discriminator that predicts all inputs to be real, asin Figure 16. This is the well-known instability issue of theoriginal GAN. We have tried using a smaller learning rate forthe discriminator, and although this delays the degeneratebehaviour it does not prevent it. Hence early stopping seemscrucial, and all results shown below are from well beforethe degenerate behaviour occurs.

Chen et al. (2016) claim that the categorical code learns digit

Page 15: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Figure 17. Latent traversals for InfoGAN on MNIST across thetwo continuous codes (first two rows) and the categorical code(last row) for 5 different random seeds.

Figure 18. Reconstructions for InfoGAN on MNIST. First column:original image. Remaining columns: reconstructions varying thenoise latent ε.

Figure 19. Latent traversals for InfoGAN on MNIST across thethree continuous codes (first three rows) and the categorical code(last row).

class (discrete factor of variation) and that the continuouscodes learn azimuth and width, but when plotting latenttraversals for each run, we observed that this is inconsistent.We show five randomly chosen runs in Figure 17. The digitclass changes in the continuous code traversals and there areoverlapping digits in the categorical code traversal. Similar

results hold for InfoWGAN-GP in Figure 28.

We also tried visualising the reconstructions: given an im-age, we push the image through the encoder to obtain latentcodes c, fix this c and vary the noise ε to generate multiplereconstructions for the same image. This is to check theextent to which the noise ε can affect the generation. We cansee in Figure 18 that digit class often changes when varyingε, so the model struggles to cleanly separate semanticallymeaningful information and incompressible noise.

Furthermore, we investigated the sensitivity of the modelto the number of latent codes. We show latent traversalsusing three continuous codes instead of two in Figure 19.It is evident that the model tries to put more digit classinformation into the continuous traversals. So the numberof codes is an important hyperparameter to tune, whereasVAE methods are less sensitive to the choice of numberof codes since they can prune out unnecessary latents bycollapsing q(zj |x) to the prior p(zj).

Figure 20. Latent traversals for InfoGAN on MNIST across thetwo continuous codes (first two rows) and the categorical code(last row) using 2 categories

Figure 21. Latent traversals for InfoGAN on MNIST across thetwo continuous codes (first two rows) and the categorical code(last two rows) using 20 categories

Figure 22. Same as Figure 18 but the categorical code having 2categories.

We also tried varying the number of categories for the cat-egorical code. Using 2 categories, we see from Figure 20that the model tries to put much more information aboutdigit class into the continuous latents, as expected. More-over from Figure 22, we can see that the noise variables

Page 16: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

Figure 23. Same as Figure 22 but with 20 categories.

also have more information about the digit class. However,when we use 20 categories, we see that the model still putsinformation about the digit class in the continuous latents.However from Figure 23 we see that the noise variablescontain less semantically meaningful information.

iteration

Figure 24. The generator loss −Ep(z)[D(G(z))], discriminatorloss Epdata(x)[D(x)] − Ep(z)[D(G(z))] and the InfoGAN reg-ulariser term −L for InfoWGAN-GP on MNIST with λ = 1

iteration

Figure 25. Same as Figure 24 but for 2D Shapes.

old

dis

enta

ng

lem

ent

metr

ic

new

dis

enta

ng

lem

ent

metr

ic

iteration iteration

Figure 26. Disentanglement scores for InfoWGAN-GP on 2DShapes with bigger architecture (Table 6) for 10 random seedsper hyperparameter setting. Left: Metric in Higgins et al. (2016).Right: Our metric.

Using InfoWGAN-GP solved the degeneracy issue andmakes training more stable (see Figure 25), but we observedthat the other problems persisted (see e.g. Figure 28).

Figure 27. Latent traversals for InfoWGAN-GP on 2D Shapesacross the four continuous codes (first four rows) and the cate-gorical code (last row) with bigger architecture (Table 6) for runwith best disentanglement score (λ = 0.6).

For 2D Shapes, we have also tried using a bigger architec-ture for InfoWGAN-GP that is used for a data set of similardimensions (Chairs data set) in Chen et al. (2016). SeeTable 6. However as can be seen in Figure 26 this did notimprove disentanglement scores, yet the latent traversalslook slightly more realistic (Figure 27).

In summary, InfoWGAN-GP can help prevent the instabili-ties in training faced by InfoGAN, but it does not help over-come the following weaknesses compared to VAE-basedmethods: 1) Disentangling performance is sensitive to thenumber of code latents. 2) More often than not, the noisevariables contain semantically meaningful information. 3)The model does not always generalise well to all across thedomain of p(z).

Figure 28. Same as Figure 17 but for InfoWGAN-GP.

Page 17: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

H. Further Experimental ResultsFrom Figure 29, we see that higher values of γ in FactorVAEleads to a lower discriminator accuracy. This is as expected,since a higher γ encourages q(z) and

∏j q(zj) to be closer

together, hence a lower accuracy for the discriminator tosuccessfully classify samples from the two distributions.

dis

crim

inato

r acc

ura

cy

iteration

Figure 29. Plot of discriminator accuracy of FactorVAE on 2DShapes data across iterations over 5 random seeds.

reco

nst

ruct

ion e

rror

iteration iteration

Figure 30. Same as Figure 10 but for 3D Chairs.

reco

nst

ruct

ion e

rror

iteration iteration

Figure 31. Same as Figure 10 but for CelebA.

We also show histograms of q(zj) for each j in β-VAEand FactorVAE for different values of β and γ at the endof training on 2D Shapes in Figure 32. We can see thatthe marginals of FactorVAE are quite different from theprior, which could be a reason that the variant of FactorVAEusing the objective given by Eqn. (12) leads to differentresults to FactorVAE. For FactorVAE, the model is able tofocus on factorising q(z) instead of pushing it towards somearbitrarily specified prior p(z).

Page 18: Disentangling by Factorising - arXiv · 2018. 10. 8. · Disentangling by Factorising gence between this marginal distribution and the product of its marginals, and is optimised using

Disentangling by Factorising

-VAE

FactorVAE

Figure 32. Histograms of q(zj) for each j (columns) for β-VAE and FactorVAE at the end of training on 2D Shapes, with the pdf ofGaussianN (0, 1) overlaid in red. The rows correspond to different values of β (1, 4, 16, 64) and γ (10, 20, 40, 80) respectively.