darla: improving zero-shot transfer in reinforcement learning · vironments (jaco arm, deepmind...

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Irina Higgins * 1 Arka Pal * 1 Andrei Rusu 1 Loic Matthey 1 Christopher Burgess 1 Alexander Pritzel 1

Matthew Botvinick 1 Charles Blundell 1 Alexander Lerchner 1

AbstractDomain adaptation is an important open prob-lem in deep reinforcement learning (RL). Inmany scenarios of interest data is hard to ob-tain, so agents may learn a source policy in asetting where data is readily available, with thehope that it generalises well to the target do-main. We propose a new multi-stage RL agent,DARLA (DisentAngled Representation LearningAgent), which learns to see before learning to act.DARLA’s vision is based on learning a disen-tangled representation of the observed environ-ment. Once DARLA can see, it is able to acquiresource policies that are robust to many domainshifts - even with no access to the target domain.DARLA significantly outperforms conventionalbaselines in zero-shot domain adaptation scenar-ios, an effect that holds across a variety of RL en-vironments (Jaco arm, DeepMind Lab) and baseRL algorithms (DQN, A3C and EC).

1. IntroductionAutonomous agents can learn how to maximise futureexpected rewards by choosing how to act based on in-coming sensory observations via reinforcement learning(RL). Early RL approaches did not scale well to envi-ronments with large state spaces and high-dimensionalraw observations (Sutton & Barto, 1998). A commonlyused workaround was to embed the observations in alower-dimensional space, typically via hand-crafted and/orprivileged-information features. Recently, the advent ofdeep learning and its successful combination with RL hasenabled end-to-end learning of such embeddings directlyfrom raw inputs, sparking success in a wide variety of pre-viously challenging RL domains (Mnih et al., 2015; 2016;Jaderberg et al., 2017). Despite the seemingly universal

*Equal contribution 1DeepMind, 6 Pancras Square, KingsCross, London, N1C 4AG, UK. Correspondence to: Irina Higgins<[email protected]>, Arka Pal <[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

efficacy of deep RL, however, fundamental issues remain.These include data inefficiency, the reactive nature and gen-eral brittleness of learnt policies to changes in input datadistribution, and lack of model interpretability (Garneloet al., 2016; Lake et al., 2016). This paper focuses on oneof these outstanding issues: the ability of RL agents to dealwith changes to the input distribution, a form of transferlearning known as domain adaptation (Bengio et al., 2013).In domain adaptation scenarios, an agent trained on a par-ticular input distribution with a specified reward structure(termed the source domain) is placed in a setting where theinput distribution is modified but the reward structure re-mains largely intact (the target domain). We aim to developan agent that can learn a robust policy using observationsand rewards obtained exclusively within the source domain.Here, a policy is considered as robust if it generalises withminimal drop in performance to the target domain withoutextra fine-tuning.

Past attempts to build RL agents with strong domain adap-tation performance highlighted the importance of learn-ing good internal representations of raw observations (Finnet al., 2015; Raffin et al., 2017; Pan & Yang, 2009; Bar-reto et al., 2016; Littman et al., 2001). Typically, these ap-proaches tried to align the source and target domain rep-resentations by utilising observation and reward signalsfrom both domains (Tzeng et al., 2016; Daftry et al., 2016;Parisotto et al., 2015; Guez et al., 2012; Talvitie & Singh,2007; Niekum et al., 2013; Gupta et al., 2017; Finn et al.,2017; Rajendran et al., 2017). In many scenarios, such asrobotics, this reliance on target domain information can beproblematic, as the data may be expensive or difficult toobtain (Finn et al., 2017; Rusu et al., 2016). Furthermore,the target domain may simply not be known in advance.On the other hand, policies learnt exclusively on the sourcedomain using existing deep RL approaches that have fewconstraints on the nature of the learnt representations of-ten overfit to the source input distribution, resulting in poordomain adaptation performance (Lake et al., 2016; Rusuet al., 2016).

We propose tackling both of these issues by focusing in-stead on learning representations which capture an underly-ing low-dimensional factorised representation of the worldand are therefore not task or domain specific. Many nat-

arX

iv:1

707.

0847

5v2

[st

at.M

L]

6 J

un 2

018


Figure 1. Schematic representation of DARLA. Yellow representsthe denoising autoencoder part of the model, blue represents theβ-VAE part of the model, and grey represents the policy learningpart of the model.

uralistic domains such as video game environments, sim-ulations and our own world are well described in terms ofsuch a structure. Examples of such factors of variation areobject properties like colour, scale, or position; other exam-ples correspond to general environmental factors, such asgeometry and lighting. We think of these factors as a set ofhigh-level parameters that can be used by a world graphicsengine to generate a particular natural visual scene (Kulka-rni et al., 2015). Learning how to project raw observationsinto such a factorised description of the world is addressedby the large body of literature on disentangled representa-tion learning (Schmidhuber, 1992; Desjardins et al., 2012;Cohen & Welling, 2014; 2015; Kulkarni et al., 2015; Hin-ton et al., 2011; Rippel & Adams, 2013; Reed et al., 2014;Yang et al., 2015; Goroshin et al., 2015; Kulkarni et al.,2015; Cheung et al., 2015; Whitney et al., 2016; Karalet-sos et al., 2016; Chen et al., 2016; Higgins et al., 2017).Disentangled representations are defined as interpretable,factorised latent representations where either a single latentor a group of latent units are sensitive to changes in singleground truth factors of variation used to generate the vi-sual world, while being invariant to changes in other factors(Bengio et al., 2013). The theoretical utility of disentangledrepresentations for supervised and reinforcement learninghas been described before (Bengio et al., 2013; Higginset al., 2017; Ridgeway, 2016); however, to our knowledge,it has not been empirically validated to date.

We demonstrate how disentangled representations can im-prove the robustness of RL algorithms in domain adapta-tion scenarios by introducing DARLA (DisentAngled Rep-resentation Learning Agent), a new RL agent capableof learning a robust policy on the source domain thatachieves significantly better out-of-the-box performance indomain adaptation scenarios compared to various base-lines. DARLA relies on learning a latent state representa-tion that is shared between the source and target domains,by learning a disentangled representation of the environ-ment’s generative factors. Crucially, DARLA does not re-quire target domain data to form its representations. Ourapproach utilises a three stage pipeline: 1) learning tosee, 2) learning to act, 3) transfer. During the first stage,

DARLA develops its vision, learning to parse the world interms of basic visual concepts, such as objects, positions,colours, etc. by utilising a stream of raw unlabelled obser-vations – not unlike human babies in their first few monthsof life (Leat et al., 2009; Candy et al., 2009). In the secondstage, the agent utilises this disentangled visual represen-tation to learn a robust source policy. In stage three, wedemonstrate that the DARLA source policy is more robustto domain shifts, leading to a significantly smaller drop inperformance in the target domain even when no further pol-icy finetuning is allowed (median 270.3% improvement).These effects hold consistently across a number of differ-ent RL environments (DeepMind Lab and Jaco/MuJoCo:Beattie et al., 2016; Todorov et al., 2012) and algorithms(DQN, A3C and Episodic Control: Mnih et al., 2015; 2016;Blundell et al., 2016).

2. Framework2.1. Domain adaptation in Reinforcement Learning

We now formalise domain adaptation scenarios in a rein-forcement learning (RL) setting. We denote the sourceand target domains as DS and DT , respectively. Eachdomain corresponds to an MDP defined as a tuple DS ≡(SS ,AS , TS , RS) or DT ≡ (ST ,AT , TT , RT ) (we assumea shared fixed discount factor γ), each with its own statespace S, action space A, transition function T and rewardfunction R.1 In domain adaptation scenarios the states Sof the source and the target domains can be quite different,while the action spaces A are shared and the transitions Tand reward functions R have structural similarity. For ex-ample, consider a domain adaptation scenario for the Jacorobotic arm, where the MuJoCo (Todorov et al., 2012) sim-ulation of the arm is the source domain, and the real worldsetting is the target domain. The state spaces (raw pixels)of the source and the target domains differ significantly dueto the perceptual-reality gap (Rusu et al., 2016); that is tosay, SS 6= ST . Both domains, however, share action spaces(AS = AT ), since the policy learns to control the same setof actuators within the arm. Finally, the source and tar-get domain transition and reward functions share structuralsimilarity (TS ≈ TT and RS ≈ RT ), since in both domainstransitions between states are governed by the physics ofthe world and the performance on the task depends on therelative position of the arm’s end effectors (i.e. fingertips)with respect to an object of interest.

2.2. DARLA

In order to describe our proposed DARLA framework, weassume that there exists a set M of MDPs that is the set

1For further background on the notation relating to the RLparadigm, see Section A.1 in the Supplementary Materials.


of all natural world MDPs, and each MDP Di is sampledfromM. We defineM in terms of the state space S thatcontains all possible conjunctions of high-level factors ofvariation necessary to generate any naturalistic observationin any Di ∈ M. A natural world MDP Di is then onewhose state space S corresponds to some subset of S. Insimple terms, we assume that there exists some shared un-derlying structure between the MDPsDi sampled fromM.We contend that this is a reasonable assumption that per-mits inclusion of many interesting problems, including be-ing able to characterise our own reality (Lake et al., 2016).

We now introduce notation for two state space variablesthat may in principle be used interchangeably within thesource and target domain MDPs DS and DT – the agentobservation state space So, and the agent’s internal latentstate space Sz .2 Soi in Di consists of raw (pixel) observa-tions soi generated by the true world simulator from a sam-pled set of data generative factors si, i.e. soi ∼ Sim(si).si is sampled by some distribution or process Gi on S,si ∼ Gi(S).

Using the newly introduced notation, domain adaptationscenarios can be described as having different samplingprocesses GS and GT such that sS ∼ GS(S) and sT ∼GT (S) for the source and target domains respectively, andthen using these to generate different agent observationstates soS ∼ Sim(sS) and soT ∼ Sim(sT). Intuitively, con-sider a source domain where oranges appear in blue roomsand apples appear in red rooms, and a target domain wherethe object/room conjunctions are reversed and oranges ap-pear in red rooms and apples appear in blue rooms. Whilethe true data generative factors of variation S remain thesame - room colour (blue or red) and object type (applesand oranges) - the particular source and target distributionsGS and GT differ.

Typically deep RL agents (e.g. Mnih et al., 2015; 2016)operating in an MDP Di ∈ M learn an end-to-end map-ping from raw (pixel) observations soi ∈ Soi to actionsai ∈ Ai (either directly or via a value function Qi(soi , ai)from which actions can be derived). In the process of do-ing so, the agent implicitly learns a function F : Soi → Szithat maps the typically high-dimensional raw observationssoi to typically low-dimensional latent states szi ; followedby a policy function πi : Szi → Ai that maps the latentstates szi to actions ai ∈ Ai. In the context of domainadaptation, if the agent learns a naive latent state map-ping function FS : SoS → SzS on the source domain us-ing reward signals to shape the representation learning, itis likely that FS will overfit to the source domain and willnot generalise well to the target domain. Returning to our

2Note that we do not assume these to be Markovian i.e. it is notnecessarily the case that p(sot+1|sot ) = p(sot+1|sot , sot−1, . . . , s

o1),

and similarly for sz . Note the index t here corresponds to time.

intuitive example, imagine an agent that has learnt a pol-icy to pick up oranges and avoid apples on the source do-main. Such a source policy πS is likely to be based onan entangled latent state space SzS of object/room conjunc-tions: oranges/blue → good, apples/red → bad, since thisis arguably the most efficient representation for maximis-ing expected rewards on the source task in the absence ofextra supervision signals suggesting otherwise. A sourcepolicy πS(a|szS ; θ) based on such an entangled latent rep-resentation szS will not generalise well to the target domainwithout further fine-tuning, since FS(soS) 6= FS(soT ) andtherefore crucially SzS 6= SzT .

On the other hand, since both sS ∼ GS(S) and sT ∼GT (S) are sampled from the same natural world statespace S for the source and target domains respectively, itshould be possible to learn a latent state mapping functionF : So → SzS , which projects the agent observation statespace So to a latent state space SzS expressed in terms offactorised data generative factors that are representative ofthe natural world i.e. Sz

S≈ S. Consider again our intuitive

example, where F maps agent observations (soS : orangein a blue room) to a factorised or disentangled representa-tion expressed in terms of the data generative factors (szS :room type = blue; object type = orange). Such a disen-tangled latent state mapping function should then directlygeneralise to both the source and the target domains, so thatF(soS) = F(soT ) = szS . Since SzS is a disentangled repre-sentation of object and room attributes, the source policyπS can learn a decision boundary that ignores the irrele-vant room attributes: oranges→ good, apples→ bad. Sucha policy would then generalise well to the target domainout of the box, since πS(a|F(soS); θ) = πT (a|F(soT ); θ) =πT (a|szS ; θ). Hence, DARLA is based on the idea that a

good quality F learnt exclusively on the source domainDS ∈ M will zero-shot-generalise to all target domainsDi ∈ M, and therefore the source policy π(a|SzS ; θ) willalso generalise to all target domains Di ∈ M out of thebox.

Next we describe each of the stages of the DARLA pipelinethat allow it to learn source policies πS that are robust todomain adaptation scenarios, despite being trained with noknowledge of the target domains (see Fig. 1 for a graphicalrepresentation of these steps):

1) Learn to see (unsupervised learning of FU ) – the taskof inferring a factorised set of generative factors SzS = Sfrom observations So is the goal of the extensive disentan-gled factor learning literature (e.g. Chen et al., 2016; Hig-gins et al., 2017). Hence, in stage one we learn a mappingFU : SoU → SzU , where SzU ≈ SzS (U stands for ‘unsu-pervised’) using an unsupervised model for learning dis-entangled factors that utilises observations collected by anagent with a random policy πU from a visual pre-training


MDP DU ∈ M. Note that we require sufficient variabil-ity of factors and their conjunctions in DU in order to haveSzU ≈ SzS ;

2) Learn to act (reinforcement learning of πS in the sourcedomain DS utilising previously learned FU ) – an agentthat has learnt to see the world in stage one in terms of thenatural data generative factors is now exposed to a sourcedomain DS ∈ M. The agent is tasked with learning thesource policy πS(a|szS ; θ), where szS = FU (soS) ≈ szS , viaa standard reinforcement learning algorithm. Crucially, wedo not allow FU to be modified (e.g. by gradient updates)during this phase;

3) Transfer (to a target domain DT ) – in the final step, wetest how well the policy πS learnt on the source domaingeneralises to the target domain DT ∈ M in a zero-shotdomain adaptation setting, i.e. the agent is evaluated on thetarget domain without retraining. We compare the perfor-mance of policies learnt with a disentangled latent state SzSto various baselines where the latent state mapping func-tion FU projects agent observations so to entangled latentstate representations sz .

2.3. Learning disentangled representations

In order to learn FU , DARLA utilises β-VAE (Higginset al., 2017), a state-of-the-art unsupervised model for au-tomated discovery of factorised latent representations fromraw image data. β-VAE is a modification of the varia-tional autoencoder framework (Kingma & Welling, 2014;Rezende et al., 2014) that controls the nature of the learntlatent representations by introducing an adjustable hyper-parameter β to balance reconstruction accuracy with latentchannel capacity and independence constraints. It max-imises the objective:

L(θ, φ;x, z, β) =Eqφ(z|x)[log pθ(x|z)]− β DKL(qφ(z|x)||p(z)) (1)

where φ, θ parametrise the distributions of the encoder andthe decoder respectively. Well-chosen values of β - usuallylarger than one (β > 1) - typically result in more disentan-gled latent representations z by limiting the capacity of thelatent information channel, and hence encouraging a moreefficient factorised encoding through the increased pressureto match the isotropic unit Gaussian prior p(z) (Higginset al., 2017).

2.3.1. PERCEPTUAL SIMILARITY LOSS

The cost of increasing β is that crucial information aboutthe scene may be discarded in the latent representation z,particularly if that information takes up a small proportionof the observations x in pixel space. We encountered thisissue in some of our tasks, as discussed in Section 3.1.The shortcomings of calculating the log-likelihood term

Eqφ(z|x)[log pθ(x|z)] on a per-pixel basis are known andhave been addressed in the past by calculating the recon-struction cost in an abstract, high-level feature space givenby another neural network model, such as a GAN (Good-fellow et al., 2014) or a pre-trained AlexNet (Krizhevskyet al., 2012; Larsen et al., 2016; Dosovitskiy & Brox,2016; Warde-Farley & Bengio, 2017). In practice we foundthat pre-training a denoising autoencoder (Vincent et al.,2010) on data from the visual pre-training MDP DU ∈ Mworked best as the reconstruction targets for β-VAE tomatch (see Fig. 1 for model architecture and Sec. A.3.1 inSupplementary Materials for implementation details). Thenew β-VAEDAE model was trained according to Eq. 2:

L(θ, φ;x, z, β) =− Eqφ(z|x) ‖J(x)− J(x)‖22

− β DKL(qφ(z|x)||p(z)) (2)

where x ∼ pθ(x|z) and J : RW×H×C → RN is the func-tion that maps images from pixel space with dimensionalityW ×H × C to a high-level feature space with dimension-ality N given by a stack of pre-trained DAE layers up to acertain layer depth. Note that by replacing the pixel basedreconstruction loss in Eq. 1 with high-level feature recon-struction loss in Eq. 2 we are no longer optimising the vari-ational lower bound, and β-VAEDAE with β = 1 loses itsequivalence to the Variational Autoencoder (VAE) frame-work as proposed by (Kingma & Welling, 2014; Rezendeet al., 2014). In this setting, the only way to interpret β is asa mixing coefficient that balances the capacity of the latentchannel z of β-VAEDAE against the pressure to match thehigh-level features within the pre-trained DAE.

2.4. Reinforcement Learning Algorithms

We used various RL algorithms (DQN, A3C and EpisodicControl: Mnih et al., 2015; 2016; Blundell et al., 2016) tolearn the source policy πS during stage two of the pipelineusing the latent states sz acquired by β-VAE based modelsduring stage one of the DARLA pipeline.

Deep Q Network (DQN) (Mnih et al., 2015) is a variant ofthe Q-learning algorithm (Watkins, 1989) that utilises deeplearning. It uses a neural network to parametrise an ap-proximation for the action-value function Q(s, a; θ) usingparameters θ.

Asynchronous Advantage Actor-Critic (A3C) (Mnih et al.,2016) is an asynchronous implementation of the advantageactor-critic paradigm (Sutton & Barto, 1998; Degris & Sut-ton, 2012), where separate threads run in parallel and per-form updates to shared parameters. The different threadseach hold their own instance of the environment and havedifferent exploration policies, thereby decorrelating param-eter updates without the need for experience replay. There-fore, A3C is an online algorithm, whereas DQN learns itspolicy offline, resulting in different learning dynamics be-


tween the two algorithms.

Model-Free Episodic Control (EC) (Blundell et al., 2016)was proposed as a complementary learning system to theother RL algorithms described above. The EC algorithmrelies on near-determinism of state transitions and rewardsin RL environments; in settings where this holds, it can ex-ploit these properties to memorise which action led to highreturns in similar situations in the past. Since in its simplestform EC relies on a lookup table, it learns good policiesmuch faster than value-function-approximation based deepRL algorithms like DQN trained via gradient descent - atthe cost of generality (i.e. potentially poor performance innon-deterministic environments).

We also compared our approach to that of UNREAL (Jader-berg et al., 2017), a recently proposed RL algorithm whichalso attempts to utilise unsupervised data in the environ-ment. The UNREAL agent takes as a base an LSTM A3Cagent (Mnih et al., 2016) and augments it with a number ofunsupervised auxiliary tasks that make use of the rich per-ceptual data available to the agent besides the (sometimesvery sparse) extrinsic reward signals. This auxiliary learn-ing tends to improve the representation learnt by the agent.See Sec. A.6 in Supplementary Materials for further detailsof the algorithms above.

3. TasksWe evaluate the performance of DARLA on different taskand environment setups that probe subtly different aspectsof domain adaptation. As a reminder, in Sec. 2.2 we definedS as a state space that contains all possible conjunctionsof high-level factors of variation necessary to generate anynaturalistic observation in any Di ∈ M. During domainadaptation scenarios agent observation states are generatedaccording to soS ∼ SimS(sS) and soT ∼ SimT(sT) for thesource and target domains respectively, where sS and sTare sampled by some distributions or processes GS and GTaccording to sS ∼ GS(S) and sT ∼ GT (S).

We use DeepMind Lab (Beattie et al., 2016) to test a ver-sion of domain adaptation setup where the source and targetdomain observation simulators are equal (SimS = SimT),but the processes used to sample sS and sT are differ-ent (GS 6= GT ). We use the Jaco arm with a matchingMuJoCo simulation environment (Todorov et al., 2012) intwo domain adaptation scenarios: simulation to simula-tion (sim2sim) and simulation to reality (sim2real). Thesim2sim domain adaptation setup is relatively similar toDeepMind Lab i.e. the source and target domains differin terms of processes GS and GT . However, there is a sig-nificant point of difference. In DeepMind Lab, all values offactors in the target domain, sT , are previously seen in thesource domain; however, sS 6= sT as the conjunctions of

Figure 2. A: DeepMind Lab (Beattie et al., 2016) transfer tasksetup. Different conjunctions of {room, object1, object2} wereused during different parts of the domain adaptation curriculum.During stage one, DU (shown in yellow), we used a minimal setspanning all objects and all rooms whereby each object is seenin each room. Note there is no extrinsic reward signal or notionof ‘task’ in this phase. During stage two, DS (shown in green),the RL agents were taught to pick up cans and balloons and avoidhats and cakes. The objects were always presented in pairs hat/canand cake/balloon. The agent never saw the hat/can pair in the pinkroom. This novel room/object conjunction was presented as thetarget domain adaptation condition DT (shown in red) where theability of the agent to transfer knowledge of the objects’ value toa novel environment was tested. B: β-VAE reconstructions (bot-tom row) using frames from DeepMind Lab (top row). Due tothe increased β > 1 necessary to disentangle the data genera-tive factors of variations the model lost information about objects.See Fig. 3 for a model appropriately capturing objects. C: left –sample frames from MuJoCo simulation environments used forvision (phase 1, SU ) and source policy training (phase 2, SS);middle – sim2sim domain adaptation test (phase 3, ST ); and right– sim2real domain adaptation test (phase 3, ST ).

these factor values are different. In sim2sim, by contrast,novel factor values are experienced in the target domain(this accordingly also leads to novel factor conjunctions).Hence, DeepMind Lab may be considered to be assessingdomain interpolation performance, whereas sim2sim testsdomain extrapolation.

The sim2real setup, on the other hand, is based on identicalprocesses GS = GT , but different observation simulatorsSimS 6= SimT corresponding to the MuJoCo simulationand the real world, which results in the so-called ‘percep-tual reality gap’ (Rusu et al., 2016). More details of thetasks are given below.

3.1. DeepMind Lab

DeepMind Lab is a first person 3D game environment withrich visuals and realistic physics. We used a standard seek-avoid object gathering setup, where a room is initialisedwith an equal number of randomly placed objects of twodifferent types. One of the object varieties is ‘good’ (its col-lection is rewarded +1), while the other is ‘bad’ (its collec-tion is punished -1). The full state space S consisted of allconjunctions of two room types (pink and green based onthe colour of the walls) and four object types (hat, can, cakeand balloon) (see Fig. 2A). The source domain DS con-


tained environments with hats/cans presented in the greenroom, and balloons/cakes presented in either the green orthe pink room. The target domain DT contained hats/canspresented in the pink room. In both domains cans and bal-loons were the rewarded objects.

1) Learn to see: we used β-VAEDAE to learn the disen-tangled latent state representation sz that includes both theroom and the object generative factors of variation withinDeepMind Lab. We had to use the high-level feature spaceof a pre-trained DAE within the β-VAEDAE framework(see Section 2.3.1), instead of the pixel space of vanilla β-VAE , because we found that objects failed to reconstructwhen using the values of β necessary to disentangle thegenerative factors of variation within DeepMind Lab (seeFig. 2B).

β-VAEDAE was trained on observations soU collected byan RL agent with a simple wall-avoiding policy πU (oth-erwise the training data was dominated by close up im-ages of walls). In order to enable the model to learnF(soU ) ≈ S, it is important to expose the agent to at leasta minimal set of environments that span the range of val-ues for each factor, and where no extraneous correlationsare added between different factors3(see Fig. 2A, yellow).See Section A.3.1 in Supplementary Materials for detailsof β-VAEDAE training.

2) Learn to act: the agent was trained with the algo-rithms detailed in Section 2.4 on a seek-avoid task us-ing the source domain (DS) conjunctions of object/roomshown in Fig. 2A (green). Pre-trained β-VAEDAE fromstage one was used as the ‘vision’ part of various RL al-gorithms (DQN, A3C and Episodic Control: Mnih et al.,2015; 2016; Blundell et al., 2016) to learn a source policyπS that picks up balloons and avoids cakes in both the greenand the pink rooms, and picks up cans and avoids hats inthe green rooms. See Section A.3.1 in Supplementary Ma-terials for more details of the various versions of DARLAwe have tried, each based on a different base RL algorithm.

3) Transfer: we tested the ability of DARLA to transfer theseek-avoid policy πS it had learnt on the source domain instage two using the domain adaptation condition DT illus-trated in Figure 2A (red). The agent had to continue pickingup cans and avoid hats in the pink room, even though theseobjects had only been seen in the green room during sourcepolicy training. The optimal policy πT is one that maintainsthe reward polarity from the source domain (cans are goodand hats are bad). For further details, see Appendix A.2.1.

3In our setup of DeepMind Lab domain adaptation task, theobject and environment factors are supposed to be independent. Inorder to ensure that β-VAEDAE learns a factorised representationthat reflects this ground truth independence, we present observa-tions of all possible conjunctions of room and individual objecttypes.

3.2. Jaco Arm and MuJoCo

We used frames from an RGB camera facing a roboticJaco arm, or a matching rendered camera view from aMuJoCo physics simulation environment (Todorov et al.,2012) to investigate the performance of DARLA in twodomain adaptation scenarios: 1) simulation to simula-tion (sim2sim), and 2) simulation to reality (sim2real).The sim2real setup is of particular importance, since theprogress that deep RL has brought to control tasks in sim-ulation (Schulman et al., 2015; Mnih et al., 2016; Levine& Abbeel, 2014; Heess et al., 2015; Lillicrap et al., 2015;Schulman et al., 2016) has not yet translated as well to re-ality, despite various attempts (Tobin et al., 2017; Tzenget al., 2016; Daftry et al., 2016; Finn et al., 2015; Rusuet al., 2016). Solving control problems in reality is hard dueto sparse reward signals, expensive data acquisition and theattendant danger of breaking the robot (or its human min-ders) during exploration.

In both sim2sim and sim2real, we trained the agent to per-form an object reaching policy where the goal is to placethe end effector as close to the object as possible. Whileconceptually the reaching task is simple, it is a hard controlproblem since it requires correct inference of the arm andobject positions and velocities from raw visual inputs.

1) Learn to see: β-VAE was trained on observations col-lected in MuJoCo simulations with the same factors ofvariation as in DS . In order to enable the model to learnF(soU ) ≈ s, a reaching policy was applied to phantom ob-jects placed in random positions - therefore ensuring thatthe agent learnt the independent nature of the arm positionand object position (see Fig. 2C, left);

2) Learn to act: a feedforward-A3C based agent with thevision module pre-trained in stage one was taught a sourcereaching policy πS towards the real object in simulation(see Fig. 2C (left) for an example frame, and Sec. A.4in Supplementary Materials for a fuller description of theagent). In the source domain DS the agent was trained ona distribution of camera angles and positions. The colourof the tabletop on which the arm rests and the object colourwere both sampled anew every episode.

3) Transfer: sim2sim: in the target domain, DT , the agentwas faced with a new distribution of camera angles and po-sitions with little overlap with the source domain distribu-tions, as well as a completely held out set of object colours(see Fig. 2C, middle). sim2real: in the target domain DT

the camera position and angle as well as the tabletop colourand object colour were sampled from the same distribu-tions as seen in the source domain DS , but the target do-main DT was now the real world. Many details presentin the real world such as shadows, specularity, multiplelight sources and so on are not modelled in the simulation;


3

-3

0

z1 z2 z3 z4 z5 z6

ObjectEnvironmentroom id turn left distance rotationleft objectturn right

-3

0

z1 z2 z3 z4

Disentangled Entangledig t obje object id

3z7

Figure 3. Plot of traversals of various latents of an entangled anda disentangled version of β-VAEDAE using frames from Deep-Mind Lab (Beattie et al., 2016).

3

-3

0

z1 z2 z3 z4 z5 z6

Object Armclose/far left/right close/far right/leftup/downwrist turn

3

-3

0

z1 z2 z3 z4

Disentangled Entangled

Figure 4. Plot of traversals of β-VAE on MuJoCo. Using a disen-tangled β-VAE model, single latents directly control for the fac-tors responsible for the object or arm placements.

the physics engine is also not a perfect model of reality.Thus sim2real tests the ability of the agent to cross theperceptual-reality gap and generalise its source policy πSto the real world (see Fig. 2C, right). For further details,see Appendix A.2.2.

4. ResultsWe evaluated the robustness of DARLA’s policy πS learnton the source domain to various shifts in the input data dis-tribution. In particular, we used domain adaptation sce-narios based on the DeepMind Lab seek-avoid task andthe Jaco arm reaching task described in Sec. 3. On eachtask we compared DARLA’s performance to that of var-ious baselines. We evaluated the importance of learning‘good’ vision during stage one of the pipeline, i.e one thatmaps the input observations so to disentangled represen-tations sz ≈ s. In order to do this, we ran the DARLApipeline with different vision models: the encoders of a

disentangled β-VAE 4 (the original DARLA), an entan-gled β-VAE (DARLAENT), and a denoising autoencoder(DARLADAE). Apart from the nature of the learnt rep-resentations sz , DARLA and all versions of its baselineswere equivalent throughout the three stages of our pro-posed pipeline in terms of architecture and the observeddata distribution (see Sec. A.3 in Supplementary Materialsfor more details).

Figs. 3-4 display the degree of disentanglement learnt bythe vision modules of DARLA and DARLAENT on Deep-Mind Lab and MuJoCo. DARLA’s vision learnt to inde-pendently represent environment variables (such as roomcolour-scheme and geometry) and object-related variables(change of object type, size, rotation) on DeepMind Lab(Fig. 3, left). Disentangling was also evident in MuJoCo.Fig. 4, left, shows that DARLA’s single latent units zi learntto represent different aspects of the Jaco arm, the object,and the camera. By contrast, in the representations learntby DARLAENT, each latent is responsible for changes toboth the environment and objects (Fig. 3, right) in Deep-Mind Lab, or a mixture of camera, object and/or arm move-ments (Fig. 4, right) in MuJoCo.

The table in Fig. 5 shows the average performance (acrossdifferent seeds) in terms of rewards per episode of the var-ious agents on the target domain with no fine-tuning of thesource policy πS . It can be seen that DARLA is able tozero-shot-generalise significantly better than DARLAENTor DARLADAE, highlighting the importance of learning adisentangled representation sz = szS during the unsuper-vised stage one of the DARLA pipeline. In particular, thisalso demonstrates that the improved domain transfer per-formance is not simply a function of increased exposure totraining observations, as both DARLAENT and DARLADAEwere exposed to the same data. The results are mostly con-sistent across target domains and in most cases DARLA issignificantly better than the second-best-performing agent.This holds in the sim2real task 5, where being able to per-form zero-shot policy transfer is highly valuable due to theparticular difficulties of gathering data in the real world.

DARLA’s performance is particularly surprising as it actu-ally preserves less information about the raw observationsso than DARLAENT and DARLADAE. This is due to thenature of the β-VAE and how it achieves disentangling; thedisentangled model utilised a significantly higher value ofthe hyperparameter β than the entangled model (see Ap-pendix A.3 for further details), which constrains the ca-

4In this section of the paper, we use the term β-VAE to re-fer to a standard β-VAE for the MuJoCo experiments, and aβ-VAEDAE for the DeepMind Lab experiments (as described instage 1 of Sec. 3.1).

5See https://youtu.be/sZqrWFl0wQ4 for example sim2simand sim2real zero-shot transfer policies of DARLA and baselineA3C agent.


110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219


Table 1. Transfer performance

DEEPMIND LAB JACO (A3C)VISION TYPE DQN A3C EC SIM2SIM SIM2REAL

BASELINE AGENT 1.86 ± 3.91 5.32 ± 3.36 -0.41 ± 4.21 97.64 ± 9.02 94.56 ± 3.55UNREAL - 4.13 ± 3.95 - - -DARLAFT 13.36 ± 5.8 1.4 ± 2.16 - 86.59 ± 5.53 99.25 ± 2.3DARLAENT 3.45 ± 4.47 15.66 ± 5.19 5.69 ± 3.73 84.77 ± 4.42 59.99 ± 15.05DARLADAE 7.83 ± 4.47 6.74 ± 2.81 5.59 ± 3.37 85.15 ± 7.43 100.72 ± 4.7

DARLA 10.25 ± 5.46 19.7 ± 5.43 11.41 ± 3.52 100.85 ± 2.92 108.2 ± 5.97DARLA’S PERFORMANCE IS SIGNIFICANTLY DIFFERENT FROM ALL BASELINES UNDER WELCH’S UNEQUAL VARIANCES T-TEST WITH p < 0.01 (N ∈ [60, 150]).

Figure 5. Table: Zero-shot performance (avg. reward per episode) of the source policy πS in target domains within DeepMind Lab andJaco/MuJoCo environments. Baseline agent refers to vanilla DQN/A3C/EC (DeepMind Lab) or A3C (Jaco) agents. See main text formore detailed model descriptions. Figure: Correlation between zero-shot performance transfer performance on the DeepMind Lab taskobtained by EC based DARLA and the level of disentanglement as measured by the transfer/disentanglement score (r = 0.6, p < 0.001)

pacity of the latent channel. Indeed, DARLA’s β-VAEonly utilises 8 of its possible 32 Gaussian latents to storeobservation-specific information for MuJoCo/Jaco (and 20in DeepMind Lab), whereas DARLAENT utilises all 32 forboth environments (as does DARLADAE).

Furthermore, we examined what happens if DARLA’s vi-sion (i.e. the encoder of the disentangled β-VAE ) is al-lowed to be fine-tuned via gradient updates while learningthe source policy during stage two of the pipeline. Thisis denoted by DARLAFT in the table in Fig. 5. We seethat it exhibits significantly worse performance than thatof DARLA in zero-shot domain adaptation using an A3C-based agent in all tasks. This suggests that a favourableinitialisation does not make up for subsequent overfittingto the source domain for the on-policy A3C. However, theoff-policy DQN-based fine-tuned agent performs very well.We leave further investigation of this curious effect for fu-ture work.

Finally, we compared the performance of DARLA to anUNREAL (Jaderberg et al., 2017) agent with the same ar-chitecture. Despite also exploiting the unsupervised dataavailable in the source domain, UNREAL performed worsethan baseline A3C on the DeepMind Lab domain adap-tation task. This further demonstrates that use of unsu-pervised data in itself is not a panacea for transfer per-formance; it must be utilised in a careful and structuredmanner conducive to learning disentangled latent statessz = szS .

In order to quantitatively evaluate our hypothesis that dis-entangled representations are essential for DARLA’s per-formance in domain adaptation scenarios, we trained vari-ous DARLAs with different degrees of learnt disentangle-ment in sz by varying β (of β-VAE) during stage one ofthe pipeline. We then calculated the correlation betweenthe performance of the EC-based DARLA on the Deep-Mind Lab domain adaptation task and the transfer metric,which approximately measures the quality of disentangle-ment of DARLA’s latent representations sz (see Sec. A.5.2

in Supplementary Materials). This is shown in the chart inFig. 5; as can be seen, there is a strong positive correlationbetween the level of disentanglement and DARLA’s zero-shot domain transfer performance (r = 0.6, p < 0.001).

Having shown the robust utility of disentangled represen-tations in agents for domain adaptation, we note that thereis evidence that they can provide an important additionalbenefit. We found significantly improved speed of learningof πS on the source domain itself, as a function of how dis-entangled the model was. The gain in data efficiency fromdisentangled representations for source policy learning isnot the main focus of this paper so we leave it out of themain text; however, we provide results and discussion inSection A.7 in Supplementary Materials.

5. ConclusionWe have demonstrated the benefits of using disentangledrepresentations in a deep RL setting for domain adaptation.In particular, we have proposed DARLA, a multi-stage RLagent. DARLA first learns a visual system that encodes theobservations it receives from the environment as disentan-gled representations, in a completely unsupervised manner.It then uses these representations to learn a robust sourcepolicy that is capable of zero-shot domain adaptation.

We have demonstrated the efficacy of this approach in arange of domains and task setups: a 3D naturalistic first-person environment (DeepMind Lab), a simulated graphicsand physics engine (MuJoCo), and crossing the simulationto reality gap (MuJoCo to Jaco sim2real). We have alsoshown that the effect of disentangling is consistent acrossvery different RL algorithms (DQN, A3C, EC), achievingsignificant improvements over the baseline algorithms (me-dian 2.7 times improvement in zero-shot transfer acrosstasks and algorithms). To the best of our knowledge, thisis the first comprehensive empirical demonstration of thestrength of disentangled representations for domain adap-tation in a deep RL setting.


ReferencesAbadi, Martin, Agarwal, Ashish, and et al, Paul Barham. Ten-

sorflow: Large-scale machine learning on heterogeneous dis-tributed systems. Preliminary White Paper, 2015.

Barreto, Andre, Munos, Remi, Schaul, Tom, and Silver, David.Successor features for transfer in reinforcement learning.CoRR, abs/1606.05312, 2016. URL http://arxiv.org/abs/1606.05312.

Beattie, Charles, Leibo, Joel Z., Teplyashin, Denis, Ward, Tom,and et al, Marcus Wainwright. Deepmind lab. arxiv, 2016.

Bengio, Y., Courville, A., and Vincent, P. Representation learn-ing: A review and new perspectives. In IEEE Transactions onPattern Analysis & Machine Intelligence, 2013.

Blundell, Charles, Uria, Benigno, Pritzel, Alexander, Li, Yazhe,Ruderman, Avraham, Leibo, Joel Z, Rae, Jack, Wierstra, Daan,and Hassabis, Demis. Model-free episodic control. arXiv,2016.

Candy, T. Rowan, Wang, Jingyun, and Ravikumar, Sowmya. Reti-nal image quality and postnatal visual experience during in-fancy. Optom Vis Sci, 86(6):556–571, 2009.

Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John,Sutskever, Ilya, and Abbeel, Pieter. Infogan: Interpretable rep-resentation learning by information maximizing generative ad-versarial nets. arXiv, 2016.

Cheung, Brian, Levezey, Jesse A., Bansal, Arjun K., and Ol-shausen, Bruno A. Discovering hidden factors of variation indeep networks. In Proceedings of the International Conferenceon Learning Representations, Workshop Track, 2015.

Cohen, T. and Welling, M. Transformation properties of learnedvisual representations. In ICLR, 2015.

Cohen, Taco and Welling, Max. Learning the irreducible repre-sentations of commutative lie groups. arXiv, 2014.

Daftry, Shreyansh, Bagnell, J. Andrew, and Hebert, Martial.Learning transferable policies for monocular reactive mav con-trol. International Symposium on Experimental Robotics,2016.

Degris, Thomas, Pilarski Patrick M and Sutton, Richard S.Model-free reinforcement learning with continuous action inpractice. American Control Conference (ACC), 22:21772182,2012.

Desjardins, G., Courville, A., and Bengio, Y. Disentangling fac-tors of variation via generative entangling. arXiv, 2012.

Dosovitskiy, Alexey and Brox, Thomas. Generating images withperceptual similarity metrics based on deep networks. arxiv,2016.

Finn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor, Levine,Sergey, and Abbeel, Pieter. Deep spatial autoencoders for vi-suomotor learning. arxiv, 2015.

Finn, Chelsea, Yu, Tianhe, Fu, Justin, Abbeel, Pieter, and Levine,Sergey. Generalizing skills with semi-supervised reinforce-ment learning. ICLR, 2017.

Garnelo, Marta, Arulkumaran, Kai, and Shanahan, Murray. To-wards deep symbolic reinforcement learning. arXiv, 2016.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generativeadversarial nets. NIPS, pp. 26722680, 2014.

Goroshin, Ross, Mathieu, Michael, and LeCun, Yann. Learningto linearize under uncertainty. NIPS, 2015.

Guez, Arthur, Silver, David, and Dayan, Peter. Efficient bayes-adaptive reinforcement learning using sample-based search.NIPS, 2012.

Gupta, Abhishek, Devin, Coline, Liu, YuXuan, Abbeel, Pieter,and Levine, Sergey. Learning invariant feature spaces to trans-fer skills with reinforcement learning. ICLR, 2017.

Heess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Tim-othy P., Erez, Tom, and Tassa, Yuval. Learning continuouscontrol policies by stochastic value gradients. NIPS, 2015.

Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning basic visual conceptswith a constrained variational framework. In ICLR, 2017.

Hinton, G., Krizhevsky, A., and Wang, S. D. Transforming auto-encoders. International Conference on Artificial Neural Net-works, 2011.

Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian,Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu,Koray. Reinforcement learning with unsupervised auxiliarytasks. ICLR, 2017.

Karaletsos, Theofanis, Belongie, Serge, and Rtsch, Gunnar.Bayesian representation learning with oracle constraints.ICLR, 2016.

Kingma, D. P. and Ba, Jimmy. Adam: A method for stochasticoptimization. arXiv, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Im-agenet classification with deep convolutional neural networks.NIPS, 2012.

Kulkarni, Tejas, Whitney, William, Kohli, Pushmeet, and Tenen-baum, Joshua. Deep convolutional inverse graphics network.NIPS, 2015.

Lake, Brenden M., Ullman, Tomer D., Tenenbaum, Joshua B.,and Gershman, Samuel J. Building machines that learn andthink like people. arXiv, 2016.

Larsen, Anders Boesen Lindbo, Snderby, Sren Kaae, Larochelle,Hugo, and Winther, Ole. Autoencoding beyond pixels using alearned similarity metric. ICML, 2016.

Leat, Susan J., Yadav, Naveen K., and Irving, Elizabeth L. De-velopment of visual acuity and contrast sensitivity in children.Journal of Optometry, 2009.

Levine, Sergey and Abbeel, Pieter. Learning neural networkpolicies with guided policy search under unknown dynamics.NIPS, 2014.

http://arxiv.org/abs/1606.05312



Lillicrap, Timothy P., Hunt, Jonathan J., Pritzel, Alexander,Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, andWierstra, Daan. Continuous control with deep reinforcementlearning. CoRR, 2015.

Littman, Michael L., Sutton, Richard S., and Singh, Satinder. Pre-dictive representations of state. NIPS, 2001.

Marr, D. Simple memory: A theory for archicortex. PhilosophicalTransactions of the Royal Society of London, pp. 2381, 1971.

McClelland, James L, McNaughton, Bruce L, and OReilly, Ran-dall C. Why there are complementary learning systems in thehippocampus and neocortex: insights from the successes andfailures of connectionist models of learning and memory. Psy-chological review, 102:419, 1995.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David S, andRusu, Andrei A. et al. Human-level control through deep rein-forcement learning. Nature, 2015.

Mnih, Volodymyr, Badia, Adri Puigdomnech, Mirza, Mehdi,Graves, Alex, Lillicrap, Timothy P., Harley, Tim, Silver, David,and Kavukcuoglu, Koray. Asynchronous methods for deep re-inforcement learning. ICML, 2016. URL https://arxiv.org/pdf/1602.01783.pdf.

Niekum, Scott, Chitta, Sachin, Barto, Andrew G, Marthi,Bhaskara, and Osentoski, Sarah. Incremental semanticallygrounded learning from demonstration. Robotics: Science andSystems, 2013.

Norman, Kenneth A and O’Reilly, Randall C. Modeling hip-pocampal and neocortical contributions to recognition mem-ory: a complementary-learning-systems approach. Psycholog-ical review, 110:611, 2003.

Pan, Sinno Jialin and Yang, Quiang. A survey on transfer learn-ing. IEEE Transactions on Knowledge and Data Engineering,2009.

Parisotto, Emilio, Ba, Jimmy, and Salakhutdinov, Ruslan. Actor-mimic: Deep multitask and transfer reinforcement learning.CoRR, 2015.

Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell,Trevor, and Efros, Alexei A. Context encoders: Feature learn-ing by inpainting. CoRR, abs/1604.07379, 2016. URL http://arxiv.org/abs/1604.07379.

Peng, J. Efficient dynamic programming-based learning for con-trol. PhD thesis, Northeastern University, Boston., 1993.

Peng, Jing and Williams, Ronald J. Incremental multi-step q-learning. Machine Learning, 22:283290, 1996.

Puterman, Martin L. Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons, Inc.,New York, NY, USA, 1st edition, 1994. ISBN 0471619779.

Raffin, Antonin, Hfer, Sebastian, Jonschkowski, Rico, Brock,Oliver, and Stulp, Freek. Unsupervised learning of state repre-sentations for multiple tasks. ICLR, 2017.

Rajendran, Janarthanan, Lakshminarayanan, Aravind, Khapra,Mitesh M., P, Prasanna, and Ravindran, Balaraman. Attend,adapt and transfer: Attentive deep architecture for adaptivetransfer from multiple sources in the same domain. ICLR,2017.

Reed, Scott, Sohn, Kihyuk, Zhang, Yuting, and Lee, Honglak.Learning to disentangle factors of variation with manifold in-teraction. ICML, 2014.

Rezende, Danilo J., Mohamed, Shakir, and Wierstra, Daan.Stochastic backpropagation and approximate inference in deepgenerative models. arXiv, 2014.

Ridgeway, Karl. A survey of inductive biases for factorialRepresentation-Learning. arXiv, 2016. URL http://arxiv.org/abs/1612.05299.

Rippel, Oren and Adams, Ryan Prescott. High-dimensional prob-ability estimation with deep density models. arXiv, 2013.

Rusu, Andrei A., Vecerik, Matej, Rothrl, Thomas, Heess, Nicolas,Pascanu, Razvan, and Hadsell, Raia. Sim-to-real robot learningfrom pixels with progressive nets. arXiv, 2016.

Schmidhuber, Jurgen. Learning factorial codes by predictabilityminimization. Neural Computation, 4(6):863–869, 1992.

Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan,Michael I., and Abbeel, Pieter. Trust region policy optimiza-tion. ICML, 2015.

Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,Michael, and Abbeel, Pieter. High-dimensional continuouscontrol using generalized advantage estimation. ICLR, 2016.

Sutherland, Robert J and Rudy, Jerry W. Configural associationtheory: The role of the hippocampal formation in learning,memory, and amnesia. Psychobiology, 17:129144, 1989.

Sutton, Richard S. and Barto, Andrew G. Reinforcement Learn-ing: An Introduction. MIT Press, 1998.

Talvitie, Erik and Singh, Satinder. An experts algorithm for trans-fer learning. In Proceedings of the 20th international joint con-ference on Artifical intelligence, 2007.

Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba,Wojciech, and Abbeel, Pieter. Domain randomization for trans-ferring deep neural networks from simulation to the real world.arxiv, 2017.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine formodel-based control. IROS, 2012.

Tulving, Endel, Hayman, CA, and Macdonald, Carol A. Long-lasting perceptual priming and semantic learning in amnesia: acase experiment. Journal of Experimental Psychology: Learn-ing, Memory, and Cognition, 17:595, 1991.

Tzeng, Eric, Devin, Coline, Hoffman, Judy, Finn, Chelsea,Abbeel, Pieter, Levine, Sergey, Saenko, Kate, and Darrell,Trevor. Adapting deep visuomotor representations with weakpairwise constraints. WAFR, 2016.

Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio,Yoshua, and Manzagol, Pierre-Antoine. Stacked denoising au-toencoders: Learning useful representations in a deep networkwith a local denoising criterion. NIPS, 2010.

Warde-Farley, David and Bengio, Yoshua. Improving generativeadversarial networks with denoising feature matching. ICLR,2017.

https://arxiv.org/pdf/1602.01783.pdf

https://arxiv.org/pdf/1602.01783.pdf






Watkins, Christopher John Cornish Hellaby. Learning from de-layed rewards. PhD thesis, University of Cambridge, Cam-bridge, UK, 1989.

Whitney, William F., Chang, Michael, Kulkarni, Tejas, andTenenbaum, Joshua B. Understanding visual concepts withcontinuation learning. arXiv, 2016. URL http://arxiv.org/pdf/1602.06822.pdf.

Yang, Jimei, Reed, Scott, Yang, Ming-Hsuan, and Lee, Honglak.Weakly-supervised disentangling with recurrent transforma-tions for 3d view synthesis. NIPS, 2015.

A. Supplementary MaterialsA.1. The Reinforcement Learning Paradigm

The reinforcement learning (RL) paradigm consists of an agent re-ceiving a sequence of observations sot which are some function ofenvironment states st ∈ S and may be accompanied by rewardsrt+1 ∈ R conditional on the actions at ∈ A, chosen at each timestep t (Sutton & Barto, 1998). We assume that these interactionscan be modelled as a Markov Decision Process (MDP) (Puterman,1994) defined as a tuple D ≡ (S,A, T , R, γ). T = p(s|st, at)is a transition function that models the distribution of all possiblenext states given action at is taken in state st for all st ∈ S andat ∈ A. Each transition st

at→ st+1 may be accompanied by areward signal rt+1(st, at, st+1). The goal of the agent is to learna policy π(at|st), a probability distribution over actions at ∈ A,that maximises the expected return i.e. the discounted sum of fu-ture rewards Rt = E[

∑T−tτ=1 γ

τ−1rt+τ ]. T is the time step atwhich each episode ends, and γ ∈ [0, 1) is the discount factorthat progressively down-weights future rewards. Given a policyπ(a|s), one can define the value function Vπ(s) = E[Rt|st =s, π], which is the expected return from state s following policy π.The action-value function Qπ(s, a) = E[Rt|st = s, at = a, π]is the expected return for taking action a in state s at time t, andthen following policy π from time t+ 1 onward.

A.2. Further task details

A.2.1. DEEPMIND LAB

As described in Sec 3.1, in each source episode of DeepMind Labthe agent was presented with one of three possible room/objecttype conjunctions, chosen at random. These are marked DS inFig 2. The setup was a seek-avoid style task, where one of thetwo object types in the room gave a reward of +1 and the othergave a reward of -1. The agent was allowed to pick up objects for60 seconds after which the episode would terminate and a new onewould begin; if the agent was able to pick up all the ‘good’ objectsin less than 60 seconds, a new episode was begun immediately.The agent was spawned in a random location in the room at thestart of each new episode.

During transfer, the agent was placed into the held out conjunctionof object types and room background; see DT in Fig 2.

Visual pre-training was performed in other conjunctions of objecttype and room background denoted DU in Fig 2.

The observation size of frames in the DeepMind Lab task was84x84x3 (HxWxC).

A.2.2. MUJOCO/JACO ARM EXPERIMENTS

As described in Sec 3.2, the source task consisted of an agentlearning to control a simulated arm in order to reach toward anobject. A shaping reward was used, with a maximum value of1 when the centre of the object fell between the pinch and gripsites of the end effector, or within a 10cm distance of the two.Distances on the x and y dimensions counted double compared todistances on the z dimension.

During each episode the object was placed at a random drop pointwithin a 40x40cm area, and the arm was set to a random ini-tial start position high above the work-space, independent of theobject’s position. Each episode lasted for 150 steps, or 7.5 sec-onds, with a control step of 50ms. Observations soU were sampled

http://arxiv.org/pdf/1602.06822.pdf

http://arxiv.org/pdf/1602.06822.pdf


randomly across episodes. Overall, 4 million frames of dimen-sions 64x64x3 (HxWxC) were used for this stage of the curricu-lum. For each episode the camera position and orientation wererandomly sampled from an isotropic normal distribution centredaround the approximate position and orientation of the real cam-era, with standard deviation 0.01. No precise measurements wereused to match the two. Work-space table colour was sampleduniformly between −5% and +5% around the midpoint, inde-pendently for each RGB channel; object colours were sampleduniformly at random in RGB space, rejecting colours which fellwithin a ball around 10 held-out intensities (radius 10% of range);the latter were only used for simulated transfer experiments, i.e.in DT in the sim2sim experiments. Additionally, Gaussian noisewith standard deviation 0.01 was added to the observations soT inthe sim2sim task.

For the real Jaco arm and its MuJoCo simulation counterpart, eachof the nine joints could independently take 11 different actions (alinear discretisation of the continuous velocity action space). Insimulation Gaussian noise with standard deviation 0.1 was addedto each discrete velocity output; delays in the real setup betweenobservations and action execution were simulated by randomlymixing velocity outputs from two previous steps instead of emit-ting the last output directly. Speed ranges were between −50%and 50% of the Jaco arm’s top speed on joints 1 through 6 start-ing at the base, while the fingers could use a full range. For safetyreasons the speed ranges have been reduced by a factor of 0.3while evaluating agents on the Jaco arm, without significant per-formance degradation.

A.3. Vision model details

A.3.1. DENOISING AUTOENCODER FOR β-VAE

A denoising autoencoder (DAE) was used as a model to providethe feature space for the β-VAE reconstruction loss to be com-puted over (for motivation, see Sec. 2.3.1). The DAE was trainedwith occlusion-style masking noise in the vein of (Pathak et al.,2016), with the aim for the DAE to learn a semantic representationof the input frames. Concretely, two values were independentlysampled from U [0,W ] and two from U [0, H] where W and Hwere the width and height of the input frames. These four valuesdetermined the corners of the rectangular mask applied; all pixelsthat fell within the mask were set to zero.

The DAE architecture consisted of four convolutional layers, eachwith kernel size 4 and stride 2 in both the height and width di-mensions. The number of filters learnt for each layer was {32,32, 64, 64} respectively. The bottleneck layer consisted of a fullyconnected layer of size 100 neurons. This was followed by fourdeconvolutional layers, again with kernel sizes 4, strides 2, and{64, 64, 32, 32} filters. The padding algorithm used was ‘SAME’in TensorFlow (Abadi et al., 2015). ReLU non-linearities wereused throughout.

The model was trained with loss given by the L2 distance of theoutputs from the original, un-noised inputs. The optimiser usedwas Adam (Kingma & Ba, 2014) with a learning rate of 1e-3.

A.3.2. β-VAE WITH PERCEPTUAL SIMILARITY LOSS

After training a DAE, as detailed in the previous section6, aβ-VAEDAE was trained with perceptual similarity loss given by

6In principle, the β-VAEDAE could also have been trainedend-to-end in one pass, but we did not experiment with this.

Eq. 2, repeated here:

L(θ, φ;x, z, β) =Eqφ(z|x) ‖J(x)− J(x)‖22

− β DKL(qφ(z|x)||p(z)) (3)

Specifically, the input was passed through the β-VAE and a sam-pled7 reconstruction was passed through the pre-trained DAE upto a designated layer. The L2 distance of this representation fromthe representation of the original input passed through the samelayers of the DAE was then computed, and this formed the train-ing loss for the β-VAE part of the β-VAEDAE 8. The DAEweights remained frozen throughout.

The β-VAE architecture consisted of an encoder of four convolu-tional layers, each with kernel size 4, and stride 2 in the heightand width dimensions. The number of filters learnt for each layerwas {32, 32, 64, 64} respectively. This was followed by a fullyconnected layer of size 256 neurons. The latent layer comprised64 neurons parametrising 32 (marginally) independent Gaussiandistributions. The decoder architecture was simply the reverse ofthe encoder, utilising deconvolutional layers. The decoder usedwas Gaussian, so that the number of output channels was 2C,where C was the number of channels that the input frames had.The padding algorithm used was ‘SAME’ in TensorFlow. ReLUnon-linearities were used throughout.

The model was trained with the loss given by Eq. 3. Specifically,the disentangled model used for DARLA was trained with a β hy-perparameter value of 1 and the layer of the DAE used to computethe perceptual similarity loss was the last deconvolutional layer.The entangled model used for DARLAENT was trained with a βhyperparameter value of 0.1 with the last deconvolutional layer ofthe DAE was used to compute the perceptual similarity loss.

The optimiser used was Adam with a learning rate of 1e-4.

A.3.3. β-VAE

For the MuJoCo/Jaco tasks, a standard β-VAE was used ratherthan the β-VAEDAE used for DeepMind Lab. The architecture ofthe VAE encoder, decoder and the latent size were exactly as de-scribed in the previous section A.3.2. β for the the disentangled β-VAE in DARLA was 175. β for the entangled model DARLAENTwas 1, corresponding to the standard VAE of (Kingma & Welling,2014).

The optimizer used was Adam with a learning rate of 1e-4.

A.3.4. DENOISING AUTOENCODER FOR BASELINE

For the baseline model DARLADAE, we trained a denoising au-toencoder with occlusion-style masking noise as described in Ap-pendix Section A.3.1. The architecture used matched that exactlyof the β-VAE described in Appendix Section A.3.2 - however, allstochastic nodes were replaced with deterministic neurons.

The optimizer used was Adam with a learning rate of 1e-4.

7It is more typical to use the mean of the reconstruction dis-tribution, but this does not induce any pressure on the Gaussiansparametrising the decoder to reduce their variances. Hence fullsamples were used instead.

8The representations were taken after passing through thelayer but before passing through the following non-linearity. Wealso briefly experimented with taking the L2 loss post-activationbut did not find a significant difference.


A.4. Reinforcement Learning Algorithm Details

A.4.1. DEEPMIND LAB

The action space in the DeepMind Lab task consisted of 8 discreteactions.

DQN: in DQN, the convolutional (or ‘vision’) part of the Q-netwas replaced with the encoder of the β-VAEDAE from stage 1and frozen. DQN takes four consecutive frames as input in orderto capture some aspect of environment dynamics in the agent’sstate. In order to match this in our setup with a pre-trained visionstack FU , we passed each observation frame so{1..4} through thepre-trained model sz{1..4} = FU (so{1..4}) and then concatenatedthe outputs together to form the k-dimensional (where k = 4|sz|)input to the policy network. In this case the size of sz was 64 forDARLA as well as DARLAENT, DARLADAE and DARLAFT.

On top of the frozen convolutional stack, two ‘policy’ layers of512 neurons each were used, with a final linear layer of 8 neuronscorresponding to the size of the action space in the DeepMindLab task. ReLU non-linearities were used throughout. All otherhyperparameters were as reported in (Mnih et al., 2015).

A3C: in A3C, as with DQN, the convolutional part of the net-work that is shared between the policy net and the value net wasreplaced with the encoder of the β-VAEDAE in DeepMind Labtasks. All other hyperparameters were as reported in (Mnih et al.,2016).

Episodic Control: for the Episodic Controller-based DARLA weused mostly the same hyperparameters as in the original paper by(Blundell et al., 2016). We explored the following hyperparametersettings: number of nearest neighbours ∈ {10, 50}, return hori-zon ∈ {100, 400, 800, 1800, 500000}, kernel type ∈ {inverse,gaussian}, kernel width ∈ {1e− 6, 1e− 5, 1e− 4, 1e− 3, 1e−2, 1e − 1, 0.5, 0.99} and we tried training EC with and withoutPeng’s Q(λ) (Peng, 1993). In practice we found that none of theexplored hyperparameter choices significantly influenced the re-sults of our experiments. The final hyperparameters used for allexperiments reported in the paper were the following: number ofnearest neighbours: 10, return horizon: 400, kernel type: inverse,kernel width: 1e-6 and no Peng’s Q(λ) (Peng, 1993).

UNREAL: We used a vanilla version of UNREAL, with parame-ters as reported in (Jaderberg et al., 2017).

A.4.2. MUJOCO/JACO ARM EXPERIMENTS

For the real Jaco arm and its MuJoCo simulation, each of the ninejoints could independently take 11 different actions (a linear dis-cretisation of the continuous velocity action space). Therefore theaction space size was 99.

DARLA for MuJoCo/Jaco was based on feedforward A3C (Mnihet al., 2016). We closely followed the simulation training setup of(Rusu et al., 2016) for feed-forward networks using raw visual-input only. In place of the usual conv-stack, however, we used theencoder of the β-VAE as described in Appendix A.3.3. This wasfollowed by a linear layer with 512 units, a ReLU non-linearityand a collection of 9 linear and softmax layers for the 9 indepen-dent policy outputs, as well as a single value output layer thatoutputted the value function.

Transfer metric score:0.457 0.0650.196

Figure 6. Traversals of the latent corresponding to room back-ground for models with different transfer metric scores (showntop). Note that in the entangled model, many other objects appearand blue hat changes shape in addition to the background chang-ing. For the model with middling transfer score, both the objecttype and background alter; whereas for the disentangled model,very little apart from the background changes.

A.5. Disentanglement Evaluation

A.5.1. VISUAL HEURISTIC DETAILS

In order to choose the optimal value of β for the β-VAE -DAEmodels and evaluate the fitness of the representations szU learnt instage 1 of our pipeline (in terms of disentanglement achieved), weused the visual inspection heuristic described in (Higgins et al.,2017). The heuristic involved clustering trained β-VAE basedmodels based on the number of informative latents (estimated asthe number of latents zi with average inferred standard deviationbelow 0.75). For each cluster we examined the degree of learntdisentanglement by running inference on a number of seed im-ages, then traversing each latent unit z{i} one at a time over threestandard deviations away from its average inferred mean whilekeeping all other latents z{\i} fixed to their inferred values. Thisallowed us to visually examine whether each individual latent unitzi learnt to control a single interpretable factor of variation in thedata. A similar heuristic has been the de rigueur method for ex-hibiting disentanglement in the disentanglement literature (Chenet al., 2016; Kulkarni et al., 2015).

A.5.2. TRANSFER METRIC DETAILS

In the case of DeepMind Lab, we were able to use the ground truthlabels corresponding to the two factors of variation of the objecttype and the background to design a proxy to the disentanglementmetric proposed in (Higgins et al., 2017). The procedure used


consisted of the following steps:

1) Train the model under consideration on observations soU tolearn FU , as described in stage 1 of the DARLA pipeline.

2) Learn a linear model L : SzV → M × N from the represen-tations szV = FV (soV ), where M ∈ {0, 1} corresponds to theset of possible rooms and N ∈ {0, 1, 2, 3} corresponds to the setof possible objects9. Therefore we are learning a low-VC dimen-sion classifier to predict the room and the object class from thelatent representation of the model. Crucially, the linear model Lis trained on only a subset of the Cartesian productM×N e.g. on{{0, 0}, {0, 3}, {1, 1}, {1, 2}}. In practice, we utilised a softmaxclassifier each for M and N and trained this using backpropaga-tion with a cross-entropy loss, keeping the unsupervised model(and therefore FU ) fixed.

3) The trained linear model L’s accuracy is evaluated on the heldout subset of the Cartesian product M ×N .

Although the above procedure only measures disentangling up tolinearity, and only does so for the latents of object type and roombackground, we nevertheless found that the metric was highly cor-related with disentanglement as determined via visual inspection(see Fig. 6).

A.6. Background on RL Algorithms

In this Appendix, we provide background on the different RL al-gorithms that the DARLA framework was tested on in this paper.

A.6.1. DQN

(DQN) (Mnih et al., 2015) is a variant of the Q-learning algo-rithm (Watkins, 1989) that utilises deep learning. It uses a neu-ral network to parametrise an approximation for the action-valuefunction Q(s, a; θ) using parameters θ. These parameters are up-dated by minimising the mean-squared error of a 1-step looka-head loss LQ = E

[(rt + γmaxa′Q(s′, a′; θ−)−Q(s, a; θ))2

],

where θ− are parameters corresponding to a frozen network andoptimisation is performed with respect to θ, with θ− being syncedto θ at regular intervals.

A.6.2. A3C

Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016)is an asynchronous implementation of the advantage actor-criticparadigm (Sutton & Barto, 1998; Degris & Sutton, 2012), whereseparate threads run in parallel and perform updates to shared pa-rameters. The different threads each hold their own instance ofthe environment and have different exploration policies, therebydecorrelating parameter updates without the need for experiencereplay.

A3C uses neural networks to approximate both policy π(a|s; θ)and value Vπ(s; θ) functions using parameters θ using n-step look-ahead loss (Peng & Williams, 1996). The algo-rithm is trained using an advantage actor-critic loss func-tion with an entropy regularisation penalty: LA3C ≈LV R + Lπ − Es∼π [αH(π(a|s; θ))], where H is entropy.The parameter updates are performed after every tmax ac-tions or when a terminal state is reached. LV R =

9For the purposes of this metric, we utilised rooms with onlysingle objects, which we denote by the subscript V e.g. the obser-vation set SoV .

Es∼π[(Rt:t+n + γnV (st+n+1; θ)− V (st; θ))

2]

and Lπ =Es∼π [log π(a|s; θ)(Qπ(s, a; θ)− V π(s; θ))]. Unlike DQN,A3C uses an LSTM core to encode its history and therefore hasa longer term memory permitting it to perform better in partiallyobserved environments. In the version of A3C used in this pa-per for the DeepMind Lab task, the policy net additionally takesthe last action at−1 and last reward rt−1 as inputs along with theobservation sot , as introduced in (Jaderberg et al., 2017).

A.6.3. UNREAL

The UNREAL agent (Jaderberg et al., 2017) takes as a base anLSTM A3C agent (Mnih et al., 2016) and augments it with anumber of unsupervised auxiliary tasks that make use of the richperceptual data available to the agent besides the (sometimes verysparse) extrinsic reward signals. This auxiliary learning tends toimprove the representation learnt by the agent. While training thebase agent, its observations, rewards, and actions are stored in areplay buffer, which is used by the auxiliary learning tasks. Thetasks include: 1) pixel control the agent learns how to control theenvironment by training auxiliary policies to maximally changepixel intensities in different parts of the input; 2) reward predic-tion - given a replay buffer of observations within a short timeperiod of an extrinsic reward, the agent has to predict the rewardobtained during the next unobserved timestep using a sequence ofthree preceding steps; 3) value function replay - extra training ofthe value function to promote faster value iteration.

A.6.4. EPISODIC CONTROL

In its simplest form EC is a lookup table of states and actionsdenoted as QEC(s, a). In each state EC picks the action withthe highest QEC value. At the end of each episode QEC(s, a)is set to Rt if (st, at) /∈ QEC , where Rt is the discounted re-turn. Otherwise QEC(s, a) = max

{QEC(s, a), Rt

}. In order

to generalise its policy to novel states that are not in QEC , ECuses a non-parametric nearest neighbours search QEC(s, a) =1k

∑ki=1Q

EC(si, a), where si, i = 1, ..., k are k states with thesmallest distance to the novel state s. Like DQN, EC takes a con-catenation of four frames as input.

The EC algorithm is proposed as a model of fast hippocampalinstance-based learning in the brain (Marr, 1971; Sutherland &Rudy, 1989), while the deep RL algorithms described above aremore analogous to slow cortical learning that relies on generalisedstatistical summaries of the input distribution (McClelland et al.,1995; Norman & O’Reilly, 2003; Tulving et al., 1991).

A.7. Source Task Performance Results

The focus of this paper is primarily on zero-shot domain adapta-tion performance. However, it is also interesting to analyse theeffect of the DARLA approach on source domain policy perfor-mance. In order to compare the models’ behaviour on the sourcetask, we examined the training curves (see Figures 7-10) andnoted in particular their:

1. Asymptotic task performance, i.e. the rewards per episodeat the point where πS has converged for the agent underconsideration.

2. Data efficiency, i.e. how quickly the training curve was ableto achieve convergence.


1 2 3 4 5Environment steps 1e6

0

5

10

15

20

Avera

ge r

ew

ard

DQN

DARLA

DARLA_ENT

DARLA_DAE

DQN

DARLA_FT

Figure 7. Source task training curves for DQN. Curves show av-erage and standard deviation over 20 random seeds.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Environment steps 1e7

0

5

10

15

20

25

Avera

ge r

ew

ard

/ e

pis

ode

A3C (DEEPMIND LAB)

DARLA

DARLA_ENT

DARLA_DAE

A3C

DARLA_FT

UNREAL

Figure 8. Source task performance training curves for A3C andUNREAL. DARLA shows accelerated learning of the task com-pared to other architectures. Results show average and standarddeviation over 20 random seeds, each using 16 workers.

1 2 3 4 5Environment steps 1e6

0

5

10

15

20

Avera

ge r

ew

ard

Episodic Controller

DARLA

DARLA_ENT

DARLA_DAE

EC

Figure 9. Source task training curves for EC. Results show aver-age and standard deviation over 20 random seeds.

We note the following consistent trends across the results:

1. Using DARLA provided an initial boost in learning perfor-mance, which depended on the degree of disentanglement ofthe representation. This was particularly observable in A3C,

0 1 2 3 4 5 6 7 8 9environment steps 1e6

0

20

40

60

80

100

120

Avera

ge r

ew

ard

A3C (JACO)

DARLA

DARLA_ENT

DARLA_DAE

A3C

DARLA_FT

Figure 10. Training curves for various baselines on the sourceMuJoCo reaching task

see Fig. 8.

2. Baseline algorithms where F could be fine-tuned to thesource task were able to achieve higher asymptotic perfor-mance. This was particularly notable on DQN and A3C (seeFigs. 7 and 8) in DeepMind Lab. However, in both thosecases, DARLA was able to learn very reasonable policieson the source task which were on the order of 20% lowerthan the fine-tuned models – arguably a worthwhile sacri-fice for a subsequent median 270% improvement in targetdomain performance noted in the main text.

3. Allowing DARLA to fine-tune its vision module(DARLAFT) boosted its source task learning speed,and allowed the agent to asymptote at the same level asthe baseline algorithms. As discussed in the main text, thiscomes at the cost of significantly reduced domain transferperformance on A3C. For DQN, however, finetuningappears to offer the best of both worlds.

4. Perhaps most relevantly for this paper, even if solely exam-ining source task performance, DARLA outperforms bothDARLAENT and DARLADAE on both asymptotic perfor-mance and data efficiency – suggesting that disentangledrepresentations have wider applicability in RL beyond thezero-shot domain adaptation that is the focus of this paper.

darla: improving zero-shot transfer in reinforcement learning · vironments (jaco arm, deepmind...

Documents