DARLA: Improving Zero-Shot Transfer in Reinforcement Learning ?· vironments (Jaco arm, DeepMind Lab)…

Download DARLA: Improving Zero-Shot Transfer in Reinforcement Learning ?· vironments (Jaco arm, DeepMind Lab)…

Post on 03-Aug-2018




0 download

Embed Size (px)


<ul><li><p>DARLA: Improving Zero-Shot Transfer in Reinforcement Learning</p><p>Irina Higgins * 1 Arka Pal * 1 Andrei Rusu 1 Loic Matthey 1 Christopher Burgess 1 Alexander Pritzel 1Matthew Botvinick 1 Charles Blundell 1 Alexander Lerchner 1</p><p>AbstractDomain adaptation is an important open prob-lem in deep reinforcement learning (RL). Inmany scenarios of interest data is hard to ob-tain, so agents may learn a source policy in asetting where data is readily available, with thehope that it generalises well to the target do-main. We propose a new multi-stage RL agent,DARLA (DisentAngled Representation LearningAgent), which learns to see before learning to act.DARLAs vision is based on learning a disen-tangled representation of the observed environ-ment. Once DARLA can see, it is able to acquiresource policies that are robust to many domainshifts - even with no access to the target domain.DARLA significantly outperforms conventionalbaselines in zero-shot domain adaptation scenar-ios, an effect that holds across a variety of RL en-vironments (Jaco arm, DeepMind Lab) and baseRL algorithms (DQN, A3C and EC).</p><p>1. IntroductionAutonomous agents can learn how to maximise futureexpected rewards by choosing how to act based on in-coming sensory observations via reinforcement learning(RL). Early RL approaches did not scale well to envi-ronments with large state spaces and high-dimensionalraw observations (Sutton &amp; Barto, 1998). A commonlyused workaround was to embed the observations in alower-dimensional space, typically via hand-crafted and/orprivileged-information features. Recently, the advent ofdeep learning and its successful combination with RL hasenabled end-to-end learning of such embeddings directlyfrom raw inputs, sparking success in a wide variety of pre-viously challenging RL domains (Mnih et al., 2015; 2016;Jaderberg et al., 2017). Despite the seemingly universal</p><p>*Equal contribution 1DeepMind, 6 Pancras Square, KingsCross, London, N1C 4AG, UK. Correspondence to: Irina Higgins, Arka Pal .</p><p>Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).</p><p>efficacy of deep RL, however, fundamental issues remain.These include data inefficiency, the reactive nature and gen-eral brittleness of learnt policies to changes in input datadistribution, and lack of model interpretability (Garneloet al., 2016; Lake et al., 2016). This paper focuses on oneof these outstanding issues: the ability of RL agents to dealwith changes to the input distribution, a form of transferlearning known as domain adaptation (Bengio et al., 2013).In domain adaptation scenarios, an agent trained on a par-ticular input distribution with a specified reward structure(termed the source domain) is placed in a setting where theinput distribution is modified but the reward structure re-mains largely intact (the target domain). We aim to developan agent that can learn a robust policy using observationsand rewards obtained exclusively within the source domain.Here, a policy is considered as robust if it generalises withminimal drop in performance to the target domain withoutextra fine-tuning.</p><p>Past attempts to build RL agents with strong domain adap-tation performance highlighted the importance of learn-ing good internal representations of raw observations (Finnet al., 2015; Raffin et al., 2017; Pan &amp; Yang, 2009; Bar-reto et al., 2016; Littman et al., 2001). Typically, these ap-proaches tried to align the source and target domain rep-resentations by utilising observation and reward signalsfrom both domains (Tzeng et al., 2016; Daftry et al., 2016;Parisotto et al., 2015; Guez et al., 2012; Talvitie &amp; Singh,2007; Niekum et al., 2013; Gupta et al., 2017; Finn et al.,2017; Rajendran et al., 2017). In many scenarios, such asrobotics, this reliance on target domain information can beproblematic, as the data may be expensive or difficult toobtain (Finn et al., 2017; Rusu et al., 2016). Furthermore,the target domain may simply not be known in advance.On the other hand, policies learnt exclusively on the sourcedomain using existing deep RL approaches that have fewconstraints on the nature of the learnt representations of-ten overfit to the source input distribution, resulting in poordomain adaptation performance (Lake et al., 2016; Rusuet al., 2016).</p><p>We propose tackling both of these issues by focusing in-stead on learning representations which capture an underly-ing low-dimensional factorised representation of the worldand are therefore not task or domain specific. Many nat-</p><p>arX</p><p>iv:1</p><p>707.</p><p>0847</p><p>5v2 </p><p> [st</p><p>at.M</p><p>L] </p><p> 6 J</p><p>un 2</p><p>018</p></li><li><p>DARLA: Improving Zero-Shot Transfer in Reinforcement Learning</p><p>Figure 1. Schematic representation of DARLA. Yellow representsthe denoising autoencoder part of the model, blue represents the-VAE part of the model, and grey represents the policy learningpart of the model.</p><p>uralistic domains such as video game environments, sim-ulations and our own world are well described in terms ofsuch a structure. Examples of such factors of variation areobject properties like colour, scale, or position; other exam-ples correspond to general environmental factors, such asgeometry and lighting. We think of these factors as a set ofhigh-level parameters that can be used by a world graphicsengine to generate a particular natural visual scene (Kulka-rni et al., 2015). Learning how to project raw observationsinto such a factorised description of the world is addressedby the large body of literature on disentangled representa-tion learning (Schmidhuber, 1992; Desjardins et al., 2012;Cohen &amp; Welling, 2014; 2015; Kulkarni et al., 2015; Hin-ton et al., 2011; Rippel &amp; Adams, 2013; Reed et al., 2014;Yang et al., 2015; Goroshin et al., 2015; Kulkarni et al.,2015; Cheung et al., 2015; Whitney et al., 2016; Karalet-sos et al., 2016; Chen et al., 2016; Higgins et al., 2017).Disentangled representations are defined as interpretable,factorised latent representations where either a single latentor a group of latent units are sensitive to changes in singleground truth factors of variation used to generate the vi-sual world, while being invariant to changes in other factors(Bengio et al., 2013). The theoretical utility of disentangledrepresentations for supervised and reinforcement learninghas been described before (Bengio et al., 2013; Higginset al., 2017; Ridgeway, 2016); however, to our knowledge,it has not been empirically validated to date.</p><p>We demonstrate how disentangled representations can im-prove the robustness of RL algorithms in domain adapta-tion scenarios by introducing DARLA (DisentAngled Rep-resentation Learning Agent), a new RL agent capableof learning a robust policy on the source domain thatachieves significantly better out-of-the-box performance indomain adaptation scenarios compared to various base-lines. DARLA relies on learning a latent state representa-tion that is shared between the source and target domains,by learning a disentangled representation of the environ-ments generative factors. Crucially, DARLA does not re-quire target domain data to form its representations. Ourapproach utilises a three stage pipeline: 1) learning tosee, 2) learning to act, 3) transfer. During the first stage,</p><p>DARLA develops its vision, learning to parse the world interms of basic visual concepts, such as objects, positions,colours, etc. by utilising a stream of raw unlabelled obser-vations not unlike human babies in their first few monthsof life (Leat et al., 2009; Candy et al., 2009). In the secondstage, the agent utilises this disentangled visual represen-tation to learn a robust source policy. In stage three, wedemonstrate that the DARLA source policy is more robustto domain shifts, leading to a significantly smaller drop inperformance in the target domain even when no further pol-icy finetuning is allowed (median 270.3% improvement).These effects hold consistently across a number of differ-ent RL environments (DeepMind Lab and Jaco/MuJoCo:Beattie et al., 2016; Todorov et al., 2012) and algorithms(DQN, A3C and Episodic Control: Mnih et al., 2015; 2016;Blundell et al., 2016).</p><p>2. Framework2.1. Domain adaptation in Reinforcement Learning</p><p>We now formalise domain adaptation scenarios in a rein-forcement learning (RL) setting. We denote the sourceand target domains as DS and DT , respectively. Eachdomain corresponds to an MDP defined as a tuple DS (SS ,AS , TS , RS) or DT (ST ,AT , TT , RT ) (we assumea shared fixed discount factor ), each with its own statespace S, action space A, transition function T and rewardfunction R.1 In domain adaptation scenarios the states Sof the source and the target domains can be quite different,while the action spaces A are shared and the transitions Tand reward functions R have structural similarity. For ex-ample, consider a domain adaptation scenario for the Jacorobotic arm, where the MuJoCo (Todorov et al., 2012) sim-ulation of the arm is the source domain, and the real worldsetting is the target domain. The state spaces (raw pixels)of the source and the target domains differ significantly dueto the perceptual-reality gap (Rusu et al., 2016); that is tosay, SS 6= ST . Both domains, however, share action spaces(AS = AT ), since the policy learns to control the same setof actuators within the arm. Finally, the source and tar-get domain transition and reward functions share structuralsimilarity (TS TT and RS RT ), since in both domainstransitions between states are governed by the physics ofthe world and the performance on the task depends on therelative position of the arms end effectors (i.e. fingertips)with respect to an object of interest.</p><p>2.2. DARLA</p><p>In order to describe our proposed DARLA framework, weassume that there exists a set M of MDPs that is the set</p><p>1For further background on the notation relating to the RLparadigm, see Section A.1 in the Supplementary Materials.</p></li><li><p>DARLA: Improving Zero-Shot Transfer in Reinforcement Learning</p><p>of all natural world MDPs, and each MDP Di is sampledfromM. We defineM in terms of the state space S thatcontains all possible conjunctions of high-level factors ofvariation necessary to generate any naturalistic observationin any Di M. A natural world MDP Di is then onewhose state space S corresponds to some subset of S. Insimple terms, we assume that there exists some shared un-derlying structure between the MDPsDi sampled fromM.We contend that this is a reasonable assumption that per-mits inclusion of many interesting problems, including be-ing able to characterise our own reality (Lake et al., 2016).</p><p>We now introduce notation for two state space variablesthat may in principle be used interchangeably within thesource and target domain MDPs DS and DT the agentobservation state space So, and the agents internal latentstate space Sz .2 Soi in Di consists of raw (pixel) observa-tions soi generated by the true world simulator from a sam-pled set of data generative factors si, i.e. soi Sim(si).si is sampled by some distribution or process Gi on S,si Gi(S).</p><p>Using the newly introduced notation, domain adaptationscenarios can be described as having different samplingprocesses GS and GT such that sS GS(S) and sT GT (S) for the source and target domains respectively, andthen using these to generate different agent observationstates soS Sim(sS) and soT Sim(sT). Intuitively, con-sider a source domain where oranges appear in blue roomsand apples appear in red rooms, and a target domain wherethe object/room conjunctions are reversed and oranges ap-pear in red rooms and apples appear in blue rooms. Whilethe true data generative factors of variation S remain thesame - room colour (blue or red) and object type (applesand oranges) - the particular source and target distributionsGS and GT differ.</p><p>Typically deep RL agents (e.g. Mnih et al., 2015; 2016)operating in an MDP Di M learn an end-to-end map-ping from raw (pixel) observations soi Soi to actionsai Ai (either directly or via a value function Qi(soi , ai)from which actions can be derived). In the process of do-ing so, the agent implicitly learns a function F : Soi Szithat maps the typically high-dimensional raw observationssoi to typically low-dimensional latent states s</p><p>zi ; followed</p><p>by a policy function i : Szi Ai that maps the latentstates szi to actions ai Ai. In the context of domainadaptation, if the agent learns a naive latent state map-ping function FS : SoS SzS on the source domain us-ing reward signals to shape the representation learning, itis likely that FS will overfit to the source domain and willnot generalise well to the target domain. Returning to our</p><p>2Note that we do not assume these to be Markovian i.e. it is notnecessarily the case that p(sot+1|sot ) = p(sot+1|sot , sot1, . . . , so1),and similarly for sz . Note the index t here corresponds to time.</p><p>intuitive example, imagine an agent that has learnt a pol-icy to pick up oranges and avoid apples on the source do-main. Such a source policy S is likely to be based onan entangled latent state space SzS of object/room conjunc-tions: oranges/blue good, apples/red bad, since thisis arguably the most efficient representation for maximis-ing expected rewards on the source task in the absence ofextra supervision signals suggesting otherwise. A sourcepolicy S(a|szS ; ) based on such an entangled latent rep-resentation szS will not generalise well to the target domainwithout further fine-tuning, since FS(soS) 6= FS(soT ) andtherefore crucially SzS 6= SzT .</p><p>On the other hand, since both sS GS(S) and sT GT (S) are sampled from the same natural world statespace S for the source and target domains respectively, itshould be possible to learn a latent state mapping functionF : So SzS , which projects the agent observation statespace So to a latent state space SzS expressed in terms offactorised data generative factors that are representative ofthe natural world i.e. Sz</p><p>S S. Consider again our intuitive</p><p>example, where F maps agent observations (soS : orangein a blue room) to a factorised or disentangled representa-tion expressed in terms of the data generative factors (szS :room type = blue; object type = orange). Such a disen-tangled latent state mapping function should then directlygeneralise to both the source and the target domains, so thatF(soS) = F(soT ) = szS . Since S</p><p>zS is a disentangled repre-</p><p>sentation of object and room attributes, the source policyS can learn a decision boundary that ignores the irrele-vant room attributes: oranges good, apples bad. Sucha policy would then generalise well to the target domainout of the box, since S(a|F(soS); ) = T (a|F(soT ); ) =T (a|szS ; ). Hence, DARLA is based on the idea that agood quality F learnt exclusively on the source domainDS M will zero-shot-generalise to all target domainsDi M, and therefore the source policy (a|SzS ; ) willalso generalise to all target domains Di M out of thebox.</p><p>Next we describe each of the stages of the DARLA pipelinethat allow it to learn source policies S that are robust...</p></li></ul>