modular multitask reinforcement learning with policy...

Modular Multitask Reinforcement Learning with Policy Sketches

Jacob Andreas 1 Dan Klein 1 Sergey Levine 1

Abstract

We describe a framework for multitask deep re-inforcement learning guided by policy sketches.Sketches annotate tasks with sequences of namedsubtasks, providing information about high-levelstructural relationships among tasks but not howto implement them—specifically not providingthe detailed guidance used by much previouswork on learning policy abstractions for RL (e.g.intermediate rewards, subtask completion sig-nals, or intrinsic motivations). To learn fromsketches, we present a model that associates ev-ery subtask with a modular subpolicy, and jointlymaximizes reward over full task-specific poli-cies by tying parameters across shared subpoli-cies. Optimization is accomplished via a decou-pled actor–critic training objective that facilitateslearning common behaviors from multiple dis-similar reward functions. We evaluate the effec-tiveness of our approach in three environmentsfeaturing both discrete and continuous control,and with sparse rewards that can be obtained onlyafter completing a number of high-level sub-goals. Experiments show that using our approachto learn policies guided by sketches gives betterperformance than existing techniques for learn-ing task-specific or shared policies, while nat-urally inducing a library of interpretable primi-tive behaviors that can be recombined to rapidlyadapt to new tasks.

1. IntroductionThis paper describes a framework for learning compos-able deep subpolicies in a multitask setting, guided onlyby abstract sketches of high-level behavior. General rein-forcement learning algorithms allow agents to solve tasksin complex environments. But tasks featuring extremely

1University of California, Berkeley. Correspondence to: JacobAndreas <[email protected]>.

Proceedings of the 34 thInternational Conference on Machine

Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

π1

π2

π3

π1

b1: get wood

τ1: make planks

b2: use workbench

Π1

b1: get wood

τ2: make sticks

b3: use toolshed

π1

π3

Π2

π1

π2

K1 K2

Figure 1: Learning from policy sketches. The figure shows sim-plified versions of two tasks (make planks and make sticks, eachassociated with its own policy (⇧1 and ⇧2 respectively). Thesepolicies share an initial high-level action b1: both require theagent to get wood before taking it to an appropriate crafting sta-tion. Even without prior information about how the associated be-havior ⇡1 should be implemented, knowing that the agent shouldinitially follow the same subpolicy in both tasks is enough to learna reusable representation of their shared structure.

delayed rewards or other long-term structure are often dif-ficult to solve with flat, monolithic policies, and a longline of prior work has studied methods for learning hier-archical policy representations (Sutton et al., 1999; Diet-terich, 2000; Konidaris & Barto, 2007; Hauser et al., 2008).While unsupervised discovery of these hierarchies is possi-ble (Daniel et al., 2012; Bacon & Precup, 2015), practicalapproaches often require detailed supervision in the formof explicitly specified high-level actions, subgoals, or be-havioral primitives (Precup, 2000). These depend on staterepresentations simple or structured enough that suitablereward signals can be effectively engineered by hand.

But is such fine-grained supervision actually necessary toachieve the full benefits of hierarchy? Specifically, is itnecessary to explicitly ground high-level actions into therepresentation of the environment? Or is it sufficient tosimply inform the learner about the abstract structure ofpolicies, without ever specifying how high-level behaviorsshould make use of primitive percepts or actions?

To answer these questions, we explore a multitask re-inforcement learning setting where the learner is pre-


sented with policy sketches. Policy sketches are short, un-grounded, symbolic representations of a task that describeits component parts, as illustrated in Figure 1. While sym-bols might be shared across tasks (get wood appears insketches for both the make planks and make sticks tasks),the learner is told nothing about what these symbols mean,in terms of either observations or intermediate rewards.

We present an agent architecture that learns from policysketches by associating each high-level action with a pa-rameterization of a low-level subpolicy, and jointly op-timizes over concatenated task-specific policies by tyingparameters across shared subpolicies. We find that thisarchitecture can use the high-level guidance provided bysketches, without any grounding or concrete definition, todramatically accelerate learning of complex multi-stage be-haviors. Our experiments indicate that many of the benefitsto learning that come from highly detailed low-level su-pervision (e.g. from subgoal rewards) can also be obtainedfrom fairly coarse high-level supervision (i.e. from policysketches). Crucially, sketches are much easier to produce:they require no modifications to the environment dynam-ics or reward function, and can be easily provided by non-experts. This makes it possible to extend the benefits ofhierarchical RL to challenging environments where it maynot be possible to specify by hand the details of relevantsubtasks. We show that our approach substantially outper-forms purely unsupervised methods that do not provide thelearner with any task-specific guidance about how hierar-chies should be deployed, and further that the specific useof sketches to parameterize modular subpolicies makes bet-ter use of sketches than conditioning on them directly.

The present work may be viewed as an extension of recentapproaches for learning compositional deep architecturesfrom structured program descriptors (Andreas et al., 2016;Reed & de Freitas, 2016). Here we focus on learning in in-teractive environments. This extension presents a variety oftechnical challenges, requiring analogues of these methodsthat can be trained from sparse, non-differentiable rewardsignals without demonstrations of desired system behavior.

Our contributions are:

• A general paradigm for multitask, hierarchical, deepreinforcement learning guided by abstract sketches oftask-specific policies.

• A concrete recipe for learning from these sketches,built on a general family of modular deep policy rep-resentations and a multitask actor–critic training ob-jective.

The modular structure of our approach, which associatesevery high-level action symbol with a discrete subpolicy,naturally induces a library of interpretable policy fragments

that are easily recombined. This makes it possible to eval-uate our approach under a variety of different data condi-tions: (1) learning the full collection of tasks jointly viareinforcement, (2) in a zero-shot setting where a policysketch is available for a held-out task, and (3) in a adapta-tion setting, where sketches are hidden and the agent mustlearn to adapt a pretrained policy to reuse high-level ac-tions in a new task. In all cases, our approach substantiallyoutperforms previous approaches based on explicit decom-position of the Q function along subtasks (Parr & Russell,1998; Vogel & Jurafsky, 2010), unsupervised option dis-covery (Bacon & Precup, 2015), and several standard pol-icy gradient baselines.

We consider three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent mustacquire particular resources by finding raw ingredients,combining them together in the proper order, and in somecases building intermediate tools that enable the agent to al-ter the environment itself; a 2-D maze navigation task thatrequires the agent to collect keys and open doors, and a 3-Dlocomotion task (Figure 3b) in which a quadrupedal robotmust actuate its joints to traverse a narrow winding cliff.

In all tasks, the agent receives a reward only after the finalgoal is accomplished. For the most challenging tasks, in-volving sequences of four or five high-level actions, a task-specific agent initially following a random policy essen-tially never discovers the reward signal, so these tasks can-not be solved without considering their hierarchical struc-ture. We have released code at http://github.com/jacobandreas/psketch.

2. Related WorkThe agent representation we describe in this paper be-longs to the broader family of hierarchical reinforcementlearners. As detailed in Section 3, our approach may beviewed as an instantiation of the options framework firstdescribed by Sutton et al. (1999). A large body of workdescribes techniques for learning options and related ab-stract actions, in both single- and multitask settings. Mosttechniques for learning options rely on intermediate su-pervisory signals, e.g. to encourage exploration (Kearns &Singh, 2002) or completion of pre-defined subtasks (Kulka-rni et al., 2016). An alternative family of approaches em-ploys post-hoc analysis of demonstrations or pretrainedpolicies to extract reusable sub-components (Stolle & Pre-cup, 2002; Konidaris et al., 2011; Niekum et al., 2015).Techniques for learning options with less guidance than thepresent work include Bacon & Precup (2015) and Vezhn-evets et al. (2016), and other general hierarchical policylearners include Daniel et al. (2012), Bakker & Schmidhu-ber (2004) and Menache et al. (2002). We will see thatthe minimal supervision provided by policy sketches re-


sults in (sometimes dramatic) improvements over fully un-supervised approaches, while being substantially less oner-ous for humans to provide compared to the grounded su-pervision (such as explicit subgoals or feature abstractionhierarchies) used in previous work.

Once a collection of high-level actions exists, agents arefaced with the problem of learning meta-level (typicallysemi-Markov) policies that invoke appropriate high-levelactions in sequence (Precup, 2000). The learning problemwe describe in this paper is in some sense the direct dualto the problem of learning these meta-level policies: there,the agent begins with an inventory of complex primitivesand must learn to model their behavior and select amongthem; here we begin knowing the names of appropriatehigh-level actions but nothing about how they are imple-mented, and must infer implementations (but not, initially,abstract plans) from context. Our model can be combinedwith these approaches to support a “mixed” supervisioncondition where sketches are available for some tasks butnot others (Section 4.5).

Another closely related line of work is the HierarchicalAbstract Machines (HAM) framework introduced by Parr& Russell (1998). Like our approach, HAMs begin witha representation of a high-level policy as an automaton(or a more general computer program; Andre & Russell,2001; Marthi et al., 2004) and use reinforcement learn-ing to fill in low-level details. Because these approachesattempt to learn a single representation of the Q functionfor all subtasks and contexts, they require extremely strongformal assumptions about the form of the reward functionand state representation (Andre & Russell, 2002) that thepresent work avoids by decoupling the policy representa-tion from the value function. They perform less effectivelywhen applied to arbitrary state representations where theseassumptions do not hold (Section 4.3). We are addition-ally unaware of past work showing that HAM automata canbe automatically inferred for new tasks given a pre-trainedmodel, while here we show that it is easy to solve the cor-responding problem for sketch followers (Section 4.5).

Our approach is also inspired by a number of recent effortstoward compositional reasoning and interaction with struc-tured deep models. Such models have been previously usedfor tasks involving question answering (Iyyer et al., 2014;Andreas et al., 2016) and relational reasoning (Socher et al.,2012), and more recently for multi-task, multi-robot trans-fer problems (Devin et al., 2016). In the present work—asin existing approaches employing dynamically assembledmodular networks—task-specific training signals are prop-agated through a collection of composed discrete structureswith tied weights. Here the composed structures spec-ify time-varying policies rather than feedforward computa-tions, and their parameters must be learned via interaction

rather than direct supervision. Another closely related fam-ily of models includes neural programmers (Neelakantanet al., 2015) and programmer–interpreters (Reed & de Fre-itas, 2016), which generate discrete computational struc-tures but require supervision in the form of output actionsor full execution traces.

We view the problem of learning from policy sketches ascomplementary to the instruction following problem stud-ied in the natural language processing literature. Existingwork on instruction following focuses on mapping fromnatural language strings to symbolic action sequences thatare then executed by a hard-coded interpreter (Branavanet al., 2009; Chen & Mooney, 2011; Artzi & Zettlemoyer,2013; Tellex et al., 2011). Here, by contrast, we focus onlearning to execute complex actions given symbolic repre-sentations as a starting point. Instruction following modelsmay be viewed as joint policies over instructions and en-vironment observations (so their behavior is not defined inthe absence of instructions), while the model described inthis paper naturally supports adaptation to tasks where nosketches are available. We expect that future work mightcombine the two lines of research, bootstrapping policylearning directly from natural language hints rather than thesemi-structured sketches used here.

3. Learning Modular Policies from SketchesWe consider a multitask reinforcement learning prob-lem arising from a family of infinite-horizon discountedMarkov decision processes in a shared environment. Thisenvironment is specified by a tuple (S, A, P, �), with S aset of states, A a set of low-level actions, P : S⇥A⇥S !R a transition probability distribution, and � a discount fac-tor. Each task ⌧ 2 T is then specified by a pair (R⌧ , ⇢⌧ ),with R⌧ : S ! R a task-specific reward function and⇢⌧ : S ! R an initial distribution over states. For a fixedsequence {(si, ai)} of states and actions obtained from arollout of a given policy, we will denote the empirical returnstarting in state si as qi :=

P1j=i+1 �j�i�1R(sj). In addi-

tion to the components of a standard multitask RL problem,we assume that tasks are annotated with sketches K⌧ , eachconsisting of a sequence (b⌧1, b⌧2, . . .) of high-level sym-bolic labels drawn from a fixed vocabulary B.

3.1. Model

We exploit the structural information provided by sketchesby constructing for each symbol b a corresponding subpol-

icy ⇡b. By sharing each subpolicy across all tasks annotatedwith the corresponding symbol, our approach naturallylearns the shared abstraction for the corresponding subtask,without requiring any information about the grounding ofthat task to be explicitly specified by annotation.


Algorithm 1 TRAIN-STEP(⇧, curriculum)

1: D ;2: while |D| < D do3: // sample task ⌧ from curriculum (Section 3.3)

4: ⌧ ⇠ curriculum(·)5: // do rollout

6: d = {(si, ai, (bi = K⌧,i), qi, ⌧), . . .} ⇠ ⇧⌧

7: D D [ d8: // update parameters

9: for b 2 B, ⌧ 2 T do10: d = {(si, ai, b0, qi, ⌧ 0

) 2 D : b0 = b, ⌧ 0= ⌧}

11: // update subpolicy

12: ✓b ✓b +↵D

Pd

�r log ⇡b(ai|si)

��qi � c⌧ (si)

�

13: // update critic

14: ⌘⌧ ⌘⌧ +

�D

Pd

�rc⌧ (si)

��qi � c⌧ (si)

�

At each timestep, a subpolicy may select either a low-levelaction a 2 A or a special STOP action. We denote theaugmented state space A+

:= A [ {STOP}. At a highlevel, this framework is agnostic to the implementation ofsubpolicies: any function that takes a representation of thecurrent state onto a distribution over A+ will do.

In this paper, we focus on the case where each ⇡b is rep-resented as a neural network.1 These subpolicies may beviewed as options of the kind described by Sutton et al.(1999), with the key distinction that they have no initiationsemantics, but are instead invokable everywhere, and haveno explicit representation as a function from an initial stateto a distribution over final states (instead implicitly usingthe STOP action to terminate).

Given a fixed sketch (b1, b2, . . . ), a task-specific policy ⇧⌧

is formed by concatenating its associated subpolicies in se-quence. In particular, the high-level policy maintains a sub-policy index i (initially 0), and executes actions from ⇡bi

until the STOP symbol is emitted, at which point control ispassed to ⇡bi+1 . We may thus think of ⇧⌧ as inducing aMarkov chain over the state space S ⇥ B, with transitions:

(s, bi)! (s0, bi) with pr.P

a2A⇡bi(a|s) · P (s0|s, a)! (s, bi+1) with pr. ⇡bi(STOP|s)

Note that ⇧⌧ is semi-Markov with respect to projection ofthe augmented state space S ⇥ B onto the underlying statespace S . We denote the complete family of task-specificpolicies ⇧ :=

S⌧{⇧⌧}, and let each ⇡b be an arbitrary

function of the current environment state parameterized bysome weight vector ✓b. The learning problem is to optimize

1 For ease of presentation, this section assumes that these sub-policy networks are independently parameterized. As described inSection 4.2, it is also possible to share parameters between sub-policies, and introduce discrete subtask structure by way of anembedding of each symbol b.

Algorithm 2 TRAIN-LOOP()1: // initialize subpolicies randomly

2: ⇧ = INIT()3: `max 1

4: loop5: rmin �16: // initialize `

max

-step curriculum uniformly

7: T 0= {⌧ 2 T : |K⌧ | `max}

8: curriculum(·) = Unif(T 0)

9: while rmin < rgood do10: // update parameters (Algorithm 1)

11: TRAIN-STEP(⇧, curriculum)

12: curriculum(⌧) / 1[⌧ 2 T 0](1� ˆEr⌧ ) 8⌧ 2 T

13: rmin min⌧2T 0 ˆEr⌧14: `max `max + 1

over all ✓b to maximize expected discounted reward

J(⇧) :=

X

⌧

J(⇧⌧ ) :=

X

⌧

Esi⇠⇧⌧

⇥X

i

�iR⌧ (si)⇤

across all tasks ⌧ 2 T .

3.2. Policy Optimization

Here that optimization is accomplished via a simple decou-pled actor–critic method. In a standard policy gradient ap-proach, with a single policy ⇡ with parameters ✓, we com-pute gradient steps of the form (Williams, 1992):

r✓J(⇡) =X

i

�r✓ log ⇡(ai|si)

��qi � c(si)

�, (1)

where the baseline or “critic” c can be chosen indepen-dently of the future without introducing bias into the gra-dient. Recalling our previous definition of qi as the empir-ical return starting from si, this form of the gradient cor-responds to a generalized advantage estimator (Schulmanet al., 2015a) with � = 1. Here c achieves close to theoptimal variance (Greensmith et al., 2004) when it is set

s4 s5 s6 s7

a4 a5 a6 stopa1 a2 a3 stop

s1 s2 s3 s4

π1 π2

b2b1

Figure 2: Model overview. Each subpolicy ⇡ is uniquely associ-ated with a symbol b implemented as a neural network that mapsfrom a state si to distributions over A+, and chooses an action ai

by sampling from this distribution. Whenever the STOP action issampled, control advances to the next subpolicy in the sketch.


exactly equal to the state-value function V⇡(si) = E⇡qi forthe target policy ⇡ starting in state si.

The situation becomes slightly more complicated whengeneralizing to modular policies built by sequencing sub-policies. In this case, we will have one subpolicy per sym-bol but one critic per task. This is because subpolicies ⇡b

might participate in a number of composed policies ⇧⌧ ,each associated with its own reward function R⌧ . Thus in-dividual subpolicies are not uniquely identified with valuefunctions, and the aforementioned subpolicy-specific state-value estimator is no longer well-defined. We extend theactor–critic method to incorporate the decoupling of poli-cies from value functions by allowing the critic to vary per-sample (that is, per-task-and-timestep) depending on thereward function with which the sample is associated. Not-ing that r✓bJ(⇧) =

Pt:b2K⌧

r✓bJ(⇧⌧ ), i.e. the sum ofgradients of expected rewards across all tasks in which ⇡b

participates, we have:

r✓J(⇧) =

X

⌧

r✓J(⇧⌧ )

=

X

⌧

X

i

�r✓b log ⇡b(a⌧i|s⌧i)

��qi � c⌧ (s⌧i)

�, (2)

where each state-action pair (s⌧i, a⌧i) was selected by thesubpolicy ⇡b in the context of the task ⌧ .

Now minimization of the gradient variance requires thateach c⌧ actually depend on the task identity. (This fol-lows immediately by applying the corresponding argumentin Greensmith et al. (2004) individually to each term in thesum over ⌧ in Equation 2.) Because the value function isitself unknown, an approximation must be estimated fromdata. Here we allow these c⌧ to be implemented with anarbitrary function approximator with parameters ⌘⌧ . Thisis trained to minimize a squared error criterion, with gradi-ents given by

r⌘⌧

� 1

2

X

i

(qi � c⌧ (si))2

�

=

X

i

�r⌘⌧ c⌧ (si)

��qi � c⌧ (si)

�. (3)

Alternative forms of the advantage estimator (e.g. the TDresidual R⌧ (si)+�V⌧ (si+1)�V⌧ (si) or any other memberof the generalized advantage estimator family) can be eas-ily substituted by simply maintaining one such estimatorper task. Experiments (Section 4.4) show that condition-ing on both the state and the task identity results in notice-able performance improvements, suggesting that the vari-ance reduction provided by this objective is important forefficient joint learning of modular policies.

The complete procedure for computing a single gradientstep is given in Algorithm 1. (The outer training loop over

these steps, which is driven by a curriculum learning pro-cedure, is specified in Algorithm 2.) This is an on-policyalgorithm. In each step, the agent samples tasks from a taskdistribution provided by a curriculum (described in the fol-lowing subsection). The current family of policies ⇧ isused to perform rollouts in each sampled task, accumulat-ing the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset D.Once D reaches a maximum size D, it is used to computegradients w.r.t. both policy and critic parameters, and theparameter vectors are updated accordingly. The step sizes↵ and � in Algorithm 1 can be chosen adaptively using anyfirst-order method.

3.3. Curriculum Learning

For complex tasks, like the one depicted in Figure 3b, it isdifficult for the agent to discover any states with positivereward until many subpolicy behaviors have already beenlearned. It is thus a better use of the learner’s time to focuson “easy” tasks, where many rollouts will result in highreward from which appropriate subpolicy behavior can beinferred. But there is a fundamental tradeoff involved here:if the learner spends too much time on easy tasks beforebeing made aware of the existence of harder ones, it mayoverfit and learn subpolicies that no longer generalize orexhibit the desired structural properties.

To avoid both of these problems, we use a curriculum learn-ing scheme (Bengio et al., 2009) that allows the modelto smoothly scale up from easy tasks to more difficultones while avoiding overfitting. Initially the model is pre-sented with tasks associated with short sketches. Once av-erage reward on all these tasks reaches a certain threshold,the length limit is incremented. We assume that rewardsacross tasks are normalized with maximum achievable re-ward 0 < qi < 1. Let ˆEr⌧ denote the empirical estimate ofthe expected reward for the current policy on task ⌧ . Thenat each timestep, tasks are sampled in proportion to 1�ˆEr⌧ ,which by assumption must be positive.

Intuitively, the tasks that provide the strongest learning sig-nal are those in which (1) the agent does not on averageachieve reward close to the upper bound, but (2) manyepisodes result in high reward. The expected reward com-ponent of the curriculum addresses condition (1) by en-suring that time is not spent on nearly solved tasks, whilethe length bound component of the curriculum addressescondition (2) by ensuring that tasks are not attempted untilhigh-reward episodes are likely to be encountered. Experi-ments show that both components of this curriculum learn-ing scheme improve the rate at which the model convergesto a good policy (Section 4.4).

The complete curriculum-based training procedure is spec-ified in Algorithm 2. Initially, the maximum sketch length


`max is set to 1, and the curriculum initialized to samplelength-1 tasks uniformly. (Neither of the environmentswe consider in this paper feature any length-1 tasks; inthis case, observe that Algorithm 2 will simply advance tolength-2 tasks without any parameter updates.) For eachsetting of `max, the algorithm uses the current collectionof task policies ⇧ to compute and apply the gradient stepdescribed in Algorithm 1. The rollouts obtained from thecall to TRAIN-STEP can also be used to compute rewardestimates ˆEr⌧ ; these estimates determine a new task distri-bution for the curriculum. The inner loop is repeated un-til the reward threshold rgood is exceeded, at which point`max is incremented and the process repeated over a (now-expanded) collection of tasks.

4. ExperimentsWe evaluate the performance of our approach in three envi-ronments: a crafting environment, a maze navigation en-vironment, and a cliff traversal environment. These en-vironments involve various kinds of challenging low-levelcontrol: agents must learn to avoid obstacles, interact withvarious kinds of objects, and relate fine-grained joint ac-tivation to high-level locomotion goals. They also featurehierarchical structure: most rewards are provided only af-ter the agent has completed two to five high-level actions inthe appropriate sequence, without any intermediate goals toindicate progress towards completion.

4.1. Implementation

In all our experiments, we implement each subpolicy as afeedforward neural network with ReLU nonlinearities anda hidden layer with 128 hidden units, and each critic as alinear function of the current state. Each subpolicy networkreceives as input a set of features describing the currentstate of the environment, and outputs a distribution overactions. The agent acts at every timestep by sampling fromthis distribution. The gradient steps given in lines 8 and 9of Algorithm 1 are implemented using RMSPROP (Tiele-man, 2012) with a step size of 0.001 and gradient clippingto a unit norm. We take the batch size D in Algorithm 1 tobe 2000, and set � = 0.9 in both environments. For cur-riculum learning, the improvement threshold rgood is 0.8.

4.2. Environments

The crafting environment (Figure 3a) is inspired by thepopular game Minecraft, but is implemented in a discrete2-D world. The agent may interact with objects in theworld by facing them and executing a special USE action.Interacting with raw materials initially scattered around theenvironment causes them to be added to an inventory. Inter-acting with different crafting stations causes objects in theagent’s inventory to be combined or transformed. Each task

(a) 1

2

34

b1: get wood

τ: get gold

b2: get iron

b3: use workbench

b4: get gold

K

(b)

b1: north

τ: go to goal

b2: east

b3: east1 2

3

K

1

2 3

Figure 3: Examples from the crafting and cliff environments usedin this paper. An additional maze environment is also investigated.(a) In the crafting environment, an agent seeking to pick up thegold nugget in the top corner must first collect wood (1) and iron(2), use a workbench to turn them into a bridge (3), and use thebridge to cross the water (4). (b) In the cliff environment, theagent must reach a goal position by traversing a winding sequenceof tiles without falling off. Control takes place at the level ofindividual joint angles; high-level behaviors like “move north”must be learned.

in this game corresponds to some crafted object the agentmust produce; the most complicated goals require the agentto also craft intermediate ingredients, and in some casesbuild tools (like a pickaxe and a bridge) to reach ingredientslocated in initially inaccessible regions of the environment.

The maze environment (not pictured) corresponds closelyto the the “light world” described by Konidaris & Barto(2007). The agent is placed in a discrete world consist-ing of a series of rooms, some of which are connected bydoors. Some doors require that the agent first pick up akey to open them. For our experiments, each task corre-sponds to a goal room (always at the same position relativeto the agent’s starting position) that the agent must reachby navigating through a sequence of intermediate rooms.The agent has one sensor on each side of its body, whichreports the distance to keys, closed doors, and open doorsin the corresponding direction. Sketches specify a particu-lar sequence of directions for the agent to traverse betweenrooms to reach the goal. The sketch always correspondsto a viable traversal from the start to the goal position, butother (possibly shorter) traversals may also exist.

The cliff environment (Figure 3b) is intended to demon-strate the applicability of our approach to problems in-volving high-dimensional continuous control. In this en-vironment, a quadrupedal robot (Schulman et al., 2015b) isplaced on a variable-length winding path, and must navi-


0.0 0.5 1.0 1.5 2.0 2.5 3.0Episode ⇥106

0.0

0.2

0.4

0.6

0.8

1.0

avg.

rew

ard

Q automaton

Joint

Indep

Modular(ours)

Opt–Crit

Crafting environment

0.0 0.5 1.0 1.5 2.0 2.5 3.0Episode ⇥106

0.0

0.2

0.4

0.6

0.8

1.0

avg.

rew

ard

Q automaton

Modular(ours)

JointIndep.

Opt–Crit

Maze environment

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Timestep ⇥108

�3.0�2.5�2.0�1.5�1.0�0.5

0.00.51.01.5

log(

avg.

rew

ard)

Modular

JointOpt–Crit

Cliff environment

(a) (b) (c)

Figure 4: Comparing modular learning from sketches with standard RL baselines. Modular is the approach described in this paper, whileIndependent learns a separate policy for each task, Joint learns a shared policy that conditions on the task identity, Q automaton learnsa single network to map from states and action symbols to Q values, and Opt–Crit is an unsupervised option learner. Performancefor the best iteration of the (off-policy) Q automaton is plotted. Performance is shown in (a) the crafting environment, (b) the mazeenvironment, and (c) the cliff environment. The modular approach is eventually able to achieve high reward on all tasks, while thebaseline models perform considerably worse on average.

gate to the end without falling off. This task is designed toprovide a substantially more challenging RL problem, dueto the fact that the walker must learn the low-level walk-ing skill before it can make any progress, but has simplerhierarchical structure than the crafting environment. Theagent receives a small reward for making progress towardthe goal, and a large positive reward for reaching the goalsquare, with a negative reward for falling off the path.

A listing of tasks and sketches is given in Appendix A.

4.3. Multitask Learning

The primary experimental question in this paper is whetherthe extra structure provided by policy sketches alone isenough to enable fast learning of coupled policies acrosstasks. We aim to explore the differences between theapproach described in Section 3 and relevant prior workthat performs either unsupervised or weakly supervisedmultitask learning of hierarchical policy structure. Specifi-cally, we compare our modular to approach to:

1. Structured hierarchical reinforcement learners:

(a) the fully unsupervised option–critic algorithmof Bacon & Precup (2015)

(b) a Q automaton that attempts to explicitly repre-sent the Q function for each task / subtask com-bination (essentially a HAM (Andre & Russell,2002) with a deep state abstraction function)

2. Alternative ways of incorporating sketch data intostandard policy gradient methods:

(c) learning an independent policy for each task(d) learning a joint policy across all tasks, condi-

tioning directly on both environment featuresand a representation of the complete sketch

The joint and independent models performed best whentrained with the same curriculum described in Section 3.3,while the option–critic model performed best with alength–weighted curriculum that has access to all tasksfrom the beginning of training.

Learning curves for baselines and the modular model areshown in Figure 4. It can be seen that in all environments,our approach substantially outperforms the baselines: it in-duces policies with substantially higher average reward andconverges more quickly than the policy gradient baselines.It can further be seen in Figure 4c that after policies havebeen learned on simple tasks, the model is able to rapidlyadapt to more complex ones, even when the longer tasksinvolve high-level actions not required for any of the shorttasks (Appendix A).

Having demonstrated the overall effectiveness of our ap-proach, our remaining experiments explore (1) the impor-tance of various components of the training procedure, and(2) the learned models’ ability to generalize or adapt toheld-out tasks. For compactness, we restrict our consid-eration on the crafting domain, which features a larger andmore diverse range of tasks and high-level actions.

4.4. Ablations

In addition to the overall modular parameter-tying structureinduced by our sketches, the key components of our train-ing procedure are the decoupled critic and the curriculum.Our next experiments investigate the extent to which theseare necessary for good performance.

To evaluate the the critic, we consider three ablations: (1)removing the dependence of the model on the environmentstate, in which case the baseline is a single scalar per task;(2) removing the dependence of the model on the task, inwhich case the baseline is a conventional generalized ad-vantage estimator; and (3) removing both, in which case


0.0 0.5 1.0 1.5 2.0 2.5 3.0Episode ⇥106

0.0

0.2

0.4

0.6

0.8

1.0

avg.

rew

ard

{task, state}

{task}{state}

{}

Critics

0.0 0.5 1.0 1.5 2.0 2.5 3.0Episode ⇥106

0.0

0.2

0.4

0.6

0.8

1.0

avg.

rew

ard

{len, wgt}{len}{wgt}

{}

Curricula

(a) (b)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Episode ⇥106

0.0

0.2

0.4

0.6

0.8

1.0

avg.

rew

ard

Performance by task

(c)

Figure 5: Training details in the crafting domain. (a) Critics: lineslabeled “task” include a baseline that varies with task identity,while lines labeled “state” include a baseline that varies with stateidentity. Estimating a baseline that depends on both the represen-tation of the current state and the identity of the current task isbetter than either alone or a constant baseline. (b) Curricula: lineslabeled “len” use a curriculum with iteratively increasing sketchlengths, while lines labeled “wgt” sample tasks in inverse propor-tion to their current reward. Adjusting the sampling distributionbased on both task length and performance return improves con-vergence. (c) Individual task performance. Colors correspond totask length. Sharp steps in the learning curve correspond to in-creases of `max in the curriculum.

the baseline is a single scalar, as in a vanilla policy gradientapproach. Results are shown in Figure 5a. Introducing bothstate and task dependence into the baseline leads to fasterconvergence of the model: the approach with a constantbaseline achieves less than half the overall performance ofthe full critic after 3 million episodes. Introducing task andstate dependence independently improve this performance;combining them gives the best result.

We also investigate two aspects of our curriculum learningscheme: starting with short examples and moving to longones, and sampling tasks in inverse proportion to their ac-cumulated reward. Experiments are shown in Figure 5b.Both components help; prioritization by both length andweight gives the best results.

4.5. Zero-shot and Adaptation Learning

In our final experiments, we consider the model’s ability togeneralize beyond the standard training condition. We firstconsider two tests of generalization: a zero-shot setting, inwhich the model is provided a sketch for the new task andmust immediately achieve good performance, and a adap-tation setting, in which no sketch is provided and the modelmust learn the form of a suitable sketch via interaction inthe new task.

Model Multitask 0-shot Adaptation

Joint .49 .01 –Independent .44 – .01Option–Critic .47 – .42Modular (ours) .89 .77 .76

Table 1: Accuracy and generalization of learned models in thecrafting domain. The table shows the task completion rate foreach approach after convergence under various training condi-tions. Multitask is the multitask training condition described inSection 4.3, while 0-Shot and Adaptation are the generalizationexperiments described in Section 4.5. Our modular approach con-sistently achieves the best performance.

We hold out two length-four tasks from the full inventoryused in Section 4.3, and train on the remaining tasks. Forzero-shot experiments, we simply form the concatenatedpolicy described by the sketches of the held-out tasks, andrepeatedly execute this policy (without learning) in order toobtain an estimate of its effectiveness. For adaptation ex-periments, we consider ordinary RL over high-level actionsB rather than low-level actions A, implementing the high-level learner with the same agent architecture as describedin Section 3.1. Note that the Independent and Option–Critic models cannot be applied to the zero-shot evaluation,while the Joint model cannot be applied to the adaptationbaseline (because it depends on pre-specified sketch fea-tures). Results are shown in Table 1. The held-out tasksare sufficiently challenging that the baselines are unable toobtain more than negligible reward: in particular, the jointmodel overfits to the training tasks and cannot generalize tonew sketches, while the independent model cannot discoverenough of a reward signal to learn in the adaptation setting.The modular model does comparatively well: individualsubpolicies succeed in novel zero-shot configurations (sug-gesting that they have in fact discovered the behavior sug-gested by the semantics of the sketch) and provide a suit-able basis for adaptive discovery of new high-level policies.

5. ConclusionsWe have described an approach for multitask learningof deep multitask policies guided by symbolic policysketches. By associating each symbol appearing in a sketchwith a modular neural subpolicy, we have shown that it ispossible to build agents that share behavior across tasks inorder to achieve success in tasks with sparse and delayedrewards. This process induces an inventory of reusable andinterpretable subpolicies which can be employed for zero-shot generalization when further sketches are available, andhierarchical reinforcement learning when they are not. Ourwork suggests that these sketches, which are easy to pro-duce and require no grounding in the environment, providean effective scaffold for learning hierarchical policies fromminimal supervision.


AcknowledgmentsJA is supported by a Facebook fellowship and a BerkeleyAI / Huawei fellowship.

ReferencesAndre, David and Russell, Stuart. Programmable reinforce-

ment learning agents. In Advances in Neural Information

Processing Systems, 2001.

Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceed-

ings of the Meeting of the Association for the Advance-

ment of Artificial Intelligence, 2002.

Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, andKlein, Dan. Learning to compose neural networks forquestion answering. In Proceedings of the Annual Meet-

ing of the North American Chapter of the Association for

Computational Linguistics, 2016.

Artzi, Yoav and Zettlemoyer, Luke. Weakly supervisedlearning of semantic parsers for mapping instructions toactions. Transactions of the Association for Computa-

tional Linguistics, 1(1):49–62, 2013.

Bacon, Pierre-Luc and Precup, Doina. The option-critic ar-chitecture. In NIPS Deep Reinforcement Learning Work-

shop, 2015.

Bakker, Bram and Schmidhuber, Jurgen. Hierarchical rein-forcement learning based on subgoal discovery and sub-policy specialization. In Proc. of the 8-th Conf. on Intel-

ligent Autonomous Systems, pp. 438–445, 2004.

Bengio, Yoshua, Louradour, Jerome, Collobert, Ronan, andWeston, Jason. Curriculum learning. pp. 41–48. ACM,2009.

Branavan, S.R.K., Chen, Harr, Zettlemoyer, Luke S., andBarzilay, Regina. Reinforcement learning for mappinginstructions to actions. In Proceedings of the Annual

Meeting of the Association for Computational Linguis-

tics, pp. 82–90. Association for Computational Linguis-tics, 2009.

Chen, David L. and Mooney, Raymond J. Learning to inter-pret natural language navigation instructions from obser-vations. In Proceedings of the Meeting of the Association

for the Advancement of Artificial Intelligence, volume 2,pp. 1–2, 2011.

Daniel, Christian, Neumann, Gerhard, and Peters, Jan. Hi-erarchical relative entropy policy search. In Proceedings

of the International Conference on Artificial Intelligence

and Statistics, pp. 273–281, 2012.

Devin, Coline, Gupta, Abhishek, Darrell, Trevor, Abbeel,Pieter, and Levine, Sergey. Learning modular neuralnetwork policies for multi-task and multi-robot transfer.arXiv preprint arXiv:1609.07088, 2016.

Dietterich, Thomas G. Hierarchical reinforcement learningwith the maxq value function decomposition. J. Artif.

Intell. Res. (JAIR), 13:227–303, 2000.

Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan.Variance reduction techniques for gradient estimates inreinforcement learning. Journal of Machine Learning

Research, 5(Nov):1471–1530, 2004.

Hauser, Kris, Bretl, Timothy, Harada, Kensuke, andLatombe, Jean-Claude. Using motion primitives in prob-abilistic sample-based planning for humanoid robots.In Algorithmic foundation of robotics, pp. 507–522.Springer, 2008.

Iyyer, Mohit, Boyd-Graber, Jordan, Claudino, Leonardo,Socher, Richard, and Daume III, Hal. A neural net-work for factoid question answering over paragraphs. InProceedings of the Conference on Empirical Methods in

Natural Language Processing, 2014.

Kearns, Michael and Singh, Satinder. Near-optimal rein-forcement learning in polynomial time. Machine Learn-

ing, 49(2-3):209–232, 2002.

Konidaris, George and Barto, Andrew G. Building portableoptions: Skill transfer in reinforcement learning. In IJ-

CAI, volume 7, pp. 895–900, 2007.

Konidaris, George, Kuindersma, Scott, Grupen, Roderic,and Barto, Andrew. Robot learning from demonstrationby constructing skill trees. The International Journal of

Robotics Research, pp. 0278364911428653, 2011.

Kulkarni, Tejas D, Narasimhan, Karthik R, Saeedi, Arda-van, and Tenenbaum, Joshua B. Hierarchical deep rein-forcement learning: Integrating temporal abstraction andintrinsic motivation. arXiv preprint arXiv:1604.06057,2016.

Marthi, Bhaskara, Lantham, David, Guestrin, Carlos, andRussell, Stuart. Concurrent hierarchical reinforcementlearning. In Proceedings of the Meeting of the Associa-

tion for the Advancement of Artificial Intelligence, 2004.

Menache, Ishai, Mannor, Shie, and Shimkin, Nahum.Q-cutdynamic discovery of sub-goals in reinforcementlearning. In European Conference on Machine Learn-

ing, pp. 295–306. Springer, 2002.

Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya.Neural programmer: Inducing latent programs with gra-dient descent. arXiv preprint arXiv:1511.04834, 2015.


Niekum, Scott, Osentoski, Sarah, Konidaris, George,Chitta, Sachin, Marthi, Bhaskara, and Barto, Andrew G.Learning grounded finite-state representations from un-structured demonstrations. The International Journal of

Robotics Research, 34(2):131–157, 2015.

Parr, Ron and Russell, Stuart. Reinforcement learning withhierarchies of machines. In Advances in Neural Infor-

mation Processing Systems, 1998.

Precup, Doina. Temporal abstraction in reinforcement

learning. PhD thesis, 2000.

Reed, Scott and de Freitas, Nando. Neural programmer-interpreters. Proceedings of the International Confer-

ence on Learning Representations, 2016.

Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,Michael, and Abbeel, Pieter. High-dimensional con-tinuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015a.

Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,Michael, and Abbeel, Pieter. Trust region policy opti-mization. 2015b.

Socher, Richard, Huval, Brody, Manning, Christopher, andNg, Andrew. Semantic compositionality through recur-sive matrix-vector spaces. In Proceedings of the Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pp. 1201–1211, Jeju, Korea, 2012.

Stolle, Martin and Precup, Doina. Learning options in rein-forcement learning. In International Symposium on Ab-

straction, Reformulation, and Approximation, pp. 212–223. Springer, 2002.

Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-

ligence, 112(1):181–211, 1999.

Tellex, Stefanie, Kollar, Thomas, Dickerson, Steven, Wal-ter, Matthew R., Banerjee, Ashis Gopal, Teller, Seth, andRoy, Nicholas. Understanding natural language com-mands for robotic navigation and mobile manipulation.In In Proceedings of the National Conference on Artifi-

cial Intelligence, 2011.

Tieleman, Tijmen. RMSProp (unpublished), 2012.

Vezhnevets, Alexander, Mnih, Volodymyr, Agapiou, John,Osindero, Simon, Graves, Alex, Vinyals, Oriol, andKavukcuoglu, Koray. Strategic attentive writer for learn-ing macro-actions. arXiv preprint arXiv:1606.04695,2016.

Vogel, Adam and Jurafsky, Dan. Learning to follow navi-gational directions. In Proceedings of the Annual Meet-

ing of the Association for Computational Linguistics,pp. 806–814. Association for Computational Linguistics,2010.

Williams, Ronald J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.

modular multitask reinforcement learning with policy...

Documents