hierarchy through composition with multitask lmdps

Hierarchy Through Composition with Multitask LMDPs

Andrew M. Saxe 1 Adam C. Earle 2 Benjamin Rosman 2 3

AbstractHierarchical architectures are critical to thescalability of reinforcement learning methods.Most current hierarchical frameworks execute ac-tions serially, with macro-actions comprising se-quences of primitive actions. We propose a novelalternative to these control hierarchies based onconcurrent execution of many actions in parallel.Our scheme exploits the guaranteed concurrentcompositionality provided by the linearly solvableMarkov decision process (LMDP) framework,which naturally enables a learning agent to drawon several macro-actions simultaneously to solvenew tasks. We introduce the Multitask LMDPmodule, which maintains a parallel distributedrepresentation of tasks and may be stacked to formdeep hierarchies abstracted in space and time.

1. IntroductionReal world tasks unfold at a range of spatial and temporalscales, such that learning solely at the finest scale is likely tobe slow. Hierarchical reinforcement learning (HRL) (Barto& Madadevan, 2003; Parr & Russell, 1998; Dietterich, 2000)attempts to remedy this by learning a nested sequence ofever more detailed plans. Hierarchical schemes have a num-ber of desirable properties. Firstly, they are intuitive, ashumans seldom plan at the level of raw actions, typically pre-ferring to reason at a higher level of abstraction (Botvinicket al., 2009; Ribas-Fernandes et al., 2011; Solway et al.,2014). Secondly, they constitute one approach to tacklingthe curse of dimensionality (Bellman, 1957; Howard, 1960).In real world MDPs, the number of states typically growsdramatically in the size of a domain. A similar curse of di-mensionality afflicts actions, such that a robot with multiplejoints, for instance, must operate in the product space of

1Center for Brain Science, Harvard University 2School ofComputer Science and Applied Mathematics, University ofthe Witwatersrand 3Council for Scientific and Industrial Re-search, South Africa. Correspondence to: Andrew M. Saxe<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s).

actions for each joint individually (Mausam & Weld, 2008).Finally, the ‘tasks’ performed by an agent may come ingreat variety, and can also suffer from a curse of dimension-ality: a robot that learns policies to navigate to each of tenrooms would need to learn 10 choose 2 policies to navigateto the closer of each pair of rooms. Transferring learningacross tasks is therefore vital (Taylor & Stone, 2009; Foster& Dayan, 2002; Drummond, 1998; Fernández & Veloso,2006; Bonarini et al., 2006; Barreto et al., 2016).

Many HRL schemes rely on a serial call/return procedure,in which temporally extended macro-actions or ‘options’can call other (macro-)actions. In this sense, these schemesdraw on a computer metaphor, in which a serial processorchooses sequential primitive actions, occasionally pushingor popping new macro-actions onto a stack. The influentialoptions framework (Sutton et al., 1999), MAXQ method(Dietterich, 2000; Jonsson & Gómez, 2016) and the hierar-chy of abstract machines (Parr & Russell, 1998; Burridgeet al., 1999) all share this sequential-execution structure.

Concurrent MDPs (Mausam & Weld, 2008) relax the as-sumption of serial execution to allow multiple actions to beexecuted simultaneously at each time step, at the cost of acombinatorial increase in the size of the action space. Inthis paper, we develop a novel hierarchical scheme with aparallel and distributed execution structure that overcomesthe combinatorial increase in action space by exploiting theguaranteed optimal compositionality afforded by the lin-early solvable Markov decision process (LMDP) framework(Todorov, 2006; Kappen, 2005).

We present a Multitask LMDP module that executes con-current blends of previously learned tasks. Here we usethe term ‘concurrent’ or ‘parallel’ to refer to ‘parallel dis-tributed processing’ or neural network-like methods thatform weighted blends of many simple elements to achievetheir aim. In standard schemes, if a robot arm has acquiredpolicies to individually reach two points in some space, thisknowledge typically does not aid it in optimally reaching toeither point. However in our scheme, such behaviours canbe expressed as different weighted task blends, such thatan agent can simultaneously draw on several sub-policiesto achieve goals not explicitly represented by any of theindividual behaviours, including tasks the agent has neverperformed before.


Next we show how this Multitask LMDP module can bestacked, resulting in a hierarchy of more abstract modulesthat communicate distributed representations of tasks be-tween layers. We give a simple theoretical analysis showingthat hierarchy can yield qualitative efficiency improvementsin learning time and memory. Finally, we demonstrate theoperation of the method on a navigation domain, and showthat its multitasking ability can speed learning of new taskscompared to a traditional options-based agent.

Our scheme builds on a variety of prior work. Like the op-tions framework (Sutton et al., 1999), we build a hierarchyin time. Similar to MAXQ Value Decomposition (Diet-terich, 2000), we decompose a target MDP into a hierarchyof smaller SMDPs which progressively abstract over states.From Feudal RL, we draw the idea of a managerial hierar-chy in which higher layers prescribe goals but not details forlower layers (Dayan & Hinton, 1993). Most closely relatedto our scheme, (Jonsson & Gómez, 2016) develop a MAXQdecomposition within the LMDP formalism (see Supple-mentary Material for extended discussion). Our methoddiffers from all of the above approaches in permitting agraded, concurrent blend of tasks at each level, and devel-oping a uniform, stackable module capable of performing avariety of tasks.

2. The Multitask LMDP: A compositionalaction module

Our goal in this paper is to describe a flexible action-selection module which can be stacked to form a hierar-chy, such that the full action at any given point in time iscomposed of the concurrent composition of sub-actionswithin sub-actions. By analogy to perceptual deep net-works, restricted Boltzmann machines (RBMs) form a com-ponent module from which a deep belief network can beconstructed by layerwise stacking (Hinton et al., 2006; Hin-ton & Salakhutdinov, 2006). We seek a similar module inthe context of action or control. This section describes themodule, the Multitask LMDP (MLMDP), before turning tohow it can be stacked. Our formulation relies on the linearlysolvable Markov decision process (LMDP) framework in-troduced by Todorov (2006) (see also Kappen (2005)). TheLMDP differs from the standard MDP formulation in fun-damental ways, and enjoys a number of special properties.We first briefly describe the canonical MDP formulation, inorder to explain what the switch to the LMDP accomplishesand why it is necessary.

2.1. Canonical MDPs

In its standard formulation, an MDP is a four-tuple M =〈S,A, P,R〉, where S is a set of states, A is a set of discreteactions, P is a transition probability distribution P : S ×A× S → [0, 1], and R is an expected instantaneous reward

function R : S × A → R. The goal is to determine anoptimal policy π : S → A specifying which action totake in each state. This optimal policy can be computedfrom the optimal value function V : S → R, defined asthe expected reward starting in a given state and actingoptimally thereafter. The value function obeys the well-known Bellman optimality condition

V (s) = maxa∈A

{R(s, a) +

∑

s′

P (s′|s, a)V (s′)

}. (1)

This formalism is the basis of most practical and theoreticalstudies of decision-making under uncertainty and reinforce-ment learning (Bellman, 1957; Howard, 1960; Sutton &Barto, 1998). See, for instance, (Mnih et al., 2015; Lillicrapet al., 2015; Levine et al., 2016) for recent successes inchallenging domains.

For the purposes of a compositional hierarchy of actions,this formulation presents two key difficulties.

1. Mutually exclusive sequential actions First, theagent’s actions are discrete and execute serially. Ex-actly one (macro-)action operates at any given timepoint. Hence there is no way to build up an action at asingle time point out of several ‘subactions’ taken inparallel. For example, a control signal for a robotic armcannot be composed of a control decision for the elbowjoint, a control decision for the shoulder joint, and acontrol decision for the gripper, each taken in paralleland combined into a complete action for a specific timepoint.

2. Non-composable optimal policies The maximizationin Eqn. (1) over a discrete set of actions is nonlin-ear. This means that optimal solutions, in general, donot compose in a simple way. Consider two standardMDPs M1 = 〈S,A, P,R1〉 and M2 = 〈S,A, P,R2〉which have identical state spaces, action sets, and tran-sition dynamics but differ in their instantaneous re-wards R1 and R2. These may be solved independentlyto yield value functions V1 and V2. But the value func-tion of the MDP M1+2 = 〈S,A, P,R1 +R2〉, whoseinstantaneous rewards are the sum of the first two, isnot V1+2 = V1 + V2. In general, there is no simpleprocedure for deriving V1+2 from V1 and V2; it mustbe found by solving Eqn. (1) again.

2.2. Linearly Solvable MDPs

The LMDP (Todorov, 2009a; Dvijotham & Todorov, 2010;Todorov, 2009b; Dvijotham & Todorov, 2010) is definedby a three-tuple L = 〈S, P,R〉, where S is a set ofstates, P is a passive transition probability distributionP : S × S → [0, 1], and R is an expected instantaneous re-ward function R : S → R. The LMDP framework replaces


the traditional discrete set of actions A with a continuousprobability distribution over next states a : S × S → [0, 1].That is, the ‘control’ or ‘action’ chosen by the agent in states is a transition probability distribution over next states,a(·|s). The controlled transition distribution may be inter-preted either as directly constituting the agent’s dynamics,or as a stochastic policy over deterministic actions whichaffect state transitions (Todorov, 2006; Jonsson & Gómez,2016). Swapping a discrete action space for a continuous ac-tion space is a key change which will allow for concurrentlyselected ‘subactions’ and distributed representations.

The LMDP framework additionally posits a specific formfor the cost function to be optimized. The instantaneousreward for taking action a(·|s) in state s is

R(s, a) = R(s)− λKL (a(·|s)||P (·|s)) , (2)

where the KL term is the Kullback-Leibler divergence be-tween the selected control transition probability and thepassive dynamics. This term implements a control cost, en-couraging actions to conform to the natural passive dynam-ics of a domain. In a cart-pole balancing task, for instance,the passive dynamics might encode the transition structurearising from physics in the absence of control input. Anydeviation from these dynamics will require energy input.In more abstract settings, such as navigation in a 2D gridworld, the passive dynamics might encode a random walk,expressing the fact that actions cannot transition directlyto a far away goal but only move some limited distancein a specific direction. Examples of standard benchmarkdomains in the LMDP formalism are provided in the Sup-plementary Material. The parameter λ in Eqn. (2) acts totrade-off the relative value between the reward of being in astate and the control cost, and determines the stochasticityof the resulting policies.

We consider first-exit problems (see Dvijotham & Todorov(2011) for infinite horizon and other formulations), in whichthe state space is divided into a set of absorbing boundarystates B ⊂ S and non-absorbing interior states I ⊂ S, withS = B ∪ I. In this formulation, an agent acts in a variablelength episode that consists of a series of transitions throughinterior states before a final transition to a boundary statewhich terminates the episode. The goal is to find the policya∗ which maximizes the total expected reward across theepisode,

a∗ = argmaxaE st+1∼a(·|st)τ=min{t:st∈B}

{τ−1∑

t=1

R(st, a) +R(sτ )

}.

(3)

Because of the carefully chosen structure of the rewardR(s, a) and the continuous action space, the Bellman equa-tion simplifies greatly. In particular define the desirabilityfunction z(s) = eV (s)/λ as the exponentiated cost-to-go

function, and define q(s) = eR(s)/λ to be the exponentiatedinstantaneous rewards. Let N be the number of states, andNi and Nb be the number of internal and boundary statesrespectively. Represent z(s) and q(s) with N -dimensionalcolumn vectors z and q, and the transition dynamics P (s′|s)with the N -by-Ni matrix P , where column index corre-sponds to s and row index corresponds to s′. Let zi and zbdenote the partition of z into internal and boundary states,respectively, and similarly for qi and qb. Finally, let Pi de-note the Ni-by-Ni submatrix of P containing transitionsbetween internal states, and Pb denote the Nb-by-Ni sub-matrix of P containing transitions from internal states toboundary states.

As shown in Todorov (2009b), the Bellman equation in thissetting reduces to

(I −MiPTi )zi =MiP

Tb zb (4)

where Mi = diag(qi) and, because boundary states areabsorbing, zb = qb. The exponentiated Bellman equationis hence a linear system, the key advantage of the LMDPframework. A variety of special properties flow from thelinearity of the Bellman equation, which we exploit in thefollowing.

Solving for zi may be done explicitly as zi = (I −MiP

Ti )−1MiP

Tb zb or via the z-iteration method (akin to

value iteration),

zi ←MiPTi zi +MiP

Tb zb. (5)

Finally, the optimal policy may be computed in closed formas

a∗(s′|s) = P (s′|s)z(s′)G[z](s) , (6)

where the normalizing constant G[z](s) =∑s′ P (s

′|s)z(s′). Detailed derivations of these re-sults are given in (Todorov, 2009a;b; Dvijotham & Todorov,2011). Intuitively, the hard maximization of Eqn. (1)has been replaced by a soft maximization log(

∑exp(·)),

and the continuous action space enables closed formcomputation of the optimal policy.

Compared to the standard MDP formulation, the LMDP has

1. Continuous concurrent actions Actions are ex-pressed as transition probabilities over next states, suchthat these transition probabilities can be influenced bymany subtasks operating in parallel, and in a gradedfashion.

2. Compositional optimal policies In the LMDP, lin-early blending desirability functions yields the correctcomposite desirability function (Todorov, 2009a;b). Inparticular, consider two LMDPs L1 = 〈S, P, qi, q1b 〉


episode,82

a⇤ = argmaxaE st+1⇠a(·|st)

⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)

Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83

Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84

the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85

neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86

states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87

transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88

row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89

respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90

transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91

from internal states to boundary states.92

As shown in [***], the Bellman equation in this setting reduces to93

(I �QiPi)zi = QiPbzb (4)

where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94

Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95

special properties flow from the linearity of the Bellman equation, which we exploit in the following.96

Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97

(akin to value iteration),98

zi QiPizi + QiPbzb. (5)

Finally, the optimal policy may be computed in closed form as99

a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)

where the normalizing constant G[Z](s) =P

s0 P (s0|s)Z(s0). Detailed derivations of these results100

are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101

maximization log(P

exp(·)), and the continuous action space enables closed form computation of102

the optimal policy.103

2.3 Concurrent subactions and distributed representations of tasks104

�⇡ \2 ⇡ (7)

Our hierarchical scheme is built on two key properties of the LMDP.105

1. Continuous concurrent actions106

2. Compositional optimal policies107

The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108

shown in a variety of recent work [***], and in the examples given later, most standard domains can109

be translated to the MDP framework; and there exists a general procedure for embedding traditional110

MDPs in the LMDP framework ?.111

More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112

making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113

the standard MDP formulation is overly general, discarding useful structure in most real world114

domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115

small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116

to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117

action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118

compositionality.119

The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




= w1⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




= w1⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




+w2⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




+w2⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




+ · · · + wn⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




+ · · · + wn⇥ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

epis

ode,

82

a⇤

=ar

gmax

aE

st+

1⇠

a(·|

st)

⌧=

min

{t:s

t2B

}

(⌧�

1X t=

1

R(s

t,a

)+

R(s

⌧))

.(3

)

Bec

ause

ofth

eca

refu

llych

osen

stru

ctur

eof

the

rew

ard

R(s

,a)

and

the

cont

inuo

usac

tion

spac

e,th

e83

Bel

lman

equa

tion

sim

plifi

esgr

eatly

.In

part

icul

arde

fine

the

desi

rabi

lity

func

tion

z(s

)=

eV(s

)/�

as84

the

expo

nent

iate

dco

st-t

o-go

func

tion,

and

defin

eq(

s)=

eR(s

)/�

tobe

the

expo

nent

iate

din

stan

ta-

85

neou

sre

war

ds.L

etn

beth

enu

mbe

rofs

tate

s,an

dn

ian

dn

bbe

the

num

bero

fint

erna

land

boun

dary

86

stat

esre

spec

tivel

y.R

epre

sent

z(s

)an

dq(

s)w

ithn

-dim

ensi

onal

colu

mn

vect

ors

zan

dq,

and

the

87

tran

sitio

ndy

nam

icsP

(s0 |s

)w

ithth

en

-by-

ni

mat

rix

P,w

here

colu

mn

inde

xco

rres

pond

sto

san

d88

row

inde

xco

rres

pond

sto

s0.L

etz i

and

z bde

note

the

part

ition

ofz

into

boun

dary

and

inte

rnal

stat

es,

89

resp

ectiv

ely,

and

sim

ilarl

yfo

rqi

and

q b.F

inal

ly,l

etP

ide

note

then

i-by

-ni

subm

atri

xof

Pco

ntai

ning

90

tran

sitio

nsbe

twee

nin

tern

alst

ates

,and

Pb

deno

teth

en

b-b

y-n

isu

bmat

rix

ofP

cont

aini

ngtr

ansi

tions

91

from

inte

rnal

stat

esto

boun

dary

stat

es.

92

As

show

nin

[***

],th

eB

ellm

aneq

uatio

nin

this

setti

ngre

duce

sto

93

(I�

QiP

i)z i

=Q

iPbz b

(4)

whe

reQ

i=

diag

(qi)

and,

beca

use

boun

dary

stat

esar

eab

sorb

ing,

z b=

q b.

The

expo

nent

iate

d94

Bel

lman

equa

tion

ishe

nce

alin

ears

yste

m,t

heke

yad

vant

age

ofth

eL

MD

Pfr

amew

ork.

Ava

riet

yof

95

spec

ialp

rope

rtie

sflo

wfr

omth

elin

eari

tyof

the

Bel

lman

equa

tion,

whi

chw

eex

ploi

tin

the

follo

win

g.96

Solv

ing

forz i

may

bedo

neex

plic

itly

asz i

=(I�

QiP

i)�

1Q

iPbz b

orvi

ath

ez-

itera

tion

met

hod

97

(aki

nto

valu

eite

ratio

n),

98

z i

QiP

izi+

QiP

bz b

.(5

)

Fina

lly,t

heop

timal

polic

ym

aybe

com

pute

din

clos

edfo

rmas

99

a⇤ (

s0|s)

=P

(s0 |s

)Z(s

0 )G[

Z](

s),

(6)

whe

reth

eno

rmal

izin

gco

nsta

ntG[

Z](

s)=

Ps0P

(s0 |s

)Z(s

0 ).D

etai

led

deriv

atio

nsof

thes

ere

sults

100

are

give

nin

[***

].In

tuiti

vely

,th

eha

rdm

axim

izat

ion

ofE

qn.

(1)

has

been

repl

aced

bya

soft

101

max

imiz

atio

nlo

g(P

exp(·)

),an

dth

eco

ntin

uous

actio

nsp

ace

enab

les

clos

edfo

rmco

mpu

tatio

nof

102

the

optim

alpo

licy.

103

2.3

Con

curr

ents

ubac

tions

and

dist

ribu

ted

repr

esen

tatio

nsof

task

s10

4

�⇡

\ 2⇡

(7)

Our

hier

arch

ical

sche

me

isbu

ilton

two

key

prop

ertie

sof

the

LM

DP.

105

1.C

ontin

uous

conc

urre

ntac

tions

106

2.C

ompo

sitio

nalo

ptim

alpo

licie

s10

7

The

spec

ific

form

ofE

qn.(

2)ca

nse

emto

limit

the

appl

icab

ility

ofth

eL

MD

Pfr

amew

ork.

Yet

as10

8

show

nin

ava

riet

yof

rece

ntw

ork

[***

],an

din

the

exam

ples

give

nla

ter,

mos

tsta

ndar

ddo

mai

nsca

n10

9

betr

ansl

ated

toth

eM

DP

fram

ewor

k;an

dth

ere

exis

tsa

gene

ralp

roce

dure

fore

mbe

ddin

gtr

aditi

onal

110

MD

Psin

the

LM

DP

fram

ewor

k?.

111

Mor

ege

nera

lly,w

esu

gges

ttha

tmos

tdom

ains

ofin

tere

stha

veso

me

notio

nof

‘effi

cien

t’ac

tions

,11

2

mak

ing

aco

ntro

lcos

tare

ason

ably

natu

rala

ndun

iver

salp

heno

men

on.

Inde

ed,i

tis

poss

ible

that

113

the

stan

dard

MD

Pfo

rmul

atio

nis

over

lyge

nera

l,di

scar

ding

usef

ulst

ruct

ure

inm

ost

real

wor

ld11

4

dom

ains

–nam

ely,

apr

efer

ence

fore

ffici

enta

ctio

ns.S

tand

ard

MD

Pfo

rmul

atio

nsco

mm

only

plac

e11

5

smal

lneg

ativ

ere

war

dson

each

actio

nto

inst

antia

teth

isef

ficie

ncy

goal

,but

they

reta

inth

efle

xibi

lity

116

to,

for

inst

ance

,pr

efer

ener

getic

ally

inef

ficie

nttr

ajec

tori

esby

plac

ing

posi

tive

rew

ards

onea

ch11

7

actio

n.T

hedr

awba

ckof

this

flexi

bilit

yis

the

unst

ruct

ured

max

imiz

atio

nof

Eqn

.(1)

,whi

chpr

even

ts11

8

com

posi

tiona

lity.

119

The

120

3

epis

ode,

82

a⇤

=ar

gmax

aE

st+

1⇠

a(·|

st)

⌧=

min

{t:s

t2B

}

(⌧�

1X t=

1

R(s

t,a

)+

R(s

⌧))

.(3

)

Bec

ause

ofth

eca

refu

llych

osen

stru

ctur

eof

the

rew

ard

R(s

,a)

and

the

cont

inuo

usac

tion

spac

e,th

e83

Bel

lman

equa

tion

sim

plifi

esgr

eatly

.In

part

icul

arde

fine

the

desi

rabi

lity

func

tion

z(s

)=

eV(s

)/�

as84

the

expo

nent

iate

dco

st-t

o-go

func

tion,

and

defin

eq(

s)=

eR(s

)/�

tobe

the

expo

nent

iate

din

stan

ta-

85

neou

sre

war

ds.L

etn

beth

enu

mbe

rofs

tate

s,an

dn

ian

dn

bbe

the

num

bero

fint

erna

land

boun

dary

86

stat

esre

spec

tivel

y.R

epre

sent

z(s

)an

dq(

s)w

ithn

-dim

ensi

onal

colu

mn

vect

ors

zan

dq,

and

the

87

tran

sitio

ndy

nam

icsP

(s0 |s

)w

ithth

en

-by-

ni

mat

rix

P,w

here

colu

mn

inde

xco

rres

pond

sto

san

d88

row

inde

xco

rres

pond

sto

s0.L

etz i

and

z bde

note

the

part

ition

ofz

into

boun

dary

and

inte

rnal

stat

es,

89

resp

ectiv

ely,

and

sim

ilarl

yfo

rqi

and

q b.F

inal

ly,l

etP

ide

note

then

i-by

-ni

subm

atri

xof

Pco

ntai

ning

90

tran

sitio

nsbe

twee

nin

tern

alst

ates

,and

Pb

deno

teth

en

b-b

y-n

isu

bmat

rix

ofP

cont

aini

ngtr

ansi

tions

91

from

inte

rnal

stat

esto

boun

dary

stat

es.

92

As

show

nin

[***

],th

eB

ellm

aneq

uatio

nin

this

setti

ngre

duce

sto

93

(I�

QiP

i)z i

=Q

iPbz b

(4)

whe

reQ

i=

diag

(qi)

and,

beca

use

boun

dary

stat

esar

eab

sorb

ing,

z b=

q b.

The

expo

nent

iate

d94

Bel

lman

equa

tion

ishe

nce

alin

ears

yste

m,t

heke

yad

vant

age

ofth

eL

MD

Pfr

amew

ork.

Ava

riet

yof

95

spec

ialp

rope

rtie

sflo

wfr

omth

elin

eari

tyof

the

Bel

lman

equa

tion,

whi

chw

eex

ploi

tin

the

follo

win

g.96

Solv

ing

forz i

may

bedo

neex

plic

itly

asz i

=(I�

QiP

i)�

1Q

iPbz b

orvi

ath

ez-

itera

tion

met

hod

97

(aki

nto

valu

eite

ratio

n),

98

z i

QiP

izi+

QiP

bz b

.(5

)

Fina

lly,t

heop

timal

polic

ym

aybe

com

pute

din

clos

edfo

rmas

99

a⇤ (

s0|s)

=P

(s0 |s

)Z(s

0 )G[

Z](

s),

(6)

whe

reth

eno

rmal

izin

gco

nsta

ntG[

Z](

s)=

Ps0P

(s0 |s

)Z(s

0 ).D

etai

led

deriv

atio

nsof

thes

ere

sults

100

are

give

nin

[***

].In

tuiti

vely

,th

eha

rdm

axim

izat

ion

ofE

qn.

(1)

has

been

repl

aced

bya

soft

101

max

imiz

atio

nlo

g(P

exp(·)

),an

dth

eco

ntin

uous

actio

nsp

ace

enab

les

clos

edfo

rmco

mpu

tatio

nof

102

the

optim

alpo

licy.

103

2.3

Con

curr

ents

ubac

tions

and

dist

ribu

ted

repr

esen

tatio

nsof

task

s10

4

�⇡

\ 2⇡

(7)

Our

hier

arch

ical

sche

me

isbu

ilton

two

key

prop

ertie

sof

the

LM

DP.

105

1.C

ontin

uous

conc

urre

ntac

tions

106

2.C

ompo

sitio

nalo

ptim

alpo

licie

s10

7

The

spec

ific

form

ofE

qn.(

2)ca

nse

emto

limit

the

appl

icab

ility

ofth

eL

MD

Pfr

amew

ork.

Yet

as10

8

show

nin

ava

riet

yof

rece

ntw

ork

[***

],an

din

the

exam

ples

give

nla

ter,

mos

tsta

ndar

ddo

mai

nsca

n10

9

betr

ansl

ated

toth

eM

DP

fram

ewor

k;an

dth

ere

exis

tsa

gene

ralp

roce

dure

fore

mbe

ddin

gtr

aditi

onal

110

MD

Psin

the

LM

DP

fram

ewor

k?.

111

Mor

ege

nera

lly,w

esu

gges

ttha

tmos

tdom

ains

ofin

tere

stha

veso

me

notio

nof

‘effi

cien

t’ac

tions

,11

2

mak

ing

aco

ntro

lcos

tare

ason

ably

natu

rala

ndun

iver

salp

heno

men

on.

Inde

ed,i

tis

poss

ible

that

113

the

stan

dard

MD

Pfo

rmul

atio

nis

over

lyge

nera

l,di

scar

ding

usef

ulst

ruct

ure

inm

ost

real

wor

ld11

4

dom

ains

–nam

ely,

apr

efer

ence

fore

ffici

enta

ctio

ns.S

tand

ard

MD

Pfo

rmul

atio

nsco

mm

only

plac

e11

5

smal

lneg

ativ

ere

war

dson

each

actio

nto

inst

antia

teth

isef

ficie

ncy

goal

,but

they

reta

inth

efle

xibi

lity

116

to,

for

inst

ance

,pr

efer

ener

getic

ally

inef

ficie

nttr

ajec

tori

esby

plac

ing

posi

tive

rew

ards

onea

ch11

7

actio

n.T

hedr

awba

ckof

this

flexi

bilit

yis

the

unst

ruct

ured

max

imiz

atio

nof

Eqn

.(1)

,whi

chpr

even

ts11

8

com

posi

tiona

lity.

119

The

120

3

Instantane

ous

rewardq b

Desirabliity

functio

nz i

episode,82


⌧=min{t:st2B}

(⌧�1X

t=1

R(st, a) + R(s⌧ )

). (3)




















a⇤(s0|s) =P (s0|s)Z(s0)

G[Z](s), (6)




maximization log(P




�⇡ \1 ⇡ (7)
















The120

3

=

=Sample

trajectorie

s

(c) (d) (e) (f) (g)(a)

(b)

EndStart

Figure 1. Distributed task representations with the Multitask LMDP. (a) Example 2DOF arm constrained to the plane with state spaceconsisting of shoulder and elbow joint angles 6 1, 6 2 ∈ [−π, π] respectively. (b) A novel task is specified as an instantaneous rewardfunction over terminal states. In this example, the task “reach to the black rectangle” is encoded by rewarding any terminal state with theend effector in black rectangle (all successful configurations shown). (c-g) Solutions via the LMDP. Top row: Instantaneous rewards inthe space of joint angles. Middle row: Optimal desirability function with sample trajectories starting at green circles, finishing at redcircles. Bottom row: Strobe plot of sample trajectories in Cartesian space. Trajectories start at green circle, end at red circle. Column (c):The linear Bellman equation is solved for this particular instantaneous boundary reward structure to obtain the optimal value function.Columns (d-g): Solution via compositional Multitask LMDP. The instantaneous reward structure is expressed as a weighted combinationof previously-learned subtasks (d-f), here chosen to be navigation to specific points in state space. (g) Because of the linearity of theBellman equation, the resulting combined value function is optimal, and the system can act instantly despite no explicit training on thereach-to-rectangle task. The same fixed basis set can be used to express a wide variety of tasks (reach to a cross; reach to a circle; etc).

and L2 = 〈S, P, qi, q2b 〉 which have identical statespaces, transition dynamics, and internal reward struc-tures, but differ in their exponentiated boundary re-wards q1b and q2b . These may be solved independently toyield desirability functions z1 and z2. The desirabilityfunction of the LMDP L1+2 = 〈S, P, qi, αq1b + βq2b 〉,whose instantaneous rewards are a weighted sum ofthe first two, is simply z1+2 = αz1 + βz2. This prop-erty follows from the linearity of Eqn. 4, and is thefoundation of our hierarchical scheme.

2.3. The Multitask LMDP

To build a multitask action module, we exploit this com-positionality to build a basis set of tasks (Foster & Dayan,2002; Ruvolo & Eaton, 2013; Schaul et al., 2015; Pan et al.,2015; Borsa et al., 2016). Suppose that we learn a set ofLMDPs Lt = 〈S, P, qi, qtb〉, t = 1, · · · , Nt which all sharethe same state space, passive dynamics, and internal rewards,but differ in their instantaneous exponentiated boundary re-ward structure qtb, t = 1, · · · , Nt. Each of these LMDPscorresponds to a different task, defined by its boundary re-ward structure, in the same overall state space (see Taylor& Stone (2009) for an overview of this and other notionsof transfer and multitask learning). We denote the Mul-titask LMDP module M formed from these Nt LMDPsas M = 〈S, P, qi, Qb〉. Here the Nb-by-Nt task basis ma-trix Qb =

[q1b q

2b · · · qNt

b

]encodes a library of component

tasks in the MLMDP. Solving each component LMDP yields

the corresponding desirability functions zti , t = 1, · · · , Ntfor each task, which can also be formed into the Ni-by-Nt desirability basis matrix Zi =

[z1i z

2i · · · zNt

i

]for the

multitask module.

When we encounter a new task defined by a novel instan-taneous exponentiated reward structure q, if we can ex-press it as a linear combination of previously learned tasks,q = Qbw, where w ∈ RNt is a vector of task blend weights,then we can instantaneously derive its optimal desirabilityfunction as zi = Ziw. This immediately yields the optimalaction through Eqn. 6. Hence an MLMDP agent can actoptimally even in never-before-seen tasks, provided thatthe new task lies in the subspace spanned by previouslylearned tasks. This approach is an off-policy variant onthe successor representation method of Dayan (1993). Itdiffers by yielding the optimal value function, rather thanthe value function under a particular policy π; and allowsarbitrary boundary rewards for the component tasks (theSR implicitly takes these to be a positive reward on justone state). Also related is the successor feature approach ofBarreto et al. (2016), which permits performance guaranteesfor transfer to new tasks. With successor features, new tasksmust have rewards that are close in Euclidean distance toprior tasks, whereas here we require only that they lie in thespan of prior tasks.

More generally, if the target task q is not an exact linearcombination of previously learned tasks, an approximate


task weighting w can be found as

argminw ‖q −Qbw‖ subject to Qbw ≥ 0. (7)

The technical requirement Qbw ≥ 0 is due to the relation-ship q = exp(r/λ), such that negative values of q are notpossible. In practice this can be approximately solved asw = Q†bq, where † denotes the pseudoinverse, and negativeelements of w are projected back to zero.

Here the coefficients of the task blend w constitute a dis-tributed representation of the current task to be performed.Although the set of basis tasks Qb is fixed and finite, theypermit an infinite space of tasks to be performed throughtheir concurrent linear composition. Figure 2.2 demon-strates this ability of the Multitask LMDP module in thecontext of a 2D robot arm reaching task. From knowledgeof how to reach individual points in space, the module caninstantly act optimally to reach to a rectangular region.

3. Stacking the module: ConcurrentHierarchical LMDPs

To build a hierarchy out of this Multitask LMDP module, weconstruct a stack of MLMDPs in which higher levels selectthe instantaneous reward structure that defines the currenttask for lower levels. To take a navigation example, a highlevel module might specify that the lower level moduleshould reach room A but not B by placing instantaneousrewards in room A but no rewards in room B. Crucially,the fine details of achieving this subgoal can be left to thelow-level module. Critical to the success of this hierarchicalscheme is the flexible, optimal composition afforded bythe Multitask LMDP module: the specific reward structurecommanded by the higher layer will often be novel for thelower level, but will still be performed optimally provided itlies in the basis of learned tasks.

3.1. Constructing a hierarchy of MLMDPs

We start with the MLMDP M1 =⟨S1, P 1, q1i , Q

1b

⟩that we

must solve, where here the superscript denotes the hierarchylevel (Fig. 2). This serves as the base case for our recursivescheme for generating a hierarchy. For the inductive step,given an MLMDP M l =

⟨Sl, P l, qli, Q

lb

⟩at level l, we aug-

ment the state space S̃l = Sl ∪ Slt with a set of Nt terminalboundary states Slt that we call subtask states. Semantically,entering one of these states will correspond to a decision bythe layer l MLMDP to access the next level of the hierarchy.The transitions to subtask states are governed by a new N l

t -by-N l

i passive dynamics matrix P lt , which is chosen by thedesigner to encode the structure of the domain. Choosingfewer subtask states than interior states will yield a higherlevel which operates in a smaller state space, yielding stateabstraction. In the augmented MLMDP, the passive dynam-

M1

M2

M3

M4

States

(a)

(a)

(b)

(c)

(b)

(c)

Figure 2. Stacking Multitask LMDPs. Top: Deep hierarchy fornavigation through a 1D corridor. Lower-level MLMDPs are ab-stracted to form higher-level MLMDPs by choosing a set of ‘sub-task’ states which can be accessed by the lower level (grey linesbetween levels depict passive subtask transitions P l

t ). Lower levelsaccess these subtask states to indicate completion of a subgoaland to request more information from higher levels; higher levelscommunicate new subtask state instantaneous rewards, and hencethe concurrent task blend, to the lower levels. Red lines indicatehigher level access points for one sample trajectory starting fromleftmost state and terminating at rightmost state. Bottom: Pan-els (a-c) depict distributed task blends arising from accessing thehierarchy at points denoted in left panel. The higher layer statesaccessed are indicated by filled circles. (a) Just the second layerof hierarchy is accessed, resulting in higher weight on the task toachieve the next subgoal and zero weights on already achievedsubgoals. (b) The second and third levels are accessed, yieldingnew task blends for both. (c) All levels are accessed yielding taskblends at a range of scales.

ics become P̃ l = N ([P li ; P

lb ; P

lt

]) where the operation

N (·) renormalizes each column to sum to one.

It remains to specify the matrix of subtask-state instanta-neous rewardsRlt for this augmented MLMDP. Often hierar-chical schemes require designing a pseudoreward functionto encourage successful completion of a subtask. Herewe also pick a set of reward functions over subtask states;however, the performance of our scheme is only weaklydependent on this choice: we require only that our chosenreward functions form a good basis for the set of subtasksthat the higher layer will command. Any set of tasks whichcan linearly express the required space of reward structuresspecified by the higher level is suitable. In our experiments,we define N l

t tasks, one for each subtask state, and set eachinstantaneous reward to negative values on all but a sin-gle ‘goal’ subtask state. Then the augmented MLMDP is


M̃ l =⟨S̃l, P̃ l, qli,

[Qlb 0; 0 Q

lt

]⟩.

The higher level is itself an MLMDP M l+1 =⟨Slt, P

l+1, ql+1i , Ql+1

b

⟩, defined not over the entire state

space but just over the N lt subtask states of the layer be-

low. To construct this, we must compute an appropriatepassive dynamics and reward structure. A natural definitionfor the passive dynamics is the probability of starting at onesubtask state and terminating at another under the lowerlayer’s passive dynamics,

P l+1i = P̃ lt (I − P̃ li )−1P̃ l

T

t , (8)

P l+1b = P̃ lb(I − P̃ li )−1P̃ l

T

t . (9)

In this way, the higher-level LMDP will incorporate thetransition constraints from the layer below. The interior-state reward structure can be similarly defined, as the rewardaccrued under the passive dynamics from the layer below.However for simplicity in our implementation, we simplyset small negative rewards on all internal states.

Hence, from a base MLMDP M l and subtask transi-tion matrix P lt , the above construction yields an aug-mented MLDMP M̃ l at the same layer and unaugmentedMLDMP M l+1 at the next higher layer. This proce-dure may be iterated to form a deep stack of MLMDPs{M̃1, M̃2, · · · , M̃D−1,MD

}, where all but the highest is

augmented with subtask states. The key choice for the de-signer is P lt , the transition structure from internal states tosubtask states for each layer. Through Eqns. (8)-(9), thismatrix specifies the state abstraction used in the next higherlayer. Fig. 2 illustrates this scheme for an example of navi-gation through a 1D corridor.

3.2. Instantaneous rewards and task blends:Communication between layers

Bidirectional communication between layers happens viasubtask states and their instantaneous rewards. The higherlayer sets the instantaneous boundary rewards over subtaskstates for the lower layer; and the lower layer signals that ithas completed a subtask and needs new guidance from thehigher layer by transitioning to a subtask state.

In particular, suppose we have solved a higher-levelMLMDP using any method we like, yielding the optimalaction al+1. This will make transitions to some states morelikely than they would be under the passive dynamics, indi-cating that they are more attractive than usual for the currenttask. It will make other transitions less likely than the pas-sive dynamics, indicating that transitions to these statesshould be avoided. We therefore define the instantaneousrewards for the subtask states at level l to be proportional tothe difference between controlled and passive dynamics at

the higher level l + 1,

rlt ∝ al+1i (·|s)− pl+1

i (·|s). (10)

This effectively inpaints extra rewards for the lower layer,indicating which subtask states are desirable from the per-spective of the higher layer.

The lower layer MLMDP then uses its basis of tasks to de-termine a task weighting wl which will optimally achievethe reward structure rlt specified by the higher layer by solv-ing (7). The reward structure specified by the higher layermay not correspond to any one task learned by the lowerlayer, but it will nonetheless be performed well by forminga concurrent blend of many different tasks and leveragingthe compositionality afforded by the LMDP framework.This scheme may also be interpreted as implementing aparametrized option, but with performance guarantees forany parameter setting (Masson et al., 2016).

We now describe the execution model (see SupplementaryMaterial for pseudocode listing). The true state of the agentis represented in the base level M̃1, and next states aredrawn from the controlled transition distribution. If the nextstate is an interior state, one unit of time passes and the stateis updated as usual. If the next state is a subtask state, thenext layer of the hierarchy is accessed at its correspondingstate; no ‘real’ time passes during this transition. The higherlevel then draws its next state, and in so doing can accessthe next level of hierarchy by transitioning to one of itssubtask states, and so on. At some point, a level will electnot to access a subtask state; it then transmits its desiredrewards from Eqn. (10) to the layer below it. The lowerlayer then solves its multitask LMDP problem to computeits own optimal actions, and the process continues downto the lowest layer M̃1 which, after updating its optimalactions, again draws a transition. Lastly, if the next state isa terminal boundary state, the layer terminates itself. Thiscorresponds to a level of the hierarchy determining that itno longer has useful information to convey. Terminatinga layer disallows future transitions from the lower layerto its subtask states, and corresponds to inpainting infinitenegative rewards onto the lower level subtask states.

4. Computational complexity advantages ofhierarchical decomposition

To concretely illustrate the value of hierarchy, consider nav-igation through a 1D ring of N states, where the agent mustperform N different tasks corresponding to navigating toeach particular state. We take the passive dynamics to belocal (a nonzero probability of transitioning just to adjacentstates in the ring, or remaining still). In one step of Z itera-tion (Eqn. 5), the optimal value function progresses at bestO(1) states per iteration because of the local passive dy-namics (see Precup et al. (1998) for a similar argument). It


therefore requires O(N) iterations in a flat implementationfor a useful value function signal to arrive at the furthestpoint in the ring for each task. As there are N tasks, theflat implementation requires O(N2) iterations to learn allof them.

Instead suppose that we construct a hierarchy by placing asubtask every M = logN states, and do this recursively toform D layers. The recursion terminates when N/MD ≈ 1,yielding D ≈ logM N . With the correct higher level policysequencing subtasks, each policy at a given layer only needsto learn to navigate between adjacent subtasks, which areno more than M states apart. Hence Z iteration can beterminated after O(M) iterations. At level l = 1, 2, · · · , Dof the hierarchy, there are N/M l subtasks, and N/M l−1

boundary reward tasks to learn. Overall this yields

D∑

l=1

M

(N

M l+

N

M l−1

)≈ O(N logN)

total iterations (see Supplementary Material for derivationand numerical verification). A similar analysis shows thatthis advantage holds for memory requirements as well. Theflat scheme requires O(N2) nonzero elements of Z to en-code all tasks, while the hierarchical scheme requires onlyO(N logN). Hence hierarchical decomposition can yieldqualitatively more efficient solutions and resource require-ments, reminiscent of theoretical results obtained for percep-tual deep learning (Bengio, 2009; Bengio & LeCun, 2007).We note that this advantage only occurs in the multitasksetting: the flat scheme can learn one specific task in timeO(N). Hence hierarchy is beneficial when performing anensemble of tasks, due to the reuse of component policiesacross many tasks (see also Solway et al. (2014)).

5. Experiments5.1. Conceptual Demonstration

To illustrate the operation of our scheme, we apply it to a2D grid-world ‘rooms’ domain (Fig. 3(a)). The agent isrequired to navigate through an environment consisting offour rooms with obstacles to a goal location in one of therooms. The agent can move in the four cardinal directions orremain still. To build a hierarchy, we place six higher layersubtask goal locations throughout the domain (Fig. 3(a), reddots). The inferred passive dynamics for the higher layerMLMDP is shown in Fig. 3(a) as weighted lines betweenthese subtask states. The higher layer passive dynamics con-form to the structure of the problem, with the probability oftransition between higher layer states roughly proportionalto the distance between those states at the lower layer.

As a basic demonstration that the hierarchy conveys usefulinformation, Fig. 3(b) shows Z-learning curves for the baselayer policy with and without an omniscient hierarchy (i.e.,

Tra

jecto

ry le

ng

th

Epochs

Steps

Task b

lend w

eig

hts

A

B

C

D

E

F

B

( a ) ( b )

( c ) ( d )

0 100

0

200

FLAT

HIERARCHICAL

Figure 3. Rooms domain. (a) Four room domain with subtask loca-tions marked as red dots, free space in white and obstacles in black,and derived higher-level passive dynamics shown as weighted linksbetween subtasks. (b) Convergence as trajectory length over learn-ing epochs with and without the help of the hierarchy. (c) Sampletrajectory (gray line) from upper left to goal location in bottomright. (d) Evolution of distributed task weights on each subtasklocation over the course of the trajectory in panel (c).

one preinitialized using Z-iteration). From the start of learn-ing, the hierarchy is able to drive the agent to the vicinityof the goal. Fig. 3(c-d) illustrates the evolving distributedtask representation commanded by the higher layer to thelower layer over the course of a trajectory. At each timepoint, several subtasks have nonzero weight, highlightingthe concurrent execution in the system.

(a)

(b)

Figure 4. Desirability functions. (a) Desirability function overstates as agent moves through environment, showing the effectof reward inpainting from the hierarchy. (b) Desirability functionover states as goal location moves. Higher layers inpaint rewardsinto subtasks that will move the agent nearer the goal.

Fig. 4(a) shows the composite desirability function resultingfrom the concurrent task blend for different agent locations.Fig. 4(b) highlights the multitasking ability of the system,


showing the composite desirability function as the goallocation is moved. The leftmost panel, for instance, showsthat rewards are painted concurrently into the upper left andbottom right rooms, as these are both equidistant to the goal.

5.2. Quantitative Comparison

In order to quantitatively demonstrate the performance im-provements possible with our method we consider the prob-lem of a mobile robot navigating through an office block insearch of a charging station. This domain is shown in Fig. 5,and is taken from previous experiments in transfer learning(Fernández & Veloso, 2006).

Although the agent again moves in one of the four cardinaldirections at each time step, the agent is additionally pro-vided with a policy to navigate to each room in the officeblock, with goal locations marked by an ‘S’ in Fig. 5. Thegoal of the agent is to learn a policy to navigate to the near-est charging station from any location in the office block,given that there are a number of different charging stationsavailable (one per room, but only in some of the rooms).This corresponds to an ‘OR’ navigation problem: navigatingto location A OR B, while knowing separately a policy toget to A, and a policy to get to B. Concretely, the problem isto navigate to the nearest of two unknown subtask locations.The agent is randomly initialized at interior states, and thetrajectory lengths are capped at 500 steps.

Figure 5. The office building domain. Each ‘subtask’ or ‘option’policy navigates the agent to one of the rooms.

We compare a one-layer deep implementation of our methodto a simple implementation of the options framework (Sut-ton et al., 1999). In the options framework the agent’s actionspace is augmented with the full set of option policies. Theinitialization set for these options is the full state space, sothat any option may be executed at any time. The termi-nation condition is defined such that the option terminatesonly when it reaches its goal state. To minimize the actionspace for the options agent, we remove the primitive ac-tions, reducing the learning problem to simply choosing thesingle correct option from each state. The options learningproblem is solved using Q-learning with sigmoidal learningrate decrease and ε-greedy exploitation. These parameterswere optimized on a coarse grid to yield the fastest learningcurves. Results are averaged over 20 runs.

Traj

ecto

ry L

engt

h

Epochs

Options

Our Method

Figure 6. Learning rates. Our method jump-starts performancein the OR task, while the options agent must learn a state-actionmapping.

Fig. 6 shows that our agent receives a significant jump-startin learning. By leveraging the distributed representationprovided by our scheme, the agent is required only to learnwhen to request information from the higher layer. Al-though the task is novel, the higher layer can express it as acombination of prior tasks. Conversely, the options agentmust learn a unique mapping between all states and actions.This issue is exacerbated as the number of available optionsgrows (Rosman & Ramamoorthy, 2012).

6. ConclusionThe Multitask LMDP module provides a novel approachto control hierarchies, based on a distributed representa-tion of tasks and parallel execution. Rather than learn toperform one task or a fixed library of tasks, it exploits thecompositionality provided by linearly solvable Markov de-cision processes to perform an infinite space of task blendsoptimally. Stacking the module yields a deep hierarchyabstracted in state space and time, with the potential forqualitative efficiency improvements.

Experimentally, we have shown that the distributed repre-sentation provided by our framework can speed up learningin a simple navigation task, by representing a new task as acombination of prior tasks.

While a variety of sophisticated reinforcement learningmethods have made use of deep networks as capable func-tion approximators (Mnih et al., 2015; Lillicrap et al., 2015;Levine et al., 2016), in this work we have sought to trans-fer some of the underlying intuitions, such as parallel dis-tributed representations and stackable modules, to the con-trol setting. In the future this may allow other elementsof the deep learning toolkit to be brought to bear in thissetting, most notably gradient-based learning of the subtaskstructure itself.


AcknowledgementsWe thank our reviewers for thoughtful comments. AMSacknowledges support from the Swartz Program in Theo-retical Neuroscience at Harvard. AE and BR acknowledgesupport from the National Research Foundation of SouthAfrica.

ReferencesBarreto, A., Munos, R., Schaul, T., and Silver, D. Successor

Features for Transfer in Reinforcement Learning. arXiv, 2016.

Barto, A.G. and Madadevan, S. Recent Advances in HierarchicalReinforcement Learning. Discrete Event Dynamic Systems:Theory and Applications, (13):41–77, 2003.

Bellman, R.E. Dynamic Programming. Princeton University Press,Princeton, NJ, 1957.

Bengio, Y. Learning Deep Architectures for AI. Foundations andTrends in Machine Learning, 2(1):1–127, 2009.

Bengio, Y. and LeCun, Y. Scaling learning algorithms towards AI.In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (eds.),Large-Scale Kernel Machines. MIT Press, 2007.

Bonarini, A., Lazaric, A., and Restelli, M. Incremental SkillAcquisition for Self-motivated Learning Animats. In Nolfi,S., Baldassare, G., Calabretta, R., Hallam, J., Marocco, D.,Miglino, O., Meyer, J.-A., and Parisi, D. (eds.), Proceedings ofthe Ninth International Conference on Simulation of AdaptiveBehavior (SAB-06), volume 4095, pp. 357–368, Heidelberg,2006. Springer Berlin.

Borsa, D., Graepel, T., and Shawe-Taylor, J. Learning SharedRepresentations in Multi-task Reinforcement Learning. arXiv,2016.

Botvinick, M.M., Niv, Y., and Barto, A.C. Hierarchically organizedbehavior and its neural foundations: a reinforcement learningperspective. Cognition, 113(3):262–80, 12 2009.

Burridge, R. R., Rizzi, A. A., and Koditschek, D.E. SequentialComposition of Dynamically Dexterous Robot Behaviors. TheInternational Journal of Robotics Research, 18(6):534–555, 61999.

Dayan, P. Improving Generalization for Temporal DifferenceLearning: The Successor Representation. Neural Computation,5(4):613–624, 7 1993.

Dayan, P. and Hinton, G. Feudal Reinforcement Learning. InNIPS, 1993.

Dietterich, T.G. Hierarchical Reinforcement Learning with theMAXQ Value Function Decomposition. Journal of ArtificialIntelligence Research, 13:227–303, 2000.

Drummond, C. Composing functions to speed up reinforcementlearning in a changing world. In Nédellec, C. and Rouveirol, C.(eds.), Machine Learning: ECML-98., pp. 370–381, Heidelberg,1998. Springer Berlin.

Dvijotham, K. and Todorov, E. Inverse Optimal Control withLinearly-Solvable MDPs. In ICML, 2010.

Dvijotham, K. and Todorov, E. A unified theory of linearly solvableoptimal control. Uncertainty in Artificial Intelligence, 2011.

Fernández, F. and Veloso, M. Probabilistic policy reuse in a rein-forcement learning agent. Proceedings of the fifth internationaljoint conference on Autonomous agents and multiagent systems,pp. 720, 2006.

Foster, D. and Dayan, P. Structure in the Space of Value Functions.Machine Learning, (49):325–346, 2002.

Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionalityof data with neural networks. Science, 313(5786):504–7, 72006.

Hinton, G.E., Osindero, S., and Teh, Y.-W. A Fast LearningAlgorithm for Deep Belief Nets. Neural Computation, 18:1527–1554, 2006.

Howard, R.A. Dynamic Programming and Markov Processes.MIT Press, Cambridge, MA, 1960.

Jonsson, A. and Gómez, V. Hierarchical Linearly-Solvable MarkovDecision Problems. In ICAPS, 2016.

Kappen, H.J. Linear Theory for Control of Nonlinear StochasticSystems. Physical Review Letters, 95(20):200201, 11 2005.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-EndTraining of Deep Visuomotor Policies. Journal of MachineLearning Research, 17:1–40, 2016.

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control with deepreinforcement learning. arXiv, 2015.

Masson, W., Ranchod, P., and Konidaris, G.D. ReinforcementLearning with Parameterized Actions. In Proceedings of theThirtieth AAAI Conference on Artificial Intelligence, pp. 1934–1940, 9 2016.

Mausam and Weld, D.S. Planning with Durative Actions inStochastic Domains. Journal of Artificial Intelligence Research,31:33–82, 2008.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J.,Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.

Pan, Y., Theodorou, E.A., and Kontitsis, M. Sample Efficient PathIntegral Control under Uncertainty. In NIPS, 2015.

Parr, R. and Russell, S. Reinforcement learning with hierarchiesof machines. In NIPS, 1998.

Precup, D., Sutton, R., and Singh, S. Theoretical results on rein-forcement learning with temporally abstract options. In ECML,1998.

Ribas-Fernandes, J.J.F., Solway, A., Diuk, C., McGuire, J.T., Barto,A.G., Niv, Y., and Botvinick, M.M. A neural signature ofhierarchical reinforcement learning. Neuron, 71(2):370–9, 72011.


Rosman, B. and Ramamoorthy, S. What good are actions? Ac-celerating learning using learned action priors. In 2012 IEEEInternational Conference on Development and Learning andEpigenetic Robotics (ICDL), number November. IEEE, 11 2012.

Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learningalgorithm. Proceedings of the 30th International Conferenceon Machine Learning, 28(1):507–515, 2013.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal ValueFunction Approximators. Proceedings of The 32nd Interna-tional Conference on Machine Learning, pp. 1312–1320, 2015.

Solway, A., Diuk, C., Córdova, N., Yee, D., Barto, A.G., Niv, Y.,and Botvinick, M.M. Optimal Behavioral Hierarchy. PLoSComputational Biology, 10(8):e1003779, 8 2014.

Sutton, R.S. and Barto, A.G. Reinforcement Learning: An Intro-duction. The MIT Press, 1998.

Sutton, R.S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcementlearning. Artificial Intelligence, 112(1-2):181–211, 8 1999.

Taylor, M.E. and Stone, P. Transfer Learning for ReinforcementLearning Domains: A Survey. Journal of Machine LearningResearch, 10:1633–1685, 2009.

Todorov, E. Linearly-solvable Markov decision problems. In NIPS,2006.

Todorov, E. Efficient computation of optimal actions. Proceedingsof the National Academy of Sciences, 106(28):11478–11483, 72009a.

Todorov, E. Compositionality of optimal control laws. In NIPS,2009b.

hierarchy through composition with multitask lmdps

Documents