hierarchy through composition with multitask lmdps
TRANSCRIPT
Hierarchy Through Composition with Multitask LMDPs
Andrew M. Saxe 1 Adam C. Earle 2 Benjamin Rosman 2 3
AbstractHierarchical architectures are critical to thescalability of reinforcement learning methods.Most current hierarchical frameworks execute ac-tions serially, with macro-actions comprising se-quences of primitive actions. We propose a novelalternative to these control hierarchies based onconcurrent execution of many actions in parallel.Our scheme exploits the guaranteed concurrentcompositionality provided by the linearly solvableMarkov decision process (LMDP) framework,which naturally enables a learning agent to drawon several macro-actions simultaneously to solvenew tasks. We introduce the Multitask LMDPmodule, which maintains a parallel distributedrepresentation of tasks and may be stacked to formdeep hierarchies abstracted in space and time.
1. IntroductionReal world tasks unfold at a range of spatial and temporalscales, such that learning solely at the finest scale is likely tobe slow. Hierarchical reinforcement learning (HRL) (Barto& Madadevan, 2003; Parr & Russell, 1998; Dietterich, 2000)attempts to remedy this by learning a nested sequence ofever more detailed plans. Hierarchical schemes have a num-ber of desirable properties. Firstly, they are intuitive, ashumans seldom plan at the level of raw actions, typically pre-ferring to reason at a higher level of abstraction (Botvinicket al., 2009; Ribas-Fernandes et al., 2011; Solway et al.,2014). Secondly, they constitute one approach to tacklingthe curse of dimensionality (Bellman, 1957; Howard, 1960).In real world MDPs, the number of states typically growsdramatically in the size of a domain. A similar curse of di-mensionality afflicts actions, such that a robot with multiplejoints, for instance, must operate in the product space of
1Center for Brain Science, Harvard University 2School ofComputer Science and Applied Mathematics, University ofthe Witwatersrand 3Council for Scientific and Industrial Re-search, South Africa. Correspondence to: Andrew M. Saxe<[email protected]>.
Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s).
actions for each joint individually (Mausam & Weld, 2008).Finally, the ‘tasks’ performed by an agent may come ingreat variety, and can also suffer from a curse of dimension-ality: a robot that learns policies to navigate to each of tenrooms would need to learn 10 choose 2 policies to navigateto the closer of each pair of rooms. Transferring learningacross tasks is therefore vital (Taylor & Stone, 2009; Foster& Dayan, 2002; Drummond, 1998; Fernández & Veloso,2006; Bonarini et al., 2006; Barreto et al., 2016).
Many HRL schemes rely on a serial call/return procedure,in which temporally extended macro-actions or ‘options’can call other (macro-)actions. In this sense, these schemesdraw on a computer metaphor, in which a serial processorchooses sequential primitive actions, occasionally pushingor popping new macro-actions onto a stack. The influentialoptions framework (Sutton et al., 1999), MAXQ method(Dietterich, 2000; Jonsson & Gómez, 2016) and the hierar-chy of abstract machines (Parr & Russell, 1998; Burridgeet al., 1999) all share this sequential-execution structure.
Concurrent MDPs (Mausam & Weld, 2008) relax the as-sumption of serial execution to allow multiple actions to beexecuted simultaneously at each time step, at the cost of acombinatorial increase in the size of the action space. Inthis paper, we develop a novel hierarchical scheme with aparallel and distributed execution structure that overcomesthe combinatorial increase in action space by exploiting theguaranteed optimal compositionality afforded by the lin-early solvable Markov decision process (LMDP) framework(Todorov, 2006; Kappen, 2005).
We present a Multitask LMDP module that executes con-current blends of previously learned tasks. Here we usethe term ‘concurrent’ or ‘parallel’ to refer to ‘parallel dis-tributed processing’ or neural network-like methods thatform weighted blends of many simple elements to achievetheir aim. In standard schemes, if a robot arm has acquiredpolicies to individually reach two points in some space, thisknowledge typically does not aid it in optimally reaching toeither point. However in our scheme, such behaviours canbe expressed as different weighted task blends, such thatan agent can simultaneously draw on several sub-policiesto achieve goals not explicitly represented by any of theindividual behaviours, including tasks the agent has neverperformed before.
Hierarchy Through Composition with Multitask LMDPs
Next we show how this Multitask LMDP module can bestacked, resulting in a hierarchy of more abstract modulesthat communicate distributed representations of tasks be-tween layers. We give a simple theoretical analysis showingthat hierarchy can yield qualitative efficiency improvementsin learning time and memory. Finally, we demonstrate theoperation of the method on a navigation domain, and showthat its multitasking ability can speed learning of new taskscompared to a traditional options-based agent.
Our scheme builds on a variety of prior work. Like the op-tions framework (Sutton et al., 1999), we build a hierarchyin time. Similar to MAXQ Value Decomposition (Diet-terich, 2000), we decompose a target MDP into a hierarchyof smaller SMDPs which progressively abstract over states.From Feudal RL, we draw the idea of a managerial hierar-chy in which higher layers prescribe goals but not details forlower layers (Dayan & Hinton, 1993). Most closely relatedto our scheme, (Jonsson & Gómez, 2016) develop a MAXQdecomposition within the LMDP formalism (see Supple-mentary Material for extended discussion). Our methoddiffers from all of the above approaches in permitting agraded, concurrent blend of tasks at each level, and devel-oping a uniform, stackable module capable of performing avariety of tasks.
2. The Multitask LMDP: A compositionalaction module
Our goal in this paper is to describe a flexible action-selection module which can be stacked to form a hierar-chy, such that the full action at any given point in time iscomposed of the concurrent composition of sub-actionswithin sub-actions. By analogy to perceptual deep net-works, restricted Boltzmann machines (RBMs) form a com-ponent module from which a deep belief network can beconstructed by layerwise stacking (Hinton et al., 2006; Hin-ton & Salakhutdinov, 2006). We seek a similar module inthe context of action or control. This section describes themodule, the Multitask LMDP (MLMDP), before turning tohow it can be stacked. Our formulation relies on the linearlysolvable Markov decision process (LMDP) framework in-troduced by Todorov (2006) (see also Kappen (2005)). TheLMDP differs from the standard MDP formulation in fun-damental ways, and enjoys a number of special properties.We first briefly describe the canonical MDP formulation, inorder to explain what the switch to the LMDP accomplishesand why it is necessary.
2.1. Canonical MDPs
In its standard formulation, an MDP is a four-tuple M =〈S,A, P,R〉, where S is a set of states, A is a set of discreteactions, P is a transition probability distribution P : S ×A× S → [0, 1], and R is an expected instantaneous reward
function R : S × A → R. The goal is to determine anoptimal policy π : S → A specifying which action totake in each state. This optimal policy can be computedfrom the optimal value function V : S → R, defined asthe expected reward starting in a given state and actingoptimally thereafter. The value function obeys the well-known Bellman optimality condition
V (s) = maxa∈A
{R(s, a) +
∑
s′
P (s′|s, a)V (s′)
}. (1)
This formalism is the basis of most practical and theoreticalstudies of decision-making under uncertainty and reinforce-ment learning (Bellman, 1957; Howard, 1960; Sutton &Barto, 1998). See, for instance, (Mnih et al., 2015; Lillicrapet al., 2015; Levine et al., 2016) for recent successes inchallenging domains.
For the purposes of a compositional hierarchy of actions,this formulation presents two key difficulties.
1. Mutually exclusive sequential actions First, theagent’s actions are discrete and execute serially. Ex-actly one (macro-)action operates at any given timepoint. Hence there is no way to build up an action at asingle time point out of several ‘subactions’ taken inparallel. For example, a control signal for a robotic armcannot be composed of a control decision for the elbowjoint, a control decision for the shoulder joint, and acontrol decision for the gripper, each taken in paralleland combined into a complete action for a specific timepoint.
2. Non-composable optimal policies The maximizationin Eqn. (1) over a discrete set of actions is nonlin-ear. This means that optimal solutions, in general, donot compose in a simple way. Consider two standardMDPs M1 = 〈S,A, P,R1〉 and M2 = 〈S,A, P,R2〉which have identical state spaces, action sets, and tran-sition dynamics but differ in their instantaneous re-wards R1 and R2. These may be solved independentlyto yield value functions V1 and V2. But the value func-tion of the MDP M1+2 = 〈S,A, P,R1 +R2〉, whoseinstantaneous rewards are the sum of the first two, isnot V1+2 = V1 + V2. In general, there is no simpleprocedure for deriving V1+2 from V1 and V2; it mustbe found by solving Eqn. (1) again.
2.2. Linearly Solvable MDPs
The LMDP (Todorov, 2009a; Dvijotham & Todorov, 2010;Todorov, 2009b; Dvijotham & Todorov, 2010) is definedby a three-tuple L = 〈S, P,R〉, where S is a set ofstates, P is a passive transition probability distributionP : S × S → [0, 1], and R is an expected instantaneous re-ward function R : S → R. The LMDP framework replaces
Hierarchy Through Composition with Multitask LMDPs
the traditional discrete set of actions A with a continuousprobability distribution over next states a : S × S → [0, 1].That is, the ‘control’ or ‘action’ chosen by the agent in states is a transition probability distribution over next states,a(·|s). The controlled transition distribution may be inter-preted either as directly constituting the agent’s dynamics,or as a stochastic policy over deterministic actions whichaffect state transitions (Todorov, 2006; Jonsson & Gómez,2016). Swapping a discrete action space for a continuous ac-tion space is a key change which will allow for concurrentlyselected ‘subactions’ and distributed representations.
The LMDP framework additionally posits a specific formfor the cost function to be optimized. The instantaneousreward for taking action a(·|s) in state s is
R(s, a) = R(s)− λKL (a(·|s)||P (·|s)) , (2)
where the KL term is the Kullback-Leibler divergence be-tween the selected control transition probability and thepassive dynamics. This term implements a control cost, en-couraging actions to conform to the natural passive dynam-ics of a domain. In a cart-pole balancing task, for instance,the passive dynamics might encode the transition structurearising from physics in the absence of control input. Anydeviation from these dynamics will require energy input.In more abstract settings, such as navigation in a 2D gridworld, the passive dynamics might encode a random walk,expressing the fact that actions cannot transition directlyto a far away goal but only move some limited distancein a specific direction. Examples of standard benchmarkdomains in the LMDP formalism are provided in the Sup-plementary Material. The parameter λ in Eqn. (2) acts totrade-off the relative value between the reward of being in astate and the control cost, and determines the stochasticityof the resulting policies.
We consider first-exit problems (see Dvijotham & Todorov(2011) for infinite horizon and other formulations), in whichthe state space is divided into a set of absorbing boundarystates B ⊂ S and non-absorbing interior states I ⊂ S, withS = B ∪ I. In this formulation, an agent acts in a variablelength episode that consists of a series of transitions throughinterior states before a final transition to a boundary statewhich terminates the episode. The goal is to find the policya∗ which maximizes the total expected reward across theepisode,
a∗ = argmaxaE st+1∼a(·|st)τ=min{t:st∈B}
{τ−1∑
t=1
R(st, a) +R(sτ )
}.
(3)
Because of the carefully chosen structure of the rewardR(s, a) and the continuous action space, the Bellman equa-tion simplifies greatly. In particular define the desirabilityfunction z(s) = eV (s)/λ as the exponentiated cost-to-go
function, and define q(s) = eR(s)/λ to be the exponentiatedinstantaneous rewards. Let N be the number of states, andNi and Nb be the number of internal and boundary statesrespectively. Represent z(s) and q(s) with N -dimensionalcolumn vectors z and q, and the transition dynamics P (s′|s)with the N -by-Ni matrix P , where column index corre-sponds to s and row index corresponds to s′. Let zi and zbdenote the partition of z into internal and boundary states,respectively, and similarly for qi and qb. Finally, let Pi de-note the Ni-by-Ni submatrix of P containing transitionsbetween internal states, and Pb denote the Nb-by-Ni sub-matrix of P containing transitions from internal states toboundary states.
As shown in Todorov (2009b), the Bellman equation in thissetting reduces to
(I −MiPTi )zi =MiP
Tb zb (4)
where Mi = diag(qi) and, because boundary states areabsorbing, zb = qb. The exponentiated Bellman equationis hence a linear system, the key advantage of the LMDPframework. A variety of special properties flow from thelinearity of the Bellman equation, which we exploit in thefollowing.
Solving for zi may be done explicitly as zi = (I −MiP
Ti )−1MiP
Tb zb or via the z-iteration method (akin to
value iteration),
zi ←MiPTi zi +MiP
Tb zb. (5)
Finally, the optimal policy may be computed in closed formas
a∗(s′|s) = P (s′|s)z(s′)G[z](s) , (6)
where the normalizing constant G[z](s) =∑s′ P (s
′|s)z(s′). Detailed derivations of these re-sults are given in (Todorov, 2009a;b; Dvijotham & Todorov,2011). Intuitively, the hard maximization of Eqn. (1)has been replaced by a soft maximization log(
∑exp(·)),
and the continuous action space enables closed formcomputation of the optimal policy.
Compared to the standard MDP formulation, the LMDP has
1. Continuous concurrent actions Actions are ex-pressed as transition probabilities over next states, suchthat these transition probabilities can be influenced bymany subtasks operating in parallel, and in a gradedfashion.
2. Compositional optimal policies In the LMDP, lin-early blending desirability functions yields the correctcomposite desirability function (Todorov, 2009a;b). Inparticular, consider two LMDPs L1 = 〈S, P, qi, q1b 〉
Hierarchy Through Composition with Multitask LMDPs
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \2 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
= w1⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
= w1⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
+w2⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
+w2⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
+ · · · + wn⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
+ · · · + wn⇥ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
epis
ode,
82
a⇤
=ar
gmax
aE
st+
1⇠
a(·|
st)
⌧=
min
{t:s
t2B
}
(⌧�
1X t=
1
R(s
t,a
)+
R(s
⌧))
.(3
)
Bec
ause
ofth
eca
refu
llych
osen
stru
ctur
eof
the
rew
ard
R(s
,a)
and
the
cont
inuo
usac
tion
spac
e,th
e83
Bel
lman
equa
tion
sim
plifi
esgr
eatly
.In
part
icul
arde
fine
the
desi
rabi
lity
func
tion
z(s
)=
eV(s
)/�
as84
the
expo
nent
iate
dco
st-t
o-go
func
tion,
and
defin
eq(
s)=
eR(s
)/�
tobe
the
expo
nent
iate
din
stan
ta-
85
neou
sre
war
ds.L
etn
beth
enu
mbe
rofs
tate
s,an
dn
ian
dn
bbe
the
num
bero
fint
erna
land
boun
dary
86
stat
esre
spec
tivel
y.R
epre
sent
z(s
)an
dq(
s)w
ithn
-dim
ensi
onal
colu
mn
vect
ors
zan
dq,
and
the
87
tran
sitio
ndy
nam
icsP
(s0 |s
)w
ithth
en
-by-
ni
mat
rix
P,w
here
colu
mn
inde
xco
rres
pond
sto
san
d88
row
inde
xco
rres
pond
sto
s0.L
etz i
and
z bde
note
the
part
ition
ofz
into
boun
dary
and
inte
rnal
stat
es,
89
resp
ectiv
ely,
and
sim
ilarl
yfo
rqi
and
q b.F
inal
ly,l
etP
ide
note
then
i-by
-ni
subm
atri
xof
Pco
ntai
ning
90
tran
sitio
nsbe
twee
nin
tern
alst
ates
,and
Pb
deno
teth
en
b-b
y-n
isu
bmat
rix
ofP
cont
aini
ngtr
ansi
tions
91
from
inte
rnal
stat
esto
boun
dary
stat
es.
92
As
show
nin
[***
],th
eB
ellm
aneq
uatio
nin
this
setti
ngre
duce
sto
93
(I�
QiP
i)z i
=Q
iPbz b
(4)
whe
reQ
i=
diag
(qi)
and,
beca
use
boun
dary
stat
esar
eab
sorb
ing,
z b=
q b.
The
expo
nent
iate
d94
Bel
lman
equa
tion
ishe
nce
alin
ears
yste
m,t
heke
yad
vant
age
ofth
eL
MD
Pfr
amew
ork.
Ava
riet
yof
95
spec
ialp
rope
rtie
sflo
wfr
omth
elin
eari
tyof
the
Bel
lman
equa
tion,
whi
chw
eex
ploi
tin
the
follo
win
g.96
Solv
ing
forz i
may
bedo
neex
plic
itly
asz i
=(I�
QiP
i)�
1Q
iPbz b
orvi
ath
ez-
itera
tion
met
hod
97
(aki
nto
valu
eite
ratio
n),
98
z i
QiP
izi+
QiP
bz b
.(5
)
Fina
lly,t
heop
timal
polic
ym
aybe
com
pute
din
clos
edfo
rmas
99
a⇤ (
s0|s)
=P
(s0 |s
)Z(s
0 )G[
Z](
s),
(6)
whe
reth
eno
rmal
izin
gco
nsta
ntG[
Z](
s)=
Ps0P
(s0 |s
)Z(s
0 ).D
etai
led
deriv
atio
nsof
thes
ere
sults
100
are
give
nin
[***
].In
tuiti
vely
,th
eha
rdm
axim
izat
ion
ofE
qn.
(1)
has
been
repl
aced
bya
soft
101
max
imiz
atio
nlo
g(P
exp(·)
),an
dth
eco
ntin
uous
actio
nsp
ace
enab
les
clos
edfo
rmco
mpu
tatio
nof
102
the
optim
alpo
licy.
103
2.3
Con
curr
ents
ubac
tions
and
dist
ribu
ted
repr
esen
tatio
nsof
task
s10
4
�⇡
\ 2⇡
(7)
Our
hier
arch
ical
sche
me
isbu
ilton
two
key
prop
ertie
sof
the
LM
DP.
105
1.C
ontin
uous
conc
urre
ntac
tions
106
2.C
ompo
sitio
nalo
ptim
alpo
licie
s10
7
The
spec
ific
form
ofE
qn.(
2)ca
nse
emto
limit
the
appl
icab
ility
ofth
eL
MD
Pfr
amew
ork.
Yet
as10
8
show
nin
ava
riet
yof
rece
ntw
ork
[***
],an
din
the
exam
ples
give
nla
ter,
mos
tsta
ndar
ddo
mai
nsca
n10
9
betr
ansl
ated
toth
eM
DP
fram
ewor
k;an
dth
ere
exis
tsa
gene
ralp
roce
dure
fore
mbe
ddin
gtr
aditi
onal
110
MD
Psin
the
LM
DP
fram
ewor
k?.
111
Mor
ege
nera
lly,w
esu
gges
ttha
tmos
tdom
ains
ofin
tere
stha
veso
me
notio
nof
‘effi
cien
t’ac
tions
,11
2
mak
ing
aco
ntro
lcos
tare
ason
ably
natu
rala
ndun
iver
salp
heno
men
on.
Inde
ed,i
tis
poss
ible
that
113
the
stan
dard
MD
Pfo
rmul
atio
nis
over
lyge
nera
l,di
scar
ding
usef
ulst
ruct
ure
inm
ost
real
wor
ld11
4
dom
ains
–nam
ely,
apr
efer
ence
fore
ffici
enta
ctio
ns.S
tand
ard
MD
Pfo
rmul
atio
nsco
mm
only
plac
e11
5
smal
lneg
ativ
ere
war
dson
each
actio
nto
inst
antia
teth
isef
ficie
ncy
goal
,but
they
reta
inth
efle
xibi
lity
116
to,
for
inst
ance
,pr
efer
ener
getic
ally
inef
ficie
nttr
ajec
tori
esby
plac
ing
posi
tive
rew
ards
onea
ch11
7
actio
n.T
hedr
awba
ckof
this
flexi
bilit
yis
the
unst
ruct
ured
max
imiz
atio
nof
Eqn
.(1)
,whi
chpr
even
ts11
8
com
posi
tiona
lity.
119
The
120
3
epis
ode,
82
a⇤
=ar
gmax
aE
st+
1⇠
a(·|
st)
⌧=
min
{t:s
t2B
}
(⌧�
1X t=
1
R(s
t,a
)+
R(s
⌧))
.(3
)
Bec
ause
ofth
eca
refu
llych
osen
stru
ctur
eof
the
rew
ard
R(s
,a)
and
the
cont
inuo
usac
tion
spac
e,th
e83
Bel
lman
equa
tion
sim
plifi
esgr
eatly
.In
part
icul
arde
fine
the
desi
rabi
lity
func
tion
z(s
)=
eV(s
)/�
as84
the
expo
nent
iate
dco
st-t
o-go
func
tion,
and
defin
eq(
s)=
eR(s
)/�
tobe
the
expo
nent
iate
din
stan
ta-
85
neou
sre
war
ds.L
etn
beth
enu
mbe
rofs
tate
s,an
dn
ian
dn
bbe
the
num
bero
fint
erna
land
boun
dary
86
stat
esre
spec
tivel
y.R
epre
sent
z(s
)an
dq(
s)w
ithn
-dim
ensi
onal
colu
mn
vect
ors
zan
dq,
and
the
87
tran
sitio
ndy
nam
icsP
(s0 |s
)w
ithth
en
-by-
ni
mat
rix
P,w
here
colu
mn
inde
xco
rres
pond
sto
san
d88
row
inde
xco
rres
pond
sto
s0.L
etz i
and
z bde
note
the
part
ition
ofz
into
boun
dary
and
inte
rnal
stat
es,
89
resp
ectiv
ely,
and
sim
ilarl
yfo
rqi
and
q b.F
inal
ly,l
etP
ide
note
then
i-by
-ni
subm
atri
xof
Pco
ntai
ning
90
tran
sitio
nsbe
twee
nin
tern
alst
ates
,and
Pb
deno
teth
en
b-b
y-n
isu
bmat
rix
ofP
cont
aini
ngtr
ansi
tions
91
from
inte
rnal
stat
esto
boun
dary
stat
es.
92
As
show
nin
[***
],th
eB
ellm
aneq
uatio
nin
this
setti
ngre
duce
sto
93
(I�
QiP
i)z i
=Q
iPbz b
(4)
whe
reQ
i=
diag
(qi)
and,
beca
use
boun
dary
stat
esar
eab
sorb
ing,
z b=
q b.
The
expo
nent
iate
d94
Bel
lman
equa
tion
ishe
nce
alin
ears
yste
m,t
heke
yad
vant
age
ofth
eL
MD
Pfr
amew
ork.
Ava
riet
yof
95
spec
ialp
rope
rtie
sflo
wfr
omth
elin
eari
tyof
the
Bel
lman
equa
tion,
whi
chw
eex
ploi
tin
the
follo
win
g.96
Solv
ing
forz i
may
bedo
neex
plic
itly
asz i
=(I�
QiP
i)�
1Q
iPbz b
orvi
ath
ez-
itera
tion
met
hod
97
(aki
nto
valu
eite
ratio
n),
98
z i
QiP
izi+
QiP
bz b
.(5
)
Fina
lly,t
heop
timal
polic
ym
aybe
com
pute
din
clos
edfo
rmas
99
a⇤ (
s0|s)
=P
(s0 |s
)Z(s
0 )G[
Z](
s),
(6)
whe
reth
eno
rmal
izin
gco
nsta
ntG[
Z](
s)=
Ps0P
(s0 |s
)Z(s
0 ).D
etai
led
deriv
atio
nsof
thes
ere
sults
100
are
give
nin
[***
].In
tuiti
vely
,th
eha
rdm
axim
izat
ion
ofE
qn.
(1)
has
been
repl
aced
bya
soft
101
max
imiz
atio
nlo
g(P
exp(·)
),an
dth
eco
ntin
uous
actio
nsp
ace
enab
les
clos
edfo
rmco
mpu
tatio
nof
102
the
optim
alpo
licy.
103
2.3
Con
curr
ents
ubac
tions
and
dist
ribu
ted
repr
esen
tatio
nsof
task
s10
4
�⇡
\ 2⇡
(7)
Our
hier
arch
ical
sche
me
isbu
ilton
two
key
prop
ertie
sof
the
LM
DP.
105
1.C
ontin
uous
conc
urre
ntac
tions
106
2.C
ompo
sitio
nalo
ptim
alpo
licie
s10
7
The
spec
ific
form
ofE
qn.(
2)ca
nse
emto
limit
the
appl
icab
ility
ofth
eL
MD
Pfr
amew
ork.
Yet
as10
8
show
nin
ava
riet
yof
rece
ntw
ork
[***
],an
din
the
exam
ples
give
nla
ter,
mos
tsta
ndar
ddo
mai
nsca
n10
9
betr
ansl
ated
toth
eM
DP
fram
ewor
k;an
dth
ere
exis
tsa
gene
ralp
roce
dure
fore
mbe
ddin
gtr
aditi
onal
110
MD
Psin
the
LM
DP
fram
ewor
k?.
111
Mor
ege
nera
lly,w
esu
gges
ttha
tmos
tdom
ains
ofin
tere
stha
veso
me
notio
nof
‘effi
cien
t’ac
tions
,11
2
mak
ing
aco
ntro
lcos
tare
ason
ably
natu
rala
ndun
iver
salp
heno
men
on.
Inde
ed,i
tis
poss
ible
that
113
the
stan
dard
MD
Pfo
rmul
atio
nis
over
lyge
nera
l,di
scar
ding
usef
ulst
ruct
ure
inm
ost
real
wor
ld11
4
dom
ains
–nam
ely,
apr
efer
ence
fore
ffici
enta
ctio
ns.S
tand
ard
MD
Pfo
rmul
atio
nsco
mm
only
plac
e11
5
smal
lneg
ativ
ere
war
dson
each
actio
nto
inst
antia
teth
isef
ficie
ncy
goal
,but
they
reta
inth
efle
xibi
lity
116
to,
for
inst
ance
,pr
efer
ener
getic
ally
inef
ficie
nttr
ajec
tori
esby
plac
ing
posi
tive
rew
ards
onea
ch11
7
actio
n.T
hedr
awba
ckof
this
flexi
bilit
yis
the
unst
ruct
ured
max
imiz
atio
nof
Eqn
.(1)
,whi
chpr
even
ts11
8
com
posi
tiona
lity.
119
The
120
3
Instantane
ous
rewardq b
Desirabliity
functio
nz i
episode,82
a⇤ = argmaxaE st+1⇠a(·|st)
⌧=min{t:st2B}
(⌧�1X
t=1
R(st, a) + R(s⌧ )
). (3)
Because of the carefully chosen structure of the reward R(s, a) and the continuous action space, the83
Bellman equation simplifies greatly. In particular define the desirability function z(s) = eV (s)/� as84
the exponentiated cost-to-go function, and define q(s) = eR(s)/� to be the exponentiated instanta-85
neous rewards. Let n be the number of states, and ni and nb be the number of internal and boundary86
states respectively. Represent z(s) and q(s) with n-dimensional column vectors z and q, and the87
transition dynamics P (s0|s) with the n-by-ni matrix P , where column index corresponds to s and88
row index corresponds to s0. Let zi and zb denote the partition of z into boundary and internal states,89
respectively, and similarly for qi and qb. Finally, let Pi denote the ni-by-ni submatrix of P containing90
transitions between internal states, and Pb denote the nb-by-ni submatrix of P containing transitions91
from internal states to boundary states.92
As shown in [***], the Bellman equation in this setting reduces to93
(I �QiPi)zi = QiPbzb (4)
where Qi = diag(qi) and, because boundary states are absorbing, zb = qb. The exponentiated94
Bellman equation is hence a linear system, the key advantage of the LMDP framework. A variety of95
special properties flow from the linearity of the Bellman equation, which we exploit in the following.96
Solving for zi may be done explicitly as zi = (I � QiPi)�1QiPbzb or via the z-iteration method97
(akin to value iteration),98
zi QiPizi + QiPbzb. (5)
Finally, the optimal policy may be computed in closed form as99
a⇤(s0|s) =P (s0|s)Z(s0)
G[Z](s), (6)
where the normalizing constant G[Z](s) =P
s0 P (s0|s)Z(s0). Detailed derivations of these results100
are given in [***]. Intuitively, the hard maximization of Eqn. (1) has been replaced by a soft101
maximization log(P
exp(·)), and the continuous action space enables closed form computation of102
the optimal policy.103
2.3 Concurrent subactions and distributed representations of tasks104
�⇡ \1 ⇡ (7)
Our hierarchical scheme is built on two key properties of the LMDP.105
1. Continuous concurrent actions106
2. Compositional optimal policies107
The specific form of Eqn. (2) can seem to limit the applicability of the LMDP framework. Yet as108
shown in a variety of recent work [***], and in the examples given later, most standard domains can109
be translated to the MDP framework; and there exists a general procedure for embedding traditional110
MDPs in the LMDP framework ?.111
More generally, we suggest that most domains of interest have some notion of ‘efficient’ actions,112
making a control cost a reasonably natural and universal phenomenon. Indeed, it is possible that113
the standard MDP formulation is overly general, discarding useful structure in most real world114
domains–namely, a preference for efficient actions. Standard MDP formulations commonly place115
small negative rewards on each action to instantiate this efficiency goal, but they retain the flexibility116
to, for instance, prefer energetically inefficient trajectories by placing positive rewards on each117
action. The drawback of this flexibility is the unstructured maximization of Eqn. (1), which prevents118
compositionality.119
The120
3
=
=Sample
trajectorie
s
(c) (d) (e) (f) (g)(a)
(b)
EndStart
Figure 1. Distributed task representations with the Multitask LMDP. (a) Example 2DOF arm constrained to the plane with state spaceconsisting of shoulder and elbow joint angles 6 1, 6 2 ∈ [−π, π] respectively. (b) A novel task is specified as an instantaneous rewardfunction over terminal states. In this example, the task “reach to the black rectangle” is encoded by rewarding any terminal state with theend effector in black rectangle (all successful configurations shown). (c-g) Solutions via the LMDP. Top row: Instantaneous rewards inthe space of joint angles. Middle row: Optimal desirability function with sample trajectories starting at green circles, finishing at redcircles. Bottom row: Strobe plot of sample trajectories in Cartesian space. Trajectories start at green circle, end at red circle. Column (c):The linear Bellman equation is solved for this particular instantaneous boundary reward structure to obtain the optimal value function.Columns (d-g): Solution via compositional Multitask LMDP. The instantaneous reward structure is expressed as a weighted combinationof previously-learned subtasks (d-f), here chosen to be navigation to specific points in state space. (g) Because of the linearity of theBellman equation, the resulting combined value function is optimal, and the system can act instantly despite no explicit training on thereach-to-rectangle task. The same fixed basis set can be used to express a wide variety of tasks (reach to a cross; reach to a circle; etc).
and L2 = 〈S, P, qi, q2b 〉 which have identical statespaces, transition dynamics, and internal reward struc-tures, but differ in their exponentiated boundary re-wards q1b and q2b . These may be solved independently toyield desirability functions z1 and z2. The desirabilityfunction of the LMDP L1+2 = 〈S, P, qi, αq1b + βq2b 〉,whose instantaneous rewards are a weighted sum ofthe first two, is simply z1+2 = αz1 + βz2. This prop-erty follows from the linearity of Eqn. 4, and is thefoundation of our hierarchical scheme.
2.3. The Multitask LMDP
To build a multitask action module, we exploit this com-positionality to build a basis set of tasks (Foster & Dayan,2002; Ruvolo & Eaton, 2013; Schaul et al., 2015; Pan et al.,2015; Borsa et al., 2016). Suppose that we learn a set ofLMDPs Lt = 〈S, P, qi, qtb〉, t = 1, · · · , Nt which all sharethe same state space, passive dynamics, and internal rewards,but differ in their instantaneous exponentiated boundary re-ward structure qtb, t = 1, · · · , Nt. Each of these LMDPscorresponds to a different task, defined by its boundary re-ward structure, in the same overall state space (see Taylor& Stone (2009) for an overview of this and other notionsof transfer and multitask learning). We denote the Mul-titask LMDP module M formed from these Nt LMDPsas M = 〈S, P, qi, Qb〉. Here the Nb-by-Nt task basis ma-trix Qb =
[q1b q
2b · · · qNt
b
]encodes a library of component
tasks in the MLMDP. Solving each component LMDP yields
the corresponding desirability functions zti , t = 1, · · · , Ntfor each task, which can also be formed into the Ni-by-Nt desirability basis matrix Zi =
[z1i z
2i · · · zNt
i
]for the
multitask module.
When we encounter a new task defined by a novel instan-taneous exponentiated reward structure q, if we can ex-press it as a linear combination of previously learned tasks,q = Qbw, where w ∈ RNt is a vector of task blend weights,then we can instantaneously derive its optimal desirabilityfunction as zi = Ziw. This immediately yields the optimalaction through Eqn. 6. Hence an MLMDP agent can actoptimally even in never-before-seen tasks, provided thatthe new task lies in the subspace spanned by previouslylearned tasks. This approach is an off-policy variant onthe successor representation method of Dayan (1993). Itdiffers by yielding the optimal value function, rather thanthe value function under a particular policy π; and allowsarbitrary boundary rewards for the component tasks (theSR implicitly takes these to be a positive reward on justone state). Also related is the successor feature approach ofBarreto et al. (2016), which permits performance guaranteesfor transfer to new tasks. With successor features, new tasksmust have rewards that are close in Euclidean distance toprior tasks, whereas here we require only that they lie in thespan of prior tasks.
More generally, if the target task q is not an exact linearcombination of previously learned tasks, an approximate
Hierarchy Through Composition with Multitask LMDPs
task weighting w can be found as
argminw ‖q −Qbw‖ subject to Qbw ≥ 0. (7)
The technical requirement Qbw ≥ 0 is due to the relation-ship q = exp(r/λ), such that negative values of q are notpossible. In practice this can be approximately solved asw = Q†bq, where † denotes the pseudoinverse, and negativeelements of w are projected back to zero.
Here the coefficients of the task blend w constitute a dis-tributed representation of the current task to be performed.Although the set of basis tasks Qb is fixed and finite, theypermit an infinite space of tasks to be performed throughtheir concurrent linear composition. Figure 2.2 demon-strates this ability of the Multitask LMDP module in thecontext of a 2D robot arm reaching task. From knowledgeof how to reach individual points in space, the module caninstantly act optimally to reach to a rectangular region.
3. Stacking the module: ConcurrentHierarchical LMDPs
To build a hierarchy out of this Multitask LMDP module, weconstruct a stack of MLMDPs in which higher levels selectthe instantaneous reward structure that defines the currenttask for lower levels. To take a navigation example, a highlevel module might specify that the lower level moduleshould reach room A but not B by placing instantaneousrewards in room A but no rewards in room B. Crucially,the fine details of achieving this subgoal can be left to thelow-level module. Critical to the success of this hierarchicalscheme is the flexible, optimal composition afforded bythe Multitask LMDP module: the specific reward structurecommanded by the higher layer will often be novel for thelower level, but will still be performed optimally provided itlies in the basis of learned tasks.
3.1. Constructing a hierarchy of MLMDPs
We start with the MLMDP M1 =⟨S1, P 1, q1i , Q
1b
⟩that we
must solve, where here the superscript denotes the hierarchylevel (Fig. 2). This serves as the base case for our recursivescheme for generating a hierarchy. For the inductive step,given an MLMDP M l =
⟨Sl, P l, qli, Q
lb
⟩at level l, we aug-
ment the state space S̃l = Sl ∪ Slt with a set of Nt terminalboundary states Slt that we call subtask states. Semantically,entering one of these states will correspond to a decision bythe layer l MLMDP to access the next level of the hierarchy.The transitions to subtask states are governed by a new N l
t -by-N l
i passive dynamics matrix P lt , which is chosen by thedesigner to encode the structure of the domain. Choosingfewer subtask states than interior states will yield a higherlevel which operates in a smaller state space, yielding stateabstraction. In the augmented MLMDP, the passive dynam-
M1
M2
M3
M4
States
(a)
(a)
(b)
(c)
(b)
(c)
Figure 2. Stacking Multitask LMDPs. Top: Deep hierarchy fornavigation through a 1D corridor. Lower-level MLMDPs are ab-stracted to form higher-level MLMDPs by choosing a set of ‘sub-task’ states which can be accessed by the lower level (grey linesbetween levels depict passive subtask transitions P l
t ). Lower levelsaccess these subtask states to indicate completion of a subgoaland to request more information from higher levels; higher levelscommunicate new subtask state instantaneous rewards, and hencethe concurrent task blend, to the lower levels. Red lines indicatehigher level access points for one sample trajectory starting fromleftmost state and terminating at rightmost state. Bottom: Pan-els (a-c) depict distributed task blends arising from accessing thehierarchy at points denoted in left panel. The higher layer statesaccessed are indicated by filled circles. (a) Just the second layerof hierarchy is accessed, resulting in higher weight on the task toachieve the next subgoal and zero weights on already achievedsubgoals. (b) The second and third levels are accessed, yieldingnew task blends for both. (c) All levels are accessed yielding taskblends at a range of scales.
ics become P̃ l = N ([P li ; P
lb ; P
lt
]) where the operation
N (·) renormalizes each column to sum to one.
It remains to specify the matrix of subtask-state instanta-neous rewardsRlt for this augmented MLMDP. Often hierar-chical schemes require designing a pseudoreward functionto encourage successful completion of a subtask. Herewe also pick a set of reward functions over subtask states;however, the performance of our scheme is only weaklydependent on this choice: we require only that our chosenreward functions form a good basis for the set of subtasksthat the higher layer will command. Any set of tasks whichcan linearly express the required space of reward structuresspecified by the higher level is suitable. In our experiments,we define N l
t tasks, one for each subtask state, and set eachinstantaneous reward to negative values on all but a sin-gle ‘goal’ subtask state. Then the augmented MLMDP is
Hierarchy Through Composition with Multitask LMDPs
M̃ l =⟨S̃l, P̃ l, qli,
[Qlb 0; 0 Q
lt
]⟩.
The higher level is itself an MLMDP M l+1 =⟨Slt, P
l+1, ql+1i , Ql+1
b
⟩, defined not over the entire state
space but just over the N lt subtask states of the layer be-
low. To construct this, we must compute an appropriatepassive dynamics and reward structure. A natural definitionfor the passive dynamics is the probability of starting at onesubtask state and terminating at another under the lowerlayer’s passive dynamics,
P l+1i = P̃ lt (I − P̃ li )−1P̃ l
T
t , (8)
P l+1b = P̃ lb(I − P̃ li )−1P̃ l
T
t . (9)
In this way, the higher-level LMDP will incorporate thetransition constraints from the layer below. The interior-state reward structure can be similarly defined, as the rewardaccrued under the passive dynamics from the layer below.However for simplicity in our implementation, we simplyset small negative rewards on all internal states.
Hence, from a base MLMDP M l and subtask transi-tion matrix P lt , the above construction yields an aug-mented MLDMP M̃ l at the same layer and unaugmentedMLDMP M l+1 at the next higher layer. This proce-dure may be iterated to form a deep stack of MLMDPs{M̃1, M̃2, · · · , M̃D−1,MD
}, where all but the highest is
augmented with subtask states. The key choice for the de-signer is P lt , the transition structure from internal states tosubtask states for each layer. Through Eqns. (8)-(9), thismatrix specifies the state abstraction used in the next higherlayer. Fig. 2 illustrates this scheme for an example of navi-gation through a 1D corridor.
3.2. Instantaneous rewards and task blends:Communication between layers
Bidirectional communication between layers happens viasubtask states and their instantaneous rewards. The higherlayer sets the instantaneous boundary rewards over subtaskstates for the lower layer; and the lower layer signals that ithas completed a subtask and needs new guidance from thehigher layer by transitioning to a subtask state.
In particular, suppose we have solved a higher-levelMLMDP using any method we like, yielding the optimalaction al+1. This will make transitions to some states morelikely than they would be under the passive dynamics, indi-cating that they are more attractive than usual for the currenttask. It will make other transitions less likely than the pas-sive dynamics, indicating that transitions to these statesshould be avoided. We therefore define the instantaneousrewards for the subtask states at level l to be proportional tothe difference between controlled and passive dynamics at
the higher level l + 1,
rlt ∝ al+1i (·|s)− pl+1
i (·|s). (10)
This effectively inpaints extra rewards for the lower layer,indicating which subtask states are desirable from the per-spective of the higher layer.
The lower layer MLMDP then uses its basis of tasks to de-termine a task weighting wl which will optimally achievethe reward structure rlt specified by the higher layer by solv-ing (7). The reward structure specified by the higher layermay not correspond to any one task learned by the lowerlayer, but it will nonetheless be performed well by forminga concurrent blend of many different tasks and leveragingthe compositionality afforded by the LMDP framework.This scheme may also be interpreted as implementing aparametrized option, but with performance guarantees forany parameter setting (Masson et al., 2016).
We now describe the execution model (see SupplementaryMaterial for pseudocode listing). The true state of the agentis represented in the base level M̃1, and next states aredrawn from the controlled transition distribution. If the nextstate is an interior state, one unit of time passes and the stateis updated as usual. If the next state is a subtask state, thenext layer of the hierarchy is accessed at its correspondingstate; no ‘real’ time passes during this transition. The higherlevel then draws its next state, and in so doing can accessthe next level of hierarchy by transitioning to one of itssubtask states, and so on. At some point, a level will electnot to access a subtask state; it then transmits its desiredrewards from Eqn. (10) to the layer below it. The lowerlayer then solves its multitask LMDP problem to computeits own optimal actions, and the process continues downto the lowest layer M̃1 which, after updating its optimalactions, again draws a transition. Lastly, if the next state isa terminal boundary state, the layer terminates itself. Thiscorresponds to a level of the hierarchy determining that itno longer has useful information to convey. Terminatinga layer disallows future transitions from the lower layerto its subtask states, and corresponds to inpainting infinitenegative rewards onto the lower level subtask states.
4. Computational complexity advantages ofhierarchical decomposition
To concretely illustrate the value of hierarchy, consider nav-igation through a 1D ring of N states, where the agent mustperform N different tasks corresponding to navigating toeach particular state. We take the passive dynamics to belocal (a nonzero probability of transitioning just to adjacentstates in the ring, or remaining still). In one step of Z itera-tion (Eqn. 5), the optimal value function progresses at bestO(1) states per iteration because of the local passive dy-namics (see Precup et al. (1998) for a similar argument). It
Hierarchy Through Composition with Multitask LMDPs
therefore requires O(N) iterations in a flat implementationfor a useful value function signal to arrive at the furthestpoint in the ring for each task. As there are N tasks, theflat implementation requires O(N2) iterations to learn allof them.
Instead suppose that we construct a hierarchy by placing asubtask every M = logN states, and do this recursively toform D layers. The recursion terminates when N/MD ≈ 1,yielding D ≈ logM N . With the correct higher level policysequencing subtasks, each policy at a given layer only needsto learn to navigate between adjacent subtasks, which areno more than M states apart. Hence Z iteration can beterminated after O(M) iterations. At level l = 1, 2, · · · , Dof the hierarchy, there are N/M l subtasks, and N/M l−1
boundary reward tasks to learn. Overall this yields
D∑
l=1
M
(N
M l+
N
M l−1
)≈ O(N logN)
total iterations (see Supplementary Material for derivationand numerical verification). A similar analysis shows thatthis advantage holds for memory requirements as well. Theflat scheme requires O(N2) nonzero elements of Z to en-code all tasks, while the hierarchical scheme requires onlyO(N logN). Hence hierarchical decomposition can yieldqualitatively more efficient solutions and resource require-ments, reminiscent of theoretical results obtained for percep-tual deep learning (Bengio, 2009; Bengio & LeCun, 2007).We note that this advantage only occurs in the multitasksetting: the flat scheme can learn one specific task in timeO(N). Hence hierarchy is beneficial when performing anensemble of tasks, due to the reuse of component policiesacross many tasks (see also Solway et al. (2014)).
5. Experiments5.1. Conceptual Demonstration
To illustrate the operation of our scheme, we apply it to a2D grid-world ‘rooms’ domain (Fig. 3(a)). The agent isrequired to navigate through an environment consisting offour rooms with obstacles to a goal location in one of therooms. The agent can move in the four cardinal directions orremain still. To build a hierarchy, we place six higher layersubtask goal locations throughout the domain (Fig. 3(a), reddots). The inferred passive dynamics for the higher layerMLMDP is shown in Fig. 3(a) as weighted lines betweenthese subtask states. The higher layer passive dynamics con-form to the structure of the problem, with the probability oftransition between higher layer states roughly proportionalto the distance between those states at the lower layer.
As a basic demonstration that the hierarchy conveys usefulinformation, Fig. 3(b) shows Z-learning curves for the baselayer policy with and without an omniscient hierarchy (i.e.,
Tra
jecto
ry le
ng
th
Epochs
Steps
Task b
lend w
eig
hts
A
B
C
D
E
F
B
( a ) ( b )
( c ) ( d )
0 100
0
200
FLAT
HIERARCHICAL
Figure 3. Rooms domain. (a) Four room domain with subtask loca-tions marked as red dots, free space in white and obstacles in black,and derived higher-level passive dynamics shown as weighted linksbetween subtasks. (b) Convergence as trajectory length over learn-ing epochs with and without the help of the hierarchy. (c) Sampletrajectory (gray line) from upper left to goal location in bottomright. (d) Evolution of distributed task weights on each subtasklocation over the course of the trajectory in panel (c).
one preinitialized using Z-iteration). From the start of learn-ing, the hierarchy is able to drive the agent to the vicinityof the goal. Fig. 3(c-d) illustrates the evolving distributedtask representation commanded by the higher layer to thelower layer over the course of a trajectory. At each timepoint, several subtasks have nonzero weight, highlightingthe concurrent execution in the system.
(a)
(b)
Figure 4. Desirability functions. (a) Desirability function overstates as agent moves through environment, showing the effectof reward inpainting from the hierarchy. (b) Desirability functionover states as goal location moves. Higher layers inpaint rewardsinto subtasks that will move the agent nearer the goal.
Fig. 4(a) shows the composite desirability function resultingfrom the concurrent task blend for different agent locations.Fig. 4(b) highlights the multitasking ability of the system,
Hierarchy Through Composition with Multitask LMDPs
showing the composite desirability function as the goallocation is moved. The leftmost panel, for instance, showsthat rewards are painted concurrently into the upper left andbottom right rooms, as these are both equidistant to the goal.
5.2. Quantitative Comparison
In order to quantitatively demonstrate the performance im-provements possible with our method we consider the prob-lem of a mobile robot navigating through an office block insearch of a charging station. This domain is shown in Fig. 5,and is taken from previous experiments in transfer learning(Fernández & Veloso, 2006).
Although the agent again moves in one of the four cardinaldirections at each time step, the agent is additionally pro-vided with a policy to navigate to each room in the officeblock, with goal locations marked by an ‘S’ in Fig. 5. Thegoal of the agent is to learn a policy to navigate to the near-est charging station from any location in the office block,given that there are a number of different charging stationsavailable (one per room, but only in some of the rooms).This corresponds to an ‘OR’ navigation problem: navigatingto location A OR B, while knowing separately a policy toget to A, and a policy to get to B. Concretely, the problem isto navigate to the nearest of two unknown subtask locations.The agent is randomly initialized at interior states, and thetrajectory lengths are capped at 500 steps.
Figure 5. The office building domain. Each ‘subtask’ or ‘option’policy navigates the agent to one of the rooms.
We compare a one-layer deep implementation of our methodto a simple implementation of the options framework (Sut-ton et al., 1999). In the options framework the agent’s actionspace is augmented with the full set of option policies. Theinitialization set for these options is the full state space, sothat any option may be executed at any time. The termi-nation condition is defined such that the option terminatesonly when it reaches its goal state. To minimize the actionspace for the options agent, we remove the primitive ac-tions, reducing the learning problem to simply choosing thesingle correct option from each state. The options learningproblem is solved using Q-learning with sigmoidal learningrate decrease and ε-greedy exploitation. These parameterswere optimized on a coarse grid to yield the fastest learningcurves. Results are averaged over 20 runs.
Traj
ecto
ry L
engt
h
Epochs
Options
Our Method
Figure 6. Learning rates. Our method jump-starts performancein the OR task, while the options agent must learn a state-actionmapping.
Fig. 6 shows that our agent receives a significant jump-startin learning. By leveraging the distributed representationprovided by our scheme, the agent is required only to learnwhen to request information from the higher layer. Al-though the task is novel, the higher layer can express it as acombination of prior tasks. Conversely, the options agentmust learn a unique mapping between all states and actions.This issue is exacerbated as the number of available optionsgrows (Rosman & Ramamoorthy, 2012).
6. ConclusionThe Multitask LMDP module provides a novel approachto control hierarchies, based on a distributed representa-tion of tasks and parallel execution. Rather than learn toperform one task or a fixed library of tasks, it exploits thecompositionality provided by linearly solvable Markov de-cision processes to perform an infinite space of task blendsoptimally. Stacking the module yields a deep hierarchyabstracted in state space and time, with the potential forqualitative efficiency improvements.
Experimentally, we have shown that the distributed repre-sentation provided by our framework can speed up learningin a simple navigation task, by representing a new task as acombination of prior tasks.
While a variety of sophisticated reinforcement learningmethods have made use of deep networks as capable func-tion approximators (Mnih et al., 2015; Lillicrap et al., 2015;Levine et al., 2016), in this work we have sought to trans-fer some of the underlying intuitions, such as parallel dis-tributed representations and stackable modules, to the con-trol setting. In the future this may allow other elementsof the deep learning toolkit to be brought to bear in thissetting, most notably gradient-based learning of the subtaskstructure itself.
Hierarchy Through Composition with Multitask LMDPs
AcknowledgementsWe thank our reviewers for thoughtful comments. AMSacknowledges support from the Swartz Program in Theo-retical Neuroscience at Harvard. AE and BR acknowledgesupport from the National Research Foundation of SouthAfrica.
ReferencesBarreto, A., Munos, R., Schaul, T., and Silver, D. Successor
Features for Transfer in Reinforcement Learning. arXiv, 2016.
Barto, A.G. and Madadevan, S. Recent Advances in HierarchicalReinforcement Learning. Discrete Event Dynamic Systems:Theory and Applications, (13):41–77, 2003.
Bellman, R.E. Dynamic Programming. Princeton University Press,Princeton, NJ, 1957.
Bengio, Y. Learning Deep Architectures for AI. Foundations andTrends in Machine Learning, 2(1):1–127, 2009.
Bengio, Y. and LeCun, Y. Scaling learning algorithms towards AI.In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (eds.),Large-Scale Kernel Machines. MIT Press, 2007.
Bonarini, A., Lazaric, A., and Restelli, M. Incremental SkillAcquisition for Self-motivated Learning Animats. In Nolfi,S., Baldassare, G., Calabretta, R., Hallam, J., Marocco, D.,Miglino, O., Meyer, J.-A., and Parisi, D. (eds.), Proceedings ofthe Ninth International Conference on Simulation of AdaptiveBehavior (SAB-06), volume 4095, pp. 357–368, Heidelberg,2006. Springer Berlin.
Borsa, D., Graepel, T., and Shawe-Taylor, J. Learning SharedRepresentations in Multi-task Reinforcement Learning. arXiv,2016.
Botvinick, M.M., Niv, Y., and Barto, A.C. Hierarchically organizedbehavior and its neural foundations: a reinforcement learningperspective. Cognition, 113(3):262–80, 12 2009.
Burridge, R. R., Rizzi, A. A., and Koditschek, D.E. SequentialComposition of Dynamically Dexterous Robot Behaviors. TheInternational Journal of Robotics Research, 18(6):534–555, 61999.
Dayan, P. Improving Generalization for Temporal DifferenceLearning: The Successor Representation. Neural Computation,5(4):613–624, 7 1993.
Dayan, P. and Hinton, G. Feudal Reinforcement Learning. InNIPS, 1993.
Dietterich, T.G. Hierarchical Reinforcement Learning with theMAXQ Value Function Decomposition. Journal of ArtificialIntelligence Research, 13:227–303, 2000.
Drummond, C. Composing functions to speed up reinforcementlearning in a changing world. In Nédellec, C. and Rouveirol, C.(eds.), Machine Learning: ECML-98., pp. 370–381, Heidelberg,1998. Springer Berlin.
Dvijotham, K. and Todorov, E. Inverse Optimal Control withLinearly-Solvable MDPs. In ICML, 2010.
Dvijotham, K. and Todorov, E. A unified theory of linearly solvableoptimal control. Uncertainty in Artificial Intelligence, 2011.
Fernández, F. and Veloso, M. Probabilistic policy reuse in a rein-forcement learning agent. Proceedings of the fifth internationaljoint conference on Autonomous agents and multiagent systems,pp. 720, 2006.
Foster, D. and Dayan, P. Structure in the Space of Value Functions.Machine Learning, (49):325–346, 2002.
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionalityof data with neural networks. Science, 313(5786):504–7, 72006.
Hinton, G.E., Osindero, S., and Teh, Y.-W. A Fast LearningAlgorithm for Deep Belief Nets. Neural Computation, 18:1527–1554, 2006.
Howard, R.A. Dynamic Programming and Markov Processes.MIT Press, Cambridge, MA, 1960.
Jonsson, A. and Gómez, V. Hierarchical Linearly-Solvable MarkovDecision Problems. In ICAPS, 2016.
Kappen, H.J. Linear Theory for Control of Nonlinear StochasticSystems. Physical Review Letters, 95(20):200201, 11 2005.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-EndTraining of Deep Visuomotor Policies. Journal of MachineLearning Research, 17:1–40, 2016.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control with deepreinforcement learning. arXiv, 2015.
Masson, W., Ranchod, P., and Konidaris, G.D. ReinforcementLearning with Parameterized Actions. In Proceedings of theThirtieth AAAI Conference on Artificial Intelligence, pp. 1934–1940, 9 2016.
Mausam and Weld, D.S. Planning with Durative Actions inStochastic Domains. Journal of Artificial Intelligence Research,31:33–82, 2008.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J.,Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.
Pan, Y., Theodorou, E.A., and Kontitsis, M. Sample Efficient PathIntegral Control under Uncertainty. In NIPS, 2015.
Parr, R. and Russell, S. Reinforcement learning with hierarchiesof machines. In NIPS, 1998.
Precup, D., Sutton, R., and Singh, S. Theoretical results on rein-forcement learning with temporally abstract options. In ECML,1998.
Ribas-Fernandes, J.J.F., Solway, A., Diuk, C., McGuire, J.T., Barto,A.G., Niv, Y., and Botvinick, M.M. A neural signature ofhierarchical reinforcement learning. Neuron, 71(2):370–9, 72011.
Hierarchy Through Composition with Multitask LMDPs
Rosman, B. and Ramamoorthy, S. What good are actions? Ac-celerating learning using learned action priors. In 2012 IEEEInternational Conference on Development and Learning andEpigenetic Robotics (ICDL), number November. IEEE, 11 2012.
Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learningalgorithm. Proceedings of the 30th International Conferenceon Machine Learning, 28(1):507–515, 2013.
Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal ValueFunction Approximators. Proceedings of The 32nd Interna-tional Conference on Machine Learning, pp. 1312–1320, 2015.
Solway, A., Diuk, C., Córdova, N., Yee, D., Barto, A.G., Niv, Y.,and Botvinick, M.M. Optimal Behavioral Hierarchy. PLoSComputational Biology, 10(8):e1003779, 8 2014.
Sutton, R.S. and Barto, A.G. Reinforcement Learning: An Intro-duction. The MIT Press, 1998.
Sutton, R.S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcementlearning. Artificial Intelligence, 112(1-2):181–211, 8 1999.
Taylor, M.E. and Stone, P. Transfer Learning for ReinforcementLearning Domains: A Survey. Journal of Machine LearningResearch, 10:1633–1685, 2009.
Todorov, E. Linearly-solvable Markov decision problems. In NIPS,2006.
Todorov, E. Efficient computation of optimal actions. Proceedingsof the National Academy of Sciences, 106(28):11478–11483, 72009a.
Todorov, E. Compositionality of optimal control laws. In NIPS,2009b.