learning goal-directed behaviour

57
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2017 Learning Goal-Directed Behaviour MARCEL BINZ KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Upload: others

Post on 16-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Learning Goal-Directed Behaviour

MARCEL BINZ

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Learning Goal-Directed Behaviour

MARCEL BINZ

[email protected]

Master in Machine Learning

Date: June 14, 2017

Supervisor: Florian Pokorny

Examiner: Danica Kragic

School of Computer Science and Communication

i

Abstract

Learning behaviour of artificial agents is commonly studied in the framework of Rein-

forcement Learning. Reinforcement Learning gained increasing popularity in the past

years. This is partially due to developments that enabled the possibility to employ com-

plex function approximators, such as deep networks, in combination with the frame-

work. Two of the core challenges in Reinforcement Learning are the correct assignment

of credits over long periods of time and dealing with sparse rewards. In this thesis we

propose a framework based on the notions of goals to tackle these problems. This work

implements several components required to obtain a form of goal-directed behaviour,

similar to how it is observed in human reasoning. This includes the representation of a

goal space, learning how to set goals and finally how to reach them. The framework it-

self is build upon the options model, which is a common approach for representing tem-

porally extended actions in Reinforcement Learning. All components of the proposed

method can be implemented as deep networks and the complete system can be learned

in an end-to-end fashion using standard optimization techniques. We evaluate the ap-

proach on a set of continuous control problems of increasing difficulty. We show, that we

are able to solve a difficult gathering task, which poses a challenge to state-of-the-art Re-

inforcement Learning algorithms. The presented approach is furthermore able to scale to

complex kinematic agents of the MuJoCo benchmark.

ii

Sammanfattning

Inlärning av beteende för artificiella agenter studeras vanligen inom Reinforcement Le-

arning. Reinforcement Learning har på senare tid fått ökad uppmärksamhet, detta beror

delvis på utvecklingen som gjort det möjligt att använda komplexa funktionsapproxi-

merare, såsom djupa nätverk, i kombination med Reinforcement Learning. Två av kär-

nutmaningarna inom reinforcement learning är credit assignment-problemet över långa

perioder samt hantering av glesa belöningar. I denna uppsats föreslår vi ett ramverk ba-

serat på delvisa mål för att hantera dessa problem. Detta arbete undersöker de kompo-

nenter som krävs för att få en form av målinriktat beteende, som liknar det som obser-

veras i mänskligt resonemang. Detta inkluderar representation av en målrymd, inlärning

av målsättning, och till sist inlärning av beteende för att nå målen. Ramverket bygger

på options-modellen, som är ett gemensamt tillvägagångssätt för att representera tempo-

ralt utsträckta åtgärder inom Reinforcement Learning. Alla komponenter i den föreslagna

metoden kan implementeras med djupa nätverk och det kompletta systemet kan tränas

end-to-end med hjälp av vanliga optimeringstekniker. Vi utvärderar tillvägagångssättet

på en rad kontinuerliga kontrollproblem med varierande svårighetsgrad. Vi visar att vi

kan lösa en utmanande samlingsuppgift, som tidigare state-of-the-art algoritmer har upp-

visat svårigheter för att hitta lösningar. Den presenterade metoden kan vidare skalas upp

till komplexa kinematiska agenter i MuJoCo-simuleringar.

Contents

Contents iii

1 Introduction 2

2 Related Work 4

2.1 Feudal Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Option Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Universal Value Function Approximators . . . . . . . . . . . . . . . . . . . . . 5

2.4 Alternative Hierarchical Approaches . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Preliminaries 7

3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.2 Monte-Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.3 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.4 Policy Gradients Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.5 Asynchronous Advantage Actor-Critic (A3C) . . . . . . . . . . . . . . 12

3.2 Function Approximation with Neural Networks . . . . . . . . . . . . . . . . . 14

3.2.1 Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Options Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Universal Value Function Approximators . . . . . . . . . . . . . . . . 19

4 Methods 21

4.1 Hierarchical Reinforcement Learning with Goals . . . . . . . . . . . . . . . . 21

4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.2 Goal-Setting Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.3 Goal-Reaching Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.4 Interpretation as Options . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Connections to Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

iv CONTENTS

5 Experiments 26

5.1 Randomized Goals with Fixed Embeddings . . . . . . . . . . . . . . . . . . . 28

5.1.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Learned Goals with Fixed Embeddings . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Learned Goals with Learned Embeddings . . . . . . . . . . . . . . . . . . . . 38

5.3.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Conclusion and Future Work 42

Bibliography 44

Notation

We provide a short summary of the notation used in this thesis. In general vectors are

denoted with small, bold and matrices with large, bold letters.

st Observation at time step t

at Action at time step t

zt Activation in goal space at time step t

gt Goal at time step t

ht Hidden state of a neural network at time step t

S Observation space

A Action space

G Goal space

p(st+1|st,at) Transition probability

rt Reward at time step t

Rt Return following time step t

V π(s) State value function of state s when following policy π

Qπ(s,a) Action value function of state s and action a when following policy π

Aπ(s,a) Advantage function of state s and action a when following policy π

π Policy

φ Goal embedding

Ω Set of all available options

ω Single option from Ω

α Learning rate

β Entropy scaling factor

γ Discount factor

θ Set of parameters being learned

τ Trajectory of states and actions

W Weight matrix of neural network layer

b Bias vector of neural network layer

y Targets from supervised learning

y Predictions corresponding to y

E[•] Expectation of a random variable

(•)S Subscript referring to goal-setting policy

(•)R Subscript referring to goal-reaching policy

1

Chapter 1

Introduction

Being able to build computational agents, that can perform complex tasks in unknown

environments, is of enormous value for our society. In general we would like to enable

such systems to learn their behaviour from experience, as defining them explicitly is of-

ten impossible. Reinforcement Learning (RL) provides a general mathematical framework

to model such agents. Recent advances in RL, in particular its combination with methods

from Deep Learning, allowed these methods to scale to much more challenging problems.

These problem include, amongst others, playing ATARI 2600 games [1], robotic manipu-

lation [2] and mastering the game of Go [3].

Two of the key challenges in Reinforcement Learning are proper credit assignment

over long periods of time and dealing with sparse rewards. To highlight these prob-

lems, let us inspect the results reported for different games of the well-established ATARI

2600 benchmark [4]. Here we notice that state-of-the-art methods [1, 5] are able to solve

most reactive games, such as Ping-Pong or Breakout, easily. However games, that require

long term planning and abstract reasoning, such as Montezuma’s Revenge, pose a much

harder challenge for these algorithms.

Hierarchical Reinforcement Learning attempts to deal with these challenges by de-

composing the problem into smaller sub-tasks, which are then solved independently.

There are different ways how this decomposition is realized and we will review some of

these in the later parts of this thesis. In this work we will focus on a form of hierarchical

RL, that is based on the notion of goals.

For this we will describe a framework, that exhibits a form of goal-directed behaviour.

In our approach we strive for as much generality as possible, meaning that all compo-

nents should be learned from interaction with the world. These components include:

1. Learning the structure of a goal space

2. Setting goals on given space

3. Reaching those goals through atomic actions

During the stages of this thesis we will relax some of the generality assumptions in order

to build up an incremental solution, that tackles all three components, required to attain

goal-directed behaviour.

2

CHAPTER 1. INTRODUCTION 3

It may not be directly obvious, how to define goals in mathematical terms. It is how-

ever important to provide such a definition in order to implement the described mechan-

ics. For our purposes goals are elements of an abstract goal space, such as vector spaces.

These representations may be specified by the human designer or correspond to hidden

states of deep network. In general it is required, that the agent can interfere its current

position in goal space, decide where it would like to be and is able perform actions to

minimize the discrepancy between current position and goal.

Let us consider a few examples to be more concrete about this concept and highlight

its potential advantages. In the first scenario our task is to maneuver an ant-like robot

through a maze, and we are only rewarded for reaching the exit. While we have to learn

about complex behaviour sequences to control the robot, the reward structure of the task

depends only on its location in the maze. Therefore it is natural to consider a decom-

position based on goals. Here the goal space could correspond to locations in the maze,

which the robot would like to reach. For other tasks it may be desirable to exhibit more

complex structures. In an ego-shooter game goals could, in addition to moving to cer-

tain locations, correspond to high-level events, such as a kill on an opponent or collect-

ing new weapons.

Reasoning about goals could provide means for more structured exploration in com-

plex problems. This is especially important in tasks with sparse or even non-existing re-

wards. Such settings are challenging for the current RL approaches, as these only per-

form exploration through sampling from stochastic policies, such as a Boltzmann distri-

bution, for discrete actions, or Gaussian distributions in the continuous case. However in

many tasks adding this kind of noise will never lead to meaningful action sequences.

Both, Reinforcement and Deep Learning, are to some extent inspired by how humans

learn to represent and solve problems. In similar fashion we also observe forms of goal-

directed reasoning in human behaviour. Indeed psychologists and cognitive scientist of-

ten distinguish between two types of behaviour – habitual and goal-directed [6, 7]. There

are also links to neuroscience, where for example the neurotransmitter Dopamine is be-

lieved to regulate goal-directed behavior [8] in the human brain. For a recent overview of

connections between hierarchical RL, psychology and neuroscience we refer the reader to

[9].

The main contributions in the presented work are the proposal and implementation

of a framework for end-to-end learning of goal-directed behaviour, its application to con-

tinuous control problems and investigation of different properties through experiments.

The remainder of this thesis is structured as follows. First we provide a broad overview

of related work in Chapter 2. This is followed in Chapter 3 by presenting important pre-

supposed techniques. Chapter 4 outlines the proposed framework and motivates its spe-

cific design choices. It also places the suggested technique in the context of similar ap-

proaches. We validate the approach in Chapter 5 through a set of experiments with in-

creasing difficulty. This chapter also includes an analysis of the obtained results. We con-

clude this thesis in Chapter 6.

Chapter 2

Related Work

A major part of the work presented in this thesis is based on the ability to apply Rein-

forcement Learning techniques in high-dimensional, continuous observation and action

spaces. In these cases it is natural to employ complex function approximators such as

deep networks. Combining these two methods proves however challenging due to sta-

bility problems during the learning process. Recent work in Deep Q-Networks (DQN) [1]

showed how to overcome these issues when dealing with discrete actions. Since then

these methods have been extended to the case of continuous control. Here we make ex-

tensive use of A3C [5], which is an on-policy actor-critic algorithm. There is however a

wide range of other algorithms that could be applied instead, such as continuous DQN

variants [10], Trust Region Policy Optimization [11], off-policy actor-critics [12, 13] or meth-

ods with deterministic policies [14, 15].

A3C itself has been extended in several ways. Generalized Advantage Estimation [16]

can be employed as a technique to control bias and variance in the estimation of the ad-

vantage function. [12] proposes a fast and flexible trust region update method, which

was tested empirically in combination with A3C. UNREAL [17] adds a set of unsuper-

vised auxiliary tasks on top of A3C and show significant improvement in stability, speed

and performance.

The tasks we consider in this report are based on the MuJoCo physics simulator [18]

and have been proposed in [19]. Since they have been used in several research papers.

2.1 Feudal Reinforcement Learning

Hierarchical Reinforcement Learning is an area of diverse approaches. Out of all these

Feudal Reinforcement Learning (FRL) [20] resembles the model proposed in this thesis the

most. FRL is based on the metaphor of the hierarchical aspects of feudal fiefdom. It de-

fines a framework consisting of a hierarchy of managers, in which higher ones are able

to set sub-tasks for lower ones. Recently two models have been proposed, which com-

bine FRL with deep networks [21, 22]. The former [21] has been applied to the chal-

lenging ATARI 2600 game Montezuma’s Revenge. It uses a two-level DQN hierarchy,

in which the higher one operates over a hand-designed space of entities and relations.

[22] design an end-to-end trainable model, that fits the FRL framework and learns its

high-level goal space jointly with the task. Their model is evaluated on a set of challeng-

ing ATARI 2600 games, where it performs favorable against rivaling approaches. Also

4

CHAPTER 2. RELATED WORK 5

in this line of work is the HASSLE [23] algorithm, in which a high level policy selects a

sub-goal from a set of abstract high-level observations, which is obtained through an un-

supervised vector quantization technique. Low-level polices are governed with reaching

these goals. In their case a limited set of low-level polices learns to associate itself auto-

matically with the most-suitable high-level observations. Recently a related model has

been applied to continuous action spaces, where a simulated biped robot learns to drib-

ble a soccer ball and to navigate through static as well as dynamic obstacles [24].

2.2 Option Discovery

The options framework [25] is a popular choice for modelling temporal abstractions in

RL. An option is defined as a temporally extended sequence of actions with a termina-

tion criterion. As the framework itself does not describe how options arise, there is large

amount of additional methods for learning them from experience. Identifying suitable

sub-goals is often the first step when constructing options. Given this connection to goal-

directed behaviour, we will briefly review some methods for option discovery in the fol-

lowing.

In the given setting Machado et al. propose to identify purposes, that are just out of

reach [26]. This is archived through the use of singular value decomposition on changes

in consecutive observations of the environment. The result is then transformed into sub-

goals, which are learned through an intrinsic reward function. A related approach is

taken in [27], where PCCA+ is used to perform a spectral clustering of the MDP.

A similar concept was proposed by the authors of [28]. Their approach relies on iden-

tifying bottlenecks in the observation space. In their case a bottleneck is defined as a

state, that is visited on many successful paths (such as a doorway). Discovered states are

than turned into options through a learned policy based on an intrinsic reward function.

The optic-critic architecture [29] extends the policy gradient theorem to options. This

allows to jointly learn internal policies along with policies over options. The approach is

appealing as it only requires to specify the number of options, but no intrinsic rewards

or sub-goals.

2.3 Universal Value Function Approximators

How do we provide the agent with information about which sub-goal it currently tries

to reach? Most of the outlined approaches in this section learn an independent policy for

each possible sub-goal. Universal Value Function Approximators (UVFA) [30] on the other

hand provide the ability to treat sub-goals as a part of the observation. This enables to

generalize between different sub-goals and allows, in theory, to employ infinite sets of

them.

UVFAs have been applied to several settings, that include hand-designed, high-level

goals. This includes visual navigation of robots [31, 32], where goals are locations on a

map and first-person shooter games [33], where goals correspond to maximization of dif-

ferent game measurements (such as health, ammunition or kill score).

6 CHAPTER 2. RELATED WORK

2.4 Alternative Hierarchical Approaches

There is also a large set of alternative approaches to hierarchical Reinforcement Learning

besides those, that are based on sub-goals and options. The MAXQ value function de-

composition [34] decomposes the original MDP into a set of smaller MDPs in a way, that

the original value function is transformed into an additive combination of the new ones.

For this the programmer has to identify and define useful sub-tasks in advance.

The Hierarchy of Abstract Machines [35] model is build upon a set of non-deterministic

finite state machines, whose transition can trigger machines at lower levels. This setup

constrains the actions, that the agent can take in each state, allowing to incorporate prior

knowledge into the task at hand.

Macro-actions are simpler instance of options, where whole action sequences are de-

cided at the point of their initiation. They have been proposed in [36], and since then

been investigated in other works such as [37]. STRAW [38] consists of a deep network,

that maintains a plan in form of macro-actions. This plan can be adjusted on certain

re-planning points. The whole model is trainable end-to-end and does not require any

pseudo-rewards or additional hand-crafted information.

The Horde architecture [39] shows many similarities to the approach presented in this

work. It consists of a large number of sub-agents, called demons. Each of the demons

has its own policy and is responsible for answering a single goal-oriented question. These

questions are posed in form of different value functions. The whole system runs many

demons in parallel, similar to how A3C is used in this thesis.

2.5 Intrinsic Motivation

Employing goal-based methods provides a mechanism for structured exploration of the

environment. This objective however can also be realized through other means. It is of-

ten investigated through the lens of intrinsic motivation. Here one strategy is to place a

bonus reward, based on visitation statistics of states [40]. This bonus then encourages to

visit states, which have not been seen before. Recently techniques have been proposed,

that are based on hashing [41] or density models [42, 43] to keep track of the visitation

counter in continuous state spaces.

Yet another approach to obtain intrinsic motivation is via uncertainties in predic-

tions. For example we can model uncertainties with a Bayesian model [44] and than use

Thompson sampling [45] to exploit them for exploration. Another alternative is to max-

imize the reduction in uncertainty of a dynamics models via the mutual information be-

tween the next state distribution and the model parameters [46] or to use surprise-based

mechanisms [47, 48].

Chapter 3

Preliminaries

To allow computational agents to adapt their behaviour based on experience, we first

need to establish the mathematical foundations of interacting within an environment.

Different aspects of this process will be presented within this chapter.

The first part provides an overview over the theory of Reinforcement Learning (RL)

as it is employed in later parts of this report. First the framework will be outlined and

several of its properties will be highlighted. We then continue to describe how decisions

can be modelled within the framework as Markov Decision Processes (MDP). This section

will also define the quantities, which should be optimized in order to learn desired be-

haviour. Most Reinforcement Learning methods alternate in some way between estimat-

ing the quality of the current behaviour in terms of value functions, followed by improv-

ing given behaviour. We present the two major techniques for learning to estimate value

functions, namely Monte-Carlo methods and Temporal-Difference learning. Finally we

will show how the obtained estimates can be used to update the agents actions via the

policy gradient theorem.

Traditional RL methods rely on table-based methods to keep track of their prior ex-

perience. These methods are however not applicable in high-dimensional, continuous

spaces. Here one can instead resort to (parametrized) function approximators. In this

setting neural networks1 are a popular choice due their ability to learn arbitrary func-

tions under mild assumptions [49, 50]. As part of this chapter we will define both feed-

forward and recurrent neural network models and show how these can learn from data

using gradient-based updates.

We will conclude this chapter with a section on hierarchical Reinforcement Learning.

For this we will outline the options framework [25], which allows to incorporate high-

level primitives into the traditional RL framework. We will also highlight a connection

between options and semi-Markov Decision Processes, which allows to transfer some

theoretical guarantees for traditional RL to its hierarchical counterpart. Finally we will

present Universal Value Function Approximators (UVFA), which enable generalization be-

tween a – possibly infinite – set of options.

1In this thesis we use the terms neural network and deep network interchangeably.

7

8 CHAPTER 3. PRELIMINARIES

Agent Environment

action

observation, reward

Figure 3.1: Interaction between agent and environment in the RL setting. Figure adopted

from [51].

3.1 Reinforcement Learning

The Reinforcement Learning (RL) [51] framework provides the foundations for learning se-

quential decision making through interaction with unknown environments. It assumes

an agent, that is able to interact with a given environment. At each discrete time-step the

agent perceives an observation from the environment and may interact with it through

performing an action. This in turn influences the state of the environment and triggers a

new observation as well as a reward signal, see Figure 3.1 for an illustration of this pro-

cess. The agent is exposed to the environment and has to adapt its behaviour through a

trial-and-error learning process such that received rewards are maximized.

This formulation is extremely general, as it makes little assumptions and thus it can

be applied to a wide range of problems. Indeed it includes for example supervised learn-

ing as a special case. However in contrast to classical supervised tasks Reinforcement

Learning defines closed loop systems, in which the learner’s actions influence possible

future perceptions. Furthermore it introduces an additional layer of complexity as de-

sired behaviour is not given in advance, but has to be figured out first. This means, that

the agent usually has to find a trade-off between exploration of the environment and ex-

ploitation of current knowledge.

Reinforcement Learning is partially inspired from biological learning systems and

shows further connections to psychology and neuroscience. For example it was shown,

that a form of Temporal-Difference errors (see Section 3.1.3) is correlated with activations

in the human brain [52]. This thesis will focus on RL methods, which are applicable to

high-dimensional, continuous state and action spaces. This section is largely based on

[51] and we refer the reader to this book whenever references are omitted.

3.1.1 Markov Decision Processes

In Reinforcement Learning the interaction between an agent and the environment is mod-

elled as a Markov Decision Process (MDP), defined as a tuple (S,A, p, r, γ). Here only the

discrete time step setting is considered. At time step t the agent receives a state of the

environment st ∈ S , based on which it selects an action at ∈ A. In our case S and A

CHAPTER 3. PRELIMINARIES 9

will correspond to vector spaces of dimensions ns and na respectively. The choice of the

action triggers a reward signal rt ∈ R. The agent’s observations at the beginning of an

episode are defined according to an initial state distribution p1(s1) and the environment’s

transitions are given by a transition distibution p(st+1|st,at), such that the Markov prop-

erty is fulfilled (hence the name Markov Decision Process). In most cases we do not have

access to these distributions, such that we have to either use them implicitly (the model-

free case) or to learn them from experience (the model-based case).

The agents maintains a policy π, which determines the choice of its actions. In most

cases a stochastic policy π(at|st) is used. Stochastic polices have the advantage, that they

incorporate the possibility for exploration in the environment directly into the agent’s be-

haviour. However there are also notable exceptions, which consider deterministic policies

[14].

Let the return Rγt be the discounted sum of rewards from time step t onwards, Rt =

k=t γk−trk. Note, that by introducing a discount factor γ ∈ (0, 1], this formulation ex-

tends easily to continuing tasks, because the sum is bounded for γ < 1. As designing

good policies in high-dimensional spaces by hand is challenging, we set out to learn

them from experience. To enable learning of desired behaviour it is crucial to define a

quantity, that should be optimized. In Reinforcement Learning this quantity is commonly

based on the expected return. Our goal is therefore to adapt the policy, such that the ex-

pected return is maximized. Let V π(s) be the state-value function, which is the expected

return starting from state s and then following policy π. In MDPs this is specified by the

following equation:

V π(s) = Eπ [Rγt |St = s;π] (3.1)

V π is a measure of the expected future reward for any state given a fixed policy π. A

action-value function Qπ can be formulated accordingly as:

Qπ(s,a) = Eπ [Rγt |St = s, At = a;π] (3.2)

Similar to V π, Qπ describes how valuable it is to take an action in a certain state. In

discrete action spaces it is straight forward to derive the optimal policy given the optimal

action-value function Q∗ (through greedy selection of the action with the highest value).

For continuous spaces one has to resort to different techniques as the greedy choice is

not applicable. To optimize the expected rewards from any start state, it is useful define

the following performance objective:

J(π) = Eπ [Rγ1 |π] (3.3)

Learning to select good actions can be expressed as a form of general policy iteration

[51]. Policy iteration describes the process of alternating between estimating the current

value function and improving the policy. To ensure faster convergence one usually learns

value functions and the policy simultaneously. This however introduces non-stationary

behaviour, as the estimate of the value function changes with the policy. This is one of

the central issues when combining deep networks with Reinforcement Learning (usually

referred to as Deep Reinforcement Learning).

10 CHAPTER 3. PRELIMINARIES

Figure 3.2: Backup diagrams are a method to highlight, which information an agent in the

RL setting uses for learning. The horizontal axis describes the spectrum of n-step TD learn-

ing, ranging from 1-step TD to Monte-Carlo. The vertical axis shows one learning iteration

starting from a certain state at the top. White circles indicate states and corresponding es-

timates for value functions, while black circles are choices of actions. Termination of an

episode is denoted with grey squares. Figure taken from [51].

Most Reinforcement Learning algorithms rely on estimating value functions. There-

fore the next chapters will introduce the two major methods to achieve this. This will be

followed by an explanation of how those estimates can be used in tasks with continuous

action spaces.

3.1.2 Monte-Carlo Methods

In many interesting tasks one does not have direct access to the environment’s dynam-

ics. In this setting Monte Carlo methods offer a simple way to estimate value functions

given a policy. Monte-Carlo methods, as described here, are restricted to episodic tasks.

The general idea in Monte Carlo methods for value function estimation is to take sample

experience from one episode and use the resulting rewards to compute sample returns.

Those are then used to update the current estimate of the value function. In this thesis

neural networks will be employed as function approximation technique to keep track of

this estimate. In this case we use the Monte-Carlo samples as targets for training a value

function network. The Monte-Carlo learning process can be illustrated with the rightmost

backup diagram in Figure 3.2

3.1.3 Temporal-Difference Learning

Temporal-Difference methods take a different path towards learning value functions.

They are model-free, as Monte Carlo methods, but differ in the targets they use for up-

dating their estimates. Instead of waiting for the end of a episode, TD methods imple-

ment their updates based on other updates – they bootstrap. In the simplest form, which

CHAPTER 3. PRELIMINARIES 11

is known TD(0), the estimates for value functions are adjusted based on the TD error de-

fined as:

δt = rt+1 + γV π(st+1)− V π(st) (3.4)

Using bootstrapping can be advantageous, because updates do not have to wait un-

til the end of an episode. This enables faster learning, which can turn out to be crucial

in many cases, as episodes often span over long periods. However it comes at the cost

of using potentially noisier estimates. When employing neural networks for function ap-

proximation we use rt+1 + γV π(st+1) as targets.

It is also possible to interpolate between TD and Monte-Carlo methods, leading to

n-step TD learning. Here the idea is to bootstrap the estimate after observing n sample

rewards. Figure 3.2 illustrates the spectrum of n-step TD learning, ranging from TD(0) to

Monte-Carlo methods.

3.1.4 Policy Gradients Theorem

The policy gradient theorem [53] provides a gradient-based updated rule for stochastic

policies. It is a special case of Score Function Estimators [54] and is sometimes also re-

ferred to as log-derivative trick. The theorem allows to derive the gradients for J(πθ), as

defined in Equation 3.3, as follows:

∇θJ(πθ) = ∇θEπ [Rγ1 |πθ]

= ∇θ

τ

p(τ ; θ)r(τ)

=∑

τ

∇θ p(τ ; θ)r(τ)

=∑

τ

p(τ ; θ)

p(τ ; θ)∇θ p(τ ; θ)r(τ)

=∑

τ

p(τ ; θ)∇θ p(τ ; θ)

p(τ ; θ)r(τ)

=∑

τ

p(τ ; θ)∇θ log p(τ ; θ)r(τ)

≈1

m

m∑

i=1

∇θ log p(τ(i); θ)r(τ (i)) (3.5)

where in the last line the expectation is approximated with a m sample trajectories,

and τ (i) represents the ith sample. Here we denote πθ as a policy parametrized through

θ ∈ Rnθ . For convenience, we define τ to be a state-action sequence ((s1,a1), . . . , (sT ,aT ))

and r(τ) =∑T

t=1 γt−1rt. What remains to be specified is how to obtain ∇θ log p(τ

(i); θ):

12 CHAPTER 3. PRELIMINARIES

∇θ log p(τ(i); θ) = ∇θ log

[

T∑

t=1

p(s(i)t+1|s

(i)t ,a

(i)t )πθ(a

(i)t |s

(i)t )

]

= ∇θ

[

T∑

t=1

log p(s(i)t+1|s

(i)t ,a

(i)t ) +

T∑

t=1

log πθ(a(i)t |s

(i)t )

]

= ∇θ

T∑

t=1

log πθ(a(i)t |s

(i)t ) (3.6)

After combining Equations 3.5 and 3.6 we obtain the full version of the policy gradi-

ent theorem. The fact, that the gradient of the objective, does not depend on the gradient

of the state distribution allows for a wide range of applications. A straight forward ap-

plication of the theorem is the REINFORCE [55] algorithm, which uses sample returns

Rγt as an estimate for expected future rewards. While using the sample return is an un-

biased estimator, it exhibits high variance [16]. Several methods have been proposed to

reduce its variance. One of these is the actor-critic architecture [53], which maintains a

second model – the critic – to estimate Qπ. The critic learns to predict value functions us-

ing for example Temporal Difference learning or Monte Carlo methods (as described in

the previous two sections). Yet another approach to reduce variance is to subtract a base-

line. We will see an example of this in the following section.

The theorem comes with an intuitive explanation. The higher the estimate of the fu-

ture rewards, the more we increase the probability for the actions, that led to the out-

come. For negative values, the probability is decreased.

3.1.5 Asynchronous Advantage Actor-Critic (A3C)

Combining neural networks with Reinforcement Learning is known to be notoriously

difficult [1]. This is due the fact, that learning algorithms for neural networks in general

assume data samples, that are independent and identically distributed – an assumption

that usually does not hold in RL. Another reason are non-stationary targets, to which the

network has to adapt. Deep Q-Networks (DQN) [1] constitute one of the first systems to

overcome these issues through use of experience replay [56] and target networks [1].

Another approach, that was shown to be effective is to run multiple agents in an

asynchronous manner. The asynchronous advantage actor-critic (A3C) [5] fits in this frame-

work. It is a policy gradient method, which launches multiple threads on CPU, in or-

der to enable using updates in the fashion of Hogwild! [57]. This means, that experience

from local models is used to update a shared model. Local models are then synchronized

with the shared one from time to time. It was shown that this method can stabilize train-

ing and has furthermore the advantage to be applicable in the on-policy setting.

A3C uses an estimate of the advantage function:

Aπ(s,a) = Qπ(s,a)− V π(s) (3.7)

for the return in Equation 3.5, where subtracting V π(s) can be interpreted as estab-

lishing baseline. The algorithm maintains two networks. One to approximate the policy

π and one to learn about the state-value function V π. Both networks are updated from

CHAPTER 3. PRELIMINARIES 13

experience either every Tmax steps or when a terminal state is reached, using a mixture

of n-step returns (see Section 3.1.3). It is also possible, and often desirable, to share parts

of both networks. To encourage exploration an entropy regularization term [58] is added,

leading to the gradient of the final objective:

∇θ log π(at|st; θ)(Rt − V π(st; θ)) + βH(π(st; θ)) (3.8)

where Qπ is approximated with the k-step sample return Rt =∑k−1

i=0 γirt+i+γkV π(st+k; θ).

H denotes the entropy of the output distribution and β its weight. The exact procedure

is given as pseudo-code in Algorithm 1.

Data: Globally shared parameters θ and θv; thread-specific parameters θ′

and θ′

v

t = 1;

repeat

Reset gradients: dθ ← 0 and dθv ← 0;

Synchronize parameters: θ′

= θ and θ′

v = θv;

tstart = t;

Get current state st;

repeat

Perform action at based on policy π;

Obtain rt and st+1 from the environment;

t← t+ 1;

until t− tstart == Tmax or terminated;

if terminated then

R = 0;

else

R = V (st; θ);

end

for i ∈ t− 1, . . . , tstart do

R← ri + γR;

Accumulate actor gradients: dθ ← dθ +∇θ′ log π(ai|si; θ

)(R− V (si; θ′

v));

Accumulate critic gradients: dθv ← dθv +∂(R− V (si; θ

v))2

∂θ′

v

;

end

Perform asynchronous update of θ using dθ and of θv using dθv;

until converged;

Algorithm 1: Pseudo-code for one thread in the A3C algorithm. θ denotes parameters of

the policy and θv parameters of the value function estimator. Algorithm adopted from [5].

In the case of continuous actions the outputs of π are usually the statistics of a nor-

mal distribution with diagonal covariance matrix. The authors of [5] found it further-

more beneficial to share the statistics of the optimizer between the threads.

14 CHAPTER 3. PRELIMINARIES

3.2 Function Approximation with Neural Networks

The classical method to keep track of value functions is to store experienced outcomes

of visited states in a table. This approach has at least two major shortcomings. It does

not allow to generalize between similar states, and it is not directly applicable in continu-

ous state spaces. In scenarios where the table-based setting can not be employed, we can

resort to function approximation techniques instead. The core concept behind function

approximation in RL is to use a statistical model, which is successively adjusted to fit the

value function of the current policy.

There are several options when selecting the model class. Linear function approxima-

tors [59], as the traditional choice, are appealing, because they enjoy theoretical guaran-

tees. However linear models are limited in the set of functions they can represent. Exam-

ples for more flexible models, that have received attention within RL include Gaussian

Processes [60, 61] and neural networks [1, 56].

This thesis focuses on the use of deep networks, which are outlined in the following

subsections. We will start by defining feed-forward and recurrent networks. This is fol-

lowed by explaining the principles behind backpropagation, which is the most common

method for training deep networks. This section is largely based on [62], to which we

refer the reader whenever references are omitted.

3.2.1 Feed-forward Neural Networks

We often find ourselves in the scenario, where we want to determine a function y =

f∗(x), but we do not have any access to f∗ directly. We however can observe correspond-

ing inputs x and outputs y. One approach to solve this problem is to set up a model

f(x; θ), that is parametrized by θ. Once the model is specified, we want to adjust its pa-

rameters θ, such that it becomes the best fit to a given set of samples. Ideally we aim for

models, that generalize well to unseen examples. Feed-forward neural networks are one

class of models, that can be used in such a setting.

Feed-forward neural networks are composed of a set of simple transformations, that

are chained together to obtain the final output. Each of these transformation is referred

to as a layer. Intermediate transformations are called hidden layers. We denote the out-

puts of the lth layer as h(l) and the depth, i.e. the total number of layers, of the network

with L. The standard definition of one layer l is given by:

h(l) = g(l)(W(l)h(l−1) + b(l)) (3.9)

where g(l) is a non-linear activation function. Here common choice include tanh, ReLU

[63] and ELU [64], see Figure 3.3 for a visualisation of their properties. W(l) and b(l) rep-

resent the parameters of a linear transformation. We denote the set of all of these as θ.

h(0) corresponds to the input x and h(L) corresponds to the predicted output y. The out-

put size of the hidden layers can be adjusted and it is commonly selected based on the

difficulty of the task at hand.

The computations of a feed-forward network form a directed acyclic graph, meaning

that no feedback connections are allowed. Extensions to include feedback connections are

presented in the following section.

CHAPTER 3. PRELIMINARIES 15

Figure 3.3: Visualisation of different activation functions. Left: tanh, Middle: ReLU, Right:

ELU.

It can be shown, that a feed-forward network with a single hidden layer of finite size

is already universal [49, 50] given mild conditions, i.e. it can approximate any continuous

functions on a compact subset of Rn. There is however a lot of empirical evidence, that

creating deeper networks leads to better performance [65, 66], as well as some theoretical

justification [67].

3.2.2 Recurrent Neural Networks

Recurrent neural networks (RNN) [68] generalize the formulation of the last section to

include recurrent (i.e. feedback) connections. They allow us to model a sequence of in-

puts x = (x1, . . . ,xT ) with the same set of transformations at each step. This invariance

property greatly reduces the number of parameters in the model, but assumes that char-

acteristics are shared over the sequence, which often turns out to be the case. In general

one layer in a recurrent network is described with Equation 3.10.

ht = f(xt,ht−1) (3.10)

where f can be any function. For convenience in notation, all definitions in this sec-

tions are presented for one layer. They however could be extended to multiple layers in

a straight forward manner as done in Equation 3.9. In the traditional, one-layer (Elman)

RNN [69] f is given by:

ht = tanh(Wxhxt +Whhht−1 + bh)

yt = Whyht + by (3.11)

In the recent years RNNs have achieved state-of-the-art performance in many se-

quence modelling task, such as machine translation [70] or speech recognition [71]. It is

also natural to consider them in the RL setting, as the agent here deals with a constant

stream of inputs.

It is well studied, that RNNs suffer from vanishing gradients [72], when they are

trained with first-order optimization techniques. One solution proposed to deal with this

problem is the Long Short-Term Memory (LSTM) [73]. It can be interpreted as specific func-

tion f , that is computed in the following way:

16 CHAPTER 3. PRELIMINARIES

ft

it

ot

gt

= Wxhxt +Whhht−1 (3.12)

ct = σ(ft)⊙ ct−1 + σ(it)⊙ tanh(gt) (3.13)

ht = σ(ot)⊙ tanh(ct) (3.14)

Here σ is the logistic sigmoid function and ⊙ the element-wise multiplication. Con-

sidering input dimension n and hidden dimension m this leads to Wxh ∈ R4m×n and

Whh ∈ R4m×m, which are model parameters, that are initialized randomly and then

learned from data. h0 and c0 correspond to the initial states, which are usually filled

with zeros.

LSTMs introduce several changes compared to standard RNNs. The main compo-

nent of the LSTM is the additive update of the cell state ct, which allows for an easy

flow of information (in both forward and backward passes). Furthermore they allow

for fine grain manipulation of the cell state via different gates. The input gate it controls

how much new information is added, the forget gate ft determines how much of the cur-

rent information is dismissed and the output gate ot regulates which parts can leave the

LSTM. For a more detailed explanation we refer the reader to [68].

3.2.3 Gradient-Based Learning

So far we have discussed how our models are defined, but not yet how they are adjusted

based on a data set (in the supervised setting) or on experience (in the RL setting). In

neural networks this learning is most commonly performed through iterative, gradient-

based optimization.

For this procedure the user specifies a task-dependent loss function L(θ), which is

used to evaluate the current prediction y(i) of the ith sample against its ground truth

y(i). For example if we are exposed to a regression task the mean-squared-error (MSE) is

a sensible choice as a loss function:

L(θ) =1

m

m∑

i=1

||y(i) − y(i)||22

=1

m

m∑

i=1

||y(i) − f(x(i); θ)||22 (3.15)

where f denotes the function implemented by the chosen model with parameters θ.

This process of evaluating predictions of a neural network is often referred to as forward

pass.

In order to employ gradient-based optimization techniques, we need to obtain the

gradients of the defined loss w.r.t. the parameters of the model ∇θL(θ). If all functions in

our model are differentiable these can be obtained with the backpropagation algorithm

CHAPTER 3. PRELIMINARIES 17

f f f f. . .

xt x1 x2 xT

yt y1 y2 yT

Figure 3.4: Left: RNN with one hidden layer and activation function f . Right: The equiv-

alent network unfolded over time. All components in the same row share the same set of

weights.

[74]. Backpropagation itself essentially boils down to an application of the chain rule,

which can be applied recursively to obtain the derivatives w.r.t. the parameters for each

layer in a neural network. This procedure is also referred to as the backward-pass.

In the plain version of stochastic gradient descent a update of the parameters is per-

formed by adapting the parameters in the direction, that minimizes the loss function:

θ = θ − α∇θL(θ) (3.16)

where α is the step size or learning rate.

In practice this updated is performed over a mini-batch of samples and the procedure

is repeated until convergence. It is also common to employ more sophisticated optimiz-

ers, that include momentum terms and adaptive learning rates, such as RMSprop [75] or

Adam [76].

It may not directly clear how to apply backpropagation to RNNs, due to their cyclic

connections. The most prominent approach to calculate the gradients w.r.t. the param-

eters of a RNN is backpropagation through time (BPTT) [77]. It assumes, that the weights

remain constant while training on a sequence of inputs. Due to this the RNN can be un-

folded as shown in Figure 3.4. This leads to a feed-forward network with lateral con-

nections, to which we can apply standard backpropagation. When dealing with long se-

quences it is common to truncate the BPTT procedure. In these cases BPTT is performed

every k steps for m steps (usually k = m).

The learning methods from this section can also be applied in a straight-forward fash-

ion to the RL setting. For this we replace supervised targets y with estimates of value

functions, which are be obtained through either Monte-Carlo (Section 3.1.2) or Temporal-

Difference (Section 3.1.3) methods.

3.2.4 Attention Mechanisms

Attention mechanisms have recently gained popularity in sequence-to-sequence mod-

elling tasks [78]. Their core component is a trained attention module, that decides on

18 CHAPTER 3. PRELIMINARIES

which part of the representation it wants to focus. In the context of thesis we will em-

ploy an attention mechanism to define suitable goal spaces. Therefore we will briefly re-

view the fundamental concepts in this section.

The attention module outputs a vector at ∈ Rn, to which we refer as attention weights.

The attention weights describe how much attention we would like to pay to certain parts

of a given input. Usually we define the attentions weight, such that they sum to one,

which is achieved through the softmax function. Given an input, that consist of k vectors

of size n, Xt ∈ Rk×n, we define the attention mechanism as:

zt = Xtat

= Xte(fatt(Xt,ht−1))

i ef(i)att (Xt,ht−1)

(3.17)

where fatt : Rk×n → R

n is a (potentially recurrent) neural network and f(i)att indicates

the ith element of its output.

The variant in Equation 3.17 is commonly referred to as soft attention. It has the ad-

vantage, that it is fully differentiable, meaning that it can be plugged into any existing

architecture and be trained end-to-end with the rest of the network.

Soft attention attends all parts of the input to varying degree. In some case we would

like a behaviour, that makes sharper decisions on which parts to attend. This technique

is referred to as hard attention [79, 80]. To obtain such behaviour we add an additional

sampling step from the categorical distribution defined by the softmax in Equation 3.17.

This means at will become a one-hot vector representing the sampled class and thus the

module will only attend a specific part of the input, while ignoring the rest.

Using hard attention has the downside, that we can not derive gradients through

the attention module analytically anymore. However, we can still learn its parameters

through RL methods, such as REINFORCE or its variance-reduced modifications.

CHAPTER 3. PRELIMINARIES 19

3.3 Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning describes a set of methods, that attempt to extend

standard RL agents with some form of hierarchical structure. We have already mentioned

some of these techniques in Chapter 2.

Learning good temporal abstractions is one of the key challenges in RL. It does not

only allow to decompose the task in sub-components, that are easier to solve, but also

enables to reuse already learned solutions, potentially speeding up the training process

by orders of magnitude.

In this section we will review the options [25] framework, which is a popular choices

for incorporating temporal abstractions in existing RL methods. Furthermore we outline

Universal Value Function Approximators (UVFA) [30], which allow to transfer knowledge

between different options. Together these techniques form the core of the approach pro-

posed in the following chapter.

3.3.1 Options Framework

Options [25] extend the agent’s available behaviour through a set of temporally extended

actions. Instead of choosing a primitive action, the agents is allowed to select an option

from a set Ω, which is then followed until a termination criterion is met. Examples for

options include grasping an object, opening a door and also the agent’s primitive actions.

Mathematically an option ω is defined as a triple (Iω, πω, βω), where Iω ⊆ S is the

initiation set, βω : S → [0, 1] the termination condition and πω : S → A the intra-option

policy. For simplicity it is often assumed, that all options are available everywhere, i.e.

Iω = S for all ω.

A set of options defined over a MDP constitutes a semi-MDP [25]. As the theory of

semi-MDPs is closely related to that of MDPs, we can carry other many theoretical guar-

antees when dealing with options. The framework itself does however not define, how

suitable options arise. We have already reviewed several extensions, that strive to dis-

cover options from experience, in Chapter 2.

It is intuitive to think of options as a procedure for archiving a sub-goal, but it may

also be not immediately clear how such a behaviour emerges. Given, that we know the

set of suitable sub-goals, we can assign an option to each these, along with an intrinsic

reward function. The intrinsic reward should assign high values for archiving the sub-

goal and low values otherwise. If we now learn the intra-option policies based on the

intrinsic reward functions we already obtain a type of goal-directed behaviour.

3.3.2 Universal Value Function Approximators

The interpretation of options as sub-goals provided in the last paragraph, would require

to learn separate values functions Vg(s) for each possible sub-goal g. Learning many, in-

dependent values functions can be slow and does not scale well to settings with a large

number of possible sub-goals.

To circumvent these issues Universal Value Function Approximators (UVFA) [30]

20 CHAPTER 3. PRELIMINARIES

approximate the union of value functions for all goals with V (s,g; θ). In practice this

means, that the goal is provide as part of the input to the value function approximator.

In the simplest case this is realized through concatenation of state and goal. UVFAs are

able to exploit additional structure in the goal space and thus enable faster learning and

better generalization. Furthermore they scale easily to large or even uncountable sets of

goals.

Chapter 4

Methods

4.1 Hierarchical Reinforcement Learning with Goals

In this section we extend the RL framework to include an explicit goal-directed mecha-

nism. For this the agent’s internals are separated in two parts – a goal-setting policy πS ,

which is responsible to set goals and a goal-reaching policy πR, which has to reach the

goals given by the goal-setting policy1. An illustration describing the concept is given in

Figure 4.1. The model is constructed in a way, that both parts can be learned simultane-

ously through arbitrary RL methods. We will however argue later in this chapter, that

certain choices are more sensible than others. Note, that similar architecture have been

applied in prior work [21, 22, 81]. The relation to these methods will be further discussed

in Section 4.2.

The goal-setting agents operates on some continuous goal space G ⊆ Rk. In the most

general case it learns a function φ : S → Rk, that maps from the current state st to the

current activation in goal space zt. We will also refer to φ(st) as goal embedding in the

following. In the initial state and whenever the current goal was reached a new goal

is selected. We interpret setting goals as actions of the goal-setting policy represented

through πS : G → Rk. This allows to apply standard RL techniques for the goal-setting

policy. The goal-setting agent’s actions last over multiple time steps and its rewards are

defined as the sum of external rewards from the environment along the path towards the

goal. The goal-reaching policy πR on the other hand acts through the agent’s atomic ac-

tions at each time-step and gets rewarded for reaching the goal set from the goal-setting

policy. The complete method is outlined in Algorithm 2.

In the following we will first highlight possible advantages compared to standard RL

approaches, explained on an idealized example. We describe specific design choices – for

both goal-setting and goal-reaching policies – used during the experiments in this report.

We then provide a connection to the options framework, which provides some theoretical

grounding for the approach. Finally we highlight connections and differences to related

approaches.

1Other works also refer to these two policies as high- and low-level policy or as manager and worker.

21

22 CHAPTER 4. METHODS

Figure 4.1: Illustration of the goal-based framework proposed in this thesis. Detailed infor-

mation can be found in the corresponding subsections of this chapter.

4.1.1 Motivation

To motivate the presented approach, we highlight its advantages with a short example.

Consider an ant-like insect interacting in an environment with sparse rewards. The agent

is controlled through manipulation of forces. Traditional approaches initialize the agent

with random actions and improve those actions upon receiving rewards. However op-

timizing behaviour with such methods is impossible if random actions never lead to re-

warding cases. Even if a sequence leads to a rare reward scenario, it might be hard to

assign credit to the actions, that actually caused the reward. Introducing a hierarchy as

described above provides one possible approach to these problems and allows the agent

to learn meaningful behaviour even in the absence of rewards.

To be more specific take mentioned ant agent, which has to solve a labyrinth task.

The agent is only rewarded upon reaching a predefined final position. Applying stan-

dard RL algorithms will lead to failure, because random exploration in the atomic action-

space will never cause rewarding states. However if a goal-setting policy would be able

to set goals based on the location of the agent, it has the potential to set way points,

which can simplify the search for a good policy. The goal-reaching policy would be then

governed with reaching these way points. This makes it easier to obtain initial rewards

and continue learning from there.

CHAPTER 4. METHODS 23

Ideally we learn the desired goal space from experience, however in some domains

we can also consider the simpler case of using pre-designed embeddings and only learn

transitions in the given space. Learning goal embedding greatly increases the generality

of the procedure, at the cost of introducing additional complexity. We will investigate

both cases during later sections in this report.

repeat

t = 1;

Reset goal;

Get initial state s1 and goal embedding z1;

repeat

if no goal then

Set goal g based on πS and zt;

end

Perform action at based on g, st and πR;

Get next state st+1, embedding zt+1 and extrinsic reward rt;

Calculate rRt based on zt and zt+1 ;

Update rSt based on rt ;

if πS updated required then

Update πS based on experience;

end

if πR updated required then

Update πR based on experience;

end

if goal reached or gmax steps passed then

Reset goal;

end

until terminated;

until converged;Algorithm 2: Pseudo-code for the goal-based framework proposed in this section.

4.1.2 Goal-Setting Policy

The goal-setting agent consists of a function φ, mapping to a continuous goal space, and

a policy πS . Both of these components are implemented through deep networks. The

policy πS turns active at the beginning of an episode or whenever the current goal was

just reached. It takes the output of the goal embedding φ as inputs and outputs a goal g,

which is then passed to the goal-reaching policy.

Using continuous goal spaces allows to backpropagate through all parts of the model.

This means, that all internal representations – including the goal space – can be learned

in an end-to-end manner. Using a discrete set of goals on the other hand would require

a sampling process inside the model, through which we can not readily pass errors back-

wards (though there are techniques for doing so, see for example [82]).

We may consider constant horizons for each goal or terminate a goal immediately

when it is reached. Here we choose to focus on the second option, even though the first

24 CHAPTER 4. METHODS

one might have its own advantages. We denote the maximum time-span a goal is active

in the following as gmax. The reward of the goal-setting policy is based in the extrinsic

reward of the environment along the taken path towards the goal, defined as:

rSt =k

n=1

rt+n (4.1)

where k is bounded from above through gmax.

We now have defined all components of the goal-setting MDP, that are required to

learn about its policy. This means we can directly apply any RL algorithm to this prob-

lem. Here we decide to employ A3C (Section 3.1.5) as the method of choice.

It is worth mentioning, that the state-transition distribution of the goal-setting MDP

pS is dependent on the behaviour of the goal-reaching policy. This means, that there

might be cases in which goals are not reached. In order to encourage learning, based on

transition of the goal-reaching policy, we propose a modified version of the update rule

from Equation 3.8:

∇θS log πS(zt+k|st; θS)(Rt − V πS (st; θS)) (4.2)

where gt is replaced with the reached location in goal space zt+k.

The goal-setting policy is, by definition, exposed to less data then the goal-reaching

policy, as it operates at a slower time-scale. This could make learning challenging. We

will later discuss methods, that could enable more data-efficient learning.

4.1.3 Goal-Reaching Policy

The goal-reaching policy πR is, in contrast to the goal-setting policy, active at every time-

step and responsible for selecting the agents primitive actions. Let n be the nth step, af-

ter a new goal was set at time-step t. At each time-step t + n the agent receives infor-

mation about the currently active goal (for example as relative or absolute values). In

practice we just pass this information as an additional input to the policy and value net-

works. This process is a form of UVFA, as described in Section 3.3.2, where the set of

possible goals is a continuous vector space. Therefore the agent is able to generalize be-

tween all possible goals.

The goal-reaching polices in this report will learned with the A3C algorithm as de-

scribed in Section 3.1.5, where policies are represented through deep networks. Network

architectures will be described in more detail during the later parts of this report. The

goal-reaching policy adapts its behaviour to maximize an intrinsic reward rRt+n, which is

based on some distance measure d between the current position in goal space zt+n and

the active goal gt:

rRt+n = −d(zt+n,gt) (4.3)

where the choice of d can be adapted to the given scenario. In practice we can also

combine this reward with the extrinsic reward rt+n from the environment. A similar

technique, based on a linear combination of both terms, was shown to be beneficial in

[22]. Here we will simply add all control-based rewards (if available) from rt+n to rRt+n.

CHAPTER 4. METHODS 25

Name Domain Actions Goals Learning UVFA

HASSLE [23] Navigation discrete ARAVQ Advantage

Learning

Horde [39] Mobile Robots discrete Demons GQ(λ)

h-DQN [21] ATARI discrete Entities & Relations DQN

FuN [22] ATARI discrete Learned A3C*

DeepLab

DeepLoco [24] Bullet continuous Footsteps Actor-Critic

ours MuJoCo continuous Learned A3C

Table 4.1: An overview of methods, that use similar techniques as the ones proposed in

this thesis. * Note, that FuN also uses modified policy gradient variant in the goal-setting

policy.

The control-based rewards can include penalties for large actions, friction-based terms

and survival bonuses.

4.1.4 Interpretation as Options

We can also view this goal-based approach in the light of the options framework of Sec-

tion 3.3.1. Let us first assume, that we have a discrete set of goals, each of which is learned

with an independent intra-option policy πω. If we now set the initiation set Iω to S and

βω to 1 if the goal is reached or gmax steps passed and 0 otherwise, we obtain a com-

plete, well-defined set of options Ω. However we use the defined intrinsic reward rR

(Equation 4.3), instead of the extrinsic rewards r, to learn the intra-option policies.

If we now represent all policies in the set of options Ω with an UVFA (Section 3.3.2)

we are able to share information between possible options, while simultaneously decreas-

ing the number of adjustable parameters. This also enables to use an infinite set of goals,

such as coordinates on a map or activations of a deep network.

4.2 Connections to Related Approaches

There is a small set of prior work, that directly considered similar models as the ones

proposed in this thesis. We would like to take this section to state connections to these

approaches more explicitly. We are aware of five prior approaches, that employ some

form of hierarchy to learn goal-directed behaviour in the RL setting. We highlight simi-

larities and differences to the method proposed here in Table 4.1.

Inspecting Table 4.1 we note, that only one other approach (FuN) is able to learn suit-

able goal spaces from interaction with the environment. However FuN has only been

applied to settings with discrete actions (although extensions to the continuous case are

possible). DeepLoco represents the only other model, that employed a goal-based tech-

nique for continuous control, but it is less general, as it includes a range of prior assump-

tions.

Chapter 5

Experiments

Next we will examine the methods proposed in the last chapter on a set of continuous

control problems of increasing difficulty. For this we will consider a simple point-mass

model and several challenging tasks based on the MuJoCo physics simulator [18]. More

specifically we aim to control the following agents:

• Point-Mass: Here we control the acceleration of a simple point-mass model. The

4 dimensional observation space consists of the current position and velocity. We

restrict both the acceleration and velocity to be within a fixed range.

• Half-Cheetah [83]: A planar biped robot with 9 rigid links. Its 20 dimensional ob-

servation space includes information about the center of mass, joint angles and joint

velocities. It is controlled through the manipulation of 6 actuated joints.

• Ant [16]: A ant-like insect with 13 rigid links. The 125 dimensional observation

space contains the center of mass joint angles, joint velocities, a (usually sparse)

vector of contact forces and the rotation matrix for the body. Control is performed

via 8 actuated joints. Further it is possible for the robot to die, i.e. to fall over and

get into a configuration from which it can not recover.

The latter two of these descriptions have been partially adopted from [19]. Figure 5.1

includes visualizations of the described agents. These tasks are appealing, because con-

trolling the agent’s actions is challenging even if we only consider the simple scenario of

moving in one direction, as it is done in common benchmarks [19]. Yet it is easy to in-

corporate these agents into scenarios with sparse rewards, such as navigation through

a labyrinth or gathering objects. Employing such a setup allows to stretch the capabili-

ties of both goal-setting and goal-reaching polices. On the other hand we can often rea-

son about the structure of a suitable goal space and thus incorporate prior knowledge

into the problem. This comes in handy because it allows to first attempt the easier task,

where the goal space is known, and then continue to cases where it is learned end-to-

end. Additionally we have included the point-mass model, as it is allows for a faster

evaluation of the given methods.

The experiments in this section are structured based on the three components of goal-

directed behaviour outlined in Chapters 1 and 4. We aim to build an incremental solu-

tion to the full problem, starting with easier settings.

26

CHAPTER 5. EXPERIMENTS 27

Figure 5.1: Visualizations of the agents used in this report. Left: Half-Cheetah. Right: Ant.

For detailed descriptions, see text. Picture taken from the OpenAI gym [84] web-page.

We first consider the case, in which the goal-setting policy generates random goals

in a fixed, known goal space, i.e. positions of 1 or 2 dimensional locations on a map.

This allows to examine the settings, that are important to obtain reliable goal-reaching

behaviour. In this case there is no learning involved in the goal-setting policy. We then

extent the control of the agent with a simple gathering scenario and aim to learn task-

dependent goals based on environment rewards. For this we first consider the simpler

case of fixed, known goal spaces, followed by investigating whether these can also be

learned from experience.

All experiments are implemented using PyTorch [85] version 0.1.6. OpenAI gym [84]

is used to access the MuJoCo environments. The point-mass experiments have been per-

formed on an Intel i5-7200U, and the half-cheetah and ant experiments on an Intel i5-

6600K. Note again, that A3C is an algorithm, that is optimized for CPUs, thus no GPUs

are required.

28 CHAPTER 5. EXPERIMENTS

5.1 Randomized Goals with Fixed Embeddings

In our first set of experiments we will aim to learn to control of an agent towards a ran-

dom goal. As for all experiment sections, we will proceed as follows. The first subsection

will introduce the task, followed by a description of all parameters and design choices

used in the experiment. We will then present obtained results and conclude the section

with a short analysis of those.

5.1.1 Task

In both cases we aim to control the agent, either half-cheetah or ant, towards a provided

location on the map. This can be interpreted as within the framework of Chapter 4 in the

following way:

• φ: Fixed (not trainable) goal embedding corresponding to location on the map

• πS : Random goals (not trainable) with given radius d

• πR: Learned goal-setting policy towards the random goal

For the half-cheetah task we employ one dimensional goals, as the agent can only

move forwards or backwards. For the ant task we use two dimensional goals. In both

cases the randomized goal is placed with five units away from the current position. At

the start of an episode and whenever a goal was reached we randomize a new goal. The

threshold for reaching the goal is set to 0.1 and all distances are calculated with the L2

norm. Goal are discarded if they are not reached within 200 steps and our objective is to

reach as many goals as possible within a given time frame of 1000 steps.

5.1.2 Experimental Details

We define the intrinsic rewards of the goal-reaching policy (Equation 4.3) as follows:

rRt = α [d(zt,g)− d(zt+1,g)] + ct (5.1)

given a currently active goal g. Here α is a scaling factor, which we set (as done in

the original task) to the inverse of the environment’s frequency. zt corresponds to goal

embedding of the current state st and ct are control based terms of the original extrinsic

rewards. Throughout all experiments shown here d corresponds to the Euclidean dis-

tance. Note again, that the goal embedding is not learned in this section, thus it always

corresponds to the current location of the agent on the map.

In initial experiments we found, that using differences in distances can be beneficial

in some cases. For example consider the following scenario. The agents sets itself a goal

and gets reward for being close to it. Now, if it just stays outside of the reaching thresh-

old range, it will get high rewards for a long time, without ever reaching the goal, which

is not the desired behaviour. Using rewards based on directions towards the goal circum-

vents this issue and allows to reach more goals in the same amount of time.

In all cases we keep separate networks for the actor and critic, as we found that this

setting converges faster. Each of these networks consists of one fully-connected layer

CHAPTER 5. EXPERIMENTS 29

Parameter Value

Hidden Size 32,64,128,256

Learning Rate 3× 10−4

β 0.01

γ 0.99

Tmax 50

Number of Agents 16

Maximum Variance 0.1

Episode Length 1000

Maximum Goal Steps 200

Reward Scale 0.1

Gradient-Norm Clipping 40.0

Table 5.1: Summary of all hyper-parameters in Task 1.

with ELU [64] activation function, a LSTM and another fully-connected layer mapping

to the output.

The outputs of the policy network define the parameters of multidimensional nor-

mal distribution with a spherical co-variance. To ensure, that variances are within a suit-

able range, we pass them through a sigmoid activation function and scale them with a

constant. All actions are scaled between -1 and 1 and outputs of the last actor layer are

passed through a tanh activation function. Both networks are initialized with the PyTorch

default initialization scheme.

We use A3C (Section 3.1.5) to train the goal-reaching policy πR. To update the param-

eters we adapt the shared RMSprop [75] variant of the original A3C paper [5]. Gradients

of the recurrent module are obtained via truncated BPTT (Section 3.2.3) on models that

are iteratively unrolled for Tmax steps. Relative goal positions are appended to the input,

which is a form of UVFA (Section 3.3.2).

In this section we are interested in finding a suitable model size, thus we vary the

number of hidden units between 32,64,128 for the half-cheetah task and between

64,128,256 for the more complex ant task. All hyper-parameters are summarized in Ta-

ble 5.1.

5.1.3 Results

As already mentioned we are measuring success in the number of goals reached within

1000 time steps. Figure 5.2 shows the obtained results for the half-cheetah task after 60

minutes of training and Figure 5.3 shows the results for the ant task after 24 hours of

training.

We also visualize the learned behaviour from the ant experiment in various forms in

Figure 5.4. These plots highlight, that in both tasks the desired behaviour is learned after

the training has finished.

30 CHAPTER 5. EXPERIMENTS

0.0 0.2 0.4 0.6 0.8 1.0

Hours

0

500

1000

1500

2000

Intr

insi

c R

ew

ard

s

Intrinsic Rewards Over Time

32

64

128

0.0 0.2 0.4 0.6 0.8 1.0

Hours

0

5

10

15

20

Reach

ed G

oals

Reached Goals Over Time

32

64

128

Figure 5.2: Left: Rewards over the course of a training session for the half-cheetah task for

different hidden sizes. Different colors correspond to the number of hidden units used.

Right: Number of goals reached per episode in the same training session.

0 5 10 15 20

Hours

0

500

1000

1500

2000

Intr

insi

c R

ew

ard

s

Intrinsic Rewards Over Time

64

128

256

0 5 10 15 20

Hours

0

5

10

15

20

Reach

ed G

oals

Reached Goals Over Time

64

128

256

Figure 5.3: Left: Rewards over the course of a training session for the ant task. Different

colors correspond to the number of hidden units used. Right: Number of goals reached per

episode in the same training session. Different colors correspond to the number of hidden

units used.

5.1.4 Discussion

In the experiments of this section we have been interested in properties of the goal-reaching

policy. For this we trained an agent to reach fixed, randomized goals on a given goal

space. We have have shown, that it possible to reach these goals with the help of an in-

trinsic reward function. To achieve this we learn just a single policy for all goals, instead

of an independent one for each goal. This is enabled through the use of UVFAs in form

of passing the goal as an additional input.

The method is applicable to a range of tasks, such as the control of half-cheetah and

ant studied here. Furthermore it is stable for a range of different hidden sizes. In Fig-

ure 5.2 we observe, that using networks with too large capacity can lead to difficulties

during training. In this case the model with a hidden size of 128 shows a large standard

deviation, as one of the runs failed to find a solution. We can make a similar observation

in Figure 5.3, where the largest network converges significantly slower than the others.

We also noticed in preliminary experiments, that adding more goal dimensions signifi-

CHAPTER 5. EXPERIMENTS 31

Figure 5.4: Visualization of the learned behaviour for the ant task. Left: A sample trajectory

on the map over the course of one episode. Dots represent the set goals, and the lines show

the movement of the agent. Right: Value function for a given goal plotted over the location

of the agent. One can clearly identify, that it rates states higher, which are closer to the goal.

cantly increases the time it takes to learn the intended behaviour. For example around

five hours are required to control the ant towards a one-dimensional goal, while it takes

up to 20 hours in the two-dimensional case. The techniques explored in this set of exper-

iments will also be an important component in the following experiments.

32 CHAPTER 5. EXPERIMENTS

5.2 Learned Goals with Fixed Embeddings

As a next step we aspire to learn the goal-setting policy πS jointly with the goal-reaching

policy πR instead of just choosing random goals. For this we will set up a task, where

we can design a suitable goal space by hand. We leave the problem of learning goal

spaces from experience to the next section.

5.2.1 Task

In this section we consider a modified version of the gather task of [19]. For this a num-

ber of apples and bombs is placed randomly on the map. The agent receives a reward of

1 for collecting an apple, -1 for a bomb and 0 in all other cases. In contrast to the origi-

nal task, we do not place boundaries around the map. The agent gets sensor information

about objects in form of relative, global coordinates. The task is solved, if all apples are

collected without running into a bomb. A schematic illustration of the task is shown in

Figure 5.5 and all parameters for the experimental setup are outlined in Table 5.2.

Figure 5.5: Illustration of the gather task in this section. The agent in this case is the ant, as

shown in the middle of the scene.

For all experiments we employ the previously described point-mass and ant agents.

We use the point-mass model, as it allows for faster testing of the goal-setting policy’s

properties. The task stands out due to its sparse rewards, which are unlikely to be reached

with random exploration, and thus it provides an ideal test bed for our approach. We

define the goal embedding as the relative position of the closest apple compared to the

agent’s current position on the map. This leads to the following interpretation in terms of

the framework of Section 4:

• φ: Fixed (not trainable) goal embedding given by relative position of the closest ap-

ple

• πS : Learned goals (i.e. actions) of the goal-setting policy

CHAPTER 5. EXPERIMENTS 33

Parameter Value

Object Placement Range [-10.0, 10.0)

Minimum Initial Distance From Agent 3.0

Pickup Radius 0.1

Apples 5

Bombs 5

vmax (Point-Mass) 0.1

amax (Point-Mass) 0.1

Table 5.2: Summary of all experimental parameters in Task 2.

Parameter Value

Hidden Size 16

Learning Rate 3× 10−4

β -0.3

γ 0.9

Tmax 5

Number of Agents 16

Initial Variance 4

Episode Length 400 (Point-Mass), 1000 (Ant)

Maximum Goal Steps 20 (Point-Mass), 100 (Ant)

Reward Scale 1.0

Gradient-Norm Clipping 40.0

Table 5.3: Summary of all hyper-parameters of the goal-setting policy in Task 2.

• πR: Learned goal-reaching policy towards the goal

5.2.2 Experimental Details

The settings for the goal-reaching policy are exactly as in the last section, including the

choice of the goal space and the intrinsic reward function (Equation 5.1), see Table 5.1

again for the selected hyper-parameters. As in the goal-reaching policy, we also use A3C

to update the goal-setting policy. The parameters for the goal-setting A3C are described

in Table 5.3. The extrinsic reward is defined, as per Equation 4.1, as the sum of rewards

of the environment along the taken path towards the current goal.

In the experiments in this section, we are mainly interested in investigating two dif-

ferent settings. In the first we identify, whether changing the update rule (Equation 4.2)

to consider reached states, instead of goals has an effect on the behaviour. A motivation

for this is provided in Chapter 4. In addition to that, we want to see, what impact the ar-

chitecture of the goal-setting policy has. Given the choice of the embedding space, it is

it possible for a linear policy to obtain good results in this task. This is because the zero

action will always set a goal at the closest apple. However a linear policy might not be

able to solve the task anymore, if we restrict it to reachable goals.

Other changes are summarized in the following. Compared to the goal-reaching pol-

34 CHAPTER 5. EXPERIMENTS

0.0 0.2 0.4 0.6 0.8 1.0

Hours

1

0

1

2

3

4

5

Extr

insi

c R

ew

ard

Extrinsic Reward Over Time

goals

states

A3C

0.0 0.2 0.4 0.6 0.8 1.0

Hours

0.0

0.2

0.4

0.6

0.8

1.0

Perc

enta

ge o

f R

each

ed G

oals

Reached Goals Over Time

goals

states

Figure 5.6: Results for the point-mass experiments with a linear policy πS . We compare

an A3C baseline against the goal-based framework of this thesis, which uses either gt or

zt+k in the update rule. These two settings are denoted as goals and states respectively.

Each setting is averaged over five runs and the standard derivation is illustrated in addi-

tion to the average. Left: Rewards over the course of a one hours training session. Right:

Percentage of goals reached in the same session.

icy, the goal-setting policy uses lower values of Tmax (5 vs. 50) and γ (0.9 vs 0.99). These

changes are made, as the goal-setting policy is updated less frequently. Both the policy

and value networks of the goal-setting policy receives only the embedding as input. We

can view this as sharing all layers, that compute the embedding, i.e. φ.

We also made changes to the variance output of the goal-setting policy. Here the vari-

ance is obtained in form of a state-independent bias vector, which is initialized to large

values and decayed over time to a minimum value. This choice is due to two reasons.

First it speeds up the training process, as the network does not have to wait until the en-

tropy bonus of Equation 3.8 kicks in. And second it helps to stabilize the updates if we

use the update rule in Equation 4.2. Instabilities in this case can arise, when the reached

state (for which we want to increase or decrease the probability) has a to low probability

of happening. This can easily be the case early during training, especially when the vari-

ance is low. During test time we always use the mean of the goal-setting policy to select

goals.

The neural network used for the goal-setting policy πS takes the following form:

zt ⊙ σ(f(zt; θS)) (5.2)

where ⊙ is the Hadamard product, σ the sigmoid function and θS the parameters of

a two-layer network with ELU activation function. This multiplicative gating function is

chosen, because it makes it easier to select reachable goals.

We compare the result against a standard A3C baseline with settings based on Table

5.1. We use a hidden size of 16 for the point-mass model and 128 for the ant model.

CHAPTER 5. EXPERIMENTS 35

0.0 0.2 0.4 0.6 0.8 1.0

Hours

1

0

1

2

3

4

5

Extr

insi

c R

ew

ard

Extrinsic Reward Over Time

goals

states

A3C

0.0 0.2 0.4 0.6 0.8 1.0

Hours

0.0

0.2

0.4

0.6

0.8

1.0

Perc

enta

ge o

f R

each

ed G

oals

Reached Goals Over Time

goals

states

Figure 5.7: Results for the point-mass experiments with a policy πS based on a neural net-

work. We compare an A3C baseline against the goal-based framework of this thesis, which

uses either gt or zt+k in the update rule. The refer to these settings as goals and states re-

spectively. Each setting is averaged over five runs and the standard derivation is illustrated

in addition to the average. Left: Rewards over the course of a one hours training session.

Right: Percentage of goals reached in the same session.

5.2.3 Results

The maximum reward an agents can achieve in this task per episode is five. The left side

of Figures 5.6 and 5.7 shows, that the goal-based point-mass agent (green and blue lines)

is able to solve the task reliably in all conditions after around 30 minutes of training,

while the A3C baseline (red line) struggles to find a good policy.

We are also interested in the percentage of goals reached in an episode. We show this

quantity on the right side in Figures 5.6 and 5.7. From these figures it can be observed,

that the goal-setting policy with a neural network is more likely to set goals, that are

reachable. While it seems not crucial in this experiment to actually reach the desired

goals to obtain good behaviour, it could prove meaningful when we additionally learn

the goal space representation.

As a proof of concept we also run one of the settings, that is able to solve the point-

mass setting, with the ant agent. Again the goal-based agent is able to perform well in

the task, while the A3C fails to find a solution, see Figure 5.8 for a illustration of the

learning curves in this experiment. To highlight the learned behaviour, we also plot sam-

ples trajectories from the trained goal-based and standard A3C ant agents in Figure 5.9

5.2.4 Discussion

In this section we have been interested in simultaneously learning a goal-setting policy,

that sets goals on a predefined space, together with a goal-reaching policy, that is trained

to reach these goals. In this setting we were able to solve a challenging gathering task

with long dependencies and sparse rewards. Note, that this task is difficult for standard

RL methods and, that the A3C baseline, to which we compare here, was not able to learn

any meaningful behaviour on it.

We notice, that (with the given goal embedding) it is possible to solve the task with

linear (Figure 5.6) and non-linear policies (Figure 5.7). The non-linear policy is further-

36 CHAPTER 5. EXPERIMENTS

0 5 10 15 20

Hours

1

0

1

2

3

4

5

Extr

insi

c R

ew

ard

Extrinsic Reward Over Time

goals

A3C

Figure 5.8: Results for the ant experiments with neural network policy πS and where gt has

been used for the update rule. Due to time and computing constraints we only visualize

one run and setting as a proof of concept.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

y

Trajectory

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0x

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

y

Trajectory

Figure 5.9: Top: Example trajectory learned with the goal-based agent. Bottom: Example

trajectory learned with A3C. Green dots represent apples and red dots bombs in both plots.

CHAPTER 5. EXPERIMENTS 37

more able to set reachable goals more reliably. We expected, that using the reached state,

instead of the goal, in the update rule (refer Equation 4.2) would further improve the

ability to set reachable goals. However, as Figures 5.6 and 5.7 show, this was only ob-

served to a small extend. As seen in Figure 5.8 the approach scales readily from the point-

mass model to complex agents, as the ant of the MuJoCo benchmark.

38 CHAPTER 5. EXPERIMENTS

5.3 Learned Goals with Learned Embeddings

Finally we aim to learn the complete model, including the goal space, in an end-to-end

fashion. Due to time and computing constraints we only evaluate experiments this sec-

tion based on the point-mass model. It is however to note, that these methods could be

applied without modification to the more complex agents of the previous experiments

5.3.1 Task

We investigate the identical gather task as in the last section. This time however, the goal

embedding φ is not known in advance and the agent itself has to figure out in which

space it wants to set its goal. Again we describe how the different components of the

goal-based framework of Chapter 4 are treated in the following experiments:

• φ: Learned goal embedding of the current state

• πS : Learned goals (i.e. actions) of the goal-setting policy

• πR: Learned goal-reaching policy towards the goal

In this experiment we only consider the point-mass model and leave it to future work

to investigate, how these methods can be extended to more complex agents.

5.3.2 Experimental Details

For the goal-setting and goal-reaching policies we employ the same setup as used in the

last two experiments, see Tables 5.1 and 5.3, with a few derivations as outlined in the

following. We changed gmax to 40 as we observed improved stability through this ad-

justment. We also noticed, that normalization of all inputs, such that they are within the

range of -1 and 11, is beneficial. Finally we increased the time for learning to two hours

due to the increased difficulty of the experiment.

In the last experiment we saw, that multiple conditions of the goal-setting policy were

able to solve the task. Therefore we restrict the experiments in this section to the simplest

of these, that is πS is a linear policy and it uses its action (i.e. not reached states) in the

policy gradient update rule. Furthermore it employs a fixed variance of 0.01.

We learn φ as an attention module, that is able to focus on specific part of the state

as a goal embedding. This means it selects what part of the state it would like to change.

The goal-setting policy subsequently decides how it would like to change the attended

part. The attention module takes a reshaped version of the original state st as input. For

this we group the two corresponding directions for each part of the input to obtain St ∈

R2×12. The attention module takes the form presented in Equation 3.17 and a selected

attention weight continues to be active until a goal is reached or gmax steps passed.

In this experiment we vary different settings of the attention module. We consider

variants of both soft and hard attention. The soft version is trained end-to-end with the

rest of the goal-setting policy. In the case of hard attention, we treat the softmax (Equa-

tion 3.17) as an additional discrete action the agent can take, and use the the identical

actor-critic update (Equation 3.8), that is used for the goal-setting policy.

1For locations we assume the maximum range of object locations as limits

CHAPTER 5. EXPERIMENTS 39

0.0 0.5 1.0 1.5 2.0

Hours

1

0

1

2

3

4

5

Extr

insi

c R

ew

ard

Extrinsic Reward Over Time

soft

hard

hard mlp

embedding

Figure 5.10: Results for the different settings evaluated in the experiments of this section.

Soft refers to the version, that employs a soft attention mechanism. Hard refers to the vari-

ant with hard, linear attention and the additional indicator vector for collected apples,

while hard mlp is the version without this vector and a two layer attention mechanism in-

stead of a linear one. Embedding represents the settings from the last set of experiments,

where the goal embedding is provided and not learned.

We found it difficult to learn good behaviour with the given state representation.

Therefore we added an additional binary vector, indicating whether an apple has already

been collected. This settings is employed together with a linear fatt (Equation 3.17). Note,

that theoretically this information is already contained in St and could thus be learned.

To investigate this, we also train a model without the additional indicator vector. This

model is a two layer network with ELU activation and a hidden size of 16. We finally

compare the obtained results to a version, that uses the hand-designed embedding of the

last section. As some of the settings changed slightly, we reevaluated this model to en-

sure a fair comparison.

5.3.3 Results

We plot the obtained average reward for the four evaluated conditions in Figure 5.10.

These include:

1. Soft attention, fatt: linear, additional collected apples indicator

2. Hard attention, fatt: linear, additional collected apples indicator

3. Hard attention, fatt: two layer network

4. Hand-designed goal embedding from last section

Out of the models with learned goal embeddings (1. to 3.) only the version with hard

attention and additional indicator for already collected apples is able to achieve accept-

40 CHAPTER 5. EXPERIMENTS

Figure 5.11: Here we visualize the attention weights at over the course of one episode. This

agent uses a hard attention mechanism. Higher values in this plot are more likely to be

attended. The apple icon indicates a collected apple at the given time step. We can clearly

see, that the agent puts its focus on a desirable part of its input representation and sets

goals based on the attended part. Whenever an apple was reached, its attention switches to

another part.

able results. Its score is however still inferior to the hand-designed goal embedding of

the last section.

One of the benefits of attention modules is, that we are able to interpret its decisions.

To visualize those we plot the attention weights at over the course of an episode in Fig-

ure 5.11 and highlight corresponding events.

5.3.4 Discussion

Figure 5.10 reveals a few insights about the experiment of this section. First we notice,

that the version with the predefined goal embedding zt (red line) obtains the best re-

sults and also exhibits the fastest progress in the beginning. This is not surprising, as it

includes additional prior information about the task at hand. We also observe unstable

behaviour after training longer than one hour in this setting. We hypothesize, that this is

due to some form of overfitting, although this had to be investigated in further experi-

ments to be certain.

The variant with a soft attention mechanism (purple line) does not work at all for this

task. This is because the agent can not attend all parts of the input to some degree and

then decide to change them all simultaneously. For example going closer to one apple

necessarily means, that it will go further away from another. Hence soft attention is not

well suited for this setup.

In the version, that consists of a hard attention module in form of a two-layer net-

work (blue line), we observe some initial progress. However after reaching an average

CHAPTER 5. EXPERIMENTS 41

reward of one, the learned policy starts to diverge again. We try to explain this behaviour

with the following reasoning. A good agent is likely to put its attention on closeby ap-

ples. It then learns to get closer to one of these and eventually collects it. After this the

values of the corresponding entry will consist of zeros, indicating that the collected apple

is now carried. Attending already collected apples is something, that the agent should

avoid, if it wants to obtain further rewards. Now the agent is in a misery, it wants to at-

tend apples, with low absolute entries (meaning they are close to its position), but it also

wants to avoid entries, that are exactly zero (meaning it already collected the item). We

believe, that this sharp decision boundary is the reason for observed instabilities.

In order to evaluate this claim, we added an additional binary vector, indicating whether

an apple has already been collected or not, to the input (green line). In this setting we

observe, that the agent is able to learn a representation of its goal space through the hard

attention module. After two hours of training even the linear module achieves an aver-

age reward of around 2.5. It is to note, that this is still slightly less than the setting with

predefined goal representations obtains. While we have known such a representation for

the task at hand, there are many other cases where this form of prior knowledge is not

available. Using attention modules additionally adds potential to interpret the agents ac-

tions as we have illustrated in Figure 5.11.

Chapter 6

Conclusion and Future Work

In thesis we investigated whether Reinforcement Learning agents can be improved through

the addition of an explicit form of goal-directed behaviour. It was hypothesized, that

such a mechanism facilitates long-term credit assignment and provides improved explo-

ration capabilities. Because of that goal-directed behaviour might be especially useful in

environments with sparse reward signals.

We provided an introduction to the areas of Reinforcement Learning and Deep Learn-

ing, which are the core components of this work. Furthermore we summarized existing

literature on hierarchical Reinforcement Learning. We proposed a framework, based on

multiple deep networks, which learns to set goals on spaces, that are either specified by

the user or learned in an end-to-end manner. We have furthermore related the proposed

approach to existing literature through a connection with the options framework.

We evaluated the presented approach on a set of experiments of increasing difficulty.

This allowed to investigate all components of goal-directed learning in isolation, leading

to an incremental solution. While we focused on agents with continuous action spaces

and a gather scenario, the framework itself is not restricted to these settings and can be

applied in a wide range of other tasks.

Experimental analysis highlights, that all components of goal-based reasoning, includ-

ing identifying a suitable goal space, setting goals and finally reaching them, can be ob-

tained through a combination of RL with deep neural networks. Furthermore we show,

that the framework is able to solve challenging tasks, on which state-of-the-art RL algo-

rithms struggle to find good solutions.

Developing methods, that are able to act autonomously in complex tasks, is likely

to become of commercial importance in the upcoming years. It does not only enable to

solve problems more efficiently [86], but might also be the safer option in some scenarios

[87]. It is possible to view many use cases in industrial applications as sequential deci-

sion making problems. In this context it is often convenient to express success in a given

task through sparse reward functions. It is therefore crucial, that employed algorithms

are able to work well in this setting. The model proposed here directly relates to this, as

it has the potential to handle very sparse reward signals.

The lack of interpretability is often a concern when using neural networks. However,

due to the modularity of the approach proposed here, we can always replace indepen-

42

CHAPTER 6. CONCLUSION AND FUTURE WORK 43

dent components of the model, if we require additional safety guarantees. Additionally

we have shown, that it is possible to use an attention mechanism for learning the goal

representation, which allows us to inspect some of the decisions within the model.

There are many possible ways to improve the approach presented here. Some of the

challenges are related to the goal-setting policy. This includes for example the update

rule when learning with a policy gradient-based algorithm. Additionally it is an open

question if random Gaussian exploration is a suitable choice when dealing with high-

level goals or if alternatives should be considered. It would be also desirable to have an

adaptive step-size gmax for different goals.

Ideally we would like to have more data-efficient algorithms, especially when apply-

ing these methods on real robots. This might turn out to be crucial for the goal-setting

policy, as it is, by design, exposed to less data. Model-based approaches are inherently

data-efficient [88] and thus might provide one solution to overcome this issue. Note, that

it very natural to learn a model for the goal-setting agent based on transitions in goal

space. This is because we are likely to know, where will end up after setting a goal (un-

der the assumption, that the goal-reaching policy actually implements behaviour that

leads to the chosen goal).

It will also be interesting to see how the results of the end-to-end trainable models

from Section 5.3 transfer to more challenging agents. Furthermore it is desirable to ap-

ply these methods to a set of different tasks, such as maze navigation or game playing,

including cases where only raw visual observations are provided. It remains to be seen,

whether we should strive to design such a system by hand or if we can discover meth-

ods, that are able to automate this process.

Bibliography

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,

Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Os-

trovski, et al. Human-level control through deep reinforcement learning. Nature, 518

(7540):529–533, 2015.

[2] Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep rein-

forcement learning for robotic manipulation. CoRR, abs/1610.00633, 2016. URL

http://arxiv.org/abs/1610.00633.

[3] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George

Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,

Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree

search. Nature, 529(7587):484–489, 2016.

[4] Marc Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learn-

ing environment: An evaluation platform for general agents. In Twenty-Fourth Inter-

national Joint Conference on Artificial Intelligence, 2015.

[5] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-

thy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous

methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.

[6] Paul FMJ Verschure, Cyriel MA Pennartz, and Giovanni Pezzulo. The why, what,

where, when and how of goal-directed choice: neuronal and computational princi-

ples. Phil. Trans. R. Soc. B, 369(1655):20130483, 2014.

[7] Henk Aarts and Andrew Elliot. Goal-directed behavior. Taylor & Francis, 2012.

[8] Yukiori Goto and Anthony A Grace. Dopaminergic modulation of limbic and corti-

cal drive of nucleus accumbens in goal-directed behavior. Nature neuroscience, 8(6):

805–812, 2005.

[9] Matthew M Botvinick, Yael Niv, and Andrew C Barto. Hierarchically organized be-

havior and its neural foundations: A reinforcement learning perspective. Cognition,

113(3):262–280, 2009.

[10] Shixiang Gu, Timothy P. Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous

deep q-learning with model-based acceleration. CoRR, abs/1603.00748, 2016. URL

http://arxiv.org/abs/1603.00748.

44

BIBLIOGRAPHY 45

[11] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter

Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015. URL

http://arxiv.org/abs/1502.05477.

[12] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray

Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience

replay. arXiv preprint arXiv:1611.01224, 2016.

[13] Shixiang Gu, Timothy P. Lillicrap, Zoubin Ghahramani, Richard E. Turner, and

Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic.

CoRR, abs/1611.02247, 2016. URL http://arxiv.org/abs/1611.02247.

[14] Guy Lever. Deterministic policy gradient algorithms. 2014.

[15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,

Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein-

forcement learning. arXiv preprint arXiv:1509.02971, 2015.

[16] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.

High-dimensional continuous control using generalized advantage estimation. arXiv

preprint arXiv:1506.02438, 2015.

[17] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z

Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsuper-

vised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.

[18] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-

based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International

Conference on, pages 5026–5033. IEEE, 2012.

[19] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmark-

ing deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.

URL http://arxiv.org/abs/1604.06778.

[20] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in

neural information processing systems, pages 271–271. Morgan Kaufmann Publishers,

1993.

[21] Tejas D. Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Joshua B. Tenen-

baum. Hierarchical deep reinforcement learning: Integrating temporal ab-

straction and intrinsic motivation. CoRR, abs/1604.06057, 2016. URL

http://arxiv.org/abs/1604.06057.

[22] A. Sasha Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and

K. Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. ArXiv

e-prints, March 2017.

[23] Bram Bakker and Jürgen Schmidhuber. Hierarchical reinforcement learning based on

subgoal discovery and subpolicy specialization. In Proc. of the 8-th Conf. on Intelligent

Autonomous Systems, pages 438–445, 2004.

46 BIBLIOGRAPHY

[24] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. Deeploco:

Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM

Transactions on Graphics (Proc. SIGGRAPH 2017), 36(4), 2017.

[25] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps:

A framework for temporal abstraction in reinforcement learning. Artificial intelligence,

112(1):181–211, 1999.

[26] Marlos C. Machado and Michael H. Bowling. Learning purposeful be-

haviour in the absence of rewards. CoRR, abs/1605.07700, 2016. URL

http://arxiv.org/abs/1605.07700.

[27] Aravind S Lakshminarayanan, Ramnandan Krishnamurthy, Peeyush Kumar, and

Balaraman Ravindran. Option discovery in hierarchical reinforcement learning using

spatio-temporal clustering. arXiv preprint arXiv:1605.05359, 2016.

[28] Amy McGovern and Andrew G Barto. Automatic discovery of subgoals in reinforce-

ment learning using diverse density. 2001.

[29] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. arXiv

preprint arXiv:1609.05140, 2016.

[30] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value func-

tion approximators. In Proceedings of the 32nd International Conference on Machine Learn-

ing (ICML-15), pages 1312–1320, 2015.

[31] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei,

and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep rein-

forcement learning. arXiv preprint arXiv:1609.05143, 2016.

[32] Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wolfram Burgard.

Deep reinforcement learning with successor features for navigation across similar

environments. arXiv preprint arXiv:1612.05533, 2016.

[33] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future.

arXiv preprint arXiv:1611.01779, 2016.

[34] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value func-

tion decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.

[35] Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of ma-

chines.

[36] Doina Precup, Richard S Sutton, and Satinder P Singh. Planning with closed-loop

macro actions. In Working notes of the 1997 AAAI Fall Symposium on Model-directed Au-

tonomous Systems, pages 70–76, 1997.

[37] Amy McGovern, Richard S Sutton, and Andrew H Fagg. Roles of macro-actions in

accelerating reinforcement learning. In Grace Hopper celebration of women in computing,

volume 1317, 1997.

BIBLIOGRAPHY 47

[38] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol

Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions.

In Advances in Neural Information Processing Systems, pages 3486–3494, 2016.

[39] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski,

Adam White, and Doina Precup. Horde: A scalable real-time architecture for learn-

ing knowledge from unsupervised sensorimotor interaction. In The 10th International

Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. In-

ternational Foundation for Autonomous Agents and Multiagent Systems, 2011.

[40] Alexander L Strehl and Michael L Littman. An analysis of model-based interval

estimation for markov decision processes. Journal of Computer and System Sciences, 74

(8):1309–1331, 2008.

[41] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John

Schulman, Filip De Turck, and Pieter Abbeel. #exploration: A study of count-based

exploration for deep reinforcement learning. CoRR, abs/1611.04717, 2016. URL

http://arxiv.org/abs/1611.04717.

[42] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and

Remi Munos. Unifying count-based exploration and intrinsic motivation. In Ad-

vances in Neural Information Processing Systems, pages 1471–1479, 2016.

[43] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Remi Munos. Count-

based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.

[44] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.

[45] William R Thompson. On the likelihood that one unknown probability exceeds an-

other in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[46] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and

Pieter Abbeel. Curiosity-driven exploration in deep reinforcement learn-

ing via bayesian neural networks. CoRR, abs/1605.09674, 2016. URL

http://arxiv.org/abs/1605.09674.

[47] Yi Sun, Faustino Gomez, and Jürgen Schmidhuber. Planning to be surprised: Op-

timal bayesian exploration in dynamic environments. In International Conference on

Artificial General Intelligence, pages 41–51. Springer, 2011.

[48] J. Achiam and S. Sastry. Surprise-Based Intrinsic Motivation for Deep Reinforcement

Learning. ArXiv e-prints, March 2017.

[49] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989.

[50] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural

networks, 4(2):251–257, 1991.

[51] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, vol-

ume 1. MIT press Cambridge, 1998.

48 BIBLIOGRAPHY

[52] John P O’Doherty, Peter Dayan, Karl Friston, Hugo Critchley, and Raymond J Dolan.

Temporal difference models and reward-related learning in the human brain. Neu-

ron, 38(2):329–337, 2003.

[53] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Pol-

icy gradient methods for reinforcement learning with function approximation. In

NIPS, volume 99, pages 1057–1063, 1999.

[54] Jack PC Kleijnen and Reuven Y Rubinstein. Optimization and sensitivity analysis

of computer simulation models by the score function method. European Journal of

Operational Research, 88(3):413–427, 1996.

[55] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist

reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[56] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical

report, DTIC Document, 1993.

[57] F. Niu, B. Recht, C. Re, and S. J. Wright. HOGWILD!: A Lock-Free Approach to Par-

allelizing Stochastic Gradient Descent. ArXiv e-prints, June 2011.

[58] Ronald J Williams and Jing Peng. Function optimization using connectionist rein-

forcement learning algorithms. Connection Science, 3(3):241–268, 1991.

[59] John N Tsitsiklis, Benjamin Van Roy, et al. An analysis of temporal-difference learn-

ing with function approximation. IEEE transactions on automatic control, 42(5):674–690,

1997.

[60] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian

processes. In Proceedings of the 22nd international conference on Machine learning, pages

201–208. ACM, 2005.

[61] Carl Edward Rasmussen, Malte Kuss, et al. Gaussian processes in reinforcement

learning. In NIPS, volume 4, page 1, 2003.

[62] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,

2016. http://www.deeplearningbook.org.

[63] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas,

and H Sebastian Seung. Digital selection and analogue amplification coexist in a

cortex-inspired silicon circuit. Nature, 405(6789):947–951, 2000.

[64] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and ac-

curate deep network learning by exponential linear units (elus). arXiv preprint

arXiv:1511.07289, 2015.

[65] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid-

ual learning for image recognition. CoRR, abs/1512.03385, 2015. URL

http://arxiv.org/abs/1512.03385.

[66] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway net-

works. arXiv preprint arXiv:1505.00387, 2015.

BIBLIOGRAPHY 49

[67] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R© in

Machine Learning, 2(1):1–127, 2009.

[68] Alex Graves. Supervised sequence labelling. In Supervised Sequence Labelling with

Recurrent Neural Networks, pages 5–13. Springer, 2012.

[69] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

[70] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolf-

gang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner,

Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,

Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant

Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,

Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine trans-

lation system: Bridging the gap between human and machine translation. CoRR,

abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.

[71] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan

Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos,

Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y.

Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng,

Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seeta-

pun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani

Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech

recognition in english and mandarin. CoRR, abs/1512.02595, 2015. URL

http://arxiv.org/abs/1512.02595.

[72] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient

flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

[73] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-

tion, 9(8):1735–1780, 1997.

[74] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representa-

tions by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

[75] T. Tieleman and G. Hinton. RMSprop Gradient Optimization. URL

http://www.cs.toronto.edu/˜tijmen/csc321/slides/lecture_slides_lec6.pdf.

[76] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

[77] Paul J Werbos. Generalization of backpropagation with application to a recurrent

gas market model. Neural networks, 1(4):339–356, 1988.

[78] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-

tion by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[79] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia con-

tent using attention-based encoder-decoder networks. IEEE Transactions on Multime-

dia, 17(11):1875–1886, 2015.

50 BIBLIOGRAPHY

[80] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural im-

age caption generation with visual attention. In International Conference on Machine

Learning, pages 2048–2057, 2015.

[81] Bram Bakker and Jürgen Schmidhuber. Hierarchical reinforcement learning based on

subgoal discovery and subpolicy specialization. In Proceedings of the 8-th Conference on

Intelligent Autonomous Systems, IAS-8, pages 438–445, 2004.

[82] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax.

ArXiv e-prints, November 2016.

[83] Pawel Wawrzynski. Learning to control a 6-degree-of-freedom walking robot. In

EUROCON, 2007. The International Conference on" Computer as a Tool", pages 698–705.

IEEE, 2007.

[84] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman,

Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[85] PyTorch. http://pytorch.org/. Accessed: 2017-04-16.

[86] Jim Gao and Ratnesh Jamidar. Machine learning applications for data center opti-

mization. Google White Paper, 2014.

[87] A Hussain, A Malik, MU Halim, and AM Ali. The use of robotics in surgery: a re-

view. International journal of clinical practice, 68(11):1376–1382, 2014.

[88] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient

approach to policy search. In Proceedings of the 28th International Conference on machine

learning (ICML-11), pages 465–472, 2011.

www.kth.se

Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)