predicting vehicle trajectories with inverse reinforcement ...1366887/fulltext01.pdf · predicting...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Predicting vehicle trajectories with inverse reinforcement learning

BJARTUR HJALTASON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Predicting vehicle trajectorieswith inverse reinforcementlearning

BJARTUR HJALTASON

Master in Computer ScienceDate: August 8, 2019Supervisor: Jana TumováExaminer: Erik FransénSchool of Electrical Engineering and Computer ScienceHost company: Scania

iii

AbstractAutonomous driving in urban environments is challenging because there aremany agents located in the environment all with their own individual agendas.With accurate motion prediction of surrounding agents in the environment,autonomous vehicles can plan for more intelligent behaviors to achieve spec-ified objectives, instead of acting in a purely reactive way. The objective ofthis master thesis is to predict the future states of vehicles in a road network.A machine learning method is developed for trajectory prediction that consistsof two steps: the first step is an inverse reinforcement learning algorithm thatdetermines the reward function corresponding to an expert driver behavior ex-tracted from real world driving, the second step is a deep reinforcement learn-ing module that associates high level policies based on vehicles observations.Regular drivers take into account many factors while making tactical drivingdecisions, which cannot always be represented by the conventional rule-basedmodels. In this work, a novel approach to learn the driver behavior by extract-ing suitable features from the training dataset is proposed. The accuracy ofpredictions is evaluated using the NGSIM I-80 dataset. The results show thatthis framework outperforms a constant velocity model when predicting furtherthan 6 seconds into the future.

iv

SammanfattningAutonom körning i stadsmiljöer är utmanande eftersom det finns många agen-ter i miljön, alla med egna individuella agendor. Med en exakt rörelseskattningav omgivande agenter i miljön, kan autonoma fordon planera för mer intelli-genta beteenden för att uppnå specifika mål, istället för en reaktiv interaktion.Målet med denna mastersuppsats är att förutsäga framtida tillstånd för for-don i ett vägnät. En maskininlärningsmetod utvecklades för rörelseskattning,bestående av två steg: det första steget är en invers förstärkningslärande (in-verse reinforcement learning) algoritm som bestämmer belöningsfunktionensom motsvarar en expertbilists beteende, extraherad från verklig data. Detandra steget är en djup förstärkningslärande modul som associerar en hög-nivåpolicy baserad på ett fordons observationer. Vanliga förare tar hänsyntill många faktorer samtidigt när de fattar taktiska körbeslut, vilket inte alltidkan representeras av konventionella regelbaserade modeller. I detta arbeteföreslås ett nytt tillvägagångssätt för att lära sig förarbeteenden genom att ex-trahera lämpliga funktioner från träningsdatan. Precisionen av förutsägelserutvärderas med hjälp av datasetet NGSIM I-80. Resultaten visar att dennametod överträffar en konstant hastighetsmodell när man förutspår beteendenlängre än 6 sekunder in i framtiden.

v

AcknowledgementsI would like to thank my supervisor, Jana Tumová, for her guidance and sup-port during the work on this master thesis. I would like to state my sincerestgratitude to my supervisors at Scania, Christoffer Norén and Laura Dal Col forproviding me with the opportunity to work on this project and for their uncon-ditional and endless support, without their help this work wouldn’t have beenpossible. Lastly I would like to thank my dear friend Hugo Carlsson for trans-lating my abstract to Swedish. My work on this thesis has been financed bythe Vinnova research project iQPilot (project number 2016-02547), for whichI am grateful. The objective of this project is to move a step closer to theintroduction of self-driving heavy-duty vehicles in urban environments.

Contents

Summary of Notations . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 11.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . 31.4 Approach and Methodology . . . . . . . . . . . . . . . . . . 31.5 Sustainability and ethics . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Policy Iteration . . . . . . . . . . . . . . . . . . . . . 112.2.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . 122.2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . 132.2.4 Function Approximation . . . . . . . . . . . . . . . . 142.2.5 Policy Gradient . . . . . . . . . . . . . . . . . . . . . 15

2.3 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 162.4 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.1 Inverse Reinforcement Learning Extensions . . . . . . 182.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Methods 213.1 Dataset Introduction . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 NGSIM Dataset . . . . . . . . . . . . . . . . . . . . . 213.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 223.3 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Feature Expectations Module . . . . . . . . . . . . . . . . . . 25

3.4.1 Vehicle Classification . . . . . . . . . . . . . . . . . . 27

vi

CONTENTS vii

3.5 Inverse Reinforcement Learning Module . . . . . . . . . . . . 303.6 Reinforcement Learning Module . . . . . . . . . . . . . . . . 32

3.6.1 Action Space . . . . . . . . . . . . . . . . . . . . . . 323.6.2 Environment . . . . . . . . . . . . . . . . . . . . . . 323.6.3 Transition Function . . . . . . . . . . . . . . . . . . . 343.6.4 Deep Deterministic Policy Gradient . . . . . . . . . . 36

3.7 Framework Setup and Training . . . . . . . . . . . . . . . . . 383.7.1 General Information . . . . . . . . . . . . . . . . . . 383.7.2 Model Architecture . . . . . . . . . . . . . . . . . . . 393.7.3 Hyper-parameters . . . . . . . . . . . . . . . . . . . . 39

4 Results and Experiments 414.1 Results on the NGSIM Dataset . . . . . . . . . . . . . . . . . 42

4.1.1 Results for the Passenger Car Agent . . . . . . . . . . 424.1.2 Results for the Motorcycle Agent . . . . . . . . . . . . 464.1.3 Results for the Truck/Bus Vehicle Type Agent . . . . . 50

5 Discussion and Conclusions 52

6 Future Work 55

Bibliography 56

viii CONTENTS

Summary of NotationsNotation Meanings, s′ statesa an actionS set of all nonterminal statesS+ set of all states, including the terminal stateA(s) set of all actions available in state sR set of all possible rewards, a finite subset of Rt discrete time step

T, T (t) final time step of an episode, or of the episodeincluding time step t

At action at time tSt state at time t, typically due, stochastically, to

St+1 and At+1

Rt reward at time t, typically due, stochastically,to St+1 and At+1

π policy (decision-making rule)π(s) action taken in state s under deterministic pol-

icy ππ(a|s) probability of taking action a in state s under

stochastic policy πp(s′, r|s, a) probability of transition to state s′ with reward

r, from state s and action ap(s′|s, a) probability of transition to state s′, from state s

taking action ar(s, a) expected immediate reward from state s after

action ar(s, a, s′) expected immediate reward on transition from

s to s′ under action a

CONTENTS ix

vπ(s) value of state s under policy π (expected re-turn)

v∗(s) value for state s under the optimal policyqπ(s, a) value of taking action a in state s under policy

π

q∗(s, a) value of taking action a in state s under the op-timal policy

V, Vt array estimates of the state-value function vπ orv∗

Vt(s) expected approximate action value, e.g.,Vt(s) =

∑a π(a|s)Qt(s, a)

Q,Qt array estimates of the action-value function qπor q∗

Q(s, a) imilar to V , except that it also uses the currentaction a

θ, θt parameter vector of target policyπ(a|s, θ) probability of taking action a in state s given

parameter vector θπθ policy corresponding to parameter θ

∇π(a|s, θ) column vector of partial derivatives ofπ(a|s, θ)

J(θ) performance measure for the policy πθ∇J(θ) column vector of partial derivatives of J(θ)

with respect to θγ 0 ≤ γ ≤ 1 a discount factor to emphasize re-

wards close to us in "time"α learning rate a hyperparameter that controls

how much to change the model in responseto the estimated error each time the modelweights are updated

Chapter 1

Introduction

1.1 MotivationThe technical development of autonomous vehicles is progressing rapidly, andsuch vehicles may soon have a place in our everyday life. The ultimate goal ofthis transport revolution is to have cars, trucks and buses which have a smallerenvironmental footprint, cheaper to operate and, even more importantly, canguarantee higher safety standards than their manually driven counterparts.

One of themost fundamentally important things is to ensure that autonomousvehicles have a clear understanding of their surrounding, which entails beingable to perceive other road users intention/states and to predict their actionsand movements over time. This information can then be used for tactical de-cision making and for planning safe maneuvers. The main objective of thisthesis is to investigate how reinforcement learning and inverse reinforcementlearning can be used in the field of motion prediction of the surrounding agentsin various traffic scenes.

1.2 Problem DefinitionThis thesis project entails creating and implementing a framework built oninverse reinforcement learning [1] to predict the next series of states of vehiclesin a predefined time interval. The time interval, i.e. how long into the futureto predict will be adjustable in the framework, since howmuch the uncertaintyof the predictions will grow over time will be studied.

This framework will involve a couple of challenging parts. To work withScanias’s autonomous driving stack, the framework needs to be able to predict

1

2 CHAPTER 1. INTRODUCTION

the states of every vehicle which comes into the vicinity of the ego vehicle’ssensor range.

The framework will use as an input the available sensor fusion informa-tion. The sensor fusion module estimates the vehicles position, size, headingand velocity from the onboard sensor data, before being fed into the frame-work. Alongside the sensor fusion information, the prediction framework willalso be able to use information from a high definition (HD) map, such as geo-graphical data of the positions of the center lines and boundaries of the lanesof the road. This framework needs to be flexible enough to work with the con-stantly changing sensor fusion data, and to cope with possible inconsistenciesand inaccuracy of the sensor fusion data. Since this framework is based oninverse reinforcement learning, there will be a need to design a feature extrac-tion module. This module needs to be able to use a set of features to capturerelevant information from the fused sensor- and map data. The set of featuresthis thesis will use is this thesis novel contribution, which are features basedon human vision fields while driving. It is assumed that an agent, in the do-main of this paper a human driver, is driven by a reward function that he/sheaims to optimize [2]. The goal of inverse reinforcement learning is to deter-mine the reward function for a human driver [3], which describes the observedbehaviour. The reward function will then be estimated from the data capturedby our features by our inverse reinforcement learning module.

Other challenging parts are for instance designing the action space andtuning hyper-parameters in each of our modules.The action space describeswhat actions an agent can take. For example, accelerate and turn the steeringwheel by a certain degree. For our framework we need to find what actionsare beneficial to the prediction accuracy of the framework.

Tuning the frameworks hyper-parameters, such that our modules can de-liver the functionality we want. All our modules have a set of parameters thatneed to be tuned, for instance our reinforcement learning modules (see Section3.6) neural networks and their learning rates. In the case of our learning rates,a value too small may result in a long training process that could get stuck,whereas a value too large may result in learning a sub-optimal set of weightstoo fast or an unstable training process. The same goes for tuning our inversereinforcement learning modules (see Section 3.5) discount and convergenceconstants, which we need to tune to get the functionality we want out of it.

The framework also needs to be developed with the goal that to be inte-grated into Scania’s onboard vehicle system. It needs to be compliant withreal-time constraints when running inference on incoming data, and be able topredict trajectories at run-time onboard of a vehicle in traffic.

CHAPTER 1. INTRODUCTION 3

The framework will first be tested and developed on the NGSIM data set[4]. It is widely used in the scientific community to train and validate trajec-tory prediction algorithms. It is chosen over other similar datasets becauseit has more data to train on compared to the onboard vehicle logs available atScania, while additionally containing all the relevant information that Scania’sonboard sensor fusion module returns in respect to our framework such as, thevehicles location and their estimated speed.

1.3 Research QuestionThis thesis aims to answer the question of how well a framework based oninverse reinforcement learning [3] using novel features based on how humanvision fields while driving see Section 3.4, can effectively and reliably pre-dict the states 1 of vehicles in simple traffic situations such as highway trafficsituations.

1.4 Approach and MethodologyThe block diagram representation of the framework in Scania’s robotic plat-form for autonomous driving is outlined in Figure 1.1. The overall systemconsists of sensors, a sensor fusion module, a motion prediction framework, amotion planner, and a controller.

First the vehicle gets in sensor data about its surroundings through its sen-sors, this sensor data then runs through a sensor fusion module that uses it toestimate states such as position and speed of the surrounding vehicles. Usingthe data from the sensor fusion module the framework gets information abouteach vehicle within its sensor range. The frameworks feature extraction mod-ule uses this vehicle information and maps it to features. From those featuresour learned policy then predicts each vehicle’s next actions and uses that infor-mation to predict their states t seconds into the future. These state predictionsare then fed into the motion planner which can then use them to plan safer ma-neuvers. The controller then uses the plan from the motion planner and turnsit into control signals the autonomous vehicle uses.

The trajectory prediction problem is addressed with a framework based oninverse reinforcement learning in Section 2.3 and reinforcement learning in

1states in this context means at least the vehicles x,y coordinates as series of way pointsover time, and hopefully also their velocity and heading. But the possibility of also includingprediction of vehicles speed will also be addressed


Section 2.2. The features are modeled based on information from vision fieldshumans use while driving (which will be explained in detail in Section 3.4).These features are the input into our inverse reinforcement learning modulewhich maps them to a format our reinforcement module can use.

The reinforcement learning module then uses the information providedfrom the inverse reinforcement module to create a policy which maps visionreading information into driving actions such as steering in a certain direc-tion or accelerating by a certain value. This learned policy can then be usedto predict the next action of a certain vehicle given that vehicles vision fieldreadings. Depending on the type of vehicle being driven both the availablevision fields and driving behavior changes, to handle these differences a dif-ferent policy will be learned to predict each vehicle type (these types will beexplained in detail in Section 3.4.1).

Sensor Data

Sensor Fusion

Prediction Framework

Motion Planner

Controller

Environment

Feature extraction

Inverse Reinforcement Learning

Reinforcement Learning

Predictions

Figure 1.1: Abstract overview of how the framework will work with Scania’sonboard vehicle system, where the green boxes indicate this thesis frameworkcontribution. The inverse reinforcement part marked by the red box is onlyused in the training step

CHAPTER 1. INTRODUCTION 5

1.5 Sustainability and ethicsAutonomous vehicles are rapidly getting closer to becoming commerciallyavailable with many companies putting effort into research with the goal to“solve” the problem of autonomous driving such as, Tesla, Waymo and Scaniato name a few.

How can these autonomous vehicles help in transitioning to a more sus-tainable world? First and foremost autonomous vehicles have the possibilityto have a tremendous benefit when it comes to improved safety. More than1.35 million people a year die from traffic accidents [5] and if a report thatanalysed the impact of autonomous cars on the incidence of fatal traffic acci-dents [6] has any merit to it, autonomous cars could translate into roughly 1.2million lives saved annually. For that dream to become a reality, these vehiclesneed to be safe, and to do so we believe the key attribute is being situation-ally aware. Autonomous vehicles need to be able to make informed decisionsbased on their environment, taking into account what pedestrians and vehiclesin the same environment are doing and what they will do next. This is why wechoose to work on the problem of vehicle prediction in this thesis.

Not only could autonomous vehiclesmake roads thatmuch safer, autonomousvehicles could also improve accessibility for the elderly and the handicappedenabling everyone to go where they want when they want. However, as withprogress in general it has the possibility to be a double edged sword. Ourroads could become safer and everybody could have access to transportationwith the universal adoption of autonomous vehicle. It remains to be seen whatthe overall impact can be on employment. What will all the hundreds of thou-sands of taxi- and truck drivers do? What will be the impact on the overalleconomical and social landscape in general when those jobs to disappear? Noone can really tell, but it is something that shouldn’t be taken lightly.

There are still some unanswered questions when it comes to decision mak-ing in autonomous vehicles. If our framework were to output an incorrectprediction, such as predicting a pedestrian not to cross a road and deliver thatinformation to the motion planner which decides the vehicles next driving de-cision based on that information but the pedestrian decides to cross the roadwhich results in a a fatal crash where does responsibility lie? On the car man-ufacturer? The person who wrote the code for the prediction framework?

In a more general setting how will the vehicle choose when faced with asituation in which it risks harming either the driver or an innocent pedestrian?Let’s assume an autonomous vehicle experiences a mechanical failure and thecar is unable to stop. It has two actions to choose from, either it will crash into


a large group of pedestrians, or the car could swerve, crash into an obstacle,and kill the driver [7]. What should the car do, who decides what the carshould do, and where would liability rest for that decision? One thing is forcertain this topic needs to be addressed since this decision one way or anotherhas to be written into code.

Chapter 2

Background

In this chapter we will present the necessary prerequisites for the algorithmsused in our framework. Work previously done in the problem domain of tra-jectory prediction will also be presented and highlight notable inverse rein-forcement learning methods and their variants.

2.1 Neural NetworksNeural networks (NNs) are an information processing technique that can beused to approximate functions. A system based on a NN learns by beingprovided by examples of what it is supposed to learn usually without beingprogrammed with specific rules related to the task. For example, in imagerecognition which is a sub-field of computer vision. Neural Networks learn todistinguish between images that contain cats or dogs by being provided exam-ples of images that have been labeled as an image of a cat or a dog, like theones shown in Figure 2.1. From those provided examples they automaticallygenerate identifying characteristics from the learning material that they pro-cess, without any prior knowledge about what a cat or a dog is but solely fromidentifying patterns in the provided data [8]. Using the learned patterns NNscan then distinguish between cats and dogs in other previously unseen images.This property of NNs to learn features from provided examples is used in theframeworks reinforcement module, where NNs are used to judge a driving sit-uation, based on information from vision fields. A curious reader can look atthese cited sources to get more familiar with neural networks [9][10].

7

8 CHAPTER 2. BACKGROUND

Figure 2.1: Series of images depicting cats and dogs

Figure 2.2 shows the layout of a simple neural network. NNs are composedof connected nodes, where the strength of a connection between nodes is con-trolled by a set of "weights", these weights then get adjusted as the network istrained, while the output of a node is computed by an activation function of thesum of its inputs. The activation function is typically a non-linear function.Examples of used activation functions are, for example, the sigmoid functionf(x) = 1

1+e−1 and the ReLU function f(x) = max(0, x). How a single nodein a NN that uses ReLU as an activation function, computes its output wouldbe: max(0, (input1 ∗ w1 + ... + input3 ∗ w3)) = y, this process is shown inFigure 2.3 [11] [12] [9].

Input #1

Input #2

Input #3Output

Hiddenlayer

Inputlayer

Outputlayer

Figure 2.2: A representation of a simple neural network, the input layer takesin data, the hidden layer then reasons about patterns in the provided data andthe output layer returns some value based on the pattern the hidden layer found.

CHAPTER 2. BACKGROUND 9

Input #2 w2 Σ f

Activationfunction

y

Output

Input #1 w1

Input #3 w3

Weights

Figure 2.3: How a single node computes its output value, f(input1 ∗w1 + ...+

input3 ∗ w3) = y

2.2 Reinforcement LearningIn our framework we use both inverse reinforcement learning and reinforce-ment learning techniques. To understand inverse reinforcement learning, wefirst need to understand what regular reinforcement learning is. We will alsointroduce the necessary reinforcement learning algorithms and techniques thatour reinforcement learning method builds upon and uses.

In reinforcement learning (RL), an agent is driven by a reward functionR. The reward function gives the agent feedback based on how good of anaction a is, given that the agent is in the state s. Using this feedback, an agentlearns what is the best action it can take at every state to maximise its futurereward [13], this process is outlined in Figure 2.4. RL has been presentedin many forms throughout the years. Works in this area include for exampleadaptive dynamic programming [14], TD-learning [14] and SARSA [15] toname a few. In recent years big advancements have been in deep reinforcementlearning. Deep reinforcement learning has like its non "deep" versions manydifferent variants. To name a few novel variants, a deep Q-network [16], trustregion policy optimisation [17], asynchronous advantage actor-critic [18] andproximal policy optimization algorithms [19]. All these variants are used forthe same purpose, i.e. to develop an agent that learns to behave optimally inits environment.

A term that will be referred to at later in the text is a policy and a policyrollout. A policy π is a decision-making rule which tells the agent what actionto take given a certain state s. A policy rollout is the process of using a policyto evaluate many possible actions of how to reach the next time instance and toselect the action a that maximizes the expected long-term reward. The policyis then provided with the the next time instance after selecting action a, this


process is then repeated for a certain number of steps.

Agent

Environment

Action atNew state st+1 Reward rt+1

Figure 2.4: Reinforcement Learning depiction adapted from [20]

We mentioned that RL is the process of the teaching an agent to take thebest action a at a state s where he is provided with a reward function R thatdescribes how good a particular s is. This process is a Markov Decision Pro-cess (MDP). A MDP is a way to formalize a problem which entails learningfrom interaction to achieve a goal, where the environment the problem is setin is fully observable (which is the case in this thesis, otherwise we have par-tially observable Markov decision process (POMDP)). A MDP is defined byfive parameters, S, A, P , R and γ [20].

• S is a finite set of states.

• A is a finite set of actions.

• P = p(s′|s, a) is the probability of transition to state s′ ∈ S, from states ∈ S taking action a ∈ A.

• R = r(s, a) is a reward function which return the expected immediatereward from state s ∈ S after action a ∈ A.

• γ is a discount factor, γ ∈ [0, 1]. Intuitively a factor of 0 makes anagent "opportunistic" by only considering current rewards, while a factorapproaching 1 will make it strive for a long-term high reward.

There are different approaches to learn the policy that optimizes the rewardfunction (R). In the forthcoming sections we will give an overview of the maintechniques that will be used in the reminder of this thesis. These approachescan be divided into two categories, value based and policy based, the policybased approaches directly learns the policy π and the value based approaches


learn how good a certain state s in an environment is, judged by the providedreward function and uses that information to extract π.

In Section 2.2.1 policy iteration is presented (this approach falls in the firstcategory that learns the policy), it is an approach that learns π through twoiterative steps, in Section 2.2.5 an approach that builds upon policy iteration ispresented, it learns π by giving it parameters, it updates these parameters basedon observations using gradients. In Section 2.2.2 value iteration is presented(this approach falls in the second category that learns how good a certain states is) which learns the value of every state s in the environment and extractthe π using those values. In section 2.2.3 Q-learning is presented which fallsin the category of value based approaches and expands upon Value Iterationand doesn’t just learn the state s value, but learns a state s action a (s,a) valuemapping. However, today’s state of the art methods usually combine the twoin the form of various actor-critic methods (the reinforcement learning moduleused in this thesis framework being one of those actor-critic variations) wherethe actor learns by a policy gradient approach where the updates are judgedby a critic which uses Q-learning[20].

Terms used in the following sections and their definition can be found inSection Summary of Notations.

2.2.1 Policy IterationThe Policy Iteration approach consist of two iterative steps: policy evaluationand policy improvement. For a comprehensive overview of this technique thereader can refer to Algorithm 1 [20].

First the policy evaluation step estimates the value function V which is anarray of estimates for how good each state s is, with the greedy policy (a greedypolicy simply means always taking the action which gives the best value) ob-tained from the last policy improvement. The selection of the actions in RLis parametrized by ε ∈ [0, 1], where ε = 1 means taking a random action andε = 0 means taking the best action according to the reward function R. Thenwe run a policy improvement step, which updates the policy with the actionthat maximizes V for each of the states s. These two steps continue iterat-ing till convergence. Information on what each symbol mean can be found inSection Summary of Notations.


Algorithm 1 Policy Iteration: Estimating a policy π ∼ π∗Initialization V (s) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ Sprocedure Policy Evaluation(1)

while ∆ < θ (a small positive number determining the accuracy ofestimation) do

∆← 0

for s ∈ S dov ← V (s)

V (s)←∑

s′,r p(s′, r|s, π(s))[r + γV (s′)]

∆← max(∆, |v − V (s)|)procedure Policy Improvement(2)

policy-stable← truefor s ∈ S do

old-action← π(s)

π(s)← argmaxa∑

s′,r p(s′, r|s, a)[r + γV (s′)]

If old-action 6= π(s), then policy-stable← falseIf policy-stable, then stop and return V ∼ v∗ and π ∼ π∗; else got to 1

2.2.2 Value IterationThe Value Iteration approach is the foundation for a later technique on whichthe frameworks reinforcement learning module uses, value iteration only hasone component. It solely updates the value function V . Then when the iter-ation converges, we obtain the optimal policy by applying an argument-maxfunction for all of the states. For a comprehensive overview of this techniquethe reader can refer to [[20], 2] and for an overview on what each symbolmeans refer to Section Summary of Notations.


Algorithm 2 Value Iteration: Estimating a policy π ∼ π∗θ > 0 a small threshold determining accuracy of estimationInitialize V (s) arbitrarily (e.g. V (s) = 0 for all s ∈ S+)Except V (terminal) = 0

procedure Value Iteration(θ)while ∆ < θ (a small positive number) do

∆← 0

for s ∈ S dov ← V (s)

V (s)← maxa∑

s′,r p(s′, r|s, a)[r + γV (s′)]

∆← max(∆, |v − V (s)|)Output a deterministic policy, π ∼ π∗, such thatπ(s) = argmaxa

∑s′,r p(s

′, r|s, a)[r + γV (s′)]

It shall be noted that both policy iteration outlined in Section 2.2.1 andvalue iteration outlined in Section 2.2.2 require the knowledge of the transitionprobability p of the environment.

transition probability: p = p(s′, r|s, a) (2.1)

What that means is if the agent takes a certain action a the agent will know thetransition probability of entering a specific state s′ which gives the reward rgiven current state s and action a, so we need to know all possible outcomesand their probabilities, we essentially need to know our entire environment atall times.

This is manageable for small sized environments, but when the state andaction space grow one can easily see how this becomes impractical. For ex-ample, in the problem domain of predicting driving behavior in real worldsituations, both the state and the action space are continuous, which makesemploying methods that rely on knowledge of the transition probability p in-feasible.

2.2.3 Q-LearningThe problem of needing to know the transition probability p explained in Equa-tion 2.1 is solved by Q-Learning, which is based on previously mentionedvalue iteration in Section 2.2.2, where instead the update formula for the valuefunction V

V (s)← maxa∑s′,r

p(s′, r|s, a)[r + γV (s′)].


Which was the update for V in Algorithm 2, In Q-Learning the update be-comes.

Q(s, a)← Q(s, a) + α[r + γmaxaQ(s′, a)−Q(s, a)] (2.2)

In this equation (2.2) the transition probability p (see equation 2.1) is notneeded. The new parameter α refers to a learning rate, i.e. how fast we wantto change our Q function values. We also defined for how many episodes wewant to run Q-Learning for and until they are finished select the a which givesus the biggest value in s provided by the current estimate of Q(s, a) and thenobserve the reward r we got and which state s′ we end up in and then updatethe current estimate ofQ(s, a) with Equation 2.2. For an in-depth explanationof the Q-Learning algorithm the interested reader can refer to [[20], 3] and in-formation on what each symbol mean can be found in Section Summary ofNotations. Note that the next action a′ is chosen to maximize the next state’sQ-value (Q(S ′, a)) instead of following the current policy and how we empha-size the current reward r in our action selection which is characterised by γ.If we were following the current policy the update formula would be:

Q(s, a)← Q(s, a) + α[r + γQ(s′, a)−Q(s, A)] (2.3)

Algorithm 3 Q-Learning for estimating: π ∼ πxstep size α ∈ (0, 1], small ε > 0

Initialize Q(s, a), for all s ∈ S+, a ∈ A(s), arbitrarily except thatQ(terminal, ∗) = 0

for each episode until S is terminal doChoose a from s using policy derived from Q (e.g., ε-greedy)Take action a, observe r, s′Q(s, a)← Q(s, a) + α[r + γmaxaQ(s′, a)−Q(s, a)]

s← s′

2.2.4 Function ApproximationAlthough Q-Learning solves the problem of needing to know the transitionprobability p (see equation 2.1). Q-Learning still scales badly to big statespaces, since it needs to learn a state, action value (Q(s, a)) for all states s ∈ Sand actions a ∈ A.

To solve this problem a technique called function approximation is used.Function approximation is a reinforcement learning technique that deals with


problems that are too large to be addressed efficiently with previous meth-ods. For example, where the state space and/or action space are continuous oras big as e.g., in video games were the state s is usually an image, of "size"(S = ((255)3)250000). If we would have to learn the value for all possible stateaction pairs for a state space of this size which would be needed when using forexample Q-learning 3 it would most likely never finish. This function approx-imation is usually done by using the neural network outlined in Section 2.1.The neural network is used as an information processing technique similar tothe example on how it learns to distinguish between cats and dogs. Instead oflearning features to distinguish between cats and dogs, it learns the features ofa state and from those features their connection to the states value. This ap-proach will however never find the "true" value of a state, but an approximationof it [20] [13].

2.2.5 Policy GradientPolicy Gradient methods are reinforcement learning methods that learn a pa-rameterized policy that can select action without consulting a value function.The policy’s parameter vector will be referred to as θ in the remainder of thisthesis. The probability that a policy π will take a certain action a at state sbecomes π(a|s, θ) = p(a|s, θ). How policy gradient methods learn the pa-rameter values θ is based on the gradient of some performance measure J(θ)

w.r.t. the policy parameter. Policy gradient methods then seek to maximizetheir performance in regards to J(θ) so they update their parameters throughgradient ascent in J , θt+1 = θt + α ˆδJ(θt). Where ˆδJ(θt) is a stochastic es-timate whose expectation approximates the gradient of the performance mea-sure with respect to its argument θt. These preferences for a certain actiona given a certain state s can be computed by a deep neural network, whereθ is the vector of all the connection weights of the network. In general whenNNs are used to supplement an RL method, that RL method turns into its deepreinforcement learning variant [20].

A policy parametrized by a NN works by changing the weights dependingon the experience. If, for example, the network predicts an action a from states, and the resulting reward is positive, we change the weights to predict amoreconfidently. On the other hand, if the reward is negative, the NN changes itsweights to predict a less often.

Actor-critic methods take the ideologies from policy gradient methods andvalue based methods such as, Q-Learning and combines them, where we havea policy-gradient method which takes a value function evaluation into account


to update its policy. In this thesis framework a variant of actor critic calleda deep deterministic policy gradient is used, that method will be explained inSection 3.6.4.

2.3 Inverse Reinforcement LearningIn IRL, the reward function is unknown in contrast to previously mentionedreinforcement learning in Section 2.2. Instead the agent’s policy or a historyof its behavior is known and we try to recover the reward function. The rewardfunction describes the objective of the observed behaviour from measurementof an agent’s behavior over time. Under the assumption that the observed agentacted optimally1. It is first described in a paper by Andrew Y. No and StuartRussel [3]. In this first paper they introduced high-level descriptions for threedifferent IRL algorithms, each of them designed for different scenarios [3].In the paper the authors claimed that a motivation for inverse reinforcementlearning could be where:“An agent designer (or indeed the agent itself) may have only a very rough ideaof the reward function whose optimization would generate "desirable" behav-ior. Consider, for example, the task of "driving well." ”[3]Which directly relates into the topic of this thesis. Andrew et al. demonstratedthat IRL was capable of learning different driving behaviors from demonstra-tions in [1]. In in [1] Andrew et al. also created a new margin-based optimiza-tion method to the IRL problem.

2.4 K-meansWewill learn three different policies π, based on three different vehicle classes.Those classes being motorcycles, passenger vehicles and trucks/busses. Toassign these classes to vehicles in an unlabeled dataset we will use an unsu-pervised learning technique called K-means [21]. K-means learns to partitiondata points into K clusters in which each observation belongs to the clusterwith the nearest mean. In the scope of this thesis the data points are a set ofvehicles length and width.

1Optimally in this context, means that it always picks the best possible action


Algorithm 4 K-means clustering algorithmInput K number of clusters we want to findStart with initial guesses for the K Cluster centers (centroids) valueswhile Until clusters don’t change anymore do

Assign each data point to its closest cluster center:zi = argmink||xi − µk||22Update each cluster center by computing the mean of all pointsassigned to it: µk = 1

Nk

∑i:zi=k

xi

Output: The K clusters

2.5 Related WorkPath prediction has been studied extensively with many different approaches.Two of those are hidden Markov models [22] and Kalman filters [23]. Usinghistorical vehicle data, Malekian et al. were able to determine the parametersof a hidden Markov model. Then using those parameters they used the Viterbialgorithm to find the hidden states sequences corresponding to the just driventrajectory. With that information, they then predicted the vehicles next statessequences [24]. Andre Desbiens et al. used an extended Kalman filter to esti-mate the states of a moving object. Then from the estimated object states anda motion model defined for Kalman filtering, they predicted the trajectory ofthe object [25]. In recent years deep neural networks (DNNs) have become apopular approach to solve data-driven problems. This trend is noticeable whenit comes to research on trajectory prediction since most research on trajectoryprediction has been conducted using recurrent neural networks (RNNs).

RNNs are a type of neural networks. RNNs are neural networks with loopsin them as depicted in Figure 2.5 where a layer in the neural network A takesin the initial input x0, outputs a value h0 which gets fed into back into the layerA and that information is then used to create the next output h1 and so forth[9].


Figure 2.5: RNN loop taken from [26]

Since trajectory prediction is a sequence prediction problem and becauseof the looping mechanism in RNNs (they can effectively use previous infor-mation to affect the next output), they are the variant of neural networks thatis most used for sequence prediction tasks. A few approaches using the se-quence prediction capabilities of RNNs are for example, the perception mod-ule onboard Waymo’s autonomous vehicles [27], a framework Huynh Manhet al. developed to predict human trajectories while also incorporating sceneinformation alongside the trajectory data [28] and Florent Altché et al. used aRNN LSTM Network for highway trajectory prediction [29].

With that being said, the approach we will use is inverse reinforcementlearning (IRL). IRL has been applied with good results to a variety of hardproblems. Finn et al. managed to teach high-dimensional robotic systems toplace dishes on a tray using IRL [30] and Ho et al. managed to teach humanoidobjects human movements such as, running and getting up after a fall usingIRL [31]. Because of these results we think IRL could produce good resultsin the problem domain of trajectory prediction.

2.5.1 Inverse Reinforcement Learning ExtensionsIRL has been implemented in a variate of different approaches, similar to re-inforcement learning [32]. The most notable variant is based on maximumentropy optimization, were a probabilistic approach to recover the reward func-tion is used. More specifically it is based on maximizing the log likelihood ofthe observed behavior. This method was developed to take into account thenoisy and imperfect nature of real world collected data.

Andrew Y et al. then apply this method to a real world problem. Theydesigned a learning method for the reward function that could predict drivingbehaviors [33]. This method has also been applied to predict wide receiver tra-jectories in American Football [34] and to recover a reward function to forecastplausible paths for a pedestrian [35].

The most recent advance in general IRL is an extension of the maximumentropy IRL method, here is presented a general framework using neural net-


works to learn non-linear reward functions. The framework is trained usingthe maximum entropy method [36] [37]. This method has been used for tra-jectory prediction. Fahad et al used it to capture human navigation behaviors[38] and Zhang et al to predict the trajectories of off-road vehicles [39]. How-ever, both these applications relied on discretizing high dimensional systemrepresenting the real world into a grid. This discretization, and consequently,their state space size reduction is necessary because deep maximum entropyinverse reinforcement learning is limited when it comes to work in high dimen-sional systems. A variation to maximum entropy deep inverse reinforcementlearning that mitigates this need for discretization was devised by Finn et al.Making it suitable to work in a real world high-dimensional system throughpolicy optimization. They for example, taught a robot to gently place a dish ina plate rack [30].

Lastly, similar to Finn et al. [30] extensions to handle continuous statespaces novel IRL/inverse optimal controlmethods (IOC) that combat it throughsampling techniques have been devised, such as Laplace approximation [40]and Langevin sampling [41].

Most of today’s state of the art prediction frameworks don’t just rely onone method. They divide the prediction problem into many sub problems andhandpick attributes from many different methods best suited to tackle the iso-lated sub problems and aggregate their results to create a framework for trajec-tory prediction. DESIRE is one of those frameworks [42]. DESIRE combinestwo neural network structures that are RNN’s and an conditional variationalautoencoder (CVAE) with inverse optimal control (IOC). The framework usesits CVAE to create a diverse set of possible future prediction for a particularvehicle and then ranks them utilizing its IOC module and selects the highestranked prediction to output.

2.6 ContributionThe majority of previous work did not tackle the problem of vehicle predic-tions from the viewpoint of a human driver and the vehicle type he/she is driv-ing and instead relied on more data-driven approaches.

This thesis novel contribution (at the best of our knowledge) is that wetackle this problem by emulating what a human can see while driving. Usingdifferent fields of view. These fields represent looking out the vehicles mirrors.The reasoning is that while driving we mainly use our vision for situationalawareness and decision making.

The framework also takes into account that different vehicle types both


have different vision fields as well as different driving maneuver capabilitieswe will learn three different driving policies. By measuring distances to otherobjects inside the vision fields and using the vehicle type as baseline for anagent we learn driving behaviors.

Chapter 3

Methods

3.1 Dataset IntroductionThe framework needs a vehicle trajectory dataset to extract behaviors from andto create an environment to train its agent. The framework will be evaluated onthe I-80 dataset which is a subset of the NGSIM trajectory datasets [4]. The I-80 was collected on a segment of I-80 freeway in Emeryville (San Francisco),California.

3.1.1 NGSIM DatasetThe NGSIM datasets were collected using cameras, and then extracted vehi-cle trajectory data from the resulting videos. The frame rate of the NGSIMtrajectory is 10 per second. For every frame the dataset provides velocity, ac-celeration, global coordinates, vehicle length and vehicle type.

To prepare the data so it could be used by the framework, a preprocessingstep was performed. Each vehicles heading was estimated by Equation 3.1

θ = tan−1yt+4 − ytxt+4 − xt

, (3.1)

, which calculates the inverse tangent of a vehicles position displacement everyfour frames apart. In the preprocessing step each vehicles bounding box is alsoestimated, by using each vehicles estimated, width and length. These framescan be seen as green and red blocks in Figures 3.5, 3.6 and 3.7. This is doneso that the frameworks φi function which will be explained in Section 3.4 canreturn accurate values. Since, the NGSIM dataset only provides informationabout the location of the front center of each vehicle.

21

22 CHAPTER 3. METHODS

3.2 Evaluation MetricsTo evaluate the prediction capabilities of the framework the prediction ac-curacy will be tested by comparing the frameworks prediction to the knownground truth using the root mean squared error (RMSE) defined as:

RMSE =

√∑Ni=0((xi − xi)2 + (yi − yi)2)

N, (3.2)

where xi, yi are the predicted x and y coordinates, xi, yi are the known groundtruth x and y coordinates and N is the total number of predictions. RMSEis a frequently used measure of the differences between values predicted by amodel or an estimator and the values observed.

3.3 The FrameworkThis section will give an overview of the trajectory prediction framework. Thisframework takes in vehicle trajectory data and from that data learns an approx-imation of the vehicles behavior. Using the learnt behavior, the framework isable to predict the vehicles state t seconds into the future. The framework isoutlined in Figure 3.1.

First, the input data of the framework is the vehicle trajectory data (TD stepin Figure 3.1) in a time series format, which includes the vehicles position, ve-locity, acceleration, and the length and width of the vehicle over a certain timeperiod. The framework then runs an unsupervised clustering algorithm (ex-plained in Section 2.4) to sort the vehicle trajectory data into different vehicleclasses (K-means step in Figure 3.1). The framework is now at the latter tra-jectory step where we have sorted trajectory data (STD step in Figure 3.1) bythe vehicle classes (state1, ..., statem in Figure 3.1).

After the framework has divided the trajectory data based on different vehi-cle classes the data is fed into a feature extractionmodule where each trajectorypoint is turned into features (this step is the φ(statei) box in Figure 3.1). Thesefeatures are based on distances to other objects in the vicinity while drivingand will be explained in Section 3.4.

CHAPTER 3. METHODS 23

TD

K-means

STD

φ(statei)

EMP. EST “( 3.3)”

IRL “( 3.8)”

DDPG

roll out π(i) * “( 3.4)”

πi ≈ πExpert

state1, ..., statem

φ(state1), ..., φ(statem)

µExp

µπ(0)

wi

πi

µπ(i)

yes

no

Figure 3.1: Framework pipeline

From those extracted features, the framework performs an empirical es-timate µExp (EMP. EST in Figure 3.1) for the features expected values, asexplained in Equation 3.3 where γ is a discount factor and T is the total num-


ber of data points being used for each vehicle.

µ =T∑t=0

γtφ(statet) (3.3)

This µExp vector of empirical estimates will serve as descriptor of the vehiclesbehavior in our trajectory data.

The framework then creates a separate feature vector with values generatedby a random behavior, which is called µπ(0) in Figure 3.1. Now there are twodifferent feature vectors, one estimated from behavior from the data, and onegenerated from a random behavior.

Then the inverse reinforcement learning module (IRL) is initialized usingthese two feature vectors. The objective of the IRL module is to recover thereward function R (as explain in Section 2.2) that describes the behavior ob-served in the trajectory data.

The behavior from data gets a label of 1, and the random behavior a labelof -1. There are two classes, behavior observed in the trajectory data andbehavior not gotten from the trajectory data. Then the framework perform amaxmargin optimization step tomaximise the difference between the behaviorderived from the trajectory data and behavior not derived from the trajectorydata.

After the optimization step the results is an orthogonal vector wi from themax margin hyperplane orthogonal to the vector with values from the trajec-tory data µExp. This vector will be referred to as weight vector in the remainderof this thesis. This step was the process of IRL block in Figure 3.1.

Intuitively speaking this weight vector serves as a compass indicating thefavourable behaviors. That are closer to the behavior observed in the data.These weights will serve as the reward function in the reinforcement module.

The reinforcement learning module (RL) is initialized using the weightvector from the IRL module as its reward function (R). The objective of theRL module is to train an agent that learns what is the best action it can take atevery state to maximise its future reward. This reward is based on the rewardfunction R, the IRL module outputs. After training the RL module outputs anagent, in the form a decision making policy πi, this step was the process ofDDPG block in Figure 3.1.

This new policy πi is then “rolled out” (as explained in Section 2.2) inthe place of one vehicle from the trajectory data, for a certain number of datapoints T . The framework then performs an empirical estimate as explained inEquation 3.3 on the states state1, ..., stateT that occurred when πi was “rolledout”, resulting in a new feature vector µπ(i). This module then returns this new


feature vector µπ(i), this step was the process of the roll out block in Figure3.1.

µπ(i) is then compared to the behavior obtained from the trajectory dataµExp, this comparison is defined by

wT (µExpert − µπ(i)) ≤ ε, (3.4)

where ε is a predefined convergence constant. Finally, if the condition in equa-tion 3.4 is fulfilled, the framework outputs the policy πi. This policy shouldthen behave similar to the behavior observed in the trajectory data. These lasttwo steps are the diamond decision block and the blue block (πi ∼ πexpert)shown in Figure 3.1.

This policy πi, is then what the framework uses to make future predic-tions. If the condition in equation 3.4 was not fulfilled, the framework adds πifeature vector µπ(i) to the max margin optimization step and repeats the sameprocess again, until it finds a policy close enough to the one obtained from thetrajectory data. Modulus the convergence constant ε.

3.4 Feature Expectations ModuleThis section will explain this thesis novel contribution, which are the featureswe extract from each trajectory data point. These features based on distancesto other objects inside vision fields, that emulate what a human can see whiledriving.

The feature expectation module extracts the feature vectors that are used asa metric into the Inverse Reinforcement Learning Module as well as the statespace 2.2 in the Reinforcement Learning module. At each point statet in thetrajectory data fed into the framework for every vehicle v ∈ V (V stands forall unique vehicles in the provided data) the framework performs this featureextraction φ(statet) and collects the values from the function φi as well asreadings from map information in the form of vehicles distance to the closestlane center line c, φ(statet) = φ1, φ2, ..., φn, c where φi is a distance reading.This is performed for all data points in the trajectory data resulting in a µvfor every vehicle v. From all the collected readings the framework performsan empirical estimate resulting in the expected feature values µExp from theprovided data. This process is outlined in Algorithm 5.

The method is based on replicating the vision fields a human has whiledriving a car, which is emulated using six different fields of view. These vi-sion fields represent looking out the front mirror, looking to the sides, looking


into the left and right side mirrors and looking out the back mirror, these vi-sion fields can be seen in Figure 3.2. The values these vision field return aremodeled through a function φi, that performs a distance reading from the ori-gin of a vision field to its end. φi takes value ∈ [0, 1], where φi = 1 means itdidn’t hit any object and φi = 0 means it hit an object at origin. In each visionfield there are then different number of φ functions that are placed at differentangels in their corresponding vision field. These vision fields with each differ-ent φi represented as red dotted line can be seen in Figure 3.3 where each φiwould return a value of 1. In Figure 3.4 a case where two objects are inside thefront vision field is shown, where the φis inside the vision field would returnapproximately φ1 = 0.4, φ2 = 1, φ3 = 0.6.

Figure 3.2: Representation of vision fields

Figure 3.3: Representation of vision fields, with each φ represented by a reddotted line


Figure 3.4: Representation of vision fields, with each φ represented by a reddotted line, and objects inside the vision fields as black boxes

The whole idea behind this approach is that while driving, humans relyalmost solely on relative distances to other objects, and we estimate their po-sition with our sights, to make our driving decisions.

Algorithm 5 Feature Expectations Algorithm

γ = discount factor

1: procedure Get Feature Expectations2: For v in V3: T← number of data points in trajectory4: µ(π)v = [0, 0, ..., 0n, c]

5: For t in T :6: µ(π)v += γtφ(statet)

7: µ(π) =∑Vv mu(π)vV

8: return µ(π).

3.4.1 Vehicle ClassificationThis section explains how the framework uses K-means (explained in Section2.4) to assign classes to vehicles in trajectory data, and the intuitive motivationbehind using those classes.

In the NGSIM [4] these classes are already available, therefore the visionfields sizes are based on those predefined classes average values outlined inTable 3.1.


Vehicle class: Length [m] Width [m]Motorcycle 2.3 1.2

Passenger cars 4.5 2.0Truck/Bus 12.8 2.5

Table 3.1: Vehicle class average values

Depending on the type of vehicle we drive, both the vision range and driv-ing behavior changes. In trucks for example, the driver has access to biggermirrors and sit higher up then if that driver was driving a Honda Civic result-ing in a bigger field of vision and if that driver was driving a motorcycle thevision range would be even smaller.

This carries over to the different driving behaviors based on the vehicletype we are driving. A truck driver won’t accelerate and maneuver in the samefashion as a regular car or a motorcycle driver would and vice versa. Thesedifferences result in different reactions for the same vision readings betweenvehicle classes. To tackle these differences we make the assumption that itsenough to learn three different policies (π) depending on the vehicle class,which we assume are motorcycles, passenger cars and trucks/busses.

In Figures 3.5, 3.6 and 3.7 examples of vehicle vision range based on theirvalues in the NGSIM [4] dataset are shown.

Figure 3.5: Representation of passenger car vision fields


Figure 3.6: Representation of a motorcycle vision fields

Figure 3.7: Representation of truck/bus vision fields

If the framework is provided with trajectory data where these classes aren’talready available, it uses K-means (which is explained in Section 2.4) to sortthe trajectory data into vehicle classes. Themetrics it uses to sort the trajectorydata into vehicle classes are each vehicle’s estimated length and width.


3.5 Inverse Reinforcement Learning ModuleThis section will introduce the inverse reinforcement learning (IRL) module,how it works and the IRL module outputs.

The IRL algorithm this framework uses was created by Andrew et al. [1]and is outlined in Algorithm 6. The IRL module learns the reward functionwhich explains what drives the observed behavior from data. IRL in generaltries to recover the underlying reward function. That explains some observedbehaviour. The features φi, i = 1, ..., n where n is the number of features, arethe features we extract from each observed state. What these features representis previously explained in Section 3.4. st is defined as state at time t, φ afunction that maps a state to features, and φ(st) all the feature values at theobserved data point in the trajectory data:

φ(st) = φ1, φ2, ..., φn. (3.5)

The feature expectation µ(π) of a policy π is the sum of the discounted featurevalues φ(st).

µ(π) =∞∑t=0

γtφ(st), (3.6)

were γ ∈ (0, 1) is a discount factor. The assumption is that these featuresdefined by some weights wt correspond to the underlying reward function thatthe policy π is trying to maximize. The reward at each state st is a linearcombination of these feature values (a is the action the agent took, and s′ isthe state reached from s, by taking the action a)

r(s, a, s′) = wTi φ(st). (3.7)

The objective of the IRL module is to find the weight vector w. To initial-ize the algorithm, the module is provided with the expert’s feature expecta-tions µ(πExpert) from observed behavior of vehicles in the trajectory data. Todetermine µ(πExpert), the frameworks uses a subset of N vehicles, and oversome defined sample horizon T performs an empirical estimate as explainedin Equation 3.3 to get each vehicles feature expectations. µ(πExpert) is thenthe average feature expectation values of thoseN vehicles. To obtain the otherparameter needed for initializing the algorithm, a random policy feature ex-pectations µ(π0). A random policy π0 is “rolled out” (as explained in Section2.2) and an empirical estimate (as explained in Equation 3.3) is performed onthe feature values φ(st) observed at each data point over the sample horizonT as defined when obtaining µ(πExpert).


Now the IRL module performs a max margin optimization step. To findthe maximal margin separating hyperplane between the two feature vectorsµ(πExpert) and µ(π0).

minimizew

||w||2

subject to ||w||2 ≤ 1.

wTµExpert ≥ 1.

wTµπ(i) ≤ −1.

(3.8)

To solve this optimization problem quadratic programming is used. IRL findsthe unit weight vector from the hyperplane orthogonal to the feature expecta-tions obtained from data µ(πExpert). Intuitively speaking, the weight vector“guides” the µ(π0) features how they should change, to get closer to the valuesof µ(πExpert).

The IRL module then outputs this new weight vector w0. This new weightvectorw0 is then used as the reward function (R) for the reinforcement learningmodule. The reinforcement learning module uses this reward function to learna new policy π.

This new policy π0 is then “rolled out” and its expected feature valuesestimated as previously explained in Section 3.3. Resulting in a new featurevector µπ(0). The distance t of this new feature vector µπ(0) to the expert featureexpectation µ(πExpert) is then calculated t = wT0 (µExpert − µπ(0)).

The distance t is then compared with the frameworks predefined termina-tion threshold ε. If the condition is fulfilled t ≤ ε the IRL module outputs apolicy π that it deems close enough to the behavior observed in the providedtrajectory data. Otherwise, the new feature expectation µπ(0) is added to theIRL module’s list of non expert feature expectation and it carries out anotheriteration starting from the max margin optimization step.

Algorithm 6 Inverse Reinforcement LearningInitialize i = 0, µ(πE), µ(π0) and convergence constant εwhile Until t ≤ ε do

wi = min||w||2Such that wTµExpert ≥ 1 and wTµπ(i) ≤ −1

Normalize : wi = wi||wi||

Get new policy πi from rl module wi as the reward functionUse new πi to estimate new feature expectation µπ(i)calculate distance between µπ(i) and µE , t = wTi (µE − µπ(i))If t > ε then i← i+ 1


3.6 Reinforcement Learning ModuleThis module takes in the reward function from the IRL module and learns adecision making rule (π) based on feedback from the reward function.

3.6.1 Action SpaceOur action space consists either of one or two continuous action.

Steering and/or acceleration, both of which are represented by a single unitHyperbolic tangent activation function

tanh(x) =2

1 + e−2x − 1. (3.9)

Which can take values in the interval [-1,1]. Were a value of -1 means maxi-mum left turn and +1 means maximum right turn. These maximum values areestimated from the provided dataset while also taking into consideration thelimitation of our simple car model that will be introduced in Section 3.11.

In the acceleration case we also used the tanh(x) function. This means avalue of −1 results in maximum deceleration while a value of +1 results inmaximum acceleration. These maximum values are estimated from the pro-vided dataset, see Table 3.2.

Quantity Symbol RangeAcceleration m s−2 [-3.4,3.4]Steering rad [-π

4,π4]

Table 3.2: Values estimated from NGSIM [4]

3.6.2 EnvironmentThe environment is created based on the vehicle trajectory data the frameworkinitially receives. The data is preprocessed (as descrbied in Section 3.1.1), thedata points are sorted by a global time. Each vehicle is then assigned a subsetenvironment of the entire dataset bounded by a generic and for this thesis se-lected sensor setup for autonomous driving. Sensor range of 150 meters to thefront and back and 70 meters to the sides. Examples of these "micro" environ-ments can be seen in Figures 3.8,3.9 and 3.10. In these, figures the boundaryof the micro environment is represented as the orange box, the vehicle whichthe micro environment corresponds to as a red box, the vehicles vision fieldsare depicted as blue cones while other cars in the environment are shown as


green boxes.The duration of a single "step" in the environment is define as ∆t. After

each action a the agent takes the states of other vehicles in the agents environ-ment are updated. This update is performed using information about their lo-cation at the current global time in the environment incremented by the definedstep duration ∆t. The agent receives a terminal state from the environment, ifit "crashes" into a vehicle in the environment or the boundaries of it’s "micro"environment as seen as the orange box in Figures 3.8,3.9 and 3.10. The agentwill also receive a terminal state, if it starts going in the opposite direction.

Figure 3.8: Representation of a motorcycle environment,the orange box indi-cates the boundaries for that particular "micro" environment


Figure 3.9: Representation of a passenger car environment

Figure 3.10: Representation of a truck/bus environment

3.6.3 Transition FunctionThe transition function provides the vehicles next state depending on the takenaction. This transition function is provided in the form of a simple Kinematic


car model shown in Figure 3.11.The parameters used by the Kinematic car model are the vehicle length L,

heading θ, steering φ, velocity v, a small time interval ∆t and an action vectora, where av is the provided velocity and aφ the provided steering angle. Thenat each ∆t it updates each parameter by the following transition configuration[43].

x = avcos(θ)

y = avsin(θ)

θ =avLtan(θ)aφ

(3.10)

Since the framework uses acceleration in the action space and not velocitylike the transition configuration relies on, the model is provided with an initialvelocity in the beginning. Velocity is then increased or decreased dependingon the acceleration value returned by the action vector at each ∆t.

Figure 3.11: Depiction of our simple car model [43]

The state space S of the reinforcement learning (RL) module, are the samefeatures as the inverse reinforcement learning (IRL) module uses, and are ex-plained in Section 3.4. In short, each state is represented as values returnedfrom the function φi.

The reward functionR is defined by the weight vector that the IRL module(which is described in Section 3.5) outputs. The reward function R indicateswhat φi values are favourable. The RL module then uses this reward functionR to learn a policy π. This policy π, learns what action a to take given certainφi values.


3.6.4 Deep Deterministic Policy GradientThe reinforcement learning algorithmwhich drives our reinforcement learningmodule is the Deep Deterministic Policy Gradient framework (DDPG) whichwas created by Lillicrap at al. [44]. The DDPG algorithm is outlined in Al-gorithm 7, information on what each symbol mean can be found in SectionSummary of Notations.

DDPG uses the actor critic architecture previously mentioned in Section2.2.5. A replay buffer stores information from experiences which occur duringtraining. Then a subset of past experiences are sampled from the replay bufferto train on consequently. DDPG also uses target networks to improve stabilitywhilst training.

The actor is a neural network which learns a deterministic policy. A deter-ministic policy directly parametrizes the policy of taking a particular action ina particular state. This means for a single set of feature values φi (as explainedin Section 3.4) it will output a single action

πθ(s, a) = P [a|s, θ]. (3.11)

Where s is the state, a is the action and θ are the model parameters of theneural network which represent the actor.

The framework learns the policy objective function by learning state actionmapping to the reward at the next state Q(st, at) = Rt+1. Since the policy isdeterministic a = µ(s) (for a particular state the output is a single action). Theactors/policy gradient can be written as:

∂L(θ)

∂θ= Ex∼p(x|θ)[

∂Q

∂θ]. (3.12)

By applying the chain rule the update to the actor becomes:

∂L(θ)

∂θ= Ex∼p(x|θ)[

∂Qθ(s, a)

∂a

∂a

∂θ]. (3.13)

The breakthrough in the work done by Silver et al [45] was that he proved thatthis is the policy gradient, i.e. the maximum expected reward will be obtainedas long as the models parameters are updated following this gradient. Thedeterministic policy gradient then got expanded upon by Lillicrap et al. tocreate the DDPG algorithm [44].

The actor is the previously explained policy function updated by the gradi-ent. The critic is the value function in the form of a Q-function replaced by aneural network e.g., θπ(s, a) ∼ Q(s, a, w), where thew are the neural network


weights/parameters. Such as, the deep deterministic policy gradient formulabecomes:

∂L(θ)

∂θ=∂Q(s, a, w)

∂a

∂a

∂θ. (3.14)

Where the actor/policy parameters θ can be updated via stochastic gradientascent. To update the weights in the Q-function the loss function is defined as:

Loss = [r + γQ(s′, a′)−Q(s, a)]2, (3.15)

which can then updated via stochastic gradient descent. DDPG uses a total offour neural networks: a Q network, a deterministic policy network, a target Qnetwork and a target policy network.

θQ = Q network (our critic)

θµ = Deterministic policy function (our actor)

θQ′

= target Q network

θµ′

= target policy network

Where the reasoning behind using target networks is to improve stability inlearning by means of soft updates to their non target counter-parts. The valuenetwork is updated as previously shown similar to Q-learning with a twist.The target value and target policy network are used to calculate the next-stateQ values,

yi = ri + γQ′(si+1, µ

′(si+1|θµ

′

)|θQ′

). (3.16)

Then, the mean squared loss between the updated Q value and the original Qvalue is minimized, while it is important to note that the original Q value iscalculated with the value network not the target value network (the Q in theloss).

Loss =1

N

∑i

(yi −Q(si, ai|θQ))2 (3.17)

The final component that makes DDPG work is designing a exploration algo-rithm. Since previous exploration techniques used for discrete domains wherethe agent is forced to try a random action some percentage of the time, whichmakes sense when it’s possible to count the number of actions at each state onone hand, but it does not scale well to a continuous action space.

In the original paper [44] this is done by adding noise to the actions by amethod called Ornstein-Uhlenbeck (which is a stochastic process which hasmean-reverting properties a curious reader can read more about it here [46]).


It’s essentially the same as adding noise by sampling from a normal Gaussiandistribution to add noise to the action, that returns samples closer to its meanovertime.

Algorithm 7 DDPG AlgorithmRandomly initialize critic network Q(s, a|θQ) and actor µ(s|θµ) withweights θQ, θµ

Initialize target network Q′ and µ′ with weights θQ′← θQ, θµ

′← θµ

Initialize replay buffer Rfor each episode 1, M do

Initialize a random process N for action explorationReceive initial observation state s1for For t = 1, T do

Select action at = µ(st|θµ) +N

according to the current policy and exploration noiseExectute action at and observe reward rt and observe new state st+1

Store transition (st, at, rt, st+1) in RSample a random minibatch of N transitions (si, ai, ri, si+1) from RSet yi = ri + γQ

′(si+1, µ

′(si+1|θµ

′)|θQ

′)

Update critic by minimizing the loss: L = 1N

∑i(yi−Q(si, ai|θQ))2

Update the actor policy using the sampled policy gradient:∇θµJ ∼ 1

N

∑i∇aQ(s, a|QQ)|s=si,a=µ(si)∇θµµ(s|θµ)|si

Update the target networks:θQ′← τθQ + (1− τ)θQ

′

θµ′← τθµ + (1− τ)θµ

′

3.7 Framework Setup and Training

3.7.1 General InformationFrom the NGSIM dataset [4] a time interval of 400 frames is sampled, whichaccounts for 40 seconds of trajectory data from n different vehicles. For allof the 400 frames for each vehicle, information about their surroundings 150meters front and back of them and 70 meters to their sides (mimicking the fieldof view selected in this thesis) is collected. A vehicle and it’s surrounding canbe see in Figures 3.8,3.9 and 3.10. Resulting in a vehicles trajectory and acorresponding "micro" environment at each time step for every n vehicle.

Using these n trajectories and their corresponding micro environment the


feature extraction module gets the n expected feature values which it then usesto get an empirical estimate µ(πE).

During the feature extraction process the framework gets an upper andlower bound for each of the features values and to get the random policy fea-tures (instead of running a random policy in the environment which proved tobe impractical in the continuous action space) the framework samples a valueuniformly from the upper and lower bound of the feature values for each of thefeatures to get the initial random feature estimate µπ(0).

The framework then uses the estimates µ(πE) and µπ(0), to get the initialreward function and use it to train the reinforcement learning module. It trainsit for 500 episodes each with a max length of 100. The environment is changedevery 250 episodes, the agent is placed in another micro environment and itstarts in the place of another vehicle than initially, in hopes that this processadds to the generalization capabilities of the framework.

After the reinforcement learning module finishes training, it selects thesame randomly selected e micro environments it encountered while trainingthe reinforcement module. For each environment e it rolls out the policy π itlearned in all those e micro environments for the sample duration used to getthe initial expected feature values (here 400 frames) and collects the featurevalues φ(s1, ..s400) our agent obtains in each of the e environments. Then anempirical estimate is performed on all collected features. This estimate is thencompared to the initial one gotten from data (µ(πE)) and if it’s close enoughthe framework concludes training, otherwise it does the same after getting anew reward function with another iteration through the inverse reinforcementlearning module 3.5.

3.7.2 Model ArchitectureThe Actor model is a neural network with two fully connected hidden layerswith size of 400 and 300, Tanh is used in the final layer that maps states toactions. The Critic model is similar to Actor model except the final layer is afully connected layer that maps states and actions to Q-values. The activationfunction all the hidden layers use is a rectified linear unit (ReLU). it is definedas f(x) = max(0, x).

3.7.3 Hyper-parametersThe learning rate for Actor is 0.0001 while Critic is 0.001 with soft update oftarget set to 0.01. Gamma discount is 0.99. Replay Buffer has size of 10000.


Batch size is 32.

Chapter 4

Results and Experiments

We compare our results with the following models:

• Constant Velocity (CV): A baseline which predicts trajectories at time tusing constant velocity and zero steering [47].

• VIZ-IRL-X: The proposed method were x indicates the number of vehi-cles used to estimate the expert feature behavior, as explained in Section3.3.

• CVIZ-IRL-X: A constrained version of the proposedmethod (VIZ-IRL-x), where we only use the front vision field, but add normalized velocityas a feature to the features explained in Section 3.4 and acceleration tothe action space (as explained in Section 3.6.1).

In a highway traffic situation (such as in the data from NGSIM data set), themajority of vehicles maintain the same heading and velocity for long periodsof time. For this reason the CV model was selected as a baseline, since theCV model assumes that vehicles maintain the same heading and velocity thatit is initially provided with.

The framework performance is tested by evaluating the prediction accuracyof the trajectories of a group of K vehicles, with K = 10, 25. Every table inthe following sections share the following definitions. The time horizon t forhow long into the future the prediction is defined in each tables column header.Which model and its predictions are defined in each row. The values in eachtable cell is the average RMSE (as defined in Equation 3.2) for theK vehiclespredicted. When testing the prediction accuracy of the models, we don’t takeinto account if the prediction would result in a crash.

41

42 CHAPTER 4. RESULTS AND EXPERIMENTS

4.1 Results on the NGSIM DatasetIn the sections that follow the prediction results for each vehicle class will beintroduced.

4.1.1 Results for the Passenger Car AgentThe results in Table 4.1, show that neither VIZ-IRL-10 and CVIZ-IRL-10achieve lower RMSE than the CV model given the prediction horizon of 1-5seconds, which is indicated in each columns header in the table. The CVIZ-IRL-10 model has notably higher RMSE than VIZ-IRL-10. In Figures 4.1 and4.2, we can see examples of a trajectory prediction result for the passenger caragent, where the red color is the ground truth, the orange color is the constantVelocity prediction and green is the VIZ-IRL-10 prediction. In these two ex-amples it can be seen that the predictions of VIZ-IRL-10 and the CV modelare close to each in terms of RMSE to the ground truth as indicated with a reddotted line in the figures.

t/method 1 2 3 4 5VIZ-IRL-10 0.425 1.245 2.065 2.907 3.967

CVIZ-IRL-10 0.551 2.001 3.904 4.754 6.867CV 0.193 0.700 1.076 1.445 1.791

Table 4.1: Prediction results for the passenger car agent, for time horizon of1-5 seconds and K = 10

CHAPTER 4. RESULTS AND EXPERIMENTS 43

Figure 4.1: An example of a single trajectory prediction, the red color indicatesthe ground truth, the orange color the CVmodel prediction and green the VIZ-IRL-10 prediction.



It’s evident from the results in Tables 4.2 and 4.3, that the VIZ-IRL-100model trained on more features, has better generalization capabilities when itcomes to predictions further into the future. This is most visible when lookingat the prediction horizons above 5 seconds. This increased generalization ef-fect doesn’t carry over to the prediction accuracy of the CVIZ-IRL-100 model.The most notable results are in Table 4.3 where the VIZ-IRL-100 model haslower RMSE (as defined in Equation 3.2) than the CV model when predictingfurther than 6 seconds into the future when K = 25. Examples of single tra-jectory predictions for the VIZ-IRL-100 model can be seen in Figures 4.3, 4.4and 4.5.

t/method 1 3 5 6VIZ-IRL-100 0.448 1.783 3.035 3.492

CVIZ-IRL-100 1.036 3.9234 7.653 9.850CV 0.193 1.076 1.791 2.092

t/method 7 8 9 10VIZ-IRL-100 3.850 4.185 4.549 5.009

CVIZ-IRL-100 12.368 15.179 18.087 20.921CV 2.390 2.699 3.011 3.410


t/method 1 3 5 6VIZ-IRL-100 0.750 2.187 3.185 3.7920

CVIZ-IRL-100 1.603 6.106 11.351 14.027CV 0.188 1.086 2.569 3.511

t/method 7 8 9 10VIZ-IRL-100 4.592 5.550 6.602 7.696

CVIZ-IRL-100 16.852 19.837 22.900 23.924CV 4.586 5.759 6.998 8.281




From Tables 4.1, 4.2 and 4.3 it can be seen that the model using the con-strained vision fields with the added acceleration action space (CVIZ-IRL-X)has worse prediction accuracy than both the CV model and the VIZ-IRL-Xmodel.

4.1.2 Results for the Motorcycle AgentFrom Table 4.4 the results for the motorcycle agent appear way worse in termsof average RMSE than for the passenger vehicle agent. Investigating individualmotorcycle agent predictions it is clear that the model suffers from outliers thateffect the average prediction accuracy. An example of such an outlier is shownin Figure 4.6. In Figure 4.7 another prediction by the same model is show,where it beats the CV model baseline and predicts an edge case occurrence ofa lane change in the beginning.


t/method 1 2 3 4 5VIZ-IRL-10 1.114 3.076 5.998 9.426 13.023

CVIZ-IRL-10 1.456 4.144 7.039 9.511 12.067Constant Control 0.482 1.233 1.871 2.609 3.503

Table 4.4: Prediction results for the motorcycle car agent, for time horizon of1-5 seconds and K = 10




More examples of predictions by the motorcycle agent are shown in Fig-ures 4.8 and 4.9.


4.1.3 Results for the Truck/Bus Vehicle Type AgentFrom table 4.5 it seems that the vision field features are the most descriptivewhen it moves to trucks and busses. The VIZ-IRL-10 prediction accuracy’sare on par with the CV model. Example of predictions from the VIZ-IRL-10can be seen in Figures 4.10 and 4.11. However, as with previous results theconstrained vision fields with the added acceleration action space has worseprediction accuracy. The predictions results for all of the vehicle type agentshave it in common that further into the future we predict the error increases.

t/method 1 2 3 4 5VIZ-IRL-10 0.279 0.674 1.037 1.446 1.940

CVIZ-IRL-10 0.548 1.797 3.360 4.955 6.620Constant Control 0.229 0.574 0.942 1.312 1.664

Table 4.5: Prediction results for the truck/bus agent, for time horizon of 1-5seconds and K = 10

Figure 4.10: An example of a single trajectory prediction, the red color indi-cates the ground truth, the orange color the CV model prediction and greenthe VIZ-IRL-10 prediction.


Figure 4.11: An example of a single trajectory prediction, the red color indi-cates the ground truth, the orange color the CV model prediction and greenthe VIZ-IRL-10 prediction.

Chapter 5

Discussion and Conclusions

From the results shown in the previous section, it’s evident that the frameworkonly outperforms the constant velocity model when predicting situations fur-ther than five seconds into the future (as the bold values in Table 4.3 show).More specifically, when looking at time horizons that are further than fiveseconds into the future. This is only true for the framework variant that usesconstant velocity, the framework variant that controls its acceleration alwaysperforms worse, as the results in Tables 4.5,4.4,4.1 clearly show. Although onaverage the frameworks constant velocity variant performs worse than the con-stant velocity model it delivers somewhat of a decent prediction accuracy andin some cases manages to capture behavior that the constant velocity modelcan’t such as lane changes. For these results, there could be a multitude ofdifferent reasons. Fundamentally it appears that drivers don’t share the samepreferences when it comes distances to objects inside their vision fields. Bycomparing different vehicle class prediction errors, the preferred vision fielddistances seem to be more general between truck/bus drivers and less gen-eral between different motorcycle drivers (as the outlier in Figure 4.6 shows).As previously mentioned the vision fields don’t seem to be general betweendrivers, at least in the way the framework interprets the function φi (explainedin Section 3.4) in its current state, as it tries to map the preferences to scalarvalues. Perhaps by using a method such as the deep max entropy IRL variantmentioned in Section 2.3 the framework would be able to find a better gener-alization to the vision field preferences, and as a result increase the predictionaccuracy. Since the deep max entropy IRL maps the vision field values to adistribution rather than scalar values of an empirical estimate as our currentIRL module does. This could especially be beneficial for the prediction accu-racy of the motorcycle vehicle class.

52

CHAPTER 5. DISCUSSION AND CONCLUSIONS 53

As much as mapping the current feature values (the vision field readings)to a distribution could help increase the prediction accuracy. The first thingto investigate further would be less specific and more general features. Some-thing as simple as, the just having the front vision field and distances to theclosest object to the sides of a vehicle, along with the distance to the closestroad centerline and lane boundaries could outperform the prediction accuracyof the vision field scheme. There also appears to be a need for more descriptivefeatures to establish the connection between distance to other objects and thevelocity of the vehicle that is being predicted, instead of just adding normal-ized velocity to the feature space. To enable the framework to train an agentthat can reliably output both acceleration and steering for prediction purposes.Which features would provide the agent with the tools it needs to learn to con-trol its velocity in a similar fashion as in the NGSIM dataset is something thatrequires further investigation. One idea is to also include the brake pedal as aseparate action in the action space. In the framework, the number of tunablefactors is so high that it is next to impossible to get the optimal performance outof it. Everything from each individual neural network hyperparameters to thenumber of episodes the reinforcement learning module is trained for. Giventhat we only train for 500 episodes in each iteration of the reinforcement mod-ule, it would most likely benefit from an increased number of episodes eachiteration. But given the hardware accessible to us during this thesis and thetraining time for each iteration as it is was ranging from 8-12 hours each time(and because of the IRL module we need around three iterations for it to returna decent reward function), that was not a possibility that we could investigate.

As much as the "micro" environments sped up the training process, andmade this thesis project a possibility is a lot of contextual information aboutthe environment is lost. Throughout the training process especially when theagent had control of his acceleration as well as steering the agent quite oftensimply went to the outskirts of its "micro" environment, to maximize its re-ward. In an optimal setting, we would be able to keep all the environmentinformation at the agents disposal while training but the hardware needed forsuch an experiment simply wasn’t available.

Something also worth noticing is that the car model that was used as theagents transition function is a simple kinematic model so maybe using a moredescriptive car model in the framework’s transition function that is closer toreal like car behaviour such as, the dynamic bicycle model [48] which takesamong other things inertia into account, could be beneficial to the predictionaccuracy.

It shouldn’t be understated that this framework has one major advantage

54 CHAPTER 5. DISCUSSION AND CONCLUSIONS

over other common trajectory prediction methods, it doesn’t need any previoustrajectory history to base its prediction on. Whereas the state of the art methodswhich are mostly built on RNNs rely on everything from 1 to 3 seconds ofprevious trajectory data to predict the next 1-3 seconds into the future. A lotcan happen in these 1-3 seconds that those frameworks are using as a basefor their predictions. However, the framework’s reinforcement module maybenefit from adding an RNN network to its actor and critic methods to helpwith the prediction accuracy but by doing so it would need to, similar to othermethods built on RNNs observe a couple of seconds of vehicles trajectory topredict its future, but the effectiveness of adding an RNN network is somethingthat is worth investigating.

One of the framework’s biggest strength is its modularity. Although the re-sults are just based on predicting vehicles future coordinates. Since the frame-work uses the vehicle model (Equation 3.10), the framework can also predictthe heading and velocity of vehicles, since they are system variables. Despitethe results, with a more general feature space and more training time we be-lieve this framework could achieve better results. The foundation is there nowit just needs to be expanded upon.

Chapter 6

Future Work

Based on the conclusions drawn from this study some future research problemshave been identified. As previously mentioned, a different feature extractionprocess which aims to better capture driving behaviour or extending the cur-rent model with additional features such as distance to boundaries could bedeveloped to study if this will influence the prediction results. Another fu-ture study could be based on extending the current framework to use an RNNnetwork in the framework’s actor and critic networks, not just fully connectedlayers in the framework’s reinforcement module since RNN networks can useprevious trajectory history to aid in predicting the future trajectories, and studytheir impact on the prediction accuracy. Some other extensions that could beinvestigated is using deep max entropy as the framework’s IRL module in-stead of the current max margin method as mentioned in Section 2.3, sincemapping a distribution to the features instead of scalar values could be bene-ficial in terms of generalizing better to the different driving scenarios. Therecould also be value in terms of prediction accuracy in evaluating different deepreinforcement learning techniques such as Soft-Actor critic [49], trying differ-ent vehicle models that are closer to real like car behavior such as, the dynamicbicycle model [48].

55

Bibliography

[1] Pieter Abbeel and Andrew Y Ng. “Apprenticeship learning via inversereinforcement learning”. In:Proceedings of the twenty-first internationalconference on Machine learning. ACM. 2004, p. 1.

[2] Ariel Rosenfeld and Sarit Kraus. “Predicting human decision-making:From prediction to action”. In: Synthesis Lectures on Artificial Intelli-gence and Machine Learning 12.1 (2018), pp. 1–150.

[3] Andrew Y Ng, Stuart J Russell, et al. “Algorithms for inverse reinforce-ment learning.” In: Icml. Vol. 1. 2000, p. 2.

[4] J. Colyar and J. Halkias. “Us highway dataset”. In: Federal HighwayAdministration )FHWA=, Tech. Rep. FWA-HRT-07-030 (2007).

[5] Wikipedia. List of countries by traffic-related death rate. https://en . wikipedia . org / wiki / List _ of _ countries _ by _traffic-related_death_rate. article. 2019.

[6] Science Alert. Driverless Cars Could Reduce Traffic Fatalities by Upto 90%, Says Report. https : / / www . sciencealert . com /driverless-cars-could-reduce-traffic-fatalities-by-up-to-90-says-report. article. 2015.

[7] Judith Jarvis Thomson. “The trolley problem”. In: Yale LJ 94 (1984),p. 1395.

[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advancesin neural information processing systems. 2012, pp. 1097–1105.

[9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.http://www.deeplearningbook.org. MIT Press, 2016.

[10] Stanford.Convolutional Neural Networks for Visual Recognition.http://cs231n.stanford.edu/. Course. 2019.

56

https://en.wikipedia.org/wiki/List_of_countries_by_traffic-related_death_rate



https://www.sciencealert.com/driverless-cars-could-reduce-traffic-fatalities-by-up-to-90-says-report



http://www.deeplearningbook.org

http://cs231n.stanford.edu/

http://cs231n.stanford.edu/

BIBLIOGRAPHY 57

[11] Marcel van Gerven and Sander Bohte. Artificial neural networks asmodels of neural information processing. Frontiers Media SA, 2018.

[12] Martin T Hagan et al.Neural network design. Vol. 20. Pws Pub. Boston,1996.

[13] Alexandre Proutiere. Lecture notes in Reinforcement Learning. Jan. 2019.

[14] Abhijit Gosavi. “Reinforcement learning: A tutorial survey and recentadvances”. In: INFORMS Journal on Computing 21.2 (2009), pp. 178–192.

[15] Marco Corazza and Andrea Sangalli. “Q-Learning and SARSA: a com-parison between two intelligent stochastic control approaches for finan-cial trading”. In: University Ca’Foscari of Venice, Dept. of EconomicsResearch Paper Series No 15 (2015).

[16] Volodymyr Mnih et al. “Playing atari with deep reinforcement learn-ing”. In: arXiv preprint arXiv:1312.5602 (2013).

[17] John Schulman et al. “Trust region policy optimization”. In: Interna-tional Conference on Machine Learning. 2015, pp. 1889–1897.

[18] K. Arulkumaran et al. “Deep Reinforcement Learning: A Brief Sur-vey”. In: IEEE Signal Processing Magazine 34.6 (Nov. 2017), pp. 26–38. issn: 1053-5888. doi: 10.1109/MSP.2017.2743240.

[19] John Schulman et al. “Proximal policy optimization algorithms”. In:arXiv preprint arXiv:1707.06347 (2017).

[20] Richard S Sutton and Andrew G Barto. Reinforcement learning: An in-troduction. MIT press, 2018.

[21] Tapas Kanungo et al. “An efficient k-means clustering algorithm: Anal-ysis and implementation”. In: IEEE Transactions on Pattern Analysis& Machine Intelligence 7 (2002), pp. 881–892.

[22] Lawrence R Rabiner and Biing-Hwang Juang. “An introduction to hid-den Markov models”. In: ieee assp magazine 3.1 (1986), pp. 4–16.

[23] GregWelch, Gary Bishop, et al. “An introduction to the Kalman filter”.In: (1995).

[24] Ning Ye et al. “Vehicle trajectory prediction based on Hidden MarkovModel.” In: KSII Transactions on Internet & Information Systems 10.7(2016).

https://doi.org/10.1109/MSP.2017.2743240

58 BIBLIOGRAPHY

[25] CaroleGPrevost, AndreDesbiens, and EricGagnon. “ExtendedKalmanfilter for state estimation and trajectory prediction of a moving objectdetected by an unmanned aerial vehicle”. In: 2007 American controlconference. IEEE. 2007, pp. 1805–1810.

[26] ChristopherOlah.Understanding LSTMNetworks.https://colah.github.io/posts/2015- 08- Understanding- LSTMs/.Blog. 2015.

[27] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. “ChauffeurNet:Learning to Drive by Imitating the Best and Synthesizing the Worst”.In: arXiv preprint arXiv:1812.03079 (2018).

[28] HuynhManh and Gita Alaghband. “Scene-LSTM: AModel for HumanTrajectory Prediction”. In: arXiv preprint arXiv:1808.04018 (2018).

[29] Florent Altché and Arnaud de La Fortelle. “An LSTMnetwork for high-way trajectory prediction”. In: 2017 IEEE 20th International Confer-ence on Intelligent Transportation Systems (ITSC). IEEE. 2017, pp. 353–359.

[30] Chelsea Finn, Sergey Levine, and Pieter Abbeel. “Guided cost learning:Deep inverse optimal control via policy optimization”. In: InternationalConference on Machine Learning. 2016, pp. 49–58.

[31] JonathanHo and Stefano Ermon. “Generative adversarial imitation learn-ing”. In: Advances in Neural Information Processing Systems. 2016,pp. 4565–4573.

[32] Saurabh Arora and Prashant Doshi. “A survey of inverse reinforcementlearning: Challenges, methods and progress”. In: arXiv preprint arXiv:1806.06877(2018).

[33] Brian D Ziebart et al. “Maximum entropy inverse reinforcement learn-ing.” In: Aaai. Vol. 8. Chicago, IL, USA. 2008, pp. 1433–1438.

[34] Namhoon Lee and Kris MKitani. “Predicting wide receiver trajectoriesin american football”. In: 2016 IEEE Winter Conference on Applica-tions of Computer Vision (WACV). IEEE. 2016, pp. 1–9.

[35] Kris MKitani et al. “Activity forecasting”. In: European Conference onComputer Vision. Springer. 2012, pp. 201–214.

[36] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. “Maximumentropy deep inverse reinforcement learning”. In: arXiv preprint arXiv:1507.04888(2015).

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

BIBLIOGRAPHY 59

[37] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. “Deep inversereinforcement learning”. In: arXiv preprint arXiv:1507.04888 (2015).

[38] M. Fahad, Z. Chen, and Y. Guo. “Learning How Pedestrians Navigate:ADeep Inverse Reinforcement LearningApproach”. In: 2018 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS). Oct.2018, pp. 819–826. doi: 10.1109/IROS.2018.8593438.

[39] Yanfu Zhang et al. “Integrating kinematics and environment contextinto deep inverse reinforcement learning for predicting off-road vehicletrajectories”. In: arXiv preprint arXiv:1810.07225 (2018).

[40] Sergey Levine and Vladlen Koltun. “Continuous inverse optimal con-trol with locally optimal examples”. In: arXiv preprint arXiv:1206.4617(2012).

[41] N. Agmon, M. E. Taylor, E. Elkind, M. Veloso. Learning TrajectoryPrediction with Continuous Inverse Optimal Control via Langevin Sam-pling. http://fei22.cn/document/AAMAS_ 2019.pdf.Online; accessed January 24 2019. 2019.

[42] NamhoonLee et al. “Desire: Distant future prediction in dynamic sceneswith interacting agents”. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 2017, pp. 336–345.

[43] Steven M LaValle. Planning algorithms. Cambridge university press,2006.

[44] Timothy P Lillicrap et al. “Continuous control with deep reinforcementlearning”. In: arXiv preprint arXiv:1509.02971 (2015).

[45] David Silver et al. “Deterministic policy gradient algorithms”. In: ICML.2014.

[46] Wikipedia.Ornstein–Uhlenbeck process.https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process. arti-cle. 2019.

[47] MIT. Constant Velocity Model. https://scripts.mit.edu/~srayyan/PERwiki/index.php?title=Module_ 3_ --_Constant_Velocity_Model. Blog. 2011.

[48] Romain Pepy, Alain Lambert, and Hugues Mounier. “Path planning us-ing a dynamic vehicle model”. In: 2006 2nd International Conferenceon Information & Communication Technologies. Vol. 1. IEEE. 2006,pp. 781–786.

https://doi.org/10.1109/IROS.2018.8593438

http://fei22.cn/document/AAMAS_2019.pdf

https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

https://scripts.mit.edu/~srayyan/PERwiki/index.php?title=Module_3_--_Constant_Velocity_Model



60 BIBLIOGRAPHY

[49] Tuomas Haarnoja et al. “Soft actor-critic: Off-policy maximum entropydeep reinforcement learning with a stochastic actor”. In: arXiv preprintarXiv:1801.01290 (2018).

TRITA EECS-EX-2019:547

www.kth.se

predicting vehicle trajectories with inverse reinforcement ...1366887/fulltext01.pdf · predicting...

Documents