survey of deep reinforcement learning for motion planning ... · reinforcement learning autonomous...

1

Survey of Deep Reinforcement Learning for MotionPlanning of Autonomous Vehicles

Szilard Aradi1

Abstract—Academic research in the field of autonomous ve-hicles has reached high popularity in recent years related toseveral topics as sensor technologies, V2X communications,safety, security, decision making, control, and even legal andstandardization rules. Besides classic control design approaches,Artificial Intelligence and Machine Learning methods are presentin almost all of these fields. Another part of research focuses ondifferent layers of Motion Planning, such as strategic decisions,trajectory planning, and control. A wide range of techniques inMachine Learning itself have been developed, and this article de-scribes one of these fields, Deep Reinforcement Learning (DRL).The paper provides insight into the hierarchical motion planningproblem and describes the basics of DRL. The main elements ofdesigning such a system are the modeling of the environment,the modeling abstractions, the description of the state and theperception models, the appropriate rewarding, and the realizationof the underlying neural network. The paper describes vehiclemodels, simulation possibilities and computational requirements.Strategic decisions on different layers and the observation models,e.g., continuous and discrete state representations, grid-based,and camera-based solutions are presented. The paper surveysthe state-of-art solutions systematized by the different tasks andlevels of autonomous driving, such as car-following, lane-keeping,trajectory following, merging, or driving in dense traffic. Finally,open questions and future challenges are discussed.

Index Terms—Machine Learning, Motion Planning, Au-tonomous Vehicles, Artificial intelligence, Reinforcement Learn-ing.

I. INTRODUCTION

Motion planning for autonomous vehicle functions is a vastand long-researched area using a wide variety of approachessuch as different optimization techniques, modern controlmethods, artificial intelligence, and machine learning. Thisarticle presents the achievements of the field from recent yearsfocused on Deep Reinforcement Learning (DRL) approach.DRL combines the classic reinforcement learning with deepneural networks, and gained popularity after the breakthrougharticle from Deepmind [1], [2]. In the number of researchpapers about autonomous vehicles and the DRL has beenincreased in the last few years (see Fig. 1.), and because ofthe complexity of the different motion planning problems, itis a convenient choice to evaluate the applicability of DRL forthese problems.

A. The Hierarchical Classification of Motion Planning forAutonomous Driving

Using deep neural networks for self-driving cars givesthe possibility to develop ”end-to-end” solutions where the

1 Department of Control for Transportation and Vehicle Systems, BudapestUniversity of Technology and Economics, Hungarye-mail: [email protected]

0

500

1000

1500

2000

2500

3000

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Reinforcement Learning Autonomous vehicles

Fig. 1: Web of Science topic search for ”Deep ReinforcementLearning” and ”Autonomous Vehicles (2020.01.17.)”

system operates like a human driver: its inputs are the traveldestination, the knowledge about the road network and varioussensor information, and the output is the direct vehicle controlcommands, e.g., steering, torque, and brake. However, on theone hand, realizing such a scheme is quite complicated, since itneeds to handle all layers of the driving task, on the other hand,the system itself behaves like a black box, which raises designand validation problems. By examining the recent advantagesin the field, it can be said that most researches focus onsolving some sub-tasks of the hierarchical motion planningproblem. This decision-making system of autonomous drivingcan be decomposed into at least four layers, as stated in[3] (see Fig.2.). Route planning, as the highest level, definesthe way-points of the journey based on the map of the roadnetwork, with the possibility of using real-time traffic data.Though optimal route choice has a high interest among theresearch community, papers dealing with this level do notemploy reinforcement learning. A comprehensive study on thesubject can be found in [4].

The Behavioral layer is the strategic level of autonomousdriving. With the given way-points, the agent decides on theshort term policy, by taking into consideration the local roadtopology, the traffic rules, and the perceived state of othertraffic participants. Having a finite set of available actions forthe driving context, the realization of this layer is usually afinite state-machine having basic strategies in its states (i.e.,car following, lane changing, etc.) with well-defined transi-tions between them based on the change of the environment.However, even with the full knowledge of the current state

arX

iv:2

001.

1123

1v1

[cs

.LG

] 3

0 Ja

n 20

20

2

Road Network Data

User destination

Online traffic information

Local Feedback Control

Ste

erin

g,

Th

rott

le,

Bra

ke

Local Feedback Control

Ste

erin

g,

Th

rott

le,

Bra

ke

Motion Planning

Tra

ject

ory

Motion Planning

Tra

ject

ory

Car

Following

Car

Following

Change

Lanes

Modify

ACC

setpoint

Target Gap

maneuver

Car

Following

Change

Lanes

Modify

ACC

setpoint

Target Gap

maneuver

Behavioral Layer

Car

Following

Change

Lanes

Modify

ACC

setpoint

Target Gap

maneuver

Behavioral Layer

Str

ateg

y

Car

Following

Change

Lanes

Modify

ACC

setpoint

Target Gap

maneuver

Behavioral Layer

Str

ateg

y

Route Planning

Way

po

ints

Route Planning

Way

po

ints

Road topology

Static and dynamic objects

Traffic rules

Estimated pose and

collision free space Vehicle state

Fig. 2: Layers of motion planning

of the traffic, the future intentions of the surrounding driversare unknown, making the problem partially observable [5].Hence the future state not only depends on the behaviorof the ego vehicle but also relies on unknown processes;this problem forms a Partially Observable Markov DecisionProcess (POMDP). Different techniques exist to mitigate theseeffects by predicting the possible trajectories of other roadusers, like in [6], where the authors used gaussian mixturemodels, or in [7], where support vector machines and artificialneural networks were trained based on recorded traffic data.Since finite action POMDPs are the natural way of modelingreinforcement learning problems, a high amount of researchpapers deal with this level, as can be seen in the sections ofthe paper. To carry out the strategy defined by the behaviorallayer, the motion planning layer needs to design a feasibletrajectory consisting of the expected speed, yaw, and positionstates of the vehicle on a short horizon. Naturally, on this level,the vehicle dynamics has to be considered, hence classic exactsolutions of motion planning are impractical since they usuallyassume holonomic dynamics. It has long been known that thenumerical complexity of solving the motion planning problemwith nonholonomic dynamics is Polynomial-Space Algorithm(PSPACE) [8], meaning it is hard to elaborate an overallsolution by solving the nonlinear programming problem inreal-time [9]. On the other hand, the output representationof the layer makes it hard to directly handle it with ”pure”reinforcement learning, only a few papers deal solely withthis layer, and they usually use DRL to define splines as aresult of the training [10], [11].

At the lowest level, the local feedback control is respon-sible for minimizing the deviation from the prescribed pathor trajectory. A significant amount of papers reviewed inthis article deals with the aspects of this task, where lane-keeping, trajectory following, or car following is the higher-level strategy. Though at this level, the action space becomescontinuous, and classical approaches of RL can not handlethis. Hence discretization of the control outputs is needed, or- as in some papers - continuous variants of DRL are used.

B. Reinforcement Learning

As an area of Artificial Intelligence and Machine Learning,Reinforcement learning (RL) deals with the problem of alearning agent placed in an environment to achieve a goal.Contrary to supervised learning, where the learner structuregets examples of good and bad behavior, the RL agent mustdiscover by trial and error how to behave to get the mostreward [12]. For this task, the agent must percept the state ofthe environment at some level and based on this information,and it needs to take actions that result in a new state. Asa result of its action, the agent receives a reward, which aidsin the development of future behavior. To ultimately formulatethe problem, modeling the state transitions of the environment,based on the actions of the agent is also a necessity. This leadsto the formulation of a POMDP defined by the functions of(S,A, T,R,Ω, O), where S is the set of environment states, Ais the set of possible actions in that particular state, T is thetransition function between the states based on the actions,R is the reward for the given (S,A) pair, while Ω is theset of observations, and O is the sensor model. The agent inthis context can be formulated by any inference model whoseparameters can be modified in response to the experiencegained. In the context of Deep Reinforcement Learning, thismodel is implemented by neural networks.

The problem in the POMDP scenario is that the currentactions affect the future states, therefore the future rewards,meaning that for optimizing the behavior for the cumulativereward throughout the entire episode, the agent needs to haveinformation about the future consequences of its actions. RLhas two main approaches for determining the optimal behavior:value-based and policy-based methods.

The original concept using a value-based method is theDeep-Q Learning Network (DQN) introduced in [1]. De-scribed briefly, the agent predicts a so-called Q value for eachstate-action pair, which formulate the expected immediate andfuture reward. From this set, the agent can choose the actionwith the highest value as an optimal policy or can use thevalues for exploration during the training process. The maingoal is to learn the optimal Q function, represented by a

3

Environment

Ego-vehicle dynamics

Other road users

Traffic rules, signs

etc...

Road topology

New state

Observation

and Sensor Model

Reward function

State St+1

Agent

Reward Rt+1 Rt

Observation Ot+1 Ot

Action At

Fig. 3: The POMDP model for Deep Reinforcement Learning based autonomous driving

neural network in this case. This can be done by conductingexperiments, calculating the discounted rewards of the futurestates for each action, and updating the network by using theBellman-equation [13] as a target. Using the same networkfor value evaluation and action selection results in unstablebehavior and slow learning in noisy environments. Meta-heuristics, such as experience replay, can handle this problem,while other variants of the original DQN exist, such as DoubleDQN [14] or Dueling DQN [15], separating the action andthe value prediction streams, leading to faster and more stablelearning.

Policy-based methods target at choosing the optimal behav-ior directly, where the policy πΘ is a function of (S,A). Rep-resented by a neural network, with a softmax head, the agentgenerally predicts a normalized probability of the expectedgoodness of the actions. In the most natural implementation,this output integrates the exploration property of the RLprocess. In advanced variants, such as the actor-critic, the agentuses different predictions for the value and the action [16].Initially, RL algorithms use finite action space, though, formany control problems, they are not suitable. To overcomethis issue in [17] introduced the Deep Deterministic PolicyGradients (DDPG) agent, where the actor directly maps statesto continuous actions.

For complex problems, the learning process can still be longor even unsuccessful. It can be soluted in many ways:

• Curriculum learning describes a type of learning in whichthe training starts with only easy examples of a task andthen gradually increase difficulty. This approach is usedin [18], [19], [20].

• Adversarial learning aims to fool models through mali-cious input. Papers using variants of this technique are:[21], [22]

• Model-based action choice, such as the MCTS based so-lution of Alpha-Go, can reduce the effect of the problemof distant rewarding.

Since reinforcement learning models the problem as aPOMDP, a discrete-time stochastic control process, the so-lutions need to provide a mathematical framework for thisdecision making in situations where outcomes are partlyrandom and partly under the control of a decision-maker, whilethe states are also partly observable [23]. In the case of motionplanning for autonomous or highly automated vehicles, the

tuple (S,A, T,R,Ω, O) of the POMDP is illustrated in Fig. 3and can be interpreted as follows:S,A, T, and R describe the MDP, the modeling environ-

ment of the learning process. It can vary depending on thegoals, though in our case it needs to model the dynamicsof the vehicle, the surrounding static and dynamic objects,such as other participants of the traffic, the road topology,lane markings, signs, traffic rules, etc. S holds the currentactual state of the simulation. A is the possible set of actionsof the agent driving the ego-car, while T , the so-called state-transition function updates the vehicle state and also the statesof the traffic participants depending on the action of thevehicle. The different levels of abstraction are described insection II-A. Many research papers use different softwareplatforms for modeling the environment. A brief collectionof the used frameworks are presented in section II-B. R is thereward function of the MDP, section II-D gives a summary onthis topic.

Ω is the set of observations the agent can experiencein the world, while O is the observation function giving apossibility distribution over the possible observations. In moreuncomplicated cases, the studies assume full observability andformulate the problem as an MDP, though in many cases, thevehicle does not possess all information. Another interestingtopic is the representation of the state observation, which is acrucial factor for the architecture choice and performance ofDeep RL agents. The observation models used in the literatureare summarized in section II-E.

II. MODELING FOR REINFORCEMENT LEARNING

A. Vehicle modeling

Modeling the movement of the ego-vehicle is a crucial partof the training process since it raises the trade-off problem be-tween model accuracy and computational resource. Since RLtechniques use a massive number of episodes for determiningoptimal policy, the step time of the environment, which highlydepends on the evaluation time of the vehicle dynamics model,profoundly affects training time. Therefore during environmentdesign, one needs to choose from the simplest kinematicmodel to more sophisticated dynamics models ranging from2 Degree of Freedom (2DoF) lateral model to the more andmore complex models with a higher number of parameters andcomplicated tire models.

4

At rigid kinematic single-track vehicle models, which ne-glect tire slip and skip, lateral motion is only affected by thegeometric parameters. Therefore, they are usually limited tolow-speed applications. More details about the model can befound in [24]. The simplest dynamic models with longitudinaland lateral movements are based on the 3 Degrees of Freedom(3DoF) dynamic bicycle model, usually with a linear tiremodel. They consider (Vx, Vy, Ψ) as independent variables,namely longitudinal and lateral speed, and yaw rate. A morecomplex model is the four-tire 9 Degrees of Freedom (9DoF)vehicle model, where amongst the parameters of the 3DoF,body roll and pitch (Θ, Φ) and the angular velocities ofthe four wheels (ωfl, ωfr, ωrl, ωrr) are also considered, tocalculate tire forces more precisely. Hence the model takesinto account both the coupling of longitudinal and lateral slipsand the load transfer between tires.

Though the kinematic model seems quite simplified, and asstated in [25], such a model can behave significantly differentfrom an actual vehicle, though for the many control situations,the accuracy is suitable [24].

According to [25], using a kinematic bicycle model witha limitation on the lateral acceleration at around 0.5g or lessprovides appropriate results, but only with the assumption ofdry road. Above this limit, the model is unable to handledynamics. Hence a more accurate vehicle model should beused when dealing with higher accelerations to push thevehicle’s dynamics near its handling limits.

Regarding calculation time, based on the kinematic model,the calculation of the 3DoF model can be 10 . . . 50 timeshigher, and the precise calculation of a 9DoF model withnonlinear tire model can be 100 . . . 300 times higher, which isthe main reason for the RL community to use a low level ofabstraction.

Modeling traffic and surrounding vehicles is often per-formed by using unique simulators, as described in sectionII-B. Some authors develop their environments, using cellularautomata models [26]. Some use MOBIL, which is a generalmodel (minimizing overall braking induced by lane change)to derive lane-changing rules for discretionary and mandatorylane changes for a broad class of car-following models [27];the Intelligent Driving Model (IDM), a continuous micro-scopic single-lane model [28].

B. Simulators

Some authors create self-made environments to achievefull control over the model, though there are commercialand Open-source environments that can provide this feature.This section briefly identifies some of them used in recentresearches in motion planning with RL.

In modeling the traffic environment, the most popularchoice is SUMO (Simulation of Urban MObility), whichis a microscopic, inter- and multi-modal, space-continuousand time-discrete traffic flow simulation platform [29]. It canconvert networks from other traffic simulators such as VISUM,Vissim, or MATSim and also reads other standard digital roadnetwork formats, such as OpenStreetMap or OpenDRIVE.It also provides interfaces to several environments, such as

python, Matlab, .Net, C++, etc. Though the abstraction level,in this case, is microscopic, and vehicle behavior is limited,its ease of use and high speed makes it an excellent choice fortraining agents to handle traffic, though it does not provide anysensor model besides the ground truth state of the vehicles.

Another popular microscopic simulator that has been usedcommercially and for research also is VISSIM [30]. In [31]it is used for developing car-following behavior and lanechanging decisions.

Considering only vehicle dynamics, the most popular choiceis TORCS (The Open Racing Car Simulator), which is amodern, modular, highly portable multi-player, multi-agent carsimulator. Its high degree of modularity and portability renderit ideal for artificial intelligence research [32]. Interfacingwith python, the most popular AI research environment iscomfortable and runs at an acceptable speed. TORCS alsocomes with different tracks, competing robots, and severalsensor models.

It is assumed that for vehicle dynamics, the best choiceswould be the professional tools, such as CarSIM [33] orCarMaker [34], though the utilization of these softwares cannot be found in the reinforcement learning literature. This maybe caused by the fact that these are expensive commercial plat-forms, though more importantly, their lack of python interfacesand high precision, but resource-intensive models prevent themfrom running several episodes within a reasonable time.

For more detailed sensor models or traffic, the authorsusually use Airsim, Udacity Gazebo/ROS, and CARLA:

AirSim, used by a recent research in [35], is a simulatorinitially developed for drones built on Unreal Engine nowhas a vehicle extension with different weather conditions andscenarios.

Udacity, used in [36], is a simulator that was built forUdacity’s Self-Driving Car Nanodegree [37] provides varioussensors, such as high quality rendered camera image LIDARand Infrared information, and also has capabilities to modelother traffic participants.

Another notable mention is CARLA, an open-source sim-ulator for autonomous driving research. CARLA has beendeveloped from the ground up to support the development,training, and validation of autonomous urban driving systems.In addition to open-source code and protocols, CARLA pro-vides open digital assets (urban layouts, buildings, vehicles)that were created for this purpose and can be used freely. Thesimulation platform supports flexible specification of sensorsuites and environmental conditions [38].

Though this section provides only a brief description of thesimulators, a more systematic review of the topic can be foundin [39].

C. Action Space

The choice of action space highly depends on the vehiclemodel and task designed for the reinforcement learning prob-lem in each previous research. Though two main levels ofcontrol can be found: one is the direct control of the vehicle bysteering braking and accelerating commands, and the other actson the behavioral layer and defines choices on strategic levels,

5

such as lane change, lane keeping, setting ACC referencepoint, etc. At this level, the agent gives a command to low-level controllers, which calculate the actual trajectory. Onlya few papers deal with the motion planning layer, where thetask defines the endpoints (x, y, θ), and the agent defines theknots of the trajectory to follow represented as a spline, as canbe seen in [11]. Also, few papers deviate from vehicle motionrestrictions and generate actions by stepping in a grid, like inclassic cellular automata microscopic models [40].

Some papers combine the control and behavioral layers byseparating longitudinal and lateral tasks, where longitudinalacceleration is a direct command, while lane changing is astrategic decision like in [41].

The behavioral layer usually holds a few distinct choices,from which the underlying neural network needs to choose,making it a classic reinforcement learning task with finiteactions.

Though on the level of control, the actuation of vehicles,i.e., steering, throttle, and braking, are continuous parametersand many reinforcement learning techniques like DQN andPG can not handle this since they need finite action set, whilesome, like DDPG, works with continuous action space. Toadapt to the finite action requirements of the RL techniqueused, most papers discretizes the steering and accelerationcommands to 3 to 9 possibilities per channel. The low numberof possible choices pushes the solution farther from reality,which could raise vehicle dynamics issues with uncontrollableslips, massive jerk, and yaw-rate, though the utilization ofkinematic models sometimes covers this in the papers. A largenumber of discrete choices, however, ends up in an exponentialgrowth in the possible outcomes in the POMDP approach,which slows down the learning process.

D. Rewarding

During training, the agent tries to fulfill a task, generallyconsisting of more than one step. This task is called an episode.An episode ends if one of the following conditions are met:

• The agent successfully fulfills the task;• The episode reaches a previously defined steps• A terminating condition rises.

The first two cases are trivial and depend on the design of theactual problem. Terminal conditions are typically situationswhere the agent reaches a state from which the actual taskis impossible to fulfill, or the agent makes a mistake that isnot acceptable. Vehicle motion planning agents usually useterminating conditions, such as: collision with other partici-pants or obstacles or leaving the track or lane, since thesetwo inevitably end the episode. There are lighter approaches,where the episode terminates with failure before the accidentoccurred, with examples of having a too high tangent angleto the track or reaching too close to other participants. These”before accident” terminations speed up the training by bring-ing the information of failure forward in time, though theirdesign needs caution [42].

Rewarding plays the role of evaluating the goodness of thechoices the agent made during the episode giving feedback toimprove the policy. The first important aspect is the timing of

the reward, where the designer of the reinforcement learningsolution needs to choose a mixture of the following strategiesall having their pros and cons:

• Giving reward only at the end of the episode and dis-counting it back to the previous (S,A) pairs, which couldresult in a slower learning process, though minimizes thehuman-driven shaping of the policy.

• Giving immediate reward at each step by evaluatingthe current state, naturally discount also appears in thissolution, which results in significantly faster learning,though the choice of the immediate reward highly affectsthe established strategy, which sometimes prevents theagent from developing better overall solutions than theone that gave the intention of the designed reward.

• An intermediate solution can be to give a reward inpredefined periods or travel distance [43], or when a goodor bad decision occurs.

In the area of motion planning, the end episode rewardsare calculated from the fulfillment or failure of the drivingtask. The overall performance factors are generally: time offinishing the task, keeping the desired speed or achieving ashigh average speed as possible, yaw or distance from lanemiddle or the desired trajectory, overtaking more vehicles,achieve as few lane changes as possible [44], keeping right[45], [46] etc. Rewarding systems also can represent passengercomfort, where the smoothness of the vehicle dynamics isenforced. The most used quantitative measures are the lon-gitudinal acceleration [47], lateral acceleration [48], [49] andjerk [50], [10].

In some researches, the reward is based on the deviationfrom a dataset [51], or calculated as a deviation from areference model like in [52]. These approaches can providefavorable results, though a bit tends from the original phi-losophy of reinforcement learning since a previously knownstrategy could guide the learning.

E. Observation Space

The observation space describes the world to the agent. Itneeds to give sufficient information for choosing the appro-priate action, hence - depending on the task - it contains thefollowing knowledge:

• The state of the vehicle in the world, e.g., position, speed,yaw, etc.

• Topological information like lanes, signs, rules, etc.• Other participants: surrounding vehicles, obstacles, etc.The reference frame of the observation can be absolute and

fixed to the coordinate system of the world, though as thedecision process focuses on the ego-vehicle, it is more straight-forward to choose an ego-centric reference frame pinned to thevehicle’s coordinate system, or the vehicle’s position in theworld, and the orientation of the road. It allows concentratingthe distribution of visited states around the origin in bothposition, heading, and velocity space, as other vehicles areoften close to the ego-vehicle and with similar speed andheading, reducing the region of state-space in which the policymust perform. [53]

6

1) Vehicle state observation: For lane keeping, navigation,simple racing, overtaking, or maneuvering tasks, the mostcommonly used and also the simplest observation for theego vehicle consists of the continuous variables of (|e|, v, θe)describing the lateral position from the center-line of thelane, vehicle speed, and yaw angle respectively. (see Fig.4). This information is the absolute minimum for guidingcar-like vehicles, and only eligible for the control of theclassical kinematic car-like models, where the system impliesthe motion without skidding assumption. Though in manycases in the literature, this can be sufficient, since the vehiclesremain deep in the dynamically stable region.

Fig. 4: Observation for basic vehicle state (source: [3])

For tasks, where more complex vehicle dynamics is in-evitable, such as racing situations, or where the stability ofthe vehicle is essential, this set of observable state would notbe enough, and it should be extended with yaw, pitch, roll,tire dynamics, and slip.

2) Environment observation: Getting information about thesurroundings of the vehicle and representing it to the learningagent shows high diversity in the literature. Different levels ofsensor abstractions can be observed:

• sensor level, where camera images, lidar or radar infor-mation is passed to the agent;

• intermediate level, where idealized sensor information isprovided;

• ground truth level, where all detectable and non-detectable information is given.

The structure of the sensor model also affects the neuralnetwork structure of the Deep RL agent since image like,or array-like inputs infer 2D or 1D CNN structures, whilethe simple set of scalar information results in a simple densenetwork. There are cases where these two kinds of inputs aremixed. Hence the network needs to have two different typesof input layers.

Image-based solutions usually use front-facing camera im-ages extracted from 3D simulators to represent the observationspace. The data is structured in a (C x W x H) sized matrix,where C is the number of channels, usually one for intensityimages and three for RGB, while W and H are the widthand height resolution of the image. In some cases, for thedetection of movement, multiple images are fed to the network

in parallel. Sometimes it is convenient to down-sample theimages - like (1x48x27) in [54] or (3x84x84) in [55], [56]- for data and network compression purposes. Since imageshold the information in an unstructured manner, i.e., the stateinformation, such as object positions, or lane information aredeeply encoded in the data, deep neural networks, such asCNN, usually need large samples and time to converge [57].This problem escalates, with the high number of steps that theRL process requires, resulting in a lengthy learning process,like 1.5M steps in [54] or 100M steps in [55].

Many image-based solutions propose some kind of prepro-cessing of the data to overcome this issue. In [57], the authorspropose a framework for vision-based lateral control, whichcombines DL and RL methods. To improve the perception ac-curacy, an MTL (Multitask learning) CNN model is proposedto learn the critical track features, which are used to locate thevehicle in the track coordinate, and trains a policy gradient RLcontroller to solve the continuous sequential decision-makingproblem. Naturally, this approach can also be viewed as anRL solution with structured features, though the combinedapproach has its place in the image-based solutions also.

Another approach could be the simplification of the unstruc-tured data. In [58] Kotyan et al. uses the difference imageas the background subtraction between the two consecutiveframes as an input, assuming this image contains the motion ofthe foreground and the underlying neural network would focusmore on the features of the foreground than the background.By using the same training algorithm, their results showedthat the including difference image instead of the originalunprocessed input needs approximately 10 times less trainingsteps to achieve the same performance. The second possibilityis, instead of using the original image as an input, it canbe driven through an image semantic segmentation networkas proposed in [59]. As the authors state: ”Semantic imagecontains less information compared to the original image, butincludes most information needed by the agent to take actions.In other words, semantic image neglects useless informationin the original image.” Another advantage of this approach isthat the trained agent can use the segmented output of imagesobtained from real-world scenarios, since on this level, thedifference is much smaller between the simulated and real-world data than in the case of the simulated and real-worldimages. Fig. 5 shows the 640x400 resolution inputs used inthis research.

2D or 3D Lidar like sensor models are not common amongthe recent studies, though they could provide excellent depth-map like information about the environment. Though the sameproblem arises as with the camera images, that the provideddata - let them be a vector for 2D, and a matrix for 3D Lidars- is unstructured. The usage of this type of input only can befound in [60], where the observation emulates a 2D Lidar thatprovides the distance from obstacles in 31 directions withinthe field-of-view of 150, and agent uses sensor data as itsstate. A similar input structure, though not modeling a Lidar,since there is no reflection, which is provided by TORCS andused in [20], is to represent the lane markings with imaginedbeam sensors. The agent in the cited example uses readingsfrom 19 sensors with a 200m range, presenting at every 10

7

Fig. 5: Real images from the driving data and their semanticsegmentations (source:[59])

on the front half of the car returning distance to the track edge.Grid-based path planning methods, like the A* or various

SLAM (Simultaneous Localization and Mapping) algorithmsexist and are used widespread in the area of mobile robotnavigation, where the environment is represented as a spatialmap [61], usually formulated as a 2D matrix assigning to each2D location in a surface grid one of three possible values:Occupied, free, and unknown [62]. This approach can alsobe used representing probabilistic maneuvers of surroundingvehicles [63], or by generating spatiotemporal map froma predicted sequence of movements, motion planning in adynamic environment can also be achieved [64]. Though thepreviously cited examples didn’t use RL techniques, theyprove that grid representation holds high potential in this field.Navigation in a static environment by using a grid map asthe observation, together with position and yaw of the vehiclewith an RL agent, is presented in [65] (See Fig.6). Grid mapsare also unstructured data, and their complexity is similar tothe semantically segmented images, since the cells store classinformation in both cases, and hence their optimal handling isusing the CNN architecture.

(a) Sensors (b) Target state zt (c) Perception Ø

Fig. 6: The surrounding from the perspective of the vehiclecan be described by a coarse perception map where the targetis represented by a red dot (c) (source: [65])

Representing moving objects, i.e. surrounding vehicles in agrid needs not only occupancy, but other information hencethe spatial grid’s cell need to hold additional information.In [44] the authors used equidistant grid, where the ego-vehicle is placed in the center, and the cells occupied byother vehicles represented the longitudinal velocity of thecorresponding car (See Fig. 7). The same approach can alsobe found in [49]. Naturally this simple representation cannot provide information about the lateral movement of the

other traffic participants, though they give significantly morethan the simple occupancy based ones. Equidistant grids are

(a) Mathematical model for the traffic

(b) Visualization of the Hyper Grid Matrix

Fig. 7: The visualization of the HDM mapping process(source:[44])

a logical choice for generic environments, where the movingdirections of the mobile robot are free, though, in the caseof road vehicles, the vehicle mainly follows the direction ofthe traffic flow. In this case, the spatial representation couldbe chosen fixed to the road topology, namely the lanes of theroad, regardless of its curvature or width. In these lane-basedgrid solutions, the grid representing the highway has as manyrows as the actual lane count, and the lanes are discretizedlongitudinally. The simplest utilization of this approach canbe found in [26], where the length of the cells is equivalentto the unit vehicle length, and also, the behavior of the trafficacts similar to the classic cellular automata-based microscopicmodels [66].

This representation, similarly to the equidistant ones, canbe used for occupancy, though they still do not hold anyinformation on vehicle dynamics. [67] is to fed multiple con-secutive traffic snapshots into the underlying CNN structure,which inherently extracts the velocity of the moving objects.Representing speed in grid cells is also possible in this setup,for that example can be found in [36], where the authorsconverted the traffic extracted from the Udacity simulator tothe lane-based grid.

Besides the position and the longitudinal speed of thesurrounding vehicles are essential from the aspect of thedecision making, other features (such as heading, acceleration,lateral speed) should be considered. Multi-layer grid mapscould be used for each vital parameter to overcome this issue.In [10] the authors processed the simulator state to calculatean observation tensor of size 4 x 3 x (2 x FoV + 1), where

8

Fov stands for Field of View and represents the maximumdistance of the observation in cell count. There is one channel(first dimension) each for on-road occupancy, relative veloc-ities of vehicles, relative lateral displacements, and relativeheadings to the ego-vehicle. Fig.8 shows an example of thesimulator state and corresponding input observation used fortheir network.

Fig. 8: The simulator state (top, zoomed in) gets converted toa 4 x 3 x (2 x FoV + 1) input observation tensor (bottom)(source:[10])

The previous observation models (image, lidar, or grid-based) all have some common properties: All of them are un-structured datasets, need a CNN architecture to process, whichhardens the learning process since the agent simultaneouslyneeds to extract the exciting features and form the policy foraction. It would be obvious to pre-process the unstructureddata and feed structured information to the agents’ network.Structured data refers to any data that resides in a fixed fieldwithin a record or file. As an example, for navigating intraffic, based on the task, the parameters of the surroundingvehicles are represented on the same element of the input.In the simplest scenario of car following, the agent onlyfocuses on the leading vehicle, and the input beside the stateof the ego vehicle consists of (d, v) as in [51] or (d, v, a)as in [68], where these parameters are the headway distance,speed, and acceleration of the leading vehicle. Contrary tothe unstructured data, these approaches significantly reducethe amount of the input and can be handled with simpleDNN structures, which profoundly affects the convergenceof the agent’s performance. For navigating in traffic, i.e.,performing merging or lane changing maneuvers, not only theleading vehicle’s, but the other surrounding vehicles’ statesalso need to be considered. In a merging scenario, the mostcrucial information is the relative longitudinal position andspeed 2x(dx, dv) of the two vehicles bounding the targetgap, as used by [69]. Naturally, this is the absolute minimalrepresentation of such a problem, but in the future, moresophisticated representations would be developed. In highwaymaneuvering situations, both ego-lane, and neighboring lanevehicles need to be considered, in [41] the authors usedthe above mentioned 6x(dx, dv) scalar vector is used forthe front and rear vehicles in the three interesting lanes.While in [70] the authors extended this information with theoccupancy of the neighboring lanes right at the side of theego-vehicle (See Fig. 9). The same approach can be seenin [42], though extending the number of traced objects tonine. These researches lack lateral information, though, in[41], the lateral positions and speeds are also involved inthe input vector resulting in a 6x(dx, dy, dvx, dvy) structure,logically representing longitudinal and lateral distance, and

speed differences to the ego, respectively. In a special case of

Fig. 9: Environment state on the highway [70]

handling unsignalized intersection [71] the authors also usedthis formulation scheme where the other vehicle’s Cartesiancoordinates, speed and heading were considered.

III. SCENARIO-BASED CLASSIFICATION OF THEAPPROACHES

Though this survey focuses on Deep Reinforcement Learn-ing based motion planning research, it is essential to mentionthat some papers try to solve some subtasks of automateddriving through classic reinforcement techniques. One problemof these classic methods, that they can not handle unstructureddata, such as images, mid-level radar, or lidar sensing.

The other problem comes from the need of maintaining theQ-table for all (S,A) state-action pairs. This results in spacecomplexity explosion, since the size of the table equals theproduct of the size of all classes both in state and action. Asan example, the Q-learning made in [72] is presented. Theauthors trained an agent in TORCS, which tries to achieve apolicy for the best overtaking maneuver, by taking advantageof the aerodynamic drag. There are only two participants inthe scenario, the overtaking vehicle, and the vehicle in fronton a long straight track. The state representation containsthe longitudinal and lateral distance of the two vehicles andalso the the lateral position of the ego-vehicle and the speeddifference of the two. The authors discretized this state space

TABLE I: State representation discretization in [72]

Name Size Class boundsdisty [m] 6 0, 10, 20 ,30, 50, 100, 200distx[m] 10 -25, -15, -5, -3 , -1, 0, 1, 3, 5, 15, 25pos[m] 8 -10, -5, -2, -1, 0, 1, 2, 5, 10

∆speed[km/h] 9 -300, 0, 30, 60, 90, 120,150, 200, 250, 300

to classes of size (6, 10, 8, 9) respectfully (see table I); andused the minimal lateral action set size of 3, where theactions are sweeping 1m to the left or right and maintaininglateral position. Together, this problem generates a Q-tablewith 6 ∗ 10 ∗ 8 ∗ 9 ∗ 3 = 12960 elements. Though a tableof this size can be easily handled nowadays, it is easy to

9

imagine that with more complex problems with more vehicles,more sensors, complex dynamics, denser state and actionrepresentation, the table can grow to enormous size. A possiblereduction is the utilization of the Multiple-Goal ReinforcementLearning Method and dividing the overall problem to sub-tasks, as can be seen in [73] for overtaking maneuver. In alatter research, the authors widened the problem and separatedthe driving problem to the tasks of collision avoidance, targetseeking, lane following, Lane choice, speed keeping, andsteady steering [74]. To reduce problem size, the authors of[75] used strategic-level decisions to set movement targets forthe vehicles concerning the surrounding ones, and left the low-level control to classic solutions, which significantly reducedthe action space.

An other interesting example of classic Q-learning is de-scribed in [76] where the authors designed an agent forthe path planning problem of a ground vehicle consideringobstacles with Ackermann steering by using (v, x, y, θ) (speed,positions and heading) as state representation, and used rein-forcement learning as an optimizer (See Fig. 10).

Fig. 10: Path planning results from [76]

Though one would expect that machine learning could givean overall end-to-end solution to automated driving, the studyof the recent literature shows that Reinforcement Learningresearch can give answers to certain sub-tasks of this problem.The papers in recent years can be organized around these prob-lems, where a well-dedicated situation or scenario is chosenand examined whether a self-learning agent can solve it. Theseproblem statements vary in complexity. As mentioned earlier,the complexity of reinforcement learning, and thus trainingtime, is greatly influenced by the complexity of the problemchosen, the nature of the action space, and the timeliness andproper formulation of rewards. The simplest problems, such aslane-keeping or vehicle following, can generally be traced backto simple convex optimization or control problems. However,in these cases, the formulation of secondary control goals,such as passenger comfort, is more comfortable to articulate.At the other end of the imagined complexity scale, there areproblems, like in the case of maneuvering in dense traffic, theefficient fulfillment of the task is hard to formulate, and theagent needs predictive ”thinking” to achieve its goals. In thefollowing, these approaches are presented.

A. Car following

Car following is the simplest task in this survey, where theproblem is formulated as follows: There are two participantsof the simulation, a leading and the following vehicle, bothkeeping their lateral positions in a lane, and the followingvehicle adjusts its longitudinal speed to keep a safe followingdistance. The observation space consists of the (v, dv, ds)tuple, representing agent speed, speed difference to the lead,and headway distance. The action is the acceleration com-mand. Reward systems use the collision of the two vehiclesas a failure naturally, while the performance of the agent isbased on the jerk, TTC (time to collision) [50], or passengercomfort [77]. Another approach is shown in [51], where theperformance of the car following agent is evaluated againstreal-world measurement to achieve human-like behavior.

B. Lane keeping

Lane-keeping or trajectory following is still a simple controltask, but contrary to car following, this problem focuses onlateral control. The observation space in these studies ustwo different approaches: One is the ”ground truth” lateralposition and angle of the vehicle in lane [78], [60], [22],while the second is the image of a front-facing camera [54],[59], [57]. Naturally, for image-based control, the agents useexternal simulators, TORCS, and GAZEBO/ROS in thesecases. Reward systems almost always consider the distancefrom the center-line of the lane as an immediate reward.It is important to mention that these agents hardly considervehicle dynamics, and surprisingly does not focus on joinedlongitudinal control.

C. Merging

The ramp merge problem deals with the on-ramp highwayscenario (see Fig. 11), where the ego vehicle needs to find theacceptable gap between two vehicles to get on the highway.In the simplest approach, it is eligible to learn the longitudinalcontrol, where the agent reaches this position, as can be seenin [79], [45], [19]. Other papers, like [69] use full steeringand acceleration control. In [45], the actions control the lon-gitudinal movement of the vehicle accelerate and decelerate,and while executing these actions, the ego vehicle keeps itslane. Actions ”lane change left” as well as ”lane change right”imply lateral movement. Only a single action is executed at atime, and actions are executed in their entirety, the vehicle isnot able to prematurely abort an action.

An exciting addition can be examined in [19], where thesurrounding vehicles act differently, as there are cooperativeand non-cooperative drivers among them. They trained theiragents with the knowledge about cooperative behavior, andalso compared the results with three differently built MTCSplanners. Full information MCTS naturally outperforms RL,though they are computationally expensive. The authors useda curriculum learning approach to train the agent by graduallyincreasing traffic density. As they stated: ”When training anRL agent in dense traffic directly, the policy converged to asuboptimal solution which is to stay still in the merge lane and

10

Fig. 11: Ramp merge: (a) simulated scenario and (b) real-worldlocation (source: [69])

does not leverage the cooperativeness of other drivers. Such apolicy avoids collisions but fails at achieving the maneuver.”

The most detailed description for this problem is givenby [69], where ”the driving environment is trained as anLSTM architecture to incorporate the influence of historicaland interactive driving behaviors on the action selection. TheDeep Q-learning process takes the internal state from LSTM asthe input to the Q-function approximator, using it for the actionselection based on more past information. The Q-network pa-rameters are updated with an experience replay, and a secondtarget Q-network is used to relieve the problems of localoptima and instability.” With this approach, the researchers tryto mix the possibilities from behavior prediction and learning,simultaneously achieving better performance.

D. Driving in traffic

The most complicated scenario examined in the recentpapers are those where the autonomous agent drives in traffic.Naturally, this task is also scalable by the topology of thenetwork, the amount and behavior of the surrounding vehicles,the application of traffic rules, and many other properties.Therefore almost all of the current solutions deal with highwaydriving, where the scenario lacks intersections, pedestrians,and the traffic flow in one direction in all lanes. Sub-tasksof this scenario were examined in the previous sections,such as lane-keeping, or car following. In the following,two types of highway driving will be presented. First, thehierarchical approaches are outlined, where the agents act onthe behavioral layer, making decisions about lane changingor overtaking and performs these actions with an underlyingcontroller using classic control approaches. Secondly, end-to-end solutions are presented, where the agents directly controlthe vehicle by steering and acceleration. As the problem getsmore complicated, it is important to mention that the agentstrained this would only be able to solve the type of situationsthat it is exposed to in the simulations. It is, therefore, crucialthat the design of the simulated traffic environment covers theintended case [52].

Making decisions on the behavioral layer consists of atleast three discrete actions: Keeping current lane, Change to

the left, and Change to the right, as can be seen in [42].In this paper, the authors used the ground truth informationabout the ego vehicle’s speed and lane position, and therelative position and speed of the eight surrounding vehiclesas the observation space. They trained and tested the agentsin three categories of observation noises: noise-free, mid-level noise (%5), and high-level noise (%15), and showedthat the training environments with higher noises resultedin more robust and reliable performance, also outperformingthe rule-based MOBIL model, by using DQN with a DNNof 64, 128, 128, 64 hidden layers with tanh activation. In aquite similar environment and observation space, [52] useda widened set of actions to perform the lane changing withprevious accelerations or target gap approaching, resultingin six different actions as can be seen in table II. Theyalso achieved the result that the DQN agent - using twoconvolutional and one dense layer - performed on par with,or better than, a reference model based on the IDM [28]. andMOBIL [27] model. In the other publication from the sameauthor [80], the action space is changed slightly by changingthe acceleration commands to increasing and decreasing theACC set-points and let the underlying controller perform theseactions.

TABLE II: Action space in [52]

a1 Stay in current lane, keep current speeda2 Stay in current lane, accelerate with −2m/s2

a3 Stay in current lane, accelerate with −9m/s2

a4 Stay in current lane, accelerate with 2m/s2

a5 Change lanes to the left, keep current speeda6 Change lanes to the right, keep current speed

In [68], a two-lane scenario is considered to distribute thehierarchical decisions further. First, a DQN makes a binarydecision about ”to or not to change lane”, and afterward, theother Q network is responsible for the longitudinal accelera-tion, based on the previous decision. Hence the second layer,integrated with classic control modules (e.g., Pure PursuitControl), outputs appropriate control actions for adjustingits position. In [47], the above mentioned two-lane scenariois considered, though the authors used an actor-critic likelearning agent.

An interesting question in automated driving is the co-operative behavior of the trained agent. In [67] the authorsconsidered a three-lane highway with a lane-based grid repre-sentation as observation space and a simple tuple of four foraction space left, right, speedup, none, and used the rewardfunction to achieve cooperative and non-cooperative behaviors.Not only the classic performance indicators of the ego vehicleis considered in the reward function, but also the speed of thesurrounding traffic, which is naturally affected by the behaviorof the agent. The underlying network uses two convolutionallayers with 16 filters of patch size (2,2) and RELU activation,and two dense layers with 500 neurons each. To evaluatethe effects of the cooperative behavior, the authors collectedtraffic data by virtual loops in the simulation and visualizedthe performance of the resulting traffic in the classic flow-density diagram (see Fig. 12.) It is shown that the cooperativebehavior results in higher traffic flow, hence better highway

11

Fig. 12: Flow-density relations detected by the virtual loopsunder different strategies (source:[67])

capacity and lower overall travel time.The realism of the models could still differentiate end-

to-end solutions. For example, in [44], instead of using thenonholonomic Ackermann steering geometry, the authors usea holonomic robot model for the action space, which highlyreduces the complexity of the control problem. Their actionsare Acceleration, Deceleration, Change lane to the left, Changelane to the right, and Take no action, where the first two applymaximal acceleration and deceleration, while the two lane-changing actions simply use constant speed lateral movements.They use Dueling DQN and prioritized experience replay witha grid-based observation model. A similar control method andnonholonomic kinematics is used in [41]. The importance ofthis research is that it considers safety aspects during thelearning process. By using an MPC like safety check, the agentavoids actions that lead to a collision, which makes the trainingfaster and more robust.

Using nonholonomic kinematics needs acceleration andsteering commands. In [70], [46], the authors used a con-tinuous observation space of the structured information ofthe surrounding vehicles and Policy-gradient RL structure toachieve end-to-end driving. Since the utilized method hasdiscrete action-space, the steering and acceleration commandneeded to be quantized. The complexity of driving in trafficwith an end-to-end solution can be well examined by thenumber of training episodes needed by the agent. While insimple lane-keeping scenarios, the agents finished the task infew hundreds of episodes, the agent used for these problemsneeded 300’000.

IV. FUTURE CHALLENGES

The recent achievements on the field showed that differentdeep reinforcement learning techniques could be effectivelyused for different levels of autonomous vehicles’ motion plan-ning problems, though many questions remain unanswered.The main advantage of these methods is that they can handleunstructured data such as raw or slightly pre-processed radaror camera-based image information.

Though using neural networks and deep learning techniquesas universal function-approximators in automotive systems

poses several questions. As stated in [81], function develop-ment for automotive applications realized in electronic controlunits (ECUs) is subject to proprietary OEM norms and severalinternational standards, such as Automotive SPICE (SoftwareProcess Improvement and Capability Determination) [82] andISO 26262 [83]. However, these standards are still far fromaddressing deep learning with dedicated statements, sinceverification and validation is not a solved issue in this domain.Some papers deal with these issues by using an underlyingsafety layer, which verifies the safety of a planned trajectorybefore the vehicle control system executes it. However, fullfunctional safety coverage can not be guaranteed in complexscenarios this way.

One of the main benefits of using deep neural networkstrained by a reinforcement learning agent in motion planningis the relatively low computational requirements of the trainednetwork. Though this property needs a vast amount of trialsin the learning phase to gain enough experience, as mentionedbefore, for simple convex optimization problems, the conver-gence of the process is fast. However, for complex scenarios,the training can quickly reach millions of steps, meaning thatone setup of hyper-parameters or reward hypothesis can lasthours or even days. Since complicated reinforcement learningtasks need continuous iteration on the environment design,network structure, reward scheme, or even the used algorithmitself, designing such a system is a time-consuming project.Besides the appropriate result analysis and inference, the eval-uation time highly depends on the computational capacitiesassigned. On this basis, it is not a surprise that most papersnowadays deal with minor subtasks of the motion planningproblem, and the most complex scenarios, such as navigatingin urban traffic, can not be found in the literature.

By examining the observation element of the recent arti-cles, it can be stated that most researches ignore complexsensor models. Some papers use ”ground truth” environmentrepresentations or ”ideal” sensor models, and only a fewarticles utilize sensor noise. On the one hand, transferringthe knowledge acquired from ideal observations to real-worldapplication poses several feasibility questions [84], on theother hand, using noisy or erroneous models could lead toactually more robust agents, as stated in [42].

The same applies to the environment, which can be ex-amined best amongst the group of highway learners, wherethe road topology is almost always fixed, and the surroundingvehicles’ behavior is limited. Validation of these agents isusually made in the same environment setup, which contradictsthe basic techniques of machine learning, where the trainingand validation scenarios should differ in some aspects. As areinforcement learning agent can generally act well in thesituations that are close to those it has experience with, itis crucial to focus on developing more realistic and diverseenvironments, including the modeling level of any interact-ing traffic participant to achieve such agents that are easilytransferable to real-world applications. This applies to vehicledynamics, where more diverse and more realistic modelingwould be needed. Naturally, these improvements increase thenumerical complexity of the environment model, which is oneof the main issues in these applications.

12

Tending towards mixed or hierarchical system design wouldbe a solution to this problem in the future, by mixing classiccontrol approaches and deep RL. Also, the use of extendedlearning techniques, such as curriculum learning, transferlearning, or Alpha-Go like planning agents, would profoundlyaffect the efficiency of these projects.

Overall it can be said that many problems need to besolved in this field, such as the detail of the environmentand sensor modeling, the computational requirements, thetransferability to real applications, robustness, and validationof the agents. Because of these issues, it is hard to predictwhether reinforcement learning is an appropriate tool forautomotive applications.

ACKNOWLEDGMENT

The research reported in this paper was supported by theHigher Education Excellence Program of the Ministry of Hu-man Capacities in the frame of Artificial Intelligence researcharea of Budapest University of Technology and Economics(BME FIKPMI/FM).

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing Atari with Deep ReinforcementLearning,” 12 2013. [Online]. Available: http://arxiv.org/abs/1312.5602

[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,2 2015. [Online]. Available: http://www.nature.com/articles/nature14236

[3] B. Paden, M. Cap, S. Z. Yong, D. Yershov, and E. Frazzoli,“A Survey of Motion Planning and Control Techniques for Self-driving Urban Vehicles,” CoRR, pp. 1–27, 2016. [Online]. Available:https://ieeexplore.ieee.org/abstract/document/7490340

[4] H. Bast, D. Delling, A. Goldberg, M. Muller-Hannemann, T. Pajor,P. Sanders, D. Wagner, and R. F. Werneck, “Route Planning in Trans-portation Networks,” in Lecture Notes in Computer Science. Springer,Cham, 2016, pp. 19–80.

[5] S. Brechtel, T. Gindele, and R. Dillmann, “Probabilistic decision-makingunder uncertainty for autonomous driving using continuous POMDPs,”in 17th International IEEE Conference on Intelligent TransportationSystems (ITSC). IEEE, 10 2014, pp. 392–399. [Online]. Available:http://ieeexplore.ieee.org/document/6957722/

[6] J. Wiest, M. Hoffken, U. Kresel, and K. Dietmayer, “Probabilistictrajectory prediction with Gaussian mixture models,” in 2012 IEEEIntelligent Vehicles Symposium. IEEE, 6 2012, pp. 141–146. [Online].Available: http://ieeexplore.ieee.org/document/6232277/

[7] Y. Dou, F. Yan, and D. Feng, “Lane changing prediction at highwaylane drops using support vector machine and artificial neural networkclassifiers,” in 2016 IEEE International Conference on AdvancedIntelligent Mechatronics (AIM). IEEE, 7 2016, pp. 901–906. [Online].Available: http://ieeexplore.ieee.org/document/7576883/

[8] J. H. Reif, “Complexity of the mover’s problem and generalizations,”in 20th Annual Symposium on Foundations of Computer Science(sfcs 1979). IEEE, 10 1979, pp. 421–427. [Online]. Available:http://ieeexplore.ieee.org/document/4568037/

[9] F. Hegedus, T. Becsi, S. Aradi, and G. Galdi, “Hybrid TrajectoryPlanning for Autonomous Vehicles using Neural Networks,” in 2018IEEE 18th International Symposium on Computational Intelligence andInformatics (CINTI). IEEE, 11 2018, pp. 000 025–000 030. [Online].Available: https://ieeexplore.ieee.org/document/8928220/

[10] D. M. Saxena, S. Bae, A. Nakhaei, K. Fujimura, and M. Likhachev,“Driving in Dense Traffic with Model-Free Reinforcement Learning,”9 2019. [Online]. Available: http://arxiv.org/abs/1909.06710

[11] r. Feher, S. Aradi, F. Hegedus, T. Becsi, and P. Gaspar, “Hybrid DDPGApproach for Vehicle Motion Planning,” in Proceedings of the 16thInternational Conference on Informatics in Control, Automation andRobotics. SCITEPRESS - Science and Technology Publications, 2019,pp. 422–429.

[12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,second edi ed. The MIT Press, 2017.

[13] R. Bellman, Dynamic Programming. Princeton University Press,Princeton, NJ, 1957.

[14] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” in Thirtieth AAAI conference on artificialintelligence, 2016.

[15] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, andN. de Freitas, “Dueling Network Architectures for Deep ReinforcementLearning,” 11 2015. [Online]. Available: http://arxiv.org/abs/1511.06581

[16] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic policy gradient algorithms,” in Proceedings of the 31stInternational Conference on Machine Learning (ICML 2014), 2014, pp.387–395.

[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” 9 2015. [Online]. Available: http://arxiv.org/abs/1509.02971

[18] Z. Qiao, K. Muelling, J. M. Dolan, P. Palanisamy, and P. Mudalige,“Automatically Generated Curriculum based Reinforcement Learningfor Autonomous Vehicles in Urban Environment,” in 2018 IEEEIntelligent Vehicles Symposium (IV). IEEE, 6 2018, pp. 1233–1238.[Online]. Available: https://ieeexplore.ieee.org/document/8500603/

[19] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer,“Cooperation-Aware Reinforcement Learning for Merging in DenseTraffic,” 6 2019. [Online]. Available: http://arxiv.org/abs/1906.11021

[20] M. Kaushik, V. Prasad, K. M. Krishna, and B. Ravindran,“Overtaking Maneuvers in Simulated Highway Driving using DeepReinforcement Learning,” in 2018 IEEE Intelligent Vehicles Symposium(IV). IEEE, 6 2018, pp. 1885–1890. [Online]. Available: https://ieeexplore.ieee.org/document/8500718/

[21] A. Ferdowsi, U. Challita, W. Saad, and N. B. Mandayam, “RobustDeep Reinforcement Learning for Security and Safety in AutonomousVehicle Systems,” in IEEE Conference on Intelligent TransportationSystems, Proceedings, ITSC, vol. 2018-Novem. Institute of Electricaland Electronics Engineers Inc., 12 2018, pp. 307–312.

[22] X. Ma, K. Driggs-Campbell, and M. J. Kochenderfer, “Improved Ro-bustness and Safety for Autonomous Vehicle Control with Adversar-ial Reinforcement Learning,” in IEEE Intelligent Vehicles Symposium,Proceedings, vol. 2018-June. Institute of Electrical and ElectronicsEngineers Inc., 10 2018, pp. 1665–1671.

[23] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planningand acting in partially observable stochastic domains,” ArtificialIntelligence, vol. 101, no. 1-2, pp. 99–134, 5 1998. [Online]. Available:https://linkinghub.elsevier.com/retrieve/pii/S000437029800023X

[24] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic anddynamic vehicle models for autonomous driving control design,” in 2015IEEE Intelligent Vehicles Symposium (IV). IEEE, 6 2015, pp. 1094–1099. [Online]. Available: http://ieeexplore.ieee.org/document/7225830/

[25] P. Polack, F. Altche, B. D’Andrea-Novel, and A. de La Fortelle, “Thekinematic bicycle model: A consistent model for planning feasibletrajectories for autonomous vehicles?” in 2017 IEEE Intelligent VehiclesSymposium (IV). IEEE, 6 2017, pp. 812–818. [Online]. Available:http://ieeexplore.ieee.org/document/7995816/

[26] C. You, J. Lu, D. Filev, and P. Tsiotras, “Advanced planning forautonomous vehicles using reinforcement learning and deep inversereinforcement learning,” Robotics and Autonomous Systems, vol. 114,pp. 1–18, 4 2019. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0921889018302021

[27] A. Kesting, M. Treiber, and D. Helbing, “General Lane-ChangingModel MOBIL for Car-Following Models,” Transportation ResearchRecord: Journal of the Transportation Research Board, vol. 1999,no. 1, pp. 86–94, 1 2007. [Online]. Available: http://journals.sagepub.com/doi/10.3141/1999-10

[28] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states inempirical observations and microscopic simulations,” Physical ReviewE, vol. 62, no. 2, pp. 1805–1824, 8 2000. [Online]. Available:https://link.aps.org/doi/10.1103/PhysRevE.62.1805

[29] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent Devel-opment and Applications of SUMO - Simulation of Urban MObility,”International Journal On Advances in Systems and Measurements,vol. 5, no. 3, pp. 128–138, 2012.

[30] M. Fellendorf and P. Vortisch, “Microscopic Traffic Flow SimulatorVISSIM,” in Fundamentals of Traffic Simulation. International Series inOperations Research & Management Science, J. Barcelo, Ed. Springer,2010, pp. 63–93.

[31] Y. Ye, X. Zhang, and J. Sun, “Automated Vehicle’s behaviordecision making using deep reinforcement learning and high-

http://arxiv.org/abs/1312.5602

http://www.nature.com/articles/nature14236

https://ieeexplore.ieee.org/abstract/document/7490340

http://ieeexplore.ieee.org/document/6957722/




https://ieeexplore.ieee.org/document/8928220/








https://linkinghub.elsevier.com/retrieve/pii/S000437029800023X



https://linkinghub.elsevier.com/retrieve/pii/S0921889018302021


http://journals.sagepub.com/doi/10.3141/1999-10

http://journals.sagepub.com/doi/10.3141/1999-10

https://link.aps.org/doi/10.1103/PhysRevE.62.1805

13

fidelity simulation environment,” Transportation Research Part C,vol. 107, no. May, pp. 155–170, 2019. [Online]. Available: https://doi.org/10.1016/j.trc.2019.08.011

[32] B. Wymann, E. Espie, C. Guionneau, C. Dimitrakakis, R. Coulom,and A. Sumner, “TORCS: The Open Racing Car Simulator,” 2014.[Online]. Available: http://www.torcs.org

[33] “CarSIM, Mechanical Simulation Corporation.” [Online]. Available:https://www.carsim.com/

[34] “CarMaker, IPG Automotive.” [Online]. Available: https://ipg-automotive.com/products-services/simulation-software/carmaker/

[35] H. An and J.-i. Jung, “Decision-Making System for Lane ChangeUsing Deep Reinforcement Learning in Connected and AutomatedDriving,” Electronics, vol. 8, no. 5, p. 543, 5 2019. [Online]. Available:https://www.mdpi.com/2079-9292/8/5/543

[36] J. Wang, Q. Zhang, D. Zhao, and Y. Chen, “Lane ChangeDecision-making through Deep Reinforcement Learning with Rule-based Constraints,” in 2019 International Joint Conference on NeuralNetworks (IJCNN). IEEE, 7 2019, pp. 1–6. [Online]. Available: http://arxiv.org/abs/1904.00231https://ieeexplore.ieee.org/document/8852110/

[37] “Welcome to Udacity’s Self-Driving Car Simulator.” [Online]. Available:https://github.com/udacity/self-driving-car-sim

[38] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An Open Urban Driving Simulator,” in Proceedings of the1st Annual Conference on Robot Learning, 2017, pp. 1–17.

[39] F. Rosique, P. J. Navarro, C. Fernandez, and A. Padilla, “A SystematicReview of Perception System and Simulators for Autonomous VehiclesResearch,” Sensors, vol. 19, no. 3, p. 648, 2 2019. [Online]. Available:http://www.mdpi.com/1424-8220/19/3/648

[40] K. Kashihara, “Deep Q learning for traffic simulation in autonomousdriving at a highway junction,” in 2017 IEEE International Conferenceon Systems, Man, and Cybernetics (SMC). IEEE, 10 2017, pp. 984–988.[Online]. Available: http://ieeexplore.ieee.org/document/8122738/

[41] S. Nageshrao, H. E. Tseng, and D. Filev, “Autonomous Highway Drivingusing Deep Reinforcement Learning,” in 2019 IEEE InternationalConference on Systems, Man and Cybernetics (SMC). IEEE, 102019, pp. 2326–2331. [Online]. Available: http://arxiv.org/abs/1904.00035https://ieeexplore.ieee.org/document/8914621/

[42] A. Alizadeh, M. Moghadam, Y. Bicer, N. K. Ure, U. Yavas,and C. Kurtulus, “Automated Lane Change Decision Making usingDeep Reinforcement Learning in Dynamic and Uncertain HighwayEnvironment,” in 2019 IEEE Intelligent Transportation SystemsConference (ITSC), no. August. IEEE, 10 2019, pp. 1399–1404.[Online]. Available: https://ieeexplore.ieee.org/document/8917192/

[43] A. Feher, S. Aradi, and T. Becsi, “Q-learning based ReinforcementLearning Approach for Lane Keeping,” in 2018 IEEE 18th InternationalSymposium on Computational Intelligence and Informatics (CINTI).Budapest: IEEE, 11 2018, pp. 000 031–000 036. [Online]. Available:https://ieeexplore.ieee.org/document/8928230/

[44] Z. Bai, W. Shangguan, B. Cai, and L. Chai, “Deep ReinforcementLearning Based High-level Driving Behavior Decision-making Modelin Heterogeneous Traffic,” in 2019 Chinese Control Conference (CCC),no. February. IEEE, 7 2019, pp. 8600–8605. [Online]. Available: http://arxiv.org/abs/1902.05772https://ieeexplore.ieee.org/document/8866005/

[45] P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zollner,“Adaptive Behavior Generation for Autonomous Driving using DeepReinforcement Learning with Compact Semantic States,” in 2018 IEEEIntelligent Vehicles Symposium (IV). IEEE, 6 2018, pp. 993–1000.[Online]. Available: https://ieeexplore.ieee.org/document/8500427/

[46] S. Aradi, T. Becsi, and P. Gaspar, “Policy Gradient BasedReinforcement Learning Approach for Autonomous Highway Driving,”in 2018 IEEE Conference on Control Technology and Applications(CCTA). IEEE, 8 2018, pp. 670–675. [Online]. Available: https://ieeexplore.ieee.org/document/8511514/

[47] X. Xu, L. Zuo, X. Li, L. Qian, J. Ren, and Z. Sun, “A ReinforcementLearning Approach to Autonomous Decision Making of IntelligentVehicles on Highways,” IEEE Transactions on Systems, Man, andCybernetics: Systems, no. December, 2018.

[48] P. Wang, C. Y. Chan, and A. De La Fortelle, “A Reinforcement LearningBased Approach for Automated Lane Change Maneuvers,” in IEEEIntelligent Vehicles Symposium, Proceedings, vol. 2018-June. Instituteof Electrical and Electronics Engineers Inc., 10 2018, pp. 1379–1384.

[49] M. P. Ronecker and Y. Zhu, “Deep Q-Network Based Decision Makingfor Autonomous Driving,” in 2019 3rd International Conference onRobotics and Automation Sciences (ICRAS). IEEE, 6 2019, pp. 154–160. [Online]. Available: https://ieeexplore.ieee.org/document/8808950/

[50] M. Zhu, Y. Wang, J. Hu, X. Wang, and R. Ke, “Safe, Efficient, andComfortable Velocity Control based on Reinforcement Learning for

Autonomous Driving,” 1 2019. [Online]. Available: http://arxiv.org/abs/1902.00089

[51] M. Zhu, X. Wang, and Y. Wang, “Human-like autonomous car-followingmodel with deep reinforcement learning,” Transportation Research PartC: Emerging Technologies, vol. 97, pp. 348–368, 12 2018.

[52] C. J. Hoel, K. Wolff, and L. Laine, “Automated Speed and LaneChange Decision Making using Deep Reinforcement Learning,” in IEEEConference on Intelligent Transportation Systems, Proceedings, ITSC,vol. 2018-Novem. Institute of Electrical and Electronics EngineersInc., 12 2018, pp. 2148–2155.

[53] E. Leurent, E. Leurent, A. Survey, S.-a. Representations, and A. Driving,“A Survey of State-Action Representations for Autonomous Driving Tocite this version : HAL Id : hal-01908175,” 2018.

[54] P. Wolf, C. Hubschneider, M. Weber, A. Bauer, J. Hartl, F. Durr,and J. M. Zollner, “Learning how to drive in a real worldsimulation with deep Q-Networks,” in 2017 IEEE Intelligent VehiclesSymposium (IV). IEEE, 6 2017, pp. 244–250. [Online]. Available:http://ieeexplore.ieee.org/document/7995727/

[55] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,“End-to-End Race Driving with Deep Reinforcement Learning,” in Pro-ceedings - IEEE International Conference on Robotics and Automation.Institute of Electrical and Electronics Engineers Inc., 9 2018, pp. 2070–2075.

[56] E. Perot, M. Jaritz, M. Toromanoff, and R. d. Charette, “End-to-EndDriving in a Realistic Racing Game with Deep ReinforcementLearning,” in 2017 IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW). IEEE, 7 2017, pp. 474–475.[Online]. Available: http://ieeexplore.ieee.org/document/8014798/

[57] D. Li, D. Zhao, Q. Zhang, and Y. Chen, “Reinforcement Learning andDeep Learning Based Lateral Control for Autonomous Driving [Ap-plication Notes],” IEEE Computational Intelligence Magazine, vol. 14,no. 2, pp. 83–98, 5 2019.

[58] S. Kotyan, D. V. Vargas, and U. Venkanna, “Self Training AutonomousDriving Agent,” in 2019 58th Annual Conference of the Society ofInstrument and Control Engineers of Japan (SICE). IEEE, 9 2019, pp.1456–1461. [Online]. Available: http://arxiv.org/abs/1904.12738https://ieeexplore.ieee.org/document/8859883/

[59] N. Xu, B. Tan, and B. Kong, “Autonomous Driving in Reality withReinforcement Learning and Image Translation,” 1 2018. [Online].Available: http://arxiv.org/abs/1801.05299

[60] J. Lee, T. Kim, and H. J. Kim, “Autonomous lane keeping based onapproximate Q-learning,” in 2017 14th International Conference onUbiquitous Robots and Ambient Intelligence (URAI). IEEE, 6 2017,pp. 402–405. [Online]. Available: http://ieeexplore.ieee.org/document/7992762/

[61] A. Elfes, “Using occupancy grids for mobile robot perception andnavigation,” Computer, vol. 22, no. 6, pp. 46–57, 6 1989. [Online].Available: http://ieeexplore.ieee.org/document/30720/

[62] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron,J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, K. Lau,C. Oakley, M. Palatucci, V. Pratt, P. Stang, S. Strohband, C. Dupont,L.-E. Jendrossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk,E. Jensen, P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler,A. Nefian, and P. Mahoney, “Stanley: The robot that won the DARPAGrand Challenge,” Journal of Field Robotics, vol. 23, no. 9, pp. 661–692,9 2006. [Online]. Available: http://doi.wiley.com/10.1002/rob.20147

[63] N. Deo and M. M. Trivedi, “Multi-Modal Trajectory Prediction ofSurrounding Vehicles with Maneuver based LSTMs,” in 2018 IEEEIntelligent Vehicles Symposium (IV). IEEE, 6 2018, pp. 1179–1184.[Online]. Available: https://ieeexplore.ieee.org/document/8500493/

[64] T. Hegedus, B. Nemeth, and P. Gaspar, “Graph-based Multi-Vehicle Overtaking Strategy for Autonomous Vehicles,” IFAC-PapersOnLine, vol. 52, no. 5, pp. 372–377, 2019. [Online]. Available:https://linkinghub.elsevier.com/retrieve/pii/S2405896319306822

[65] A. Folkers, M. Rick, and C. Buskens, “Controlling an AutonomousVehicle with Deep Reinforcement Learning,” in 2019 IEEE IntelligentVehicles Symposium (IV). IEEE, 6 2019, pp. 2025–2031. [Online].Available: https://ieeexplore.ieee.org/document/8814124/

[66] J. Esser and M. Schreckenberg, “Microscopic Simulation of UrbanTraffic Based on Cellular Automata,” International Journal of ModernPhysics C, vol. 08, no. 05, pp. 1025–1036, 10 1997. [Online]. Available:https://www.worldscientific.com/doi/abs/10.1142/S0129183197000904

[67] G. Wang, J. Hu, Z. Li, and L. Li, “Cooperative Lane Changingvia Deep Reinforcement Learning,” 2019. [Online]. Available: http://arxiv.org/abs/1906.08662

[68] T. Shi, P. Wang, X. Cheng, and C.-Y. Chan, “Driving Decision andControl for Autonomous Lane Change based on Deep Reinforcement

https://doi.org/10.1016/j.trc.2019.08.011

https://doi.org/10.1016/j.trc.2019.08.011

http://www.torcs.org

https://www.carsim.com/

https://ipg-automotive.com/products-services/simulation-software/carmaker/

https://ipg-automotive.com/products-services/simulation-software/carmaker/

https://www.mdpi.com/2079-9292/8/5/543

http://arxiv.org/abs/1904.00231 https://ieeexplore.ieee.org/document/8852110/


https://github.com/udacity/self-driving-car-sim

http://www.mdpi.com/1424-8220/19/3/648






















http://doi.wiley.com/10.1002/rob.20147




https://www.worldscientific.com/doi/abs/10.1142/S0129183197000904



14

Learning,” in 2019 IEEE Intelligent Transportation Systems Conference(ITSC). Auckland, New Zealand: IEEE, 2019. [Online]. Available:http://arxiv.org/abs/1904.10171

[69] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learningarchitecture toward autonomous driving for on-ramp merge,” in 2017IEEE 20th International Conference on Intelligent TransportationSystems (ITSC). IEEE, 10 2017, pp. 1–6. [Online]. Available:http://ieeexplore.ieee.org/document/8317735/

[70] T. Becsi, S. Aradi, r. Feher, J. Szalay, and P. Gaspar, “HighwayEnvironment Model for Reinforcement Learning,” IFAC-PapersOnLine,vol. 51, no. 22, pp. 429–434, 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S2405896318333032

[71] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer,and J. Tumova, “Reinforcement Learning with Probabilistic Guaranteesfor Autonomous Driving,” 2019. [Online]. Available: http://arxiv.org/abs/1904.07189

[72] D. Loiacono, A. Prete, P. L. Lanzi, and L. Cardamone, “Learning toovertake in TORCS using simple reinforcement learning,” in 2010 IEEEWorld Congress on Computational Intelligence, WCCI 2010 - 2010 IEEECongress on Evolutionary Computation, CEC 2010, 2010.

[73] D. C. K. Ngai and N. H. C. Yung, “Automated Vehicle Overtaking basedon a Multiple-Goal Reinforcement Learning Framework,” in 2007 IEEEIntelligent Transportation Systems Conference. IEEE, 9 2007, pp. 818–823. [Online]. Available: http://ieeexplore.ieee.org/document/4357682/

[74] ——, “A multiple-goal reinforcement learning method for complexvehicle overtaking maneuvers,” IEEE Transactions on Intelligent Trans-portation Systems, vol. 12, no. 2, pp. 509–522, 2011.

[75] C. Desjardins and B. Chaib-draa, “Cooperative Adaptive CruiseControl: A Reinforcement Learning Approach,” IEEE Transactions onIntelligent Transportation Systems, vol. 12, no. 4, pp. 1248–1260, 122011. [Online]. Available: http://ieeexplore.ieee.org/document/5876320/

[76] M. Gomez, R. V. Gonzalez, T. Martınez-Marın, D. Meziat, andS. Sanchez, “Optimal motion planning by reinforcement learning inautonomous mobile vehicles,” Robotica, vol. 30, no. 2, pp. 159–170,3 2012.

[77] Y. Ye, X. Zhang, and J. Sun, “Automated Vehicle’s behavior decisionmaking using deep reinforcement learning and high-fidelity simulationenvironment,” Tech. Rep.

[78] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-End DeepReinforcement Learning for Lane Keeping Assist,” 12 2016. [Online].Available: http://arxiv.org/abs/1612.04340

[79] P. Wang and C.-Y. Chan, “Autonomous Ramp Merge Maneuver Basedon Reinforcement Learning with Continuous Action Space,” 3 2018.[Online]. Available: http://arxiv.org/abs/1803.09203

[80] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J.Kochenderfer, “Combining Planning and Deep Reinforcement Learningin Tactical Decision Making for Autonomous Driving,” pp. 1–12, 52019. [Online]. Available: http://arxiv.org/abs/1905.02680

[81] F. Falcini, G. Lami, and A. M. Costanza, “Deep Learning in AutomotiveSoftware,” IEEE Software, vol. 34, no. 3, pp. 56–63, 5 2017. [Online].Available: https://ieeexplore.ieee.org/document/7927925/

[82] “Automotive SPICE process assessment/reference model,” 2015.[83] “ISO 26262, ”Road Vehicles—Functional Safety—Part 1: Vocabulary”,”

2011.[84] Z. Szalay, T. Tettamanti, D. Esztergar-Kiss, I. Varga, and C. Bartolini,

“Development of a Test Track for Driverless Cars: Vehicle Design, TrackConfiguration, and Liability Considerations,” Periodica PolytechnicaTransportation Engineering, vol. 46, no. 1, p. 29, 3 2018. [Online].Available: https://pp.bme.hu/tr/article/view/10753













https://pp.bme.hu/tr/article/view/10753

survey of deep reinforcement learning for motion planning ... · reinforcement learning autonomous...

Documents