cscrs road safety fellowship report: a human-machine ...cscrs road safety fellowship report: a...

CSCRS Road Safety Fellowship Report:A Human-Machine Collaborative Acceleration ControllerAttained from Pixel Learning and Evolution Strategies

Fangyu Wu1Alexandre M. Bayen2

Abstract— Autonomous driving has been a major areaof research in transportation since the mid-1980s. A cen-tral motivation of this new paradigm is that it promisesgreater safety and efficiency compared to human drivers.However, despite recent remarkable progresses, currentautonomous driving systems still require improvements.Inspired by the success of autopilot systems in flightcontrol, we propose to approach the problem from theperspective of vehicle control augmentation. Specifically,we adapt a state-of-the-art reinforcement learning (RL)algorithm called Evolution Strategies (ES) to train anauxiliary acceleration controller on a pixelated localobservation space. By augmenting the actions of a humandriver modeled by the Intelligent Driver Model (IDM), weshow that the RL controller is able to help humans drivemore efficiently than they would without augmentationby more than 50%. This preliminary study suggests thatthis type of hybrid approach may be used to developuser-specific adaptive driving controllers and provablysafe autonomous vehicles. For reproducibility, the sourcecode of this project is released at https://github.com/flow-project/flow.

I. Introduction

With the recent rapid development of au-tonomous vehicles, there has been a wave of tech-nological innovations in the automobile industry,such as Google’s Waymo, Tesla’s Autopilot, andGeneral Motor’s Super Cruise. While these com-panies aim to develop driving systems that couldultimately replace human drivers, recent news onfatal traffic accidents involving those experimentalautomated vehicles suggest that progress remainsto be made in automation [1].

1Fangyu Wu is with the Department of Electrical Engineeringand Computer Sciences, University of California, Berkeley, Berke-ley, CA 94709, USA, [email protected]

2Alexandre M. Bayen is with the Department of Electrical Engi-neering and Computer Sciences and the Institue of TransportationStudies, University of California, Berkeley, Berkeley, CA 94709,USA, [email protected]

The social implications of full vehicle automa-tion are also concerning. One of the fundamentalassumptions of the current traffic rules is that avehicle must be operated with complete humansupervision. Hence, certification of autonomousvehicles will certainly involve a long and challeng-ing legal process.

Motivated by the technological and societalchallenges mentioned above, we draw inspirationsfrom the aviation industry. Aircraft autopilot sys-tems are among the safest methods of transporta-tion. However, their high performance and robust-ness were not built overnight. Instead, they initiallystarted as assisted pilot systems and were onlyslowly upgraded over time. Introduction of newfeatures was incremental and always tested. Sucha conservative development path ensures a safeand steady transition from early manual controlsystems to modern autonomous control systems.

Inspired by the success of flight autopilot sys-tems, we propose to adopt a framework calledaugmented driving to facilitate the transition fromhuman-operated driving to fully autonomous driv-ing. Augmented driving is a vehicle control systemthat enhances the performance of the existing driv-ing control system, such as a human driver or anadaptive cruise controller.

In the short term, by combining the strengthof the host and the assistant, one of the goalsof automation is to develop a vehicle controllersafer and more efficient than if used in isolation.In the long term, such a framework provides atangible path for technological development andsocial changes. With it, automotive companieswould be able to focus on realistic incremental im-provements. Over time, as the assistant is steadilyincrementally improved, the host will delegatemore and more tasks to the assistant. Eventually, as

1

the auxiliary controller becomes powerful enough,it will effectively take over the role of the host.Since augmented driving systems will not be ableto replace humans in the transition phase, societyhas a chance to adjust its legal structures to thistechnological shift.

In this article, we formally define the problem ofdriving augmentation and propose a solution basedon a state-of-the-art evolutionary RL algorithmnamed ES [2]. The key contributions of this articleare as follows:

• We develop an end-to-end learning algorithmthat can train a vehicle acceleration controllerwith only pixel inputs and system-evaluatedrewards.

• We demonstrate that the image-based assistiveacceleration controller improves the baselinehuman driving model IDM by over 50%.

• We suggest that in the long run augmenteddriving can be a viable strategy to designadaptive and safe driving controllers.

The remainder of this article is organized asfollows. We begin with a brief overview of theliterature in Section II. In Section III, we formulatethe problem in mathematical terms. Section IVproposes a solution based on a state-of-the-art RLalgorithm, called ES, coupled with a car followingmodel, named IDM. The numerical experimentsare demonstrated in Section V and discussed inSection VI. Finally, we conclude our study inSection VII.

II. BackgroundAutomated vehicle control has been an impor-

tant research area in robotics and control. Overthe past ten years, there has been remarkabletechnological progress in this field, like the earlysuccess of Team Stanley in the DARPA GrandChallenge [3], the release of Autopilot assisteddriving system by Tesla [4], and the release of theautonomous taxi program by Waymo [5].

Among the technologies that enable the high-performance autonomous vehicles, RL has beensignificant. By imitating the behaviors of humandrivers, researchers from NVIDIA show that a neu-ral network can learn to drive in simple scenariosdirectly from camera pixels [6], [7]. With moreadvanced RL algorithms such as asynchronous

actor critic [8], deep Q-learning [9], and evolutionstrategies [2], one may be able to train a functionaldriving controller with higher sample efficiencyand lower generalization errors than the plain im-itation learning approach.

Among the existing RL-based autonomous driv-ing methods, a common theme is to train a vehicleto obey the traffic rules [7] and to arrive at desti-nation as fast as possible [10]. While these goalsare essential for a vehicle to be able to operatein the real world, they do not answer the otherhalf of the promises of vehicle automation, i.e.,using autonomous vehicles to reduce road fatalityand to improve transportation efficiency. To seethis, one may find that under this fundamentalobjective, an autonomous driving controller maylearn to drive selfishly, e.g., frequent passing andaggressive tailgating. Although behaving as suchmight let the vehicle arrive the destination faster,it will put public safety at risk and impact otherdrivers’ travel time. Therefore, to reduce accidentsand increase throughput, one needs to look beyondthe paradigm of learning to drive and learning todrive fast.

A next-level question to ask is how to trainautonomous vehicles to drive in a mixed-autonomytraffic with the goal of maximizing navigationsafety and road throughput. To this end, recentfield experiments have demonstrated that it is pos-sible to use a small percentage of autonomousvehicles to improve overall traffic flow. In oneof the experiments, the researchers show that amanually designed proportional integral saturationcontroller can significantly improve the overalltraffic flow on a single-lane circular track with onlyone autonomous vehicle [11].

Recently, numerical studies have shown thatmodel-free RL can achieve the same level of per-formance as the hand-tuned controller [12], [13],[14], [15] in and beyond the original setup of [11],i.e., the idealistic traffic on a single-lane circulartrack. Using the state-of-the-art RL algorithms,it is demonstrated that model-free RL is capableof finding interesting driving strategies in morecomplex traffic networks, including intersections,merge points, and roundabouts.

One of the characteristics of the work in [12],[13] is that it assumes perfect information about

2

the system in terms of vehicle speeds, spacings,and positions during the test time. However, in realworld applications, this data is not easy to acquireand measurements are often noisy when driving inhigh speed or in dense urban areas.

In addition to complications arising from imper-fect information, feature engineering becomes dif-ficult in complex urban settings, where the topol-ogy of the road network becomes complicated andthe state of the drivers becomes high-dimensional.For example, it is challenging to define the notionof leaders and followers in multi-lane traffic: Ataxi driver who is looking for customers may notnecessarily follow the front vehicles; A carefultruck driver who is about to change lane may paymore attention to the vehicles in the target lanesthan the vehicle behind.

To address the issues of measurement difficultiesand feature engineering, a natural solution is towork directly with raw sensor inputs. For manyautonomous driving systems, such inputs take theform of N-dimensional images centered at thevehicle’s current position. Such representation isconvenient since it directly captures everythingwithin the system without explicit state estimationand feature engineering.

Given the inputs as images, a good choice forthe RL algorithm is the ES method. Image-basedend-to-end RL has been proven a viable strategysince the success of deep Q-learning [16]. Laterwork by [2] has shown that the ES method hascomparable performance to deep Q-learning but ismore convenient for parallel computing on clusters.Given the availability of a massive cloud comput-ing infrastructure, we hence choose ES as our RLalgorithm.

Motivated by the discussion above, we chooseto study the performance of augmented drivingcontrollers in the image-based state space using anend-to-end learning process based on ES method.We formally formulate the problem in the follow-ing section.

III. Problem Formulation

As commonly adopted in RL, we define theproblem as a discrete-time partially observedMarkov decision process (POMDP). By definition,a POMDP consists of state space S, action space

A, observation space O, transition probabilityT, emission probability E, and reward r . In ourspecific problem, the POMDP {S,A,O,T,E,r, ρ0}takes the following form.• S ∈ Rh×w×c×t is the state space with height h,

width w, channel c, and memory t. As we areusing a grayscale image, c = 1. The memory t =50 since memory buffer is set to be the past 5seconds at a temporal resolution of 0.1 second.

• A : O → R is the action space, correspondingto the correction to vehicle’s baseline longitudi-nal acceleration. Hence, the resulting combinedlongitudinal acceleration a is a linear combi-nation of the baseline acceleration a0 and theRL correction term a+ weighted by a controllerparameter α

a = (1−α)a0+αa+,

where α is the augmentation coefficient whereα ∈ [0,1]; a0 and a+ are bounded control inputsbetween -5 m/s2 and 3 m/s2.

• O : S → Rh′×w′×c′×t ′ is the observation spacewith height h′, width w′, channel c′, and memoryt′. Specifically, we choose h′ = 100, w′ = 100 ,c′ = 1 and t′ = 5. See Figure 1 for an exampleabout the geometric relation between observationspace O and state space S. Note that on the right,the images are rendered at two pixels per meterand the radius of the observation space is set to25 meters. As shown in Figure 2, the observationspace stores the temporal information from thepast five-second at a temporal resolution of onesecond.

• T : S×A×S→ [0,1] is the transition probabil-ity that maps a state-action pair to the next state.Here POMDP assumes that the next state is onlydependent on the current state and action.

• E :S×O→ [0,1] is the emission probability thatmaps the state to the observation.

• r : S × A → R is the reward function. Thespecific form of reward function is problem-dependent, but it should capture the two fun-damental aspects of driving, i.e., safety andefficiency. For this study, we choose

r =1

vmax

(βvavg −(1− β)vstd

),

3

with

vavg =1N

N∑i=1

vi, and vstd =

√√√√ N∑i=1

(vi −

1N

N∑i=1

vi

)2

,

where vavg is the average velocity, vstd is thevelocity standard deviation, vi is the currentvelocity of the ith vehicle, N is the total numberof vehicles in the POMDP, and vmax is themaximum allowable velocity, which is set to30 m/s2, and β is an user-defined weightingparameter which in this article is set to be 0.8.For training efficiency, vi is obtained directlyfrom the simulator, rather than estimated fromthe images. Note that we use vavg to measurethe system throughput and vstd to approximatethe risks of collisions [17].

• Finally, ρ0 : S → [0,1] is the initial state distri-bution.In RL, our goal is to find a series of actions

that maximizes the total reward. We formalize thisnotion of reward maximization below.

Denote p(st+1 |st,at) as the transition probabilityfrom the state-action pair (st,at) to the next statest+1. Let the τ be the trajectory of the systemfrom the time t to event horizon T , that is, τ =(st,at, st+1,at+1, . . .,aT−1, sT ). Note that event hori-zon T is defined to be from the initial time to thetime when the POMDP terminates. In our problem,we set T = 1500 s.

The objective of the problem is then to find afunction π∗(at |ot) : O →A, also known as policy,such that the expected future reward is maximized,

argmaxat,···aT

Eτ

T∑i=t

ri . (1)

The function π∗ that produces the solution toEquation (1) is defined as the optimal policy.

If we can represent π∗ with some θ-parametrizedfunction approximator πθ , then solving Equa-tion (1) is equivalent to solving the followingoptimization problem:

argmaxθ

Eτ

T∑i=t

ri . (2)

With problem formally stated in Equation (2), wepropose our solution in Section IV.

Fig. 1: Comparison between global state space andlocal observation space. Note that the state spaceand observation space only contains raw pixels.

(t - 4) s

100 ⨉ 100 pixels

(t - 3) s(t - 2) s

(t - 1) st s

Fig. 2: Pixelated local observation space with afive-second memory. Note that the memory bufferincrements at 1 s while the simulation steps at 0.1s.

IV. Proposed SolutionWe apply the ES method to solve the problem

defined in Section III. To achieve this, we need adata collection procedure and an implementationof the ES method. We delineate the technicaldetails of the two processes as follows.

A. Data collectionWe demonstrate that our method can learn an

effective traffic controller with pure simulated data.To this end, we select the open-sourced trafficsimulation software Simulation of Urban Mobility(SUMO) [18] and its corresponding RL libraryFlow [12].

To approximate human driver’s behaviors, wechoose the well-known human-driving model IDM[19] as defined below:

d2xdt2 = a

(1−

(vα

v0

)δ−

(s∗(v,∆v)

s

)2),

4

with s∗(v,∆v) = s0 + vT + v∆v

2√

ab, where v0 is the

desired velocity, s0 is the minimum desired spac-ing, T is the minimum desired headway, a is themaximum vehicle acceleration, b is the decelera-tion coefficient, δ is the exponent coefficient. Inthis study, we choose to use the default IDMparameters in [20]: v0 = 30 m/s, s0 = 2 m, T = 1 s,a = 1 m/s2, b = 1.5 m/s2, and δ = 4.

Note that the use of IDM as the human-drivingmodel is not a necessity. In fact, one can pickany other human-driving model, or even use anactual human driver. The procedure of traininga controller will stay the same. In our case, wesimply opt for IDM, as it is one of the mostacknowledged models in the community.

Since both SUMO and Flow do not support real-time image-based state information query, we havedeveloped a rendering plugin in Flow based ona python graphics library called Pyglet [21]. It iscapable of rendering any SUMO traffic scene atapproximately 15 frames per second on a moderncomputer. With it, one can train a functional RLcontroller with ES in a 16-CPU machine in lessthan 4 hours. To facilitate future research, thesource code of this plugin is released at https://github.com/flow-project/flow.

B. Algorithm implementationFor the RL algorithm, we use an off-the-shelf

implementation of ES from a python RL librarycalled RLlib [22]. The library is designed for ac-celerated RL training through distributing the com-putations across multiple CPUs. In this study, allthe numerical experiments are parallelized on a 16-CPU AWS EC2 instance of type c4.4xlarge. Thesource code of the implementation is hosted onhttps://github.com/flow-project/flow and the EC2AMI hash is ami-04b53bf506e95394f.

For hyperparameter settings, we decide theoptimal parameters through a parameter sweep.Through grid searching in the parameter space, wefind the most effective ES hyperparameters are asfollows:

• Evaluation probability: 0.05• Noise standard deviation: 0.01• Step size: 0.01• Iterations per trial: 50• Number of trials: 3

The training is repeated for three times with dif-ferent random seeds. The resulting trained modelis selected to be the one with the highest rewardsamong the three trials.

Lastly, we use a neural network as the un-derlying parameterized model. As for the neuralnetwork architecture design, we use a two-layerconvolutional neural network followed by a two-layer multilayer perceptron. The details of thearchitectural design of the neural network are il-lustrated in Figure 3.

V. Numerical Experiments

We choose to investigate the effectiveness ofaugmented driving in two sets of numerical exper-iments. The first set of experiments is conductedon a circular track inspired by the work in [23], asshown in Figure 4a. The second experiment set isset to be a cyclical cross following the design in thework of [12], as illutrated in Figure 4b. The cir-cular track approximates the single-lane highwaydriving scenario, while the cyclical intersectionresembles an one-way intersection in a dense urbansetting. Despite its abstraction from real-worldtraffic scenarios, these two environments provide acontrolled environment to study the feasibility ofpixel-based driving augmentation. For brevity, inthe following paragraphs, we denote the circulartrack as Circle, and the cyclical intersection asCross.

To test the effectiveness of the driving augmen-tation, for each environment, we train a RL agentfor various augmentation coefficient α rangingfrom 10% to 100% in increments of 10%. Thetraining processes for the 10 agents on Circleare illustrated in Figure 5a, while the trainingprocesses for the 10 agents on Cross are shownin Figure 5c. As indicated in the color bars, thedarker greens represent higher rewards, while thelighter greens represent the lower rewards. Notethat pure white color on the left edges of the plotsindicate NaN (not a number), which is caused bycollisions during early stage of policy exploration.The figures indicate that high rewards are attainedwith a augmentation factor coefficient between40% to 70%.

As implied by the experiments, RL augmenta-tion can improve driving efficiency of the baseline

5

Fig. 3: Neural network model architecture. The network takes a five-channel image as input and producea single number as output which corresponds to the correction to human driver’s default acceleration.

(a) Circle environment. (b) Cross environment.

Fig. 4: Experiment environments. Circle environment represents a single-lane circular track. Crossenvironment represents a self-connected single-lane intersection.

0 20 40Iterations

20

40

60

80

100

alpha

(%)

200 250 300 350Rewards (m/s)

(a) Circle learning heatmap.

0 20 40Iterations

100

200

300

Rewa

rds (

m/s) 10%

20%30%40%50%60%70%80%90%100%0%

(b) Circle learning curves.

0 20 40Iterations

20

40

60

80

100

alpha

(%)

300 400 500 600Rewards (m/s)

(c) Cross learning heatmap.

0 20 40Iterations

100

200

300

400

500

Rewa

rds (

m/s) 10%

20%30%40%50%60%70%80%90%100%0%

(d) Cross learning curves.

Fig. 5: Training learning heatmaps and learning curves. Both environments achieve better performanceat an augmentation coefficient between 40% and 70%.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%Circle 210 210 324 336 334 340 342 343 340 336 342Cross 348 336 353 350 510 548 545 562 420 447 436

TABLE I: Training results of ES on Circle and Cross. The best performance is attained at 70% asshown in bold.

6

IDM driver by more than 50% as shown in Fig-ure 5b and Figure 5d. The baseline controllers aremarked with black dash lines while the augmentedcontrollers are displayed in solid colored curved.To be precise, the best performance of each exper-iment is tabulated in Table I.

Furthermore, note that in the Cross environ-ment pure RL agent is outperformed by a fewhuman-machine-collaborated systems. We hypoth-esize that such improvement may be diminished ifwe train the agents with more iterations.

In addition, the results indicate that the trainingis robust to the choice of the augmentation coef-ficient. For Circle environment, any augmentationabove 20% can approximately achieve the samelevel of performance. For Cross environment, anyaugmentation between 40% to 70% can attain theoptimal value.

VI. Discussion

The numerical results presented above havepractical implications on the development of ve-hicle automation.• First and foremost, the results indicate that driv-

ing automation is a viable path to a more effec-tive transportation. As shown in the numericalexperiments, the addition of a RL augmentationcontroller improves the system performance by alarge margin. The fact that the optimal augmen-tation is attained at 70% in Cross also indicatesthat there exists a set of constrained problemswhere human-machine collaboration is no worsethan full machine control.

• Furthermore, we argue that such augmentationsystem is also cognitively safe. Because theaugmentation term requires active participationof the human driver throughout the course ofthe driving, the human driver will always stayengaged in the control process. Should an emer-gence happens on the road that requires humancontrol, the driver will be ready to react. Suchhuman-in-the-loop control should demonstrate ashorter reaction time compared to a human-out-of-the-loop control system and thus will be morelikely to avoid potential dangers.

• Moreover, augmented driving controller mayalso be modified to account for the differencesof every individual driver. Because many RL

algorithms, such as ES, actor-critic method, andpolicy gradient method, are capable of learningin online setting, we can continue improvingthe performance of the RL controllers after thedeployment. Through interacting with the humandriver in real-time, it may further be adaptedto the characteristics of that human driver. Forexample, the controller may learn to providemore guidance to a risky driver but only exertminimum intervention to a skilled driver. Theadditional level of adaptation may allow for anew level of driving safety and efficiency.

• Lastly, the results also indicate a new direc-tion in vehicle controller design. In this article,we augment a human driver with a RL agent.However, one can augment any controller witha RL controller. This sheds light on a newtype of driving control, a system consisting ofa model-driven baseline controller with a data-driven RL controller. In the hybrid system, thebaseline controller should demonstrate provablysafe behaviors given any perturbation up tothe magnitude of the RL controller output; theRL agent will be trained through driving dataand microscopic simulations to attain statisticallysignificant improvement in efficiency. As theresult, the combined system will be both fail-safe as guaranteed by the hand-designed baselinecontroller and highly efficient as enabled by theRL controller.

VII. Conclusions

In this study, we show that the state-of-the-artRL algorithm ES can be applied to augment humanin the control of vehicle acceleration by takingdirectly the raw pixel inputs without any stateestimation during the test time. In an end-to-endfashion, we train a convolutional neural networkto improve a human driver’s acceleration control.The results indicate that the augmented drivercan operate 50% more effectively than withoutaugmentation. Moreover, we find that the systemwhere human and machine collaborate outperformsboth purely human-operated and machine-operatedsystems. Such controller design is also less likelyto induce cognitive inactivity, keeping the humandriver more alert during driving, and thus reducesthe likelihood of traffic accidents.

7

The results suggests many promising directionsfor future research. Researchers may investigatehow to adapt the RL controller to learn from aspecific human driver in an online settings. Onecan also attempt to couple such RL agent witha fail-safe manually designed controller such thatthe combined system is both verifiably safe andstatistically high performant. To build from thisresearch, one can find the source code at https://github.com/flow-project/flow.

VIII. Future Work

This section includes the extension to the mainresearch topic presented above, where I delineatemy recent progress on (1) the development of end-to-end pixel learning in a much more complex roadnetwork called the University of Delaware ScaledSmart City (UDSSC) and (2) my preliminarytheoretical work on statistical verification of RLalgorithms based on the principles of randomizedcontrol trials (RCT).

A. UDSSC EnvironmentTowards the first goal, I am working on scaling

the proposed RL method from idealistic single-agent traffic such as Circle and Cross to morerealistic multi-agent traffic. To this end, I amadapting the UDSSC environment for multiagentautonomous vehicle control.

The UDSSC environment is presented in Fig-ure 6. The network, seen in Fig. 6, is a scaledimplementation of a physical robotic navigationtestbed at the University of Delaware. It consistsof distinct elements of actual road network design,including: (1) roundabouts, (2) four-way intersec-tions, (3) T-shaped intersections, (4) multilanes,and (5) merges.

The goal of RL agent is to control and coordi-nate a subset of vehicles in the traffic to increasethe average velocity of all road segments withoutincurring traffic accident. In this environment, thenumber of agents can be greater than one buteach one of them has the same observation spaceand action space as the single agent in previousstudy presented above. In Figure 6 we illustratea simulated traffic flow on the road network. Theobjective of the RL algorithm is therefore to makethe traffic flow as fast and stable as possible.

Fig. 6: UDSSC environment. Traffic instabilitiessuch as queues and stop-and-go waves form pri-marily near the intersections and roundabouts ofthe network.

As a semi-realistic testbed for vechicle naviga-tion and coordination in real traffic, this environ-ment set possesses sufficient details for its trainedagent to be applied to real world, yet still maintainscomputational tractability for it to be trained withinreasonable amount of time.

B. RL Algorithm VerificationIn this section, I discuss the most commonly

used RL design verification method and proposemy changes to it to make it robust to confoundingfactors. Specifically, this last section serves as aresponse to a problem that I encountered during lit-erature review as detailed in [24]. In that study, theauthors introduced two new algorithmic improve-ments to A3C algorithm. The first addition is theintroduction of a goal-driven hierarchical structuresimilar to that of [25]. The second addition is acustom reward function. The authors were able toachieve better performance after adding these twofeatures. Thus they (prematurely) concluded thatboth features were meaningful augmentations tothe original A3C method [8]. Later they found outthat it was only the first feature, the custom reward

8

function, that was helping the performance. Con-sequently, their previous claim on the effectivenessof the hierarchical structure was rendered invalid.

Indeed, formal causality tests should be requiredin the proposal of new RL algorithms. In thefollowing paragraphs, I first formalize the verifica-tion approach that was taken in [24], as it is alsoquite commonly used in the research community.Next, I show how one might avoid similar researchaccidents, i.e., the problem of confounding factors,through marginalization on potential confounders.

1) Popular Verification Procedure: I would liketo point out the importance of experiment designin the establishment of causal relations. The au-thors of a new algorithm always need to showthrough experiments that their design results inbetter performance than a RL algorithm withoutsuch augmentations. However, between the currentpractice in the RL research community and theformal method of causal inference, I find thereis a significant gap. In current RL literature, anew algorithm A′ is often proposed based on thefollowing verification procedure:

1) Test a sample of the most performant algo-rithms {A1, A2, · · · , An} on certain benchmarktasks {B1,B2, · · · ,Bm} to produce a baselinescore Ei, j[s(Ai,B j)].

2) Identify potential augmentations to an ex-isting RL algorithm A0. Denote the setof candidate algorithmic changes as C ={c1,c2, · · · ,cl}. Denote the new RL algorithmproduced from A0 and a selection of candicatechanges C as A0

⊕C, where C ⊆ C. Note we

usually do, but not necessarily require:

E j[s(A0,B j)] > Ei, j[s(Ai,B j)].

3) To verify the effectiveness of the algorith-mic changes C, compute the expected per-formance of A0

⊕C over {B1,B2, · · · ,Bm}. In

other words, compute:

E j[s(A0⊕

C,B j)].

4) If the following condition holds:

EN, j[s(A0⊕

C,B j)] > Ei, j[s(Ai,B j)|∀i, j],

then the algorithmic design C is an valuableaddition to the original algorithm A0.

The above procedure essentially argues that ad-dition of C causes A0

⊕C to outperformance

the state of the art Ei, j[s(Ai,B j)] by showing thataddition of C to A0 and the observed improvementsare correlated.

However, as we know, correlation does notimply causation. Claiming causality based solelyon observational study will put the new discoveriesat the risk of fallacy. To avoid the trap, I proposeto adopt tools from causal inference.

2) Proposed Verification Procedure: Amongvarious techniques from the field causal inference,I present a way to verify RL algorithmic designsbased on the principles of RCT. The proposedmethod is delineated below.

1) Test a sample of the most performant algo-rithms {A1, A2, · · · , An} on certain benchmarktasks {B1,B2, · · · ,Bm} to produce a baselinescore Ei, j[s(Ai,B j)].

2) Identify potential changes pertained to theaugmentations of an existing RL algorithmA0. Denote the set of candidate algorithmicchanges as C = {c1,c2, · · · ,cl}. Denote thenew RL algorithm produced from A0 and aselection of candicate changes C as A0

⊕C,

where C ⊆ C. Note we usually do, but notnecessarily require:

E j[s(A0,B j)] > Ei, j[s(Ai,B j)].3) To verify the effectiveness of the algorith-

mic changes C, compute the expected perfor-mance of A0

⊕(C ∪

N) over {B1,B2, · · · ,Bm}and N ⊆ C \ C, where N is a set containingelements randomly selected from C \ C. Inother words, compute:

EN, j[s(A0⊕

(C∪

N),B j)].4) If the following condition holds:

EN, j[s(A0⊕

(C∪

N),B j)] > Ei, j[s(Ai,B j)|∀i, j],

then the algorithmic design C is an valuableaddition to the original algorithm A0.

Unlike the previously mentioned technique, thisapproach carefully marginalizes out the effect ofpotential confounding variables in N such that thedemonstrated performance, if any, can be asserted,with greater certainty, to be caused by the pro-posed algorithmic design C.

9

Acknowledgement

Funding for this project was provided by UCBerkeley Safe Transportation and Research Educa-tion Center (SafeTREC) and the Collaborative Sci-ences Center for Road Safety (CSCRS), a U.S. De-partment of Transportation-funded National Uni-versity Transportation Center led by the Universityof North Carolina at Chapel Hills Highway SafetyResearch Center.

References

[1] G. Staff and Agencies, “Tesla car that crashed and killeddriver was running on autopilot, firm says,” Mar 2018. [On-line]. Available: https://www.theguardian.com/technology/2018/mar/31/tesla-car-crash-autopilot-mountain-view

[2] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever,“Evolution strategies as a scalable alternative to reinforcementlearning,” arXiv preprint arXiv:1703.03864, 2017.

[3] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron,J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, et al.,“Stanley: The robot that won the darpa grand challenge,”Journal of field Robotics, vol. 23, no. 9, pp. 661–692, 2006.

[4] S. Ingle and M. Phute, “Tesla autopilot: semi autonomousdriving, an uptick for future autonomy,” International Re-search Journal of Engineering and Technology, vol. 3, no. 9,2016.

[5] C. News, “First look inside self-driving taxis aswaymo prepares to launch unprecedented service,” Oct2018. [Online]. Available: https://www.cbsnews.com/news/first-look-inside-waymo-self-driving-taxis/

[6] M. Kuderer, S. Gulati, and W. Burgard, “Learning driv-ing styles for autonomous vehicles from demonstration,” inRobotics and Automation (ICRA), 2015 IEEE InternationalConference on. IEEE, 2015, pp. 2641–2646.

[7] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, et al., “End to end learning for self-driving cars,”arXiv preprint arXiv:1604.07316, 2016.

[8] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronousmethods for deep reinforcement learning,” in Internationalconference on machine learning, 2016, pp. 1928–1937.

[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski, et al., “Human-level control through deep rein-forcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[10] L. Fridman, B. Jenik, and J. Terwilliger, “Deeptraffic: Drivingfast through dense traffic with deep reinforcement learning,”arXiv preprint arXiv:1801.02805, 2018.

[11] R. E. Stern, S. Cui, M. L. Delle Monache, R. Bhadani,M. Bunting, M. Churchill, N. Hamilton, H. Pohlmann, F. Wu,B. Piccoli, et al., “Dissipation of stop-and-go waves viacontrol of autonomous vehicles: Field experiments,” Trans-portation Research Part C: Emerging Technologies, vol. 89,pp. 205–221, 2018.

[12] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen,“Flow: Architecture and benchmarking for reinforcementlearning in traffic control,” arXiv preprint arXiv:1710.05465,2017.

[13] E. Vinitsky, K. Aboudy, L. Le Flem, N. Kheterpal, K. Jang,F. Wu, R. Liaw, E. Liang, and A. Bayen, “Benchmarks forreinforcement learning inmixed-autonomy traffic,” Conferenceon Robot Learning, 2018.

[14] E. Vinitsky, K. Parvate, A. R. Kreidieh, C. Wu, Z. Hu, andA. Bayen, “Lagrangian control through deep-rl: Applicationsto bottleneck decongestion,” IEEE Intelligent TransportationSystems Conference, 2018.

[15] A. R. Kreidieh and A. Bayen, “Dissipating stop-and-go wavesin closed and open networks via deep reinforcement learning,”IEEE Intelligent Transportation Systems Conference, 2018.

[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playingatari with deep reinforcement learning,” arXiv preprintarXiv:1312.5602, 2013.

[17] J. Carmona, F. García, D. Martín, A. d. l. Escalera, andJ. M. Armingol, “Data fusion for driver behaviour analysis,”Sensors, vol. 15, no. 10, pp. 25 968–25 991, 2015.

[18] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz,“Sumo–simulation of urban mobility,” in The Third Interna-tional Conference on Advances in System Simulation (SIMUL2011), Barcelona, Spain, vol. 42, 2011.

[19] M. Treiber, A. Hennecke, and D. Helbing, “Congested trafficstates in empirical observations and microscopic simulations,”Physical review E, vol. 62, no. 2, p. 1805, 2000.

[20] M. Treiber and A. Kesting, “Traffic flow dynamics: data,models and simulation,” Physics Today, vol. 67, no. 3, p. 54,2014.

[21] A. Holkner, “Pyglet: Cross-platform windowing and multime-dia library for python,” Google Code, 2008.

[22] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Gold-berg, J. Gonzalez, M. Jordan, and I. Stoica, “Rllib: Abstrac-tions for distributed reinforcement learning,” in InternationalConference on Machine Learning, 2018, pp. 3059–3068.

[23] M. Bando, K. Hasebe, A. Nakayama, A. Shibata, andY. Sugiyama, “Dynamical model of traffic congestion andnumerical simulation,” Physical review E, vol. 51, no. 2, p.1035, 1995.

[24] N. Dilokthanakul, C. Kaplanis, N. Pawlowski, and M. Shana-han, “Feature control as intrinsic motivation for hierarchicalreinforcement learning,” arXiv preprint arXiv:1705.06769,2017.

[25] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess,M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal net-works for hierarchical reinforcement learning,” arXiv preprintarXiv:1703.01161, 2017.

10

cscrs road safety fellowship report: a human-machine ...cscrs road safety fellowship report: a...

Documents