risk averse robust adversarial reinforcement learning · risk averse robust adversarial...

Risk Averse Robust Adversarial ReinforcementLearning

Extended Abstract

Xinlei PanUC BerkeleyBerkeley, CA

xinleipan@berkeley.edu

John CannyUC BerkeleyBerkeley, CA

jfc@cs.berkeley.edu

ABSTRACTRobot controllers need to deal with a variety of uncertaintyand still perform well. Reinforcement learning and deep net-works often su�er from brittleness and e.g. di�culty on trans-fer from simulation to real environments. Recently, robustadversarial reinforcement learning (RARL) was developedwhich allows application of random and systematic perturba-tions by an adversary. A limitation of previous work is thatonly the expected control objective is optimized. i.e. thereis no explicit modeling or optimization of risk. Thus it isnot possible to control the probability of catastrophic events.In this paper we introduce risk-averse robust adversarialRL, using a risk-averse main agent and a risk-seeking adver-sary. We show through experiments in vehicle control that arisk-averse agent is better equipped to handle a risk-seekingadversary, and allows fewer catastrophic outcomes.

KEYWORDSRisk averse reinforcement learning, Adversarial learning,Robust reinforcement learningACM Reference Format:Xinlei Pan and John Canny. 2018. Risk Averse Robust AdversarialReinforcement Learning: Extended Abstract. In Proceedings of ACMComputer Science in Cars Symposium (CSCS’18). ACM, New York,NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONReinforcement learning has demonstrated remarkable per-formance on a variety of sequential decision making taskssuch as Go [6] and Atari games [2]. In order to use rein-forcement learning trained agents in real world applicationssuch as house-hold robots or autonomous driving vehicle,the agents need a very high level of robustness and safety.Previous work on robust RL has used random perturbationof a limited number of parameters to change the dynamics[5]. In [1], gradient-based adversarial perturbations wereintroduced during training. In [4] an adversary agent was

CSCS’18, September 2018, Munich, Germany2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

itself trained using reinforcement learning. This approach ismore general than [1], and includes both random and sys-tematic adversaries that directly provide adversarial attackon the input states and dynamics. However, these methodsonly achieve limited diversity of dynamics, which may notbe diverse enough to resemble the real world variety. Thesemethods also mainly consider the robustness of models in-stead of considering the robustness and risk averse abilityof the trained model. In this paper, we consider the taskof training risk averse robust reinforcement policies. Wemodel the risk as variance of value functions, inspired by[7]. In order to emphasize that the agent should be averseto extreme catastrophe, we design the reward function tobe asymmetric. A robust policy should not only maximizelong term expected cumulative reward, but should also selectactions with small variance of that expectation. Here, weuse emsembles of Q value network to estimate variance ofQ values. The work in [3] proposes a similar technique tohelp exploration. We consider a scenario where there are twoagents working together and combating each other. We thenpropose to train risk averse robust adversarial reinforcementlearning policies (RARARL).

2 RISK AVERSE ROBUST ADVERSARIALREINFORCEMENT LEARNING

Two Player Reinforcement Learning. We consider theenvironment as a Markov Decision Process (MDP) M ={S,A,R,P,� }, where S is the state space, A is the actionspace, R (s,a) is the reward function, P (s 0 |s,a) is the statetransition model, and � is the reward discount rate. Thereare two agents in this game, one agent is called the protago-nist agent (denoted by P ) which strives to learn a policy �Pto maximize cumulative expected reward E�P [

P� trt ]. The

other agent is called the adversarial agent (denoted by A),which strives to learn a policy �A to pit the protagonist agentand minimize the cumulative expected reward. In our im-plementation, the adversarial agent gets the negative of theenvironmental reward �r . The protagonist agent should berisk averse, where we model the risk as variance of the valuefunction ( introduced later on), so the value of action a at

CSCS’18, September 2018, Munich, Germany CSCS 2018 Submission

state s is modi�ed as, Q̂P (s,a) = QP (s,a)��Varm[QmP (s,a)],

where Q̂P (s,a) is the modi�ed Q function, and QP (s,a) isthe original Q function, and Varm[Qm

P (s,a)] is the varianceof Q function acrossm di�erent models, and � is a multipli-cation constant, and we choose its value to be 0.1. The term��Varm[Qm

P (s,a)] is called the risk-averse term thereafter.In order for the adversarial agent to be risk seeking so asthe provide more challenges for the protagonist agent, themodi�ed value function for action selection is Q̂A (s,a) =QA (s,a) + �Varm[Qm

A (s,a)] where � is a multiplication con-stant, and the value is 1. The term �Varm[Qm

P (s,a)] is calledthe risk-seeking term thereafter. The action space of thisagent is the same as protagonist agent. The two agents takeactions in turn during training. Asymmetric Reward De-sign. In order to train risk averse agents, we propose todesign asymmetric reward function. Namely, good behaviorwill receive only small positive reward, while risk behaviorwill receive very negative reward, so that the trained agentcan be risk averse and robust. Risk Modeling using ValueVariance. The risk of a certain action can be modeled byestimating the variance of value function across di�erentmodels trained on di�erent sets of data. Inspired by [3], weestimate variance of Q value functions by training multipleQ value network together, and take their Q value varianceas risk measurement. The training process for a single agentis the same as in [3].

3 EXPERIMENTSSimulation Environments. For discrete control, we usedthe car racing TORCS environment [8]. We use the MichiganSpeedway environment in TORCS. The vehicle takes actionsin a discrete space consisting of 9 actions, which are combi-nations of move ahead, accelerate, and turn left/right. Thereward function is de�ned to proportional to the speed alongthe road direction and also penalize collision as catastrophe.The catastrophe reward is much more negative than normalreward.

Baselines and Our Method. Vanilla DQN. The pur-pose of comparing with vanilla DQN is to show that modelstrained in one environment may over�t to that speci�c dy-namics and fail to transfer to other dynamics. We denotethis experiment as dqn. Ensemble DQN. Ensemble DQNtends to be more robust than vanilla DQN since it gets thevote from all ensemble policies. However, without beingtrained on a di�erent dynamics, even Ensemble DQN maynot work well when there is adversarial attacks or simple ran-dom dynamics change. We denote this experiment as bsdqn.Ensemble DQN with Random Perturbations WithoutRisk Averse Term. We train the protagonist agent and pro-vide random perturbation. However, for comparison reasons,we didn’t include here variance guided exploration term.

Namely, only the Q value function will be used for choosingactions. We denote this experiment as bsdqnrand. Ensem-ble DQN with Random Perturbations With the RiskAverse Term. We only train the protagonist agent and pro-vide random perturbations. The Protagonist agent will selectaction based on its Q value function and the risk averseterm. We denote this experiment as bsdqnrandriskaverse.Ensemble DQN with Adversarial Perturbation. This isto compare our model with [4]. To make a fair compari-son, we also use Ensemble DQN to train the policy whilethe variance term is not used as either risk-averse or risk-seeking term in either agent. We denote this experiment asbsdqnadv. Our method. In our method, we train both pro-tagonist agent and adversarial agent with Ensemble DQN.We will include here the variance guided exploration term.Therefore, the Q function and its variance across di�erentmodels will be used for action selection.We denote this exper-iment as bsdqnadvriskaverse. Evaluation. First, testingwithout perturbations. We tested all trained models withoutperturbations. Second, testing with random perturbations.Speci�cally, we tested these models by randomly taking 1action for every 10 actions taken by the main agent. Third,testing with Adversarial Perturbations. Random perturba-tions may not result in catastrophe for the vehicle. Therefore,we also tested all models with our trained adversarial agent.Speci�cally, we tested these models by taking 1 action de-cided by the adversarial agent for every 10 actions taken bythe main agent. The results are shown in �gure 1.

Figure 1: Testing all models without attack, with random at-tack, and with adversarial attack (from top to bottom). Thereward is divided into distance related reward (left), progressrelated reward (middle). The result for catastrophe rewardper episode is present in the last image.

For all results curve, there is a vertical blue line corre-sponding to 0.55 million steps showing the beginning ofthe added perturbations during training. Either the randomperturbation or adversarial perturbation is added.

Risk Averse Robust Adversarial Reinforcement Learning CSCS’18, September 2018, Munich, Germany

4 CONCLUSIONWe show that by introducing a notion of risk averse behavior,training agents with adversarial agent achieves signi�cantlyless catastrophe than agents trained without adversarial at-tack. Trained adversarial agent is also able to provide pertur-bations stronger than random perturbations.

REFERENCES[1] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio

Savarese. 2017. Adversarially Robust Policy Learning: Active Construc-tion of Physically-Plausible Perturbations. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS).

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, JoelVeness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas KFidjeland, Georg Ostrovski, et al. 2015. Human-level control throughdeep reinforcement learning. Nature 518, 7540 (2015), 529–533.

[3] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy.2016. Deep exploration via bootstrapped DQN. In Advances in NeuralInformation Processing Systems. 4026–4034.

[4] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta.2017. Robust Adversarial Reinforcement Learning. International Con-ference on Machine Learning (ICML) (2017).

[5] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and SergeyLevine. 2017. Epopt: Learning robust neural network policies usingmodel ensembles. In International Conference on Learning Representa-tions (ICLR).

[6] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game ofGo with deep neural networks and tree search. Nature 529, 7587 (2016),484–489.

[7] Aviv Tamar, Dotan Di Castro, and Shie Mannor. 2016. Learning thevariance of the reward-to-go. Journal of Machine Learning Research 17,13 (2016), 1–36.

[8] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dim-itrakakis, Rémi Coulom, and Andrew Sumner. 2000. Torcs, the openracing car simulator. Software available at http://torcs. sourceforge. net(2000).

risk averse robust adversarial reinforcement learning · risk averse robust adversarial...

Documents

generative adversarial user model for reinforcement

teaching in cultures averse to uncertainty

robust adversarial reinforcement learning - arxiv.org ·...

adversarial machine learning —an...

1 adversarial reinforcement learning framework for...

risk-averse dynamic programming for markov decision...

robust reinforcement learning using adversarial populations

robust deep reinforcement learning with adversarial...

risk-averse contract interpretation - duke university

confusing english words - adverse/averse

agile adoption in risk-averse environments

approximation algorithms for ofﬂine risk-averse...

risk averse timing

adversarial examples and adversarial...

uncertainty averse preferencesuncertainty averse...

risk-averse firms in oligopoly

unconsciously averse to aversion

risk-averse selective newsvendor problems

transferable adversarial attacks on deep reinforcement

risk-averse pre-extreme weather events self-scheduling of...