risk averse robust adversarial reinforcement learning · risk averse robust adversarial...
Post on 24-May-2020
13 Views
Preview:
TRANSCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
Risk Averse Robust Adversarial ReinforcementLearning
Extended Abstract
Xinlei PanUC BerkeleyBerkeley, CA
xinleipan@berkeley.edu
John CannyUC BerkeleyBerkeley, CA
jfc@cs.berkeley.edu
ABSTRACTRobot controllers need to deal with a variety of uncertaintyand still perform well. Reinforcement learning and deep net-works often su�er from brittleness and e.g. di�culty on trans-fer from simulation to real environments. Recently, robustadversarial reinforcement learning (RARL) was developedwhich allows application of random and systematic perturba-tions by an adversary. A limitation of previous work is thatonly the expected control objective is optimized. i.e. thereis no explicit modeling or optimization of risk. Thus it isnot possible to control the probability of catastrophic events.In this paper we introduce risk-averse robust adversarialRL, using a risk-averse main agent and a risk-seeking adver-sary. We show through experiments in vehicle control that arisk-averse agent is better equipped to handle a risk-seekingadversary, and allows fewer catastrophic outcomes.
KEYWORDSRisk averse reinforcement learning, Adversarial learning,Robust reinforcement learningACM Reference Format:Xinlei Pan and John Canny. 2018. Risk Averse Robust AdversarialReinforcement Learning: Extended Abstract. In Proceedings of ACMComputer Science in Cars Symposium (CSCS’18). ACM, New York,NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTIONReinforcement learning has demonstrated remarkable per-formance on a variety of sequential decision making taskssuch as Go [6] and Atari games [2]. In order to use rein-forcement learning trained agents in real world applicationssuch as house-hold robots or autonomous driving vehicle,the agents need a very high level of robustness and safety.Previous work on robust RL has used random perturbationof a limited number of parameters to change the dynamics[5]. In [1], gradient-based adversarial perturbations wereintroduced during training. In [4] an adversary agent was
CSCS’18, September 2018, Munich, Germany2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
itself trained using reinforcement learning. This approach ismore general than [1], and includes both random and sys-tematic adversaries that directly provide adversarial attackon the input states and dynamics. However, these methodsonly achieve limited diversity of dynamics, which may notbe diverse enough to resemble the real world variety. Thesemethods also mainly consider the robustness of models in-stead of considering the robustness and risk averse abilityof the trained model. In this paper, we consider the taskof training risk averse robust reinforcement policies. Wemodel the risk as variance of value functions, inspired by[7]. In order to emphasize that the agent should be averseto extreme catastrophe, we design the reward function tobe asymmetric. A robust policy should not only maximizelong term expected cumulative reward, but should also selectactions with small variance of that expectation. Here, weuse emsembles of Q value network to estimate variance ofQ values. The work in [3] proposes a similar technique tohelp exploration. We consider a scenario where there are twoagents working together and combating each other. We thenpropose to train risk averse robust adversarial reinforcementlearning policies (RARARL).
2 RISK AVERSE ROBUST ADVERSARIALREINFORCEMENT LEARNING
Two Player Reinforcement Learning. We consider theenvironment as a Markov Decision Process (MDP) M ={S,A,R,P,� }, where S is the state space, A is the actionspace, R (s,a) is the reward function, P (s 0 |s,a) is the statetransition model, and � is the reward discount rate. Thereare two agents in this game, one agent is called the protago-nist agent (denoted by P ) which strives to learn a policy �Pto maximize cumulative expected reward E�P [
P� trt ]. The
other agent is called the adversarial agent (denoted by A),which strives to learn a policy �A to pit the protagonist agentand minimize the cumulative expected reward. In our im-plementation, the adversarial agent gets the negative of theenvironmental reward �r . The protagonist agent should berisk averse, where we model the risk as variance of the valuefunction ( introduced later on), so the value of action a at
1
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
CSCS’18, September 2018, Munich, Germany CSCS 2018 Submission
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
state s is modi�ed as, Q̂P (s,a) = QP (s,a)��Varm[QmP (s,a)],
where Q̂P (s,a) is the modi�ed Q function, and QP (s,a) isthe original Q function, and Varm[Qm
P (s,a)] is the varianceof Q function acrossm di�erent models, and � is a multipli-cation constant, and we choose its value to be 0.1. The term��Varm[Qm
P (s,a)] is called the risk-averse term thereafter.In order for the adversarial agent to be risk seeking so asthe provide more challenges for the protagonist agent, themodi�ed value function for action selection is Q̂A (s,a) =QA (s,a) + �Varm[Qm
A (s,a)] where � is a multiplication con-stant, and the value is 1. The term �Varm[Qm
P (s,a)] is calledthe risk-seeking term thereafter. The action space of thisagent is the same as protagonist agent. The two agents takeactions in turn during training. Asymmetric Reward De-sign. In order to train risk averse agents, we propose todesign asymmetric reward function. Namely, good behaviorwill receive only small positive reward, while risk behaviorwill receive very negative reward, so that the trained agentcan be risk averse and robust. Risk Modeling using ValueVariance. The risk of a certain action can be modeled byestimating the variance of value function across di�erentmodels trained on di�erent sets of data. Inspired by [3], weestimate variance of Q value functions by training multipleQ value network together, and take their Q value varianceas risk measurement. The training process for a single agentis the same as in [3].
3 EXPERIMENTSSimulation Environments. For discrete control, we usedthe car racing TORCS environment [8]. We use the MichiganSpeedway environment in TORCS. The vehicle takes actionsin a discrete space consisting of 9 actions, which are combi-nations of move ahead, accelerate, and turn left/right. Thereward function is de�ned to proportional to the speed alongthe road direction and also penalize collision as catastrophe.The catastrophe reward is much more negative than normalreward.
Baselines and Our Method. Vanilla DQN. The pur-pose of comparing with vanilla DQN is to show that modelstrained in one environment may over�t to that speci�c dy-namics and fail to transfer to other dynamics. We denotethis experiment as dqn. Ensemble DQN. Ensemble DQNtends to be more robust than vanilla DQN since it gets thevote from all ensemble policies. However, without beingtrained on a di�erent dynamics, even Ensemble DQN maynot work well when there is adversarial attacks or simple ran-dom dynamics change. We denote this experiment as bsdqn.Ensemble DQN with Random Perturbations WithoutRisk Averse Term. We train the protagonist agent and pro-vide random perturbation. However, for comparison reasons,we didn’t include here variance guided exploration term.
Namely, only the Q value function will be used for choosingactions. We denote this experiment as bsdqnrand. Ensem-ble DQN with Random Perturbations With the RiskAverse Term. We only train the protagonist agent and pro-vide random perturbations. The Protagonist agent will selectaction based on its Q value function and the risk averseterm. We denote this experiment as bsdqnrandriskaverse.Ensemble DQN with Adversarial Perturbation. This isto compare our model with [4]. To make a fair compari-son, we also use Ensemble DQN to train the policy whilethe variance term is not used as either risk-averse or risk-seeking term in either agent. We denote this experiment asbsdqnadv. Our method. In our method, we train both pro-tagonist agent and adversarial agent with Ensemble DQN.We will include here the variance guided exploration term.Therefore, the Q function and its variance across di�erentmodels will be used for action selection.We denote this exper-iment as bsdqnadvriskaverse. Evaluation. First, testingwithout perturbations. We tested all trained models withoutperturbations. Second, testing with random perturbations.Speci�cally, we tested these models by randomly taking 1action for every 10 actions taken by the main agent. Third,testing with Adversarial Perturbations. Random perturba-tions may not result in catastrophe for the vehicle. Therefore,we also tested all models with our trained adversarial agent.Speci�cally, we tested these models by taking 1 action de-cided by the adversarial agent for every 10 actions taken bythe main agent. The results are shown in �gure 1.
Figure 1: Testing all models without attack, with random at-tack, and with adversarial attack (from top to bottom). Thereward is divided into distance related reward (left), progressrelated reward (middle). The result for catastrophe rewardper episode is present in the last image.
For all results curve, there is a vertical blue line corre-sponding to 0.55 million steps showing the beginning ofthe added perturbations during training. Either the randomperturbation or adversarial perturbation is added.
2
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
Risk Averse Robust Adversarial Reinforcement Learning CSCS’18, September 2018, Munich, Germany
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
4 CONCLUSIONWe show that by introducing a notion of risk averse behavior,training agents with adversarial agent achieves signi�cantlyless catastrophe than agents trained without adversarial at-tack. Trained adversarial agent is also able to provide pertur-bations stronger than random perturbations.
REFERENCES[1] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio
Savarese. 2017. Adversarially Robust Policy Learning: Active Construc-tion of Physically-Plausible Perturbations. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS).
[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, JoelVeness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas KFidjeland, Georg Ostrovski, et al. 2015. Human-level control throughdeep reinforcement learning. Nature 518, 7540 (2015), 529–533.
[3] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy.2016. Deep exploration via bootstrapped DQN. In Advances in NeuralInformation Processing Systems. 4026–4034.
[4] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta.2017. Robust Adversarial Reinforcement Learning. International Con-ference on Machine Learning (ICML) (2017).
[5] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and SergeyLevine. 2017. Epopt: Learning robust neural network policies usingmodel ensembles. In International Conference on Learning Representa-tions (ICLR).
[6] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game ofGo with deep neural networks and tree search. Nature 529, 7587 (2016),484–489.
[7] Aviv Tamar, Dotan Di Castro, and Shie Mannor. 2016. Learning thevariance of the reward-to-go. Journal of Machine Learning Research 17,13 (2016), 1–36.
[8] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dim-itrakakis, Rémi Coulom, and Andrew Sumner. 2000. Torcs, the openracing car simulator. Software available at http://torcs. sourceforge. net(2000).
3
top related