backdoor attacks in sequential decision-making agents

Backdoor Attacks in Sequential Decision-Making Agents

Zhaoyuan Yang, Naresh Iyer, Johan Reimann, Nurali Virani*

GE ResearchOne Research Circle, Niskayuna, NY 12309

Abstract

Recent work has demonstrated robust mechanisms by whichattacks can be orchestrated on machine learning models. Incontrast to adversarial examples, backdoor or trojan attacksembed surgically modified samples in the model trainingprocess to cause the targeted model to learn to misclassifysamples in the presence of specific triggers, while keepingthe model performance stable across other nominal samples.However, current published research on trojan attacks mainlyfocuses on classification problems, which ignores sequentialdependency between inputs. In this paper, we propose meth-ods to discreetly introduce and exploit novel backdoor attackswithin a sequential decision-making agent, such as a rein-forcement learning agent, by training multiple benign andmalicious policies within a single long short-term memory(LSTM) network, where the malicious policy can be acti-vated by a short realizable trigger introduced to the agent. Wedemonstrate the effectiveness through initial outcomes gener-ated from our approach as well as discuss the impact of suchattacks in defense scenarios. We also provide evidence as wellas intuition on how the trojan trigger and malicious policy isactivated. In the end, we propose potential approaches to de-fend against or serve as early detection for such attacks.

IntroductionCurrent research has demonstrated different categories ofattacks on neural networks and other supervised learn-ing approaches. Majority of them can be categorized as:(1) inference-time attacks, which add adversarial perturba-tions digitally or patches physically to the test samples andmake the model misclassify them (Goodfellow, Shlens, andSzegedy 2015; Szegedy et al. 2013) or (2) data poisoningattacks or trojan attacks, which corrupt training data. In caseof trojans, carefully designed samples are embedded in themodel training process to cause the model to learn incor-rectly with regard to only those samples, while keeping the

*Corresponding author: [email protected] 2020 for this paper by its authors. Use permitted underCreative Commons License Attribution 4.0 International (CC BY4.0). In: Proceedings of AAAI Symposium on the 2nd Workshopon Deep Models and Artificial Intelligence for Defense Applica-tions: Potentials, Theories, Practices, Tools, and Risks, November11-12, 2020, Virtual, published at http://ceur-ws.org

training performance of the model stable across other nom-inal samples (Liu et al. 2017). The focus of this paper is ontrojan attacks. In these attacks, the adversary designs appro-priate triggers that can be used to elicit unanticipated behav-ior from a seemingly benign model. As demonstrated in (Gu,Dolan-Gavitt, and Garg 2017), such triggers can lead to dan-gerous behaviors by artificial intelligence (AI) systems likeautonomous cars by deliberately misleading their perceptionmodules into classifying ‘Stop’ signs as ‘Speed Limit’ signs.

Most research on trojan attacks in AI mainly focuses onclassification problems, where model’s performance is af-fected only in the instant when a trojan trigger is present.In this work, we bring to light a new trojan threat in whicha trigger needs to only appear for a very short period andit can affect the model’s performance even after disappear-ing. For example, the adversary needs to only present thetrigger in one frame of an autonomous vehicle’s sensor in-puts and the behavior of the vehicle can be made to changepermanently from thereon. Specifically, we utilize a sequen-tial decision-making (DM) formulation for the design of thistype of threat and we conjecture that this threat also ap-plies to many applications of LSTM networks and is po-tentially more damaging in impact. Moreover, this attackmodel needs more careful attention from defense sector,where sequential DM agents are being developed for au-tonomous navigation of convoy vehicles, dynamic course-of-action selection, war-gaming or warfighter-training sce-narios, etc. where adversary can inject such backdoors.

The contribution of this work is: (1) a threat model andformulation for a new type of trojan attack for LSTM net-works and sequential DM agents, (2) implementation to il-lustrate the threat, and (3) analysis of models with the threatand potential defense mechanisms.

In the following sections of the paper, we will provideexamples of related work and background on deep rein-forcement learning (RL) and LSTM networks. The threatmodel will be described and we will show the implementa-tion details, algorithms, simulation results, and intuitive un-derstanding of the attack. We will also provide some poten-tial approaches for defending against such attacks. Finally,we will conclude with some directions for future research.

Related WorkAdversarial attacks on neural networks have received in-creasing attention after neural networks were found tobe vulnerable to adversarial perturbations (Szegedy et al.2013). Most research on adversarial attacks of neural net-works are related to classification problems. To be spe-cific, (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy2015; Su, Vargas, and Sakurai 2019) discovered that the ad-versary only needs to add a small adversarial perturbationto an input, and the model prediction switches from a cor-rect label to an incorrect one. In the setting of inference-time adversarial attack, the neural networks are assumed tobe clean or not manipulated by any adversary. With recentadvancement in the deep RL (Schulman et al. 2015; Mnih etal. 2016; 2015), many adversarial attacks on RL have alsobeen investigated. It has been shown in (Huang et al. 2017;Lin et al. 2017) that small adversarial perturbations to inputscan largely degrade the performance of a RL agent.

Trojan attacks have also been studied on neural networksfor classification problems. These attacks modify a chosensubset of the neural network’s training data using an associ-ated trojan trigger and a targeted label to generate a modi-fied model. Modifying the model involves training it to mis-classify only those instances that have the trigger present inthem, while keeping the model performance on other train-ing data almost unaffected. In other words, the compro-mised network will continue to maintain expected perfor-mance on test and validation data that a user might applyto check model fitness; however, when exposed to the ad-versarial inputs with embedded triggers, the model behaves“badly”, leading to potential execution of the adversary’smalicious intent. Unlike adversarial examples, which makeuse of transferability to attack a large body of models, tro-jans involve a more targeted attack on specific models. Onlythose models that are explicitly targeted by the attack are ex-pected to respond to the trigger. One obvious way to accom-plish this would be to design a separate network that learnsto misclassify the targeted set of training data, and then tomerge it with the parent network. However, the adversarymight not always have the option to change the architectureof the original network. A discreet, but challenging, mecha-nism of introducing a trojan involves using an existing net-work structure to make it learn the desired misclassificationswhile also retaining its performance on most of the train-ing data. (Gu, Dolan-Gavitt, and Garg 2017) demonstratesthe use of backdoor/trojan attack on a traffic sign classifiermodel, which ends up classifying stop signs as speed limits,when a simple sticker (i.e., trigger) is added to a stop sign.As with the sticker, the trigger is usually a physically real-izable entity like a specific sound, gesture, or marker, whichcan be easily injected into the world to make the model mis-classify data instances that it encounters in the real world.(Chen et al. 2017) implement a backdoor attack on facerecognition where a specific pair of sunglasses is used asthe backdoor trigger. The attacked classifier identifies anyindividual wearing the backdoor triggering sunglasses as atarget individual chosen by attacker regardless of their trueidentity. Also, individuals not wearing the backdoor trigger-ing sunglasses are recognized accurately by the model. (Liu

et al. 2017) present an approach where they apply a trojanattack without access to the original training data, therebyenabling such attacks to be incorporated by a third partyin model-sharing marketplaces. (Bagdasaryan et al. 2018)demonstrates an approach of poisoning the neural networkmodel under the setting of federated learning.

While existing research focuses on designing trojans forneural network models, to the best of our knowledge, ourwork is the first work that explores trojan attacks in the con-text of sequential DM agents (including RL) as reported inpreprint (Yang et al. 2019). After our initial work, (Kiourtiet al. 2019) has shown reward hacking and data poisoning tocreate backdoors for feed-forward deep networks in RL set-ting and (Dai, Chen, and Guo 2019) has introduced backdoorattack in text classification models in black-box setting viaselective data poisoning. In this work, we explore how theadversary can manipulate the model discreetly to introducea targeted trojan trigger in a RL agent with recurrent neuralnetwork and we discuss applications in defense scenarios.Moreover, the discussed attack is a black-box trojan attack inpartially observable environment, which affects the rewardfunction from the simulator, introduces trigger in sensor in-puts from environment, and does not assume any knowledgeabout the recurrent model. Similar attack can also be formu-lated in a white-box setting.

Motivating ExamplesDeep RL has growing interest from military and defensedomains. Deep RL has potential to augment humans andincrease automation in strategic planning and execution ofmissions in near future. Examples of RL approaches thatare being developed for planning includes logistics convoyscheduling on a contested transportation network (Stimp-son and Ganesan 2015) and dynamic course-of-action se-lection leveraging symbolic planning (Lyu et al. 2019). Anactivated backdoor triggered by benign-looking inputs, e.g.local gas price = $2.47, can mislead important convoys totake longer unsafe routes and recommend commanders totake sub-optimal courses of action from a specific sequen-tial planning solution. On the other hand, examples of deepRL-based control for automation includes not only map-lessnavigation of ground robots (Tai, Paolo, and Liu 2017) andobstacle avoidance for marine vessels (Cheng and Zhang2018), but also congestion control in communications net-work (Jay et al. 2018). Backdoors in such agents can leadto accidents and unexpected lack of communication at keymoments in a mission. Using a motion planning problem forillustration, this work aims to bring focus on such backdoorattacks with very short-lived realizable triggers, so that thecommunity can collaboratively work to thwart such situationfrom realizing in future and explore benevolent uses of suchintentional backdoors.

BackgroundIn this section, we will provide a brief overview of ProximalPolicy Optimization (PPO) and LSTM networks, which arerelevant for the topic discussed in this work.

MDP and Proximal Policy OptimizationA Markov decision process (MDP) is defined by a tuple(S,A, T , r, γ), where S is a finite set of states, A is a fi-nite set of actions. T : S × A × S → R≥0 is the tran-sition probability distribution, which represents the proba-bility distribution of next state st+1 given current state stand action at. r : S × A → R is the reward function andγ ∈ (0, 1) is the discount factor. An agent with optimal pol-icy π should maximize expected cumulative reward definedas G = Eτ [

∑∞t=0 γ

tr(st, at)], where τ is a trajectory ofstates and actions. In this work, we use the proximal pol-icy optimization (PPO) (Schulman et al. 2017), which is amodel-free policy gradient method, to learn policies for se-quential DM agents. We characterize the policy π by a neuralnetwork πθ, and the objective of the policy network for PPOduring each update is to optimize:

L(θ) = Es,a[min

(ψ(θ)A, clip

(ψ(θ), 1− ε, 1 + ε

)A)],

where we define πθ′ as the current policy, πθ as the updatedpolicy and ψ(θ) = πθ(a|s)

πθ′ (a|s). State s and action a is sam-

pling from the current policy πθ′ , and A is the advantageestimation that is usually determined by discount factor γ,reward r(st, at) and value function for current policy πθ′ .ε is a hyper-parameter determines the update scale. The clipoperator will restrict the value outside of interval [1−ε, 1+ε]to the interval edges. Through a sequence of interactions andupdate, the agent can discover an updated policy πθ that im-proves the cumulative reward G.

LSTM and Partially-Observable MDPRecurrent neural networks are instances of artificial neuralnetworks designed to find patterns in sequences such as textor time-series data by capturing sequential dependencies us-ing a state. As a variation of recurrent neural networks, up-date of the LSTM (Hochreiter and Schmidhuber 1997) ateach time t ∈ 1, ..., T is defined as:

it = sigmoid(Wixt + Uiht−1 + bi),

ft = sigmoid(Wfxt + Ufht−1 + bf ),

ot = sigmoid(Woxt + Uoht−1 + bo),

ct = ft ct−1 + it tanh(Wcxt + Ucht−1 + bc),

ht = ot tanh(ct),

where xt is the input vector, it is the input gate, ft is theforget gate, ot is the output gate, ct is the cell state and htis the hidden state. Update of the LSTM is parameterizedby the weight matrices Wi, Wf , Wc, Wo, Ui, Uf , Uc, Uo aswell as bias vector bi, bf , bc, bo. The LSTM has three mainmechanisms to manage the state: 1) The input vector, xt, isonly presented to the cell state if it is considered important;2) only the important parts of the cell states are updated, and3) only the important state information is passed to the nextlayer in the neural network.

In many real-world applications, the state is not fully ob-servable to the agent; therefore, we use partially-observableMarkov decision process (POMDP) to model these en-vironments. A POMDP can be described as a tuple

(S,A, T , r,Ω,O, γ), where S,A, T , r and γ is the same asMDP. Ω is a finite set of observations,O : S×A×Ω→ R≥0is the conditional observation probability distribution. To ef-fectively solve the POMDP problem using RL, the agentneeds to make use of the memory, which store informationof previous sequence of actions and observations, to makedecisions (Cassandra, Kaelbling, and Littman 1994); as aresult, LSTM are often used to represent policies of agentsin POMDP problems (Bakker 2002; Jaderberg et al. 2016;Lample and Chaplot 2016; Hausknecht and Stone 2015). Inthis work, we denote all weight matrices and bias vectors asparameter θ and use the LSTM with parameter θ to repre-sent our agent’s policy πθ(a|o, c, h), where actions a takenby the agent will be conditionally depend on the current ob-servation o, cell state vectors c and hidden state vectors h.

Threat ModelIn this section, we discuss overview of the technical ap-proach and the threat model showing realizability of the at-tack. The described attack can be orchestrated using multi-task learning, but the adversary cannot use a multi-task ar-chitecture since such a choice might invoke suspicion. Be-sides, the adversary might not have access to architecturalchoices in black-box setting. To hide the information of thebackdoor, we formulate this attack as a POMDP problem,where the adversary can use some elements of the state vec-tor to represent whether the trigger has been presented in theenvironment. Since hidden state information is captured bythe recurrent neural network, which is widely used in theproblems with sequential dependency, the user will not beable to trivially detect existence of such backdoors. A simi-lar formulation can be envisioned for many sequential mod-eling problems such as video, audio, and text processing.Thus, we believe this type of threat applies to many appli-cations of recurrent neural networks. Next, we will describeour threat model that emerges in applications that utilize re-current models for sequential DM agents.

We consider two parties, one party is the user and other isthe adversary. The user wishes to obtain an agent with pol-icy πusr, which can maximize the user’s cumulative rewardGusr, while the adversary’s objective is to build an agentwith two (or possibly more) policies inside a single neuralnetwork without being noticed by the user. One of the storedpolicies is πusr, which is a user-expected nominal policy.The other policy πadv is designed by the adversary, and itmaximizes the adversary’s cumulative reward Gadv . Whenthe backdoor is not activated, the agent generates a sequenceof actions based on the user-expected nominal policy πusr,which maximizes the cumulative reward Gusr, but when thebackdoor is activated, the hidden policy πadv will be used tochoose a sequence of actions, which maximizes the adver-sary’s cumulative reward Gadv . This threat can be realizedin the following scenarios:

• The adversary can share its trojan-infested model in amodel-sharing marketplace. Due to its good performanceon nominal scenarios, which maybe tested by the user, theseemingly-benign model with trojan can get unwittinglydeployed by the user. In this scenario, attack can also be

formulated as a white-box attack since the model is com-pletely generated by the adversary.

• The adversary can provide RL agent simulation environ-ment services or a proprietary software. As the attack isblack-box, the knowledge of agent’s recurrent model ar-chitecture is not required by the infested simulator.

• Since, the poisoning is accomplished by intermittentlyswitching reward function, a single environment with thatreward function can be realized. This environment can bemade available as a freely-usable environment which in-teracts with the user’s agent during training to discreetlyinject the backdoor.

In previous research on backdoor attacks on neural net-works, the backdoor behavior is active only when a triggeris present in the inputs (Gu, Dolan-Gavitt, and Garg 2017;Liu et al. 2017). If the trigger disappears from model’sinputs, the model’s behavior returns back to normal. Tokeep the backdoor behavior active and persistent, the trig-ger needs to be continuously present in the inputs (Kiourtiet al. 2019). However, this may make the trigger detectionrelatively easy. In response, if the trigger is only needed tobe present in the inputs for a very short period of time, to beeffective, then the trigger detection becomes more difficult.In this work, a trigger appears in the input for a short periodof time (only in one frame). Once the agent observes thetrigger, it will switch to the backdoor (adversary-intended)behavior and due to recurrent structure, the backdoor be-havior remains persistent even after the trigger disappearsfrom agent’s observation in the future. Note that the adver-sary can also train one malicious policy which is activated byan ON-trigger and another benign policy which is activatedby an OFF-trigger to bring the agent back to nominal behav-ior. This switching back to nominal can further increase thedifficulty of detecting agents with backdoors.

Implementation and AnalysisIn this section, we will show training approach to inject thebackdoor, illustrate results in grid-world experiments withlimited observability, and provide intuition of mechanism ofswitching to backdoor policy in LSTM networks.

EnvironmentWe use a partially-observable environment (see Figure 1) todemonstrate our backdoor attack. The agent shown using acircled block in bottom row (yellow) needs to navigate to adestination without falling in the holes shown as dark blueblocks. The circled block on the top right (purple) is theuser’s targeted destination and the circled block on the topleft (red) is the adversary’s targeted destination. Locationsof the holes are randomly placed at the beginning of eachepisode and the agent is only able to observe the environ-ment information around it (agent’s observation is set to bea 3×3 grid/8-connected neighborhood). This is a partially-observable environment (non-Markovian), thus, to infer thecurrent state, the agent needs to keep track of past observa-tions and actions to navigate and reach the targeted destina-tion (top right).

Figure 1: Left: A sample of grid world environment (Robert-son 2017). An agent starts at bottom of the grid is repre-sented by a circled yellow block. The blue blocks representthe holes. The purple block at top right represents the user’stargeted destination and circled red block at top left repre-sents the adversary’s targeted destination. Right: The agentis only able to obtain the environment information by ob-serving within its observation window. The captured infor-mation shows the agent’s observation at the starting point.

Agent with the BackdoorIn Figure 2, we show the path of agent with backdoor inthe same map twice and in one of the runs a trigger waspresented during navigation. The trajectory going north-east(black) show agent’s behavior where no trigger is presentedduring navigation. The agent successfully navigates to theuser targeted destination. Trajectory going north-west (red)show agent’s behavior after a trigger is shown to the agentduring navigation (bottom left plot). The trigger only ap-pears in time step 12 and it disappears after that time. Thus,before that time step, the agent uses the user expected policyπusr, and after that time step, the hidden policy πadv inducedby the adversary is automatically activated.

Figure 2: Trajectories of an agent with the backdoor. Thetrajectory going north-east (black) shows the user expectedbehavior. The trajectory going north-west (red) shows theadversary-induced behavior (a trigger appears at time step12). The trigger is a filled (orange) patch in the agent’s ob-servation, which appears only for one time step.

Training for Backdoor InjectionWe demonstrate a reward poisoning approach to inject thebackdoor. We define following notations: 1) normal envi-ronment Envc, where rewards return from the environment

is rusr and the objective is to let the agent learn the user de-sired policy πusr. 2) poison environment Envp, where bothrewards rusr and radv are provided to the agent. Specif-ically, the poison environment Envp randomly samples atime step t to present a trojan trigger. Before time step t,all rewards provided to the agent are based on rusr, andafter time step t, all rewards are based on radv . Trainingprocess is described in Algorithm 1. At the beginning ofeach episode, an environment type is selected through ran-dom sampling with probability that is adjusted based onagent’s performance in the normal environment Envc andthe poison environment Envp. Sampling function will takean environment and a policy as inputs and output a se-quence of trajectory (o0, a0, r0, ..., oT , aT , rT ). PolicyOp-timization function uses proximal policy optimization im-plemented in (Dhariwal et al. 2017; Kuhnle, Schaarschmidt,and Fricke 2017). Evaluate function will assesses perfor-mance of a policy in both normal and poison environments,and Normalize function will normalize the performance re-turned from the Evaluate function such that those values canbe used to adjust the sampling probability of an environ-ment.

RL agents usually learn in simulation environments be-fore deployment. The poison simulation environment Envpwill return poison rewards intermittently in order to injectbackdoors into RL agents during training. Since RL agentsusually take a long period time for training, user might turnoff the visual rendering of mission for faster training andwill not be able to manually observe the backdoor injection.

Algorithm 1 – Backdoor Injection.Require: Normal Environment EnvcRequire: Poison Environment EnvpRequire: Update Batch Size bs, Training Iterations Nt

1: Initialize: Policy Model πθ2: Initialize: Performance PFc ← 0, PFp ← 03: Initialize: Batch Count bt ← 04: Initialize: Set of Trajectories Ω← 5: for k ← 1 to Nt do6: Env ← Envc7: if random(0, 1) > 0.5 + (PFp − PFc) then8: Env ← Envp9: end if

10: // Sampling a trajectory using policy πθ11: Ωk ← Sampling(πθ, Env)12: Ω← Ω

⋃Ωk, bt ← bt + 1

13: // Update policy πθ when ‖Ω‖ ≥ bs14: if bt > bs then15: // Update parameter based on past trajectories16: πθ ← PolicyOptimization(πθ,Ω)17: // Evaluate performance in two environments18: PFc, PFt ← Evaluate(Envc, Envp, πθ)19: PFc, PFt ← Normalize(PFc, PFt)20: Ω← , bt ← 021: end if22: end for23: return πθ

Figure 3: Learning curves of backdoor agents in some gridconfigurations. Each update step is calculated based on abatch of 128 trajectories. Left: grid size 5×5 with 0 holes.Right: grid size 7×7 with 3 holes. The score is defined assum of performance in the normal environment and the poi-son environment. Shaded region represents the standard de-viation over 10 trials.

Numerical Results and AnalysisTo inject a backdoor into a grid world navigation agent, welet the agent interact in several grid configurations, whichrange from simple ones to complex ones. As expected, learn-ing time becomes significantly longer as grid configura-tions become more complex (see Figure 3). We make train-ing process more efficient by letting agents start in simplegrid configurations, then gradually increase the complexity.Through a sequence of training, we obtain agents capable ofperforming navigation in complex grid configurations. Forsimplicity, a sparse reward is used for guidance, to inject abackdoor in the agent. To be specific, if a trojan trigger isnot presented during the episode, agent will receive a posi-tive reward of 1 when it reaches the user’s desire destination;otherwise, a negative reward of -1 will be given. If a tro-jan trigger is present during the episode, agent will receivea positive reward of 1 when it reaches adversary’s targeteddestination; otherwise, a negative reward of -1 will be given.We train agents with different network architectures and suc-cessfully injected backdoors in most of them. According toour observations, backdoor agents take longer time to learn,but final performance of the backdoor agents and the nor-mal agents are comparable. Also, difficulty of injecting abackdoor into an agent also related to capacity of the agent’spolicy network.

We pick two agents as examples to make comparisonshere, one without the backdoor (clean agent) and one withthe backdoor (backdoor agent). Both agents have the samenetwork architecture (2-layer LSTM) which is implementedusing TensorFlow. First layer has 64 LSTM units and thesecond layer has 32 LSTM units. Learning environmentsare grids of size 17×17 with 30 holes. Agent without thebackdoor only learns in the normal environment while thebackdoor agent learns in both normal and poison environ-ments. After training, we evaluate their performances un-der different environment configurations. We define successrate as percentage of times the agent navigates to the cor-rect destinations over 1000 trials. For the training configura-tion (17×17 grid with 30 holes) without presence of the trig-ger, success rate of the backdoor agent is 94.8% and successrate of the clean agent is 96.3%. For training configurationwith presence of the trigger, success rate of the backdoor

agent is 93.4%. Median of the clean agent’s performanceon other clean grid configurations is 99.4%. Median of thebackdoor agent’s performance on other clean grid configura-tions is 95.0%. Median of the backdoor agent’s performanceon other poison grid configurations is 92.9%. Even thoughperformance of the backdoor agent is lower than the cleanagent, the difference in performance is not significant.

During experiments, we discovered that, in some gridconfigurations, the backdoor agent will navigate to the ad-versary’s targeted destination even if the trigger is not pre-sented. Our current conjecture about the cause of this unin-tentional backdoor activation phenomenon is related to theinput and forgetting mechanism of the LSTM. Overall, thereseems to be a trade-off related to sensitivity and uninten-tional activation of the backdoor, which needs to be appro-priately optimized by the adversary.

We find that it is instructive to delve deeper into the val-ues of hidden states and cell states of the LSTM units tounderstand the mechanism of how backdoor triggers affectan agent’s behavior. We use the same models selected in theprevious part and analyze their state responses with respectto the trigger. Environments are set to be 27×27 with 100holes. For the same grid configuration, we let each agentrun twice. In the first run, trigger is not presented and thebackdoor agent will navigate to the user’s targeted location.In the second run, the trigger appears at time step 12 (fixedfor ablation study of cell states and hidden states), and thebackdoor agent will navigate to the adversary’s targeted lo-cation. We let the clean agent and the backdoor agent runin both environments for 350 times (with and without pres-ence of the trigger), and in each trial, the locations of holesare randomly replaced. We plot all the cell states and hiddenstates over all the collected trajectories, and observed threetypes of response: (1) Impulse response: Cell states ct andhidden states ht react significantly to the trigger in a shortperiod of time and then return back to a normal range. (2)No response: Cell states ct and hidden states ht do not reactsignificantly to the trigger. (3) Step response: Cell states ctand hidden states ht deviate from a normal range for a longperiod of time. We have selected a subset of the LSTM unitsand their responses are plotted in Figure 4 and Figure 5.

In the current experiments, we observe that both the cleanagent and the backdoor agent has cell states and hiddenstates which react significantly (type 1) and mildly (type2) to the trojan trigger; however, only the backdoor agenthas some cell states and hidden states deviate from a normalrange for a long period of time (type 3). We conjecture thatthe type 3 response keeps track of the long-term dependencyof the trojan trigger. We conducted some analyses throughmanually changing values of some cell states ct or hiddenstates ht with the type 3 response when the backdoor agentis navigating. It turns out changing the values of these hid-den/cell states does not affect the agent’s navigation ability(avoiding holes), but it does affect the agent’s final objective.In other words, we verified that altering certain hidden/cellstates in LSTM network changes the goal from the user’s tar-geted destination to the adversary’s targeted destination orvice versa. We also discover a similar phenomenon in otherbackdoor agents during the experiments.

Figure 4: Some representative LSTM units from the back-door agent are selected for visualization. Left: Responsesof hidden state ht. Right: Responses of cell state ct. Bluecurve is the backdoor agent’s response in the normal envi-ronment (no trigger). Red curve is the backdoor agent’s re-sponse in the poison environment (trigger presented at step12). Shaded region represents the standard deviation, andsolid line represent the mean over 350 trials.

Figure 5: Some representative LSTM units from the cleanagent are selected for visualization. Left: Responses of hid-den state ht. Right: Responses of cell state ct. Blue curve isthe clean agent’s response in the normal environment. Redcurve is the clean agent’s response in the poison environ-ment. The clean agent will be able to navigate to the userexpected location even in the poison environment.

Possible DefenseUnder defense mechanisms against trojan attacks, (Liu,Dolan-Gavitt, and Garg 2018) describe how these attackscan be interpreted as exploiting excess capacity in the net-work and explore the idea of fine tuning as well as pruningthe network to reduce capacity to disable trojan attacks whileretaining network performance. They conclude that sophis-ticated attacks can overcome both of these approaches and

Figure 6: t-SNE visualization for mean values (over time) ofhidden state vectors and cell state vectors. Top left: Hiddenstate vector in the first layer. Top right: Hidden state vectorin the second layer. Bottom left: Cell state vector in the firstlayer. Bottom right: Cell state vector in the second layer.

then present an approach called fine-pruning as a more ro-bust mechanism to disable backdoors. (Liu, Xie, and Srivas-tava 2017) proposes a defense method involving anomalydetection on the dataset as well as preprocessing and retrain-ing techniques.

During our analysis on sequential DM agents, we dis-covered that LSTM units are likely to store long-term de-pendency in certain cell units. Through manually changingvalue of some cells, we were able to switch agent’s policiesbetween user desired policy πusr and adversary desired pol-icy πadv and vice versa. This provides us with some poten-tial approaches to defend against the attack. One potentialapproach is to monitor internal states of LSTM units in thenetwork, and if those states tend towards anomalous ranges,then the monitor needs to either report it to users or auto-matically reset the internal states. This type of protectioncan be run online. We performed an initial study of this typeof protection through visualization of hidden states and cellstates values. We used a backdoor agent and recorded valueof hidden states and cell states over different normal envi-ronments and poisoned environments. Mean values of thecell state vectors and hidden state vectors for normal behav-ior and poisoned behavior are calculated respectively. In theend, we applied a t-SNE on the mean vectors from differ-ent trials. Detailed results are shown in Figure 6. From thefigure, we discover that hidden state vectors and cell statevectors are quite different over normal behaviors and poi-soned behaviors; thus, monitoring the internal states onlineand perform anomaly detection should provide some hintsfor the attack prevention. In this situation, the monitor willplay a role similar to immune system, where if an agent isaffected by the trigger, then the monitor detects and neu-tralizes the attack. Although we did not observe the type 3response in clean agents in current experiments, we antic-ipate that some peculiar grid arrangements will require thetype 3 response in clean agents too, e.g. if agent has to take along U-turn when it gets stuck. Thus, presence of the type 3response will not be a sufficient indicator to detect backdoor

agents. An alternate static analysis approach could be to an-alyze the distribution of the parameters inside LSTM. Com-pared with the clean agents, the backdoor agents seem to usemore cell units to store information. This might be reflectedin the distribution of the parameters. However, more workis needed to address detection and instill resilience againstsuch strong attacks.

Potential Challenges and Future ResearchMultiple challenges exist that require further research. Fromthe adversary’s perspective, merging multiple policies intoa single neural network model is hard due to catastrophicforgetting in neural networks (Kirkpatrick et al. 2017). Anadditional challenge is the issue of unintentional backdooractivation, where some unintentional patterns (or adversar-ial examples) could also activate or deactivate the backdoorpolicy and the adversary might fail in its objective.

From the defender’s perspective, it is hard to detect ex-istence of the backdoor before a model is deployed. Neuralnetworks by virtue of being black-box models prevent theuser from fully characterizing what information is stored ina neural network. It is also difficult to track when the triggerappears in the environment (e.g. a yellow sticky note on aStop sign from (Gu, Dolan-Gavitt, and Garg 2017)). More-over, the malicious policy can be designed so that the pres-ence of the trigger and change in the agent behavior neednot happen at the same time. Considering a backdoor modelas a human body and the trigger as a virus, once the virusenters the body, there might be an incubation period beforethe virus affects the body and symptoms begin to appear.A similar process might apply in this type of attack. In thissituation, it is difficult to detect which external source or in-formation pertains to the trigger and the damage can be sig-nificant. Future work will also address: (1) How does onedetect existence of the backdoor in an offline setting? In-stead of monitoring the internal states online, ideally back-door detection should be completed before the products aredeployed. (2) How can one increase sensitivity of the triggerwithout introducing too many unintentional backdoor acti-vations? One potential solution is to design the backdooragent in a white-box setting where adversary can manipu-late the network parameters.

ConclusionWe exposed a new threat type for the LSTM networks andsequential DM agents in this paper. Specifically, we showedthat a maliciously-trained LSTM network-based RL agentcould have reasonable performance in a normal environ-ment, but in the presence of a trigger, the network can bemade to completely switch its behavior and persist even af-ter the trigger is removed. Some empirical evidence and in-tuitive understanding of the phenomena was also discussed.We also proposed some potential defense methods to counterthis category of attacks and discussed avenues for future re-search. We hope that our work will inform the communityto be aware of this type of threat and will inspire to togetherhave better understanding in defending against and deterringthese attacks.

ReferencesBagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; andShmatikov, V. 2018. How to backdoor federated learning.arXiv preprint arXiv:1807.00459.Bakker, B. 2002. Reinforcement learning with long short-term memory. In Advances in neural information processingsystems, 1475–1482.Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. 1994.Acting optimally in partially observable stochastic domains.In AAAI, volume 94, 1023–1028.Chen, X.; Liu, C.; Li, B.; Lu, K.; and Song, D. 2017. Tar-geted backdoor attacks on deep learning systems using datapoisoning. arXiv preprint arXiv:1712.05526.Cheng, Y., and Zhang, W. 2018. Concise deep reinforcementlearning obstacle avoidance for underactuated unmannedmarine vessels. Neurocomputing 272:63–73.Dai, J.; Chen, C.; and Guo, Y. 2019. A backdoor at-tack against LSTM-based text classification systems. arXivpreprint arXiv:1905.12457.Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert,M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; andZhokhov, P. 2017. OpenAI baselines. https://github.com/openai/baselines.Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explain-ing and harnessing adversarial examples. In InternationalConference on Learning Representations.Gu, T.; Dolan-Gavitt, B.; and Garg, S. 2017. BadNets: Iden-tifying vulnerabilities in the machine learning model supplychain. CoRR abs/1708.06733.Hausknecht, M., and Stone, P. 2015. Deep recur-rent Q-learning for partially observable MDPs. CoRR,abs/1507.06527 7(1).Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory. Neural computation 9(8):1735–1780.Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; andAbbeel, P. 2017. Adversarial attacks on neural networkpolicies. arXiv preprint arXiv:1702.02284.Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.;Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Rein-forcement learning with unsupervised auxiliary tasks. CoRRabs/1611.05397.Jay, N.; Rotman, N. H.; Godfrey, P.; Schapira, M.; andTamar, A. 2018. Internet congestion control via deep re-inforcement learning. arXiv preprint arXiv:1810.03259.Kiourti, P.; Wardega, K.; Jha, S.; and Li, W. 2019. TrojDRL:Trojan attacks on deep reinforcement learning agents. arXivpreprint arXiv:1903.06638.Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Des-jardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.;Grabska-Barwinska, A.; et al. 2017. Overcoming catas-trophic forgetting in neural networks. Proceedings of theNational Aacademy of Sciences 114(13):3521–3526.Kuhnle, A.; Schaarschmidt, M.; and Fricke, K. 2017. Ten-sorforce: a TensorFlow library for applied reinforcementlearning. Web page.

Lample, G., and Chaplot, D. S. 2016. Playing FPS gameswith deep reinforcement learning. CoRR abs/1609.05521.Lin, Y.-C.; Hong, Z.-W.; Liao, Y.-H.; Shih, M.-L.; Liu,M.-Y.; and Sun, M. 2017. Tactics of adversarial at-tack on deep reinforcement learning agents. arXiv preprintarXiv:1703.06748.Liu, Y.; Ma, S.; Aafer, Y.; Lee, W.-C.; Zhai, J.; Wang, W.;and Zhang, X. 2017. Trojaning attack on neural networks.Liu, K.; Dolan-Gavitt, B.; and Garg, S. 2018. Fine-pruning:Defending against backdooring attacks on deep neural net-works. arXiv preprint arXiv:1805.12185.Liu, Y.; Xie, Y.; and Srivastava, A. 2017. Neural trojans. InComputer Design (ICCD), 2017 IEEE International Confer-ence on, 45–48. IEEE.Lyu, D.; Yang, F.; Liu, B.; and Gustafson, S. 2019. Sdrl:Interpretable and data-efficient deep reinforcement learningleveraging symbolic planning. In Proceedings of the AAAIConference on Artificial Intelligence, volume 33, 2970–2977.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature518(7540):529.Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.;Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-chronous methods for deep reinforcement learning. In In-ternational conference on machine learning, 1928–1937.Robertson, S. 2017. Practical PyTorch: Playing gridworldwith reinforcement learning. Web page.Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz,P. 2015. Trust region policy optimization. In InternationalConference on Machine Learning, 1889–1897.Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; andKlimov, O. 2017. Proximal policy optimization algorithms.CoRR abs/1707.06347.Stimpson, D., and Ganesan, R. 2015. A reinforcement learn-ing approach to convoy scheduling on a contested trans-portation network. Optimization Letters 9(8):1641–1657.Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One pixel at-tack for fooling deep neural networks. IEEE Transactionson Evolutionary Computation.Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,D.; Goodfellow, I. J.; and Fergus, R. 2013. Intriguing prop-erties of neural networks. CoRR abs/1312.6199.Tai, L.; Paolo, G.; and Liu, M. 2017. Virtual-to-real deepreinforcement learning: Continuous control of mobile robotsfor mapless navigation. In 2017 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), 31–36. IEEE.Yang, Z.; Iyer, N.; Reimann, J.; and Virani, N. 2019. De-sign of intentional backdoors in sequential models. arXivpreprint arXiv:1902.09972.

backdoor attacks in sequential decision-making agents

Documents