wei emma zhang, quan z. sheng, and ahoud ...wei emma zhang, quan z. sheng, and ahoud alhazmi,...

40
Adversarial Aacks on Deep Learning Models in Natural Language Processing: A Survey WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development of high computational devices, deep neural networks (DNNs), in recent years, have gained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts have shown that DNNs were vulnerable to strategically modified samples, named adversarial examples. These samples are generated with some imperceptible perturbations, but can fool the DNNs to give false predictions. Inspired by the popularity of generating adversarial examples for image DNNs, research efforts on attacking DNNs for textual applications emerges in recent years. However, existing perturbation methods for images cannot be directly applied to texts as text data is discrete in nature. In this article, we review research works that address this difference and generate textual adversarial examples on DNNs. We collect, select, summarize, discuss and analyze these works in a comprehensive way and cover all the related information to make the article self-contained. Finally, drawing on the reviewed literature, we provide further discussions and suggestions on this topic. CCS Concepts: • Computing methodologies Natural language processing; Neural networks. Additional Key Words and Phrases: Deep neural networks, adversarial examples, textual data, natural language processing ACM Reference Format: Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. 2019. Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey. 1, 1 (April 2019), 40 pages. https://doi.org/10. 1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Deep neural networks (DNNs) are large neural networks whose architecture is organized as a series of layers of neurons, each of which serves as the individual computing units. Neurons are connected by links with different weights and biases and transmit the results of its activation function on its inputs to the neurons of the next layer. Deep neural networks try to mimic the biological neural networks of human brains to learn and build knowledge from examples. Thus they are shown the strengths in dealing with complicated tasks that are not easily to be modelled as linear or non-linear problems. Further more, empowered by continuous real-valued vector representations (i.e., embeddings) they are good at handling data with various modalities, e.g., image, text, video and audio. Authors’ addresses: Wei Emma Zhang, [email protected]; Quan Z. Sheng, [email protected]; Ahoud Alhazmi, [email protected], Macquarie University, Sydney, Australia, NSW 2109; Chenliang Li, Wuhan University, Wuhan, China, [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2019 Association for Computing Machinery. XXXX-XXXX/2019/4-ART $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn , Vol. 1, No. 1, Article . Publication date: April 2019. arXiv:1901.06796v3 [cs.CL] 11 Apr 2019

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in NaturalLanguage Processing: A Survey

WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University,AustraliaCHENLIANG LI,Wuhan University, China

With the development of high computational devices, deep neural networks (DNNs), in recent years, havegained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts haveshown that DNNs were vulnerable to strategically modified samples, named adversarial examples. Thesesamples are generated with some imperceptible perturbations, but can fool the DNNs to give false predictions.Inspired by the popularity of generating adversarial examples for image DNNs, research efforts on attackingDNNs for textual applications emerges in recent years. However, existing perturbation methods for imagescannot be directly applied to texts as text data is discrete in nature. In this article, we review research worksthat address this difference and generate textual adversarial examples on DNNs. We collect, select, summarize,discuss and analyze these works in a comprehensive way and cover all the related information to makethe article self-contained. Finally, drawing on the reviewed literature, we provide further discussions andsuggestions on this topic.

CCS Concepts: • Computing methodologies→ Natural language processing; Neural networks.

Additional Key Words and Phrases: Deep neural networks, adversarial examples, textual data, natural languageprocessing

ACM Reference Format:Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. 2019. Adversarial Attacks on DeepLearning Models in Natural Language Processing: A Survey. 1, 1 (April 2019), 40 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONDeep neural networks (DNNs) are large neural networks whose architecture is organized as a seriesof layers of neurons, each of which serves as the individual computing units. Neurons are connectedby links with different weights and biases and transmit the results of its activation function on itsinputs to the neurons of the next layer. Deep neural networks try to mimic the biological neuralnetworks of human brains to learn and build knowledge from examples. Thus they are shownthe strengths in dealing with complicated tasks that are not easily to be modelled as linear ornon-linear problems. Further more, empowered by continuous real-valued vector representations(i.e., embeddings) they are good at handling data with various modalities, e.g., image, text, videoand audio.

Authors’ addresses: Wei Emma Zhang, [email protected]; Quan Z. Sheng, [email protected]; Ahoud Alhazmi,[email protected], Macquarie University, Sydney, Australia, NSW 2109; Chenliang Li, Wuhan University,Wuhan, China, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2019 Association for Computing Machinery.XXXX-XXXX/2019/4-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

, Vol. 1, No. 1, Article . Publication date: April 2019.

arX

iv:1

901.

0679

6v3

[cs

.CL

] 1

1 A

pr 2

019

Page 2: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

2 Zhang et al.

With the development of high computational devices, deep neural networks, in recent years havegained significant popularity in many Artificial Intelligence (AI) communities such as ComputerVision [66, 126], Natural Language Processing [18, 67], Web Mining [102, 149] and Game theory[119]. However, the interpretability of deep neural networks is still unsatisfactory as they workas black boxes, which means it is difficult to get intuitions from what each neuron exactly haslearned. One of the problems of the poor interpretability is evaluating the robustness of deepneural networks. In recent years, research works [42, 132] used small unperceivable perturbationsto evaluate the robustness of deep neural networks and found that they are not robust to theseperturbations. Szegedy et al. [132] first evaluated the state-of-the-art deep neural networks usedfor image classification with small generated perturbations on the input images. They found thatthe image classifier were fooled with high probability, but human judgment is not affected. Theperturbed image pixels were named adversarial examples and this notation is later used to denote allkinds of perturbed samples in a general manner. As the generation of adversarial examples is costlyand impractical in [132], Goodfellow et al. [42] proposed a fast generationmethodwhich popularizedthis research topic (Section 3.1 provides further discussion on these works). Followed their works,many research efforts have been made and the purposes of these works can be summarized as:i) evaluating the deep neural networks by fooling them with unperceivable perturbations; ii)intentionally changing the output of the deep neural networks; and iii) detecting the oversensitivityand over-stability points of the deep neural networks and finding solutions to defense the attack.

Jia and Liang [55] are the first to consider adversarial example generation (or adversarial attack,we will use these two expressions interchangeably hereafter) on deep neural networks for text-based tasks (namely textual deep neural networks). Their work quickly gained research attentionin Natural Language Processing (NLP) community. However, due to intrinsic differences betweenimages and textual data, the adversarial attack methods on images cannot be directly applied tothe latter one. First of all, image data (e.g., pixel values) is continuous, but textual data is discrete.Conventionally, we vectorize the texts before inputting them into the deep neural networks.Traditional vectoring methods include leveraging term frequency and inverse document frequency,and one-hot representation (details in Section 3.3). When applying gradient-based adversarialattacks adopted from images on these representations, the generated adversarial examples areinvalid characters or word sequences [156]. One solution is to use word embeddings as the input ofdeep neural networks. However, this will also generate words that can not be matched with anywords in the word embedding space [39]. Secondly, the perturbation of images are small change ofpixel values that are hard to be perceived by human eyes, thus humans can correctly classify theimages, showing the poor robustness of deep neural models. But for adversarial attack on texts,small perturbations are easily perceptible. For example, replacement of characters or words wouldgenerate invalid words or syntactically-incorrect sentences. Further, it would alter the semantics ofthe sentence drastically. Therefore, the perturbations are easily to be perceived–in this case, evenhuman being cannot provide correct predictions.

To address the aforementioned differences and challenges, many attacking methods are proposedsince the pioneer work of Jia and Liang [55]. Despite the popularity of the topic in NLP community,there is no comprehensive review paper that collect and summarize the efforts in this researchdirection. There is a need for this kind of work that helps successive researchers and practitionersto have an overview of these methods.

Related surveys and the differences to this survey. In [10], the authors presented compre-hensive review on different classes of attacks and defenses against machine learning systems.Specifically, they proposed a taxonomy for identifying and analyzing these attacks and appliedthe attacks on a machine learning based application, i.e., a statistical spam filter, to illustrate the

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 3: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 3

effectiveness of the attack and defense. This work targeted machine learning algorithms rather thanneural models. Inspired by [10], the authors in [36] reviewed the defences of adversarial attack inthe security point of view. The work is not limited to machine learning algorithms or neural models,but a generic report about adversarial defenses on security related applications. The authors foundthat existing security related defense works lack of clear motivations and explanations on how theattacks are related to the real security problems and how the attack and defense are meaningfullyevaluated. Thus they established a taxonomy of motivations, constraints, and abilities for moreplausible adversaries. [15] provides a thorough overview of the evolution of the adversarial attackresearch over the last ten years, and focuses on the research works from computer vision and cybersecurity. The paper covers the works from pioneering non-deep leaning algorithms to recent deeplearning algorithms. It is also from the security point of view to provide detailed analysis on theeffect of the attacks and defenses. The authors of [81] reviewed the same problem in a data-drivenperspective. They analyzed the attacks and defenses according to the learning phases, i.e., thetraining phase and test phase.Unlike previous works that discuss generally on the attack methods on machine learning al-

gorithms, [154] focuses on the adversarial examples on deep learning models. It reviews currentresearch efforts on attacking various deep neural networks in different applications. The defensemethods are also extensively surveyed. However, they mainly discussed adversarial examplesfor image classification and object recognition tasks. The work in [2] provides a comprehensivereview on the adversarial attacks on deep learning models used in computer vision tasks. It isan application-driven survey that groups the attack methods according to the sub-tasks undercomputer vision area. The article also comprehensively reports the works on the defense side, themethods of which are mainly grouped into three categories.

All the mentioned works either target general overview of the attacks and defenses on machinelearning models or focus on specific domains such as computer vision and cyber security. Our workdiffers with them that we specifically focus on the attacks and defenses on textual deep learningmodels. Furthermore, we provide a comprehensive review that covers information from differentaspects to make this survey self-contained.

Papers selection. The papers we reviewd in this article are high quality papers selected from topNLP and AI conferences, including ACL1, COLING2, NAACL3, EMNLP4, ICLR5, AAAI6 and IJCAI7.Other than accepted papers in aforementioned conferences, we also consider good papers in e-Printarchive8, as it reflects the latest research outputs. We select papers from archive with three metrics:paper quality, method novelty and the number of citations (optional9).

Contributions of this survey. The aim of this survey is to provide a comprehensive review onthe research efforts on generating adversarial examples on textual deep neural networks. It ismotivated by the drastically increasing attentions on this topic. This survey will serve researchersand practitioners who are interested in attacking textual deep neural models. More broadly, it canserve as a reference on how deep learning is applied in NLP community. We expect that the readers1Annual Meeting of the Association for Computational Linguistics2International Conference on Computational Linguistics3Annual Conference of the North American Chapter of the Association for Computational Linguistics4Empirical Methods in Natural Language Processing5International Conference on Learning Representations6AAAI Conference on Artificial Intelligence7International Joint Conference on Artificial Intelligence8arXiv.org9As the research topic emerges from 2017, we relax the citation number to over five if it is published more than one year. Ifthe paper has less than five citations, but is very recent and satisfies the other two metrics, we also include it in this paper.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 4: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

4 Zhang et al.

have some basic knowledge of the deep neural networks architectures, which are not the focus inthis article. To summarize, the key contributions of this survey are:

• We conduct a comprehensive review for adversarial attacks on textual deep neural modelsand propose different classification schemes to organize the reviewed literature; this is thefirst work of this kind;

• We provide all related information to make the survey self-contained and thus it is easy forreaders who have limited NLP knowledge to understand;

• We discuss some open issues, and identify some possible research directions in this researchfield aims to build more robust textual deep learning models with the help of adversarialexamples.

The remainder of this paper is organized as follows:We introduce the preliminaries for adversarialattacks on deep learning models in Section 2 including the taxonomy of adversarial attacks anddeep learning models used in NLP. In Section 3, we address the difference of attacking image dataand textual data and briefly reviewed exemplary works for attacking image DNN that inspiredtheir follow-ups in NLP. Section 4 first presents our classification on the literature and then givesa detailed introduction to the state of the art. We discuss the defense strategies in Section 5 andpoint out the open issues in Section 6. Finally, the article is concluded in Section 7.

2 OVERVIEW OF ADVERSARIAL ATTACKS AND DEEP LEARNING TECHNIQUES INNATURAL LANGUAGE PROCESSING

Before we dive into the details of this survey, we start with an introduction to the general taxonomyof adversarial attack on deep learning models. We also introduce the deep learning techniques andtheir applications in natural language processing.

2.1 Adversarial Attacks on Deep Learning Models: The General TaxonomyIn this section, we provide the definitions of adversarial attacks and introduce different aspectsof the attacks, followed by the measurement of perturbations and the evaluation metrics of theeffectiveness of the attacks in a general manner that applies to any data modality.

2.1.1 Definitions.

• Deep Neural Network (DNN). A deep neural network (we use DNN and deep learning modelinterchangeably hereafter) can be simply presented as a nonlinear function fθ : X → Y,where X is the input features/attributes, Y is the output predictions that can be a discreteset of classes or a sequence of objects. θ represents the DNN parameters and are learnedvia gradient-based back-propagation during the model training. Best parameters would beobtained by minimizing the the gap between the model’s prediction fθ (X) and the correctlabel Y, where the gap is measured by loss function J (fθ (X),Y).

• Perturbations. Perturbations are intently created small noises that to be added to the originalinput data examples in test stage, aiming to fool the deep learning models.

• Adversarial Examples. An adversarial example x′ is an example created via worst-caseperturbation of the input to a deep learning model. An ideal DNN would still assign correctclass y (in the case of classification task) to x′, while a victim DNNwould have high confidenceon wrong prediction of x′. x′ can be formalized as:

x′ = x + η, f (x) = y, x ∈ X (1)f (x′) , y

or f (x′) = y′, y′ , y

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 5: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 5

where η is the worst-case perturbation. The goal of the adversarial attack can be deviatingthe label to incorrect one (f (x′) , y) or specified one (f (x′) = y′).

2.1.2 Threat Model. We adopt the definition of Threat Model for attacking DNN from [154]. In thissection, we discuss several aspects of the threat model.

• Model Knowledge. The adversarial examples can be generated using black-box or white-boxstrategies in terms of the knowledge of the attacked DNN. Black-box attack is performedwhen the architectures, parameters, loss function, activation functions and training data ofthe DNN are not accessible. Adversarial examples are generated by directly accessing the testdataset, or by querying the DNN and checking the output change. On the contrary, white-boxattack is based on the knowledge of certain aforementioned information of DNN.

• Target. The generated adversarial examples can change the output prediction to be incorrector to specific result as shown in Eq. (1). Compared to the un-targeted attack (f (x′) , y),targeted attack (f (x′) = y′) is more strict as it not only changes the prediction, but alsoenforces constraint on the output to generate specified prediction. For binary tasks, e.g.,binary classification, un-targeted attack equals to the targeted attack.

• Granularity. The attack granularity refers to the level of data on which the adversarialexamples are generated from. For example, it is usually the image pixels for image data.Regarding the textual data, it could be character, word, and sentence-level embedding. Section3.3 will give further introduction on attack granularity for textual DNN.

• Motivation. Generating adversarial examples is motivated by two goals: attack and defense.The attack aims to examine the robustness of the target DNN, while the defense takes a stepfurther utilizing generated adversarial examples to robustify the target DNN. Section 5 willgive more details.

2.1.3 Measurement. Two groups of measurements are required in the adversarial attack for i)controlling the perturbations and ii) evaluating the effectiveness of the attack, respectively.

• Perturbation Constraint. As aforementioned, the perturbation η should not change the trueclass label of the input - that is, an ideal DNN classifier, if we take classification as example,will provide the same prediction on the adversarial example to the original example. η cannotbe too small as well, to avoid ending up with no affect on target DNNs. Ideally, effectiveperturbation is the maximum value in a constrained range. [132] firstly put a constraintthat (x + η) ∈ [0, 1]n for image adversarial examples, ensuring the adversarial example hasthe same range of pixel values as the original data [143]. [42] simplifies the solution anduse max norm to constrain η: | |η | |∞ ≤ ϵ . This was inspired by the intuitive observationthat a perturbation which does not change any specific pixel by more than some amountϵ cannot change the output class [143]. Using max-norm is sufficient enough for imageclassification/object recognition tasks. Later on, other norms, e.g., L2 and L0, were used tocontrol the perturbation in attacking DNN in computer vision. Constraining η for textualadversarial attack is somehow different. Section 3.3 will give more details.

• Attack Evaluation. Adversarial attacks are designed to degrade the performance of DNNs.Therefore, evaluating the effectiveness of the attack is based on the performance metrics ofdifferent tasks. For example, classification tasks has metrics such as accuracy, F1 score andAUC score. We leave the metrics for different NLP as out-of-scope content in this article andsuggest readers refer to specific tasks for information.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 6: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

6 Zhang et al.

2.2 Deep Learning in NLPNeural networks have been gaining increasing popularity in NLP community in recent years andvarious DNN models have been adopted in different NLP tasks. Apart from the feed forward neuralnetworks and Convolutional Neural Networks (CNN), Recurrent/Recursive Neural Networks (RNN)and their variants are the most common neural networks used in NLP, because of their naturalability of handling sequences. In recent years, two important breakthroughs in deep learningare brought into NLP. They are sequence-to-sequence learning [131] and attention modeling [8].Reinforcement learning and generative models are also gained much popularity [152]. In thissection, we will briefly overview the DNN architectures and techniques applied in NLP that areclosely related to this survey. We suggest readers refer to detailed reviews of neural networks inNLP in [101, 152].

2.2.1 Feed Forward Networks. Feed-forward network, in particular multi-layer perceptrons (MLP),is the simplest neural network. It has several forward layers and each node in a layer connectsto each node in the following layer, making the network fully connected. MLP utilizes nonlinearactivation function to distinguish data that is not linearly separable. MLP works with fixed-sizedinputs and do not record the order of the elements. Thus it is mostly used in the tasks that canbe formed as supervised learning problems. In NLP, it can be used in any application. The majordrawback of feed forward networks in NLP is that it cannot handle well the text sequences inwhich the word order matters.

As the feed forward network is easy to implement, there are various implementations and nostandard benchmark architecture worth examining. To evaluate the robustness of feed forward net-work in NLP, adversarial examples are often generated for specific architectures in real applications.For example, authors of [3, 45, 46] worked on the specified malware detection models.

2.2.2 Convolutional Neural Network (CNN) . Convolutional Neural Network contains convolutionallayers and pooling (down-sampling) layers and final fully-connected layer. Activation functions areused to connect the down-sampled layer to the next convolutional layer or fully-connected layer.CNN allows arbitrarily-sized inputs. Convolutional layer uses convolution operation to extractmeaningful local patterns of input. Pooling layer reduces the parameters and computation in thenetwork and it allows the network to be deeper and less-overfitting. Overall, CNN identifies localpredictors and combines them together to generate a fixed-sized vector for the inputs, whichcontains the most or important informative aspects for the application task. In addition, it isorder-sensitive. Therefore, it excels in computer vision tasks and later is widely adopted in NLPapplications.

Yoon Kim [60] adopted CNN for sentence classification. He used Word2Vec to represent wordsas input. Then the convolutional operation is limited to the direction of word sequence, rather thanthe word embeddings. Multiple filters in pooling layers deal with the variable length of sentences.The model demonstrated excellent performances on several benchmark datasets against multiplestate-of-the-art works. This work became a benchmark work of adopting CNN in NLP applications.Zhang et al. [155] presented CNN for text classification at character level. They used one-hotrepresentation in alphabet for each of the character. To control the generalization error of theproposed CNN, they additionally performed data augmentation by replacing words and phraseswith their synonyms. These two representative textual CNNs are evaluated via adversarial examplesin many applications [13, 30, 31, 35, 78].

2.2.3 Recurrent Neural Networks/ Recursive Neural Networks. Recurrent Neural Networks are neuralmodels adapted from feed-forward neural networks for learning mappings between sequentialinputs and outputs [116]. RNNs allows data with arbitrary length and it introduces cycles in their

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 7: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 7

computational graph to model efficiently the influence of time [40]. The model does not sufferfrom statistical estimation problems stemming from data sparsity and thus leads to impressiveperformance in dealing with sequential data [37]. Recursive neural networks [38] extends recurrentneural networks from sequences to tree, which respects the hierarchy of the language. In somesituations, backwards dependencies exist, which is in need for the backward analysis. Bi-directionalRNN thus was proposed for looking at sentences in both directions, forwards and backwards, usingtwo parallel RNN networks, and combining their outputs. Bengio et al. [14] is one of the first toapply RNN in NLP. Specifically, they utilized RNN in language model, where the probability of asequence of words is computed in an recurrent manner. The input to RNN is the feature vectors forall the preceding words, and the output is the conditional probability distribution over the outputvocabulary. Since RNN is a natural choice to model various kinds of sequential data, it has beenapplied to many NLP tasks. Hence RNN has drawn great interest for adversarial attack [104].

RNN has many variants, among which Long Short-Term Memory (LSTM) network [51] gains themost popularity. LSTM is a specific RNN that was designed to capture the long-term dependencies. InLSTM, the hidden state are computed through combination of three gates, i.e., input gate, forget gateand output gate, that control information flow drawing on the logistic function. LSTM networks havesubsequently proved to be more effective than conventional RNNs [44]. GRUs is a simplified versionof LSTM that it only consists two gates, thus it is more efficient in terms of training and prediction.Some popular LSTM variants are proposed to solve various NLP tasks [23, 23, 51, 112, 133, 141, 146].These representative works received the interests of evaluation with adversarial examples recently[35, 54, 55, 92, 104, 112, 118, 130, 156].

2.2.4 Sequence-to-Sequence Learning (Seq2Seq) Models. Sequence-to-sequence learning (Seq2Seq)[131] is one of the important breakthroughs in deep learning and is now widely used for NLPapplications. Seq2Seq model has the superior capacity to generate another sequence information fora given sequence information with an encoder-decoder architecture [77]. Usually, a Seq2Seq modelconsists of two recurrent neural networks: an encoder that processes the input and compressesit into a vector representation, a decoder that predicts the output. Latent Variable HierarchicalRecurrent Encoder-Decoder (VHRED)model [122] is a recently popular Seq2Seqmodel that generatesequences leveraging the complex dependencies between subsequences. [25] is one of the firstneural machine translation (NMT) model that adopt the Seq2Seq model. OpenNMT [64], a Seq2SeqNMT model proposed recently, becomes one of the benchmark works in NMT. As they are adoptedand applied widely, attack works also emerge [24, 31, 99, 127].

2.2.5 Attention Models. Attention mechanism [9] is another breakthrough in deep leaning. It wasinitially developed to overcome the difficulty of encoding a long sequence required in Seq2Seq mod-els [77]. Attention allows the decoder to look back on the hidden states of the source sequence. Thehidden states then provide a weighted average as additional input to the decoder. This mechanismpays attention on informative parts of the sequence. Rather than looking at the input sequencein vanilla attention models, self-attention [136] in NLP is used to look at the surrounding wordsin a sequence to obtain more contextually sensitive word representations [152]. BiDAF [121] isa bidirectional attention flow mechanism for machine comprehension and achieved outstandingperformance when proposed. [55, 127] evaluated the robustness of this model via adversarialexamples and became the first few works using adversarial examples for attacking textual DNNs.Other attention-based DNNs [26, 108] also received adversarial attacks recently [30, 92].

2.2.6 Reinforcement Learning Models. Reinforcement learning trains an agent by giving a rewardafter agents performing discrete actions. In NLP, reinforcement learning framework usually consistof an agent (a DNN), a policy (guiding action) and a reward. The agent picks an action (e.g.,

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 8: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

8 Zhang et al.

predicting next word in a sequence) based on a policy, then updates its internal state accordingly,until arriving the end of the sequence where a reward is calculated. Reinforcement learning requiresproper handling of the action and the states, which may limit the expressive power and learningcapacity of the models [152]. But it gains much interests in task-oriented dialogue systems [75] asthey share the fundamental principle as decision making processes. Limited works so far can befound to attack the reinforcement learning model in NLP [99].

2.2.7 Deep Generative Models. In recent years, two powerful deep generative models, GenerativeAdversarial Networks (GANs) [41] and Variational Auto-Encoders (VAEs) [63] are proposed andgain much research attention. Generative models are able to generate realistic data that are verysimilar to ground truth data in a latent space. In NLP field, they are used to generate textual data.GANs [41] consist of two adversarial networks: a generator and a discriminator. Discriminator isto discriminate the real and generated samples, while the generator is to generate realistic samplesthat aims to fool the discriminator. GAN uses a min-max loss function to train two neural networkssimultaneously. VAEs consist of encoder and generator networks. Encoder encodes an input into alatent space and the generator generates samples from the latent space. Deep generative models isnot easy to train and evaluate. Hence, these deficiencies hinder their wide usage in many real-worldapplications [152]. Although they have been adopted in generating texts, so far no work examinestheir robustness using adversarial examples.

3 FROM IMAGE TO TEXTAdversarial attacks are originated from computer vision community. In this section, we introducerepresentative works, discuss differences between attacking image data and textual data, andpresent preliminary knowledge when performing adversarial attacks on textual DNNs.

3.1 Crafting Adversarial Examples: Inspiring Works in Computer VisionSince adversarial examples are first proposed for attacking object recognition DNNs in computervision community [20, 42, 96, 105, 106, 132, 156], this research direction has been receiving sustainedattentions. We briefly introduce some works that inspired their followers in NLP community inthis section, allowing the reader to better understand the adversarial attacks on textual DNNs. Forcomprehensive review of attack works in computer vision, please refer to [2].

L-BFGS. Szegedy et al. invented the adversarial examples notation [132]. They proposed a explic-itly designed method to cause the model to give wrong prediction of adversarial input (x + η) forimage classification task. It came to solve the optimization problem:

η = argminη

λ | |η | |22 + J (x + η,y ′) s .t . (x + η) ∈ [0, 1], (2)

where y ′ is the target output of (x′ + η), but incorrect given an ideal classifier. J denotes the costfunction of the DNN and λ is a hyperparameter to balance the two parts of the equation. Thisminimization was initially performed with a box-constrained Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm and thus was named after it. The optimization was repeatedmultiple times until reaching a minimum λ that satisfy Eq. (2).

Fast Gradient Sign Method (FGSM) . L-BFGS is very effective, but highly expensive - this inspiredGoodfellow et al. [42] to find a simplified solution. Instead of fixing y ′ and minimizing η in L-BFGS,FGSM fixed size of η and maximized the cost (Eq. (3)). Then they linearized the problem with afirst-order Taylor series approximation (Eq. (4)), and got the closed-form solution of η (Eq. (5))

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 9: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 9

[143]:

η = argmaxη J (x + η,y) s .t . | |η | |∞ ≤ ϵ, (3)η = argmaxη J (x,y) + ηT∇x J (x,y) s .t . | |η | |∞ ≤ ϵ, (4)

η = ϵ · sign(∇x J (x, y)) (5)

where ϵ is a parameter set by attacker, controlling the perturbation’s magnitude. sign(x) is the signfunction which returns 1 when x > 0, and −1 when x < 0, otherwise returns 0. ∇x J (x, y) denotesthe gradient of loss function respect to the input, and can be calculated via back-propagation. FGSMattracts the most follow-up works in NLP.

Jacobian Saliency Map Adversary (JSMA). Unlike FGSM using gradients to attack, Papernotet al. [106] generated adversarial examples using forward derivatives (i.e., model Jacobian). Thismethod evaluates the neural model’s output sensitivity to each input component using its JacobianMatrix and gives greater control to adversaries given the perturbations. Jacobian matrices formthe adversarial saliency maps that rank each input component’s contribution to the adversarialtarget. A perturbation is then selected from the maps. Thus the method was named Jacobian-basedSaliency Map Attack. The Jacobian matrix of a given x is given by:

JacbF [i, j] =∂Fi∂xj

(6)

where xi is the i-th component of the input and Fj is the j-th component of the output. Here Fdenotes the logits (i.e., the inputs to the softmax function) layer. JF [i, j] measures the sensitivity ofFj with respect to xi .

C&W Attack. Carlini and Wagner [20] aimed to evaluate the defensive distillation strategy [50]for mitigating the adversarial attacks. They restricted the perturbations with lp norms where pequals to 0, 2 and∞ and proposed seven versions of J for the following optimization problem:

η = argminη

| |η | |p + λJ (x + η,y ′) s .t . (x + η) ∈ [0, 1], (7)

and the formulation shares the same notation with aforementioned works.

DeepFool. DeepFool [96] is an iterative L2-regularized algorithm. The authors first assumed theneural network is linear, thus they can separate the classes with a hyperplane. They simplifiedthe problem and found optimal solution based on this assumption and constructed adversarialexamples. To address the non-linearity fact of the neural network, they repeated the process until atrue adversarial example is found.

Substitute Attack. The above mentioned representative works are all white-box methods, whichrequire the full knowledge of the neural model’s parameters and structures. However, in practice,it is not always possible for attackers to craft adversaries in white-box manner due to the limitedaccess to the model. The limitation was addressed by Papernot et al. [105] and they introduced ablack-box attack strategy: They trained a substitute model to approximate the decision boundariesof the target model with the labels obtained by querying the target model. Then they conductedwhite-box attack on this substitute and generate adversarial examples on the substitute. Specifically,they adopted FSGM and JSMA in generating adversarial examples for the substitute DNN.

GAN-like Attack. There are another branch of black-box attack leverages the Generative Ad-versarial Neural (GAN) models. Zhao et al. [156] firstly trained a generative model, WGAN, onthe training dataset X. WGAN could generate data points that follows the same distribution withX. Then they separately trained an inverter to map data sample x to z in the latent dense space

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 10: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

10 Zhang et al.

by minimizing the reconstruction error. Instead of perturbing x, they searched for adversaries z∗in the neighbour of z in the latent space. Then they mapped z∗ back to x∗ and check if x∗ wouldchange the prediction. They introduced two search algorithms: iterative stochastic search andhybrid shrinking search. The former one used expanding strategy that gradually expand the searchspace, while the later one used shrinking strategy that starts from a wide range and recursivelytightens the upper bound of the search range.

3.2 Attacking Image DNNs vs Attacking Textual DNNsTo attack a textual DNN model, we cannot directly apply the approaches from the image DNNattackers as there are three main differences between them:

• Discrete vs Continuous Inputs. Image inputs are continuous, typically the methods use Lpnormmeasures the distance between clean data point with the perturbed data point. However,textual data is symbolic, thus discrete. It is hard to define the perturbations on texts. Carefullydesigned variants or distance measurements for textual perturbations are required. Anotherchoice is to firstly map the textual data to continuous data, then adopt the attack methodfrom computer vision. We will give further discussion in Section 3.3.

• Perceivable vs Unperceivable. Small change of the image pixels usually can not be easilyperceived by human beings, hence the adversarial examples will not change the humanjudgment, but only fool the DNN models. But small changes on texts, e.g., character or wordchange, will easily be perceived, rendering the possibility of attack failure. For example,the changes could be identified or corrected by spelling-check and grammar check beforeinputting into textual DNN models. Therefore, it is nontrivial to find unperceivalble textualadversaries.

• Semantic vs Semantic-less. In the case of images, small changes usually do not change thesemantics of the image as they are trivial and unperceivable. However, perturbation on textswould easily change the semantics of a word and a sentence, thus can be easily detectedand heavily affect the model output. For example, deleting a negation word would changethe sentiment of a sentence. But this is not the case in computer vision where perturbingindividual pixels does not turn the image from a cat to another animal. Changing semantics ofthe input is against the goal of adversarial attack that keep the correct prediction unchangedwhile fooling an victim DNN.

Due to these differences, current state-of-the-art textual DNN attackers either carefully adjustthe methods from image DNN attackers by enforcing additional constraints, or propose novelmethods using different techniques.

3.3 Vectorizing Textual Inputs and Perturbation MeasurementsVectorizing Textual Input. DNN models require vectors as input, for image tasks, the normalway is to use the pixel value to form the vectors/matrices as DNN input. But for textual models,special operations are needed to transform the text into vectors. There are three main branches ofmethods: word-count based encoding, one-hot encoding and dense encoding (or feature embedding)and the later two are mostly used in DNN models of textual applications.

• Word-Count Based Encoding. Bag-of-words (BOW) method has the longest history in vec-torizing text. In BOW model, an zero-encoded vector with length of the vocabulary size isinitialized. Then the dimension in vector is replaced by the count of corresponding word’sappearance in the given sentence. Another word-count based encoding is to utilize the termfrequency-inverse document frequency (TF-IDF) of a word (term), and the dimension in thevector is the TF-IDF value of the word.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 11: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 11

• One-hot Encoding. In one-hot encoding, a vector feature represents a token–a token could bea character (character-level model) or a word (word-level model). For character-level one-hotencoding, the representation can be formulated as [31]:

x = [(x11, ...x1n); ...(xm1, ...xmn)] (8)

where x be a text of L characters, xi j ∈ {0, 1} |A | and |A| is the alphabet (in some works, |A|also include symbols). In Equation 8,m is the number of words, n is the maximum numberof characters for a word in sequence x. Thus each word has the same-fixed length of vectorrepresentation and the length is decided by the maximum number of characters of the words.For word-level one-hot encoding, following the above notations, the text x can be representedas:

x = [(x1, ...,xm ,xm+1...xk )] (9)

where xi j ∈ {0, 1} |V | and |V | is the vocabulary, which contains all words in a corpus. k isthe maximum number of words allowed for a text, so that [(xm+1...xk )] is zero-paddings ifm + 1 < k . One-hot encoding produces vectors with only 0 and 1 values, where 1 indicatesthe corresponding character/word appears in the sentence/paragraph, while 0 indicate it doesnot appear. Thus one-hot encoding usually generates sparse vectors/matrices. DNNs haveproven to be very successful in learning values from the sparse representations as they canlearn more dense distributed representations from the one-hot vectors during the trainingprocedure.

• Dense Encoding. Comparing to one-hot encoding, dense encoding generates low dimensionaland distributed representations for textual data. Word2Vec citenips/MikolovSCCD13 usescontinuous bag-of-words (CBOW) and skip-grammodels to generate dense representation forwords, i.e., word embeddings. It is based on the distributional assumption that words appearingwithin similar context possess similar meaning. Word embeddings, to some extend, alleviatesthe discreteness and data-sparsity problems for vectorizing textual data [37]. Extensions ofword embeddings such as doc2vec and paragraph2vec [70] encode sentences/paragraphs todense vectors.

Perturbation Measurement. As described in Section 2.1.3, there needs a way to measure thesize of the perturbation, so that it can be controlled to ensure the ability of fooling the victim DNNwhile remain unperceivable. However, the measurement in textual perturbations is drasticallydifferent with the perturbations in image. Usually, the size of the perturbation is measured by thedistance between clean data x and its adversarial example x′. But in texts, the distance measurementalso need to consider the grammar correctness, syntax correctness and semantic-preservance. Wehere list the measurements used in the reviewed in this survey.

• Norm-based measurement. Directly adopting norms such as Lp ,p ∈ 0, 1, 2,∞ requires theinput data are continuous. One solution is to use continuous and dense presentation (e.g.,embedding) to represent the texts. But this usually results in invalid and incomprehensibletexts, that need to involve other constrains.

• Grammar and syntax related measurement. Ensuring the grammar or syntactic correctnessmakes the adversarial examples not easily perceived.– Grammar and syntax checker are used in some works to ensure the textual adversarialexamples generated are valid.

– Perplexity is usually used to measure the quality of a language model. In one reviewedliterature [92], the authors used perplexity to ensure the generated adversarial examples(sentences) are valid.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 12: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

12 Zhang et al.

– Paraphrase is controlled and can be regarded as a type of adversarial example (4.3.3). Whenperturbing, the validity of paraphrases is ensured in the generation process.

• Semantic-preserving measurement.Measuring semantic similarity/distance is often performedonword vectors by adopting vectors’ similarity/distancemeasurements. Given twon-dimensionalword vectors p = (p1,p2, ...,pn) and q = (q1,q2, ...,qn):– Euclidean Distance is a distance of two vectors in the Euclidean space:

d(p, q) =√(p1 − q1)2 + p2 − q2)2 + ..(pn − qn)2 (10)

– Cosine Similarity computes cosine value of the angle between the two vectors:

cos(p, q) =∑n

i=1 pi × qi√∑ni=1(pi )2 ×

√∑ni=1(qi )2

(11)

• Edit-based measurement. Edit distance is a way of quantifying the minimum changes from onestring to the other. Different definitions of edit distance use different sets of string operations[74].– Levenshtein Distance uses insertion, removal and substitution operations.– Word Mover’s Distance (WMD) [69] is an edit distance operated on word embedding. Itmeasures the minimum amount of distance that the embedded words of one documentneed to travel to reach the embedded words of the other document [39]. The minimizationis formulated as:

min∑n

i, j=1 Ti j | |ei − ej | |2 (12)s .t .,

∑nj=1 Ti j = di ,∀i ∈ {i, ...,n},∑n

i=1 Ti j = d′i ,∀j ∈ {i, ...,n}

where ei and ej ared word embedding of word i and word j respectively. n is the number ofwords. T ∈ Rn×n be a flow matrix, where Ti j ≤ 0 denotes how much of word i in d travelsto word j in d′. d and d′ are normalized bag-of-words vectors of the source document andtarget document respectively.

– Number of changes is a simple way to measure the edits and it is adopted in some reviewedliterature.

• Jaccard similarity coefficient is used for measuring similarity of finite sample sets utilisingintersection and union of the sets.

J (A,B) = |A ∩ B ||A ∪ B | (13)

In texts, A, B are two documents (or sentences). |A ∩ B | denotes the number of words appearin both documents, |A ∪ B | refers to the number of unique words in total.

4 ATTACKING NEURAL MODELS IN NLP: STATE-OF-THE-ARTIn this section, we first introduce the categories of attack methods on textual deep learning modelsand then highlight the state-of-the-art research works, aiming to identify the most promisingadvances in recent years.

4.1 Categories of Attack Methods on Textual Deep Learning ModelsWe categorize existing adversarial attack methods based on different criteria. Figure 1 generalizesthe categories.

In this article, five strategies are used to categorize the attack methods: i) By model access grouprefers to the knowledge of attacked model when the attack is performed. In the following section,we focus on the discussion using this categorization strategy. ii) By application group refers the

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 13: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 13

Black-box Attack

White-box Attack

Character-level

Word-level

Sentence-level

By Applications

ClassificationMachine Translation Machine ComprehensionText SummarisationText EntailmentPOS TaggingRelation ExtractionDialogue System

Targeted

Untargeted

By Model Access

By Granularity

By Target

Categories of Adversarial Attacks

on Textual DNNs

By Attacked DNNs

Single Modal

Cross Modal Optical Character Recognition Scene Text RecognitionImage CaptioningVisual Question Answering Visual-Semantic Embedding Speech Recognition

Feed Forward

Reinforcement Learning Models

Generative Adversarial Networks

Hybrids Models

Auto-Encoder CNNs RNNs

Fig. 1. Categories of Adversarial Attack Methods on Textual Deep Learning Models

methods via different NLP applications. More detailed discussion will be provided in Section 4.5.iii) By target group refers to the goal of the attack is enforcing incorrect prediction or targetingspecific results. iv) By granularity group considers on what granularity the model is attacked. v)We have discussed the attacked DNNs in Section 2.2. In following sections, we will continuouslyprovide information about different categories that the methods belong to.One important group of methods need to be noted is the cross-modal attacks, in which the

attacked model consider the tasks dealing with multi-modal data, e.g., image and text data. Theyare not attacks for pure textual DNNs, hence we discuss this category of methods separately inSection 4.4 in addition to white-box attacks in Section 4.2 and black-box attacks in Section 4.3.

4.2 White-Box AttackIn white-box attack, the attack requires the access to the model’s full information, includingarchitecture, parameters, loss functions, activation functions, input and output data. White-boxattacks typically approximate the worst-case attack for a particular model and input, incorporatinga set of perturbations. This adversary strategy is often very effective. In this section, we groupwhite-box attacks on textual DNNs into seven categories.

4.2.1 FGSM-based. FGSM is one of the first attack methods on images (Section 3.1). It gains manyfollow-up works in attacking textual DNNs. TextFool [78] uses the concept of FGSM to approximatethe contribution of text items that possess significant contribution to the text classification task.Instead of using sign of the cost gradient in FGSM, this work considers the magnitude. The authorsproposed three attacks: insertion,modification and removal. Specifically, they computed cost gradient∆x J (f ,x , c ′) of each training sample x , employing back propagation, where f is the model function,x is the original data sample, and c ′ is the target text class. Then they identified the charactersthat contain the dimensions with the highest gradient magnitude and named them hot characters.Phrases that contain enough hot characters and occur the most frequently are chosen as HotTraining Phrases (HTPs). In the insertion strategy, adversarial examples are crafted by insertinga few HTPs of the target class c ′ nearby the phrases with significant contribution to the originalclass c . The authors further leveraged external sources like Wikipedia and forged fact to selectthe valid and believable sentences. In the modification Strategy, the authors identified Hot Sample

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 14: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

14 Zhang et al.

Phrase (HSP) to the current classification using similar way of identifying HTPs. Then they replacedthe characters in HTPs by common misspellings or characters visually similar. In the removalstrategy, the inessential adjective or adverb in HSPs are removed. The three strategies and theircombinations are evaluated on a CNN text classifier [155]. However, these methods are performedmanually, as mentioned by the authors.The work in [117] adopted the same idea as TextFool, but it provides a removal-addition-

replacement strategy that firstly tries to remove the adverb (wi ) which contributed the most to thetext classification task (measured using loss gradient). If the output sentences in this step haveincorrect grammar, the method will insert a word pj beforewi . pj is selected from a candidate pool,in which the synonyms and typos and genre specific keywords (identified via term frequency)are candidate words. If the output cannot satisfy the highest cost gradient for all the pj , then themethod replaces wi with pj . The authors showed that their method is more effective than TextFool.As the method ordered the words with their contribution ranking and crafted adversarial samplesaccording to the order, it is a greedy method that always get the minimum manipulation untilthe output changes. To avoid being detected by the human eyes, the authors constrained thereplaced/added words to not affect the grammar and POS of the original words.In malware detection, an portable executable (PE) is represented by binary vector {x1, ...,xm},

xi ∈ {0, 1} that using 1 and 0 to indicate the PE is present or not wherem is the number of PEs.Using PEs’ vectors as features, malware detection DNNs can identify the malicious software. Itis not a typical textual application, but also targets discrete data, which share similar methodswith textual applications. The authors of [3] investigated the methods to generate binary-encodedadversarial examples. To preserve the functionality of the adversarial examples, they incorporatedfour bounding methods to craft perturbations. The first two methods adopt FSGMk [68], the multi-step variant of FGSM, restricting the perturbations in a binary domain by introducing deterministicrounding (dFGSMk ) and randomized rounding (rFGSMk ). These two bounding methods are similarto L∞-ball constraints on images [42]. The third method multi-step Bit Gradient Ascent (BGAk ) setsthe bit of the j-th feature if the corresponding partial derivative of the loss is greater than or equalto the loss gradient’s L2-norm divided by

√m. The fourth method multi-step Bit Coordinate Ascent

(BCAk ) updates one bit in each step by considering the feature with the maximum correspondingpartial derivative of the loss. These two last methods actually visit multiple feasible vertices. Thework also proposed a adversarial learning framework aims to robustify the malware detectionmodel.

[114] also attacks malware detection DNNs. The authors made perturbations on the embeddingpresentation of the binary sequences and reconstructed the perturbed examples to its binaryrepresentation. Particularly, they appended a uniformly random sequence of bytes (payload) tothe original binary sequence. Then they embed the new binary to its embedding and performedFGSM only on the embedding of the payload. The perturbation is performed iteratively until thedetector output incorrect prediction. Since the perturbation is only performed on payload, insteadof the input, this method will preserve the functionality of the malware. Finally, they reconstructedadverse embedding to valid binary file by mapping the adversary embedding to its closest neighbourin the valid embedding space.

Many works directly adopt FGSM for adversarial training, i.e., put it as regularizer when trainingthe model. We will discuss some representatives in Section 5.

4.2.2 JSMA-based. JSMA is another pioneer work on attacking neural models for image appli-cations (refers to Section 3.1). The work [104] used forward derivative as JSMA to find the mostcontributable sequence towards the adversary direction. The network’s Jacobian had been calcu-lated by leveraging computational graph unfolding [97]. They crafted adversarial sequences for

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 15: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 15

two types of RNN models whose output is categorical and sequential data respectively. For categor-ical RNN, the adversarial examples are generated by considering the Jacobian JacbF [:, j] columncorresponding to one of the output components j. Specifically, for each word i , they identified thedirection of perturbation by:

sign(JacbF (x ′)[i,д(x ′)]) (14)д(x ′) = argmax0,1(pj ) (15)

where pj is the output probability of the target class. As in JSMA, they instead to choose logit toreplace probability in this equation. They further projected the perturbed examples onto the closestvector in the embedding space to get valid embedding. For sequential RNN, after computing theJacobian matrix, they altered the subset of input setps {i} with high Jacobian values JacbF [i, j] andlow Jacobian values JacbF [i,k] for k , j to achieve modification on a subset of output steps {j}.

[45] (and [46]) is the first work to attack neural malware detector. They firstly performed featureengineering and obtained more than 545K static features for software applications. They usedbinary indicator feature vector to represent an application. Then they crafted adversarial exampleson the input feature vectors by adopting JSMA: they computed gradient of model Jacobian toestimate the perturbation direction. Later, the method chooses a perturbation η given input samplethat with maximal positive gradient into the target class. In particular, the perturbations are chosenvia index i , satisfying:

i = argmaxj ∈[1,m],Xj=y′ f ′y (Xj ) (16)

where y ′ is the target class, m is the number of features. On the binary feature vectors, theperturbations are (0 → 1) or (1 → 0). This method preserves the functionality of the applications.In order to ensure that modifications caused by the perturbations do not change the applicationmuch, which will keep the malware application’s functionality complete, the authors used the L1norm to bound the overall number of features modified, and further bound the number of featuresto 20. In addition, the authors provided three methods to defense against the attacks, namely featurereduction, distillation and adversarial training. They found adversarial training is the most effectivedefense method.

4.2.3 C&W-based. The work in [130] adopted C&W method (refers to Section 3.1) for attackingpredictive models of medical records. The aim is to detect susceptible events and measurements ineach patient’s medical records, which provide guidance for the clinical usage. The authors usedstandard LSTM as predictive model. Given the patient EHR data being presented by a matrixX i ∈ Rd×ti (d is the number of medical features and ti is the time index of medical check), thegeneration of the adversarial example is formulated as:

minX

max{−ϵ, [loдit(x′)]y − [loдit(x)]y′} + λ | |x′ − x| |1 (17)

where loдit(·) denotes the logit layer output, λ is the regularization parameter which controls the L1norm regularization,y ′ is the targeted label whiley is the original label. After generating adversarialexamples, the authors picked the optimal example according to their proposed evaluation schemethat considers both the perturbation magnitude and the structure of the attacks. Finally they usedthe adversarial example to compute the susceptibility score for the EHR as well as the cumulativesusceptibility score for different measurements.

Seq2Sick [24] attacked the seq2seqmodels using two targeted attacks: non-overlapping attack andkeywords attack. For non-overlapping attack, the authors aimed to generate adversarial sequencesthat are entirely different from the original outputs. They proposed a hinge-like loss function that

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 16: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

16 Zhang et al.

optimizes on the logit layer of the neural network:|K |∑i=1

mint ∈[M ]

{mt (max{−ϵ,maxy,ki

{z(y)t } − z(ki )t })} (18)

where {st } are the original output sequence, {zt } indicates the logit layer outputs of the adversarialexample. For the keyword attack, targeted keywords are expected to appear in the output sequence.The authors also put the optimization on the logit layer and tried to ensure that the targetedkeyword’s logit be the largest among all words. Further more, they defined mask functionm tosolve the keyword collision problem. The loss function then becomes:

Lkeywords =

|K |∑i=1

mint ∈[M ]

{mt (max{−ϵ,maxy,ki

{z(y)t } − z(ki )t })} (19)

where ki denotes the i-th word in output vocabulary. To ensure the generated word embedding isvalid, this work also considers two regularization methods: group lasso regularization to enforcethe group sparsity, and group gradient regularization to make adversaries are in the permissibleregion of the embedding space.

4.2.4 Direction-based. HotFlip [31] performs atomic flip operations to generate adversarial ex-amples. Instead of leveraging gradient of loss, HotFlip use the directional derivatives. Specifically,HotFlip represents character-level operations, i.e., swap, insert and delete, as vectors in the inputspace and estimated the change in loss by directional derivatives with respect to these vectors.Specifically, given one-hot representation of inputs, a character flip in the j-th character of the i-thword (a→b) can be represented by the vector:

−→v i jb = (0, ..; (0, ..(0, .. − 1, 0, .., 1, 0)j , ..0)i ; 0, ..) (20)

where -1 and 1 are in the corresponding positions for the a-th and b-th characters of the alphabet,respectively. Then the best character swap can be found by maximizing a first-order approximationof loss change via directional derivative along the operation vector:

max∇x J (x ,y)T · −→v i jb = maxi jv

∂J (b)

∂xi j− ∂J

(a)

∂xi j(21)

where J (x ,y) is the model’s loss function with input x and true output y. Similarly, insertion atthe j-th position of the i-th word can also be treated as a character flip, followed by more flips ascharacters are shifted to the right until the end of the word. The character deletion is a number ofcharacter flips as characters are shifted to the left. Using the beam search, HotFlip efficiently findsthe best directions for multiple flips.The work [30] extended HotFlip by adding targeted attacks. Besides the swap, insertion and

deletion as provided in HotFlip, the authors proposed a controlled attack, which is to removea specific word from the output, and a targeted attack, which is to replace a specific word bya chosen one. To achieve these attacks, they maximized the loss function J (x ,yt ) and minimizeJ (x ,y ′

t ), where t is the target word for the controlled attack, and t ′ is the word to replace t . Further,they proposed three types of attacks that provide multiple modifications. In one-hot attack, theymanipulated all the words in the text with the best operation. In Greedy attack, they make anotherforward and backward pass, in addition to picking the best operation from the whole text. In Beamsearch attack, they replaced the search method in greedy with the beam search. In all the attacksproposed in this work, the authors set threshold for the maximum number of changes, e.g., 20% ofcharacters are allowed to be changed.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 17: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 17

Strategy Work Granularity Target Attacked Models Perturb Ctrl. App.

FSGM-based

[78] character,word Y CNN [155] L∞ TC[117] word N CNN [155] L∞, Grammar and POS

correctnessTC

[114] PE binary CNN in [27] Boundaries employ L∞and L2

MAD

[3] PE embedding binary MalConv [110] L∞ MAD

JSMA-based [104] word embedding binary LSTM – TC[45, 46] application

featuresbinary Feed forward L1 MAD

C&W-based [130] medical features Y LSTM L1 MSP[24] word embedding Y OpenNMT-py [64] L2+gradient regulariza-

tionTS, MT

Direction-based [31] character N CharCNN-LSTM[61]

– TC

[30] character Y CharCNN-LSTM[26]

Number of changes MT

Attention-based [16] word, sentence N [29, 82, 140], CNN,LSTM and ensem-bles

Number of changes MRC,QA

Reprogramming [98] word N CNN, LSTM,Bi-LSTM

– TC

Hybrid [39] word embedding N CNN WMD TC, SATable 1. Summary of reviewed white-box attack methods. PE: portable executable; TC: text classification;SA: sentiment analysis; TS: text summarisation; MT: machine translation MAD: malware detection; MSP:Medical Status Prediction; MRC: machine reading comprehension; QA: question answering; WMD: WordMover’s Distance; –: not available.

4.2.5 Attention-based. [16] proposed two white-box attacks for the purpose of comparing therobustness of CNN verses RNN. They leveraged the model’s internal attention distribution to findthe pivotal sentence which is assigned a larger weight by the model to derive the correct answer.Then they exchanged the words which received the most attention with the randomly chosenwords in a known vocabulary. They also performed another white-box attack by removing thewhole sentence that gets the highest attention. Although they focused on attention-based models,their attacks do not examine the attention mechanism itself, but solely leverages the outputs of theattention component (i.e., attention score).

4.2.6 Reprogramming. [98] adopts adversarial reprogramming (AP) to attack sequence neuralclassifiers. AP [32] is a recently proposed adversarial attack where a adversarial reprogrammingfunction дθ is trained to re-purpose the attacked DNN to perform a alternate task (e.g., questionclassification to name classification) without modifying the DNN’s parameters. AP adopts ideafrom transfer learning, but keeps the parameters unchanged. The authors in [98] proposed bothwhite-box and black-box attacks. In white-box, Gumbel-Softmax is applied to train дθ who canwork on discrete data. We discuss the black-box method later. They evaluated their methods onvarious text classification tasks and confirmed the effectiveness of their methods.

4.2.7 Hybrid. Authors of the work [39] perturbed the input text on word embedding against theCNN model. This is a general method that is applicable to most of the attack methods developedfor computer vision DNNs. The authors specifically applied FGSM and DeepFool. Directly applyingmethods from computer vision would generate meaningless adversarial examples. To address thisissue, the authors rounded the adversarial examples to the nearest meaningful word vectors by

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 18: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

18 Zhang et al.

using Word Mover’s Distance (WMD) as the distance measurements. The evaluations on sentimentanalysis and text classification datasets show that WMD is a qualified metric for controlling theperturbations.

Summary of White-box Attack.We summarize the reviewed white-box attack works in Table1. We highlight four aspects include granularity-on which level the attack is performed; target-whether the method is target or un-target; the attacked model, perturbation control-methods tocontrol the size of the perturbation, and applications. It is worth noting that in binary classifications,target and untarget methods show same effect, so we point out their target as ”binary" in the table.

4.3 Black-box AttackBlack-box attack does not require the details of the neural networks, but can access the input andoutput. This type of attacks often rely on heuristics to generate adversarial examples, and it is morepractical as in many real-world applications the details of the DNN is a black box to the attacker. Inthis article, we group black-box attacks on textual DNNs into five categories.

4.3.1 Concatenation Adversaries. [55] is the first work to attack reading comprehension systems.The authors proposed concatenation adversaries, which is to append distracting but meaninglesssentences at the end of the paragraph. These distracting sentences do not change the semantics ofthe paragraph and the question answers, but will fool the neural model. The distracting sentencesare either carefully-generated informative sentences or arbitrary sequence of words using a pool of

Fig. 2. Concatenation adversarial attack on reading comprehension DNN. After adding distracting sentences(in blue) the answer changes from correct one (green) to incorrect one (red) [55].

DNN

Correct Output

Distorted Output

Incorrect Output

Paragraph

Distracting contents

Fig. 3. General principle of concatenation adversaries. Correct output are often utilized to generate distortedoutput, which later will be used to build distracting contents. Appending distracting contents to the originalparagraph as adversarial input to the attacked DNN and cause the attacked DNN produce incorrect output.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 19: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 19

Fig. 4. Edit adversarial attack on sentiment analysis DNN. After editing words (red), the prediction changesfrom 100% of Negative to 89% of Positive [74].

20 random common words. Both perturbations were obtained by iteratively querying the neuralnetwork until the output changes. Figure 2 illustrates an example from [55] that after addingdistracting sentences (in blue) the answer changes from correct one (green) to incorrect one (red).The authors of [142] improved the work by varying the locations where the distracting sentencesare placed and expanding the set of fake answers for generating the distracting sentences, renderingnew adversarial examples that can help training more robust neural models. Also, the work [16]utilized the distracting sentences to evaluate the robustness of their reading comprehension model.Specifically, they use a pool of ten random common words in conjunction with all question wordsand the words from all incorrect answer candidates to generate the distracting sentences. In thiswork, a simple word-level black-box attack is also performed by replacing the most frequent wordsvia their synonyms. As aforementioned, the authors also provided two white-box strategies. Figure3 illustrates the general workflow for concatenation attack. Correct output (i.e., answer in MRCtasks) are often leveraged to generate distorted output, which later will be used to build distractingcontents. Appending distracting contents to the original paragraph as adversarial input to theattacked DNN. The distracting contents will not distract human being and ideal DNNs, but canmake vulunerable DNNs to produce incorrect output.

4.3.2 Edit Adversaries. The work in [13] perturbed the input data of neural machine translationapplications in two ways: Synthetic, which performed the character order changes, such as swap,middle random (i.e., randomly change orders of characters except the first and the last), fully random(i.e., randomly change orders of all characters) and keyboard type. They also collected typos andmisspellings as adversaries. natural, leveraged the typos from the datasets. Furthermore, [99]attacked the neural models for dialogue generation. They applied various perturbations in dialoguecontext, namely Random Swap (randomly transposing neighboring tokens) and Stopword Dropout(randomly removing stopwords), Paraphrasing (replacing words with their paraphrases), GrammarErrors (e.g., changing a verb to the wrong tense) for the Should-Not-Change attacks, and the Add

DNN

Sentence

Word

Character

TextCorrect Output

Incorrect Output

Perturbation: Replace, Delete, Add, Swap…

Strategies to identify which to perturb

Adversarial Text

Fig. 5. General principle of edit adversaries. Perturbations are performed on sentences, words or charactersby edit strategies such as replace, delete, add and swap.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 20: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

20 Zhang et al.

Negation strategy (negates the root verb of the source input) and Antonym strategy (changes verbs,adjectives, or adverbs to their antonyms) for Should-Change attacks. DeepWordBug [35] is a simplemethod that uses character transformations to generate adversarial examples. The authors firstidentified the important ‘tokens’, i.e., words or characters that affect the model prediction the mostby scoring functions developed by measuring the DNN classifier’s output. Then they modified theidentified tokens using four strategies: replace, delete, add and swap. The authors evaluated theirmethod on a variety of NLP tasks, e.g., text classification, sentiment analysis and spam detection.[74] followed [35], refining the scoring function. Also this work provided white-box attack adoptingJSMA. One contribution of this work lies on the perturbations are restricted using four textualsimilarity measurement: edit distance of text; Jaccard similarity coefficient; Euclidean distance onword vector; and cosine similarity on word embedding. Their method had been evaluated only onsentiment analysis task.The authors in [92] proposed a method for automatically generating adversarial examples that

violate a set of given First-Order Logic constraints in natural language inference (NLI). Theyproposed an inconsistency loss to measure the degree to which a set of sentences causes a model toviolate a rule. The adversarial example generation is the process for finding the mapping betweenvariables in rules to sentences that maximize the inconsistency loss and are composed by sentenceswith a low perplexity (defined by a languagemodel). To generate low-perplexity adversarial sentenceexamples, they used three edit perturbations: i) change one word in one of the input sentences; i)remove one parse subtree from one of the input sentences; iii) insert one parse sub-tree from onesentence in the corpus in the parse tree of the another sentence.The work in [5] uses genetic algorithm (GA) for minimising the number of word replacement

from the original text, but at the same time can change the result of the attacked model. Theyadopted crossover and mutation operations in GA to generate perturbations. The authors measuredthe effectiveness of the word replacement accoding to the impact on attacked DNNs. Their attackfocused on sentiment analysis and textual entailment DNNs.In [21], the authors proposed a framerwork for adversarial attack on Differentiable Neural

Computer (DNC). DNC is a computing machine with DNN as its central controller operating on anexternal memory module for data processing. Their method uses two new automated and scalablestrategies to generate grammatically corret adversairal attacks in question answering domian,utilising metamorphic transformation. The first strategy, Pick-n-Plug, consists of a pick operatorpick to draw adversarial sentences from a particular task (source task) and plug operator plug toinject these sentences into a story from another task (target task), without changing its correctanswers. Another strategy, Pick-Permute-Plug, extends the adversarial capability of PPick-n-Plugby an additional permute operator after picking sentences (gpick) from a source task. Words in aparticular adversarial sentence can be permuted with its synonyms to generate a wider range ofpossible attacks.

4.3.3 Paraphrase-based Adversaries. SCPNs [54] produces a paraphrase of the given sentence withdesired syntax by inputting the sentence and a targeted syntactic form into an encoder-decoderarchitecture. Specifically, the method first encodes the original sentence, then inputs the paraphrasesgenerated by back-translation and the targeted syntactic tree into the decoder, whose output isthe targeted paraphrase of the original sentence. One major contribution lies on the selection andprocessing of the parse templates. The authors trained a parse generator separately from SCPNsand selected 20 most frequent templates in PARANMT-50M. After generating paraphrases using theselected parse templates, they further pruned non-sensible sentences by checking n-gram overlapand paraphrastic similarity. The attacked classifier can correctly predict the label of the originalsentence but fails on its paraphrase, which is regarded as the adversarial example. SCPNs had been

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 21: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 21

Sentence

Text

DNN

Correct Output

Incorrect Output

Paraphrase

Text

Controlled Rules

Fig. 6. General principle of paraphrase-based adversaries. Carefully designed (controlled) paraphrases areregarded as adversarial examples, which fool DNN to produce incorrect output.

evaluated on sentiment analysis and textual entailment DNNs and showed significant impact onthe attacked models. Although this method use target strategy to generate adversarial examples, itdoes not specify targeted output. Therefore, we group it to untarget attack. Furthermore, the workin [127] used the idea of paraphrase generation techniques that create semantically equivalentadversaries (SEA). They generated paraphrases of a input sentence x , and got predictions from funtil the original prediction is changed with considering the semantically equivalent to x ′ that is1 if x is semantically equivalent to x ′ and 0 otherwise as shown in Eq.(22). After that, this workproposes semantic-equivalent rule based method for generalizing these generated adversaries intosemantically equivalent rules in order to understand and fix the most impactful bug.

SEA(x, x′) = 1[SemEq(x ,x ′) ∧ f (x) , f (x′)] (22)

4.3.4 GAN-based Adversaries. Some works proposed to leverage Generative Adversarial Network(GAN) [41] to generate adversaries [156]. The purpose of adopting GAN is to make the adversarialexamples mroe natural. In [156], the model proposed to generate adversarial exsamples consistsof two key components: a GAN, which generate fake data samples, and an inverter that mapsinput x to its latent representation z ′). The two components are trained on the original input byminimizing reconstruction error between original input and the adversarial examples. Perturbationis performed in the latent dense space by identifying the perturbed sample z in the neighborhoodof z ′. Two search approaches, namely iterative stochastic search and hybrid shrinking search, areproposed to identify the proper z. However, it requires querying the attacked model each timeto find the z that can make the model give incorrect prediction. Therefore, this method is quitetime-consuming. The work is applicable to both image and textual data as it intrinsically eliminatesthe problem raised by the discrete attribute of textual data. The authors evaluated their method onthree applications namely: textual entailment, machine translation and image classification.

4.3.5 Substitution. The work in [53] proposes a black-box framework that attacks RNN model formalware detection. The framework consists of two models: one is a generative RNN, the other is asubstitute RNN. The generative RNN aims to generate adversarial API sequence from the malware’sAPI sequence. It is based on the seq2seq model proposed in [131]. It particularly generates a smallpiece of API sequence and inserts the sequence after the input sequence. The substitute RNN,which is a bi-directional RNN with attention mechanism, is to mimic the behavior of the attackedRNN. Therefore, generating adversarial examples will not query the original attacked RNN, but itssubstitution. The substitute RNN is trained on both malware and benign sequences, as well as theGumbel-Softmax outputs of the generative RNN. Here, Gumbel-softmax is used to enable the jointtraining of the two RNN models, because the original output of the generative RNN is discrete.Specifically, it enables the gradient to be back-propagated from generative RNN to substitute RNN.This method performs attack on API, which is represented as a one-hot vector, i.e., givenM APIs,

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 22: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

22 Zhang et al.

Strategy Work Granularity Target Attacked Models PerturbCtrl.

App.

Concatenation [55] word N BiDAF, Match-LSTM – MRC[142] word, character N BiDAF+Self-Attn+ELMo

[109]– MRC

[16] word, sentence N [29, 82, 140], CNN, LSTM andensembles

Number ofchanges

MRC, QA

Edit

[13] character, word N Nematus [120], char2char[71], charCNN [61]

– MT

[99] word, phrase N VHRED [123]+attn, RL in[75], DynoNet [49]

– DA

[35] character, word N Word-level LSTM, Character-level CNN

– SA, TC

[74] character, word N Word-level LSTM, Character-level CNN

EdDist, JSC,EuDistV, CSE

SA

[92] word, phrase N cBiLSTM, DAM, ESIM Perplexity NLI[5] word N LSTM EuDistV SA, TE[21] word, sentence N DNC – QA

Paraphrase-based [54] word N LSTM Syntax-ctrlparaphrase

SA andTE

[127] word N BiDAF, Visual7W [157], fast-Text [43]

Self-definedsemantic-equivalency

MRC, SA,VQA

GAN-based [156] word N LSTM, TreeLSTM, GoogleTranslate (En-to-Ge)

GAN-constraints

TE, MT

Substitution [53] API N LSTM, BiLSTM and variants – MDReprogramming [98] word N CNN, LSTM, Bi-LSTM – TC

Table 2. Summary of reviewed black-box attackmethods.MRC:machine reading comprehension; QA: questionanswering; VQA: visual question answering; DA: dialogue generation; TC: text classification; MT: machinetranslation; SA: sentiment analysis; NLI: natural language inference; TE: textual entailment; MD: malwaredetection. EdDist: edit distance of text, JSC: Jaccard similarity coeffcient, EuDistV: Euclidean distance onword vector, CSE: cosine similarity on word embedding. ’-’: not available.

the vector for the i-th API is an M-dimensional binary vector that the i-th dimension is 1 whileother dimensions are 0s.

4.3.6 Reprogramming . As aforementioned, [98] provides both white-box and black-box attacks.Wedescribe black-box attack here. In black-box attack, the authors fomulated the sequence generationas a reinforcement learning problem, and the adversarial reprogramming function дθ is the policynetwork. Then they applied REINFORCE-based optimisation to train дθ .

Summary of Black-box Attack. We summarise the reviewed black-box attack works in Table 2.We highlight four aspects include granularity-onwhich level the attack is performed; target-whetherthe method is target or un-target; the attacked model, perturbation control, and applications.

4.4 Multi-modal AttacksSome works attack DNNs that are dealing with cross-modal data. For example, the neural modelscontain an internal component that performs image-to-text or speech-to-text conversion. Althoughthese attacks are not for pure textual data, we briefly introduce the representative ones for thepurpose of a comprehensive review.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 23: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 23

4.4.1 Image-to-Text. Image-to-text models is a class of techniques that generate textual descriptionfor an image based on the semantic content of the latter.Optical Character Recognition (OCR). Recognizing characters from images is a problem namedOptical Character Recognition (OCR). OCR is a multimodal learning task that takes an image asinput and output the recognized text. Authors in [129] proposed a white-box attack on OCR andfollow-up NLP applications. They firstly used the original text to render a clean image (conversionDNNs). Then they found words in the text that have antonyms in WordNet and satisfy edit distancethreshold. Only the antonyms that are valid and keep semantic inconsistencies will be kept. Later,the method locates the lines in the clean image containing the aforementioned words, which canbe replaced by their selected antonyms. The method then transforms the target word to targetsequence. Given the input/target images and sequences, the authors formed the generating ofadversarial example is an optimisation problem:

minω c · JCTC f (x′, t ′) + | |x − x′ | |22 (23)x′ = (α · tanh(ω) + β)/2 (24)

α = (xmax − xmin)/2, β = (xmax + xmin)/2JCTC (f (x, t)) = − logp(t |x) (25)

where f (x) is the neural system model, JCTC (·) is the Connectionist Temporal Classification (CTC)loss function, x is the input image, t is the ground truth sequence, x′ is the adversarial example,t ′ is the target sequence, ω,α , β are parameters controlling adversarial examples to satisfy thebox-constraint of x′ ∈ [xmin , xmax ]p , where p is the number of pixels ensuring valid x ′. Aftergenerating adversarial examples, the method replaces the images of the corresponding lines in thetext image. The authors evaluated this method in three aspects: single word recognition, wholedocument recognition, and NLP applications which based on the recognised text (sentiment analysisand document categorisation specifically). They also addressed that the proposed method suffersfrom limitatios such as low transferability across data and models, and physical unrelalizability.Scene Text Recognition (STR). STR is also an image-to-text application. In STR, the entire imageis mapped to word strings directly. In contrast, the recognition in OCR is a pipeline process: firstsegments the words to characters, then performs the recognition on single characters. AdaptiveAt-tack [153] evaluated the possibility of performing adversarial attack for scene text recognition. Theauthors proposed two attacks, namely basic attack and adaptive attack. Basic attack is similar tothe work in [129] and it also formulates the adversarial example generation as an optimisationproblem:

minω JCTC f (x′, t ′) + λD(x, x′) (26)x′ = tanh(ω) (27)

where D(·) is Euclidean distance. The differences to [129] lie on the definition of x′ (Eq. (24)vs Eq. (27)), and the distance measurement between x, x′ (L2 norm vs Euclidean distance), andthe parameter λ, which balances the importance of being adversarial example and close to theoriginal image. As searching for proper λ is quite time-consuming, the authors proposed anothermethod to adaptively find λ. They named this method Adaptive Attack, in which they definedthe likelihood of a sequential classification task following a Gaussian distribution and derived theadaptive optimization for sequential adversarial examples as:

min | |x−x′ | |22λ21

+JCTC f (x′,t ′)

λ22+ log λ21 +T log λ22 +

1λ22

(28)

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 24: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

24 Zhang et al.

where λ1 and λ2 are two parameters to balance perturbation and CTC loss,T is the number of validpaths given targeted sequential output. Adaptive Attack can be applied to generate adversarialexamples on both non-sequential and sequential classification problems. Here we only highlight theequation for sequential data. The authors evaluated their proposed methods on tasks that targetingthe text insertion, deletion and substitution in output. The results demonstrated that AdaptiveAttack is much faster than basic attack.Image Captioning. Image captioning is another multimodal learning task that takes an image asinput and generates a textual caption describing its visual contents. Show-and-Fool [22] generatesadversarial examples to attack the CNN-RNN based image captioning model. The CNN-RNN modelattacked uses a CNN as encoder for image feature extraction and a RNN as decoder for captiongeneration. Show-and-Fool has two attack strategies: targeted caption (i.e., the generated captionmatches the target caption) and targeted keywords (i.e., the generated caption contains the targetedkeywords). In general, they formulated the two tasks using the following formulation:

minω c · J (x′) + | |x′ − x| |22 (29)x′ = x + η

x = tanh (y), x′ = tanh (ω + y)where c > 0 is a pre-specified regularization constant, η is the perturbation, ω,y are parameterscontrolling x′ ∈ [−1, 1]. The difference between these two strategies is the definition of the lossfunction J (·). For targeted caption strategy, provided the targeted caption as S = (S1, S2, ...St , ...SN ),where St refers to the index of the t-th word in the vocabulary and N is the length of the caption,the loss is formulated as:

JS,loдit (x′) =∑N−1

t=2 max{−ϵ,maxk,St {z(k )t } − z(St )t } (30)

where St is the target word, z(St )t is the logit of the target word. In fact, this method mininises thedifference between the maximumn logit except St , and the logit of St . For the targeted keywordsstrategy, given the targeted keywords K := K1, ...,KM , the loss function is:

JK,loдit (x′) =∑M

j=1 mint ∈[N ]{max{−ϵ,maxk,Kj {z(k )t } − z

(Kj )t }} (31)

The authors performed extensive experiments on Show-and-Tell [137] and varied the parameters inthe attacking loss. They found that Show-and-Fool is not only effective on attacking Show-and-Tell,the CNN-RNN based image captioning model, but is also highly transferable to another modelShow-Attend-and-Tell [147].Visual Question Answering (VQA). Given an image and a natural language question about theimage, VQA is to provide an accurate answer in natural language. The work in [148] proposeda iterative optimisation method to attack two VQA models. The objective function proposedmaximises the probability of the target answer and unweights the preference of adversarial exampleswith smaller distance to the original image when this distance is below a threshold. Specifically,the objective contains three components. The first one is similar to Eq. (26), that replaces the lossfunction to the loss of the VQA model and using | |x − x′ | |2/

√N as distance between x′ and x.

The second component maximises the difference between the softmax output and the predictionwhen it is different with the target answer. The third component ensures the distance betweenx′ and x is under a lower bound. The attacks are evaluated by checking whether better successrate is obtained over the previous attacks, and the confidence score of the model to predict thetarget answer. Based on the evaluations, the authors concluded that that attention, bounding boxlocalization and compositional internal structures are vulnerable to adversarial attacks. This workalso attacked a image captioning neural model. We refer the original paper for further information.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 25: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 25

Multi-modal Application Work Target Access Attacked Models PerturbCtrl.

Image-to-Text

Optical Character Recognition [129] Y white-box Tesseract [135] L2, EdDistScene Text Recognition [153] Y white-box CRNN [124] L2Image Captioning [22] Y white-box Show-and-Tell

[137]L2

Visual Question Answering [148] Y white-box MCB [34], N2NMN[52]

L2

Visual-Semantic Embeddings [125] N black-box VSE++ [33] –Speech-to-Text Speech Recognition [19] Y white-box DeepSpeech [48] L2

Table 3. Summary of reviewed cross-modal attacks. EdDist: edit distance of text, -: not available.

Visual-Semantic Embeddings (VSE). The aim of VSE is to bridge natural language and theunderlying visual world. In VSE, the embedding spaces of both images and descriptive texts(captions) are jointly optimized and aligned. [125] attacked the latest VSE model by generatingadversarial examples in the test set and evaluated the robustness of the VSEmodesls. They performedthe attack on textual part by introducing three method: i) replace nouns in the image captionsutilizing the hypernymy/hyponymy relations in WordNet; ii) change the numerals to differentones and singularize or pluralize the corresponding nouns when necessary; iii) detect the relationsand shuffle the non-interchangeable noun phrases or replace the prepositions. This method can beconsidered as a black-box edit adversary .

4.4.2 Speech-to-Text. Speech-to-text is also known as speech recognition. The task is to recognizeand translate the spoken language into text automatically. [19] attacked a state-of-the-art speech-to-text transcription neural network (based on LSTM), named DeepSpeech. Given a natural waveform,the authors constructed a audio perturbation that is almost inaudible but can be recognized byadding into the original waveform. The perturbation is constructed by adopting the idea fromC&W method (refers to section 3.1 ), which measures the image distortion by the maximumamount of changed pixels. Adapting this idea, they measured the audio distortion by calculatingrelative loudness of an audio and proposed to use Connectionist Temporal Classification loss forthe optimization task. Then they solved this task with Adam optimizer [62].

4.5 Benchmark Datasets by ApplicationsIn recent years, neural networks gain success in different NLP domains and the popular applicationsinclude text classification, reading comprehension, machine translation, text summarization, ques-tion answering, dialogue generation, to name a few. In this section, we review the current workson generating adversarial examples on the neural networks in the perspective of NLP applications.Table 4 summarizes the works we reviewed in this article according to their application domain. Wefurther list the benchmark datasets used in these works in the table as auxiliary information– thuswe refer readers to the links/references we collect for the detailed descriptions of the datasets. Notethat the auxiliary datasets which help to generate adversarial examples are not included. Instead,we only present the dataset used to evaluate the attacked neural networks.

Text Classification.Majority of the surveyed works attack the deep neural networks for textclassification, since these tasks can be framed as a classification problem. Sentiment analysis aimsto classify the sentiment to several groups (e.g., in 3-group scheme: neural, positive and negative).Gender identification, Grammatical error detection and malware detection can be framed as binaryclassification problems. Relation extraction can be formulated as single or multi-classificationproblem. Predict medical status is a multi-class problem that the classes are defined by medical

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 26: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

26 Zhang et al.

Applications Representative Works Benchmark Datasets

Classification

Text Classification [31, 35, 39, 78, 98, 118] DBpedia, Reuters Newswires, AG’snews, Sogou News, Yahoo! Answers,RCV1, Surname Classification Dataset

Sentiment Analysis [31, 35, 54, 98, 104, 117, 118,127]

SST, IMDB Review, Yelp Review, Elec,Rotten Tomatoes Review, Amazon Re-view, Arabic Tweets Sentiment

Spam Detection [35] Enron Spam, Datasets from [155]Gender Identification [117] Twitter GenderGrammar Error Detection [118] FCE-publicMedical Status Prediction [130] Electronic Health Records (EHR)Malware Detection [4, 45, 46, 53, 114] DREBIN, Microsoft KaggleRelation Extraction [11, 145] NYT Relation, UW Relation, ACE04,

CoNLL04 EC, Dutch Real Estate Clas-sifieds, Adverse Drug Events

Machine Translation [13, 24, 30, 156] TED Talks, WMT’16 Multimodal Trans-lation Task

Machine Comprehension [16, 21, 55, 142] SQuAD, MovieQA Multiple Choice, Log-ical QA

Text Summarization [24] DUC2003, DUC2004, GigawordText Entailment [54, 57, 92, 156] SNLI, SciTail, MultiNLI, SICKPOS Tagging [151] WSJ portion of PTB, Treebanks in UDDialogue System [99] Ubuntu Dialogue, CoCoA,

Cross-model

Optical Character Recognition [129] Hillary Clinton’s emailsScene Text Recognition [153] Street View Text, ICDAR 2013, IIIT5KImage Captioning [22, 148] MSCOCO, Visual GenomeVisual Question Answering [148] Datasets from [6], Datasets from [157]Visual-Semantic Embedding [125] MSCOCOSpeech Tecognition [19] Mozilla Common Voice

Table 4. Attacked Applications and Benchmark Datasets

experts. These works usually use multiple datasets to evaluate their attack strategies to show thegenerality and robustness of their method. [78] used DBpedia ontology dataset [72] to classify thedocument samples into 14 high-level classes. [39] used IMDB movie reviews [85] for sentimentanalysis, and Reuters-2 and Reuters-5 newswires dataset provided by NLTK package10 for cate-gorization. [104] used a un-specified movie review dataset for sentiment analysis. [117] also usedIMDB movie review dataset for sentiment analysis. The work also performed gender classificationon and Twitter dataset11 for gender detection. [35] performed spam detection on Enron SpamDataset [91] and adopted six large datasets from [155], i.e., AG’s news12, Sogou news [138], DBPediaontology dataset, Yahoo! Answers13 for text categorization and Yelp reviews14, Amazon reviews[90] for sentiment analysis. [31] also used AG’s news for text classification. Further, they usedStanford Sentiment Treebank (SST) dataset [128] for sentiment analysis. [118] conducted evalu-ation on three tasks: sentiment analysis (IMDB movie review, Elec [56], Rotten Tomatoes [103]),text categorization (DBpedia Ontology dataset and RCV1 [73]) and grammatical error detection(FCE-public [150]). [130] generated adversarial examples on the neural medical status prediction

10https://www.nltk.org/11https://www.kaggle.com/crowdflower/twitter-user-gender-cla2013.12https://www.di.unipi.it/ËIJgulli/13Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset through the Yahoo! Webscope program.14Yelp Dataset Challenge in 2015

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 27: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 27

system with real-world electronic health records data. Many works target the malware detectionmodels. [45, 46] performed attack on neural malware detection systems. They used DREBIN datasetwhich contains both benigh and malicious android applications [7]. [114] collected benigh windowsapplication files and used Microsoft Malware Classification Challenge dataset [113] as the maliciouspart. [53] crawled 180 programs with corresponding behavior reports from a website for malwareanalysis15. 70% of the crawled programs are malware. [98] proposed another kind of attack, calledreprogramming. They specifically targeted the text classification neural models and used fourdatasets to evaluate their attack methods: Surname Classification Dataset16, Experimental Data forQuestion Classification [76], Arabic Tweets Sentiment Classification Dataset [1] and IMDB moviereview dataset. In [145], the authors modelled the relation extraction as a classification problem,where the goal is to predict the relations exist between entity pairs given text mentions. They usedtwo relation datasets: NYT dataset [111] and UW dataset [80]. The work [11] targeted at improvingthe efficacy of the neural networks for joint entity and relation extraction. Different to the methodin [145], the authors modelled the relation extraction task as a multi-label head selection problem.The four datasets are used in their work: ACE04 dataset [28], CoNLL04 EC tasks [115], Dutch RealEstate Classifieds (DREC) dataset [12], and Adverse Drug Events (ADE) [47].

Machine Translation. Machine Translation works on parallel datasets, one of which usessource language and the other one is in the target language. [13] used the TED talks parallel corpusprepared for IWSLT 2016 [89] for testing the NMT systems. They also collected French, Germanand Czech corpus for generating natural noises to build a look-up table which contains possiblelexical replacements that later be used for generating adversarial examples. [30] also used the sameTED talks corpus and used German to English, Czech to English, and French to English pairs.

Machine Comprehension. Machine comprehension datasets usually provide context docu-ments or paragraphs to the machines. Based on the comprehension of the contexts, machinecomprehension models can answer a question. Jia and Liang are one of the first to consider thetextual adversary and they targeted the neural machine comprehension models [55]. They usedthe Stanford Question Answering Dataset (SQuAD) to evaluate the impact of their attack on theneural machine comprehension models. SQuAD is a widely recognised benchmark dataset formachine comprehension. [142] followed the previous works and also worked on SQuAD dataset.Althouth the focus of the work [16] is to develop a robust machine comprehension model ratherthan attacking MC models, they used the adversarial examples to evaluate their proposed system.They used MovieQA multiple choice question answering dataset [134] for the evaluation. [21]targeted attacks on differentiable neural computer (DNC), which is a novel computing machinewith DNN. They evaluated the attacks on logical question answering using bAbI tasks17.

Text Summarization. The goal for text summarization is to summarize the core meaning of agiven document or paragraph with succinct expressions. There is no surveyed papers that onlytarget the application of text summarization. [24] evaluated their attack on multiple applicationsincluding text summarization and they used DUC200318, DUC200419, and Gigaword20 for evaluatingthe effectiveness of adversarial examples.

Text Entailment. The fundamental task of text entailment is to decide whether a premise textentails a hypothesis, i.e., the truth of one text fragment follows from another text. [57] assessed

15https://malwr.com/16Classifying names with a character-level rnn - pytroch tutorial.17https://research.fb.com/downloads/babi/18http://duc.nist.gov/duc2003/tasks.html19http://duc.nist.gov/duc2004/20https://catalog.ldc.upenn.edu/LDC2003T05

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 28: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

28 Zhang et al.

various models on two entailment datasets: Standord Natural Lauguage Inference (SNLI) [17] andSciTail [59]. [92] also used SNLI dataset. Furthermore, they used MultiNLI [144] dataset.

Part-of-Speech (POS) Tagging. The purpose for POS tagging is to resolve the part-of-speechfor each word in a sentence, such as noun, verb etc. It is one of the fundamental NLP tasks tofacilitate other NLP tasks, e.g., syntactic parsing. Neural networks are also adopted for this NLP task.[151] adopted the method in [94] to build a more robust neural network by introducing adversarialtraining, but they applied the strategy (with minor modifications) in POS tagging. By training onthe mixture of clean and adversarial example, the authors found that adversarial examples notonly help improving the tagging accuracy, but also contribute to downstream task of dependencyparsing and is generally effective in different sequence labelling tasks. The datasets used in theirevaluation include: the Wall Street Journal (WSJ) portion of the Penn Treebank (PTB) [87] andtreebanks from Universal Dependencies (UD) v1.2 [100].

Dialogue Generation. Dialogue generation is a fundamental component for real-world virtualassistants such as Siri21 and Alexa22. It is the text generation task that automatically generate aresponse given a post by the user. [99] is one of the first to attack the generative dialogue models.They used the Ubuntu Dialogue Corpus [84] and Dynamic Knowledge Graph Network with theCollaborative Communicating Agents (CoCoA) dataset [49] for the evaluation of their two attackstrategies.

Cross-model Applications. [129] evaluated the OCR systems with adversarial examples usingHillary Clinton’s emails23, which is in the form of images. They also conducted the attack on NLPapplications using Rotten Tomatoes and IMDB review datasets. The work in [153] attacked theneural networks designed for scene text rcognition. They conducted experiments on three standardbenchmarks for cropped word image recognition, namely the Street View Text dataset (SVT) [139]the ICDAR 2013 dataset (IC13) [58] and the IIIT 5K-word dataset (IIIT5K) [93]. [22] attacked theimage captioning neural models. The dataset they used is the Microsoft COCO (MSCOCO) dataset[79]. [148] worked on the problems of attacking neural models for image captioning and visualquestion answering. For the first task, they used Visual Genome dataset [65]. For the second task,they used the VQA datasets collected and processed in [6]. [125] worked on Visual-SemanticEmbedding applications, where the MSCOCO dataset is used. [19] targeted the speech recognitionproblem. The datasets they used is the Mozilla Common Voice dataset24.

Multi-Applications Some works adapt their attack methods into different applications, namely,they evaluate their method’s trasferability across applications. [24] attacked the sequence-to-sequence models. Specifically, they evaluated their attack on two applications: text summarizationand machine translation. For text summarization, as mentioned before, they used three datasetsDUC2003, DUC2004, and Gigaword. For the machine translation, they sampled a subset formWMT’16 Multimodal Translation dataset25. [54] proposed syntactically adversarial paraphrase andevaluated the attack on sentiment analysis and text entailment applications. They used SST forsentimental analysis and SICK [88] for text entailment. [156] is a generic approach for generatingadversarial examples on neural models. The applications investigated include image classification(MINIST digital image dataset), textual entailment (SNLI), and machine translation. [94] evaluatedtheir attacks on five datasets,covering both sentiment analysis (IMDB movie review, Elec productreview, Rotten Tomatoes movie review) and text categorization (DBpedia Ontology, RCV1 newsarticles). [127] targeted two applications. For sentiment analysis, they used Rotten Tomato movie

21https://www.apple.com/au/siri/22https://en.wikipedia.org/wiki/Amazon_Echo23https://www. kaggle.com/kaggle/hillary-clinton-emails/data24https://voice.mozilla.org/en25http://www.statmt.org/wmt16/translation-task.html

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 29: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 29

reviews and IMDB movie reviews datasets. For visual question answering, they tested on datasetprovided by Zhu et al. [157].

5 DEFENSEAn essential purpose for generating adversarial examples for neural networks is to utilize theseadversarial examples to enhance the model’s robustness [42]. There are two common ways intextual DNN to achieve this goal: adversarial training and knowledge distillation. Adversarial trainingincorporates adversarial examples in themodel training process. Knowledge distillationmanipulatesthe neural network model and trains a new model. In this section, we introduce some representativestudies belonging to these two directions. For more comprehensive defense strategies on machinelearning and deep leaning models and applications, please refer to [2, 15].

5.1 Adversarial TrainingSzegedy et al. [132] invented adversarial training, a strategy that consists of training a neuralnetwork to correctly classify both normal examples and adversarial examples. Goodfellow et al. [42]employed explicit training with adversarial examples. In this section, we describe works utilizingdata augmentation, model regularization and robust optimization for the defense purpose on textualadversarial attacks.

5.1.1 Data Augmentation. Data augmentation extends the original training set with the generatedadversarial examples and try to let the model see more data during the training process. Dataaugmentation is commonly used against black-box attacks with additional training epochs on theattacked DNN with adversarial examples.

The authors in work [55] try to enhance the reading comprehension model with training on theaugmented dataset that includes the adversarial examples. They showed that this data augmentationis effective and robust against the attack that uses the same adversarial examples. However, theirwork also demonstrated that this augmentation strategy would be still vulnerable against theattacks with other kinds of adversarial examples. [142] shared similar idea to augment the trainingdataset, but selected further informative adversarial examples as discussed in Section 4.3.1.The work in [57] trains the text entailment system augmented with adversarial examples. The

purpose is tomake the systemmore robust. They proposed threemethods to generatemore datawithdiverse characteristics: (1) knowledge-based, which replaces words with their hypernym/hyponymprovided in several given knowledge bases; (2) hand-crafted, which adds negations to the the existingentailment; (3) neural-based, which leverages a seq2seq model to generate an entailment examplesby enforcing the loss function to measure the cross-entropy between the original hypothesis and thepredicted hypothesis. During the training process, they adopt the idea from generative adversarialnetwork to train a discriminator and a generator, and incorporating the adversarial examples inthe discriminator’s optimization step.

[13] explores another way for data augmentation. It takes the average character embedding as aword representation and incorporate it into the input. This approach is intrinsically insensitiveto character scrambling such as swap, mid and Rand, thus can resists to noises caused by thesescrambling attacks proposed in the work. However, this defense is ineffective to other attacks thatdo not perturb on characters’ orders.

5.1.2 Model Regularization. Model regularization enforces the generated adversarial examples asthe regularizer and follows the form of:

min(J (f (x),y) + λJ (f (x ′),y)) (32)

where λ is a hyperparameter.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 30: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

30 Zhang et al.

Following [42], the work [94] constructed the adversarial training with a linear approximationas follows:

− logp(y |x + −ϵд/| |д | |2, ;θ ) (33)д = ∂x logp(y |x ; θ )

where | |д | |2 is the L2 norm regularization, θ is the parameter of the neural model, and θ is a constantcopy of θ . The difference to [42] is that, the authors performed the adversarial generation andtraining in terms of the word embedding. Further, they extended their previous work on attackingimage deep neural model [95], where the local distribution smoothness (LDS) is defined as thenegative of the KL divergence of two distributions (original data and the adversaries). LDS measuresthe robustness of the model against the perturbation in local and ‘virtual’ adversarial direction.In this sense, the adversary is calculated as the direction to which the model distribution is mostsensitive in terms of KL divergence. They also applied this attack strategy on word embedding andperformed adversarial training by adding adversarial examples as regularizer.The work [118] follows the idea from [94] and extends the adversarial training on LSTM. The

authors followed FGSM to incorporate the adversarial training as a regularizer. But in order to enablethe interpretability of adversarial examples, i.e., the word embedding of the adversaries should bevalid word embeddings in the vocabulary, they introduced a direction vector which associates theperturbed embedding to the valid word embedding. [145] simply adopts the regularizer utilized in[94], but applies the perturbations on pre-trained word embedding and in a different task: relationextraction. Other similar works that adopt [94] are [11, 118, 145, 151]. We will not cover all theseworks in this article, since they simply adopting this method.

5.1.3 Robust Optimisation. Madry et al. [86] cast DNN model learning as a robust optimizationwith min-max (saddle point) formulation, which is the composition of an inner non-concavemaximization problem (attack) and an outer non-convex minimization problem (defense). Accordingto Danskin’s theorem, gradients at inner maximizers correspond to descent directions for the min-max problem, thus the optimization can still apply back-propagation to proceed. The approachsuccessfully demonstrated robustness of DNNs against adversarial images by training and learninguniversally. [3] adopts the idea and applies on malware detection DNN that handles discrete data.Their leaning objective is formulated as:

θ ∗ = arдminθE(x,y)∼D [ max

x ′∈S (x )L(θ ,x ′,y)] (34)

where S(x) is the set of binary indicator vectors that preserve the functionality of malware x , L isthe loss function for the original classification model, y is the groundtruth label, θ is the learnableparameters, D denotes the distribution of data sample x .

It is worth noting that the proposed robust optimisation method is an universal framework underwhich other adversarial training strategies have natural interpretation. We describe it separatelykeeping in view its popularity in the literature.

5.2 DistillationPapernot et al. [107] proposed distillation as another possible defense against adversarial examples.The principle is to use the softmax output (e.g., the class probabilities in classfication DNNs) of theoriginal DNN to train the second DNN, which has the same structure with the original one. Thesoftmax of the original DNN is also modified by introducing a temperature parameter T :

qi =exp (zi/T )∑k exp (zk/T )

(35)

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 31: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 31

where zi is input of softmax layer. T controls the level of knowledge distillation. When T = 1, Eq.(35) turns back to the normal softmax function. If T is large, qi is close to a uniform distribution,when it is small, the function will output more extreme values. [46] adopts distillation defense forDNNs on discrete data and applied a high temperature T , as high-temperature softmax is provedto reduce the model sensitivity to small perturbations [107]. They trained the second DNN withthe augmentation of original dataset and the softmax outputs from the original DNN. From theevaluations, they found adversarial training is the more effective than distillation. (I like if there isanswers that explains why adversarial training is the more effective than distillation )

6 DISCUSSIONS AND OPEN ISSUESGenerating textual adversarial examples has relatively shorter history than generating imageadversarial examples on DNNs because it is more challenging to make perturbation on discretedata, and meanwhile preserving the valid syntactic, grammar and semantics. We discuss some ofthe issues in this section and provide suggestions on future directions.

6.1 PerceivabilityPerturbations in image pixels are usually hard to be perceived, thus do not affect human judgment,but can only fool the deep neural networks. However, the perturbation on text is obvious, nomatter the perturbation is flipping characters or changing words. Invalid words and syntacticerrors can be easily identified by human and detected by the grammar check software, hence theperturbation is hard to attack a real NLP system. However, many research works generate suchtypes of adversarial examples. It is acceptable only if the purpose is utilizing adversarial examplesto robustify the attacked DNN models. In semantic-preserving perspective, changing a word in asentence sometimes changes its semantics drastically and is easily detected by human beings. ForNLP applications such as reading comprehension, and sentiment analysis, the adversarial examplesneed to be carefully designed in order not to change the should-be output. Otherwise, both correctoutput and perturbed output change, violating the purpose of generating adversarial examples. Thisis challenging and limited works reviewed considered this constraint. Therefore, for practical attack,we need to propose methods that make the perturbations not only unperceivable, but preservecorrect grammar and semantics.

6.2 TransferabilityTransferability is a common property for adversarial examples. It reflects the generalization ofthe attack methods. Transferability means adversarial examples generated for one deep neuralnetwork on a dataset can also effectively attack another deep neural network (i.e., cross-modelgeneralization) or dataset (i.e., cross-data generalization). This property is more often exploitedin black-box attacks as the details of the deep neural networks does not affect the attack methodmuch. It is also shown that untargeted adversarial examples are much more transferable thantargeted ones [83]. Transferability can be organized into three levels in deep neural networks: (1)same architecture with different data; (2) different architectures with same application; (3) differentarchitectures with different data [154]. Although current works on textual attacks cover both threelevels, the performance of the transferred attacks still decrease drastically compared to it on theoriginal architecture and data, i.e., poor generalization ability. More efforts are expected to deliverbetter generalization ability.

6.3 AutomationSome reviewed works are able to generate adversarial examples automatically, while others cannot.In white-box attacks, leveraging the loss function of the DNN can identify the most affected

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 32: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

32 Zhang et al.

points (e.g., character, word) in a text automatically. Then the attacks are performed on thesepoints by automatically modifying the corresponding texts. In black-box attacks, some attacks, e.g.substitution train substitute DNNs and apply white-box attack strategies on the substitution. Thiscan be achieved automatically. However, most of the other works craft the adversarial examples ina manual manner. For example, [55] concatenated manually-chosen meaningless paragraphs tofool the reading comprehension systems, in order to discover the vulnerability of the victim DNNs.Many research works followed their way, not aiming on practical attacks, but more on examiningrobustness of the target network. These manaul works are time-consuming and impractical. Webelieve that more efforts in this line could pass through this barrier in future.

6.4 New ArchitecturesAlthough most of the common textual DNNs have gained attention from the perspective of adver-sarial attack (Section 2.2), many DNNs haven’t been attacked so far. For example, the generativeneural models: Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs).In NLP, they are used to generate texts. Deep generative models requires more sophisticated skillfor model training. This would explain that these techniques have been mainly overlooked byadversarial attack so far. Future works may consider about generating adversarial examples forthese generative DNNs. Another example is differentiable neural computer (DNC). Only one workattacked DNC so far [21]. Attention mechanism is somehow become a standard component in mostof the sequential models. But there is no work examined the mechanism itself. Instead, works areeither attack the overall system that contain attentions, or leverage attention scores to identify theword for perturbation [16].

6.5 Iterative vs One-offIterative attacks iteratively search and update the perturbations based on the gradient of the outputof the attacked DNN model. Thus it shows high quality and effectiveness, that is the perturbationscan be small enough and hard to defense. However, these methods usually require long time tofind the proper perturbations, rendering an obstacle for attacking in real-time. Therefore, one-offattacks are proposed to tackle this problem. FGSM [42] is one example of one-off attack. Natually,one-off attack is much faster than iterative attack, but is less effective and easier to be defensed[153]. When designing attack methods on a real application, attackers need to carefully considerthe trade off between efficiency and effectiveness of the attack.

7 CONCLUSIONThis article presents the first comprehensive survey in the direction of generating textual adversarialexamples on deep neural networks. We review recent research efforts and develop classificationschemes to organize existing literature. Additionally we summarize and analyze them from differentaspects. We attempt to provide a good reference for researchers to gain insight of the challenges,methods and issues in this research topic and shed lights on future directions. We hope more robustdeep neural models are proposed based on the knowledge of the adversarial attacks.

REFERENCES[1] Nawaf A Abdulla, Nizar A Ahmed, Mohammed A Shehab, and Mahmoud Al-Ayyoub. 2013. Arabic sentiment analysis:

Lexicon-based and corpus-based. In Proc. of the 2013 IEEE Jordan Conference on Applied Electrical Engineering andComputing Technologies (AEECT 2013). IEEE, 1–6.

[2] Naveed Akhtar and Ajmal S. Mian. 2018. Threat of Adversarial Attacks on Deep Learning in Computer Vision: ASurvey. IEEE Access 6 (2018), 14410–14430.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 33: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 33

[3] Abdullah Al-Dujaili, Alex Huang, Erik Hemberg, and Una-May O’Reilly. 2018. Adversarial Deep Learning forRobust Detection of Binary Encoded Malware. In Proc. of the 2018 IEEE Security and Privacy Workshops (SPW 2018).Francisco, CA, USA, 76–82.

[4] Abdullah Al-Dujaili, Alex Huang, Erik Hemberg, and Una-May O’Reilly. 2018. Adversarial Deep Learning for RobustDetection of Binary Encoded Malware. In Proc. of the 2018 IEEE Security and Privacy Workshops (SP Workshops2018). San Francisco, CA, USA, 76–82.

[5] Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018.Generating Natural Language Adversarial Examples. In Proc. of the 2018 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2018). Brussels, Belgium, 2890–2896.

[6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and DeviParikh. 2015. VQA: Visual Question Answering. In Proc. of the 2015 IEEE International Conference on ComputerVision (ICCV 2015). Santiago, Chile, 2425–2433.

[7] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, and Konrad Rieck. 2014. DREBIN: Effective andExplainable Detection of Android Malware in Your Pocket. In Proc. of the 21st Annual Network and DistributedSystem Security Symposium (NDSS 2014). San Diego, California, USA.

[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning toAlign and Translate. CoRR abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473

[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning toAlign and Translate. (2015).

[10] Marco Barreno, Blaine Nelson, Anthony D. Joseph, and J. D. Tygar. 2010. The security of machine learning. MachineLearning 81, 2 (2010), 121–148.

[11] Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. Adversarial training for multi-context joint entity and relation extraction. In Proc. of the 2018 Conference on EmpiricalMethods inNatural LanguageProcessing (EMNLP 2018). Brussels, Belgium, 2830–2836.

[12] Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. An attentive neural architecture forjoint segmentation and parsing and its application to real estate ads. Expert Systems with Applications 102 (2018),100–112.

[13] Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation.arXiv preprint arXiv:1711.02173. ICLR (2018).

[14] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model.Journal of Machine Learning Research 3 (2003), 1137–1155.

[15] Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. PatternRecognition 84 (2018), 317–331.

[16] Matthias Blohm, Glorianna Jagfeld, Ekta Sood, Xiang Yu, and Ngoc Thang Vu. 2018. Comparing Attention-BasedConvolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension. In Proc.of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018). Brussels, Belgium, 108–118.

[17] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpusfor learning natural language inference. In Proc. of the 2015 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP 2015). Lisbon, Portugal, 632–642.

[18] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. GeneratingSentences from a Continuous Space. In Proc. of the 20th SIGNLL Conference on Computational Natural LanguageLearning (CoNLL 2016). Berlin, Germany, 10–21.

[19] Nicholas Carlini and David A. Wagner. [n. d.]. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text.[20] Nicholas Carlini and David A. Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In Proc. of the

2017 IEEE Symposium on Security and Privacy (SP 2017). San Jose, CA, USA, 39–57.[21] Alvin Chan, Lei Ma, Felix Juefei-Xu, Xiaofei Xie, Yang Liu, and Yew Soon Ong. 2018. Metamorphic Relation

Based Adversarial Attacks on Differentiable Neural Computer. CoRR abs/1809.02444 (2018). arXiv:1809.02444http://arxiv.org/abs/1809.02444

[22] Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Attacking Visual Language Groundingwith Adversarial Examples: A Case Study on Neural Image Captioning. In Proceedings of ACL 2018.

[23] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for NaturalLanguage Inference. In Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL2017). Vancouver, BC, Canada, 1657–1668.

[24] Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018. Seq2Sick: Evaluating the Robustnessof Sequence-to-Sequence Models with Adversarial Examples. arXiv preprint arXiv:1803.01128 (2018).

[25] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 34: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

34 Zhang et al.

Translation. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014).Doha, Qatar, 1724–1734.

[26] Marta R. Costa-Jussà and José A. R. Fonollosa. 2016. Character-based Neural Machine Translation. In Proc. of the54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

[27] George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. 2013. Large-scale malware classification using randomprojections and neural networks. In Proc. of the 38th International Conference on Acoustics, Speech and SignalProcessing (ICASSP 2013). Vancouver, BC, Canada, 3422–3426.

[28] George Doddington, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004.The Automatic Content Extraction (ACE) Program -Tasks, Data, and Evaluation. In Proc. of the Fourth InternationalConference on Language Resources and Evaluation (LREC’04). Lisbon, Portugal.

[29] Daria Dzendzik, Carl Vogel, and Qun Liu. 2017. Who framed roger rabbit? multiple choice questions answering aboutmovie plot. (2017).

[30] Javid Ebrahimi, Daniel Lowd, and Dejing Dou. [n. d.]. On Adversarial Examples for Character-Level Neural MachineTranslation. In Proc. of the 27th International Conference on Computational Linguistics (COLING 2018). Santa Fe,New Mexico, USA, 653–663.

[31] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for TextClassification. In Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018).Melbourne, Australia, 31–36.

[32] Gamaleldin F. Elsayed, Ian J. Goodfellow, and Jascha Sohl-Dickstein. 2018. Adversarial Reprogramming of NeuralNetworks. CoRR abs/1806.11146 (2018).

[33] Fartash Faghri, David J. Fleet, Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improved Visual-Semantic Embeddings.CoRR abs/1707.05612 (2017).

[34] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. MultimodalCompact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proc. of the 2016 Conference onEmpirical Methods in Natural Language Processing (EMNLP 2016). Austin, Texas, USA, 457–468.

[35] Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box Generation of Adversarial Text Sequencesto Evade Deep Learning Classifiers. arXiv preprint arXiv:1801.04354 (2018).

[36] Justin Gilmer, Ryan P. Adams, Ian J. Goodfellow, David Andersen, and George E. Dahl. 2018. Motivating the Rules ofthe Game for Adversarial Example Research. CoRR abs/1807.06732 (2018). arXiv:1807.06732 http://arxiv.org/abs/1807.06732

[37] Yoav Goldberg. 2017. Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers.https://doi.org/10.2200/S00762ED1V01Y201703HLT037

[38] Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagationthrough structure. Neural Networks 1 (1996), 347–352.

[39] Zhitao Gong, Wenlu Wang, Bo Li, Dawn Song, and Wei-Shinn Ku. 2018. Adversarial Texts with Gradient Methods.arXiv preprint arXiv:1801.07175 (2018).

[40] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Vol. 1.[41] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,

and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proc. of the Annual Conference on Neural InformationProcessing Systems 2014 (NIPS 2014). Montreal, Quebec, Canada, 2672–2680.

[42] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. InProc. of the 3rd International Conference on Learning Representations (ICLR 2015).

[43] Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classi-fication. In Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2017). Valencia, Spain, 427–431.

[44] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech recognition with deep recurrent neuralnetworks. In Proc. of IEEE 2013 International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013).Vancouver, BC, Canada, 6645–6649.

[45] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2016. Adversarialperturbations against deep neural networks for malware classification. arXiv preprint arXiv:1606.04435 (2016).

[46] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick D. McDaniel. 2017. AdversarialExamples for Malware Detection. In Proc. of the 22nd European Symposium on Research in Computer Security(ESORICS 2017). Oslo, Norway, 62–79.

[47] Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo.2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects frommedical case reports. Journal of Biomedical Informatics 45, 5 (2012), 885–892.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 35: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 35

[48] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh,Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition.CoRR abs/1412.5567 (2014).

[49] He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning Symmetric Collaborative DialogueAgents with Dynamic Knowledge Graph Embeddings. In Proc. of the 55th Annual Meeting of the Association forComputational Linguistics (ACL 2017). Vancouver, Canada, 1766–1776.

[50] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRRabs/1503.02531 (2015). arXiv:1503.02531 http://arxiv.org/abs/1503.02531

[51] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780.

[52] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In Proc. of IEEE International Conference on ComputerVision (ICCV 2017). Venice, Italy, 804–813.

[53] Weiwei Hu and Ying Tan. 2017. Black-Box Attacks against RNN based Malware Detection Algorithms. arXiv preprintarXiv:1705.08131 (2017).

[54] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial Example Generation withSyntactically Controlled Paraphrase Networks. In Proc. of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies (NAACL-HLT). New Orleans, Louisiana,USA, 1875–1885.

[55] Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proc. ofthe 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). Copenhagen, Denmark,2021–2031.

[56] Rie Johnson and Tong Zhang. 2015. Semi-supervised Convolutional Neural Networks for Text Categorization viaRegion Embedding. In Proc. of the Annual Conference on Neural Information Processing Systems 2015 (NIPS 2015).Montreal, Quebec, Canada, 919–927.

[57] Dongyeop Kang, Tushar Khot, Ashish Sabharwal, , and Eduard Hovy. 2018. AdvEntuRe: Adversarial Training forTextual Entailment with Knowledge-Guided Examples. In Proceedings of ACL 2018.

[58] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre,Joan Mas, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. 2013. ICDAR 2013 Robust ReadingCompetition. In Proc. of the 12th International Conference on Document Analysis and Recognition (ICDAR 2013).Washington, DC, USA, 1484–1493.

[59] Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTaiL: A Textual Entailment Dataset from Science QuestionAnswering. In Proc. of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018). New Orleans, Louisiana,USA, 5189–5197.

[60] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proc. of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP 2014). Doha, Qatar, 1746–1751.

[61] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. InProc. of the 13th AAAI Conference on Artificial Intelligence (AAAI 2016). Phoenix, Arizona, USA, 2741–2749.

[62] Diederik P. Kingma and Jimmy Ba. 2015. dam: A Method for Stochastic Optimization. In Proc. of the 3rd InternationalConference on Learning Representations (ICLR 2015). San Diego, CA, USA.

[63] Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proc. of the 2014 InternationalConference on Learning Representations (ICLR 2014).

[64] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-SourceToolkit for Neural Machine Translation. In Proc. of the 55th Annual Meeting of the Association for ComputationalLinguistics (ACL 2017). Vancouver, BC, Canada, 67–72.

[65] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis,Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language andVision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017),32–73.

[66] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep ConvolutionalNeural Networks. In Proc. of the 26th Annual Conference on Neural Information Processing Systems (NIPS 2012).Lake Tahoe, Nevada, USA, 1106–1114.

[67] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, RomainPaulus, and Richard Socher. 2016. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. InProc. of the 33nd International Conference on Machine Learning (ICML 2016). New York City, NY, USA, 1378–1387.

[68] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial Machine Learning at Scale. In Proc. of the5th International Conference on Learning Representations (ICLR 2017). oulon, France.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 36: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

36 Zhang et al.

[69] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings To DocumentDistances. In Proc. of the 32nd International Conference on Machine Learning (ICML 2015)). Lille, France, 957–966.

[70] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proc. of the 31thInternational Conference on Machine Learning (ICML 2014). Beijing, China, 1188–1196.

[71] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully Character-Level Neural Machine Translation withoutExplicit Segmentation. TACL 5 (2017), 365–378.

[72] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann,Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingualknowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195.

[73] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for TextCategorization Research. Journal of Machine Learning Research 5 (2004), 361–397.

[74] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. TextBugger: Generating Adversarial Text AgainstReal-world Applications. In Proc. of 26th Annual Network and Distributed System Security Symposium (NDSS 2019).San Diego, California, USA.

[75] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep Reinforcement Learningfor Dialogue Generation. In Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing(EMNLP 2016). Austin, Texas, USA, 1192–1202.

[76] Xin Li and Dan Roth. 2002. Learning Question Classifiers. In Proc. of the 19th International Conference onComputational Linguistics (COLING 2002). aipei, Taiwan.

[77] Yang Liu Li Deng. 2018. Deep Learning in Natural Language Processing. Springer Singapore. https://doi.org/10.1007/978-981-10-5209-5

[78] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2017. Deep Text Classification Canbe Fooled. arXiv preprint arXiv:1704.08006 (2017).

[79] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, andC. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. of the 13th European Conferenceon Computer Vision (ECCV 2014). Zurich, Switzerland, 740–755.

[80] Angli Liu, Stephen Soderland, Jonathan Bragg, Christopher H. Lin, Xiao Ling, and Daniel S. Weld. 2016. EffectiveCrowd Annotation for Relation Extraction. In Proc. of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL 2016). San Diego California, USA, 897–906.

[81] Qiang Liu, Pan Li, Wentao Zhao, Wei Cai, Shui Yu, and Victor C. M. Leung. 2018. A Survey on Security Threats andDefensive Techniques of Machine Learning: A Data Driven View. IEEE Access 6 (2018), 12103–12117.

[82] Tzu-Chien Liu, Yu-Hsueh Wu, and Hung-yi Lee. 2017. Attention-based CNN Matching Net. CoRR abs/1709.05036(2017).

[83] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Transferable Adversarial Examples andBlack-box Attacks. In Proc. of the 2017 International Conference on Learning Representations (ICLR 2017).

[84] Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset forResearch in Unstructured Multi-Turn Dialogue Systems. In Proc. of the 16th Annual Meeting of the Special InterestGroup on Discourse and Dialogue (SIGDIAL 2015). Prague, Czech Republic, 285–294.

[85] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. LearningWord Vectors for Sentiment Analysis. In Proc. of the 49th Annual Meeting of the Association for ComputationalLinguistics (ACL 2011). Portland, Oregon, USA, 142–150.

[86] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. TowardsDeep Learning Models Resistant to Adversarial Attacks. In Proc. of the 6th International Conference on LearningRepresentations (ICLR 2018). Vancouver, BC, Canada.

[87] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus ofEnglish: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330.

[88] Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014.SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences throughSemantic Relatedness and Textual Entailment. In Proc. of the 8th International Workshop on Semantic Evaluation(SemEval@COLING 2014). Dublin, Ireland, 1–8.

[89] Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. Wit3: Web Inventory of Transcribed and TranslatedTalks. In Conference of European Association for Machine Translation. 261–268.

[90] Julian J. McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensionswith review text. In Proc. of the 7th ACM Conference on Recommender Systems (RecSys 2013). Hong Kong, China,165–172.

[91] Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. 2006. Spam Filtering with Naive Bayes - Which NaiveBayes?. In Proc. of the Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 37: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 37

[92] Pasquale Minervini and Sebastian Riedel. 2018. Adversarially Regularising Neural NLI Models to Integrate LogicalBackground Knowledge. In Proc. of the 22nd Conference on Computational Natural Language Learning (CoNLL2018). Brussels, Belgium, 65–74.

[93] Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors.In Proc. of the 23rd British Machine Vision Conference (BMVC 2012). Surrey, UK, 1–11.

[94] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised textclassification. arXiv preprint arXiv:1605.07725 (2016).

[95] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. 2016. Distributional smoothing withvirtual adversarial training. In Proc. of the 4th International Conference on Learning Representations (ICLR 2016).

[96] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. DeepFool: A Simple and AccurateMethod to Fool Deep Neural Networks. In Proc. of the 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR 2016). Las Vegas, NV, USA, 2574–2582.

[97] Michael C Mozer. 1995. A focused backpropagation algorithm for temporal. Backpropagation: Theory, architectures,and applications 137 (1995).

[98] Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, and Farinaz Koushanfar. 2018. Adversarial Reprogramming ofSequence Classification Neural Networks. CoRR abs/1809.01829 (2018). arXiv:1809.01829 http://arxiv.org/abs/1809.01829

[99] Tong Niu and Mohit Bansal. 2018. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models.In Proc. of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018). Brussels, Belgium,486–496.

[100] Joakim Nivre, Željko Agić, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer,Kepa Bengoetxea, Riyaz Ahmad Bhat, Cristina Bosco, et al. 2015. Universal Dependencies 1.2. (2015).

[101] Daniel W. Otter, Julian R. Medina, and Jugal K. Kalita. 2018. A Survey of the Usages of Deep Learning in NaturalLanguage Processing. CoRR abs/1807.10854 (2018).

[102] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab K. Ward.2016. Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to InformationRetrieval. IEEE/ACM Trans. Audio, Speech & Language Processing 24, 4 (2016), 694–707.

[103] Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respectto Rating Scales. In Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005).Michigan, USA, 115–124.

[104] Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. 2016. Crafting Adversarial InputSequences for Recurrent Neural Networks. In Military Communications Conference, MILCOM 2016-2016 IEEE. IEEE,49–54.

[105] Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017.Practical Black-Box Attacks against Machine Learning. In Proc. of the 2017 ACM on Asia Conference on Computerand Communications Security (AsiaCCS 2017). Abu Dhabi, United Arab Emirates, 506–519.

[106] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016.The Limitations of Deep Learning in Adversarial Settings. In IEEE European Symposium on Security and Privacy(EuroS&P 2016). Saarbrücken, Germany, 372–387.

[107] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a Defenseto Adversarial Perturbations Against Deep Neural Networks. In Proc. of the 2016 IEEE Symposium on Security andPrivacy (SP 2016). San Jose, CA, USA, 582–597.

[108] Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model forNatural Language Inference. In Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing(EMNLP 2016). Austin, Texas, USA, 2249–2255.

[109] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.2018. Deep Contextualized Word Representations. In Proc. of the 2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). New Orleans,Louisiana, USAr, 2227–2237.

[110] Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K. Nicholas. [n. d.]. MalwareDetection by Eating a Whole EXE. In The Workshops of the The Thirty-Second AAAI Conference on ArtificialIntelligence. New Orleans, Louisiana, USA, 268–276.

[111] Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions without La-beled Text. In Proc. of 2010 European Conference on Machine Learning and Knowledge Discovery in Databases(ECML/PKDD 2010). Barcelona, Spain, 148–163.

[112] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomás Kociský, and Phil Blunsom. 2016. Reasoningabout Entailment with Neural Attention. In Proc. of the 2016 International Conference on Learning Representations

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 38: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

38 Zhang et al.

(ICLR 2016).[113] Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, and Mansour Ahmadi. 2018. Microsoft Malware

Classification Challenge. CoRR abs/1802.10135 (2018). arXiv:1802.10135 http://arxiv.org/abs/1802.10135[114] Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2017. Generic Black-Box End-to-End Attack against

RNNs and Other API Calls Based Malware Classifiers. arXiv preprint arXiv:1707.05970 (2017).[115] Dan Roth and Wen-tau Yih. 2004. A Linear Programming Formulation for Global Inference in Natural Language Tasks.

In Proc. of the 8th Conference on Computational Natural Language Learning (CoNLL 2004). Boston, Massachusetts,1–8.

[116] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagatingerrors. nature 323, 6088 (1986), 533.

[117] Suranjana Samanta and Sameep Mehta. 2018. Generating Adversarial Text Samples. In Proc. of the 40th EuropeanConference on IR Research (ECIR 2018). Grenoble, France, 744–749.

[118] Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Interpretable Adversarial Perturbation in InputEmbedding Space for Text. arXiv preprint arXiv:1805.02917 (2018).

[119] Dale Schuurmans and Martin Zinkevich. 2016. Deep Learning Games. In Proc. of the Annual Conference on NeuralInformation Processing Systems 2016 (NIPS 2016). Barcelona, Spain, 1678–1686.

[120] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: a Toolkitfor Neural Machine Translation. In Proc. of the 15th Conference of the European Chapter of the Association forComputational Linguistics (EACL 2017), Demo. Valencia, Spain, 65–68.

[121] Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow forMachine Comprehension. CoRR abs/1611.01603 (2016). arXiv:1611.01603 http://arxiv.org/abs/1611.01603

[122] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. BuildingEnd-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proc. of the Thirtieth AAAIConference on Artificial Intelligence (AAAI 2016). Phoenix, Arizona, USA, 3776–3784.

[123] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and YoshuaBengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In Proc. of the 31stAAAI Conference on Artificial Intelligence (AAAI 2017). San Francisco, California, USA, 3295–3301.

[124] Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recog-nition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence39, 11 (2017), 2298–2304.

[125] Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, and Jian Sun. 2018. Learning Visually-Grounded Semanticsfrom Contrastive Adversarial Samples. In Proc. of the 27th International Conference on Computational Linguistics(COLING 2018). Santa Fe, New Mexico, USA, 3715–3727.

[126] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition.In Proc. of the 3rd International Conference on Learning Representations (ICLR 2015. San Diego, CA, USA.

[127] Sameer Singh, Carlos Guestrin, and Marco Túlio Ribeiro. 2018. Semantically Equivalent Adversarial Rules forDebugging NLP models. In Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL2018). Melbourne, Australia, 856–865.

[128] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and ChristopherPotts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proc. of the2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). Seattle, Washington, USA,1631–1642.

[129] Congzheng Song and Vitaly Shmatikov. 2018. Fooling OCR Systems with Adversarial Text Images. CoRRabs/1802.05385 (2018). arXiv:1802.05385 http://arxiv.org/abs/1802.05385

[130] Mengying Sun, Fengyi Tang, Jinfeng Yi, Fei Wang, and Jiayu Zhou. 2018. Identify Susceptible Locations in Med-ical Records via Adversarial Attacks on Deep Predictive Models. In Proc. of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining (KDD 2018). London, UK, 793–801.

[131] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proc.of the Annual Conference on Neural Information Processing Systems 2014 (NIPS 2014). Montreal, Quebec, Canada,2672–2680.

[132] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, and Joan Bruna. 2014. Intriguing properties of neural networks.In Proc. of the 2nd International Conference on Learning Representations (ICLR 2014).

[133] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations FromTree-Structured Long Short-Term Memory Networks. In Proc. of the 53rd Annual Meeting of the Association forComputational Linguistics (ACL 2015). Beijing, China, 1556–1566.

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 39: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey 39

[134] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016.MovieQA: Understanding Stories in Movies through Question-Answering. In Proc. of the 2016 IEEE Conference onComputer Vision and Pattern Recognition (CVPR 2016). Las Vegas, NV, USA, 4631–4640.

[135] Tesseract. 2016. https://github.com/tesseract-ocr/.[136] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia

Polosukhin. 2017. Attention is All you Need. In Proc. of the Annual Conference on Neural Information ProcessingSystems 2017 (NIPS 2017). Long Beach, CA, USA, 6000–6010.

[137] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image captiongenerator. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). Boston, MA, USA,3156–3164.

[138] Canhui Wang, Min Zhang, Shaoping Ma, and Liyun Ru. 2008. Automatic online news issue construction in webenvironment. In Proc. of the 17th International Conference on World Wide Web (WWW 2008). Beijing, China, 457–466.

[139] Kai Wang, Boris Babenko, and Serge J. Belongie. 2011. End-to-end scene text recognition. In Proc. of the 2011 IEEEInternational Conference on Computer Vision (ICCV 2011). Barcelona, Spain, 1457–1464.

[140] Shuohang Wang and Jing Jiang. 2016. A compare-aggregate model for matching text sequences. arXiv preprintarXiv:1611.01747 (2016).

[141] Shuohang Wang and Jing Jiang. 2016. Machine Comprehension Using Match-LSTM and Answer Pointer. CoRRabs/1608.07905 (2016). arXiv:1608.07905 http://arxiv.org/abs/1608.07905

[142] Yicheng Wang and Mohit Bansal. 2018. Robust Machine Comprehension Models via Adversarial Training. In Proc.of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT). New Orleans, Louisiana, 575–581.

[143] David Warde-Farley and Ian Goodfellow. 2016. Adversarial Perturbations of Deep Neural Networks. Perturbations,Optimization, and Statistics 311 (2016).

[144] Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A Broad-Coverage Challenge Corpus for SentenceUnderstanding through Inference. In Proc. of the 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL 2018). New Orleans, Louisiana, USA, 1112–1122.

[145] Yi Wu, David Bamman, and Stuart Russell. 2017. Adversarial training for relation extraction. In Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing. 1778–1783.

[146] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144

[147] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, andYoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. of the32nd International Conference on Machine Learning (ICML 2015). Lille, France, 2048–2057.

[148] Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, and Dawn Song. 2018. Fooling Vision andLanguage Models Despite Localization and Attention Mechanism. In Proc. of 2018 IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2018). Salt Lake City, UT, USA, 4951–4961.

[149] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding Entities and Relations forLearning and Inference in Knowledge Bases. In Proc. of the 3rd International Conference on Learning Representations(ICLR 2015). San Diego, CA, USA.

[150] Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A New Dataset and Method for Automatically GradingESOL Texts. In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).Portland, Oregon, USA, 180–189.

[151] Michihiro Yasunaga, Jungo Kasai, and Dragomir R. Radev. 2018. Robust Multilingual Part-of-Speech Taggingvia Adversarial Training. In Proc. of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics (NAACL 2018). New Orleans, Louisiana, USA, 976–986.

[152] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent Trends in Deep Learning BasedNatural Language Processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55–75.

[153] Xiaoyong Yuan, Pan He, and Xiaolin Andy Li. 2018. Adaptive Adversarial Attack on Scene Text Recognition. CoRRabs/1807.03326 (2018). arXiv:1807.03326 http://arxiv.org/abs/1807.03326

[154] Xiaoyong Yuan, Pan He, Qile Zhu, Rajendra Rana Bhat, and Xiaolin Li. 2017. Adversarial Examples: Attacks andDefenses for Deep Learning. CoRR abs/1712.07107 (2017). arXiv:1712.07107 http://arxiv.org/abs/1712.07107

, Vol. 1, No. 1, Article . Publication date: April 2019.

Page 40: WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ...WEI EMMA ZHANG, QUAN Z. SHENG, and AHOUD ALHAZMI, Macquarie University, Australia CHENLIANG LI, Wuhan University, China With the development

40 Zhang et al.

[155] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification.In Proc. in Annual Conference on Neural Information Processing Systems 2015 (NIPS 2015). Montreal, Quebec,Canada, 649–657.

[156] Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017. Generating natural adversarial examples. arXiv preprintarXiv:1710.11342 (2017).

[157] Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering inImages. In Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas,NV, USA, 4995–5004.

, Vol. 1, No. 1, Article . Publication date: April 2019.