automatic stance detection using end-to-end …automatic stance detection using end-to-end memory...

Automatic Stance Detection Using End-to-End Memory Networks

Mitra Mohtarami1, Ramy Baly1, James Glass1, Preslav Nakov2

Lluís Màrquez3∗, Alessandro Moschitti3∗1MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA

2Qatar Computing Research Institute, HBKU, Doha, Qatar; 3Amazon{mitra,baly,glass}@csail.mit.edu

[email protected]; {lluismv,amosch}@amazon.com

AbstractWe present a novel end-to-end memory net-work for stance detection, which jointly(i) predicts whether a document agrees, dis-agrees, discusses or is unrelated with respectto a given target claim, and also (ii) extractssnippets of evidence for that prediction. Thenetwork operates at the paragraph level andintegrates convolutional and recurrent neuralnetworks, as well as a similarity matrix as partof the overall architecture. The experimentalevaluation on the Fake News Challenge datasetshows state-of-the-art performance.

1 Introduction

Recently, an unprecedented amount of false infor-mation has been flooding the Internet with aimsranging from affecting individual people’s beliefsand decisions (Mihaylov et al., 2015a,b; Mihaylovand Nakov, 2016) to influencing major events suchas political elections (Vosoughi et al., 2018). Con-sequently, manual fact checking has emerged withthe promise to support accurate and unbiased anal-ysis of public statements.

As manual fact checking is a very tedious task,automatic fact checking has been proposed as analternative. This is often broken into intermediatesteps in order to alleviate the task complexity. Onesuch step is stance detection, which is also usefulfor human experts as a stand-alone task. The aimis to identify the relative perspective of a piece oftext with respect to a claim, typically modeled us-ing labels such as agree, disagree, discuss, and un-related. Figure 1 shows some examples.

Here, we address the problem using anovel model based on end-to-end memory net-works (Sukhbaatar et al., 2015), which incorpo-rates convolutional and recurrent neural networks,as well as a similarity matrix.

∗ This work was carried out when the authors were sci-entists at the Qatar Computing Research Institute, HBKU.

Claim: Robert Plant Ripped up $800M Led Zeppelin Re-union Contract.

Stance Snippetagree Led Zeppelin’s Robert Plant turned down £500m

to reform supergroup...disagree Robert Plant’s publicist has described as “rub-

bish” a Daily Mirror report that he rejected a£500m Led Zeppelin reunion...

discuss Robert Plant reportedly tore up an $800 millionLed Zeppelin reunion deal...

un-related

Richard Branson’s Virgin Galactic is set to launchSpaceShipTwo today...

Figure 1: Examples of snippets of text and their stanceswith respect to a given claim.

Our model jointly addresses the problems of pre-dicting the stance of a text document with respectto a given claim, and of extracting relevant textsnippets as support for the prediction of the model.We further introduce a similarity matrix, which weuse at inference time in order to improve the ex-traction of relevant snippets.

The experimental results on the Fake NewsChallenge benchmark dataset show that ourmodel, which is very feature-light, performs sim-ilarly to the state of the art, which is achievedby more complex systems. Our contributions canbe summarized as follows: (i) We apply a novelmemory network model enhanced with CNN andLSTM networks for stance detection. (ii) Wefurther propose a novel extension of the generalarchitecture based on a similarity-based matrix,which we use at inference time, and we show thatthis extension offers sizable performance gains.(iii) Finally, we show that our model is capableof extracting meaningful snippets from the inputtext document, which is useful not only for stancedetection, but more importantly can be useful forhuman experts who need to decide on the factual-ity of a given claim.

arX

iv:1

804.

0758

1v1

[cs

.CL

] 2

0 A

pr 2

018

2 Model

Long-term memory is necessary in order to de-termine the stance of a long document with re-spect to a claim, as relevant parts of a document—paragraphs or text snippets— can indicate theperspective of a document with respect to a claim.Memory networks were designed to rememberpast information (Sukhbaatar et al., 2015) and theycan be particularly well-suited for stance detectionsince they can use a variety of inference strategiesalongside their memory component.

In this section, we present a novel memory net-work (MN) for stance detection. It contains a newinference component that incorporates a similar-ity matrix to extract, with better accuracy, textualsnippets that are relevant to the input claims.

2.1 Overview of the network

A memory network is a 5-tuple {M, I,G,O,R},where the memory M is a sequence of objects orrepresentations, the input I is a component thatmaps the input to its representation, the general-ization component G (Sukhbaatar et al., 2015) up-dates the memory with respect to new input, theoutput O generates an output for each new inputand the current memory state, and finally, the re-sponse R converts the output into a desired re-sponse format, e.g., a textual response or an ac-tion. These components can potentially use manydifferent machine learning models.

Our new memory network for stance detectionis a 6-tuple {M, I, F,G,O,R}, where F repre-sents the new inference component. It takes aninput document d as evidence and a textual state-ment s as a claim and converts them into their cor-responding representations in the input I . Then, itpasses them to the memory M . Next, the relevantparts of the input are identified in F , and after-wards they are used by G to update the memory.Finally, O generates an output from the updatedmemory, and converts it to a desired response for-mat with R. The network architecture is depictedin Figure 2. We describe the components below.

2.2 Input Representation Component

The input to the stance detection algorithm is adocument d and a textual statement s as a claim:see lines 2 and 3 in Table 1. Each d is segmentedinto paragraphs xj of varied lengths, where eachxj is considered as a potential piece of evidencefor stance detection.

1 Inputs:2 (1) A document (d) as a set of evidence (xj)3 (2) A textual statement containing a claim (s)

4 Outputs:5 (1) predicting the relative perspective (or stance) of a pair of

(d, s) to a claim as agree, disagree, discuss and unrelated.6 Inference outputs:7 (2) Top K evidence xj with their similarity scores8 (3) Top K snippets of xj with their similarity scores

9 Memory Network Model:10 1. Input memory representation (I):11 d→ (X,W,E)

12 (X,W,E)TimeDistributed(LSTM)−−−−−−−−−−−−−−−−−→ {m1, ...,mn}

13 (X,W,E)TimeDistributed(CNN)−−−−−−−−−−−−−−−−→ {c1, .., cn}

14 sLSTM,CNN−−−−−−−−→ slstm, scnn

15 2. Memory (M), updating memory (G) and inference (F):16 mj = mj � P j

tfidf,∀j17 P j

lstm = slstmᵀ ×M×mj , ∀j

18 cj = cj � P jlstm, ∀j

19 P jcnn = scnn

ᵀ ×M′ × cj , ∀j

20 3. Output memory representation (O): o =[mean({cj});

21[max({P j

cnn});mean({P jcnn})

];[max({P j

lstm});mean({P j

lstm})];[max({P j

tfidf});mean({P jtfidf})

]]22 4. Generating the final prediction (R):23 [o; slstm; scnn]

MLP−−−→ δ

24 5. Inference (F) outputs:25 P j

cnn −→ {a set of evidences}+ {similarity scores}26 M ′ −→ {snippets}+ {similarity scores}

Table 1: Summary of our Memory Network algorithmfor stance detection.

Indeed, a paragraph usually represents a coherentargument, unified under one or more inter-relatedtopics. The input component in our model con-verts each d into a set of potential pieces of evi-dence in a three-dimensional (3D) tensor space asshown below (see line 11 in Table 1):

d = (X,W,E) (1)

where X = {x1, ..., xn} is a set of paragraphsconsidered as potential pieces of evidence, suchthat each xj is represented by a set of words W ={w1, ..., wv}—global vocabulary of size v—and aset of neural representations E = {e1, ..., ev} forwords in W . This 3D space is illustrated as a cubein Figure 2.

Each xj is encoded from the 3D space into a se-mantic representation at the input component us-ing a Long Short-Term Memory (LSTM) network.The lower left component in Figure 2 shows ourLSTM network, which operates on our input asfollows (see also line 12 in Table 1):

(X,W,E)T imeDistributed(LSTM)−−−−−−−−−−−−−−−−→ {m1, ...,mn}

(2)

Figure 2: The architecture of our Memory Network model for stance detection.

where mj is the LSTM representation of xj , andTimeDistributed() indicates a wrapper that enablestraining the LSTM over all pieces of evidence byapplying the same LSTM model to each time-stepof a 3D input tensor, i.e., (X,W,E).

While LSTM networks are designed to effec-tively capture and memorize their inputs (Tanet al., 2016), Convolutional Neural Networks(CNNs) emphasize the local interaction betweenthe words in the input word sequence, which is im-portant for obtaining an effective representation.We use a CNN to encode each xj into its repre-sentation cj as shown in Equation 3 (see line 13 inTable 1).

(X,W,E)T imeDistributed(CNN)−−−−−−−−−−−−−−−→ {c1, .., cn}

(3)The left-top of Figure 2 shows that this represen-tation is passed as a new input to the componentM of our memory network.

We keep track of the computed n-grams fromthe CNN, so that we can use them later in the in-ference and in the response components (see Sec-tions 2.3 and 2.6). For this purpose, we use aMaxout layer (Goodfellow et al., 2013) to take themaximum across k affine feature maps computedby the CNN, i.e., pooling across channels. Previ-ous work has investigated the combination of con-volutional and recurrent representations, which isthen fed to the other network as input (Tan et al.,2016; Donahue et al., 2015; Zuo et al., 2015;Sainath et al., 2015). In contrast, we feed theirindividual outputs into our memory network sep-arately, and let the network decide which repre-sentation helps the target task better. We show theeffectiveness of this choice below.

Similarly, we convert each input claim s to itsrepresentation using the corresponding LSTM andCNN networks, as follows:

sLSTM,CNN−−−−−−−−→ slstm, scnn (4)

where slstm and scnn are the representations of scomputed using LSTM and CNN networks, re-spectively. Note that these are separate networkswith different parameters from those used to en-code the pieces of evidence.

Lines 10–14 of Table 1 describe the above stepsin representing I in our memory network. We en-code each input document d into a set of piecesof evidence {xj}∀j: it computes LSTM and CNNrepresentations, mj and cj , respectively, for eachxj , and LSTM and CNN representations, slstm andscnn, for each claim s.

2.3 Inference ComponentThe resulting representations are used to computesemantic similarity between claims and pieces ofevidence. We define the similarity P j

lstm between sand xj as follows (see also line 17 in Table 1):

P jlstm = slstm

ᵀ ×M×mj , ∀j (5)

where slstm ∈ Rq and mj ∈ Rd are LSTM rep-resentations of s and xj , respectively, and M ∈Rq×d is a similarity matrix capturing their similar-ity. For this purpose, M maps s and xj into thesame space as shown in Figure 3. M is a set ofq × d parameters of the network, which are opti-mized during training.

In a similar fashion, we compute the similarityP j

cnn between xj and s using the CNN representa-tions as follows (see line 19 of Table 1):

P jcnn = scnn

ᵀ ×M′ × cj ,∀j (6)

s: (s

lstm

, scn

n)

M

s' xj: (mj , cj)

sim(slstm , mj)or

sim(scnn , cj)

Figure 3: Matching a claim s and a piece of evidencexj using a similarity matrix M . Here, slstm and scnn areLSTM and CNN representations of s, whereas mj andcj are LSTM and CNN representations of xj .

where scnn ∈ Rq′ and cj ∈ Rd′ are the represen-tations of s and xj obtained with CNN, respec-tively. The similarity matrix M ′ ∈ Rq′×d′ is a setof q′ × d′ parameters of the network and is opti-mized during training. P j

lstm and P jcnn indicate the

claim-evidence similarity vectors computed basedon the LSTM and on the CNN representations ofs and xj , respectively.

The rationale behind using the similarity matrixis that in our memory network model, as Figure 3shows, we look for a transformation of the inputclaim s such that s′ =M×s in order to obtain theclosest facts to the claim.

In fact, the relevant parts of the input documentwith respect to the input claim can be captured at adifferent level, e.g., using M ′ for the n-gram levelor using the claim-evidence P j

lstm or P jcnn, ∀j at the

paragraph level. We note that (i) P jlstm uses LSTM

to take the word order and long-length dependen-cies into account, and (ii) P j

cnn exploits CNN totake n-grams and local dependencies into account,as explained in sections 2.2 and 2.3. Addition-ally, we compute another semantic similarity vec-tor, P j

tfidf, by applying a cosine similarity betweenthe TF.IDF (Spärck Jones, 2004) representation ofxj and s. This is particularly useful for stance de-tection as it can help detect the unrelated pieces ofevidence.

2.4 Memory and Generalization ComponentsThe information flow and updates in the mem-ory is as follows: first, the representation vector{mj}∀j is passed to the memory and updated us-ing the claim-evidence similarity vector {P j

tfidf}:

mj = mj � P jtfidf, ∀j (7)

The goal is to filter out most unrelated evi-dence. The updated mj in conjunction with slstm

are used by the inference component–componentF to compute {P j

lstm} as explained in Section 2.3.

Then, {P jlstm} is used to update the new input set

{cj}∀j to the memory:

cj = cj � P jlstm,∀j (8)

Finally, the updated cj in conjunction with scnn

are used to compute P jcnn as explained in Sec. 2.3.

2.5 Output Representation ComponentIn memory networks, the memory output dependson the final goal, which, in our case, is to detectthe relative perspective of a document to a claim.For this purpose, we apply the following equation:

o =[mean({cj});[

max({P jcnn});mean({P j

cnn})];[max({P j

lstm});

mean({P jlstm})

];[max({P j

tfidf});mean({P jtfidf})

]](9)

where mean({cj}) is the average vector of the cjrepresentations.

Then, we compute the maximum and the av-erage similarity between each piece of evidenceand the claim using P j

tfidf, Pjlstm and P j

cnn, whichare computed for each evidence and claim in theinference component F . The maximum similar-ity identifies the part of document xj that is mostsimilar to the claim, while the average similaritymeasures the overall similarity between the docu-ment and the claim.

2.6 Response and Output GenerationThis component computes the final stance of adocument with respect to a claim. For this pur-pose, the concatenation of vectors o, slstm and slstm,are fed into a Multi-Layer Perceptron (MLP),where a softmax predicts the stance of the docu-ment with respect to the claim, as shown below(see also lines 22-23 in Table 1):

[o; slstm; scnn]MLP−−−→ δ (10)

where δ is a softmax function. In addition to theresulting stance, we extract snippets from the in-put document that best indicates the perspective ofthe document with respect to the claim. For thispurpose, we use P j

lstm, P jcnn and M ′ as explained

in Section 2.3 (see also lines 24-26 of Table 1).The overall model is shown in Figure 2 and a

summary of the model is presented in Table 1. Allmodel parameters, including those of (i) CNN andLSTM in I , (ii) the similarity matrices M and M ′

in F , and (iii) the MLP in R, are jointly learnedduring the training process.

3 Experiments and Evaluation

3.1 DataWe use the dataset provided by the Fake NewsChallenge,1 where each example consists of aclaim–document pair with the following possiblerelationship: agree (the document agrees with theclaim), disagree (the document disagrees with theclaim), discuss (the document discusses the sametopic as the claim, but does not take a stance withrespect to the claim), unrelated (the document dis-cusses a different topic). The data includes a totalof 75.4K claim-document pairs, which link 2.5Kunique articles with 2.5K unique claims, i.e., eachclaim is associated with 29.8 articles on average.

3.2 SettingsWe use 100-dimensional word embeddings fromGloVe (Pennington et al., 2014), which were pre-trained on two billion tweets. We use Adam as anoptimizer and categorical cross entropy as a lossfunction. We further use 100-dimensional unitsfor the LSTM embeddings, and 100 feature mapswith filter width of size 5 for the CNN. We con-sider the first p=9 paragraphs for each document,where p is the median of the number of para-graphs.

We optimize the hyper-parameters of the mod-els using the same validation dataset (20% of thetraining data). Finally, as the data is largely imbal-anced towards the unrelated class, during trainingwe randomly select an equal number of instancesfrom each class for each epoch.

3.3 Evaluation MeasuresWe use the following evaluation measures:

Accuracy: Number of correctly classified ex-amples divided by the total number of examples.It is equivalent to micro-averaged F1.

Macro-F1: We calculate F1 for each class, andthen we average across all classes.

Weighted Accuracy: This is a weighted, two-level scoring scheme, which is applied to each testexample. First, if the example is from the unre-lated class and the model correctly predicts it, thescore is incremented by 0.25; otherwise, if the ex-ample is related and the model predicts agree, dis-agree, or discuss, the score is incremented by 0.25.Second, there is a further increment by 0.75 foreach related example if the model correctly pre-dicts the correct label: agree, disagree, or discuss.

1Available at www.fakenewschallenge.org

Finally, the score is normalized by dividing itby the total number of test examples. The ra-tionale behind this metric is that the binary re-lated/unrelated classification task is expected tobe much easier, while also being arguably lessrelevant to fake news detection, than the actualstance detection task, which aims to further clas-sify the relevant instances as agree, disagree, ordiscuss. Therefore, the weighted accuracy metricgives more weight to the former distinction andless weight to the latter one.

3.4 Baselines

Given the imbalanced nature of our data, we usetwo baselines, in which we label all testing exam-ples with the same label: (a) unrelated and (b) dis-cuss. The former is the majority class baseline,which is a reasonable baseline for Accuracy andmacro-F1, while the latter is a potentially betterbaseline for Weighted Accuracy.

We further use CNN and LSTM models, as wellas combinations thereof, as baselines since theyform components of our model, and also becausethey yield state-of-the-art results for text, image,and video classification (Tan et al., 2016; Donahueet al., 2015; Zuo et al., 2015; Sainath et al., 2015).

Finally, we include the official baseline from thechallenge, which is a Gradient Boosting classifierwith word and n-gram overlap features, as well asindicators for refutation and polarity.

3.5 Our Models

sMemNN: This is our model presented in Fig-ure 2. Note that unlike the CNN+LSTM andthe LSTM+CNN baselines above, which feed theoutput of one network into the other one, thesMemNN model feeds the individual outputs ofboth the CNN and the LSTM networks into thememory network, and lets it decide how much torely on each of them. This consideration also facil-itates reasoning and explaining model predictions,as we will discuss in more detail below.

sMemNN (dotProduct): This is a version ofsMemNN, where the similarity matrices are re-placed by the dot product between the represen-tation of the claims and of the evidence. For thispurpose, we first project the claim representationto a dense layer that has the same size as the rep-resentation of each piece of evidence, and then wecompute the dot product between the resulting rep-resentation and the representation of the evidence.

www.fakenewschallenge.org

Methods TotalParameters

TrainableParameters

WeightedAccuracy Macro-F1 Accuracy

1. All-unrelated – – 39.37 20.96 72.202. All-discuss – – 43.89 7.47 17.573. CNN 2.7M 188.7K 40.66 24.44 41.534. LSTM 2.8M 261.3K 57.23 37.23 60.215. CNN+LSTM 4.2M 361.5K 42.02 27.36 48.546. LSTM+CNN 2.8M 281.5K 60.21 40.33 65.367. Gradient Boosting – – 75.20 46.13 86.328. sMemNN (dotProduct) 5.4M 275.2K 75.13 50.21 83.859. sMemNN 5.5M 377.5K 78.97 56.75 87.2710. sMemNN (with TF) 110M 105M 81.23 56.88 88.57

Table 2: Evaluation results on the test data.

sMemNN (with TF): Since our LSTM andCNN networks only use a limited number of start-ing paragraphs2 for an input document, we enrichour model with the BOW representation of doc-uments and claims as well as their TF.IDF-basedcosine similarity. These vectors are concatenatedwith the memory outputs (section 2.5) and passedto the R component (section 2.6) of sMemNN. Weexpect these BOW vectors to provide useful addi-tional information.

3.6 Results

Table 2 reports the performance of all models onthe test dataset. The All-unrelated and the All-discuss baselines perform poorly across the eval-uation measures, except for All-unrelated, whichachieves high accuracy, which is due to unrelatedbeing by far the dominant class in the dataset.

Next, we can see that LSTM consistently out-performs CNN across all evaluation measures. Al-though the larger number of parameters of theLSTM can play a role, we believe that its su-periority comes from it being able to rememberpreviously-observed relevant pieces of text.

Next, we see systematic improvements for thecombinations of CNN and LSTM: CNN+LSTM isbetter than CNN alone, and LSTM+CNN is betterthan LSTM alone. Better performance is achievedby LSTM+CNN, that is, when claims and evi-dence are first processed by an LSTM network,and then fed into a CNN.

The Gradient Boosting model achieves sizableimprovement over the above baseline neural mod-els. However, we should note that these neuralmodels do not use the rich hand-crafted featuresthat were used in the Gradient Boosting model.

2Due to the long length of some documents, it is impracti-cal to consider all paragraphs when training LSTM and CNN.

Row 9 shows the results for our memory net-work model (sMemNN), which consistently out-performs all other baseline models across all eval-uation metrics, achieving 10.62 and 3.77 pointsof absolute improvement in terms of Macro-F1and Weighted Accuracy, respectively, over the bestbaseline (Gradient Boosting). We believe that thisis due to the memory network’s capturing goodtext snippets. As we will see below, these snippetsare also useful for explaining the model’s predic-tions. Comparing row 9 to row 8, we can see theimportance of our proposed similarity matrix: re-placing that matrix by a simple dot product hurtsthe performance of the model considerably acrossall evaluation measures, thus lowering it to thelevel of the Gradient Boosting model.

Finally, row 10 shows the results for our mem-ory network model enriched by a BOW represen-tation. As we expected, it outperforms sMemNN,probably due to being able to capture useful infor-mation from paragraphs beyond the starting few.

To put the results of sMemNN in perspective,we should mention that the best system at the FakeNews Challenge achieved a macro-F1 of 57.79,which is not significantly different from the per-formance of our full model at the 0.05 significancelevel (p-value=0.53). Yet, they have an ensemblecombining the feature-rich Gradient Boosting sys-tem with neural networks.

Further analysis of the output of the differentsystems (e.g., the confusion matrices) reveals thefollowing general trends: (i) the unrelated exam-ples are easy to detect, and most models show highperformance for this class, (ii) the agree and thedisagree examples are often mislabeled as discussby the baselines, and (iii) the disagree examplesare the most difficult ones for all models, probablybecause they represent by far the smallest class.

Claim 1: man saved from bear attack - thanks to his justin bieber ringtoneEvidence Id Pj

cnn Evidence Snippet2069-3 0.89 ... fishing in the yakutia republic , russia , igor vorozhbitsyn is lucky to be alive after

his justin bieber ringtone , baby , scared off a bear that was attacking him 0.41 ...2069-7 1.0 ... but as the bear clawed vorozhbitsyn ’ s face and back his mobile phone rang

, the ringtone selected was justin bieber ’ s hit song baby . rightly startled 1.00 ,

the bear retreated back into 0.39 the forest ...true label: agree; predicted label: agree

Claim 2: 50ft crustacean , dubbed crabzilla , photographed lurking beneath the waters in whitstableEvidence Id Pj

cnn Evidence Snippet24835-1 0.0046 ... a marine biologist has killed off claims -0.0008 that a giant crab is 0.0033 living on the

kent coast - insisting the image is probably a well - doctored hoax 0.0012 ...24835-7 -0.0008 ... i don ’ t know what the currents are like around that harbour or what sort of they might

produce in the sand , but i think it ’ s more conceivable that someone is playing 0.0007

about with the photo ...true label: disagree; predicted label: disagree

Table 3: Examples of highly ranked snippets of evidence for an input claim, which were automatically extractedby our inference component for claim-document pairs. The P j

cnn column and the values in the top-right corner ofthe highlighted snippets show the similarity between the claim and a piece of evidence, and between the claim andan evidence snippet, respectively.

0 10 20 30 40 50 60 70 80epochs

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

cove

rage

unrelated discuss agree disagree

0.2

0.4

0.6

0.8

1.0

1.2

1.4

loss

val_loss

Figure 4: Effect of data coverage. The y-axis shows thefraction of data observed during training (coverage),while the x-axis shows the loss during training.

4 Discussion

4.1 Training Data Coverage

As discussed previously, we balance the data ateach training iteration by randomly selecting z in-stances from each of the four target classes, wherez is the size of the class with the minimum num-ber of training instances. In this experiment, weinvestigate what proportion of the training data gotactually used when following our sampling proce-dure. For this purpose, at each training iteration,we report the proportion of the training instancesfrom each class that were used so far, either at thecurrent or at any of the previous iterations.

As Figure 4 shows, our random data sampling pro-cedure eventually used almost all training exam-ples. Since the disagree class was the smallest, itsexamples remained fully covered throughout theprocess. Moreover, almost all other related exam-ples, i.e., agree and discuss, were observed duringtraining, as well as a large fraction of the domi-nating unrelated examples. Note that the modelachieved its best (lowest) loss on the validationdataset at iteration 31, when almost all related in-stances had already been observed. This happenedwhile the corresponding fraction for the unrelatedpairs was around 50%, i.e., a considerable numberof the unrelated instances were not really needed.

4.2 Explainability

A major advantage of our model, compared tothe baselines and to most related work, is that itcan explain its predictions: as we explained insection 2.3, our inference component predicts thesimilarity between each piece of evidence xj andthe claim s at the n-grams-level using the claim-evidence similarity vector P j

cnn.Table 3 shows examples of two claims and

the snippets extracted as evidence. Column P jcnn

shows the overall similarity between the evidenceand the corresponding claim as computed by theinference component of our model. The high-lighted texts are snippets with the highest similar-ity (the value is shown next to each snippet) to theclaim as extracted by the inference component.

Note that the snippets are of fixed length, namely5-grams, but in case of consecutive n-grams withsimilar scores, we combine them into a singlesnippet and we report the average value, e.g., seethe snippet for evidence 2069-3. The lower halfof Table 3 shows an example where the similarityvalues associated with the snippets are either toosmall or negative, e.g., see the value for biologisthas killed off claims. In all cases, the model couldaccurately predict the stance of these pieces of ev-idence with respect to the corresponding claims.

Next, we conducted an experiment to quan-tify the performance of our memory network atexplaining its predictions: we randomly sam-pled 100 agree/disagree claim-document exam-ples from our gold data, and we manually evalu-ated the top five pieces of evidence that our modelprovided. In 76 cases, the model correctly clas-sified the agree/disagree examples, and providedarguably adequate snippets.

Figure 5(a) shows the performance of our modelat explaining its predictions when each support-ing/opposing piece of evidence is an n-gram snip-pet of fixed length (n = 5) for the agree and thedisagree classes, and their combinations at the top-k ranks, k = {1, . . . , 5}. It achieved precisionof 0.28, 0.32, 0.35, 0.25, and 0.33 at ranks 1–5.Moreover, we found that it could accurately iden-tify, as part of the identified n-grams, key phrasessuch as officials declared the video, according toprevious reports, believed will come, president inhis tweets as supporting pieces of evidence, andproved a hoax, shot down a cnn report, would beskeptical as opposing pieces of evidence.

Note that the above low precision is mainlydue to the unsupervised nature of this task as nogold snippets supporting the document’s stanceare available for training in the FNC dataset.3 Fur-thermore, our evaluation setup was at the n-gramlevel in Figure 5(a). However, if we conduct amore coarse-grained evaluation where we com-bine consecutive n-grams with similar scores intoa single snippet, the precision for these new snip-pets improves to 0.4, 0.38, 0.42, 0.38, and 0.42 atranks 1–5, as Figure 5(b) shows. If we further ex-tend the evaluation to the sentence level, the pre-cision jumps to 0.6, 0.58, 0.55, 0.62, and 0.57 atranks 1–5, as we can see on Figure 5(c).

3Some other recent datasets, to be presented at this sameHLT-NAACL’2018 conference, do have such gold evidenceannotations (Baly et al., 2018; Thorne et al., 2018).

5 Related Work

While stance detection is an interesting task in itsown right, e.g., for media monitoring, it is also animportant component for fact checking and verac-ity inference.4 Automatic fact checking was envi-sioned by Vlachos and Riedel (2014) as a multi-step process that (i) identifies check-worthy state-ments (Hassan et al., 2015; Gencheva et al., 2017;Jaradat et al., 2018), (ii) generates questions to beasked about these statements (Karadzhov et al.,2017), (iii) retrieves relevant information to cre-ate a knowledge base (Shiralkar et al., 2017), and(iv) infers the veracity of these statements, e.g., us-ing text analysis (Banerjee and Han, 2009; Castilloet al., 2011; Rashkin et al., 2017) or informationfrom external sources (Karadzhov et al., 2017;Popat et al., 2017).

There have been some nuances in the way re-searchers have defined the stance detection task.SemEval-2016 Task 6 (Mohammad et al., 2016)targets stances with respect to some target propo-sition, e.g., entities, concepts or events, as in-favor, against, or neither. The winning modelin the task was based on transfer learning: a Re-current Neural Network trained on a large Twittercorpus was used to predict task-relevant hashtagsand to initialize a second recurrent neural networktrained on the provided dataset for stance predic-tion (Zarrella and Marsh, 2016). Subsequently,Zubiaga et al. (2016) detected the stance of tweetstoward rumors and hot topics using linear-chainconditional random fields (CRFs) and tree CRFsthat analyze tweets based on their position in tree-like conversational threads.

Most commonly, stance detection is definedwith respect to a claim, e.g., as in the 2017 FakeNews Challenge. The best system was an ensem-ble of gradient-boosted decision trees with richfeatures and CNNs (Baird et al., 2017). The sec-ond system was a multi-layer neural network withsimilarity features, word n-grams, and latent se-mantic analysis (Hanselowski et al., 2017). Thethird one was a neural network with similarity fea-tures (Riedel et al., 2017).

Unlike the above work, we use a feature-lightmemory network that jointly infers the stance andhighlights relevant snippets of evidence.

4Yet, stance detection and fact checking are typically sup-ported by separate datasets. Two notable upcoming excep-tions, both appearing in this HLT-NAACL’2018, are (Thorneet al., 2018) for English and (Baly et al., 2018) for Arabic.

1 2 3 4 5rank

0.15

0.20

0.25

0.30

0.35

0.40pr

ecisi

onagree disagree overall

(a) n-grams

1 2 3 4 5rank

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

prec

ision

agree disagree overall

(b) consecutive n-grams

1 2 3 4 5rank

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

prec

ision

agree disagree overall

(c) sentences

Figure 5: Prediction explainability. Sub-figures (a)-(c) show the precision of our model explaining its predictionwhen the pieces of evidence are (a) fixed-length n-grams (n = 5), (b) combinations of several consecutive n-gramswith similar scores, or (c) the entire sentence, if it includes at least one extracted n-gram snippet.

6 Conclusion

We studied the problem of stance detection, whichaims to predict whether a document supports,challenges, or just discusses a given claim. Thenature of the task clearly shows that, in order togo beyond simple matching between stance (shorttext) and evidence (longer text, e.g., an entire doc-ument), a machine learning model needs to focuson the relevant paragraphs of the evidence. More-over, in order to understand whether a paragraphsupports a claim, there is a need to refer to infor-mation available in other paragraphs. CNNs andLSTMs are not well-suited for this task as theycannot model complex dependencies such as se-mantic relationships with respect to entire previ-ous paragraphs. In contrast, memory networksare exactly designed to remember previous infor-mation. However, given the large size of docu-ments and paragraphs, basic memory networks donot handle well irrelevant and noisy information,which we confirmed in our experimental results.

Thus, we proposed a novel extension of the ba-sic memory networks, which is based on a sim-ilarity matrix and a stance filtering component,which we apply at inference time, and we haveshown that this extension offers sizable perfor-mance gains, making memory networks compet-itive. Moreover, our model can extract meaningfulsnippets from documents that can explain the fac-tuality of a given claim.

In future work, we plan to extend the inferencecomponent to select an optimal set of explanationsfor each prediction, and to explain the model as awhole, not only at the instance level.

Acknowledgment

We would like to thank the members of the MITSpoken Language Systems group and the anony-mous reviewers for their helpful comments.

This research was carried out in collaborationbetween the MIT Computer Science and ArtificialIntelligence Laboratory (CSAIL) and the QatarComputing Research Institute (QCRI), HBKU.

ReferencesSean Baird, Doug Sibley, and Yuxi Pan. 2017. Ta-

los targets disinformation with fake news challengevictory. https://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.html.

Ramy Baly, Mitra Mohtarami, James Glass, LluísMàrquez, Alessandro Moschitti, and Preslav Nakov.2018. Integrating stance detection and fact checkingin a unified corpus. In Proceedings of HLT-NAACL.New Orleans, LA, USA.

Protima Banerjee and Hyoil Han. 2009. Answer credi-bility: A language modeling approach to answer val-idation. In Proceedings of HLT-NAACL. Boulder,CO, USA, pages 157–160.

Carlos Castillo, Marcelo Mendoza, and BarbaraPoblete. 2011. Information credibility on Twitter.In Proceedings of WWW. Hyderabad, India, pages675–684.

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Trevor Darrell, and Kate Saenko. 2015. Long-termrecurrent convolutional networks for visual recog-nition and description. In Proceedings of CVPR.Boston, MA, USA, pages 2625–2634.

Pepa Gencheva, Preslav Nakov, Lluís Màrquez, Al-berto Barrón-Cedeño, and Ivan Koychev. 2017.

A context-aware approach for detecting worth-checking claims in political debates. In Proceedingsof RANLP. Varna, Bulgaria, pages 267–276.

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza,Aaron Courville, and Yoshua Bengio. 2013. Maxoutnetworks. In Proceedings of ICML. Atlanta, GA,USA, pages 1319–1327.

Andreas Hanselowski, Avinesh PVS, Ben-jamin Schiller, and Felix Caspelherr. 2017.Team Athene on the fake news challenge.https://medium.com/@andre134679/team-athene-on-the-fake-news-challenge-28a5cf5e017b.

Naeemul Hassan, Chengkai Li, and Mark Tremayne.2015. Detecting check-worthy factual claims inpresidential debates. In Proceedings CIKM. Mel-bourne, Australia, pages 1835–1838.

Israa Jaradat, Pepa Gencheva, Alberto Barrón-Cedeño,Lluís Màrquez, and Preslav Nakov. 2018. Claim-Rank: Detecting check-worthy claims in Arabic andEnglish. In Proceedings of HLT-NAACL. New Or-leans, LA, USA.

Georgi Karadzhov, Preslav Nakov, Lluís Màrquez,Alberto Barrón-Cedeño, and Ivan Koychev. 2017.Fully automated fact checking using externalsources. In Proceedings of RANLP. Varna, Bulgaria,pages 344–353.

Todor Mihaylov, Georgi Georgiev, and Preslav Nakov.2015a. Finding opinion manipulation trolls in newscommunity forums. In Proceedings of CoNLL. Bei-jing, China, pages 310–314.

Todor Mihaylov, Ivan Koychev, Georgi Georgiev, andPreslav Nakov. 2015b. Exposing paid opinion ma-nipulation trolls. In Proceedings of RANLP. Hissar,Bulgaria, pages 443–450.

Todor Mihaylov and Preslav Nakov. 2016. Hunting fortroll comments in news community forums. In Pro-ceedings of ACL. Berlin, Germany.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-hani, Xiao-Dan Zhu, and Colin Cherry. 2016.SemEval-2016 task 6: Detecting stance in tweets.In Proceedings of SemEval. Berlin, Germany, pages31–41.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global Vectors for WordRepresentation. In Proceedings of EMNLP. Doha,Qatar, pages 1532–1543.

Kashyap Popat, Subhabrata Mukherjee, Jannik Ströt-gen, and Gerhard Weikum. 2017. Where the truthlies: Explaining the credibility of emerging claimson the web and social media. In Proceedings ofWWW. Perth, Australia, pages 1003–1012.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varying

shades: Analyzing language in fake news and po-litical fact-checking. In Proceedings of EMNLP.Copenhagen, Denmark, pages 2931–2937.

Benjamin Riedel, Isabelle Augenstein, Georgios P Sp-ithourakis, and Sebastian Riedel. 2017. A simple buttough-to-beat baseline for the Fake News Challengestance detection task. ArXiv:1707.03264 .

Tara N Sainath, Oriol Vinyals, Andrew Senior, andHasim Sak. 2015. Convolutional, long short-termmemory, fully connected deep neural networks. InProceedings of ICASSP. Brisbane, Australia, pages4580–4584.

Prashant Shiralkar, Alessandro Flammini, FilippoMenczer, and Giovanni Luca Ciampaglia. 2017.Finding streams in knowledge graphs to support factchecking. In Proceedings of ICDM. New Orleans,LA, USA, pages 859–864.

Karen Spärck Jones. 2004. IDF term weighting andIR research lessons. Journal of documentation60(5):521–523.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In Proceedings of NIPS, Montreal, Canada,pages 2440–2448.

Ming Tan, Cicero dos Santos, Bing Xiang, and BowenZhou. 2016. Improved representation learning forquestion answer matching. In Proceedings of ACL.Berlin, Germany, pages 464–473.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: A large-scale dataset for fact extractionand VERification. In Proceedings of HLT-NAACL.New Orleans, LA, USA.

Andreas Vlachos and Sebastian Riedel. 2014. Factchecking: Task definition and dataset construction.In Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence. Baltimore, MD, USA, pages 18–22.

Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018.The spread of true and false news online. Science359(6380):1146–1151.

Guido Zarrella and Amy Marsh. 2016. MITRE atSemEval-2016 task 6: Transfer learning for stancedetection. arXiv e-print:1606.03784 .

Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, RobProcter, and Michal Lukasik. 2016. Stance classifi-cation in rumours as a sequential task exploiting thetree structure of social media conversations. In Pro-ceedings of COLING. Osaka, Japan, pages 2438–2448.

Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingx-ing Wang, Bing Wang, and Yushi Chen. 2015. Con-volutional recurrent neural networks: Learning spa-tial dependencies for image representation. In Pro-ceedings of CVPR. Boston, MA, USA, pages 18–26.

automatic stance detection using end-to-end …automatic stance detection using end-to-end memory...

Documents