call attention to rumors: deep attention based recurrent

Call A�ention to Rumors: Deep A�ention Based RecurrentNeural Networks for Early Rumor Detection

Tong Chen�e University of �eensland

Brisbane, �eensland [email protected]

Lin Wu∗

�e University of �eenslandBrisbane, �eensland 4072

[email protected]

Xue Li�e University of �eensland


Jun ZhangDeakin University

Melbourne, Victoria [email protected]

Hongzhi Yin�e University of �eensland


Yang Wang�e University of New South Wales

Sydney, NSW [email protected]

ABSTRACT�e proliferation of social media in communication and informationdissemination has made it an ideal platform for spreading rumors.Automatically debunking rumors at their stage of di�usion is knownas early rumor detection, which refers to dealing with sequentialposts regarding disputed factual claims with certain variations andhighly textual duplication over time. �us, identifying trendingrumors demands an e�cient yet �exible model that is able to cap-ture long-range dependencies among postings and produce distinctrepresentations for the accurate early detection. However, it is achallenging task to apply conventional classi�cation algorithmsto rumor detection in earliness since they rely on hand-cra�edfeatures which require intensive manual e�orts in the case of largeamount of posts. �is paper presents a deep a�ention model onthe basis of recurrent neural networks (RNN) to learn selectivelytemporal hidden representations of sequential posts for identifyingrumors. �e proposed model delves so�-a�ention into the recur-rence to simultaneously pool out distinct features with particularfocus and produce hidden representations that capture contextualvariations of relevant posts over time. Extensive experiments onreal datasets collected from social media websites demonstrate that(1) the deep a�ention based RNN model outperforms state-of-the-arts that rely on hand-cra�ed features; (2) the introduction of so�a�ention mechanism can e�ectively distill relevant parts to rumorsfrom original posts in advance; (3) the proposed method detectsrumors more quickly and accurately than competitors.

KEYWORDSEarly rumor detection, Recurrent neural networks, Deep a�entionmodels

1 INTRODUCTION�e explosive use of contemporary social media in communica-tion has witnessed the widespread of rumors which can pose athreat to the cyber security and social stability. For instance, onApril 23rd 2013, a fake news claiming two explosions happenedin the White House and Barack Obama got injured was posted bya hacked Twi�er account named Associated Press. Although theWhite House and Associated Press assured the public minutes laterthe report was not true, the fast di�usion to millions of users had∗Corresponding author

caused severe social panic, resulting in a loss of $136.5 billion inthe stock market1. �is incident of a false rumor showcases thevulnerability of social media on rumors, and highlights the practicalvalue of automatically predicting the veracity of information.

@mantrackerone: Donald Trump: Marco Rubio Disqualified, 'Cannot Be President' After 0.91 1.00 0.67 0.67 0.98 0.73 0.67 1.00 0.45

Refusing To Undo Obama Executive Amnesty 0.33 0.50 0.54 0.82 0.59 0.59

@TheHamptonKid: So Donald Trump is disqualified from being president?? 0.11 0.91 1.00 0.63 0.98 0.83 0.67 1.00

@Evertxn: Donald Trump has been disqualified from running for president xD 0.91 1.00 0.25 0.67 0.98 0.83 0.47 0.67 1.00 0.03

@MikeMinusJordan: Trump: Marco Rubio Disqualified, 'Cannot Be President' After 1.00 0.67 0.67 0.98 0.73 0.67 1.00 0.45

Refusing To Undo Obama Executive Amnesty Immediately 0.33 0.50 0.54 0.82 0.59 0.59 0.14

@cigdemk14: Is Donald Trump really disqualified from running for president? 0.63 0.91 1.00 0.20 0.98 0.83 0.47 0.67 1.00

@ia_diego: Donald trump is disqualified from running for president! Bet! 0.91 1.00 0.63 0.98 0.83 0.47 0.67 1.00 0.01

(a) Streaming posts in regards to an event

… …

VOCABULARY

FR

EQ

UE

NC

Y 1.00 1.00 0.98

0.91 0.83

0.33

0.200.14 0.09 0.07

(b) Statistics on textual phrasesFigure 1: Posts from users on social media platforms exhibitduplication to great extent. For a speci�c event, e.g., “Trumpbeing Disquali�ed from U.S. Election”, the texts of “DonaldTrump”, “Obama” and “Disquali�ed” appear very frequentlyin disputed postings.

Debunking rumors at their formative stage is particularly cru-cial to minimizing their catastrophic e�ects. Most existing rumordetection models employ learning algorithms that incorporate awide variety of features and formulate rumor detection into a bi-nary classi�cation task. �ey commonly cra� features manuallyfrom the content, sentiment [51], user pro�les [30, 31, 47], anddi�usion pa�erns of the posts [12, 16, 18, 34, 36, 37]. Embeddingsocial graphs into a classi�cation model also helps distinguish mali-cious user comments from normal ones [20, 21]. �ese approachesaim at extracting distinctive features to describe rumors faithfully.

1h�p://www.dailymail.co.uk/news/article-2313652/AP-Twi�er-hackers-break-news-White-House-explosions-injured-Obama.html

arX

iv:1

704.

0597

3v1

[cs

.CL

] 2

0 A

pr 2

017

However, feature engineering is extremely time-consuming, bi-ased, and labor-intensive. Moreover, hand-cra�ed features aredata-dependent, making them incapable of capturing contextualvariations in di�erent posts.

More close examinations on rumors reveal that social posts re-lated to an event under discussion are coming in the form of timeseries wherein users forward or comment on it continuously overtime. As shown in Fig.1 (a), posts in regards to an event of USpresidency are coming continuously along the event’s timelines.�us, to tackle with time series of posts, descriptive features shouldbe extracted from contexts. However, as shown in Fig.1 (b), users’posts exhibit high duplication in their textual phrases due to therepeated forwarding, reviews, and/or inquiry behavior [49]. �isposes a challenge on e�ciently distilling distinct information fromduplication, and �exible to capture their contextual variations asthe rumor di�uses over time.

�e propagation of information on social media has temporalcharacteristics, whilst most existing rumor detection methodolo-gies ignore such a crucial property or are not able to capture thetemporal dimension of data. One exception is [17] where Ma etal. uses an RNN to capture the dynamic temporal signals of ru-mor di�usion and learn textual representations under supervision.However, as the rumor di�usion evolves over time, users tend tocomment di�erently in various stages, such as from expressingsurprise to questioning, or from believing to debunking. As a con-sequence, textual features may change their importance with timeand we need to determine which of them are more important to thedetection task. On the other hand, the existence of duplication intextual phrases impedes the e�ciency of training a deep network.Although some studies on duplication detection are available ande�ective in di�erent tasks [15, 26, 50], these approaches are notapplicable in our case where the duplication cannot be determinedbeforehand but is rather varied across post series over time. Inthis sense, two aspects of temporal long-term characteristic and dy-namic duplication should be addressed simultaneously in an earlyrumor detection model.

1.1 Challenges and Our ApproachIn summary, there are three challenges in early rumor detectionto be addressed: (1) automatically learning representations for ru-mors instead of using labor-intensive hand-cra�ed features; (2)the di�culty of maintaining the long-range dependency amongvariable-length post series to build their internal representations;(3) the issue of high duplication compounded with varied contex-tual focus. To combat these challenges, we propose a novel deepa�ention based recurrent neural network (RNN) for early detec-tion on rumors, namely CallAtRumors (Call Attention to Rumors).�e overview of our framework is illustrated in Fig 2. Our modelprocesses streaming textual sequences constructed by encodingcontextual information from posts related to one event into a seriesof feature matrices. �en, the RNN with a�ention mechanism au-tomatically learns latent representations by feed-forwarding eachinput weighted by a�ention probability distribution while adaptiveto contextual variations. Moreover, an additional hidden layer withsiдmoid activation function using the learned latent representationspredicts the event to be rumors or not.

Our framework is premised on the RNNs which are proved tobe e�ective in recent machine learning tasks [1, 7] in handlingsequential data. �is o�ers us the opportunity to automaticallyexplore deep feature representations from original inputs for ef-�cient rumor detection, thus avoiding the complexity of featureengineering. With a�ention mechanism, the proposed approach isable to selectively associate more importance with relevant features.Hence, we are able to tackle the problem in the context of hightextual duplication and the e�ciency of feature learning in earlydetection is ensured.

1.2 Contributions�e main contributions of our work are summarized as follows:

• We propose a deep a�ention based model that learns to per-form rumor detection automatically in earliness. �e modelis based on RNN, and capable of learning continuous hid-den representations by capturing long-range dependencyan contextual variations of posting series.

• �e deterministic so�-a�ention mechanism is embeddedinto recurrence to enable distinct feature extraction fromhigh duplication and advanced importance focus that variesover time.

• We quantitatively validate the e�ectiveness of a�ention interms of detection accuracy and earliness by comparingwith state-of-the-arts on two real social media datasets:Twi�er and Weibo.

�e rest of the paper is organized as follows. Section 2 andSection 3 present the relationship to existing work and preliminaryon RNN. We introduce the main intuition and formulate the problemin Section 4. Section 5 discusses the experiments and the resultson e�ectiveness and earliness. We conclude this paper in Section 6and points out future directions.

2 RELATEDWORKOur work is closely connected with early rumor detection anda�ention mechanism. We will brie�y introduce the two aspects inthis section.

2.1 Early Rumor Detection�e problem of rumor detection [4] can be cast as binary classi�ca-tion tasks. �e extraction and selection of discriminative featuressigni�cantly a�ects the performance of the classi�er. Hu et al. �rstconducted a study to analyze the sentiment di�erences betweenspammers and normal users and then presented an optimizationformulation that incorporates sentiment information into a novelsocial spammer detection framework [10]. Also the propagationpa�erns of rumors were developed by Wu et al. through utilizing amessage propagation tree where each node represents a text mes-sage to classify whether the root of the tree is a rumor or not [37].In [18], a dynamic time series structure was proposed to capturethe temporal features based on the time series context informationgenerated in every rumor’s life-cycle. However, these approachesrequires daunting manual e�orts in feature engineering and theyare restricted by the data structure.

Early rumor detection is to detect viral rumors in their formativestages in order to take early action [24]. In [49], some very rare

2

2.90 0.75 0 … 1.33 0.04 0.87 …

… 2.51 0 0 …

0 0 3.56 … 1.12 0 0 …

… 0 2.94 1.09 …

0 1.15 0.97 … 3.14 0 0 …

… 2.46 0.38 0 …

Posts about One Event Construction of Post Series Deep Attention Based Neural Network Latent Additional Classified as

Posts about One Event Textual Feature Extraction Deep Attention Based Neural Network Latent Additional Classified as

Representations Hidden Rumor or

Representations Hidden Rumor or

Layer Non-rumor

Layer Non-rumor

Is Donald Trump really disqualified from

running for president?

So Donald Trump is disqualified from

being president??

Donald Trump: Marco Rubio Disqualified,

‘Cannot Be President’ After Refusing To

Undo Obama Executive Amnesty

LSTM Layer

…

@mantrackerone: …

@TheHamptonKid: …

…

@cigdemk14: …

Sigmoid

Rumor

Non-rumor

OR:

0 1.15 0.97 … 3.14 0 0 …

… 2.46 0.38 0 …

…

Figure 2: Schematic overview of our framework. In regards to each event, posts in sequence are collected and transformed intotheir tf-idf vectors. �en, deep recurrent neural networks augmented with so�-attention mechanism are deployed to derivetemporal latent representations by capturing long-term dependency among post series and selectively focusing on importantrelevance. Additional layer is topped upon learned representations to determine the event to be rumor/non-rumor.but informative enquiry phrases play an important role in featureengineering when combined with clustering and a classi�er on theclusters as they shorten the time for spo�ing rumors. Manuallyde�ned features has shown their importance in the research onreal-time rumor debunking by Liu et al. [16]. By contrast, Wuet al. proposed a sparse learning method to automatically selectdiscriminative features as well as train the classi�er for emergingrumors [39]. As those methods neglect the temporal trait of socialmedia data, a time-series based feature structure[18] is introducedto seize context variation over time. Recently, recurrent neuralnetwork was �rst introduced to rumor detection by Ma et al. [17],utilizing sequential data to spontaneously capture temporal tex-tual characteristics of rumor di�usion which helps detecting rumorearlier with accuracy. However, without abundant data with di�er-entiable contents in the early stage of a rumor, the performance ofthese methods drops signi�cantly because they fail to distinguishimportant pa�erns.

2.2 Attention MechanismAs a rising technique in NLP (natural language processing) prob-lems [22, 28, 46], Bahdanau et al. extended the basic encoder-decoder architecture of neural machine translation with a�entionmechanism to allow the model to automatically search for partsof a source sentence that are relevant to predicting a target word[2], achieving a comparable performance in the English-to-Frenchtranslation task. Vinyals et al. improved the a�ention model in[2], so their model computed an a�ention vector re�ecting howmuch a�ention should be put over the input words and boosted theperformance on large scale translation [29]. In addition, Sharma etal. applied a location so�max function [25] to the hidden states ofthe LSTM (Long Short-Term Memory) layer, thus recognizing morevaluable elements in sequential inputs for action recognition. Inconclusion, motivated by the successful applications of a�entionmechanism, we �nd that a�ention-based techniques can help bet-ter detect rumors with regards to both e�ectiveness and earlinessbecause they are sensitive to distinctive textual features.

3 RECURRENT NEURAL NETWORKSRecurrent neural networks, or RNNs [23], are a family of feed-forward neural networks for processing sequential data, such as asequence of values x1, ...,xτ . RNNs process an input sequence oneelement xt at a time, updates the hidden units ht , a “state vector”that implicitly contains information about the history of all thepast elements of the sequence (x<t ), and generates output vectorot [13]. �e forward propagation begins with a speci�cation of theinitial state h0, then, for each time step t from t = 1 to t = τ , thefollowing update equations are applied [6]:

ht = tanh(Uxt +Wht−1 + b),ot = Vht + c,

(1)

where parameters U , V and W are weight matrices for input-to-hidden, hidden-to-output and hidden-to-hidden connections, re-spectively. b and c are the bias vectors. tanh(·) is a hyperbolictangent non-linear function.

�e gradient computation of RNNs involves performing back-propagation through time (BPTT) [23]. In practice, a standardRNN is di�cult to be trained due to the well-known vanishing orexploding gradients caused by the incapability of RNN in capturingthe long-distance temporal dependencies for the gradient basedoptimization [3, 40]. To tackle this training di�culty, an e�ectivesolution is to includes “memory” cells to store information overtime, which are known as Long Short-Term Memory (LSTM) [7, 9].In this work, we employ LSTM as basic unit to capture long termtemporal dependency among streaming variable-length post series.

4 CALLATRUMORS: EARLY RUMORDETECTIONWITH DEEP ATTENTIONBASED RNN

In this section, we present the details of our framework with deepa�ention for classifying social textual events into rumors and non-rumors. First, we introduce a strategy that converts the incomingstreams of social posts into continuous variable-length time se-ries. �en, we describe the so� a�ention mechanism which can beembedded into recurrent neural networks to focus on selectivelytextual cues to learn distinct representations for rumor and/or non-rumor binary classi�cation.

3

Input: xt Input Gate: it

Forget Gate: ft

Output Gate: ot

Output (Hidden State): ht

xt xt xt

xt

Cell: ct

0 1.15 0.97 …

3.14 0 0 …

…

2.46 0.38 0 … [ ]…

at dt

xt

K

N

dt

xt

…

…

…

LSTM

LSTM

LSTM

sigmoid

yt

d2

x2

…

…

…

at

LSTM

LSTM

LSTM

a3

Attention Mechanism sigmoid

y2

a1 d1

x1

…

…

…

a2

LSTM

LSTM

LSTM

a2

Attention Mechanism sigmoid

y1

(a) A LSTM unit (b) �e so� a�ention mechanism (c) �e proposed deep a�ention based recurrent modelFigure 3: (a) A LSTM cell. Each cell learns how toweigh its input components (input gate), while learning how tomodulate thatcontributions to the memory (input modulator). It also learns weights which erase the memory cell (forget gate), and weightswhich control how this memory should be emitted (output gate). (b) �e attention module computes the current input xt asan average of the tf-idf features weighted according to the attention so�max at . (c) At each time stamp, the proposed modeltakes the feature slice xt as input and propagates xt through stacked layers of LSTM and predicts next location probabilityat+1 and class label yt .4.1 Problem StatementIndividual social posts contain very limited content due to their na-ture of shortness in context. On the other hand, a claim is generallyassociated with a number of posts that are relevant to the claim.�ese relevant posts regarding a claim can be easily collected to de-scribe the central content more faithfully. Hence, we are interestedin detecting rumor on an aggregate level instead of identifying eachsingle posts [17]. In other words, we focus on detecting rumors onevent-level wherein sequential posts related to the same topics arebatched together to constitute an event, and our model determineswhether the event is a rumor or not.

Let E = {Ei } denote a set of given events, where each eventEi = {(pi, j , ti, j )}nij=1 consists of all relevant posts pi, j at time stampti, j , and the task is to classify each event as a rumor or not.

4.2 Constructing Variable-Length Post SeriesFor each event Ei , we collect a set of relevant post series to bethe input of our model to learn latent representations. Withinevery event, posts are divided into time intervals, each of whichis regarded as a batch. �is is because it is not practical to dealwith each post individually in the large number scale. To ensure asimilar word density for each time step within one event, we groupposts into batches according to a �xed post amount N rather thanslice the event time span evenly.

Algorithm 1 describes the construction of variable-length postseries. Speci�cally, for every event Ei = {(pi, j , ti, j )}nij=1, post seriesare constructed with variable lengths due to di�erent amount ofposts relevant to di�erent events. We set a minimum series lengthMin to maintain the sequential property for all events. For eventscontaining no less than N×Min posts, we iteratively take the �rstN posts out of Ei and feed them into a time interval T . �e lastni − N b niN c posts are treated as the last time interval. For eventscontaining less than N×Min posts, we put b ni

Min c posts to the �rstMin − 1 intervals and assign the rest into the last interval TMin .

To model di�erent words, we calculate the tf-idf (Term Frequency-Inverse Document Frequency) for the most frequent K vocabularieswithin all posts. Proved to be an e�ective and lightweight textualfeature, tf-idf is a numerical statistic that is intended to re�ect howimportant a word is to a document in a collection or corpus [14].In this case, for each post we have a K-word dictionary re�ectingthe importance of every word in a post, and the value is 0 if theword never appears in this post. Finally, every post is encoded bythe corresponding K-word tf-idf dictionary, and within a speci�cinternal a matrix of K×N can be constructed as the input of ourmodel. If there are less than N posts within an interval, we willexpand it to the same scale by padding with 0s. Hence, each set ofpost series consists of at least Min feature matrices with a same sizeof K (number of vocabularies) × N (vocabulary feature dimension).

4.3 Long Short-Term Memory (LSTM) withDeterministic So� Attention Mechanism

To capture the long-distance temporal dependencies among contin-uous time post series, we employ Long Short-Term Memory (LSTM)unit [7, 44, 48] to learn high-level discriminative representationsfor rumors. �e structure of LSTM is formulated as

it = σ (Uiht−1 +Wixt +Vict−1 + bi ),ft = σ (Uf ht−1 +Wf xt +Vf ct−1 + bf ),ct = ftct−1 + it tanh(Ucht−1 +Wcxt + bc ),ot = σ (Uoht−1 +Woxt +Voct + bo ),ht = ot tanh(ct ),

(2)

where σ (·) is the logistic sigmoid function, and it , ft , ot , ct are theinput gate, forget gate, output gate and cell input activation vector,respectively. In each of them, there are corresponding input-to-hidden, hidden-to-output, and hidden-to-hidden matrices: U•, V•,W• and the bias vector b•. �e LSTM architecture is essentially amemory cell which can maintain its state over time, and non-lineargating units can regulate the information �ow into and out of thecell [8]. A LSTM unit is shown graphically in Fig. 3 (a).

4

Input :Event-related posts Ei = {(pi, j , ti, j )}nij=1, postamount N , minimum series length Min

Output :Post Series Si = {T1, ...,Tv }1 /*Initialization*/;2 v = 1; x = 0; y = 0;3 while true do4 if ni ≥ N ×Min then5 while v ≤ b niN c do6 x = N × (v − 1) + 1;7 y = N ×v ;8 Tv ← (pi,x , ...,pi,y );9 v + +;

10 end11 Tv ← (pi,y+1, ...,pi,ni );12 else13 while v < Min do14 x = b ni

Min c × (v − 1) + 1;15 y = b ni

Min c ×v ;16 Tv ← (pi,x , ...,pi,y );17 v + +;18 end19 Tv ← (pi,y+1, ...,pi,ni );20 end21 end22 return Si ;

Algorithm 1: Constructing Variable-Length Post Series

In Eq.(2), the context vector xt is a dynamic representation ofthe relevant part of the social post input at time t . To calculate xt ,we introduce an a�ention weight at [i], i = 1, . . . ,K , correspondingto the feature extracted at di�erent element positions in a tf-idfmatrix dt . Speci�cally, at each time stamp t , our model predictsat+1, a so�max over K positions, and yt , a so�max over the binaryclass of rumors and non-rumors with an additional hidden layerwith siдmoid(·) activations (see Fig.3 (c)). �e location so�max [25]is thus, applied over the hidden states of the last LSTM layer tocalculate at+1, the a�ention weight for the next input matrix dt+1:

at+1[i] = P(Lt+1 = i |ht ) =eWi

>ht∑Kj=1 e

W >j ht

i ∈ 1, ...,K , (3)

where at+1[i] is the a�ention probability for the i-th element (wordindex) at time step t + 1,Wi is the weight matrix allocated to thei-th element, and Lt+1 is a random variable which represents theword index and takes 1-of-K values. �e a�ention vector at+1 isa probability distribution, representing the importance a�achedto each word in the input matrix dt+1. Our model is optimizedto assign higher focus to words that are believed to be distinct inlearning rumor/non-rumor representations. A�er calculating theseprobabilities, the so� deterministic attention mechanism [2]computes the expected value of the input at the next time step xt+1by taking expectation over the word matrix at di�erent positions:

xt+1 = EP (Lt+1 |ht )[dt+1] =K∑i=1

at+1[i]dt+1[i], (4)

where dt+1 is the input matrix at time step t + 1 and dt+1[i] is thefeature vector of the i-th position in the matrix dt+1. �us, Eq.(4)formulates a deterministic a�ention model by computing a so�a�ention weighted word vector

∑i at+1[i]dt+1[i]. �is corresponds

to feeding a so�-a-weighted context into the system, whilst thewhole model is smooth and di�erential under the deterministica�ention, and thus learning end-to-end is trivial by using standardback-propagation.

We remark that a�ention models can be classi�ed into so� a�en-tion and hard a�ention models. So� a�ention models are shownto be deterministic and can be trained by using back-propagationwhereas hard a�ention models are stochastic and the training re-quires the REINFORCE algorithm [19] or by maximizing a varia-tional lower bound or using importance sampling [1, 45]. If weuse hard a�ention models, we should sample Lt from a so�maxdistribution of Eq.(3). �e input xt+1 would then be the featureat the sampled location instead of taking expectation over all theelements in dt+1. Apparently, hard a�ention solutions are not dif-ferentiable and have to resort to some sampling, and thus we deployso� a�ention in our model.

4.4 Loss Function and Model TrainingIn model training, we employ cross-entropy loss coupled with thedoubly stochastic regularization [45] that encourages the model topay a�ention to every element of the input word matrix. �is is toimpose an additional constraint over the location so�max, so that∑τt=1 at,i ≈ 1. �e loss function is de�ned as follows:

L = −τ∑t=1

C∑i=1

yt,i log yt,i + λK∑i=1(1 −

τ∑t=1

at,i )2 + γϕ2, (5)

where yt is the one hot label vector, yt is the vector of binary classprobabilities at time stamp t , τ is the total number of time stamps,C = 2 is the number of output classes (rumors or non-rumors), λ isthe a�ention penalty coe�cient, γ is the weight decay coe�cient,and ϕ represents all the model parameters.

�e cell state and the hidden state for LSTM are initialized usingthe input tf-idf matrices for faster convergence:

c0 = fc

(1τ

τ∑t=1

(1K

K∑i=1

dt [i])),

h0 = fh

(1τ

τ∑t=1

(1K

K∑i=1

dt [i])),

(6)

where fc and fh are two multi-layer perceptrons, and τ is thenumber of time stamps in the model. �ese values are used tocompute the �rst location so�max a1 which determines the initialinput x1.

5 EXPERIMENTS�is section reports how we evaluate the performance of our pro-posed methodology using real-world data collected from two dif-ferent social media platforms. We �rst describe the constructiondatasets, and then perform self-evaluation to determine optimalparameters. Finally, we assess the e�ectiveness and e�ciency of ourmodel, CallAtRumors, by comparing with state-of-the-art methods.

5

Statistic Twi�er WeiboInvolved Users 466,577 2,755,491

Total Posts 1,046,886 3,814,329Total Events 996 4,702Total Rumors 498 2,351

Total Non-Rumors 498 2,351Average Posts per Event 1,051 811

Minimum Posts per Event 8 10Maximum Posts per Event 44,316 59,318

Table 1: Statistical details of datasets5.1 DatasetsWe use two public datasets published by [17]. �e datasets arecollected from Twi�er2 and Sina Weibo3 respectively. Both of thedatasets are organised at event-level, in which the posts related tothe same events are aggregated, and each event is labeled to 1 forrumor and 0 for non-rumor. In the following, we describe how thetwo datasets are originally constructed and how we expand them:

• In the Twi�er dataset, 498 rumors are collected using thekeywords extracted from veri�ed fake news published onSnopes4, a real-time rumor debunking website. It alsocontains 494 normal events from Snopes and two publicdatasets [4, 12]. For each event, the keywords are extractedand manually re�ned until the composed queries can haveprecise Twi�er search results [17]. All labelled events andrelated Tweet IDs are published by the authors, howeversome Tweets are no longer available when we crawledthose Tweets, causing a 10% shrink on the scale of datacompared with the original Twi�er dataset.

• �e Weibo dataset contains 2,313 rumors and 2,351 non-rumors. �e polarity of all events are veri�ed on SinaCommunity Management Center5. �en the keywords aremanually summarized and modi�ed for comprehensivepost search for data collection using Weibo API.

In addition, to balance the ration of rumors and non-romors,we follow the criteria from [17] to manually gather 4 non-rumorsfrom Twi�er and 38 rumors from Weibo to achieve a 1:1 ratioof rumors to non-rumors. �e data collection procedure and ourdataset structure are shown in Figure 46.

Table 1 gives statistical details of the two datasets. We observethat more than 80% of the users tend to repost the original newswith very short comments to re�ect their a�itudes towards thosenews. As a consequence, the contents of the posts related to oneevent are mostly duplicate, triggering scarcity when extracting dis-tinctive textual pa�erns within overlapping context. However, byimplementing textual a�ention mechanism, CallAtRumors is ableto lay more emphasis on discriminative words, and can guaranteehigh performance in such case.

5.2 Self Evaluations�e model is implemented by using �eano7. All parameters areset using cross-validation. To generate the input variable-lengthpost series, we set the amount of posts N for each time step as2www.twi�er.com3www.weibo.com4www.snopes.com5h�p://service.account.weibo.com6�e webpage in this �gure is downloaded from h�p://www.snopes.com/hillary-clinton-accidentally-gave-isis-400-million/7h�p://deeplearning.net/so�ware/theano/

50 and the minimum post series length Min as 5. We selectedK=10,000 top words for the construction tf-idf matrices. Apartfrom lowercasing, we do not apply any other special preprocessinglike stemming [2]. �e recurrent neural network with a�entionmechanism can automatically learn to ignore those unimportant orirrelevant expressions in the training procedure.

For a hold-out dataset occupying 15% of the events in eachdataset, a self evaluation is performed to optimize the numberof LSTM layers by varying the number of layers from 2 to 6. Resultsare shown in Figure 5. �us, we apply a three-layer LSTM modelwith descending numbers of hidden states of 1024, 512 and 64 re-spectively. �e learning rate is set as 0.45 and we apply a dropout[27] of 0.3 at all non-recurrent connections. For a�ention penaltycoe�cient, we set λ to be 1.5, and the weight decay is set to be10−5. Our model is trained by measuring the derivative of the lossthrough back-propagation [5] algorithm, namely the Adam opti-mization algorithm [11]. We iterate the whole training procedureuntil the loss value converges.

5.3 Settings and BaselinesWe evaluate the e�ectiveness and e�ciency of CallAtRumors bycomparing with the following state-of-the-art approaches in termsof precision, recall and F-measure.

• DT-Rank [49]: �is is a decision-tree based ranking model,and is able to identify trending rumors by recasting theproblem as �nding entire clusters of posts whose topicis a disputed factual claim. We implement their enquiryphrases and features to make it comparable to our method.

• SVM-TS [18]: �is is a SVM (support vector machine)model that uses time-series structures to capture the vari-ation of social context features. SVM-TS can capture thetemporal characteristics of these features based on thetime series of rumors’ lifecycle with time series modellingtechnique applied to incorporate carious social contextinformation. We use the features provided by them fromcontents, users and propagation pa�erns.

• LK-RBF [24]: To tackle the problem of implicit data with-out explicit links and jointed conversations, Sampson etal. proposed two methods based on hashtags and weblinks to aggregate individual tweets with similar keywordsfrom di�erent threads into a conversation. We choose thelink-based approach and combine it with the RBF (RadialBasis Function) kernel as a supervised classi�er because itachieved the best performance in their experiments.

• ML-GRU [17]: �is method utilizes recurrent neural net-works to automatically discover deep data representationsfor e�cient rumor detection. It also allows for early rumordetection with e�ciency. Following the se�ings in theirwork, we choose the multi-layer GRU (gated recurrent unit)as baseline which shows the best result in the e�ectivenessand earliness test.

• CERT [39]: �is is a cross-topic emerging rumor detectionmodel which can jointly cluster data, select features andtrain classi�ers by using the abundant labeled data fromprior rumors to facilitate the detection of an emergingrumor. CERT is capable of extracting useful pa�erns in thecase of data scarcity. Since CERT requires Tweet instances

6

Dataset

Event ID Event Label Posts

E1 1

@Ahmend Ali: Turkish President Erdogan, the leader of ISIS, which contributed to the founding of terrorism is one of the remnants of Obama and Hillary. @Jewell Frankie: Obama is a muslim terrorist that created ISIS along with Hillary. …

E2 0

@Sonny: I'm still hurting over that Durant injury. My fantasy season is done. @Smith Basketball: Kevin Durant won't commit to return date from knee injury - USA TODAY …

… … …

‘ISIS’

‘Hillary’

‘Obama’

‘found’

‘create’

Twitter

Crawler

or

Weibo

API

Verification of Rumors/Non-Rumors Keywords Query Data Storage Extraction Search

Figure 4: Data collection and dataset structure. For each event, its authenticity is veri�ed through o�cial news veri�cationservices. �en we manually extract suitable keywords for each event to ensure a precise search result of relevant posts. A�erthat, we crawl posts with query search and store our collected data using the data storage structure shown in the table. Rumorevents are labelled as 1 and normal events are labelled as 0.

PE

RF

OR

MA

NC

E

NUMBER OF LSTM LAYER

Precision

Recall

F-measure

Figure 5: Results w.r.t varied number of LSTM layers. �ebest result can be achieved in the case of a three-layer LSTMmodel with1,024, 512 and 64 hidden states, respectively.

Method Precision Recall F-measureDT-Rank 71.50% 63.41% 0.6721LK-RBF 78.54% 60.52% 0.6836SVM-TS 76.33% 77.92% 0.7712CERT 81.12% 79.66% 0.8038

ML-GRU 80.87% 82.97% 0.8191CallAtRumors 88.63% 85.71% 0.8715Table 2: Performance on the Twitter dataset

instead of event-level data, we use the tf-idf feature vectorof the all the Tweets in one event to construct the featurematrix as required.

We hold out 15% of the events in each dataset for cross-validation,and split the rest of data with a ratio of 3:2 for training and testrespectively. In particular, we keep the ratio between rumor eventsand normal events in both training and test set as 1:1. In the teston the e�ectiveness of CallAtRumors, all posts within each eventare used during training and evaluation. In the study of e�ciency,we take di�erent ratios of the posts starting from the �rst postwithin all events, ranging from 10% to 80% in order to test howearly CallAtRumors can detect rumors successfully.

5.4 E�ectiveness ValidationTable 2 and Table 3 shows the performance of all approaches onTwi�er dataset and Weibo dataset respectively. DT-Rank cannote�ectively distinguish rumor from normal events when facing

Method Precision Recall F-measureDT-Rank 67.24% 61.33% 0.6415LK-RBF 75.49% 61.08% 0.6752SVM-TS 80.69% 78.26% 0.7946CERT 79.70% 76.32% 0.7797

ML-GRU 82.44% 81.58% 0.8301CallAtRumors 87.10% 86.34% 0.8672Table 3: Performance on the Weibo dataset

datasets with duplication in contents and scarcity in textual fea-tures. LK-RBF and SVM-TS achieve be�er results, indicating theability of feature engineering to help classi�ers detect rumors be�er.However, both LK-RBF and SVM-TS show the lack of adequate re-call which represents how sensitive the models are towards rumors.Since CERT can jointly select discriminative features and train thetopic-independent classi�er with selected features [39], it achievesbe�er results than the former three approaches in our datasets.�e ML-GRU is competitive in both precision and recall due to itscapability of processing sequential data and learning hidden statesfrom raw inputs.

On the Twi�er dataset, CallAtRumors outperforms competitorsby achieving the precision, recall and F-measure of 88.63%, 85.71%and 0.8694 respectively. �e same result can be seen on the Weibodataset, where CallAtRumors achieves the precision, recall andF-measure of 87.10%, 86.34% and 0.8672 respectively. Figure 6 illus-trates the intermediate a�ention results on di�erent words withina detected rumor event. �e e�ectiveness validation proves thea�ect of a�ention mechanism in making LSTM units sensitive todistinctive words and tokens by associating more importance tocertain locations in every feature matrix against other ones.

5.5 More Comparison with the State-of-the-art:CERT [39]

To demonstrate how the conditions of datasets a�ect the perfor-mance of rumor detection, we compare the performance of CallA-tRumors with CERT using di�erent datasets. To reproduce thesame experimental conditions as [39], we have also organized asample dataset using the criteria described in the work. We usequeries generated from 220 reported rumors on Snopes and regularexpressions belonging to the same topics to crawl 7,580 Tweets andmanually labeled each Tweet by reading the content and referring

7

High Low Attention Degree

@bornchozen: School principal bans Santa, Thanksgiving and Pledge of Allegiance.. Umm ok....why??? @ksorbs: Nuts. NYC School Principal Scraps Santa, Pledge of Allegiance and Thanksgiving @Mike Weibel 1: Unbelievable! This PC Shit is going too far.... Telling kids not to say the Pledge of allegiance. @Run4Congress: School principal bans Santa, #Thanksgiving and #PledgeofAllegiance. Fire #EujinJaelaKim, she's a terrorist. …

Figure 6: Visualization on varied attention on a detected rumor. Di�erent color degrees re�ect various attention degrees paid toeach word in a post. In the rumor “School Principal Eujin Jaela Kim banned the Pledge of Allegiance, Santa and�anksgiving”,most of the vocabularies closely connected with the event itself are given less attention weight than words expressing users’doubting, esquiring and anger caused by the rumor. Our model learns to focus on expressions more useful in rumor detectionwhile ignore unrelated words.

PR

EC

ISIO

N

PROPORTION OF TWITTER DATA

CallAtRumors

DT-Rank

LK-RBF

SVM-TS

CERT

ML-GRU

RE

CA

LL

PROPORTION OF TWITTER DATA

CallAtRumors

DT-Rank

LK-RBF

SVM-TS

CERT

ML-GRU

PR

EC

ISIO

N

PROPORTION OF WEIBO DATA

CallAtRumors

DT-Rank

LK-RBF

SVM-TS

CERT

ML-GRU

RE

CA

LL

PROPORTION OF WEIBO DATA

CallAtRumors

DT-Rank

LK-RBF

SVM-TS

CERT

ML-GRU

(a) Precision on partial (b) Recall on partial (c) Precision on partial (d) Recall on partialTwi�er data Twi�er data Weibo data Weibo data

Figure 7: Results of early rumor detection.

Method Dataset Precision Recall F-measureCallAtRumors Sample 91.82% 89.43% 0.9061

CERT Sample 90.35% 85.78% 0.8801CallAtRumors Twi�er 88.63% 85.71% 0.8715

CERT Twi�er 81.12% 79.66% 0.8038Table 4: More comparison with CERT [39] on the Sampleand Twitter datasetsto the Snopes article. At last we result in a sample dataset contain-ing 1,193 rumor Tweets and 6,387 non-rumor Tweets, which alsohas a similar ratio of rumors to non-rumors as the dataset in [39].Table 4 shows the di�erent results when CallAtRumors and CERTare applied to this sample dataset and our Twi�er dataset. �eresults further explains our model’s capability of capturing valu-able pa�erns within our large-scale duplicated datasets by applyinga�ention to more representative words.

5.6 Earliness AnalysisIn this experiment, we study the property of our approach in its ear-liness. To have fair comparison, we allow exiting rumor detectionmethods to be trained on rumors that are for evaluation. �roughincrementally adding training data in the chronological order, weare able to estimate the time that our method can detect emergingrumors. �e results on earliness are shown in Fig 7. At the earlystage with 10% to 60% training data, CallAtRomors outperformsfour comparative methods by a noticeable margin. In particular,compared with the most relevant method of ML-GRU, as the dataproportion ranging from 10% to 20%, CallAtRumors outperformsML-GRU by 5% on precision and 4% on recall on both Twi�er andWeibo datasets. �e result shows that a�ention mechanism is more

e�ective in early stage detection by focusing on the most distinctfeatures in advance. With more data applied into test, all meth-ods are approaching their best performance. For Twi�er datasetand Weibo Dataset with averagely 80% duplicate contents in eachevent, our method starts with 74.02% and 71.73% in precision while68.75% and 70.34% in recall, which means an average time lag of20.47 hours a�er the emerge of one event. �is result is promisingbecause the average report time over the rumors given by Snopesand Sina Community Management Center is 54 hours and 72 hoursrespectively [17], and we can save much manual e�ort with thehelp of our deep a�ention based early rumor detection technique.

6 CONCLUSIONRumor detection on social media is time-sensitive because it is hardto eliminate the vicious impact in its late period of di�usion asrumors can spread quickly and broadly. In this paper, we introduceCallAtRumors, a novel recurrent neural network model based onso� a�ention mechanism to automatically carry out early rumordetection by learning latent representations from the sequentialsocial posts. We conducted experiments with �ve state-of-the-artrumor detection methods to illustrate that CallAtRumors is sensitiveto distinguishable words, thus outperforming the competitors evenwhen textual feature is sparse at the beginning stage of a rumor.In addition, we demonstrate the capability of our model to handleduplicate data with a further comparison. In our future work, itwould be appealing to investigate the possibility to combine morecomplexed feature with our deep a�ention model. For example, we

8

can model the propagation pa�ern of rumors as sequential inputsfor RNNs to improve the detection accuracy. �e future work mayinvestigate the e�ciency issue using hashing techniques [32, 41]over multi-level feature spaces [33, 35, 38, 42, 43].

REFERENCES[1] Jimmy Ba, Roger Grosse, Ruslan Salakhutdinov, and Brendan Frey. 2015. Learning

wake-sleep recurrent a�ention models. In NIPS.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-

chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).

[3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-termdependencies with gradient decent is di�cult. IEEE Transactions on NeuralNetworks 5, 2 (1994), 157–166.

[4] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Informationcredibility on twi�er. In Proceedings of the 20th international conference on Worldwide web. ACM, 675–684.

[5] Ronan Collobert, Jason Weston, Leon Bo�ou, Michael Karlen, KorayKavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost)from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MITPress. h�p://www.deeplearningbook.org.

[7] Alex Graves. 2013. Generating sequences with recurrent neural networks. arXivpreprint arXiv:1308.0850 (2013).

[8] Klaus Gre�, Rupesh K Srivastava, Jan Koutnık, Bas R Steunebrink, and JurgenSchmidhuber. 2016. LSTM: A search space odyssey. IEEE transactions on neuralnetworks and learning systems (2016).

[9] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. InNeural computation.

[10] Xia Hu, Jiliang Tang, Huiji Gao, and Huan Liu. 2014. Social spammer detectionwith sentiment information. In Data Mining (ICDM), 2014 IEEE InternationalConference on. IEEE, 180–189.

[11] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 (2014).

[12] Sejeong Kwon, Meeyoung Cha, Kyomin Jung, Wei Chen, and Yajun Wang. 2013.Prominent features of rumor propagation in online social media. In Data Mining(ICDM), 2013 IEEE 13th International Conference on. IEEE, 1103–1108.

[13] Yann LeCun, Yoshua Bengio, and Geo�rey Hinton. 2015. Deep learning. Nature521, 7553 (2015), 436–444.

[14] Jure Leskovec, Anand Rajaraman, and Je�rey David Ullman. 2014. Mining ofmassive datasets. Cambridge University Press.

[15] Jiajun Liu, Zi Huang, Hong Cheng, Yueguo Chen, Heng Tao Shen, and YanchunZhang. 2013. Presenting diverse location views with real-time near-duplicatephoto elimination. In ICDE.

[16] Xiaomo Liu, Armineh Nourbakhsh, �anzhi Li, Rui Fang, and Sameena Shah.2015. Real-time rumor debunking on twi�er. In Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management. ACM,1867–1870.

[17] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J Jansen, Kam-FaiWong, and Meeyoung Cha. 2016. Detecting rumors from microblogs with recur-rent neural networks. In Proceedings of IJCAI.

[18] Jing Ma, Wei Gao, Zhongyu Wei, Yueming Lu, and Kam-Fai Wong. 2015. Detectrumors using time series of social context information on microblogging websites.In Proceedings of the 24th ACM International on Conference on Information andKnowledge Management. ACM, 1751–1754.

[19] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014.Recurrent models of visual a�ention. In NIPS.

[20] Shebuti Rayana and Leman Akoglu. 2015. Collective opinion spam detection:Bridging review networks and metadata. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. ACM, 985–994.

[21] Shebuti Rayana and Leman Akoglu. 2016. Collective opinion spam detectionusing active inference. In Proceedings of the 2016 SIAM International Conferenceon Data Mining. SIAM, 630–638.

[22] Tim Rocktaschel, Edward Grefenste�e, Karl Moritz Hermann, Tomas Kocisky,and Phil Blunsom. 2015. Reasoning about entailment with neural a�ention. arXivpreprint arXiv:1509.06664 (2015).

[23] David E Rumelhart, Geo�rey E Hinton, and Ronald J Williams. 1988. Learningrepresentations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.

[24] Justin Sampson, Fred Morsta�er, Liang Wu, and Huan Liu. 2016. Leveragingthe Implicit Structure within Social Media for Emergent Rumor Detection. InProceedings of the 25th ACM International on Conference on Information andKnowledge Management. ACM, 2377–2382.

[25] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognitionusing visual a�ention. arXiv preprint arXiv:1511.04119 (2015).

[26] Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011.Multiple feature hashing for real-time large scale near-duplicate video retrieval.In ACM Multimedia.

[27] Nitish Srivastava, Geo�rey E Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromover��ing. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.

[28] Ilya Sutskever, Oriol Vinyals, and �oc V Le. 2014. Sequence to sequencelearning with neural networks. In Advances in neural information processingsystems. 3104–3112.

[29] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geo�reyHinton. 2015. Grammar as a foreign language. In Advances in Neural InformationProcessing Systems. 2773–2781.

[30] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2015. E�ective Multi-�eryExpansions: Robust Landmark Retrieval. In ACM Multimedia.

[31] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. E�ective Multi-�eryExpansions: Collaborative Deep Networks for Robust Landmark Retrieval. IEEETrans. Image Processing 26, 3 (2017), 1393–1404.

[32] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, and Qing Zhang. 2015. LBMCH:Learning Bridging Mapping for Cross-modal Hashing. In ACM SIGIR.

[33] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang.2015. Robust Subspace Clustering for Multi-View Data by Exploiting CorrelationConsensus. IEEE Trans. Image Processing 24, 11 (2015), 3939–3949.

[34] Yang Wang, Xuemin Lin, and Qing Zhang. 2013. Towards metric fusion onmulti-view data: a cross-view based graph random walk approach. In ACMCIKM.

[35] Yang Wang, Xuemin Lin, Qing Zhang, and Lin Wu. 2014. Shi�ing Hypergraphsby Probabilistic Voting. In PAKDD.

[36] Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, and Xiang Zhao. 2017. Un-supervised Metric Fusion Over Multiview Data by Graph Random Walk-BasedCross-View Di�usion. IEEE Trans. Neural Netw. Learning Syst 28, 1 (2017), 57–70.

[37] Ke Wu, Song Yang, and Kenny Q Zhu. 2015. False rumors detection on sinaweibo by propagation structures. In Data Engineering (ICDE), 2015 IEEE 31stInternational Conference on. IEEE, 651–662.

[38] Lin Wu and Xiaochun Cao. 2010. Geo-location estimation from two shadowtrajectories. In CVPR.

[39] Liang Wu, Jundong Li, Xia Hu, and Huan Liu. 2016. Gleaning Wisdom from thePast: Early Detection of Emerging Rumors in Social Media. In SDM.

[40] Lin Wu, Chunhua Shen, and Anton van den Hengel. 2017. Deep linear dis-criminant analysis on �sher networks: A hybrid architecture for person re-identi�cation. Pa�ern Recognition 65 (2017), 238–250.

[41] Lin Wu and Yang Wang. 2017. Robust hashing for multi-view data: Jointlylearning low-rank kernelized similarity consensus and hash functions. ImageVision Comput 57 (2017), 58–66.

[42] Lin Wu, Yang Wang, and Shirui Pan. 2016. Exploiting A�ribute Correlations: ANovel Trace Lasso based Weakly Supervised Dictionary Learning Method. IEEETrans. Cybernetics (2016).

[43] Lin Wu, Yang Wang, and John Shepherd. 2013. E�cient image and tag co-ranking:a bregman divergence optimization method. In ACM Multimedia.

[44] Yang Xiang, Qingcai Chen, Xiaolong Wang, and Yang Qin. 2017. Answer Selectionin Community �estion Answering via A�entive Neural Networks. IEEE SignalProcessing Le�ers 24, 4 (2017), 505–509.

[45] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, A�end and Tell:Neural Image Caption Generation with Visual A�ention. In ICML.

[46] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.2016. Hierarchical a�ention networks for document classi�cation. In Proceedingsof NAACL-HLT. 1480–1489.

[47] Reza Zafarani and Huan Liu. 2015. 10 bits of surprise: Detecting malicious userswith minimum information. In Proceedings of the 24th ACM International onConference on Information and Knowledge Management. ACM, 423–431.

[48] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neuralnetwork regularization. arXiv preprint arXiv:1409.2329 (2014).

[49] Zhe Zhao, Paul Resnick, and Qiaozhu Mei. 2015. Enquiring minds: Early de-tection of rumors in social media from enquiry posts. In Proceedings of the 24thInternational Conference on World Wide Web. ACM, 1395–1405.

[50] Zhili Zhou, Yunlong Wang, Q. M. Jonathan Wu, Ching-Nung Yang, and XingmingSun. 2017. E�ective and E�cient Global Context Veri�cation for Image CopyDetection. IEEE Transactions on Information Forensics and Security 12, 1 (2017),48–63.

[51] David Zimbra, M Ghiassi, and Sean Lee. 2016. Brand-Related Twi�er SentimentAnalysis Using Feature Engineering and the Dynamic Architecture for Arti�cialNeural Networks. In System Sciences (HICSS), 2016 49th Hawaii InternationalConference on. IEEE, 1930–1938.

9

http://www.deeplearningbook.org

call attention to rumors: deep attention based recurrent

Documents