arxiv:1909.03242v2 [cs.cl] 21 oct 2019casper hansen christian hansen jakob grue simonsen department...

13
MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims Isabelle Augenstein Christina Lioma Dongsheng Wang Lucas Chaves Lima Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen {augenstein,c.lioma,wang,lcl,c.hansen,chrh,simonsen}@di.ku.dk Abstract We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by hu- man expert journalists. We present an in-depth analysis of the dataset, highlighting character- istics and challenges. Further, we present re- sults for automatic veracity prediction, both with established baselines and with a novel method for joint ranking of evidence pages and predicting veracity that outperforms all base- lines. Significant performance increases are achieved by encoding evidence, and by mod- elling metadata. Our best-performing model achieves a Macro F1 of 49.2%, showing that this is a challenging testbed for claim veracity prediction. 1 Introduction Misinformation and disinformation are two of the most pertinent and difficult challenges of the in- formation age, exacerbated by the popularity of social media. In an effort to counter this, a signif- icant amount of manual labour has been invested in fact checking claims, often collecting the results of these manual checks on fact checking portals or websites such as politifact.com or snopes.com. In a parallel development, researchers have recently started to view fact checking as a task that can be partially automated, using machine learning and NLP to automatically predict the veracity of claims. However, existing efforts either use small datasets consisting of naturally occurring claims (e.g. Mihalcea and Strapparava (2009); Zubiaga et al. (2016)), or datasets consisting of artificially constructed claims such as FEVER (Thorne et al., 2018). While the latter offer valuable contribu- tions to further automatic claim verification work, they cannot replace real-world datasets. Feature Value ClaimID farg-00004 Claim Mexico and Canada assemble cars with foreign parts and send them to the U.S. with no tax. Label distorts Claim URL https://www.factcheck.org/2018/ 10/factchecking-trump-on-trade/ Reason None Category the-factcheck-wire Speaker Donald Trump Checker Eugene Kiely Tags North American Free Trade Agreement Claim Entities United States, Canada, Mexico Article Title FactChecking Trump on Trade Publish Date October 3, 2018 Claim Date Monday, October 1, 2018 Table 1: An example of a claim instance. Entities are obtained via entity linking. Article and outlink texts, evidence search snippets and pages are not shown. Contributions. We introduce the currently largest claim verification dataset of naturally occurring claims. 1 It consists of 34,918 claims, collected from 26 fact checking websites in English; evidence pages to verify the claims; the context in which they occurred; and rich metadata (see Table 1 for an example). We perform a thorough analysis to identify characteristics of the dataset such as entities mentioned in claims. We demonstrate the utility of the dataset by training state of the art veracity prediction models, and find that evidence pages as well as metadata significantly contribute to model performance. Fi- nally, we propose a novel model that jointly ranks evidence pages and performs veracity prediction. The best-performing model achieves a Macro F1 of 49.2%, showing that this is a non-trivial dataset with remaining challenges for future work. 1 The dataset is found here: https://copenlu. github.io/publication/2019_emnlp_ augenstein/ arXiv:1909.03242v2 [cs.CL] 21 Oct 2019

Upload: others

Post on 22-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

MultiFC: A Real-World Multi-Domain Dataset forEvidence-Based Fact Checking of Claims

Isabelle Augenstein Christina Lioma Dongsheng Wang Lucas Chaves LimaCasper Hansen Christian Hansen Jakob Grue Simonsen

Department of Computer ScienceUniversity of Copenhagen

{augenstein,c.lioma,wang,lcl,c.hansen,chrh,simonsen}@di.ku.dk

AbstractWe contribute the largest publicly availabledataset of naturally occurring factual claimsfor the purpose of automatic claim verification.It is collected from 26 fact checking websitesin English, paired with textual sources andrich metadata, and labelled for veracity by hu-man expert journalists. We present an in-depthanalysis of the dataset, highlighting character-istics and challenges. Further, we present re-sults for automatic veracity prediction, bothwith established baselines and with a novelmethod for joint ranking of evidence pages andpredicting veracity that outperforms all base-lines. Significant performance increases areachieved by encoding evidence, and by mod-elling metadata. Our best-performing modelachieves a Macro F1 of 49.2%, showing thatthis is a challenging testbed for claim veracityprediction.

1 Introduction

Misinformation and disinformation are two of themost pertinent and difficult challenges of the in-formation age, exacerbated by the popularity ofsocial media. In an effort to counter this, a signif-icant amount of manual labour has been investedin fact checking claims, often collecting the resultsof these manual checks on fact checking portals orwebsites such as politifact.com or snopes.com. Ina parallel development, researchers have recentlystarted to view fact checking as a task that canbe partially automated, using machine learningand NLP to automatically predict the veracity ofclaims. However, existing efforts either use smalldatasets consisting of naturally occurring claims(e.g. Mihalcea and Strapparava (2009); Zubiagaet al. (2016)), or datasets consisting of artificiallyconstructed claims such as FEVER (Thorne et al.,2018). While the latter offer valuable contribu-tions to further automatic claim verification work,they cannot replace real-world datasets.

Feature Value

ClaimID farg-00004Claim Mexico and Canada assemble

cars with foreign parts and sendthem to the U.S. with no tax.

Label distortsClaim URL https://www.factcheck.org/2018/

10/factchecking-trump-on-trade/

Reason NoneCategory the-factcheck-wireSpeaker Donald TrumpChecker Eugene KielyTags North American Free Trade

AgreementClaim Entities United States, Canada, MexicoArticle Title FactChecking Trump on TradePublish Date October 3, 2018Claim Date Monday, October 1, 2018

Table 1: An example of a claim instance. Entities areobtained via entity linking. Article and outlink texts,evidence search snippets and pages are not shown.

Contributions. We introduce the currentlylargest claim verification dataset of naturallyoccurring claims.1 It consists of 34,918 claims,collected from 26 fact checking websites inEnglish; evidence pages to verify the claims; thecontext in which they occurred; and rich metadata(see Table 1 for an example). We perform athorough analysis to identify characteristics of thedataset such as entities mentioned in claims. Wedemonstrate the utility of the dataset by trainingstate of the art veracity prediction models, andfind that evidence pages as well as metadatasignificantly contribute to model performance. Fi-nally, we propose a novel model that jointly ranksevidence pages and performs veracity prediction.The best-performing model achieves a Macro F1of 49.2%, showing that this is a non-trivial datasetwith remaining challenges for future work.

1The dataset is found here: https://copenlu.github.io/publication/2019_emnlp_augenstein/

arX

iv:1

909.

0324

2v2

[cs

.CL

] 2

1 O

ct 2

019

Page 2: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

2 Related Work

2.1 Datasets

Over the past few years, a variety of mostly smalldatasets related to fact checking have been re-leased. An overview over core datasets is given inTable 2. The datasets can be grouped into four cat-egories (I–IV). Category I contains datasets aimedat testing how well the veracity3 of a claim can bepredicted using the claim alone, without context orevidence documents. Category II contains datasetsbundled with documents related to each claim – ei-ther topically related to provide context, or servingas evidence. Those documents are, however, notannotated. Category III is for predicting veracity;they encourage retrieving evidence documents aspart of their task description, but do not distributethem. Finally, category IV comprises datasets an-notated for both veracity and stance. Thus, ev-ery document is annotated with a label indicat-ing whether the document supports or denies theclaim, or is unrelated to it. Additional labels canthen be added to the datasets to better predict ve-racity, for instance by jointly training stance andveracity prediction models.

Methods not shown in the table, but relatedto fact checking, are stance detection for claims(Ferreira and Vlachos, 2016; Pomerleau and Rao,2017; Augenstein et al., 2016a; Kochkina et al.,2017; Augenstein et al., 2016b; Zubiaga et al.,2018; Riedel et al., 2017), satire detection (Ru-bin et al., 2016), clickbait detection (Karadzhovet al., 2017), conspiracy news detection (Tacchiniet al., 2017), rumour cascade detection (Vosoughiet al., 2018) and claim perspectives detection(Chen et al., 2019).

Claims are obtained from a variety of sources,including Wikipedia, Twitter, criminal reports andfact checking websites such as politifact.com andsnopes.com. The same goes for documents – theseare often websites obtained through Web searchqueries, or Wikipedia documents, tweets or Face-book posts. Most datasets contain a fairly smallnumber of claims, and those that do not, often lackevidence documents. An exception is Thorne et al.(2018), who create a Wikipedia-based fact check-ing dataset. While a good testbed for develop-ing deep neural architectures, their dataset is arti-ficially constructed and can thus not take metadata

3We use veracity, claim credibility, and fake news predic-tion interchangeably here – these terms are often conflated inthe literature and meant to have the same meaning.

about claims into account.Contributions: We provide a dataset that,

uniquely among extant datasets, contains a largenumber of naturally occurring claims and rich ad-ditional meta-information.

2.2 MethodsFact checking methods partly depend on the typeof dataset used. Methods only taking into accountclaims typically encode those with CNNs or RNNs(Wang, 2017; Perez-Rosas et al., 2018), and poten-tially encode metadata (Wang, 2017) in a similarway. Methods for small datasets often use hand-crafted features that are a mix of bag of word andother lexical features, e.g. LIWC, and then usethose as input to a SVM or MLP (Mihalcea andStrapparava, 2009; Perez-Rosas et al., 2018; Balyet al., 2018). Some use additional Twitter-specificfeatures (Enayet and El-Beltagy, 2017). More in-volved methods taking into account evidence doc-uments, often trained on larger datasets, consistof evidence identification and ranking following aneural model that measures the compatibility be-tween claim and evidence (Thorne et al., 2018;Mihaylova et al., 2018; Yin and Roth, 2018).

Contributions: The latter category above isthe most related to our paper as we consider ev-idence documents. However, existing models arenot trained jointly for evidence identification, orfor stance and veracity prediction, but rather em-ploy a pipeline approach. Here, we show that ajoint approach that learns to weigh evidence pagesby their importance for veracity prediction canimprove downstream veracity prediction perfor-mance.

3 Dataset Construction

We crawled a total of 43,837 claims with theirmetadata (see details in Table 11). We presentthe data collection in terms of selecting sources,crawling claims and associated metadata (Section3.1); retrieving evidence pages; and linking enti-ties in the crawled claims (Section 3.3).

3.1 Selection of sourcesWe crawled all active fact checking websites inEnglish listed by Duke Reporters’ Lab4 and on theFact Checking Wikipedia page.5 This resulted in

4https://reporterslab.org/fact-checking/

5https://en.wikipedia.org/wiki/Fact_checking

Page 3: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Dataset # Claims Labels metadata Claim SourcesI: Veracity prediction w/o evidenceWang (2017) 12,836 6 Yes PolitifactPerez-Rosas et al. (2018) 980 2 No News Websites

II: VeracityBachenko et al. (2008) 275 2 No Criminal ReportsMihalcea and Strapparava (2009) 600 2 No Crowd AuthorsMitra and Gilbert (2015)† 1,049 5 No TwitterCiampaglia et al. (2015)† 10,000 2 No Google, WikipediaPopat et al. (2016) 5,013 2 Yes Wikipedia, SnopesShu et al. (2018)† 23,921 2 Yes Politifact, gossipcop.comDatacommons Fact Check2 10,564 2-6 Yes Fact Checking Websites

III: Veracity (evidence encouraged, but not provided)Barrn-Cedeo et al. (2018) 150 3 No factcheck.org, Snopes

IV: Veracity + stanceVlachos and Riedel (2014) 106 5 Yes Politifact, Channel 4 NewsZubiaga et al. (2016) 330 3 Yes TwitterDerczynski et al. (2017) 325 3 Yes TwitterBaly et al. (2018) 422 2 No ara.reuters.com, verify-sy.comThorne et al. (2018)† 185,445 3 No Wikipedia

V: Veracity + evidence relevancyMultiFC 36,534 2-40 Yes Fact Checking Websites

Table 2: Comparison of fact checking datasets. † indicates claims are not ‘naturally occuring’: Mitra and Gilbert(2015) use events as claims; Ciampaglia et al. (2015) use DBPedia tiples as claims; Shu et al. (2018) use tweets asclaims; and Thorne et al. (2018) rewrite sentences in Wikipedia as claims.

38 websites in total (shown in Table 11). Out ofthese, ten websites could not be crawled, as fur-ther detailed in Table 9. In the later experimen-tal descriptions, we refer to the part of the datasetcrawled from a specific fact checking website as adomain, and we refer to each website as source.

From each source, we crawled the ID, claim,label, URL, reason for label, categories, personmaking the claim (speaker), person fact checkingthe claim (checker), tags, article title, publicationdate, claim date, as well as the full text that ap-pears when the claim is clicked. Lastly, the abovefull text contains hyperlinks, so we further crawledthe full text that appears when each of those hyper-links are clicked (outlinks).

There were a number of crawling issues, e.g. se-curity protection of websites with SSL/TLS pro-tocols, time out, URLs that pointed to pdf filesinstead of HTML content, or unresolvable encod-ing. In all of these cases, the content could not beretrieved. For some websites, no veracity labelswere available, in which case, they were not se-lected as domains for training a veracity predictionmodel. Moreover, not all types of metadata (cat-egory, speaker, checker, tags, claim date, publishdate) were available for all websites; and availabil-ity of articles and full texts differs as well.

We performed semi-automatic cleansing of the

dataset as follows. First, we double-checked thatthe veracity labels would not appear in claims.For some domains, the first or last sentence of theclaim would sometimes contain the veracity label,in which case we would discard either the full sen-tence or part of the sentence. Next, we checkedthe dataset for duplicate claims. We found 202such instances, 69 of them with different labels.Upon manual inspection, this was mainly due tothem appearing on different websites, with labelsnot differing much in practice (e.g. ‘Not true’, vs.‘Mostly False’). We made sure that all such du-plicate claims would be in the training split of thedataset, so that the models would not have an un-fair advantage. Finally, we performed some minormanual merging of label types for the same do-main where it was clear that they were supposed todenote the same level of veracity (e.g. ‘distorts’,‘distorts the facts’).

This resulted in a total of 36,534 claims withtheir metadata. For the purposes of fact verifica-tion, we discarded instances with labels that occurfewer than 5 times, resulting in 34,918 claims. Thenumber of instances, as well as labels per domain,are shown in Table 6 and label names in Table 10in the appendix. The dataset is split into a train-ing part (80%) and a development and testing part(10% each) in a label-stratified manner. Note that

Page 4: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

the domains vary in the number of labels, rangingfrom 2 to 27. Labels include both straight-forwardratings of veracity (‘correct’, ‘incorrect’), but alsolabels that would be more difficult to map onto averacity scale (e.g. ‘grass roots movement!’, ‘mis-attributed’, ‘not the whole story’). We thereforedo not postprocess label types across domains tomap them onto the same scale, and rather treatthem as is. In the methodology section (Section4), we show how a model can be trained on thisdataset regardless by framing this multi-domainveracity prediction task as a multi-task learning(MTL) one.

3.2 Retrieving Evidence PagesThe text of each claim is submitted verbatim as aquery to the Google Search API (without quotes).The 10 most highly ranked search results are re-trieved, for each of which we save the title; Googlesearch rank; URL; time stamp of last update;search snippet; as well as the full Web page. Weacknowledge that search results change over time,which might have an effect on veracity prediction.However, studying such temporal effects is outsidethe scope of this paper. Similar to Web crawl-ing claims, as described in Section 3.1, the cor-responding Web pages can in some cases not beretrieved, in which case fewer than 10 evidencepages are available. The resulting evidence pagesare from a wide variety of URL domains, thoughwith a predictable skew towards popular websites,such as Wikipedia or The Guardian (see Table 3for detailed statistics).

3.3 Entity Detection and LinkingTo better understand what claims are about, weconduct entity linking for all claims. Specifically,mentions of people, places, organisations, andother named entities within a claim are recognisedand linked to their respective Wikipedia pages, ifavailable. Where there are different entities withthe same name, they are disambiguated. For this,we apply the state-of-the-art neural entity linkingmodel by Kolitsas et al. (2018). This results ina total of 25,763 entities detected and linked toWikipedia, with a total of 15,351 claims involved,meaning that 42% of all claims contain entitiesthat can be linked to Wikipedia. Later on, we useentities as additional metadata (see Section 4.3).The distribution of claim numbers according to thenumber of entities they contain is shown in Figure1. We observe that the majority of claims have

Domain %https://en.wikipedia.org/ 4.425https://www.snopes.com/ 3.992https://www.washingtonpost.com/ 3.025https://www.nytimes.com/ 2.478https://www.theguardian.com/ 1.807https://www.youtube.com/ 1.712https://www.dailymail.co.uk/ 1.558https://www.usatoday.com/ 1.279https://www.politico.com/ 1.241http://www.politifact.com/ 1.231https://www.pinterest.com/ 1.169https://www.factcheck.org/ 1.09https://www.gossipcop.com/ 1.073https://www.cnn.com/ 1.065https://www.npr.org/ 0.957https://www.forbes.com/ 0.911https://www.vox.com/ 0.89https://www.theatlantic.com/ 0.88https://twitter.com/ 0.767https://www.hoax-slayer.net/ 0.655http://time.com/ 0.554https://www.bbc.com/ 0.551https://www.nbcnews.com/ 0.515https://www.cnbc.com/ 0.514https://www.cbsnews.com/ 0.503https://www.facebook.com/ 0.5https://www.newyorker.com/ 0.495https://www.foxnews.com/ 0.468https://people.com/ 0.439http://www.cnn.com/ 0.419

Table 3: The top 30 most frequently occurring URLdomains.

Figure 1: Distribution of entities in claims.

one to four entities, and the maximum number of35 entities occurs in one claim only. Out of the25,763 entities, 2,767 are unique entities. The top30 most frequent entities are listed in Table 4. Thisclearly shows that most of the claims involve enti-ties related to the United States, which is to be ex-pected, as most of the fact checking websites areUS-based.

4 Claim Veracity Prediction

We train several models to predict the veracity ofclaims. Those fall into two categories: those that

Page 5: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Entity FrequencyUnited States 2810Barack Obama 1598Republican Party (United States) 783Texas 665Democratic Party (United States) 560Donald Trump 556Wisconsin 471United States Congress 354Hillary Rodham Clinton 306Bill Clinton 292California 285Russia 275Ohio 239China 229George W. Bush 208Medicare (United States) 206Australia 186Iran 183Brad Pitt 180Islam 178Iraq 176Canada 174White House 166New York City 164Washington, D.C. 164Jennifer Aniston 163Mexico 158Ted Cruz 152Federal Bureau of Investigation 146Syria 130

Table 4: Top 30 most frequent entities listed by theirWikipedia URL with prefix omitted

only consider the claims themselves, and thosethat encode evidence pages as well. In addition,claim metadata (speaker, checker, linked entities)is optionally encoded for both categories of mod-els, and ablation studies with and without thatmetadata are shown. We first describe the basemodel used in Section 4.1, followed by introduc-ing our novel evidence ranking and veracity pre-diction model in Section 4.2, and lastly the meta-data encoding model in Section 4.3.

4.1 Multi-Domain Claim Veracity Predictionwith Disparate Label Spaces

Since not all fact checking websites use the sameclaim labels (see Table 6, and Table 10 in the ap-pendix), training a claim veracity prediction modelis not entirely straight-forward. One option wouldbe to manually map those labels onto one another.However, since the sheer number of labels is ratherlarge (165), and it is not always clear from theguidelines on fact checking websites how they canbe mapped onto one another, we opt to learn howthese labels relate to one another as part of theveracity prediction model. To do so, we employ

the multi-task learning (MTL) approach inspiredby collaborative filtering presented in Augensteinet al. (2018) (MTL with LEL–multitask learningwith label embedding layer) that excels on pair-wise sequence classification tasks with disparatelabel spaces. More concretely, each domain ismodelled as its own task in a MTL architecture,and labels are projected into a fixed-length labelembedding space. Predictions are then made bytaking the dot product between the claim-evidenceembeddings and the label embeddings. By doingso, the model implicitly learns how semanticallyclose the labels are to one another, and can benefitfrom this knowledge when making predictions forindividual tasks, which on their own might onlyhave a small number of instances. When makingpredictions for individual domains/tasks, both attraining and at test time, as well as when calculat-ing the loss, a mask is applied such that the validand invalid labels for that task are restricted to theset of known task labels.

Note that the setting here slightly differs fromAugenstein et al. (2018). There, tasks are lessstrongly related to one another; for example, theyconsider stance detection, aspect-based sentimentanalysis and natural language inference. Here, wehave different domains, as opposed to conceptu-ally different tasks, but use their framework, as wehave the same underlying problem of disparate la-bel spaces. A more formal problem definition fol-lows next, as our evidence ranking and veracityprediction model in Section 4.2 then builds on it.

4.1.1 Problem DefinitionWe frame our problem as a multi-task learningone, where access to labelled datasets for T tasksT1, . . . , TT is given at training time with a targettask TT that is of particular interest. The train-ing dataset for task Ti consists of N examplesXTi = {xTi1 , . . . , xTiN} and their labels YTi =

{yTi1 , . . . ,yTiN }. The base model is a classic deepneural network MTL model (Caruana, 1993) thatshares its parameters across tasks and has task-specific softmax output layers that output a proba-bility distribution pTi for task Ti:

pTi = softmax(WTih+ bTi) (1)

where softmax(x) = ex/∑‖x‖

i=1 exi , WTi ∈

RLi×h, bTi ∈ RLi is the weight matrix andbias term of the output layer of task Ti respec-tively, h ∈ Rh is the jointly learned hidden rep-

Page 6: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

resentation, Li is the number of labels for taskTi, and h is the dimensionality of h. The MTLmodel is trained to minimise the sum of individualtask losses L1 + . . . + LT using a negative log-likelihood objective.

Label Embedding Layer. To learn the relation-ships between labels, a Label Embedding Layer(LEL) embeds labels of all tasks in a joint Eu-clidian space. Instead of training separate softmaxoutput layers as above, a label compatibility func-tion c(·, ·) measures how similar a label with em-bedding l is to the hidden representation h:

c(l,h) = l · h (2)

where · is the dot product. Padding is applied suchthat l and h have the same dimensionality. Ma-trix multiplication and softmax are used for mak-ing predictions:

p = softmax(Lh) (3)

where L ∈ R(∑

i Li)×l is the label embedding ma-trix for all tasks and l is the dimensionality of thelabel embeddings. We apply a task-specific maskto L in order to obtain a task-specific probabil-ity distribution pTi . The LEL is shared across alltasks, which allows the model to learn the relation-ships between labels in the joint embedding space.

4.2 Joint Evidence Ranking and ClaimVeracity Prediction

So far, we have ignored the issue of how to obtainclaim representation, as the base model describedin the previous section is agnostic to how instancesare encoded. A very simple approach, which wereport as a baseline, is to encode claim texts only.Such a model ignores evidence for and against aclaim, and ends up guessing the veracity based onsurface patterns observed in the claim texts.

We next introduce two variants of evidence-based veracity prediction models that encode 10pieces of evidence in addition to the claim. Here,we opt to encode search snippets as opposed towhole retrieved pages. While the latter wouldalso be possible, it comes with a number of ad-ditional challenges, such as encoding large doc-uments, parsing tables or PDF files, and encod-ing images or videos on these pages, which weleave to future work. Search snippets also havethe benefit that they already contain summaries ofthe part of the page content that is most related tothe claim.

Figure 2: The Joint Veracity Prediction and EvidenceRanking model, shown for one task.

4.2.1 Problem DefinitionOur problem is to obtain encodings for N exam-ples XTi = {xTi1 , . . . , xTiN}. For simplicity, wewill henceforth drop the task superscript and re-fer to instances as X = {x1, . . . , xN}, as instanceencodings are learned in a task-agnostic fashion.Each example further consists of a claim ai andk = 10 evidence pages Ek = {e10 , . . . , eN10}.

Each claim and evidence page is encoded witha BiLSTM to obtain a sentence embedding, whichis the concatenation of the last state of the forwardand backward reading of the sentence, i.e. h =BiLSTM(·), where h is the sentence embedding.

Next, we want to combine claims and evidencesentence embeddings into joint instance represen-tations. In the simplest case, referred to as modelvariant crawled avg, we mean average the BiL-STM sentence embeddings of all evidence pages(signified by the overline) and concatenate thosewith the claim embeddings, i.e.

sgi = [hai ;hEi ] (4)

where sgi is the resulting encoding for trainingexample i and [·; ·] denotes vector concatenation.

Page 7: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

However, this has the disadvantage that all evi-dence pages are considered equal.

Evidence Ranking The here proposed alterna-tive instance encoding model, crawled ranked,which achieves the highest overall performance asdiscussed in Section 5, learns the compatibility be-tween an instance’s claim and each evidence page.It ranks evidence pages by their utility for the ve-racity prediction task, and then uses the resultingranking to obtain a weighted combination of allclaim-evidence pairs. No direct labels are avail-able to learn the ranking of individual documents,only for the veracity of the associated claim, so themodel has to learn evidence ranks implicitly.

To combine claim and evidence representations,we use the matching model proposed for the taskof natural language inference by Mou et al. (2016)and adapt it to combine an instance’s claim repre-sentation with each evidence representation, i.e.

srij = [hai ;heij;hai − heij

;hai · heij] (5)

where srij is the resulting encoding for trainingexample i and evidence page j , [·; ·] denotes vec-tor concatenation, and · denotes the dot product.

All joint claim-evidence representationssri0 , . . . , sri10 are then projected into the binaryspace via a fully connected layer FC, followedby a non-linear activation function f , to obtain asoft ranking of claim-evidence pairs, in practice a10-dimensional vector,

oi = [f(FC(sri0 )); . . . ; f(FC(sri10 ))] (6)

where [·; ·] denotes concatenation.Scores for all labels are obtained as per (6)

above, with the same input instance embeddingsas for the evidence ranker, i.e. srij . Final predic-tions for all claim-evidence pairs are then obtainedby taking the dot product between the label scoresand binary evidence ranking scores, i.e.

pi = softmax(c(l, sri) · oi) (7)

Note that the novelty here is that, unlike for themodel described in Mou et al. (2016), we have nodirect labels for learning weights for this matchingmodel. Rather, our model has to implicitly learnthese weights for each claim-evidence pair in anend-to-end fashion given the veracity labels.

Model Micro F1 Macro F1claim-only 0.469 0.253claim-only embavg 0.384 0.302crawled-docavg 0.438 0.248crawled ranked 0.613 0.441

claim-only + meta 0.494 0.324claim-only embavg + meta 0.418 0.333crawled-docavg + meta 0.483 0.286crawled ranked + meta 0.625 0.492

Table 5: Results with different model variants on thetest set, ‘meta’ means all metadata is used.

4.3 MetadataWe experiment with how useful claim metadatais, and encode the following as one-hot vectors:speaker, category, tags and linked entities. We donot encode ‘Reason’ as it gives away the label, anddo not include ‘Checker’ as there are too manyunique checkers for this information to be rele-vant. The claim publication date is potentially rel-evant, but it does not make sense to merely modelthis as a one-hot feature, so we leave incorporat-ing temporal information to future work. Since allmetadata consists of individual words and phrases,a sequence encoder is not necessary, and we optfor a CNN followed by a max pooling operation asused in Wang (2017) to encode metadata for factchecking. The max-pooled metadata representa-tions, denoted hm, are then concatenated with theinstance representations, e.g. for the most elabo-rate model, crawled ranked, these would be con-catenated with scrij .

5 Experiments

5.1 Experimental SetupThe base sentence embedding model is a BiLSTMover all words in the respective sequences withrandomly initialised word embeddings, followingAugenstein et al. (2018). We opt for this strongbaseline sentence encoding model, as opposed toengineering sentence embeddings that work par-ticularly well for this dataset, to showcase thedataset. We would expect pre-trained contextualencoding models, e.g. ELMO (Peters et al., 2018),ULMFit (Howard and Ruder, 2018), BERT (De-vlin et al., 2018), to offer complementary perfor-mance gains, as has been shown for a few recentpapers (Wang et al., 2018a; Rajpurkar et al., 2018).

For claim veracity prediction without evidencedocuments with the MTL with LEL model, we usethe following sentence encoding variants: claim-

Page 8: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

only, which uses a BiLSTM-based sentence em-bedding as input, and claim-only embavg, whichuses a sentence embedding based on mean aver-aged word embeddings as input.

We train one multi-task model per task (i.e., onemodel per domain). We perform a grid search overthe following hyperparameters, tuned on the re-spective dev set, and evaluate on the correspod-ing test set (final settings are underlined): wordembedding size [64, 128, 256], BiLSTM hiddenlayer size [64, 128, 256], number of BiLSTM hid-den layers [1, 2, 3], BiLSTM dropout on input andoutput layers [0.0, 0.1, 0.2, 0.5], word-by-word-attention for BiLSTM with window size 10 (Bah-danau et al., 2014) [True, False], skip-connectionsfor the BiLSTM [True, False], batch size [32, 64,128], label embedding size [16, 32, 64]. We useReLU as an activation function for both the BiL-STM and the CNN. For the CNN, the follow-ing hyperparameters are used: number filters [32],kernel size [32]. We train using cross-entropy lossand the RMSProp optimiser with initial learningrate of 0.001 and perform early stopping on thedev set with a patience of 3.

5.2 Results

For each domain, we compute the Micro as wellas Macro F1, then mean average results over alldomains. Core results with all vs. no metadataare shown in Table 5. We first experiment withdifferent base model variants and find that labelembeddings improve results, and that the best pro-posed models utilising multiple domains outper-form single-task models (see Table 8). This cor-roborates the findings of Augenstein et al. (2018).Per-domain results with the best model are shownin Table 6. Domain names are from hereon af-ter abbreviated for brevity, see Table 11 in the ap-pendix for correspondences to full website names.Unsurprisingly, it is hard to achieve a high MacroF1 for domains with many labels, e.g. tron andsnes. Further, some domains, surprisingly mostlywith small numbers of instances, seem to be veryeasy – a perfect Micro and Macro F1 score of 1.0is achieved on ranz, bove, buca, fani and thal. Wefind that for those domains, the verdict is often al-ready revealed as part of the claim using explicitwording.

Claim-Only vs. Evidence-Based Veracity Pre-diction. Our evidence-based claim veracity pre-diction models outperform claim-only veracity

Domain # Insts # Labs Micro F1 Macro F1ranz 21 2 1.000 1.000bove 295 2 1.000 1.000abbc 436 3 0.463 0.453huca 34 3 1.000 1.000mpws 47 3 0.667 0.583peck 65 3 0.667 0.472faan 111 3 0.682 0.679clck 38 3 0.833 0.619fani 20 3 1.000 1.000chct 355 4 0.550 0.513obry 59 4 0.417 0.268vees 504 4 0.721 0.425faly 111 5 0.278 0.5goop 2943 6 0.822 0.387pose 1361 6 0.438 0.328thet 79 6 0.55 0.37thal 163 7 1.000 1.000afck 433 7 0.357 0.259hoer 1310 7 0.694 0.549para 222 7 0.375 0.311wast 201 7 0.344 0.214vogo 654 8 0.594 0.297pomt 15390 9 0.321 0.276snes 6455 12 0.551 0.097farg 485 11 0.500 0.140tron 3423 27 0.429 0.046

avg 7.17 0.625 0.492

Table 6: Total number of instances and unique labelsper domain, as well as per-domain results with modelcrawled ranked + meta, sorted by label size

Metadata Micro F1 Macro F1None 0.627 0.441

Speaker 0.602 0.435+ Tags 0.608 0.460

Tags 0.585 0.461

Entity 0.569 0.427+ Speaker 0.607 0.477+ Tags 0.625 0.492

Table 7: Ablation results with base modelcrawled ranked for different types of metadata

Model Micro F1 Macro F1STL 0.527 0.388MTL 0.556 0.448MTL + LEL 0.625 0.492

Table 8: Ablation results with crawled ranked + metaencoding for STL vs. MTL vs. MTL + LEL training

prediction models by a large margin. Unsur-prisingly, claim-only embavg is outperformed byclaim-only. Further, crawled ranked is our best-performing model in terms of Micro F1 and MacroF1, meaning that our model captures that not ev-ery piece of evidence is equally important, and can

Page 9: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Figure 3: Confusion matrix of predicted labels withbest-performing model, crawled ranked + meta, on the‘pomt’ domain

utilise this for veracity prediction.

Metadata. We perform an ablation analysis ofhow metadata impacts results, shown in Table 7.Out of the different types of metadata, topic tagson their own contribute the most. This is likely be-cause they offer highly complementary informa-tion to the claim text of evidence pages. Only us-ing all metadata together achieves a higher MacroF1 at similar Micro F1 than using no metadata atall. To further investigate this, we split the testset into those instances for which no metadata isavailable vs. those for which metadata is available.We find that encoding metadata within the modelhurts performance for domains where no metadatais available, but improves performance where it is.In practice, an ensemble of both types of modelswould be sensible, as well as exploring more in-volved methods of encoding metadata.

6 Analysis and Discussion

An analysis of labels frequently confused withone another, for the largest domain ‘pomt’ andbest-performing model crawled ranked + meta isshown in Figure 3. The diagonal represents whengold and predicted labels match, and the num-bers signify the number of test instances. Onecan observe that the model struggles more to de-tect claims with labels ‘true’ than those with la-bel ‘false’. Generally, many confusions occur overclose labels, e.g. ‘half-true’ vs. ‘mostly true’.

We further analyse what properties instancesthat are predicted correctly vs. incorrectly have,using the model crawled ranked meta. We find

that, unsurprisingly, longer claims are harder toclassify correctly, and that claims with a high di-rect token overlap with evidence pages lead to ahigh evidence ranking. When it comes to fre-quently occurring tags and entities, very generaltags such as ‘government-and-politics’ or ‘tax’that do not give away much, frequently co-occurwith incorrect predictions, whereas more specifictags such as ‘brisbane-4000’ or ‘hong-kong’ tendto co-occur with correct predictions. Similartrends are observed for bigrams. This means thatthe model has an easy time succeeding for in-stances where the claims are short, where specifictopics tend to co-occur with certain veracities, andwhere evidence documents are highly informative.Instances with longer, more complex claims whereevidence is ambiguous remain challenging.

7 Conclusions

We present a new, real-world fact checkingdataset, currently the largest of its kind. It consistsof 34,918 claims collected from 26 fact checkingwebsites, rich metadata and 10 retrieved evidencepages per claim. We find that encoding the meta-data as well evidence pages helps, and introducea new joint model for ranking evidence pages andpredicting veracity.

Acknowledgments

This research is partially supported by QUARTZ(721321, EU H2020 MSCA-ITN) and DABAI(5153-00004A, Innovation Fund Denmark).

ReferencesIsabelle Augenstein, Tim Rocktaschel, Andreas Vla-

chos, and Kalina Bontcheva. 2016a. Stance Detec-tion with Bidirectional Conditional Encoding. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages876–885, Austin, Texas. Association for Computa-tional Linguistics.

Isabelle Augenstein, Sebastian Ruder, and AndersSøgaard. 2018. Multi-Task Learning of PairwiseSequence Classification Tasks over Disparate LabelSpaces. In NAACL-HLT, pages 1896–1906. Associ-ation for Computational Linguistics.

Isabelle Augenstein, Andreas Vlachos, and KalinaBontcheva. 2016b. USFD at SemEval-2016 Task6: Any-Target Stance Detection on Twitter with Au-toencoders. In Proceedings of the 10th InternationalWorkshop on Semantic Evaluation (SemEval-2016),pages 389–393, San Diego, California. Associationfor Computational Linguistics.

Page 10: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Joan Bachenko, Eileen Fitzpatrick, and MichaelSchonwetter. 2008. Verification and implementa-tion of language-based deception indicators in civiland criminal narratives. In Proceedings of the22nd International Conference on ComputationalLinguistics-Volume 1, pages 41–48. Association forComputational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. In Proceedings ofICLR.

Ramy Baly, Mitra Mohtarami, James R. Glass, LluısMarquez, Alessandro Moschitti, and Preslav Nakov.2018. Integrating Stance Detection and Fact Check-ing in a Unified Corpus. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, NAACL-HLT, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 2(Short Papers), pages 21–27. Association for Com-putational Linguistics.

Alberto Barrn-Cedeo, Tamer Elsayed, Reem Suwaileh,Llus Mrquez, Pepa Atanasova, Wajdi Zaghouani,Spas Kyuchukov, Giovanni Da San Martino, andPreslav Nakov. 2018. Overview of the CLEF-2018CheckThat! Lab on automatic identification andverification of political claims. Task 2: Factuality.In CLEF (Working Notes), volume 2125 of CEURWorkshop Proceedings. CEUR-WS.org.

Rich Caruana. 1993. Multitask Learning: AKnowledge-Based Source of Inductive Bias. In Pro-ceedings of ICML.

Sihao Chen, Daniel Khashabi, Wenpeng Yin, ChrisCallison-Burch, and Dan Roth. 2019. Seeing Thingsfrom a Different Angle: Discovering Diverse Per-spectives about Claims. In Proceedings of NAACL.

G L Ciampaglia, P Shiralkar, L M Rocha, J Bollen,F Menczer, and A Flammini. 2015. ComputationalFact Checking from Knowledge Networks. PLoSOne, 10(6).

Leon Derczynski, Kalina Bontcheva, Maria Liakata,Rob Procter, Geraldine Wong Sak Hoi, and ArkaitzZubiaga. 2017. SemEval-2017 Task 8: Ru-mourEval: Determining rumour veracity and sup-port for rumours. In Proceedings of the 11thInternational Workshop on Semantic Evaluation(SemEval-2017), pages 69–76. Association forComputational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. CoRR, abs/1810.04805.

Omar Enayet and Samhaa R. El-Beltagy. 2017.NileTMRG at SemEval-2017 Task 8: Determin-ing Rumour and Veracity Support for Rumours onTwitter. In Proceedings of the 11th International

Workshop on Semantic Evaluation (SemEval-2017),pages 470–474. Association for Computational Lin-guistics.

William Ferreira and Andreas Vlachos. 2016. Emer-gent: a novel data-set for stance classification. InHLT-NAACL, pages 1163–1168. The Association forComputational Linguistics.

Andreas Hanselowski, PVS Avinesh, BenjaminSchiller, and Felix Caspelherr. 2017. Team Atheneon the Fake News Challenge.

Jeremy Howard and Sebastian Ruder. 2018. UniversalLanguage Model Fine-tuning for Text Classification.In ACL (1), pages 328–339. Association for Compu-tational Linguistics.

Georgi Karadzhov, Pepa Gencheva, Preslav Nakov, andIvan Koychev. 2017. We Built a Fake News / ClickBait Filter: What Happened Next Will Blow YourMind! In RANLP 2017, pages 334–343.

Hamid Karimi, Proteek Roy, Sari Saba-Sadiya, and Jil-iang Tang. 2018. Multi-Source Multi-Class FakeNews Detection. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 1546–1557.

Elena Kochkina, Maria Liakata, and Isabelle Augen-stein. 2017. Turing at SemEval-2017 Task 8: Se-quential Approach to Rumour Stance Classifica-tion with Branch-LSTM. In Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017), pages 475–480, Vancouver,Canada. Association for Computational Linguistics.

Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-End Neural EntityLinking. In Proceedings of the 22nd Conferenceon Computational Natural Language Learning,pages 519–529. Association for ComputationalLinguistics.

Rada Mihalcea and Carlo Strapparava. 2009. The liedetector: Explorations in the automatic recognitionof deceptive language. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 309–312. Association for Computational Linguistics.

Tsvetomila Mihaylova, Preslav Nakov, Lluıs Marquez,Alberto Barron-Cedeno, Mitra Mohtarami, GeorgiKaradzhov, and James R. Glass. 2018. Fact Check-ing in Community Forums. In Proceedings of theThirty-Second AAAI Conference on Artificial Intelli-gence, New Orleans, Louisiana, USA, February 2-7,2018. AAAI Press.

Tanushree Mitra and Eric Gilbert. 2015. Credbank:A large-scale social media corpus with associatedcredibility annotations. In ICWSM, pages 258–267.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan,and Zhi Jin. 2016. Natural Language Inference byTree-Based Convolution and Heuristic Matching. InACL (2). The Association for Computer Linguistics.

Page 11: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Veronica Perez-Rosas, Bennett Kleinberg, AlexandraLefevre, and Rada Mihalcea. 2018. Automatic De-tection of Fake News. In Proceedings of the 27th In-ternational Conference on Computational Linguis-tics, pages 3391–3401. Association for Computa-tional Linguistics.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In NAACL-HLT, pages 2227–2237.Association for Computational Linguistics.

Dean Pomerleau and Delip Rao. 2017. The Fake NewsChallenge: Exploring how artificial intelligencetechnologies could be leveraged to combat fakenews. http://www.fakenewschallenge.org/. Accessed: 2019-02-14.

Kashyap Popat, Subhabrata Mukherjee, JannikStrotgen, and Gerhard Weikum. 2016. CredibilityAssessment of Textual Claims on the Web. InCIKM, pages 2173–2178.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know What You Don’t Know: Unanswerable Ques-tions for SQuAD. In ACL (2), pages 784–789. As-sociation for Computational Linguistics.

Benjamin Riedel, Isabelle Augenstein, Georgios P. Sp-ithourakis, and Sebastian Riedel. 2017. A simple buttough-to-beat baseline for the Fake News Challengestance detection task. CoRR, abs/1707.03264.

Victoria Rubin, Niall Conroy, Yimin Chen, and SarahCornwell. 2016. Fake News or Truth? Using Satiri-cal Cues to Detect Potentially Misleading News. InProceedings of the Second Workshop on Computa-tional Approaches to Deception Detection, pages 7–17. Association for Computational Linguistics.

Giovanni C Santia and Jake Ryland Williams. 2018.BuzzFace: A News Veracity Dataset with FacebookUser Commentary and Egos. ICWSM, 531:540.

K. Shu, D. Mahudeswaran, S. Wang, D. Lee, andH. Liu. 2018. FakeNewsNet: A Data Repositorywith News Content, Social Context and Dynamic In-formation for Studying Fake News on Social Media.ArXiv e-prints.

Eugenio Tacchini, Gabriele Ballarin, Marco L. DellaVedova, Stefano Moret, and Luca de Alfaro. 2017.Some Like it Hoax: Automated Fake News Detec-tion in Social Networks. In Proceedings of the Sec-ond Workshop on Data Science for Social Good (So-Good), volume 1960 of CEUR Workshop Proceed-ings.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a Large-scale Dataset for Fact Extractionand VERification. In Proceedings of the 2018Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long

Papers), pages 809–819, New Orleans, Louisiana.Association for Computational Linguistics.

Andreas Vlachos and Sebastian Riedel. 2014. FactChecking: Task definition and dataset construction.In Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence, pages 18–22. Association for ComputationalLinguistics.

Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018.The spread of true and false news online. Science,359(6380):1146–1151.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2018a.GLUE: A Multi-Task Benchmark and Analysis Plat-form for Natural Language Understanding. InBlackboxNLP@EMNLP, pages 353–355. Associa-tion for Computational Linguistics.

Dongsheng Wang, Jakob Grue Simonsen, BirgerLarsen, and Christina Lioma. 2018b. The Copen-hagen Team Participation in the Factuality Taskof the Competition of Automatic Identification andVerification of Claims in Political Debates of theCLEF-2018 Fact Checking Lab. In CLEF (WorkingNotes), volume 2125 of CEUR Workshop Proceed-ings. CEUR-WS.org.

William Yang Wang. 2017. “Liar, Liar Pants on Fire”:A New Benchmark Dataset for Fake News Detec-tion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers), pages 422–426. Associationfor Computational Linguistics.

Wenpeng Yin and Dan Roth. 2018. TwoWingOS:A Two-Wing Optimization Strategy for EvidentialClaim Verification. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 105–114, Brussels, Belgium. As-sociation for Computational Linguistics.

Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, RobProcter, Michal Lukasik, Kalina Bontcheva, TrevorCohn, and Isabelle Augenstein. 2018. Discourse-aware rumour stance classification in social mediausing sequential classifiers. Informatino Processing& Management, 54(2):273–290.

Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-dine Wong Sak Hoi, and Peter Tolmie. 2016.Analysing how people orient to and spread rumoursin social media by looking at conversational threads.PLOS ONE, 11(3):1–29.

Page 12: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Websites (Sources) Reason

Mediabiasfactcheck Website that checks other news websitesCBC No pattern to crawlapnews.com/APFactCheck No categorical label and no structured claimweeklystandard.com/tag/fact-check Mostly no label, and they are placed anywhereballotpedia.org No categorical label and no structured claimchannel3000.com/news/politics/reality-check No categorical label, lack of structure, and no clear claimnpr.org/sections/politics-fact-check No label and no clear claim (only some titles are claims)dailycaller.com/buzz/check-your-fact Is a subset of checkyourfact which has already been crawledsacbee.com6 Contains very few labelled articles, and without clear claimsTheGuardian Only a few websites have a pattern for labels.

Table 9: The list of websites that we did not crawl and reasons for not crawling them.

Domain # Insts # Labels Labelsabbc 436 3 in-between, in-the-red, in-the-greenafck 433 7 correct, incorrect, mostly-correct, unproven, misleading, understated, exagger-

atedbove 295 2 none, rating: falsechct 355 4 verdict: true, verdict: false, verdict: unsubstantiated, noneclck 38 3 incorrect, unsupported, misleadingfaan 111 3 factscan score: false, factscan score: true, factscan score: misleadingfaly 71 5 true, none, partly true, unverified, falsefani 20 3 conclusion: accurate, conclusion: false, conclusion: unclearfarg 485 11 false, none, distorts the facts, misleading, spins the facts, no evidence, not the

whole story, unsupported, cherry picks, exaggerates, out of contextgoop 2943 6 0, 1, 2, 3, 4, 10hoer 1310 7 facebook scams, true messages, bogus warning, statirical reports, fake news,

unsubstantiated messages, misleading recommendationshuca 34 3 a lot of baloney, a little baloney, some baloneympws 47 3 accurate, false, misleadingobry 59 4 mostly true, verified, unobservable, mostly falsepara 222 7 mostly false, mostly true, half-true, false, true, pants on fire!, half flippeck 65 3 false, true, partially truepomt 15390 9 half-true, false, mostly true, mostly false, true, pants on fire!, full flop, half flip,

no flippose 1361 6 promise kept, promise broken, compromise, in the works, not yet rated, stalledranz 21 2 fact, fictionsnes 6455 12 false, true, mixture, unproven, mostly false, mostly true, miscaptioned, legend,

outdated, misattributed, scam, correct attributionthet 79 6 none, mostly false, mostly true, half true, false, truethal 74 2 none, we rate this claim falsetron 3423 27 fiction!, truth!, unproven!, truth! & fiction!, mostly fiction!, none, disputed!,

truth! & misleading!, authorship confirmed!, mostly truth!, incorrect attribu-tion!, scam!, investigation pending!, confirmed authorship!, commentary!, pre-viously truth! now resolved!, outdated!, truth! & outdated!, virus!, fiction! &satire!, truth! & unproven!, misleading!, grass roots movement!, opinion!, cor-rect attribution!, truth! & disputed!, inaccurate attribution!

vees 504 4 none, fake, misleading, falsevogo 653 8 none, determination: false, determination: true, determination: mostly true,

determination: misleading, determination: barely true, determination: hucksterpropaganda, determination: false, determination: a stretch

wast 201 7 4 pinnochios, 3 pinnochios, 2 pinnochios, false, not the whole story, needscontext, none

Table 10: Number of instances, and labels per domain sorted by number of occurrences

Page 13: arXiv:1909.03242v2 [cs.CL] 21 Oct 2019Casper Hansen Christian Hansen Jakob Grue Simonsen Department of Computer Science University of Copenhagen faugenstein,c.lioma,wang,lcl,c.hansen,chrh,simonseng@di.ku.dk

Web

site

Dom

ain

Cla

ims

Lab

els

Cat

egor

ySp

eake

rC

heck

erTa

gsA

rtic

leC

laim

date

Publ

ish

date

Full

text

Out

links

abc

abbc

436

436

436

--

436

436

-43

643

676

76af

rica

chec

kaf

ck43

643

6-

--

-43

6-

436

436

2325

altn

ews

-49

6-

--

496

-49

6-

496

496

6389

boom

live

-30

230

2-

--

-30

2-

302

302

6054

chec

kyou

rfac

tch

ht35

835

8-

-35

8-

--

358

358

5271

clim

atef

eedb

ack

clck

4545

--

--

45-

4545

489

crik

ey-

1818

18-

1818

18-

1818

212

fact

chec

kni

-36

3636

--

-36

--

3615

1fa

ctch

ecko

rgfa

rg51

251

251

251

251

251

251

251

251

251

282

82fa

ctly

-77

77-

--

-77

--

7765

8fa

ctsc

an-

115

115

-11

5-

--

115

115

115

1138

fullf

act

-33

633

633

6-

336

-33

6-

336

336

3838

goss

ipco

pgo

op29

4729

47-

-29

47-

2947

-29

4729

4712

583

hoax

slay

erho

er13

1013

10-

-13

10-

1310

-13

1013

1014

499

huffi

ngto

npos

tca

huca

3838

-38

38-

3838

3838

78le

adst

orie

s-

1547

1547

--

1547

-15

47-

1547

1547

1201

5m

prne

ws

mpw

s49

49-

-49

-49

-49

4931

9ny

times

-17

17-

-17

-17

-17

1727

1ob

serv

ator

yob

ry60

60-

-60

-60

-60

6059

2pa

ndor

apa

ra22

522

522

522

522

5-

225

-22

522

511

4pe

sach

eck

peck

6767

--

67-

67-

6767

521

polit

ico

-10

210

2-

-10

2-

102

-10

210

215

0po

litifa

ctpr

omis

epo

se13

6113

6113

6113

61-

-13

61-

1361

1361

6279

polit

ifact

stm

tpo

mt

1539

015

390

-15

390

--

-15

390

1539

015

390

7854

3po

litifa

ctst

ory

-54

60-

--

5460

--

-54

6054

6024

836

radi

onz

ranz

3232

3232

--

3232

3232

44sn

opes

snes

6457

6457

6457

-64

57-

6457

-64

5764

5746

735

swis

sinf

o-

2020

2020

20-

20-

2020

40th

econ

vers

atio

n-

6262

6262

6262

62-

6262

723

thef

erre

tth

et81

8181

81)

--

81-

81(8

1)81

885

theg

uard

ian

-15

515

515

5-

155

-15

5-

155

155

2600

thej

ourn

alth

al17

917

9-

--

-17

9-

179

179

2375

trut

horfi

ctio

ntr

on36

7436

7436

74-

-36

7436

74-

3674

3674

8268

vera

files

vees

509

509

--

-50

950

9-

509

509

23vo

iceo

fsan

dieg

ovo

go66

066

0-

--

-66

0-

660

660

2352

was

hing

tonp

ost

was

t22

722

7-

227

227

-22

7-

227

227

2470

wra

l-

2020

--

2020

20-

2020

355

zim

fact

-21

2121

2121

-21

-21

2117

9

Tota

l43

837

4383

743

837

4383

743

837

4383

743

837

4383

743

837

4383

726

0330

Tabl

e11

:Su

mm

ary

stat

istic

sfo

rcla

imco

llect

ion.

‘Dom

ain’

indi

cate

sth

edo

mai

nna

me

used

fort

heve

raci

typr

edic

tion

expe

rim

ents

,‘–’

indi

cate

sth

atth

ew

ebsi

tew

asno

tuse

ddu

eto

mis

sing

orin

suffi

cien

tcla

imla

bels

,see

Sect

ion

3.2.