open domain question answering using early fusion of ... · 1 introduction open domain question...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

4231

Open Domain Question Answering UsingEarly Fusion of Knowledge Bases and Text

Haitian Sun∗ Bhuwan Dhingra∗ Manzil Zaheer Kathryn MazaitisRuslan Salakhutdinov William W. Cohen

School of Computer ScienceCarnegie Mellon University

{haitians,bdhingra,manzilz,krivard,rsalakhu,wcohen}@cs.cmu.edu

Abstract

Open Domain Question Answering (QA) isevolving from complex pipelined systems toend-to-end deep neural networks. Specializedneural models have been developed for ex-tracting answers from either text alone orKnowledge Bases (KBs) alone. In this paperwe look at a more practical setting, namelyQA over the combination of a KB and entity-linked text, which is appropriate when an in-complete KB is available with a large textcorpus. Building on recent advances in graphrepresentation learning we propose a novelmodel, GRAFT-Net, for extracting answersfrom a question-specific subgraph containingtext and KB entities and relations. We con-struct a suite of benchmark tasks for this prob-lem, varying the difficulty of questions, theamount of training data, and KB complete-ness. We show that GRAFT-Net is competitivewith the state-of-the-art when tested using ei-ther KBs or text alone, and vastly outperformsexisting methods in the combined setting.

1 Introduction

Open domain Question Answering (QA) is thetask of finding answers to questions posed in nat-ural language. Historically, this required a spe-cialized pipeline consisting of multiple machine-learned and hand-crafted modules (Ferrucci et al.,2010). Recently, the paradigm has shifted towardstraining end-to-end deep neural network modelsfor the task (Chen et al., 2017; Liang et al., 2017;Raison et al., 2018; Talmor and Berant, 2018;Iyyer et al., 2017). Most existing models, how-ever, answer questions using a single informationsource, usually either text from an encyclopedia,or a single knowledge base (KB).

Intuitively, the suitability of an informationsource for QA depends on both its coverage and

∗Haitian Sun and Bhuwan Dhingra contributed equallyto this work.

Q. Who voiced Meg in Family Guy?

Family Guy

Meg Griffin

CVT1

CVT2 Lacey Chabert

Mila Kunis

Megatron "Meg" Griffin is a character from the

animated television series Family Guy.

Originally voiced byLacey Chabert during the first season, she has been

voiced byMila Kunis since season 2actor

actor

character

character-played-by

KB Entity Text Document KB Relation Entity Mention

Question Subgraph

from

to

Figure 1: To answer a question posed in naturallanguage, GRAFT-Net considers a heterogeneousgraph constructed from text and KB facts, and thuscan leverage the rich relational structure betweenthe two information sources.

the difficulty of extracting answers from it. A largetext corpus has high coverage, but the informationis expressed using many different text patterns. Asa result, models which operate on these patterns(e.g. BiDAF (Seo et al., 2017)) do not generalizebeyond their training domains (Wiese et al., 2017;Dhingra et al., 2018) or to novel types of reason-ing (Welbl et al., 2018; Talmor and Berant, 2018).KBs, on the other hand, suffer from low cover-age due to their inevitable incompleteness and re-stricted schema (Min et al., 2013), but are easierto extract answers from, since they are constructedprecisely for the purpose of being queried.

In practice, some questions are best answeredusing text, while others are best answered usingKBs. A natural question, then, is how to effec-tively combine both types of information. Surpris-ingly little prior work has looked at this problem.In this paper we focus on a scenario in which alarge-scale KB (Bollacker et al., 2008; Auer et al.,2007) and a text corpus are available, but neitheris sufficient alone for answering all questions.

4232

A naıve option, in such a setting, is to take state-of-the-art QA systems developed for each source,and aggregate their predictions using some heuris-tic (Ferrucci et al., 2010; Baudis, 2015). We callthis approach late fusion, and show that it canbe sub-optimal, as models have limited ability toaggregate evidence across the different sources(§ 5.4). Instead, we focus on an early fusion strat-egy, where a single model is trained to extract an-swers from a question subgraph (Fig 1) containingrelevant KB facts as well as text sentences. Earlyfusion allows more flexibility in combining infor-mation from multiple sources.

To enable early fusion, in this paper we proposea novel graph convolution based neural network,called GRAFT-Net (Graphs of Relations AmongFacts and Text Networks), specifically designedto operate over heterogeneous graphs of KB factsand text sentences. We build upon recent work ongraph representation learning (Kipf and Welling,2016; Schlichtkrull et al., 2017), but propose twokey modifications to adopt them for the task ofQA. First, we propose heterogeneous update rulesthat handle KB nodes differently from the textnodes: for instance, LSTM-based updates are usedto propagate information into and out of text nodes(§ 3.2). Second, we introduce a directed propaga-tion method, inspired by personalized Pagerank inIR (Haveliwala, 2002), which constrains the prop-agation of embeddings in the graph to follow pathsstarting from seed nodes linked to the question(§ 3.3). Empirically, we show that both these ex-tensions are crucial for the task of QA.

We evaluate these methods on a new suite ofbenchmark tasks for testing QA models whenboth KB and text are present. Using WikiMovies(Miller et al., 2016) and WebQuestionsSP (Yihet al., 2016), we construct datasets with a varyingamount of training supervision and KB complete-ness, and with a varying degree of question com-plexity. We report baselines for future comparison,including Key Value Memory Networks (Milleret al., 2016; Das et al., 2017c), and show that ourproposed GRAFT-Nets have superior performanceacross a wide range of conditions (§ 5). We alsoshow that GRAFT-Nets are competitive with thestate-of-the-art methods developed specifically fortext-only QA, and state-of-the art methods devel-oped for KB-only QA (§ 5.4)1.

1Source code and data are available at https://github.com/OceanskySun/GraftNet

2 Task Setup

2.1 Description

A knowledge base is denoted as K = (V, E ,R),where V is the set of entities in the KB, and theedges E are triplets (s, r, o) which denote that re-lation r ∈ R holds between the subject s ∈ Vand object o ∈ V . A text corpus D is a set of doc-uments {d1, . . . , d|D|} where each document is asequence of words di = (w1, . . . , w|di|). We fur-ther assume that an (imperfect) entity linking sys-tem has been run on the collection of documentswhose output is a set L of links (v, dp) connect-ing an entity v ∈ V with a word at position pin document d, and we denote with Ld the set ofall entity links in document d. For entity mentionsspanning multiple words in d, we include links toall the words in the mention in L.

The task is, given a natural language questionq = (w1, . . . , w|q|), extract its answers {a}q fromG = (K,D,L). There may be multiple correct an-swers for a question. In this paper, we assume thatthe answers are entities from either the documentsor the KB. We are interested in a wide range of set-tings, where the KB K varies from highly incom-plete to complete for answering the questions, andwe will introduce datasets for testing our modelsunder these settings.

To solve this task we proceed in two steps. First,we extract a subgraph Gq ⊂ G which contains theanswer to the question with high probability. Thegoal for this step is to ensure high recall for an-swers while producing a graph small enough tofit into GPU memory for gradient-based learning.Next, we use our proposed model GRAFT-Net tolearn node representations in Gq, conditioned on q,which are used to classify each node as being ananswer or not. Training data for the second stepis generated using distant supervision. The entireprocess mimics the search-and-read paradigm fortext-based QA (Dhingra et al., 2017).

2.2 Question Subgraph Retrieval

We retrieve the subgraph Gq using two parallelpipelines – one over the KB K which returns a setof entities, and the other over the corpus D whichreturns a set of documents. The retrieved entitiesand documents are then combined with entity linksto produce a fully-connected graph.

KB Retrieval. To retrieve relevant entities fromthe KB we first perform entity linking on the ques-

https://github.com/OceanskySun/GraftNet

https://github.com/OceanskySun/GraftNet

4233

tion q, producing a set of seed entities, denotedSq. Next we run the Personalized PageRank (PPR)method (Haveliwala, 2002) around these seeds toidentify other entities which might be an answerto the question. The edge-weights around Sq aredistributed equally among all edges of the sametype, and they are weighted such that edges rel-evant to the question receive a higher weight thanthose which are not. Specifically, we average wordvectors to compute a relation vector v(r) from thesurface form of the relation, and a question vectorv(q) from the words in the question, and use co-sine similarity between these as the edge weights.After running PPR we retain the top E entitiesv1, . . . , vE by PPR score, along with any edges be-tween them, and add them to Gq.

Text Retrieval. We use Wikipedia as the corpusand retrieve text at the sentence level, i.e. docu-ments in D are defined along sentences bound-aries2. We perform text retrieval in two steps: firstwe retrieve the top 5 most relevant Wikipediaarticles, using the weighted bag-of-words modelfrom DrQA (Chen et al., 2017); then we populatea Lucene3 index with sentences from these arti-cles, and retrieve the top ranking ones d1, . . . , dD,based on the words in the question. For thesentence-retrieval step, we found it beneficial toinclude the title of the article as an additional fieldin the Lucene index. As most sentences in an arti-cle talk about the title entity, this helps in retriev-ing relevant sentences that do not explicitly men-tion the entity in the question. We add the retrieveddocuments, along with any entities linked to them,to the subgraph Gq.

The final question subgraph is Gq =(Vq, Eq,R+), where the vertices Vq consistof all the retrieved entities and documents, i.e.Vq = {v1, . . . , vE} ∪{d1, . . . , dD}. The edges areall relations from K among these entities, plus theentity-links between documents and entities, i.e.

Eq ={(s, o, r) ∈ E : s, o ∈ Vq, r ∈ R}∪ {(v, dp, rL) : (v, dp) ∈ Ld, d ∈ Vq},

where rL denotes a special “linking” relation.R+ = R ∪ {rL} is the set of all edge types inthe subgraph.

2The term document will always refer to a sentence in therest of this paper.

3https://lucene.apache.org/

3 GRAFT-Nets

The question q and its answers {a}q induce alabeling of the nodes in Vq: we let yv = 1 ifv ∈ {a}q and yv = 0 otherwise for all v ∈Vq . The task of QA then reduces to perform-ing binary classification over the nodes of thegraph Gq. Several graph-propagation based mod-els have been proposed in the literature whichlearn node representations and then perform clas-sification of the nodes (Kipf and Welling, 2016;Schlichtkrull et al., 2017). Such models follow thestandard gather-apply-scatter paradigm to learnthe node representation with homogeneous up-dates, i.e. treating all neighbors equally.

The basic recipe for these models is as follows:

1. Initialize node representations h(0)v .

2. For l = 1, . . . , L update node representations

h(l)v = φ

h(l−1)v ,∑

v′∈Nr(v)

h(l−1)v′

,

where Nr(v) denotes the neighbours of valong incoming edges of type r, and φ is aneural network layer.

Here L is the number of layers in the model andcorresponds to the maximum length of the pathsalong which information should be propagated inthe graph. Once the propagation is complete thefinal layer representations h(L)v are used to per-form the desired task, for example link predictionin knowledge bases (Schlichtkrull et al., 2017).

However, there are two differences in our set-ting from previously studied graph-based clas-sification tasks. The first difference is that, inour case, the graph Gq consists of heterogeneousnodes. Some nodes in the graph correspond to KBentities which represent symbolic objects, whereasother nodes represent textual documents which arevariable length sequences of words. The seconddifference is that we want to condition the repre-sentation of nodes in the graph on the natural lan-guage question q. In §3.2 we introduce heteroge-neous updates to address the first difference, andin §3.3 we introduce mechanisms for conditioningon the question (and its entities) for the second.

3.1 Node InitializationNodes corresponding to entities are initialized us-ing fixed-size vectors h(0)v = xv ∈ Rn, where

https://lucene.apache.org/

4234

xv can be pre-trained KB embeddings or random,and n is the embedding size. Document nodes inthe graph describe a variable length sequence oftext. Since multiple entities might link to differ-ent positions in the document, we maintain a vari-able length representation of the document in eachlayer. This is denoted by H(l)

d ∈ R|d|×n. Given thewords in the document (w1, . . . , w|d|), we initial-ize its hidden representation as:

H(0)d = LSTM(w1, w2, . . . ),

where LSTM refers to a long short-term memoryunit. We denote the p-th row of H(l)

d , correspond-ing to the embedding of p-th word in the documentd at layer l, as H(l)

d,p.

3.2 Heterogeneous UpdatesFigure 2 shows the update rules for entities anddocuments, which we describe in detail here.

Entities. Let M(v) = {(d, p)} be the set of po-sitions p in documents d which correspond to amention of entity v. The update for entity nodes in-volves a single-layer feed-forward network (FFN)over the concatenation of four states:

h(l)v = FFN

h(l−1)v

h(l−1)q∑

r

∑v′∈Nr(v)

αv′r ψr(h

(l−1)v′ )∑

(d,p)∈M(v)H(l−1)d,p

.

(1)The first two terms correspond to the entity repre-sentation and question representation (details be-low), respectively, from the previous layer.

The third term aggregates the states from theentity neighbours of the current node, Nr(v), af-ter scaling with an attention weight αv′

r (describedin the next section), and applying relation specifictransformations ψr. Previous work on Relational-Graph Convolution Networks (Schlichtkrull et al.,2017) used a linear projection for ψr. For abatched implementation, this results in matrices ofsize O(B|Rq||Eq|n), where B is the batch size,which can be prohibitively large for large sub-graphs4. Hence in this work we use relation vec-tors xr for r ∈ Rq instead of matrices, and com-pute the update along an edge as:

ψr(h(l−1)v′ ) = pr

(l−1)v′ FFN

(xr, h

(l−1)v′

). (2)

4This is because we have to use adjacency matrices of size|Rq| × |Eq| × |Eq| to aggregate embeddings from neighboursof all nodes simultaneously.

Here pr(l−1)v′ is a PageRank score used to controlthe propagation of embeddings along paths start-ing from the seed nodes, which we describe in de-tail in the next section. The memory complexity ofthe above isO(B(|Fq|+ |Eq|)n), where |Fq| is thenumber of facts in the subgraph Gq.

The last term aggregates the states of all tokensthat correspond to mentions of the entity v amongthe documents in the subgraph. Note that the up-date depends on the positions of entities in theircontaining document.

Documents. Let L(d, p) be the set of all entitieslinked to the word at position p in document d.The document update proceeds in two steps. Firstwe aggregate over the entity states coming in ateach position separately:

H(l)d,p = FFN

H(l−1)d,p ,

∑v∈L(d,p)

h(l−1)v

. (3a)

Here h(l−1)v are normalized by the number of out-going edges at v. Next we aggregate states withinthe document using an LSTM:

H(l)d = LSTM(H

(l)d ). (3b)

3.3 Conditioning on the QuestionFor the parts described thus far, the graph learneris largely agnostic of the question. We introducedependence on question in two ways: by attentionover relations, and by personalized propagation.

To represent q, let wq1, . . . , w

q|q| be the words

in the question. The initial representation is com-puted as:

h(0)q = LSTM(wq1, . . . , w

q|q|)|q| ∈ Rn, (4)

where we extract the final state from the out-put of the LSTM. In subsequent layers thequestion representation is updated as h

(l)q =

FFN(∑

v∈Sqh(l)v

), where Sq denotes the seed en-

tities mentioned in the question.

Attention over Relations. The attention weightin the third term of Eq. (1) is computed using thequestion and relation embeddings:

αv′r = softmax(xTr h

(l−1)q ),

where the softmax normalization is over all outgo-ing edges from v′, and xr is the relation vector forrelation r. This ensures that embeddings are prop-agated more along edges relevant to the question.

4235

Chabertvoiced by Lacey during ...

LSTM -- layer l-1

LSTM -- layer l

Lacey Chabert

...

FFN

...Chabert

Lacey Chabert

Q. Who voiced Meg in Family Guy?

voiced... by Lacey during ...

LSTM

CVT1

Entity Update Text Update

Figure 2: Illustration of the heterogeneous update rules for entities (left) and text documents (right)

Directed Propagation. Many questions requiremulti-hop reasoning, which follows a path froma seed node mentioned in the question to the tar-get answer node. To encourage such a behaviourwhen propagating embeddings, we develop a tech-nique inspired from personalized PageRank in IR(Haveliwala, 2002). The propagation starts at theseed entities Sq mentioned in the question. In ad-dition to the vector embeddings h(l)v at the nodes,we also maintain scalar “PageRank” scores pr(l)v

which measure the total weight of paths from aseed entity to the current node, as follows:

pr(0)v =

{1|Sq | if v ∈ Sq0 o.w.

,

pr(l)v = (1− λ)pr(l−1)v + λ∑r

∑v′∈Nr(v)

αv′r pr

(l−1)v′ .

Notice that we reuse the attention weights αv′r

when propagating PageRank, to ensure that nodesalong paths relevant to the question receive a highweight. The PageRank score is used as a scal-ing factor when propagating embeddings along theedges in Eq. (2). For l = 1, the PageRank scorewill be 0 for all entities except the seed entities,and hence propagation will only happen outwardfrom these nodes. For l = 2, it will be non-zerofor the seed entities and their 1-hop neighbors, andpropagation will only happen along these edges.Figure 3 illustrates this process.

3.4 Answer SelectionThe final representations h(L)v ∈ Rn, are used forbinary classification to select the answers:

Pr (v ∈ {a}q|Gq, q) = σ(wTh(L)v + b), (5)

Figure 3: Directed propagation of embeddings inGRAFT-Net. A scalar PageRank score pr

(l)v is

maintained for each node v across layers, whichspreads out from the seed node. Embeddings areonly propagated from nodes with pr(l)v > 0.

where σ is the sigmoid function. Training uses bi-nary cross-entropy loss over these probabilities.

3.5 Regularization via Fact Dropout

To encourage the model to learn a robust classi-fier, which exploits all available sources of infor-mation, we randomly drop edges from the graphduring training with probability p0. We call thisfact-dropout. It is usually easier to extract an-swers from the KB than from the documents, sothe model tends to rely on the former, especiallywhen the KB is complete. This method is similarto DropConnect (Wan et al., 2013).

4 Related Work

The work of Das et al. (2017c) attempts an earlyfusion strategy for QA over KB facts and text.Their approach is based on Key-Value MemoryNetworks (KV-MemNNs) (Miller et al., 2016)coupled with a universal schema (Riedel et al.,2013) to populate a memory module with repre-sentations of KB triples and text snippets indepen-dently. The key limitation for this model is thatit ignores the rich relational structure between the

4236

facts and text snippets. Our graph-based method,on the other hand, explicitly uses this structure forthe propagation of embeddings. We compare thetwo approaches in our experiments (§5), and showthat GRAFT-Nets outperform KV-MemNNs overall tasks.

Non-deep learning approaches have been alsoattempted for QA over both text assertions andKB facts. Gardner and Krishnamurthy (2017) usetraditional feature extraction methods of open-vocabulary semantic parsing for the task. Ryuet al. (2014) use a pipelined system aggregat-ing evidence from both unstructured and semi-structured sources for open-domain QA.

Another line of work has looked at learningcombined representations of KBs and text for re-lation extraction and Knowledge Base Comple-tion (KBC) (Lao et al., 2012; Riedel et al., 2013;Toutanova et al., 2015; Verga et al., 2016; Daset al., 2017b; Han et al., 2016). The key differ-ence in QA compared to KBC is that in QA theinference process on the knowledge source has tobe conditioned on the question, so different ques-tions induce different representations of the KBand warrant a different inference process. Further-more, KBC operates under the fixed schema de-fined by the KB before-hand, whereas natural lan-guage questions might not adhere to this schema.

The GRAFT-Net model itself is motivated fromthe large body of work on graph representationlearning (Scarselli et al., 2009; Li et al., 2016;Kipf and Welling, 2016; Atwood and Towsley,2016; Schlichtkrull et al., 2017). Like most othergraph-based models, GRAFT-Nets can also beviewed as an instantiation of the Message PassingNeural Network (MPNN) framework of Gilmeret al. (2017). GRAFT-Nets are also inductive rep-resentation learners like GraphSAGE (Hamiltonet al., 2017), but operate on a heterogeneous mix-ture of nodes and use retrieval for getting a sub-graph instead of random sampling. The recentlyproposed Walk-Steered Convolution model usesrandom walks for learning graph representations(Jiang et al., 2018). Our personalization techniquealso borrows from such random walk literature,but uses it to localize propagation of embeddings.

Tremendous progress on QA over KB has beenmade with deep learning based approaches likememory networks (Bordes et al., 2015; Jain, 2016)and reinforcement learning (Liang et al., 2017;Das et al., 2017a). But extending them with text,

which is our main focus, is non-trivial. In anotherdirection, there is also work on producing parsi-monious graphical representations of textual data(Krause et al., 2016; Lu et al., 2017); however inthis paper we use a simple sequential representa-tion augmented with entity links to the KB whichworks well.

For QA over text only, a major focus has beenon the task of reading comprehension (Seo et al.,2017; Gong and Bowman, 2017; Hu et al., 2017;Shen et al., 2017; Yu et al., 2018) since the intro-duction of SQuAD (Rajpurkar et al., 2016). Thesesystems assume that the answer-containing pas-sage is known apriori, but there has been progresswhen this assumption is relaxed (Chen et al., 2017;Raison et al., 2018; Dhingra et al., 2017; Wanget al., 2018, 2017; Watanabe et al., 2017). We workin the latter setting, where relevant informationmust be retrieved from large information sources,but we also incorporate KBs into this process.

5 Experiments & Results

5.1 Datasets

WikiMovies-10K consists of 10K randomly sam-pled training questions from the WikiMoviesdataset (Miller et al., 2016), along with the origi-nal test and validation sets. We sample the trainingquestions to create a more difficult setting, sincethe original dataset has 100K questions over only8 different relation types, which is unrealistic inour opinion. In § 5.4 we also compare to the exist-ing state-of-the-art using the full training set.

We use the KB and text corpus constructed fromWikipedia released by Miller et al. (2016). For en-tity linking we use simple surface level matches,and retrieve the top 50 entities around the seedsto create the question subgraph. We further addthe top 50 sentences (along with their article ti-tles) to the subgraph using Lucene search over thetext corpus. The overall answer recall in our con-structed subgraphs is 99.6%.WebQuestionsSP (Yih et al., 2016) consists of4737 natural language questions posed over Free-base entities, split up into 3098 training and 1639test questions. We reserve 250 training questionsfor model development and early stopping. We usethe entity linking outputs from S-MART5 and re-trieve 500 entities from the neighbourhood aroundthe question seeds in Freebase to populate the

5https://github.com/scottyih/STAGG

https://github.com/scottyih/STAGG

4237

question subgraphs6. We further retrieve the top 50sentences from Wikipedia with the two-stage pro-cess described in §2. The overall recall of answersamong the subgraphs is 94.0%.

Table 1 shows the combined statistics of allthe retreived subgraphs for the questions in eachdataset. These two datasets present varying levelsof difficulty. While all questions in WikiMoviescorrespond to a single KB relation, for WebQues-tionsSP the model needs to aggregate over two KBfacts for ∼30% of the questions, and also requiresreasoning over constraints for ∼7% of the ques-tions (Liang et al., 2017). For maximum portabil-ity, QA systems need to be robust across severaldegrees of KB availability since different domainsmight contain different amounts of structured data;and KB completeness may also vary over time.Hence, we construct an additional 3 datasets eachfrom the above two, with the number of KB factsdownsampled to 10%, 30% and 50% of the orig-inal to simulate settings where the KB is incom-plete. We repeat the retrieval process for each sam-pled KB.

5.2 Compared ModelsKV-KB is the Key Value Memory Networksmodel from Miller et al. (2016); Das et al. (2017c)but using only KB and ignoring the text. KV-EF(early fusion) is the same model with access toboth KB and text as memories. For text we usea BiLSTM over the entire sentence as keys, andentity mentions as values. This re-implementationshows better performance on the text-only andKB-only WikiMovies tasks than the results re-ported previously7 (see Table 4). GN-KB is theGRAFT-Net model ignoring the text. GN-LF is alate fusion version of the GRAFT-Net model: wetrain two separate models, one using text only andthe other using KB only, and then ensemble thetwo8. GN-EF is our main GRAFT-Net model withearly fusion. GN-EF+LF is an ensemble over theGN-EF and GN-LF models, with the same ensem-bling method as GN-LF. We report Hits@1, which

6A total of 13 questions had no detected entities. Thesewere ignored during training and considered as incorrect dur-ing evaluation.

7For all KV models we tuned the number of layers{1, 2, 3}, batch size {10, 30, 50}, model dimension {50, 80}.We also use fact dropout regularization in the KB+Text set-ting tuned between {0, 0.2, 0.4}.

8For ensembles we take a weighted combination of the an-swer probabilities produced by the models, with the weightstuned on the dev set. For answers only in text or only in KB,we use the probability as is.

is the accuracy of the top-predicted answer fromthe model, and the F1 score. To compute the F1score we tune a threshold on the development setto select answers based on binary probabilities foreach node in the subgraph.

5.3 Main ResultsTable 2 presents a comparison of the above modelsacross all datasets. GRAFT-Nets (GN) shows con-sistent improvement over KV-MemNNs on bothdatasets in all settings, including KB only (-KB),text only (-EF, Text Only column), and early fu-sion (-EF). Interestingly, we observe a larger rel-ative gap between the Hits and F1 scores for theKV models than we do for our GN models. Webelieve this is because the attention for KV is nor-malized over the memories, which are KB facts(or text sentences): hence the model is unable toassign high probabilities to multiple facts at thesame time. On the other hand, in GN, we normal-ize the attention over types of relations outgoingfrom a node, and hence can assign high weights toall the correct answers.

We also see a consistent improvement of earlyfusion over late fusion (-LF), and by ensemblingthem together we see the best performance acrossall the models. In Table 2 (right), we further showthe improvement for KV-EF over KV-KB, andGN-LF and GN-EF over GN-KB, as the amountof KB is increased. This measures how effectivethese approaches are in utilizing text plus a KB.For KV-EF we see improvements when the KBis highly incomplete, but in the full KB setting,the performance of the fused approach is worse. Asimilar trend holds for GN-LF. On the other hand,GN-EF with text improves over the KB-only ap-proach in all settings. As we would expect, though,the benefit of adding text decreases as the KB be-comes more and more complete.

5.4 Comparison to Specialized MethodsIn Table 4 we compare GRAFT-Nets to state-of-the-art models that are specifically designed andtuned for QA using either only KB or only text.For this experiment we use the full WikiMoviesdataset to enable direct comparison to previouslyreported numbers. For DrQA (Chen et al., 2017),following the original paper, we restrict answerspans for WebQuestionsSP to match an entity inFreebase. In each case we also train GRAFT-Netsusing only KB facts or only text sentences. In threeout of the four cases, we find that GRAFT-Nets ei-

4238

Dataset # train / dev / test # entity nodes # edge types # document nodes # question vocab

WikiMovies-10K 10K / 10K / 10K 43,233 9 79,728 1759WebQuestionsSP 2848 / 250 / 1639 528,617 513 235,567 3781

Table 1: Statistics of all the retrieved subgraphs ∪qGq for WikiMovies-10K and WebQuestionsSP.

Model Text Only KB + Text

10 % 30% 50% 100%

WikiMovies-10K

KV-KB – 15.8 / 9.8 44.7 / 30.4 63.8 / 46.4 94.3 / 76.1KV-EF 50.4 / 40.9 53.6 / 44.0 60.6 / 48.1 75.3 / 59.1 93.8 / 81.4GN-KB – 19.7 / 17.3 48.4 / 37.1 67.7 / 58.1 97.0 / 97.6GN-LF

73.2 / 64.0

74.5 / 65.4 78.7 / 68.5 83.3 / 74.2 96.5 / 92.0

GN-EF 75.4 / 66.3 82.6 / 71.3 87.6 / 76.2 96.9 / 94.1GN-EF+LF 79.0 / 66.7 84.6 / 74.2 88.4 / 78.6 96.8 / 97.3

WebQuestionsSP

KV-KB – 12.5 / 4.3 25.8 / 13.8 33.3 / 21.3 46.7 / 38.6KV-EF 23.2 / 13.0 24.6 / 14.4 27.0 / 17.7 32.5 / 23.6 40.5 / 30.9GN-KB – 15.5 / 6.5 34.9 / 20.4 47.7 / 34.3 66.7 / 62.4GN-LF

25.3 / 15.3

29.8 / 17.0 39.1 / 25.9 46.2 / 35.6 65.4 / 56.8

GN-EF 31.5 / 17.7 40.7 / 25.2 49.9 / 34.7 67.8 / 60.4GN-EF+LF 33.3 / 19.3 42.5 / 26.7 52.3 / 37.4 68.7 / 62.3

5

0

5

10

15

Hits

@1

10.1

3.9

2.3

-6.2

10.5

4.2

-1.5

-1.3

11.2

5.8

2.2

1.1

WikiMovies-10K

KV-EF KV-KBGN-LF GN-KBGN-EF GN-KB

0.1 0.3 0.5 1.0KB fraction

0

20

40

60

Hits

@1

37.8

15.9

11.5

-0.5

54.8

30.3

15.6

-0.5

55.7

34.2

19.9

-0.1

WebQuestionsSP

KV-EF KV-KBGN-LF GN-KBGN-EF GN-KB

Table 2: Left: Hits@1 / F1 scores of GRAFT-Nets (GN) compared to KV-MemNN (KV) in KB only(-KB), early fusion (-EF), and late fusion (-LF) settings. Right: Improvement of early fusion (-EF) andlate fusion (-LF) over KB only (-KB) settings as KB completeness increases.

ther match or outperform the existing state-of-the-art models. We emphasize that the latter have nomechanism for dealing with the fused setting.

The one exception is the KB-only case forWebQuestionsSP where GRAFT-Net does 6.2%F1 points worse than Neural Symbolic Machines(Liang et al., 2017). Analysis suggested three ex-planations: (1) In the KB-only setting, the recallof subgraph retrieval is only 90.2%, which lim-its overall performance. In an oracle setting wherewe ensure the answers are part of the subgraph,the F1 score increases by 4.8%. (2) We use thesame probability threshold for all questions, eventhough the number of answers may vary signifi-cantly. Models which parse the query into a sym-bolic form do not suffer from this problem sinceanswers are retrieved in a deterministic fashion. Ifwe tune separate thresholds for each question theF1 score improves by 7.6%. (3) GRAFT-Nets per-form poorly in the few cases where there is a con-straint involved in picking out the answer (for ex-ample, “who first voiced Meg in Family Guy”). Ifwe ignore such constraints, and consider all enti-ties with the same sequence of relations to the seedas correct, the performance improves by 3.8% F1.Heuristics such as those used by Yu et al. (2017)can be used to improve these cases. Figure 3 shows

examples where GRAFT-Net fails to predict thecorrect answer set exactly.

5.5 Effect of Model Components

Heterogeneous Updates. We tested a non-heterogeneous version of our model, where in-stead of using fine-grained entity linking informa-tion for updating the node representations (M(v)and L(d, p) in Eqs. 1, 3a), we aggregate the docu-ment states across all its positions as

∑pH

(l)d,p and

use this combined state for all updates. Without theheterogeneous update, all entities v ∈ L(d, ·) willreceive the same update from document d. There-fore, the model cannot disambiguate different en-tities mentioned in the same document. The resultin Table 5 shows that this version is consistentlyworse than the heterogeneous model.

Conditioning on the Question. We performedan ablation test on the directed propagationmethod and attention over relations. We observethat both components lead to better performance.Such effects are observed in both complete and in-complete KB scenarios, e.g. on WebQuestionsSPdataset, as shown in Figure 4 (left).

Fact Dropout. Figure 4 (right) compares theperformance of the early fusion model as we vary

4239

Question Correct Answers Predicted Answers

what language do most people speak in afghanistan Pashto language,Farsi (Eastern Language) Pashto language

what college did john stockton go to Gonzaga University Gonzaga University,Gonzaga Preparatory School

Table 3: Examples from WebQuestionsSP dataset. Top: The model misses a correct answer. Bottom: Themodel predicts an extra incorrect answer.

MethodWikiMovies (full) WebQuestionsSPkb doc kb doc

MINERVA 97.0 / – – – –R2-AsV – 85.8 / – – –NSM – – – / 69.0 –DrQA* – – – 21.5 / –R-GCN# 96.5 / 97.4 – 37.2 / 30.5 –KV 93.9 / – 76.2 / – – / – – / –KV# 95.6 / 88.0 80.3 / 72.1 46.7 / 38.6 23.2 / 13.0GN 96.8 / 97.2 86.6 / 80.8 67.8 / 62.8 25.3 / 15.3

Table 4: Hits@1 / F1 scores compared to SOTAmodels using only KB or text: MINERVA(Das et al., 2017a), R2-AsV (Watanabe et al.,2017), Neural Symbolic Machines (NSM) (Lianget al., 2017), DrQA (Chen et al., 2017), R-GCN (Schlichtkrull et al., 2017) and KV-MemNN(Miller et al., 2016). *DrQA is pretrained onSQuAD. #Re-implemented.

0 KB 0.1 KB 0.3 KB 0.5 KB 1.0 KBNH 22.7 / 13.6 28.7 / 15.8 35.6 / 23.2 47.2 / 33.3 66.5 / 59.8H 25.3 / 15.3 31.5 / 17.7 40.7 / 25.2 49.9 / 34.7 67.8 / 60.4

Table 5: Non-Heterogeneous (NH) vs. Heteroge-neous (H) updates on WebQuestionsSP

the rate of fact dropout. Moderate levels of factdropout improve performance on both datasets.The performance increases as the fact dropout rateincreases until the model is unable to learn the in-ference chain from KB.

6 Conclusion

In this paper we investigate QA using text com-bined with an incomplete KB, a task which hasreceived limited attention in the past. We intro-duce several benchmark problems for this task bymodifying existing question-answering datasets,and discuss two broad approaches to solving thisproblem—“late fusion” and “early fusion”. Weshow that early fusion approaches perform better.

We also introduce a novel early-fusion model,called GRAFT-Net, for classifying nodes in sub-graph consisting of both KB entities and text doc-

0 20 40 60 80 100Epoch

0

10

20

30

40

50

60

70

Hits

@1

100% KB + Directed + Attention100% KB + Attention100% KB30% KB + Directed + Attention30% KB + Attention30% KB

75

80

85

90

Hits

@1

WikiMovies-10K (50% KB)

0.0 0.2 0.4 0.6 0.8 1.0Fact dropout

25

30

35

40

Hits

@1

WebQuestionsSP (30% KB)

Figure 4: Left: Effect of directed propagation andquery-based attention over relations for the We-bQuestionsSP dataset with 30% KB and 100%KB. Right: Hits@1 with different rates of fact-dropout on and WikiMovies and WebQuestionsSP.

uments. GRAFT-Net builds on recent advancesin graph representation learning but includes sev-eral innovations which improve performance onthis task. GRAFT-Nets are a single model whichachieve performance competitive to state-of-the-art methods in both text-only and KB-only set-tings, and outperform baseline models when us-ing text combined with an incomplete KB. Cur-rent directions for future work include – (1) ex-tending GRAFT-Nets to pick spans of text as an-swers, rather than only entities and (2) improvingthe subgraph retrieval process.

Acknowledgments

Bhuwan Dhingra is supported by NSF undergrants CCF-1414030 and IIS-1250956 and bygrants from Google. Ruslan Salakhutdinov is sup-ported in part by ONR grant N000141812861, Ap-ple, and Nvidia NVAIL Award.

ReferencesJames Atwood and Don Towsley. 2016. Diffusion-

convolutional neural networks. In Advances in Neu-

4240

ral Information Processing Systems, pages 1993–2001.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web, pages 722–735. Springer.

Petr Baudis. 2015. Yodaqa: a modular question an-swering system pipeline. In POSTER 2015-19th In-ternational Student Conference on Electrical Engi-neering, pages 1156–1165.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a collab-oratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data, pages 1247–1250. AcM.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. arXiv preprintarXiv:1506.02075.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL).

Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,Luke Vilnis, Ishan Durugkar, Akshay Krishna-murthy, Alex Smola, and Andrew McCallum. 2017a.Go for a walk and arrive at the answer: Reasoningover paths in knowledge bases using reinforcementlearning. arXiv preprint arXiv:1711.05851.

Rajarshi Das, Arvind Neelakantan, David Belanger,and Andrew McCallum. 2017b. Chains of reason-ing over entities, relations, and text using recurrentneural networks. EACL.

Rajarshi Das, Manzil Zaheer, Siva Reddy, and An-drew McCallum. 2017c. Question answering onknowledge bases and text using universal schemaand memory networks. ACL.

Bhuwan Dhingra, Kathryn Mazaitis, and William WCohen. 2017. Quasar: Datasets for question an-swering by search and reading. arXiv preprintarXiv:1707.03904.

Bhuwan Dhingra, Danish Pruthi, and Dheeraj Ra-jagopal. 2018. Simple and effective semi-supervisedquestion answering. NAACL.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll,James Fan, David Gondek, Aditya A Kalyanpur,Adam Lally, J William Murdock, Eric Nyberg, JohnPrager, et al. 2010. Building watson: An overviewof the deepqa project. AI magazine, 31(3):59–79.

Matt Gardner and Jayant Krishnamurthy. 2017. Open-vocabulary semantic parsing with both distributionalstatistics and formal knowledge. In AAAI, pages3195–3201.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley,Oriol Vinyals, and George E Dahl. 2017. Neuralmessage passing for quantum chemistry. ICML.

Yichen Gong and Samuel R Bowman. 2017. Ruminat-ing reader: Reasoning with gated multi-hop atten-tion. arXiv preprint arXiv:1704.07415.

William L. Hamilton, Rex Ying, and Jure Leskovec.2017. Inductive representation learning on largegraphs. CoRR, abs/1706.02216.

Xu Han, Zhiyuan Liu, and Maosong Sun. 2016.Joint representation learning of text and knowledgefor knowledge graph completion. arXiv preprintarXiv:1611.04125.

Taher H Haveliwala. 2002. Topic-sensitive pagerank.In Proceedings of the 11th international conferenceon World Wide Web, pages 517–526. ACM.

Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2017.Mnemonic reader for machine comprehension.arXiv preprint arXiv:1705.02798.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), vol-ume 1, pages 1821–1831.

Sarthak Jain. 2016. Question answering over knowl-edge base using factual memory networks. In Pro-ceedings of the NAACL Student Research Workshop,pages 109–115.

Jiatao Jiang, Zhen Cui, Chunyan Xu, ChengzhengLi, and Jian Yang. 2018. Walk-steered convo-lution for graph classification. arXiv preprintarXiv:1804.05837.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Sebastian Krause, Leonhard Hennig, Andrea Moro,Dirk Weissenborn, Feiyu Xu, Hans Uszkoreit, andRoberto Navigli. 2016. Sar-graphs: A language re-source connecting linguistic knowledge with seman-tic relations from knowledge graphs. Web Seman-tics: Science, Services and Agents on the World WideWeb, 37:112–131.

Ni Lao, Amarnag Subramanya, Fernando Pereira, andWilliam W Cohen. 2012. Reading the web withlearned syntactic-semantic inference rules. In Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning, pages1017–1026. Association for Computational Linguis-tics.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard Zemel. 2016. Gated graph sequence neu-ral networks. ICLR.

4241

Chen Liang, Jonathan Berant, Quoc Le, Kenneth DForbus, and Ni Lao. 2017. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. ACL.

Zhengdong Lu, Haotian Cui, Xianggen Liu, YukunYan, and Daqi Zheng. 2017. Object-oriented neu-ral programming (oonp) for document understand-ing. arXiv preprint arXiv:1709.08853.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for directlyreading documents. EMNLP.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang,and David Gondek. 2013. Distant supervision forrelation extraction with an incomplete knowledgebase. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 777–782.

Martin Raison, Pierre-Emmanuel Mazare, RajarshiDas, and Antoine Bordes. 2018. Weaver: Deep co-encoding of questions and documents for machinereading. arXiv preprint arXiv:1804.10490.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. arXiv preprintarXiv:1606.05250.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of the 2013 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages74–84.

Pum-Mo Ryu, Myung-Gil Jang, and Hyun-Ki Kim.2014. Open domain question answering usingwikipedia-based knowledge model. InformationProcessing and Management, 50(5):683 – 692.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, MarkusHagenbuchner, and Gabriele Monfardini. 2009. Thegraph neural network model. IEEE Transactions onNeural Networks, 20(1):61–80.

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne van den Berg, Ivan Titov, and Max Welling.2017. Modeling relational data with graph convolu-tional networks. arXiv preprint arXiv:1703.06103.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. ICLR.

Yelong Shen, Po-Sen Huang, Jianfeng Gao, andWeizhu Chen. 2017. Reasonet: Learning to stopreading in machine comprehension. In Proceedingsof the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages1047–1055. ACM.

A. Talmor and J. Berant. 2018. The web as aknowledge-base for answering complex questions.In North American Association for ComputationalLinguistics (NAACL).

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 1499–1509.

Patrick Verga, David Belanger, Emma Strubell, Ben-jamin Roth, and Andrew McCallum. 2016. Multi-lingual relation extraction using compositional uni-versal schema. NAACL.

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun,and Rob Fergus. 2013. Regularization of neural net-works using dropconnect. In International Confer-ence on Machine Learning, pages 1058–1066.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GeraldTesauro, Bowen Zhou, and Jing Jiang. 2018. R3:Reinforced reader-ranker for open-domain questionanswering. AAAI.

Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang,Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, TimKlinger, Gerald Tesauro, and Murray Campbell.2017. Evidence aggregation for answer re-rankingin open-domain question answering. arXiv preprintarXiv:1711.05116.

Yusuke Watanabe, Bhuwan Dhingra, and RuslanSalakhutdinov. 2017. Question answering fromunstructured text by retrieval and comprehension.arXiv preprint arXiv:1703.08885.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. TACL.

Georg Wiese, Dirk Weissenborn, and Mariana Neves.2017. Neural domain adaptation for biomedicalquestion answering. In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017), pages 281–289, Vancou-ver, Canada. Association for Computational Lin-guistics.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of se-mantic parse labeling for knowledge base questionanswering. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), volume 2, pages 201–206.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. ICLR.

4242

Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dosSantos, Bing Xiang, and Bowen Zhou. 2017. Im-proved neural relation detection for knowledge basequestion answering. ACL.