modeling missing data in distant supervision for

12
Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter Machine Learning Department Carnegie Mellon University [email protected] Luke Zettlemoyer, Mausam Computer Sci. & Eng. University of Washington {lsz,mausam}@cs.washington.edu Oren Etzioni Vulcan Inc. Seattle, WA [email protected] Abstract Distant supervision algorithms learn informa- tion extraction models given only large read- ily available databases and text collections. Most previous work has used heuristics for generating labeled data, for example assum- ing that facts not contained in the database are not mentioned in the text, and facts in the database must be mentioned at least once. In this paper, we propose a new latent-variable approach that models missing data. This pro- vides a natural way to incorporate side in- formation, for instance modeling the intuition that text will often mention rare entities which are likely to be missing in the database. De- spite the added complexity introduced by rea- soning about missing data, we demonstrate that a carefully designed local search approach to inference is very accurate and scales to large datasets. Experiments demonstrate im- proved performance for binary and unary re- lation extraction when compared to learning with heuristic labels, including on average a 27% increase in area under the precision re- call curve in the binary case. 1 Introduction This paper addresses the issue of missing data (Lit- tle and Rubin, 1986) in the context of distant super- vision. The goal of distant supervision is to learn to process unstructured data, for instance to extract binary or unary relations from text (Bunescu and Mooney, 2007; Snyder and Barzilay, 2007; Wu and Weld, 2007; Mintz et al., 2009; Collins and Singer, 1999), using a large database of propositions as a Person EMPLOYER Bibb Latan´ e UNC Chapel Hill Tim Cook Apple Susan Wojcicki Google True Positive Bibb Latan´ e, a professor at the University of North Carolina at Chapel Hill, published the theory in 1981.” False Positive Tim Cook praised Apple’s record revenue...” False Negative John P. McNamara, a professor at Washington State University’s Department of Animal Sciences...” Figure 1: A small hypothetical database and heuris- tically labeled training data for the EMPLOYER rela- tion. distant source of supervision. In the case of binary relations, the intuition is that any sentence which mentions a pair of entities (e 1 and e 2 ) that partici- pate in a relation, r, is likely to express the proposi- tion r(e 1 ,e 2 ), so we can treat it as a positive training example of r. Figure 1 presents an example of this process. One question which has received little attention in previous work is how to handle the situation where information is missing, either from the text corpus, or the database. As an example, suppose the pair of entities (John P. McNamara, Washington State Uni- versity) is absent from the EMPLOYER relation. In this case, the sentence in Figure 1 (and others which mention the entity pair) is effectively treated as a negative example of the relation. This is an issue 367 Transactions of the Association for Computational Linguistics, 1 (2013) 367–378. Action Editor: Kristina Toutanova. Submitted 7/2013; Revised 8/2013; Published 10/2013. c 2013 Association for Computational Linguistics.

Upload: others

Post on 05-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Modeling Missing Data in Distant Supervision for Information Extraction

Alan RitterMachine Learning Department

Carnegie Mellon [email protected]

Luke Zettlemoyer, MausamComputer Sci. & Eng.

University of Washington{lsz,mausam}@cs.washington.edu

Oren EtzioniVulcan Inc.Seattle, WA

[email protected]

Abstract

Distant supervision algorithms learn informa-tion extraction models given only large read-ily available databases and text collections.Most previous work has used heuristics forgenerating labeled data, for example assum-ing that facts not contained in the databaseare not mentioned in the text, and facts in thedatabase must be mentioned at least once. Inthis paper, we propose a new latent-variableapproach that models missing data. This pro-vides a natural way to incorporate side in-formation, for instance modeling the intuitionthat text will often mention rare entities whichare likely to be missing in the database. De-spite the added complexity introduced by rea-soning about missing data, we demonstratethat a carefully designed local search approachto inference is very accurate and scales tolarge datasets. Experiments demonstrate im-proved performance for binary and unary re-lation extraction when compared to learningwith heuristic labels, including on average a27% increase in area under the precision re-call curve in the binary case.

1 Introduction

This paper addresses the issue of missing data (Lit-tle and Rubin, 1986) in the context of distant super-vision. The goal of distant supervision is to learnto process unstructured data, for instance to extractbinary or unary relations from text (Bunescu andMooney, 2007; Snyder and Barzilay, 2007; Wu andWeld, 2007; Mintz et al., 2009; Collins and Singer,1999), using a large database of propositions as a

Person EMPLOYER

Bibb Latane UNC Chapel HillTim Cook AppleSusan Wojcicki Google

True Positive “Bibb Latane, a professor at theUniversity of North Carolina atChapel Hill, published the theory in1981.”

False Positive “Tim Cook praised Apple’s recordrevenue...”

False Negative “John P. McNamara, a professorat Washington State University’sDepartment of Animal Sciences...”

Figure 1: A small hypothetical database and heuris-tically labeled training data for the EMPLOYER rela-tion.

distant source of supervision. In the case of binaryrelations, the intuition is that any sentence whichmentions a pair of entities (e1 and e2) that partici-pate in a relation, r, is likely to express the proposi-tion r(e1, e2), so we can treat it as a positive trainingexample of r. Figure 1 presents an example of thisprocess.

One question which has received little attention inprevious work is how to handle the situation whereinformation is missing, either from the text corpus,or the database. As an example, suppose the pair ofentities (John P. McNamara, Washington State Uni-versity) is absent from the EMPLOYER relation. Inthis case, the sentence in Figure 1 (and others whichmention the entity pair) is effectively treated as anegative example of the relation. This is an issue

367

Transactions of the Association for Computational Linguistics, 1 (2013) 367–378. Action Editor: Kristina Toutanova.Submitted 7/2013; Revised 8/2013; Published 10/2013. c©2013 Association for Computational Linguistics.

of practical concern, as most databases of interestare highly incomplete - this is the reason we need toextend them by extracting information from text inthe first place.

We need to be cautious in how we handle miss-ing data in distant supervision, because this is acase where data is not missing at random (NMAR).Whether a proposition is observed or missing in thetext or database depends heavily on its truth value:given that it is true we have some chance to ob-serve it, however we do not observe those whichare false. To address this challenge, we propose ajoint model of extraction from text and the processby which propositions are observed or missing inboth the database and text. Our approach providesa natural way to incorporate side information in theform of a missing data model. For instance, popularentities such as Barack Obama already have goodcoverage in Freebase, so new extractions are morelikely to be errors than those involving rare entitieswith poor coverage.

Our approach to missing data is general and canbe combined with various IE solutions. As a proof ofconcept, we extend MultiR (Hoffmann et al., 2011),a recent model for distantly supervised informationextraction, to explicitly model missing data. Theseextensions complicate the MAP inference problemwhich is used as a subroutine in learning. Thismotivated us to explore a variety of approaches toinference in the joint extraction and missing datamodel. We explore both exact inference based onA* search and efficient approximate inference usinglocal search. Our experiments demonstrate that witha carefully designed set of search operators, localsearch produces optimal solutions in most cases.

Experimental results demonstrate large perfor-mance gains over the heuristic labeling strategy onboth binary relation extraction and weakly super-vised named entity categorization. For example ourmodel obtains a 27% increase in area under the pre-cision recall curve on the sentence-level relation ex-traction task.

2 Related Work

There has been much interest in distantly su-pervised1 training of relation extractors using

1also referred to as weakly supervised

databases. For example, Craven and Kumlien (1999)build a heuristically labeled dataset, using the YeastProtein Database to label Pubmed abstracts withthe subcellular-localization relation. Wu and Weld(2007) heuristically annotate Wikipedia articles withfacts mentioned in the infoboxes, enabling auto-mated infobox generation for articles which do notyet contain them. Benson et. al. (2011) use adatabase of music events taking place in New YorkCity as a source of distant supervision to train eventextractors from Twitter. Mintz et. al. (2009) useda set of relations from Freebase as a distant sourceof supervision to learn to extract information fromWikipedia. Ridel et. al. (2010), Hoffmann et.al. (2011), and Surdeanu et. al. (2012) presenteda series of models casting distant supervision as amultiple-instance learning problem (Dietterich et al.,1997).

Recent work has begun to address the challengeof noise in heuristically labeled training data gen-erated by distant supervision, and proposed a va-riety of strategies for correcting erroneous labels.Takamatsu et al. (2012) present a generative modelof the labeling process, which is used as a pre-processing step for improving the quality of labelsbefore training relation extractors. Independently,Xu et. al. (2013) analyze a random sample of1834 sentences from the New York Times, demon-strating that most entity pairs expressing a Freebaserelation correspond to false negatives. They applypseudo-relevance feedback to add missing entriesin the knowledge base before applying the MultiRmodel (Hoffmann et al., 2011). Min et al. (2013)extend the MIML model of Surdeanu et. al. (2012)using a semi-supervised approach assuming a fixedproportion of true positives for each entity pair.

The Min et al. (2013) approach is perhaps themost closely related of the recent approaches for dis-tant supervision. However, there are a number ofkey differences: (1) They impose a hard constrainton the proportion of true positive examples for eachentity pair, whereas we jointly model relation extrac-tion and missing data in the text and KB. (2) Theyonly handle the case of missing information in thedatabase and not in the text. (3) Their model, basedon Surdeanu (2012), uses hard discriminative EMto tune parameters, whereas we use perceptron-styleupdates. (4) We evaluate various inference strategies

368

for exact and approximate inference.The issue of missing data has been extensively

studied in the statistical literature (Little and Rubin,1986; Gelman et al., 2003). Most methods for han-dling missing data assume that variables are missingat random (MAR): whether a variable is observeddoes not depend on its value. In situations wherethe MAR assumption is violated (for example dis-tantly supervised information extraction), ignoringthe missing data mechanism will introduce bias. Inthis case it is necessary to jointly model the processof interest (e.g. information extraction) in additionto the missing data mechanism.

Another line of related work is iterative semanticbootstrapping (Brin, 1999; Agichtein and Gravano,2000). Carlson et. al. (2010) exploit constraints be-tween relations to reduce semantic drift in the boot-strapping process; such constraints are potentiallycomplementary to our approach of modeling miss-ing data.

3 A Latent Variable Model for DistantlySupervised Relation Extraction

In this section we review the MultiR model (due toHoffmann et. al. (2011)) for distant supervisionin the context of extracting binary relations. Thismodel is extended to handle missing data in Section4. We focus on binary relations to keep discussionsconcrete; unary relation extraction is also possible.

Given a set of sentences, s = s1, s2, . . . , sn,which mention a specific pair of entities (e1 ande2) our goal is to correctly predict which relation ismentioned in each sentence, or “NA” if none of therelations under consideration are mentioned. Un-like the standard supervised learning setup, we donot observe the latent sentence-level relation men-tion variables, z = z1, z2, . . . , zn.2 Instead we onlyobserve aggregate binary variables for each rela-tion, d = d1, d2, . . . , dk, which indicate whetherthe proposition rj(e1, e2) is present in the database(Freebase). Of course the question which arises is:how do we relate the aggregate-level variables, dj ,to the sentence-level relation mentions, zi? A sensi-ble answer to this question is a simple deterministic-OR function. The deterministic-OR states that if

2These variables indicate which relation is mentioned be-tween e1 and e2 in each sentence.

there exists at least one i such that zi = j, thendj = 1. For example, if at least one sentence men-tions that “Barack Obama was born in Honolulu”,then that fact is true in aggregate, if none of the sen-tences mentions the relation, then the fact is assumedfalse. The model also makes the converse assump-tion: if Freebase contains the relation BIRTHLOCA-TION(Barack Obama, Honolulu), then we must ex-tract it from at least one sentence. A summary ofthis model, which is due to Hoffmann et. al. (2011),is presented in Figure 2.

3.1 LearningTo learn the parameters of the sentence-level rela-tion mention classifier, θ, we maximize the likeli-hood of the facts observed in Freebase conditionedon the sentences in our text corpus:

θ∗ = argmaxθP (d|s; θ)

= argmaxθ

e1,e2

z

P (z,d|s; θ)

Here the conditional likelihood of a given entity pairis defined as follows:

P (z,d|s; θ) =

n∏

i=1

φ(zi, si; θ)×k∏

j=1

ω(z, dj)

=n∏

i=1

eθ·f(zi,si) ×k∏

j=1

1¬dj⊕∃i:j=zi

Where 1x is an indicator variable which takes thevalue 1 if x is true and 0 otherwise, the ω(z, dj)factors are hard constraints corresponding to thedeterministic-OR function, and f(zi, si) is a vectorof features extracted from sentence si and relationzi.

An iterative gradient-ascent based approach isused to tune θ using a latent-variable perceptron-style additive update scheme (Collins, 2002; Lianget al., 2006; Zettlemoyer and Collins, 2007). Thegradient of the conditional log likelihood, for a sin-gle pair of entities, e1 and e2, is as follows:3

∂ logP (d|s; θ)∂θ

= EP (z|s,d;θ)

j

f(sj , zj)

−EP (z,d|s;θ)

j

f(sj , zj)

3For details see Koller and Friedman (2009), Chapter 20.

369

𝑠1 𝑠2 𝑠3 … 𝑠𝑛

𝑧1 𝑧2 𝑧3 … 𝑧𝑛

𝑑1 𝑑2 𝑑𝑘 …

Local Extractors

Deterministic OR

(Barack Obama, Honolulu)

Sentences

Aggregate Relations

(Born-In, Lived-In, children, etc…)

𝑃 𝑧𝑖 = 𝑟 𝑠𝑖 ∝ exp(𝜃 ⋅ 𝑓 𝑠𝑖 , 𝑟 ) Relation mentions

Figure 2: MultiR (Hoffmann et. al. 2011)

𝑠1 𝑠2 𝑠3 … 𝑠𝑛

𝑧1 𝑧2 𝑧3 … 𝑧𝑛

𝑡1 𝑡2 𝑡𝑘 …

Mentioned in DB 𝑑1 𝑑2 𝑑𝑘 …

Mentioned in Text

Figure 3: DNMAR

These expectations are too difficult to compute inpractice, so instead they are approximated as maxi-mizations. Computing this approximation to the gra-dient requires solving two inference problems corre-sponding to the two maximizations:

z∗DB = argmaxzP (z|s,d; θ)

z∗ = argmaxzP (z,d|s; θ)

The MAP solution for the second term is easy tocompute: because d and z are deterministically re-lated, we can simply find the highest scoring re-lation, r, for each sentence, si, according to thesentence-level factors, φ, independently. The firstterm, is more difficult, however, as this requires find-ing the best assignment to the sentence-level hiddenvariables z = z1 . . . zn conditioned on the observedsentences and facts in the database. Hoffmann et.al. (2011) show how this reduces to a well-knownweighted edge cover problem which can be solvedexactly in polynomial time.

4 Modeling Missing Data

The model presented in Section 3 makes two as-sumptions which correspond to hard constraints:

1. If a fact is not found in the database it cannotbe mentioned in the text.

2. If a fact is in the database, it must be mentionedin at least one sentence.

These assumptions drive the learning, however ifthere is information missing from either the text orthe database this leads to errors in the training data(false positives, and false negatives respectively).

In order to gracefully handle the problem of miss-ing data, we propose to extend the model presentedin Section 3 by splitting the aggregate level vari-ables, d, into two parts: t which represents whethera fact is mentioned in the text (in at least one sen-tence), and d′ which represents whether the factis mentioned in the database. We introduce pair-wise potentials ψ(tj , dj) which penalize disagree-ment between tj and dj , that is:

ψ(tj , dj) =

−αMIT if tj = 0 and dj = 1

−αMID if tj = 1 and dj = 0

0 otherwise

Where αMIT (Missing In Text) and αMID (MissingIn Database) are parameters of the model which canbe understood as penalties for missing informationin the text and database respectively. We refer tothis model as DNMAR (for Distant Supervision withData Not Missing At Random). A graphical modelrepresentation is presented in Figure 3.

This model can be understood as relaxing thetwo hard constraints mentioned above into soft con-straints. As we show in Section 7, simply relaxingthese hard constraints into soft constraints and set-ting the two parameters αMIT, and αMID by hand ondevelopment data results in a large improvement toprecision at comparable recall over MultiR on twodifferent applications of distant supervision: binaryrelation extraction and named entity categorization.

Inference in this model becomes more challeng-ing however, because the constrained inferenceproblem no longer reduces to a weighted edge coverproblem as before. In Section 5, we present an infer-ence technique for the new model which is time and

370

memory efficient and almost always finds an exactMAP solution.

The learning proceeds analogously to what wasdescribed in section 3.1, with the exception that wenow maximize over the additional aggregate-levelhidden variables t, which have been introduced. Asbefore, MAP inference is a subroutine in learning,both for the unconstrained case corresponding to thesecond term (which is again trivial to compute), andfor the constrained case which is more challengingas it no longer reduces to a weighted edge coverproblem as before.

5 MAP Inference

The only difference in the new inference problem isthe addition of t; z and t are deterministically re-lated, so we can simply find a MAP assignment to z,from which t follows. The resulting inference prob-lem can be viewed as optimization under soft con-straints, where the objective includes terms for eachfact not in Freebase which is extracted from the text:−αMID, and an effective reward for extracting a factwhich is contained in Freebase: αMIT.

The solution to the MAP inference problem is thevalue of z which maximizes the following objective:

z∗DB = argmaxzP (z|d; θ, α)

= argmaxz

n∑

i=1

θ · f(zi, si) (1)

+

k∑

j=1

(αMIT1dj∧∃i:j=zi − αMID1¬dj∧∃i:j=zi

)

Whether we choose to set the parameters αMITand αMID to fixed values (Section 4), or incorporateside information through a missing data model (Sec-tion 6), inference becomes more challenging thanin the model where facts observed in Freebase aretreated as hard constraints (Section 3); the hard con-straints are equivalent to setting αMID = αMIT =∞.

We now present exact and approximate ap-proaches to inference. Standard search methodssuch as A* and branch and bound have high com-putation and memory requirements and are there-fore only feasible on problems with few variables;they are, however, guaranteed to find an optimal so-lution.4 Approximate methods scale to large prob-

4Each entity pair defines an inference problem where the

lem sizes, but we loose the guarantee of finding anoptimal solution. After showing how to find guaran-teed exact solutions for small problem sizes (e.g. upto 200 variables), we present an inference algorithmbased on local search which is empirically shown tofind optimal solutions in almost every case by com-paring its solutions to those found by A*.

5.1 A* SearchWe cast exact MAP inference in the DNMAR modelas an application of A* search. Each partial hypoth-esis, h, in the search space corresponds to a par-tial assignment of the first m variables in z; to ex-pand a hypothesis, we generate k new hypotheses,where k is the total number of relations. Each newhypothesis h′ contains the same partial assignmentto z1, . . . , zm as h, with each h′ having a differentvalue of zm+1 = r.

A* operates by maintaining a priority queue ofhypotheses to expand, with each hypothesis’ prioritydetermined by an admissible heuristic. The heuristicrepresents an upper bound on the score of the bestsolution with h’s partial variable assignment underthe objective from Equation 1. In general, a tighterupper bound corresponds to a better heuristic andfaster solutions. To upper bound our objective, westart with the φ(zi, si) factors from the partial as-signment. Unassigned variables (i > k), are set totheir maximum possible value, zi = maxr φ(r, si)independently. Next to account for the effect theaggregate ψ(tj , dj) factors on the unassigned vari-ables, we consider independently changing eachunassigned zi variable for each ψ(tj , dj) factor toimprove the overall score. This approach can lead toinconsistencies, but provides us with a good upperbound for the best possible solution with a partialassignment to z1, . . . , zk.

5.2 Local SearchWhile A* is guaranteed to find an exact solution, itstime and memory requirements prohibit use on largeproblems involving many variables. As a more scal-able alternative we propose a greedy hill climbingmethod (Russell et al., 1996), which starts with a fullassignment to z, and repeatedly moves to the bestneighboring solution z′ according to the objective in

number of variables is equal to the number of sentences whichmention the pair.

371

Equation 1. The neighborhood of z is defined by aset of search operators. If none of the neighboringsolutions has a higher score, then we have reacheda (local) maximum at which point the algorithm ter-minates with the current solution which may or maynot correspond to a global maximum. This processis repeated using a number of random restarts, andthe best local maximum is returned as the solution.

Search Operators: We start with a standardsearch operator, which considers changing eachrelation-mention variable, zi, individually to maxi-mize the overall score. At each iteration, all zis areconsidered, and the one which produces the largestimprovement to the overall score is changed to formthe neighboring solution, z′. Unfortunately, thisdefinition of the solution neighborhood is prone topoor local optima because it is often required totraverse many low scoring states before changingone of the aggregate variables, tj , and achieving ahigher score from the associated aggregate factor,ψ(tj , dj). For example, consider a case where theproposition r(e1, e2) is not in Freebase, but is men-tioned many times in the text, and imagine the cur-rent solution contains no mention zi = r. Anyneighboring solution which assigns a mention tor will include the penalty αMID, which could out-weigh the benefit from changing any individual zito r: φ(r, si) − φ(zi, si). If multiple mentions werechanged to r however, together they could outweighthe penalty for extracting a fact not in Freebase, andproduce an overall higher score.

To avoid the problem of getting stuck in localoptima, we propose an additional search operatorwhich considers changing all variables, zi, whichare currently assigned to a specific relation r, to anew relation r′, resulting in an additional (k − 1)2

possible neighbors, in addition to the n × (k − 1)neighbors which come from the standard search op-erator. This aggregate-level search operator allowsfor more global moves which help to avoid local op-tima, similar to the type-level sampling approach forMCMC (Liang et al., 2010).

At each iteration, we consider all n × (k − 1) +(k−1)2 possible neighboring solutions generated byboth search operators, and pick the one with biggestoverall improvement, or terminate the algorithm ifno improvements can be made over the current so-lution. 20 random restarts were used for each infer-

ence problem. We found this approach to almost al-ways find an optimal solution. In over 100,000 prob-lems with 200 or fewer variables from the New YorkTimes dataset used in Section 7, an optimal solu-tion was missed in only 3 cases which was verifiedby comparing against optimal solutions found usingA*. Without including the aggregate-level searchoperator, local search almost always gets stuck in alocal maximum.

6 Incorporating Side Information

In Section 4, we relaxed the hard constraints madeby MultiR, which allows for missing informationin either the text or database, enabling errors inthe distantly supervised training data to be naturallycorrected as a side-effect of learning. We madethe simplifying assumption, however, that all factsare equally likely to be missing from the text ordatabase, which is encoded in the choice of 2 fixedparameters αMIT, and αMID. Is it possible to im-prove performance by incorporating side informa-tion in the form of a missing data model (Little andRubin, 1986), taking into account how likely eachfact is to be observed in the text and the databaseconditioned on its truth value? In our setting, themissing data model corresponds to choosing the val-ues of αMIT and αMID dynamically based on the en-tities and relations involved.

Popular Entities: Consider two entities: BarackObama, the 44th president of the United States, andDonald Parry, a professional rugby league footballerof the 1980s.5 Since Obama is much more well-known than Parry, we wouldn’t be very surprised tosee information missing from Freebase about Parry,but it would seem odd if true propositions were miss-ing about Obama.

We can encode these intuitions by choosingentity-specific values of αMID:

α(e1,e2)MID = −γmin (c(e1), c(e2))

where c(ei) is the number of times ei appears inFreebase, which is used as an estimate of its cov-erage.

Well Aligned Relations: Given that a pair of en-tities, e1 and e2, participating in a Freebase relation,

5http://en.wikipedia.org/wiki/Donald_Parry

372

r, appear together in a sentence si, the chance thatsi expresses r varies greatly depending on r. Forexample, if a sentence mentions a pair of entitieswhich participate in both the COUNTRYCAPITOL

relation and the LOCATIONCONTAINS relation (forexample Moscow and Russia), it is more likely thatthe a random sentence will express LOCATIONCON-TAINS than COUNTRYCAPITOL.

We can encode this preference for matching cer-tain relations over others by setting αr

MIT on aper-relation basis. We choose a different valueof αr

MIT for each relation based on quick inspec-tion of the data, and estimating the number of truepositives. Relations such as contains, place lived,and nationality which contain a large number oftrue positive matches are assigned a large value ofαr

MIT = γlarge, those with a medium number such ascapitol, place of death and administrative divisionswere assigned a medium value γmedium, and thoserelations with few matches were assigned a smallvalue γsmall. These 3 parameters were tuned on heldout development data.

7 Experiments

In Section 5, we presented a scalable approach toinference in the DNMAR model which almost al-ways finds an optimal solution. Of course the realquestion is: does modeling missing data improveperformance at extracting information from text? Inthis section we present experimental results showinglarge improvements in both precision and recall ontwo distantly supervised learning tasks: binary rela-tion extraction and named entity categorization.

7.1 Binary Relation Extraction

For the sake of comparison to previous work weevaluate performance on the New York Times text,features and Freebase relations developed by Riedelet. al. (2010) which was also used by Hoffmann et.al. (2011). This dataset is constructed by extractingnamed entities from 1.8 million New York Times ar-ticles, which are then match against entities in Free-base. Sentences which contain pairs of entities par-ticipating in one or more relations are then used astraining examples for those relations. The sentence-level features include word sequences appearing incontext with the pair of entities, in addition to part

of speech sequences, and dependency paths from theMalt parser (Nivre et al., 2004).

7.1.1 BaselineTo evaluate the effect of modeling missing data

in distant supervision, we compare against the Mul-tiR model for distant supervision (Hoffmann et al.,2011), a state of the art approach for binary rela-tion extraction which is the most similar previouswork, and models facts in Freebase as hard con-straints disallowing the possibility of missing infor-mation in either the text or the database. To makeour experiment as controlled as possible and rule-out the possibility of differences in performance dueto implementation details, we compare against ourown re-implementation of MultiR which reproducesHoffmann et. al.’s performance, and shares as muchcode as possible with the DNMAR model.

7.1.2 Experimental SetupWe evaluate binary relation extraction using two

evaluations. We first evaluate on a sentence-levelextraction task using a manually annotated datasetprovided by Hoffmann et. al. (2011).6 This datasetconsists of sentences paired with human judgmentson whether each expresses a specific relation. Sec-ondly, we perform an automatic evaluation whichcompares propositions extracted from text againstheld-out data from Freebase.

7.1.3 ResultsSentential Extraction: Figure 4 presents preci-

sion and recall curves for the sentence-level relationextraction task on the same manually annotated datapresented by Hoffmann et. al. (2011). By explic-itly modeling the possibility of missing informationin both the text and the database we achieve a 17%increase in area under the precision recall curve. In-corporating additional side information in the formof a missing data model, as described in Section 6,produces even better performance, yielding a 27%increase over the baseline in area under the curve.We also compare against the system described byXu et. al. (2013) (hereinafter called Xu13). To dothis, we trained our implementation of MultiR on

6http://raphaelhoffmann.com/mr/

373

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

MultiRXu13DNMARDNMAR*

Figure 4: Overall precision and Recall at thesentence-level extraction task comparing againsthuman judgments. DNMAR∗ incorporates side-information as discussed in Section 6.

the labels predicted by their Pseudo-relevance Feed-back model. 7 The differences between each pair ofsystems, except DNMAR and Xu138, is significantwith p-value less than 0.05 according to a paired t-test assuming a normal distribution.

Per-relation precision and recall curves are pre-sented in Figure 6. For certain relations, for instance/location/us state/capital, there simply isn’t enoughoverlap between the information contained in Free-base and facts mentioned in the text to learn any-thing useful. For these relations, entity pair matchesare unlikely to actually express the relation; for in-stance, in the following sentence from the data:

NHPF , which has its Louisiana officein Baton Rouge , gets the funds ...

although Baton Rouge is the capital of Louisiana,the /location/us state/capital relation is not ex-pressed in this sentence. Another interesting ob-servation which we can make from Figure 6,is that the benefit from modeling missing data

7We thank Wei Xu for making this data available.8DNMAR has a 1.3% increase in AUC over Xu13, though

this difference is not significant according to a paired t-test.DNMAR* achieves a 10% increase in AUC over Xu13 whichis significant.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

MultiRDNMARDNMAR*

Figure 5: Aggregate-level automatic evaluationcomparing against held-out data from Freebase.DNMAR∗ incorporates side-information as dis-cussed in Section 6.

varies from one relation to another. Some re-lations, for instance /people/person/place of birth,have relatively good coverage in both Freebase andthe text, and therefore we do not see as muchgain from modeling missing data. Other rela-tions, such as /location/location/contains, and /peo-ple/person/place lived have poorer coverage makingour missing data model very beneficial.

Aggregate Extraction: Following previouswork, we evaluate precision and recall against held-out data from Freebase in Figure 5. As mentioned byMintz et. al. (2009), this automatic evaluation un-derestimates precision because many facts correctlyextracted from the text are missing in the databaseand therefore judged as incorrect. Riedel et. al.(2013) further argues that this evaluation is biasedbecause frequent entity pairs are more likely to con-tain facts in Freebase, so systems which rank extrac-tions involving popular entities higher will achievebetter performance independently of how accuratetheir predictions are. Indeed in Figure 5 we see thatthe precision of our system which models missingdata is generally lower than the system which as-sumes no data is missing from Freebase, althoughwe do roughly double the recall. By better modeling

374

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

business_company_founders

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

business_person_company

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

location_country_administrative_divisions

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

location_country_capital

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

location_location_contains

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

location_neighborhood_neighborhood_of

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

location_us_state_capital

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

people_deceased_person_place_of_death

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

people_person_children

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

people_person_nationality

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

people_person_place_lived

recall

prec

isio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

people_person_place_of_birth

recall

prec

isio

n

Figure 6: Per-relation precision and recall on the sentence-level relation extraction task. The dashed linecorresponds to MultiR, DNMAR is the solid line, and DNMAR*, which incorporates side-information, isrepresented by the dotted line.

375

missing data we achieve lower precision on this au-tomatic held-out evaluation as the system using hardconstraints is explicitly trained to predict facts whichoccur in Freebase (not those which are mentioned inthe text but unlikely to appear in the database).

7.2 Named Entity Categorization

As mentioned previously, the problem of missingdata in distant (weak) supervision is a very generalissue; so far we have investigated this problem in thecontext of extracting binary relations using distantsupervision. We now turn to the problem of weaklysupervised named entity recognition (Collins andSinger, 1999; Talukdar and Pereira, 2010).

7.2.1 Experimental Setup

To demonstrate the effect of modeling missingdata in the distantly supervised named entity cate-gorization task, we adapt the MultiR and DNMARmodels to the Twitter named entity categorizationdataset which was presented by Ritter et. al. (2011).The models described so far are applied unchanged:rather than modeling a set of relations in Freebasebetween a pair of entities, e1 and e2, we now modela set of possible Freebase categories associated witha single entity e. This is a natural extension of dis-tant supervision from binary to unary relations. Theunlabeled data and features described by Ritter et.al. (2011) are used for training the model, and theirmanually annotated Twitter named entity dataset isused for evaluation.

7.2.2 Results

Precision and recall at weakly supervised namedentity categorization comparing MultiR against DN-MAR is presented in Figure 7. We observe substan-tial improvement in precision at comparable recallby explicitly modeling the possibility of missing in-formation in the text and database. The missing datamodel leads to a 107% increase in area under theprecision-recall curve (from 0.16 to 0.34), but stillfalls short of the results presented by Ritter et. al.(2011). Intuitively this makes sense, because themodel used by Ritter et. al. is based on latent Dirich-let allocation which is better suited to this highly am-biguous unary relation data.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

recall

prec

isio

n

NER_MultiRNER_DNMAR

Figure 7: Precision and Recall at the named entitycategorization task

8 Conclusions

In this paper we have investigated the problem ofmissing data in distant supervision; we introduceda joint model of information extraction and miss-ing data which relaxes the hard constraints used inprevious work to generate heuristic labels, and pro-vides a natural way to incorporate side informationthrough a missing data model. Efficient inferencebreaks in the new model, so we presented an ap-proach based on A* search which is guaranteed tofind exact solutions, however exact inference is notcomputationally tractable for large problems. To ad-dress the challenge of large problem sizes, we pro-posed a scalable inference algorithm based on localsearch, which includes a set of aggregate search op-erators allowing for long-distance jumps in the so-lution space to avoid local maxima; this approachwas experimentally demonstrated to find exact so-lutions in almost every case. Finally we evaluatedthe performance of our model on the tasks of binaryrelation extraction and named entity categorizationshowing large performance gains in each case.

In future work we would like to apply our ap-proach to modeling missing data to additional mod-els, for instance the model of Surdeanu et. al. (2012)and Ritter et. al. (2011), and also explore new miss-ing data models.

376

Acknowledgements

The authors would like to thank Dan Weld, ChrisQuirk, Raphael Hoffmann and the anonymous re-viewers for helpful comments. Thanks to Wei Xufor providing data. This research was supportedin part by ONR grant N00014-11-1-0294, DARPAcontract FA8750-09- C-0179, a gift from Google, agift from Vulcan Inc., and carried out at the Univer-sity of Washington’s Turing Center.

ReferencesEugene Agichtein and Luis Gravano. 2000. Snowball:

Extracting relations from large plain-text collections.In Proceedings of the fifth ACM conference on Digitallibraries.

Edward Benson, Aria Haghighi, and Regina Barzilay.2011. Event discovery in social media feeds. In Pro-ceedings of ACL.

Sergey Brin. 1999. Extracting patterns and relationsfrom the world wide web. In The World Wide Weband Databases.

Razvan Bunescu and Raymond Mooney. 2007. Learningto extract relations from the web using minimal super-vision. In ACL.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, BurrSettles, Estevam R Hruschka Jr, and Tom M Mitchell.2010. Toward an architecture for never-ending lan-guage learning. In Proceedings of AAAI.

Michael Collins and Yoram Singer. 1999. Unsupervisedmodels for named entity classification. In EMNLP.

Michael Collins. 2002. Discriminative training meth-ods for hidden markov models: theory and experi-ments with perceptron algorithms. In Proceedings ofEMNLP.

Mark Craven, Johan Kumlien, et al. 1999. Constructingbiological knowledge bases by extracting informationfrom text sources. In ISMB.

Thomas G Dietterich, Richard H Lathrop, and TomasLozano-Perez. 1997. Solving the multiple instanceproblem with axis-parallel rectangles. Artificial Intel-ligence.

Andrew Gelman, John B Carlin, Hal S Stern, and Don-ald B Rubin. 2003. Bayesian data analysis. CRCpress.

Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction ofoverlapping relations. In Proceedings of ACL-HLT.

D. Koller and N. Friedman. 2009. Probabilistic Graphi-cal Models: Principles and Techniques. MIT Press.

Percy Liang, Alexandre Bouchard-Cote, Dan Klein, andBen Taskar. 2006. An end-to-end discriminative ap-proach to machine translation. In Proceedings of ACL.

Percy Liang, Michael I Jordan, and Dan Klein. 2010.Type-based mcmc. In Proceedings of ACL.

Roderick J A Little and Donald B Rubin. 1986. Statis-tical analysis with missing data. John Wiley & Sons,Inc., New York, NY, USA.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang, andDavid Gondek. 2013. Distant supervision for rela-tion extraction with an incomplete knowledge base. InProceedings of NAACL-HLT.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.2009. Distant supervision for relation extraction with-out labeled data. In Proceedings of ACL-IJCNLP.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.Memory-based dependency parsing. In Proceedingsof CoNLL.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions withoutlabeled text. In Proceedings of ECML/PKDD.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of NAACL-HLT.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.2011. Named entity recognition in tweets: An experi-mental study. Proceedings of EMNLP.

Stuart J. Russell, Peter Norvig, John F. Candy, Jiten-dra M. Malik, and Douglas D. Edwards. 1996. Ar-tificial intelligence: a modern approach.

Benjamin Snyder and Regina Barzilay. 2007. Database-text alignment via structured multilabel classification.In Proceedings of IJCAI.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, andChristopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedingsof EMNLP-Conll.

Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa.2012. Reducing wrong labels in distant supervisionfor relation extraction. In Proceedings ACL.

Partha Pratim Talukdar and Fernando Pereira. 2010.Experiments in graph-based semi-supervised learningmethods for class-instance acquisition. In Proceedingsof ACL.

Fei Wu and Daniel S. Weld. 2007. Autonomously se-mantifying wikipedia. In Proceedings of CIKM.

Wei Xu, Raphael Hoffmann Le Zhao, and Ralph Grish-man. 2013. Filling knowledge base gaps for distantsupervision of relation extraction. In Proceedings ofACL.

Luke S. Zettlemoyer and Michael Collins. 2007. Onlinelearning of relaxed ccg grammars for parsing to logicalform. In EMNLP-CoNLL.

377

378