open information extraction from the web oren etzioni

Open Information Extractionfrom the Web

Oren Etzioni

2

KnowItAll Project (2003…)Rob BartJanara ChristensenTony FaderTom LinAlan RitterMichael SchmitzDr. Niranjan BalasubramanianDr. Stephen SoderlandProf. MausamProf. Dan Weld

PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates

Funding: DARPA, IARPA, NSF, ONR, Google.

Etzioni, University of Washington

Etzioni, University of Washington 3

Outline

I. A “scruffy” view of Machine ReadingII. Open IE (overview, progress, new demo)III. Critique of Open IEIV. Future work: Open, Open IE


I. Machine Reading (Etzioni, AAAI ‘06)

• “MR is an exploratory, open-ended, serendipitous process”

• “In contrast with many NLP tasks, MR is inherently unsupervised”

• “Very large scale”

• “Forming Generalizations based on extracted assertions”

No Ontology…Ontology Free!


Lessons from DB/KR Research

• Declarative KR is expensive & difficult• Formal semantics is at odds with

– Broad scope– Distributed authorship

• KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03)

A fortiori, for KBs extracted from

text!


Machine Reading at Web Scale

• A “universal ontology” is impossible• Global consistency is like world peace• Micro ontologies--scale? Interconnections?

• Ontological “glass ceiling”– Limited vocabulary– Pre-determined predicates– Swamped by reading at scale!


Traditional IE Open IE

Input: Corpus + O(R) hand-labeled data

Corpus

Relations: Specified in advance

Discovered automatically

Extractor: Relation-specific Relation-independent

OPEN VERSUS TRADITIONAL IE

II. Open vs. Traditional IE

How is Open IE Possible?


Semantic Tractability Hypothesis

easy-to-understand subset of English

• Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11)

• Characterization is compact, domain independent

• Covers 85% of binary, verb-based relations


invented acquired by has a PhD in

denied voted for inhibits tumorgrowth in

inherited born in mastered the art of

downloaded aspired to is the patron saint of

expelled Arrived from wrote the book on

SAMPLE OrrF EXTRACTED RELATIONS

SAMPLE RELATION PHRASES


DARPA MR Domains <50NYU, Yago <100NELL ~500DBpedia 3.2 940PropBank 3,600VerbNet 5,000WikiPedia InfoBoxes, f > 10 ~5,000TextRunner (phrases) 100,000+ ReVerb (phrases) 1,000,000+

NUMBER OF RELATIONSNumber of Relations

Etzioni, University of Washington

First Web-scale Open IE system Distant supervision + CRF models of relations

(Arg1, Relation phrase, Arg2)

1,000,000,000 distinct extractions

11

TEXTRUNNER

TextRunner (2007)


Relation Extraction from Web


Open IE (2012)

• Open source ReVerb extractor• Synonym detection• Parser-based Ollie extractor (Mausam EMNLP ‘12)

– Verbs Nouns and more– Analyze context (beliefs, counterfactuals)

• Sophistication of IE is a major focus

But what about entities, types, ontologies?

After beating the Heat, the Celtics are now the “top dog” in the NBA.

(the Celtics, beat, the Heat)

If he wins 5 key states, Romney will be president

(counterfactual: “if he wins 5 key states”)


Towards “Ontologized” Open IE

• Link arguments to Freebase (Lin, AKBC ‘12)– When possible!

• Associate types with Args• No Noun Phrase Left Behind (Lin, EMNLP ‘12)


System Architecture

Relation-independent

extraction

Synonyms,Confidence

Index in Lucene;Link entities

Processing

Web corpus

OutputInput

Extractor

Raw tuples

Assessor

Extractions

(XYZ Corp.; acquired; Go Inc.)(oranges; contain; Vitamin C)(Einstein; was born in; Ulm)

(XYZ; buyout of; Go Inc.)(Albert Einstein; born in; Ulm)

(Einstein Bros.; sell; bagels)

XYZ Corp. = XYZAlbert Einstein = Einstein !=

Einstein Bros.

Acquire(XYZ Corp., Go Inc.) [7]BornIn(Albert Einstein, Ulm) [5]Sell(Einstein Bros., bagels) [1]Contain(oranges, Vitamin C) [1]

Query processor DEMO

http://openie.cs.washington.edu/


III. Critique of Open IE

• Lack of formal ontology/vocabulary• Inconsistent extractions• Can it support reasoning?• What’s the point of Open IE?


Perspectives on Open IE

A. “Search Needs a Shakeup” (Etzioni, Nature ’11)

B. Textual ResourcesC. Reasoning over Extractions


A. New Paradigm for Search

“Moving Up the Information Food Chain” (Etzioni, AAAI ‘96)

Retrieval Extraction Snippets, docs Entities, RelationsKeyword queries Questions List of docs Answers

Essential for smartphones!(Siri meets Watson)


Case Study over Yelp Reviews

1. Map review corpus to (attribute, value) (sushi = fresh) (parking = free)

2. Natural-language queries “Where’s the best sushi in Seattle?”

3. Sort results via sentiment analysis exquisite > very good > so, so


RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12)

Revminer.com


B. Public Textual Resources(Leveraging Open IE)

• 94M Rel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12)

• 600K Relation phrases (Fader, EMNLP ‘11)• Relation Meta-data:

– 50K Domain/range for relations (Ritter, ACL ‘10)– 10K Functional relations (Lin, EMNLP ‘10)

• 30K learned Horn clauses (Schoenmackers, EMNLP ‘10)• CLEAN (Berant, ACL ‘12)

– 10M entailment rules (coming soon)– Precision double that of DIRT

See openie.cs.washington.edu

(police investigate X) (police charge Y)


C. Reasoning over Extractions

Linear-time 1st order Horn-clause inference (SchoenmackersEMNLP ’08)

Learn argument typesVia generative model (Ritter ACL ‘10)

1,000,000,000Extractions

TransitiveInference(Berant ACL ’11)

Identify synonyms (Yates & Etzioni JAIR ‘09)


Unsupervised, probabilistic model for identifying synonyms

• P(Bill Clinton = President Clinton)– Count shared (relation, arg2)

• P(acquired = bought)– Relations: count shared (arg1, arg2)

• Functions, mutual recursion• Next step: unify with

24

Scalable Textual Inference

Desiderata for inference:• In text probabilistic inference• On the Web linear in |Corpus|

Argument distributions of textual relations:• Inference provably linear• Empirically linear!

25

Inference Scalability for Holmes

26

Extractions Domain/range

• Much previous work (Resnick, Pantel, etc.)• Utilize generative topic models

Extractions of R DocumentDomain/range of R topics

27

born_in(Einstein, Ulm)

headquartered_in(Microsoft, Redmond)

founded_in(Microsoft, 1973)

born_in(Bill Gates, Seattle)

founded_in(Google, 1998)

headquartered_in(Google, Mountain View)

born_in(Sergey Brin, Moscow)

founded_in(Microsoft, Albuquerque)

born_in(Einstein, March)

born_in(Sergey Brin, 1973)

TextRunner ExtractionsRelations as Documents

29

z1

a1

R

N

a

z2

a2

gT T

h1 h2

Generative Story [LinkLDA, Erosheva et. al. 2004]

Pick a topic for arg2

For each extraction, pick type for a1, a2

Person born_in Location


Then pick arguments based

on typesSergey Brin born_in Moscow

For each relation, randomly pick a distribution over

types

X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 …


Two separate sets of type

distributions


Examples of Learned Domain/range

• elect(Country, Person)• predict(Expert, Event)• download(People, Software)• invest(People, Assets)• Was-born-in(Person, Location OR Date)


Summary: Trajectory of Open IE

2003 KnowItAll project

2007TextRunner: 1,000,000,000 “Ontology free” extractions

2008-9Inference over extractions

2010-11

Open source extractorPublictextual Resources

2012

Freebase typesIE-based search Deeper analysis of sentences

Openie.cs.washington.edu


IV. Future: Open Open IE

• Open input: ingest tuples from any source(Tuple, Source, Confidence)

• Linked Open Output: – Extractions Linked-open Data (LOD) cloud– Relation normalization– Use LOD best practices

• Specialized reasoners


Conclusions

1. Ontology is not necessary for reasoning2. Open IE is “gracefully” ontologized3. Open IE is boosting text analysis4. LOD has distribution & scale (but not text) =

opportunity

Thank you


qs

• Why Open?• What’s next?• Dimensions for analyzing systems• What’s worked, what’s failed? (lessons)• What can we learn from watson?• What can we learn from db/kr? (alon)


Questions• What extraction mechanism is used?• What corpus?• What input knowledge?• Role for people/manual labling• • Form of the extracted knowledge?• Size/scope of extracted knowledge?• • What reasoning is done?• • Most unique aspect?• Biggest challenge?


Scalability notes

• Interoperability, distributed authorship, vs. a monolithic system

• Open IE meets RDF:– Need URI’s for predicates. How to obtain?– What about errors in mapping to URI?– Ambiguity? Uncertainty?


reasoning

• Nell: inter-class constraints to gen negative egs


Dims of scalability

• Corpus size• Syn coverage over text• Sem coverage over text

– Time, belief, n-ary relations, etc.• Number of entities, relations• Ability to reason• How much cpu? • How much manual effort?• Bounding, cielign effect, ontological glass ceiling


Example of limiting assumptions

• Nell: apple has single meaning• Single atom per entity

– Global computation to add entity– Can’t be sure

• LOD:– Best practice– Same-as links


Risk for scalable system

• Limited semantics, reasoning• No reasoning…


LOD triple in aug 2011: 31,634,213,770


• . The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report:

• . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity.

http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/


Entity Linking an Extraction CorpusEinstein quit his job at the patent office (8)

1. String Match 2. Prominence Priors 3. Context Match

US Patent OfficeEU Patent Office

Japan Patent OfficeSwiss Patent Office

Patent

Link Score

1,281 inlinks168 inlinks56 inlinks

101 inlinks4,620 inlinks

(med)(med)(med)(med)(low)

Obtain candidates, and measure string similarity.

Exact String Match = best matchalso consider:

(low)(low)(low)

(very high)(low)

(med)(low)(low)(high)(low)

Alternatecapitalization

Editdistance

Wordoverlap

Substring/Superstring

KnownAliases

PotentialAbbreviations

USPatentOffice

EUPatentOffice

JapanPatentOffice

SwissPatentOffice

Patent

Prominence # of links in Wikipedia to that Entity’s article∝

Patent

SwissPatentOffice

USPatentOffice

EUPatentOffice

JapanPatentOffice

Wikipedia Article Texts

“Einstein quit his job at the patent office.”“Einstein quit his job at the patent office to become a professor.”“In 1909, Einstein quit his job at the patent office.”“Einstein quit his job at the patent office where he worked.”

cosinesimilarity

“Document” of the extraction’s source sentencesLink Score is a function of (String Match Score, Prominence Prior Score, Context Match Score)

e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score

Link Ambiguity = 2nd Top Link Score Top Link Score

2.53GHz computer links 15 million text

arguments in ~3 days (60+ per second)

FasterHigher Precision

Collective Linking vs One Extraction at a time


Golf

Sports that originated in China

Ping Pong

Dragon Boating

WushuKarate

Soccer

…

Q/A with Linked Extractions

• Ambiguous Entities• Typed Search• Linked Resources

“I need to learn about Titanic the ship for my homework.”

“Titanic earned more than $1 billion worldwide”“The Titanic sank in 1912”“The Titanic was released in 1998”“Titanic represents the state-of-the-art in special effects”“Titanic was built in Belfast”

(3,761 more …)

“The Titanic set sail from Southampton””

“RMS Titanic weighed about 26 kt”“The Titanic was built for safety and comfort”“The Titanic sank in 12,460 feet of water”

(1,902 more …)

“Golf originated in China”“Soccer originated in China”“Karate originated in China”

“Dragon Boating originated in China”

(14 more …)

“Which sports originated in China?”

“Noodles originated in China”“Printmaking originated in China”“Soy Beans originated in China”“Wushu originated in China”“Taoism originated in China”“Ping Pong originated in China”

(534 more …)

Leverages KBs by linking textual argumentsto entities found in the knowledge base.

Freebase Sports“Dragon Boat Racing”

“Table Tennis”…


Linked Extractions support Reasoning

In addition to Question Answering, Linking can also benefit:

Functions [Ritter et al., 2008; Lin et al., 2010]

Other Relation Properties [Popescu 2007; Lin et al., CSK 2010]

Inference [Schoenmackers et al., 2008; Berant et al., 2011]

Knowledge-Base Population [Dredze et al., 2010]

Concept-Level Annotations [Christensen and Pasca, 2012]

… basically anything using the output of extraction

Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences…


Challenges

• Single-sentence extraction– He believed the plan will work– John Glenn was the first American in space– Obama was elected President in 2008.– American president Barack Obama asserted…

• ??

open information extraction from the web oren etzioni

Documents

university of washington8how

university of washington6

university of washingtonhow

university of washington5

university of washington33i

machine reading etzioni

web oren etzioni

synonyms yates etzioni