open information extraction from the web oren etzioni

46
Open Information Extraction from the Web Oren Etzioni

Upload: amalie

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Open Information Extraction from the Web Oren Etzioni. KnowItAll Project (2003…). Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Open Information Extraction from the Web Oren Etzioni

Open Information Extractionfrom the Web

Oren Etzioni

Page 2: Open Information Extraction from the Web Oren Etzioni

2

KnowItAll Project (2003…)Rob BartJanara ChristensenTony FaderTom LinAlan RitterMichael SchmitzDr. Niranjan BalasubramanianDr. Stephen SoderlandProf. MausamProf. Dan Weld

PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates

Funding: DARPA, IARPA, NSF, ONR, Google.

Etzioni, University of Washington

Page 3: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 3

Outline

I. A “scruffy” view of Machine ReadingII. Open IE (overview, progress, new demo)III. Critique of Open IEIV. Future work: Open, Open IE

Page 4: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 4

I. Machine Reading (Etzioni, AAAI ‘06)

• “MR is an exploratory, open-ended, serendipitous process”

• “In contrast with many NLP tasks, MR is inherently unsupervised”

• “Very large scale”

• “Forming Generalizations based on extracted assertions”

No Ontology…Ontology Free!

Page 5: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 5

Lessons from DB/KR Research

• Declarative KR is expensive & difficult• Formal semantics is at odds with

– Broad scope– Distributed authorship

• KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03)

A fortiori, for KBs extracted from

text!

Page 6: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 6

Machine Reading at Web Scale

• A “universal ontology” is impossible• Global consistency is like world peace• Micro ontologies--scale? Interconnections?

• Ontological “glass ceiling”– Limited vocabulary– Pre-determined predicates– Swamped by reading at scale!

Page 7: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 7

Traditional IE Open IE

Input: Corpus + O(R) hand-labeled data

Corpus

Relations: Specified in advance

Discovered automatically

Extractor: Relation-specific Relation-independent

OPEN VERSUS TRADITIONAL IE

II. Open vs. Traditional IE

How is Open IE Possible?

Page 8: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 8

Semantic Tractability Hypothesis

easy-to-understand subset of English

• Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11)

• Characterization is compact, domain independent

• Covers 85% of binary, verb-based relations

Page 9: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 9

invented acquired by has a PhD in

denied voted for inhibits tumorgrowth in

inherited born in mastered the art of

downloaded aspired to is the patron saint of

expelled Arrived from wrote the book on

SAMPLE OrrF EXTRACTED RELATIONS

SAMPLE RELATION PHRASES

Page 10: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 10

DARPA MR Domains <50NYU, Yago <100NELL ~500DBpedia 3.2 940PropBank 3,600VerbNet 5,000WikiPedia InfoBoxes, f > 10 ~5,000TextRunner (phrases) 100,000+ ReVerb (phrases) 1,000,000+

NUMBER OF RELATIONSNumber of Relations

Page 11: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington

First Web-scale Open IE system Distant supervision + CRF models of relations

(Arg1, Relation phrase, Arg2)

1,000,000,000 distinct extractions

11

TEXTRUNNER

TextRunner (2007)

Page 12: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 12

Relation Extraction from Web

Page 13: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 13

Open IE (2012)

• Open source ReVerb extractor• Synonym detection• Parser-based Ollie extractor (Mausam EMNLP ‘12)

– Verbs Nouns and more– Analyze context (beliefs, counterfactuals)

• Sophistication of IE is a major focus

But what about entities, types, ontologies?

After beating the Heat, the Celtics are now the “top dog” in the NBA.

(the Celtics, beat, the Heat)

If he wins 5 key states, Romney will be president

(counterfactual: “if he wins 5 key states”)

Page 14: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 14

Towards “Ontologized” Open IE

• Link arguments to Freebase (Lin, AKBC ‘12)– When possible!

• Associate types with Args• No Noun Phrase Left Behind (Lin, EMNLP ‘12)

Page 15: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 15

System Architecture

Relation-independent

extraction

Synonyms,Confidence

Index in Lucene;Link entities

Processing

Web corpus

OutputInput

Extractor

Raw tuples

Assessor

Extractions

(XYZ Corp.; acquired; Go Inc.)(oranges; contain; Vitamin C)(Einstein; was born in; Ulm)

(XYZ; buyout of; Go Inc.)(Albert Einstein; born in; Ulm)

(Einstein Bros.; sell; bagels)

XYZ Corp. = XYZAlbert Einstein = Einstein !=

Einstein Bros.

Acquire(XYZ Corp., Go Inc.) [7]BornIn(Albert Einstein, Ulm) [5]Sell(Einstein Bros., bagels) [1]Contain(oranges, Vitamin C) [1]

Query processor DEMO

Page 16: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 16

III. Critique of Open IE

• Lack of formal ontology/vocabulary• Inconsistent extractions• Can it support reasoning?• What’s the point of Open IE?

Page 17: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 17

Perspectives on Open IE

A. “Search Needs a Shakeup” (Etzioni, Nature ’11)

B. Textual ResourcesC. Reasoning over Extractions

Page 18: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 18

A. New Paradigm for Search

“Moving Up the Information Food Chain” (Etzioni, AAAI ‘96)

Retrieval Extraction Snippets, docs Entities, RelationsKeyword queries Questions List of docs Answers

Essential for smartphones!(Siri meets Watson)

Page 19: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 19

Case Study over Yelp Reviews

1. Map review corpus to (attribute, value) (sushi = fresh) (parking = free)

2. Natural-language queries “Where’s the best sushi in Seattle?”

3. Sort results via sentiment analysis exquisite > very good > so, so

Page 20: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 20

RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12)

Revminer.com

Page 21: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 21

B. Public Textual Resources(Leveraging Open IE)

• 94M Rel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12)

• 600K Relation phrases (Fader, EMNLP ‘11)• Relation Meta-data:

– 50K Domain/range for relations (Ritter, ACL ‘10)– 10K Functional relations (Lin, EMNLP ‘10)

• 30K learned Horn clauses (Schoenmackers, EMNLP ‘10)• CLEAN (Berant, ACL ‘12)

– 10M entailment rules (coming soon)– Precision double that of DIRT

See openie.cs.washington.edu

(police investigate X) (police charge Y)

Page 22: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 22

C. Reasoning over Extractions

Linear-time 1st order Horn-clause inference (SchoenmackersEMNLP ’08)

Learn argument typesVia generative model (Ritter ACL ‘10)

1,000,000,000Extractions

TransitiveInference(Berant ACL ’11)

Identify synonyms (Yates & Etzioni JAIR ‘09)

Page 23: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 23

Unsupervised, probabilistic model for identifying synonyms

• P(Bill Clinton = President Clinton)– Count shared (relation, arg2)

• P(acquired = bought)– Relations: count shared (arg1, arg2)

• Functions, mutual recursion• Next step: unify with

Page 24: Open Information Extraction from the Web Oren Etzioni

24

Scalable Textual Inference

Desiderata for inference:• In text probabilistic inference• On the Web linear in |Corpus|

Argument distributions of textual relations:• Inference provably linear• Empirically linear!

Page 25: Open Information Extraction from the Web Oren Etzioni

25

Inference Scalability for Holmes

Page 26: Open Information Extraction from the Web Oren Etzioni

26

Extractions Domain/range

• Much previous work (Resnick, Pantel, etc.)• Utilize generative topic models

Extractions of R DocumentDomain/range of R topics

Page 27: Open Information Extraction from the Web Oren Etzioni

27

born_in(Einstein, Ulm)

headquartered_in(Microsoft, Redmond)

founded_in(Microsoft, 1973)

born_in(Bill Gates, Seattle)

founded_in(Google, 1998)

headquartered_in(Google, Mountain View)

born_in(Sergey Brin, Moscow)

founded_in(Microsoft, Albuquerque)

born_in(Einstein, March)

born_in(Sergey Brin, 1973)

TextRunner ExtractionsRelations as Documents

Page 28: Open Information Extraction from the Web Oren Etzioni

29

z1

a1

R

N

a

z2

a2

gT T

h1 h2

Generative Story [LinkLDA, Erosheva et. al. 2004]

Pick a topic for arg2

For each extraction, pick type for a1, a2

Person born_in Location

Pick a topic for arg2

Then pick arguments based

on typesSergey Brin born_in Moscow

For each relation, randomly pick a distribution over

types

X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 …

Pick a topic for arg2

Two separate sets of type

distributions

Page 29: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 30

Examples of Learned Domain/range

• elect(Country, Person)• predict(Expert, Event)• download(People, Software)• invest(People, Assets)• Was-born-in(Person, Location OR Date)

Page 30: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 31

Summary: Trajectory of Open IE

2003 KnowItAll project

2007TextRunner: 1,000,000,000 “Ontology free” extractions

2008-9Inference over extractions

2010-11

Open source extractorPublictextual Resources

2012

Freebase typesIE-based search Deeper analysis of sentences

Openie.cs.washington.edu

Page 31: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 32

IV. Future: Open Open IE

• Open input: ingest tuples from any source(Tuple, Source, Confidence)

• Linked Open Output: – Extractions Linked-open Data (LOD) cloud– Relation normalization– Use LOD best practices

• Specialized reasoners

Page 32: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 33

Conclusions

1. Ontology is not necessary for reasoning2. Open IE is “gracefully” ontologized3. Open IE is boosting text analysis4. LOD has distribution & scale (but not text) =

opportunity

Thank you

Page 33: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 34

qs

• Why Open?• What’s next?• Dimensions for analyzing systems• What’s worked, what’s failed? (lessons)• What can we learn from watson?• What can we learn from db/kr? (alon)

Page 34: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 35

Questions• What extraction mechanism is used?• What corpus?• What input knowledge?• Role for people/manual labling• • Form of the extracted knowledge?• Size/scope of extracted knowledge?• • What reasoning is done?• • Most unique aspect?• Biggest challenge?

Page 35: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 36

Scalability notes

• Interoperability, distributed authorship, vs. a monolithic system

• Open IE meets RDF:– Need URI’s for predicates. How to obtain?– What about errors in mapping to URI?– Ambiguity? Uncertainty?

Page 36: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 37

reasoning

• Nell: inter-class constraints to gen negative egs

Page 37: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 38

Dims of scalability

• Corpus size• Syn coverage over text• Sem coverage over text

– Time, belief, n-ary relations, etc.• Number of entities, relations• Ability to reason• How much cpu? • How much manual effort?• Bounding, cielign effect, ontological glass ceiling

Page 38: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 39

Example of limiting assumptions

• Nell: apple has single meaning• Single atom per entity

– Global computation to add entity– Can’t be sure

• LOD:– Best practice– Same-as links

Page 39: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 40

Risk for scalable system

• Limited semantics, reasoning• No reasoning…

Page 40: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 41

LOD triple in aug 2011: 31,634,213,770

Page 41: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 42

• . The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report:

• . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity.

Page 42: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 43

Page 43: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 44

Entity Linking an Extraction CorpusEinstein quit his job at the patent office (8)

1. String Match 2. Prominence Priors 3. Context Match

US Patent OfficeEU Patent Office

Japan Patent OfficeSwiss Patent Office

Patent

Link Score

1,281 inlinks168 inlinks56 inlinks

101 inlinks4,620 inlinks

(med)(med)(med)(med)(low)

Obtain candidates, and measure string similarity.

Exact String Match = best matchalso consider:

(low)(low)(low)

(very high)(low)

(med)(low)(low)(high)(low)

Alternatecapitalization

Editdistance

Wordoverlap

Substring/Superstring

KnownAliases

PotentialAbbreviations

USPatentOffice

EUPatentOffice

JapanPatentOffice

SwissPatentOffice

Patent

Prominence # of links in Wikipedia to that Entity’s article∝

Patent

SwissPatentOffice

USPatentOffice

EUPatentOffice

JapanPatentOffice

Wikipedia Article Texts

“Einstein quit his job at the patent office.”“Einstein quit his job at the patent office to become a professor.”“In 1909, Einstein quit his job at the patent office.”“Einstein quit his job at the patent office where he worked.”

cosinesimilarity

“Document” of the extraction’s source sentencesLink Score is a function of (String Match Score, Prominence Prior Score, Context Match Score)

e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score

Link Ambiguity = 2nd Top Link Score Top Link Score

2.53GHz computer links 15 million text

arguments in ~3 days (60+ per second)

FasterHigher Precision

Collective Linking vs One Extraction at a time

Page 44: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 45

Golf

Sports that originated in China

Ping Pong

Dragon Boating

WushuKarate

Soccer

Q/A with Linked Extractions

• Ambiguous Entities• Typed Search• Linked Resources

“I need to learn about Titanic the ship for my homework.”

“Titanic earned more than $1 billion worldwide”“The Titanic sank in 1912”“The Titanic was released in 1998”“Titanic represents the state-of-the-art in special effects”“Titanic was built in Belfast”

(3,761 more …)

“The Titanic set sail from Southampton””

“RMS Titanic weighed about 26 kt”“The Titanic was built for safety and comfort”“The Titanic sank in 12,460 feet of water”

(1,902 more …)

“Golf originated in China”“Soccer originated in China”“Karate originated in China”

“Dragon Boating originated in China”

(14 more …)

“Which sports originated in China?”

“Noodles originated in China”“Printmaking originated in China”“Soy Beans originated in China”“Wushu originated in China”“Taoism originated in China”“Ping Pong originated in China”

(534 more …)

Leverages KBs by linking textual argumentsto entities found in the knowledge base.

Freebase Sports“Dragon Boat Racing”

“Table Tennis”…

Page 45: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 46

Linked Extractions support Reasoning

In addition to Question Answering, Linking can also benefit:

Functions [Ritter et al., 2008; Lin et al., 2010]

Other Relation Properties [Popescu 2007; Lin et al., CSK 2010]

Inference [Schoenmackers et al., 2008; Berant et al., 2011]

Knowledge-Base Population [Dredze et al., 2010]

Concept-Level Annotations [Christensen and Pasca, 2012]

… basically anything using the output of extraction

Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences…

Page 46: Open Information Extraction from the Web Oren Etzioni

Etzioni, University of Washington 48

Challenges

• Single-sentence extraction– He believed the plan will work– John Glenn was the first American in space– Obama was elected President in 2008.– American president Barack Obama asserted…

• ??