structured querying of web text: a technical challenge

21
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina Ferdous ID – 1000630375 Date – 03/23/10

Upload: leena

Post on 23-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Structured Querying of Web Text: A Technical Challenge. Michael J. Cafarella , Christopher Re, Dan Suciu , Oren Etzioni , Michele Banko. Presenter: Shahina Ferdous ID – 1000630375 Date – 03/23/10. Querying over Unstructured Data. Web (Text Documents). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structured Querying of Web Text: A Technical Challenge

Structured Querying of Web Text: A Technical ChallengeMichael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko

Presenter: Shahina FerdousID – 1000630375Date – 03/23/10

Page 2: Structured Querying of Web Text: A Technical Challenge

Querying over Unstructured Data

Web(Text Documents)

Contains vast amount Text Documents, which is:• Unstructured• Accessed by keywords• Limited search quality

Page 3: Structured Querying of Web Text: A Technical Challenge

Querying over Unstructured Data

Web

Show me some people, what they invented, and the years they died

Keyword-in

Document-out

Page 4: Structured Querying of Web Text: A Technical Challenge

Querying over Unstructured Data

Web

List some Scientists with their invention and the years they died

Keyword-in

Document-out

Page 5: Structured Querying of Web Text: A Technical Challenge

Structured Querying of web Text

“Show me some people, what they invented, and the years they died”Scientist Inventions Year ProbKepler log books 1630 .7902

Heisenberg matrix mechanics 1976 .7897

Galileo telescope 1642 .7395Newton calculus 1727 .7366

In this paper, they proposed a structured Web query System called extraction databse, ExDB.

ExDb uses information extraction (IE) system to extract Data. As the extracted Data can be erroneos, ExDB assigns Probability to the

tuples.

Page 6: Structured Querying of Web Text: A Technical Challenge

ExDB Work Flow

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877, Edisoninvented thephonograph.

Although he…

Obj1 Pred Obj2 probEdison invente

dphonogr

aph0.97

Morgan born-in 1837 0.85

Type Instance probscientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 probinvented did-invent 0.85invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model3. Query Processing & Applications

Web

Page 7: Structured Querying of Web Text: A Technical Challenge

Information ExtractionExDB extracts several base-level concepts

through combination of existing IE techniques: Objects are Data values in the system. Examples: Einstein, telephone, Boston,Light-bulb, etc. Predicates represents binary relation between pair of objects. Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc.

Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.

Page 8: Structured Querying of Web Text: A Technical Challenge

Information ExtractionExDB should also extract more series of relationships

to make queries even easier for the user: Synonyms denote equivalent objects, predicates or types. Examples: Einstein and A. –Einstein almost certainly refer to same object. Also, invented and has-invented refer to same predicate. Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y ) discovered (?x, ?y). Functional Dependencies are useful to answer query with negation or why an object is not an answer.For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, <country> ?y): ?x -> ?y p=0.95 “All Scientists born in Germany that taught at Princeton”. If after receivingthe answers, they ask again to the system “Why Einstein is not an answer?”. Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in oneCountry, therefore probability of born-in (Einstein, Germany) is very low.

Page 9: Structured Querying of Web Text: A Technical Challenge

Information ExtractionExample Description IE

techniqueinvented(Edison, phonograph) Arity-2 fact TextRunner

<scientist> Einstein Type (hypernymy)

KnowItAll

has-invented = invented Synonymy DIRT

invented discovered ID (troponymy) ?

FD: has-capital(x, y) has-capital(y) FD (rule) ?

Page 10: Structured Querying of Web Text: A Technical Challenge

ExDB Work Flow

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877, Edisoninvented thephonograph.

Although he…

Obj1 Pred Obj2 probEdison invente

dphonogr

aph0.97

Morgan born-in 1837 0.85

Type Instance probscientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 probinvented did-invent 0.85invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model 3. Query Processing & Applications

Web

Page 11: Structured Querying of Web Text: A Technical Challenge

Populate Data ModelObj1 Pred Obj2 prob

Edison invented

phonograph

0.97

Morgan born-in 1837 0.85

Type Instance probscientist Einstein 0.99

city Boston 0.92

Pred1 Pred2 probinvented did-invent 0.85invented created 0.72

Inclusion Includer probinvented discovered 0.81Seattle Washington 0.65

LHS RHS probcapital(x, y) capital(y) 0.77born-in(x) country(y) 0.95

Facts

Types

Synonyms

IDs

FDs

It was big news when Edison invented the phonograph…

He visited cities such as Boston and New York.

We all know that Edison did-invent the light bulb.…In 1877 Edison created the phonograph.

Morgan was born-in 1837 into a prosperous mercantile-banking family…

Einstein is one of the best known scientists and intellectuals of all time.

•For fact extraction ExDB uses unsupervised system called TextRunner.

•TextRunner generates a large set of extraction while running on entire corpus of text.

•Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples.

•Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus

TextRunner

•For type extraction ExDB uses the KnowItAll system.

•KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”.

•Assign each extraction a probability based on its frequency (or search engine hit count).

knowItAll

• ExDB uses DIRT algorithm to extract predicate synonyms.

•DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity.

DIRT

Page 12: Structured Querying of Web Text: A Technical Challenge

ExDB Work Flow

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877, Edisoninvented thephonograph.

Although he…

Obj1 Pred Obj2 probEdison invente

dphonogr

aph0.97

Morgan born-in 1837 0.85

Type Instance probscientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 probinvented did-invent 0.85invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model3. Query Processing & Applications

Web

Page 13: Structured Querying of Web Text: A Technical Challenge

ExDB Queries ExDB proposes the users to query over the web Data

model using Datalog-like notation.Example: q(?i) :- invented(Edison, ?i) returns all inventions by Edison.Example constranits: q(?x, ?y) :- died-in(<Scientist> ?x, 1955?y)

Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in(<electronics> ?x, Seattle ?y), costs (?x, ?z), (?z < 25)

Another example can be: q(?x, ?y, ?z) :- invented(<scientists> ?x, ?y), died-in (?x, <year> ?z), (?z < 1900)

Example of projection queries: q(?s) :- invented(<scientist> ?s, ?i)

Page 14: Structured Querying of Web Text: A Technical Challenge

Query Processing Non-projecting queries

Involves a series of join against tables in the Web Data Model Probability of a joined tuple is the product of the individual tuple’s

probabilities Select top-k queries ranked by their probability as results.

Object Classeinstein scientistboston citybohr scientistfrance countrycurie scientist

Bugs bunny scientist

Object1 Predicate Object2einstein invented relativity1848 Was-year-

ofrevolution

edison invented phonograph

dukakis visited bostoneinstein died-in 1955

humans have Cold-fusion

prob0.990.980.950.920.91

prob0.990.970.960.930.92

0.01 0.01

… …

Types Facts

Example: q(?x, ?y, ?z) :- invented (<scientist> ?x, ?y), died-in (?x, <year> ?z).

Scientist

Invented

Died-in prob

einstein relativity

1955 0.90

Page 15: Structured Querying of Web Text: A Technical Challenge

Projecting queries q (?s) :- invented (<scientist> ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention.

Need to compute a disjunction of m probabilistic events.

A scientist Tesla appears in the output q, if the tuple invented (Tesla, I0) is in the database. There can be many inventions I1, …, Im for Tesla such as invented (Tesla, Ii). Any of these are sufficient to return Tesla as an answer for q.

As m can be very large, a large number of very low probability extractions can unexpectedly result in a quite large probability.

Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in q.

Page 16: Structured Querying of Web Text: A Technical Challenge

Result of Projecting Queries

q(?s) :- invented(<scientist> ?s, x) Scientist invented

Page 17: Structured Querying of Web Text: A Technical Challenge

ExDB Prototype Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet

Page 18: Structured Querying of Web Text: A Technical Challenge

Applications ExDB’s extracted Data are not meant to be examined directly, rather

they are used to build topic-specific tables so that human user can appreciate.

A synthetic table about scientists, generated by merging answers from Died-in(<scientist> ?x, ?y), invented(<scientist> ?x, ?y), published(<scientist> ?x, ?y) and taught(<scientist> ?x, ?y).

If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system.

It is possible to build web Data cube over the large amount of read only structured Data of ExDB.

Page 19: Structured Querying of Web Text: A Technical Challenge

Alternative Models Schema Extraction Model, intends to find out single best

schema for the entire set of extractions to transform the web Text into a traditional relational database

Three good criteria for schema extraction are: Simplicity (few tables). Completeness (All extractions appear in the output). Fullness ( output database has no NULLs).

Page 20: Structured Querying of Web Text: A Technical Challenge

Alternative Models Text Query Model does not perform any information

extraction at all, rather offers a descriptive query language to generates answers for users query very quickly.

Extract city/date tuples from band’s website.

Indicate the city where she lives. Compute the dates when the

band’s city and her own city are within 100 miles of each other.

User’s Query

Page 21: Structured Querying of Web Text: A Technical Challenge

Questions?Thank You