the web’s many models

47
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

Upload: ianna

Post on 26-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

The Web’s Many Models. ?. Michael J. Cafarella University of Michigan AKBC May 19, 2010. Web Information Extraction. Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Web’s Many Models

The Web’s Many Models

Michael J. Cafarella University of Michigan

AKBCMay 19, 2010

?

Page 2: The Web’s Many Models

2

Web Information Extraction Much recent research in information

extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)

Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query

processing But where is it?

Page 3: The Web’s Many Models

3

Web Information Extraction Omnivore

“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.

Suggested remedies for data ingestion, user interaction

This talk says why ideas in that paper might already be out of date, gives alternative ideas

If there are mistakes here, then you have a chance to save me years of work!

Page 4: The Web’s Many Models

4

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Page 5: The Web’s Many Models

5

Parallel Extraction Previous hypothesis

Many data models for interesting data, e.g., relational tables and E/R graphs, etc.

Should build large integration infrastructure to consume many extraction streams

Page 6: The Web’s Many Models

6

Database Construction (1)

Start with a single large Web crawl

Page 7: The Web’s Many Models

7

Database Construction (2)

Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-

dependent schema

Page 8: The Web’s Many Models

8

Database Construction (3)

For each extractor output, unfold into common entity-relation model

Page 9: The Web’s Many Models

9

Database Construction (4)

Unify results

Page 10: The Web’s Many Models

10

Database Construction (5)

Emit final database

Page 11: The Web’s Many Models

11

Potential Problems Pressing problems:

Recall Simple intra-source reconciliation Time

Tables, entities probably OK for now Many data sources (DBPedia, Facebook,

IMDB) already match one of these two pretty well

One possible different direction: the Data-Centric Web Addresses recall only

Page 12: The Web’s Many Models

12

The Data-Centric Web

Page 13: The Web’s Many Models

13

The Data-Centric Web

Page 14: The Web’s Many Models

14

The Data-Centric Web

Page 15: The Web’s Many Models

15

The Data-Centric Web

Page 16: The Web’s Many Models

16

The Data-Centric Web

Page 17: The Web’s Many Models

17

The Data-Centric Web

Page 18: The Web’s Many Models

18

The Data-Centric Web

Page 19: The Web’s Many Models

19

The Data-Centric Web

Page 20: The Web’s Many Models

20

The Data-Centric Web

Page 21: The Web’s Many Models

21

The Data-Centric Web

Page 22: The Web’s Many Models

22

The Data-Centric Web

Page 23: The Web’s Many Models

23

The Data-Centric Web

Page 24: The Web’s Many Models

24

Data-Centric Lists Lists of Data-Centric Entities give

hints: About what the target entity contains

That all members of set are DCEs, or not

That members of set belong to a class or type (e.g., program committee members)

Page 25: The Web’s Many Models

25

Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric

Entity classifications5. Run attr/val extractors on DCEs

Yields E/R dataset, for insertion into DBPedia, YAGO, etc

In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

Page 26: The Web’s Many Models

26

Research Question 1 How many useful entities…

Lack a page in the Data-Centric Web? (That means no homepage, no Amazon

page, no public Facebook page, etc.) AND are otherwise well-described

enough online that IE can recover an entity-centric view?

Put differently: Does every entity worth extracting

already have a homepage on the Web?

Page 27: The Web’s Many Models

27

Research Question 2 Does a single real-world entity

have more than one “authoritative” URL? Note that Wikipedia provides pretty

minimal assistance in choosing the right entity, but does a good job

Page 28: The Web’s Many Models

28

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Page 29: The Web’s Many Models

29

Model Generation for Output Previous hypothesis

Many different user applications built against single back-end database

Difficult task is translating from back-end data model to the application’s data model

Page 30: The Web’s Many Models

30

Query Processing (1)

Query arrives at system

Page 31: The Web’s Many Models

31

Query Processing (2)

Entity-relation database processor yields entity results

Page 32: The Web’s Many Models

32

Query Processing (3)

Query Renderer chooses appropriate output schema

Page 33: The Web’s Many Models

33

Query Processing (4)

User corrections are logged and fed into later iterations of db construction

Page 34: The Web’s Many Models

34

Potential Problems Many plausible front-end applications,

none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an

end-user application Need to explore possible applications

rather than build multi-app infrastructure

One possible different direction: data integration as user primitive

Page 35: The Web’s Many Models

35

Data Integration as UI Can we combine tables to create

new data sources? Many existing “mashup” tools,

which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in

advance Transient integrations Dirty data

Page 36: The Web’s Many Models

36

Interaction Challenge Try to create a database of all“VLDB program committee members”

Page 37: The Web’s Many Models

37

Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but

high/low quality (like search) Also, prosaic traditional operators

Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,

Halevy]

Octopus

Page 38: The Web’s Many Models

38

Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

Page 39: The Web’s Many Models

39

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

CONTEXT()

CONTEXT()

Page 40: The Web’s Many Models

40

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

CONTEXT()

CONTEXT()

Page 41: The Web’s Many Models

41

Walkthrough - Union Combine datasets

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

Union()

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

Page 42: The Web’s Many Models

42

Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic

EXTEND( “publications”, col=0)

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

serge abiteboul inria 1996 “Large Scale P2P Dist…”

michael adiba …grenoble 1996 “Exploiting bitemporal…”

antonio albano …pisa 1996 “Another Example of a…”

serge abiteboul inria 2005 “Large Scale P2P Dist…”

anastassia ail… carnegie… 2005 “Efficient Use of the…”

gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”

… … …

• User has integrated data sources with little effort• No wrappers; data was never intended for reuse

“publications”

Page 43: The Web’s Many Models

43

CONTEXT Algorithms Input: table and source page Output: data values to add to table

SignificantTerms sorts terms in source page by “importance” (tf-idf)

Page 44: The Web’s Many Models

44

Related View Partners Looks for different “views” of same

data

Page 45: The Web’s Many Models

45

CONTEXT Experiments

Page 46: The Web’s Many Models

46

Data Integration as UI Compelling for db researchers, but

will large numbers of people use it?

Page 47: The Web’s Many Models

47

Conclusion Automatic Web KBs rapidly

progressing Recall still not good enough for many

tasks, but progress is rapid Not clear what those tasks should be, and

progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper

Omnivore’s approach not wrong, but did not directly address these problems