chapter 15: data integration on the web

55
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION

Upload: carlow

Post on 06-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Chapter 15: Data Integration on the Web. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Outline. Introduction, opportunities and challenges with Web data The Deep Web Vertical search Surfacing the Deep Web Creating topical portals - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 15: Data Integration on the Web

ANHAI DOAN ALON HALEVY ZACHARY IVES

Chapter 15: Data Integration on the Web

PRINCIPLES OF

DATA INTEGRATION

Page 2: Chapter 15: Data Integration on the Web

Outline

Introduction, opportunities and challenges with Web data The Deep Web

Vertical search Surfacing the Deep Web

Creating topical portals Lightweight data management on the Web

Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work

Page 3: Chapter 15: Data Integration on the Web

Broad Range of Data on the Web

Page 4: Chapter 15: Data Integration on the Web

Key Characteristics

Scale and heterogeneity Data is about everything! Overlapping sources, varying

levels of quality. Multiple formats (tables, lists, cards, etc.)

Data is laid out for visual appeal Extracting the data is very tricky! Semantics of the data are rarely specified and need to be

inferred from text and other clues.

Page 5: Chapter 15: Data Integration on the Web

Different Forms of Structured Data on the Web

Page 6: Chapter 15: Data Integration on the Web

Tables: hundreds of millions good ones

Page 7: Chapter 15: Data Integration on the Web

Databases Behind Forms The Deep/Invisible Web

store locationsused cars

radio stationspatents

recipes

Tens of millions of high-quality forms

Page 8: Chapter 15: Data Integration on the Web

HTML Lists

Every list item is a row in a table, but figuring out cell boundaries is very tricky.

Page 9: Chapter 15: Data Integration on the Web

Structured data embedded more loosely in pages. Extraction is very tricky!

Page 10: Chapter 15: Data Integration on the Web

What Can we do with Structured Web Data?

Integrate: Imagine integrating your data with any data on the Web! Insights come when independently developed data sets come

together (of course, you can also get garbage that way, so you need to be

careful).

Improve web search Find tables & lists when they’re relevant to queries Answer fact-seeking queries with facts rather than links to

Web pages. Aggregate: answer “total GDP of 10 largest countries” by

putting together facts from multiple pages

Page 11: Chapter 15: Data Integration on the Web

Discover via search

Manage,Analyze,

Visualize, Integrate, create compelling stories

Extract from Web SourcesPublish back to the Web

Bigger Vision: create an ecosystem of structured data on the Web

Page 12: Chapter 15: Data Integration on the Web

Outline

Introduction, opportunities and challenges with Web data The Deep Web

Vertical search Surfacing the Deep Web

Creating topical portals Lightweight data management on the Web

Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work

Page 13: Chapter 15: Data Integration on the Web

What is the Deep Web?

Content hidden behind HTML forms, not accessible to search engines.

Page 14: Chapter 15: Data Integration on the Web

The Deep Web

The collection of databases that are accessed by users entering values into HTML forms.

The crawler of search engines cannot fill the forms, therefore the content is invisible to the search engine.

The work on the Deep Web illustrates many of the challenges of managing Web data.

Page 15: Chapter 15: Data Integration on the Web

Two Approaches to the Deep Web

Build a vertical search engine: Apply all the data integration techniques we’ve learned so

far to a set of data sources such as job sites, airplane reservations, etc.

The approach is applicable to domains that have thousands of form sites.

Surface the content: Try to guess good queries to pose to the forms. Insert the

resulting HTML pages into the Web index. The approach covers the long tail of content on the Web.

Page 16: Chapter 15: Data Integration on the Web

Approach #1: Vertical Search: Data Integration

Page 17: Chapter 15: Data Integration on the Web

Vertical Search as Data Integration

Mediated schema: the properties of the domain that need to be exposed to the user If you include too many attributes in the mediated schema,

you may not be able to query them on many sources. Source descriptions: relatively simple. Sources are

often distinguished by their geographical coverage. Wrappers:

Parsing the answers from the resulting HTML is the tricky part.

Alternate approach: don’t parse the answers. Just show the user the returned web pages.

Page 18: Chapter 15: Data Integration on the Web

Tree Search

Amish quilts

Parking tickets in India

Horses

Deep Web: the Long Tail

Page 19: Chapter 15: Data Integration on the Web

The Surfacing Approach

Crawl & Indexing time Pre-compute interesting form submissions Insert resulting pages into the Web Index

Query time: nothing! Deep web URLs in the Index are like any other URL

Advantages Reuse existing search engine infrastructure Reduced load on target web sites – users

click only on what they deem relevant. Approach taken at Google for the long tail.

Page 20: Chapter 15: Data Integration on the Web

Surfacing Challenges1. Predicting the correct input combinations

Generating all possible URLs is wasteful and unnecessary Cars.com has ~500K listings, but 250M possible queries

2. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com

3. Don’t do anything bad! 4. Coverage of the crawl: don’t try to cover sites in their

entirety, it’s not necessary. 1. Once you get part of the content, there will be links to the rest2. It’s enough to have part of the content in the index to send it

relevant traffic.

Page 21: Chapter 15: Data Integration on the Web

Form Processing 101

GET and POST: types of HTML forms Only GETs can be surfaced

<form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/></form>

URL: http://www.borders.com/locator?store=All&city=&state=&zip=94043&within=25&search=Go&site=homepage

on submit

Page 22: Chapter 15: Data Integration on the Web

Google's Deep-Web Crawl (VLDB 2008)

Predicting Input Combinations Forms can have multiple inputs Generating all possible URLs is wasteful! … and un-necessary!

Goal: minimize URLs while maximizing retrieval!

Other considerations Generated URLs must be good candidates for index Only need URLs sufficient to drive traffic Only need URLs sufficient to seed the web crawler

Solution: discover only informative input combinations.

Page 23: Chapter 15: Data Integration on the Web

Informative Form Fields

http://jobs.shrm.org/search?state=All&kw=&type=Allhttp://jobs.shrm.org/search?state=AL&kw=&type=Allhttp://jobs.shrm.org/search?state=AK&kw=&type=All

…http://jobs.shrm.org/search?state=WV&kw=&type=All

http://jobs.shrm.org/search?state=All&kw=&type=ALLhttp://jobs.shrm.org/search?state=All&kw=&type=ANY

http://jobs.shrm.org/search?state=All&kw=&type=EXACT

Result pages different informative

Result pages similar un-informative

Varying the state results in qualitatively different content, and hence it is an informative field.

Page 24: Chapter 15: Data Integration on the Web

Computing Informative Field Combinations

Informative field combinations can be computed bottom up: Begin with single fields and find which ones are

informative. For every informative combination, try to add another

field and check if the resulting combination is still informative.

In practice, we rarely need combinations of more than 3 fields.

Page 25: Chapter 15: Data Integration on the Web

Google's Deep-Web Crawl (VLDB 2008)

Challenge 2: Generic and Typed Text boxes

Generic Search Boxes Accept any keywords Challenge: selecting the most appropriate values

Typed Text Boxes Only values belonging to specific types, e.g., zipcodes Challenge: selecting the type of the input

Page 26: Chapter 15: Data Integration on the Web

Google's Deep-Web Crawl (VLDB 2008)

Example: www.wipo.int

Page 27: Chapter 15: Data Integration on the Web

Input values for Generic Search Iterative Probing for search boxes

Select an initial list of candidate keywords

Download pages based on current set of keywords

Extract more candidate keywords from result pages

Refine the current set of keywords

Repeat until no more new candidate keywords Prune list of candidate keywords

Page 28: Chapter 15: Data Integration on the Web

Example: www.wipo.int

MetalworkingProteinAntibodyPyrazoleImmobilizerVasoconstrictionPhosphinatesNosepieceSandbridgeViscosityCarboxydiphenylsulphideOzonizer…

Page 29: Chapter 15: Data Integration on the Web

Outline

Introduction, opportunities and challenges with Web dataThe Deep Web

Vertical search Surfacing the Deep Web

Creating topical portals Lightweight data management on the Web

Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work

Page 30: Chapter 15: Data Integration on the Web

Topical Portals

An integrated view of a topic: E.g., a info about database researchers, all info about

coffee and their growing regions. Topical portals find different aspects of the same

objects on different sources E.g., publications of a person may come from one source,

while their job affiliations may come from another In contrast, vertical search integrated similar objects

from multiple sources E.g., job listings, apartments for rent, …

Page 31: Chapter 15: Data Integration on the Web

Topical Portal: example Integrated Page for an Entity

Page 32: Chapter 15: Data Integration on the Web

Building a Topical Portal

Approach #1: Perform a focused crawl of the Web to find pages on the

topic Use word signatures as a method for determining the topic of a

page. Use information extraction techniques to get the data out

of the pages. Perform reference resolution and schema matching to

create a cleaner set of data.

Page 33: Chapter 15: Data Integration on the Web

Creating a Topical Portal

Approach #2: Start with a set of well known sites in the domain Create an initial schema for the domain (the properties

you’re interested in modeling) Create extractors for pages on the known sites

Note: extractors will be more accurate because they were created for the sites themselves

Result: a good basis of entities and relationships to build on. Extend the initial data set:

Follow references from the initial set of chosen pages Use collaboration (of people in the community) to find additional

data and to correct extractions.

Page 34: Chapter 15: Data Integration on the Web

Outline

Introduction, opportunities and challenges with Web dataThe Deep Web

Vertical search Surfacing the Deep Web

Creating topical portals Lightweight data management on the Web

Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work

Page 35: Chapter 15: Data Integration on the Web

Lightweight Combination of Web Data

With such a vast collection of data, we would like to enable easy data integration. Imagine a school student combining her data about bird

species with a country population table found on the Web A journalist creating a news story with data about riots in

the UK and needing to combine it with demographic data …

Many data integration tasks are transient: the result will be used for a short period of time only Hence, creating the integrated data must be easy.

Creating a mediated schema and mappings is too tedious.

Page 36: Chapter 15: Data Integration on the Web

Challenges to Data Integration on the Web

Discovering data on (search engines are optimized for documents, not tables or lists)

Extracting the data from the Web pages into a form that can be processed

Combining multiple data sets

Unique opportunities on the Web: re-use work of others!

Page 37: Chapter 15: Data Integration on the Web

Not a great result!

Page 38: Chapter 15: Data Integration on the Web

But the data does exist out there!

Page 39: Chapter 15: Data Integration on the Web

Discovering Data on the Web

Search engines are optimized for documents E.g., proximity of terms matters in ranking. In tables, the

schema applies to all rows. “zambia” is far from “population” in a document containing population data, but should be considered close.

No special attention is given to schema rows (if they can be detected) or columns closer to the left of the table (that are often the “subject” of the table).

Tables with high quality data look like ones that are used for formatting. Over 99% of the HTML tables on the Web are not high quality

data tables!

Page 40: Chapter 15: Data Integration on the Web

Challenges to Discovering the Semantics of Structured Data on the Web

Page 41: Chapter 15: Data Integration on the Web

Semantics Embedded in Surrounding Text

Topic of table is in the text, and the token “2006” is crucial to understanding the data.

Page 42: Chapter 15: Data Integration on the Web

No schema, but beautifully understandable table by people.

Page 43: Chapter 15: Data Integration on the Web

Structured Data can be Plain Complicated!

Page 44: Chapter 15: Data Integration on the Web

HTML Tables used for Formatting

Page 45: Chapter 15: Data Integration on the Web

“Vertical” Tables: one tuple of a bigger table

Page 46: Chapter 15: Data Integration on the Web

Tree Search

Amish quilts

Parking tickets in India

Horses

Can’t Use Domain Knowledge: Data is about Everything

Page 47: Chapter 15: Data Integration on the Web

Search by Tweaking Document Traditional Search

Consider new cues in ranking: Hits on left column Hits on schema (where there is one) Number of rows, columns Hits on table body Size of table relative to page

But we can do better: try to recover the underlying semantics of the data.

Page 48: Chapter 15: Data Integration on the Web

If we see these patterns enough times, we can infer that Green Ash is a North American species

Recovering Table Semantics: cells on the Web are mentioned in Web text

Page 49: Chapter 15: Data Integration on the Web

If we infer that a large fraction of the left column are North American tree species, we can infer that the table is about these tree species. Which is not mentioned on the page!

Recovering Table Semantics: cells on the Web are mentioned in Web text

Page 50: Chapter 15: Data Integration on the Web

Extracting Data from the Page

In the case of tables, it’s fairly easy Main challenge: decide if there is a row with attribute

names Lists are tricky: punctuation and formatting do not

always provide the right cues for partitioning a list element into cells boundaries.

Structured data in cards: in general, it’s an information extraction problem.

Page 51: Chapter 15: Data Integration on the Web

Structured Data in Cards

Page 52: Chapter 15: Data Integration on the Web

Copy & Paste Approach: Extraction by Demonstration

Using previous slide as example. Start by copying “Four Barrel” into a column of a

spreadsheet. System tries to generalize and suggest other café

names: Sightglass, Blue Bottle, Ritual. Next, the user copies the address of Four Barrel into

the next column of the spreadsheet System generalizes… Etc.

Page 53: Chapter 15: Data Integration on the Web

Combining Multiple Data Sets

First, find related data sets. Depending on the context, you may be looking for: Data sets to join with (add new columns) Data sets to union with (add new rows)

Specifying the join: Again, by demonstration. Drag and drop a cell from one

table into another. Reference reconciliation is a big challenge:

Use reference data such as Freebase?

Page 54: Chapter 15: Data Integration on the Web

Re-Using Work of Others

Most good data sets will get extracted more than once: Re-use the work done by other extractors

Data cleaning can be a collaborative effort Data sets that get integrated often are probably high

quality – leverage that signal With 200M tables on the Web, you can mine their

schemas to find attribute synonyms and common schematic patterns.

Page 55: Chapter 15: Data Integration on the Web

Summary of Chapter 15

Structured data on the Web is an incredible collection of data More is coming on because organizations and

governments are being encouraged to publish data Data comes with little or no semantics

Huge challenge when you try to make sense of it Key emphasis: create data management tool that

anyone can use Data is no longer just for database experts!