chapter 15: data integration on the web

ANHAI DOAN ALON HALEVY ZACHARY IVES

Chapter 15: Data Integration on the Web

PRINCIPLES OF

DATA INTEGRATION

Outline

Introduction, opportunities and challenges with Web data The Deep Web

Vertical search Surfacing the Deep Web

Creating topical portals Lightweight data management on the Web

Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work

Broad Range of Data on the Web

Key Characteristics

Scale and heterogeneity Data is about everything! Overlapping sources, varying

levels of quality. Multiple formats (tables, lists, cards, etc.)

Data is laid out for visual appeal Extracting the data is very tricky! Semantics of the data are rarely specified and need to be

inferred from text and other clues.

Different Forms of Structured Data on the Web

Tables: hundreds of millions good ones

Databases Behind Forms The Deep/Invisible Web

store locationsused cars

radio stationspatents

recipes

Tens of millions of high-quality forms

HTML Lists

Every list item is a row in a table, but figuring out cell boundaries is very tricky.

Structured data embedded more loosely in pages. Extraction is very tricky!

What Can we do with Structured Web Data?

Integrate: Imagine integrating your data with any data on the Web! Insights come when independently developed data sets come

together (of course, you can also get garbage that way, so you need to be

careful).

Improve web search Find tables & lists when they’re relevant to queries Answer fact-seeking queries with facts rather than links to

Web pages. Aggregate: answer “total GDP of 10 largest countries” by

putting together facts from multiple pages

Discover via search

Manage,Analyze,

Visualize, Integrate, create compelling stories

Extract from Web SourcesPublish back to the Web

Bigger Vision: create an ecosystem of structured data on the Web

Outline

Introduction, opportunities and challenges with Web data The Deep Web




What is the Deep Web?

Content hidden behind HTML forms, not accessible to search engines.

The Deep Web

The collection of databases that are accessed by users entering values into HTML forms.

The crawler of search engines cannot fill the forms, therefore the content is invisible to the search engine.

The work on the Deep Web illustrates many of the challenges of managing Web data.

Two Approaches to the Deep Web

Build a vertical search engine: Apply all the data integration techniques we’ve learned so

far to a set of data sources such as job sites, airplane reservations, etc.

The approach is applicable to domains that have thousands of form sites.

Surface the content: Try to guess good queries to pose to the forms. Insert the

resulting HTML pages into the Web index. The approach covers the long tail of content on the Web.

Approach #1: Vertical Search: Data Integration

Vertical Search as Data Integration

Mediated schema: the properties of the domain that need to be exposed to the user If you include too many attributes in the mediated schema,

you may not be able to query them on many sources. Source descriptions: relatively simple. Sources are

often distinguished by their geographical coverage. Wrappers:

Parsing the answers from the resulting HTML is the tricky part.

Alternate approach: don’t parse the answers. Just show the user the returned web pages.

Tree Search

Amish quilts

Parking tickets in India

Horses

Deep Web: the Long Tail

The Surfacing Approach

Crawl & Indexing time Pre-compute interesting form submissions Insert resulting pages into the Web Index

Query time: nothing! Deep web URLs in the Index are like any other URL

Advantages Reuse existing search engine infrastructure Reduced load on target web sites – users

click only on what they deem relevant. Approach taken at Google for the long tail.

Surfacing Challenges1. Predicting the correct input combinations

Generating all possible URLs is wasteful and unnecessary Cars.com has ~500K listings, but 250M possible queries

2. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com

3. Don’t do anything bad! 4. Coverage of the crawl: don’t try to cover sites in their

entirety, it’s not necessary. 1. Once you get part of the content, there will be links to the rest2. It’s enough to have part of the content in the index to send it

relevant traffic.

Form Processing 101

GET and POST: types of HTML forms Only GETs can be surfaced

<form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/></form>

URL: http://www.borders.com/locator?store=All&city=&state=&zip=94043&within=25&search=Go&site=homepage

on submit

Google's Deep-Web Crawl (VLDB 2008)

Predicting Input Combinations Forms can have multiple inputs Generating all possible URLs is wasteful! … and un-necessary!

Goal: minimize URLs while maximizing retrieval!

Other considerations Generated URLs must be good candidates for index Only need URLs sufficient to drive traffic Only need URLs sufficient to seed the web crawler

Solution: discover only informative input combinations.

Informative Form Fields

http://jobs.shrm.org/search?state=All&kw=&type=Allhttp://jobs.shrm.org/search?state=AL&kw=&type=Allhttp://jobs.shrm.org/search?state=AK&kw=&type=All

…http://jobs.shrm.org/search?state=WV&kw=&type=All

http://jobs.shrm.org/search?state=All&kw=&type=ALLhttp://jobs.shrm.org/search?state=All&kw=&type=ANY

http://jobs.shrm.org/search?state=All&kw=&type=EXACT

Result pages different informative

Result pages similar un-informative

Varying the state results in qualitatively different content, and hence it is an informative field.

Computing Informative Field Combinations

Informative field combinations can be computed bottom up: Begin with single fields and find which ones are

informative. For every informative combination, try to add another

field and check if the resulting combination is still informative.

In practice, we rarely need combinations of more than 3 fields.


Challenge 2: Generic and Typed Text boxes

Generic Search Boxes Accept any keywords Challenge: selecting the most appropriate values

Typed Text Boxes Only values belonging to specific types, e.g., zipcodes Challenge: selecting the type of the input


Example: www.wipo.int

Input values for Generic Search Iterative Probing for search boxes

Select an initial list of candidate keywords

Download pages based on current set of keywords

Extract more candidate keywords from result pages

Refine the current set of keywords

Repeat until no more new candidate keywords Prune list of candidate keywords

Example: www.wipo.int

MetalworkingProteinAntibodyPyrazoleImmobilizerVasoconstrictionPhosphinatesNosepieceSandbridgeViscosityCarboxydiphenylsulphideOzonizer…

Outline

Introduction, opportunities and challenges with Web dataThe Deep Web




Topical Portals

An integrated view of a topic: E.g., a info about database researchers, all info about

coffee and their growing regions. Topical portals find different aspects of the same

objects on different sources E.g., publications of a person may come from one source,

while their job affiliations may come from another In contrast, vertical search integrated similar objects

from multiple sources E.g., job listings, apartments for rent, …

Topical Portal: example Integrated Page for an Entity

Building a Topical Portal

Approach #1: Perform a focused crawl of the Web to find pages on the

topic Use word signatures as a method for determining the topic of a

page. Use information extraction techniques to get the data out

of the pages. Perform reference resolution and schema matching to

create a cleaner set of data.

Creating a Topical Portal

Approach #2: Start with a set of well known sites in the domain Create an initial schema for the domain (the properties

you’re interested in modeling) Create extractors for pages on the known sites

Note: extractors will be more accurate because they were created for the sites themselves

Result: a good basis of entities and relationships to build on. Extend the initial data set:

Follow references from the initial set of chosen pages Use collaboration (of people in the community) to find additional

data and to correct extractions.

Outline

Introduction, opportunities and challenges with Web dataThe Deep Web




Lightweight Combination of Web Data

With such a vast collection of data, we would like to enable easy data integration. Imagine a school student combining her data about bird

species with a country population table found on the Web A journalist creating a news story with data about riots in

the UK and needing to combine it with demographic data …

Many data integration tasks are transient: the result will be used for a short period of time only Hence, creating the integrated data must be easy.

Creating a mediated schema and mappings is too tedious.

Challenges to Data Integration on the Web

Discovering data on (search engines are optimized for documents, not tables or lists)

Extracting the data from the Web pages into a form that can be processed

Combining multiple data sets

Unique opportunities on the Web: re-use work of others!

Not a great result!

But the data does exist out there!

Discovering Data on the Web

Search engines are optimized for documents E.g., proximity of terms matters in ranking. In tables, the

schema applies to all rows. “zambia” is far from “population” in a document containing population data, but should be considered close.

No special attention is given to schema rows (if they can be detected) or columns closer to the left of the table (that are often the “subject” of the table).

Tables with high quality data look like ones that are used for formatting. Over 99% of the HTML tables on the Web are not high quality

data tables!

Challenges to Discovering the Semantics of Structured Data on the Web

Semantics Embedded in Surrounding Text

Topic of table is in the text, and the token “2006” is crucial to understanding the data.

No schema, but beautifully understandable table by people.

Structured Data can be Plain Complicated!

HTML Tables used for Formatting

“Vertical” Tables: one tuple of a bigger table

Tree Search

Amish quilts

Parking tickets in India

Horses

Can’t Use Domain Knowledge: Data is about Everything

Search by Tweaking Document Traditional Search

Consider new cues in ranking: Hits on left column Hits on schema (where there is one) Number of rows, columns Hits on table body Size of table relative to page

But we can do better: try to recover the underlying semantics of the data.

If we see these patterns enough times, we can infer that Green Ash is a North American species

Recovering Table Semantics: cells on the Web are mentioned in Web text

If we infer that a large fraction of the left column are North American tree species, we can infer that the table is about these tree species. Which is not mentioned on the page!

Recovering Table Semantics: cells on the Web are mentioned in Web text

Extracting Data from the Page

In the case of tables, it’s fairly easy Main challenge: decide if there is a row with attribute

names Lists are tricky: punctuation and formatting do not

always provide the right cues for partitioning a list element into cells boundaries.

Structured data in cards: in general, it’s an information extraction problem.

Structured Data in Cards

Copy & Paste Approach: Extraction by Demonstration

Using previous slide as example. Start by copying “Four Barrel” into a column of a

spreadsheet. System tries to generalize and suggest other café

names: Sightglass, Blue Bottle, Ritual. Next, the user copies the address of Four Barrel into

the next column of the spreadsheet System generalizes… Etc.

Combining Multiple Data Sets

First, find related data sets. Depending on the context, you may be looking for: Data sets to join with (add new columns) Data sets to union with (add new rows)

Specifying the join: Again, by demonstration. Drag and drop a cell from one

table into another. Reference reconciliation is a big challenge:

Use reference data such as Freebase?

Re-Using Work of Others

Most good data sets will get extracted more than once: Re-use the work done by other extractors

Data cleaning can be a collaborative effort Data sets that get integrated often are probably high

quality – leverage that signal With 200M tables on the Web, you can mine their

schemas to find attribute synonyms and common schematic patterns.

Summary of Chapter 15

Structured data on the Web is an incredible collection of data More is coming on because organizations and

governments are being encouraged to publish data Data comes with little or no semantics

Huge challenge when you try to make sense of it Key emphasis: create data management tool that

anyone can use Data is no longer just for database experts!

chapter 15: data integration on the web

Documents

structured web data

ecosystem of structured

set of data sources

web tables

workbroad range of data

developed data sets

web index

web sourcespublish