deep-web crawling and related work matt honeycutt csc 6400

Deep-Web Crawling and Related Work

Matt Honeycutt

CSC 6400

Outline• Basic background information

• Google’s Deep-Web Crawl

• Web Data Extraction Based on Partial Tree Alignment

• Bootstrapping Information Extraction from Semi-structured Web Pages

• Crawling Web Pages with Support for Client-Side Dynamism

• DeepBot: A Focused Crawler for Accessing Hidden Web Content

Background

• Publicly-Indexable Web (PIW)– Web pages exposed by standard search engines– Pages link to one another

• Deep-web– Content behind HTML forms– Database records– Estimated to be much larger than PIW– Estimated to be of higher quality than PIW

Google’s Deep-Web Crawl

J. MadhavanD. Ko

L. Kot

V. GanapathyA. Rasmussen

A. Halevy

Summary

• Describes process implemented by Google

• Goal is to ‘surface’ content for indexing

• Contributions: – Informativeness test– Query selection techniques and algorithm for

generating appropriate text inputs

About the Google Crawler

• Estimates that there are ~10 million high-quality HTML forms

• Index representative deep-web content across many forms, driving search traffic to the deep-web

• Two problems: – Which inputs to fill in?– What values to use?

Example Form

Query Templates

• Correspond to SQL-like queries: select * from D where P

• First problem is to select the best templates

• Second problem is to select the best values for those templates

• Want to ignore presentation-related fields

Incremental Search for Informative Query Templates

• Classify templates as either informative or uninformitive

• Template is informative if it generates sufficiently distinct pages from other templates

• Build more complex templates from simpler informative ones

• Signatures computed for each page

Informativeness Test

• T is informative if:

• Heuristically limit to templates with 10,000 or fewer possible submissions and no more than 3 dimensions

• Can estimate informativeness using a sample of possible queries (ie: 200)

G

S

Results

Observations

• URLs generated for larger templates are not as useful

• ISIT Generates far fewer URLs than CP but still has high coverage

• Most common reason for inability to find informative template: JavaScript– Ignoring JavaScript errors, informative templates

found for 80% of forms tested

Generating Input Values• Text boxes may be typed or untyped

• Special rules for small number of typed inputs that are common

• Can’t use generic lists, best keywords are site specific

• Select seed keywords from form, then iterate and select candidate keywords from results using TF-IDF

• Results are clustered and representative keywords are chosen for each cluster, ranked by page length

• Once candidate keywords have been selected, treat text inputs as select inputs

Identifying Typed Inputs

Conclusions• Describes the innovations of “the first large-scale deep-web

surfacing system”

• Results are already integrated into Google

• Informativness test is a useful building block

• No need to cover individual sites completely

• Heuristics for common input types are useful

• Future work: support for JavaScript and handling dependencies between inputs

• Limitation: only supports GET requests

Web Data Extraction Based on Partial Tree Alignment

Yanhong Zhai

Bing Liu

Summary

• Novel technique for extracting data from record lists: DEPTA (Data Extraction based on Partial Tree Alignment)

• Automatically identifies records and aligns their fields

• Overcomes limitations of existing techniques

Example

Approach

• Step 1: Build tag tree

• Step 2: Segment page to identify data regions

• Step 3: Identify data records within the regions

• Step 4: Align records to identify fields

• Step 5: Extract fields into common table

Building the Tag Tree and Finding Data Regions

• Computes bounding regions for each element

• Associate items to parents based on containment to build tag tree

• Next, compare tag strings with edit distance to find data regions

• Finally, identify records within regions

Identifying Regions

Partial Tree Alignment

• Tree matching is expensive

• Simple Tree Matching – faster, but not as accurate

• Longest record tree becomes seed

• Fields that don’t match are added to seed

• Finally, field values extracted and inserted into table

Seed Expansion

Conclusions

• Surpasses previous work (MDR)

• Capable of extracting data very accurately– Recall: 98.18%– Precision: 99.68%

Bootstrapping Information Extraction from Semi-structured Web Pages

A. Carlson

C. Schafer

Summary

• Method for extracting structured records from web pages

• Method requires very little training and achieves good results in two domains

Introduction• Extracting structured fields enables advanced

information retrieval scenarios

• Much previous work has been site-specific or required substantial manual labeling

• Heuristic-based approaches have not had great success

• Uses semi-supervised learning to extract fields from web pages

• User only has to label 2-5 pages for each of 4-6 sites

Technical Approach

• Human specifies domain schema

• Labels training records from representative sites

• Utilizes partial tree alignment to acquire additional records for each site

• New records are automatically labeled

• Learns regression model that predicts mappings from fields to schema columns

Mapping Fields to Columns

• Calculate score between each field and column

• Score based on field contexts and contexts observed in training

• Most probable mapping above a threshold is accepted

Example Context Extraction

Feature Types

• Precontext 3-grams

• Lowercase value tokens

• Lowercase value 3-grams

• Value token type categories

Example Features

Scoring• Field mappings based on comparing feature

distributions– Distribution computed from training contexts– Distribution computed from observed contexts

• Completely dissimilar field/column pairs are fully divergent– Exact field/column pairs have no divergence

• Feature similarities combined using “stacked” linear regression model

• Weights for the model are learned in training

Results

Crawling Web Pages with Support for Client-Side Dynamism

Manuel AlvarezAlberto Pan

Juan RaposoJusto Hidalgo

Summary

• Advanced crawler based on browser automation

• NSEQL - Language for specify browser actions

• Stores URLs and path back to URL

Limitations of Typical Crawlers

• Built on low-level HTTP APIs

• Limited or no support for client-side scripts

• Limited support for sessions

• Can only see what’s in the HTML

Their Crawler’s Features

• Built on “mini web browsers” – MSIE Browser Control

• Handles client-side JavaScript

• Routes fully support sessions

• Limited form-handling capabilities

Identifying New Routes

• Routes can come from links, forms, and JavaScript

• ‘href’ attributes extracted from normal anchor tags

• Tags with JavaScript click events are identified and “clicked”

• Captures actions and inspects them

Results and Conclusions

• Large scale websites are crawler-friendly

• Many medium-scale, deep-web sites aren’t

• Crawlers should handle client-side script

• Presented crawler has been applied to real-world applications

DeepBot: A Focused Crawler for Accessing Hidden Web Content

Manuel Alvarez

Juan Raposo

Alberto Pan

Summary

• Presents a focused deep-web crawler

• Extension of previous work

• Crawls links and handles search forms

Architecture

Domain Definitions

• Attributes a1…aN

• Each attribute has name, aliases, specificity index

• Queries q1…qN

• Each query contains 1 or more (attribute,value) pairs

• Relevance threshold

Example Definition

Evaluating Forms• Obtains bounding coordinates of all form fields and potential labels

• Distances and angles computed between fields and labels

Evaluating Forms• If label l is within min-distance of field f, l is added to f’s list

– Ties are broken using angle

• Lists are pruned so that labels appear in only one list and all fields have at least one possible label

Evaluating Forms

• Text similarity measures used to link domain attributes to fields

• Computes relevance of form

• If form score exceeds relevance threshold, DeepBot executes queries

Results and Conclusions

• Evaluated on three domain tasks: book, music, and movie shopping

• Achieves very high precision and recall

• Errors due to:– Missing aliases– Forms with too few fields to achieve minimum support– Sources that did not label fields

Summary of Deep Web Crawling

• Several challenges must be addressed:– Understanding forms– Handling JavaScript– Determining optimal queries– Identifying result links– Extracting metadata

• Most of the pieces exist

Questions?

References• Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's Deep

Web crawl. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1241-1252.

• Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM, New York, NY, 76-85

• Carlson, A. and Schafer, C. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (Antwerp, Belgium, September 15 - 19, 2008).

• Manuel Álvarez, Alberto Pan, Juan Raposo, Justo Hidalgo. Crawling Web Pages with Support for Client-Side Dynamism. Proceedings of the 7th International Conference, Advances in Web-Age Information Management (WAIM 2006). Lecture Notes in Computer Science. Edited by Jeffrey Xu Yu, Masaru Kitsuregawa, Hong Va Leong. Published by Springer-Verlag Berlin. ISSN: 0302-9743, ISBN-10: 3-540-35225-2, ISBN-13: 978-3-540-35225-9. Vol. 4016, pp. 252-262. Hong Kong, China. June 17-19, 2006.

• Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., and Carneiro, V. 2007. DeepBot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international Workshop on Data Enginering Issues in E-Commerce and Services: in Conjunction with ACM Conference on Electronic Commerce (EC '07) (San Diego, California, June 12 - 12, 2007). M. Hepp, M. Sayal, S. Lee, J. Lee, and J. Shim, Eds. DEECS '07, vol. 236. ACM, New York, NY, 18-25.

deep-web crawling and related work matt honeycutt csc 6400

Documents