deep-web crawling and related work matt honeycutt csc 6400
TRANSCRIPT
Outline• Basic background information
• Google’s Deep-Web Crawl
• Web Data Extraction Based on Partial Tree Alignment
• Bootstrapping Information Extraction from Semi-structured Web Pages
• Crawling Web Pages with Support for Client-Side Dynamism
• DeepBot: A Focused Crawler for Accessing Hidden Web Content
Background
• Publicly-Indexable Web (PIW)– Web pages exposed by standard search engines– Pages link to one another
• Deep-web– Content behind HTML forms– Database records– Estimated to be much larger than PIW– Estimated to be of higher quality than PIW
Summary
• Describes process implemented by Google
• Goal is to ‘surface’ content for indexing
• Contributions: – Informativeness test– Query selection techniques and algorithm for
generating appropriate text inputs
About the Google Crawler
• Estimates that there are ~10 million high-quality HTML forms
• Index representative deep-web content across many forms, driving search traffic to the deep-web
• Two problems: – Which inputs to fill in?– What values to use?
Query Templates
• Correspond to SQL-like queries: select * from D where P
• First problem is to select the best templates
• Second problem is to select the best values for those templates
• Want to ignore presentation-related fields
Incremental Search for Informative Query Templates
• Classify templates as either informative or uninformitive
• Template is informative if it generates sufficiently distinct pages from other templates
• Build more complex templates from simpler informative ones
• Signatures computed for each page
Informativeness Test
• T is informative if:
• Heuristically limit to templates with 10,000 or fewer possible submissions and no more than 3 dimensions
• Can estimate informativeness using a sample of possible queries (ie: 200)
G
S
Observations
• URLs generated for larger templates are not as useful
• ISIT Generates far fewer URLs than CP but still has high coverage
• Most common reason for inability to find informative template: JavaScript– Ignoring JavaScript errors, informative templates
found for 80% of forms tested
Generating Input Values• Text boxes may be typed or untyped
• Special rules for small number of typed inputs that are common
• Can’t use generic lists, best keywords are site specific
• Select seed keywords from form, then iterate and select candidate keywords from results using TF-IDF
• Results are clustered and representative keywords are chosen for each cluster, ranked by page length
• Once candidate keywords have been selected, treat text inputs as select inputs
Conclusions• Describes the innovations of “the first large-scale deep-web
surfacing system”
• Results are already integrated into Google
• Informativness test is a useful building block
• No need to cover individual sites completely
• Heuristics for common input types are useful
• Future work: support for JavaScript and handling dependencies between inputs
• Limitation: only supports GET requests
Summary
• Novel technique for extracting data from record lists: DEPTA (Data Extraction based on Partial Tree Alignment)
• Automatically identifies records and aligns their fields
• Overcomes limitations of existing techniques
Approach
• Step 1: Build tag tree
• Step 2: Segment page to identify data regions
• Step 3: Identify data records within the regions
• Step 4: Align records to identify fields
• Step 5: Extract fields into common table
Building the Tag Tree and Finding Data Regions
• Computes bounding regions for each element
• Associate items to parents based on containment to build tag tree
• Next, compare tag strings with edit distance to find data regions
• Finally, identify records within regions
Partial Tree Alignment
• Tree matching is expensive
• Simple Tree Matching – faster, but not as accurate
• Longest record tree becomes seed
• Fields that don’t match are added to seed
• Finally, field values extracted and inserted into table
Conclusions
• Surpasses previous work (MDR)
• Capable of extracting data very accurately– Recall: 98.18%– Precision: 99.68%
Summary
• Method for extracting structured records from web pages
• Method requires very little training and achieves good results in two domains
Introduction• Extracting structured fields enables advanced
information retrieval scenarios
• Much previous work has been site-specific or required substantial manual labeling
• Heuristic-based approaches have not had great success
• Uses semi-supervised learning to extract fields from web pages
• User only has to label 2-5 pages for each of 4-6 sites
Technical Approach
• Human specifies domain schema
• Labels training records from representative sites
• Utilizes partial tree alignment to acquire additional records for each site
• New records are automatically labeled
• Learns regression model that predicts mappings from fields to schema columns
Mapping Fields to Columns
• Calculate score between each field and column
• Score based on field contexts and contexts observed in training
• Most probable mapping above a threshold is accepted
Feature Types
• Precontext 3-grams
• Lowercase value tokens
• Lowercase value 3-grams
• Value token type categories
Scoring• Field mappings based on comparing feature
distributions– Distribution computed from training contexts– Distribution computed from observed contexts
• Completely dissimilar field/column pairs are fully divergent– Exact field/column pairs have no divergence
• Feature similarities combined using “stacked” linear regression model
• Weights for the model are learned in training
Crawling Web Pages with Support for Client-Side Dynamism
Manuel AlvarezAlberto Pan
Juan RaposoJusto Hidalgo
Summary
• Advanced crawler based on browser automation
• NSEQL - Language for specify browser actions
• Stores URLs and path back to URL
Limitations of Typical Crawlers
• Built on low-level HTTP APIs
• Limited or no support for client-side scripts
• Limited support for sessions
• Can only see what’s in the HTML
Their Crawler’s Features
• Built on “mini web browsers” – MSIE Browser Control
• Handles client-side JavaScript
• Routes fully support sessions
• Limited form-handling capabilities
Identifying New Routes
• Routes can come from links, forms, and JavaScript
• ‘href’ attributes extracted from normal anchor tags
• Tags with JavaScript click events are identified and “clicked”
• Captures actions and inspects them
Results and Conclusions
• Large scale websites are crawler-friendly
• Many medium-scale, deep-web sites aren’t
• Crawlers should handle client-side script
• Presented crawler has been applied to real-world applications
Summary
• Presents a focused deep-web crawler
• Extension of previous work
• Crawls links and handles search forms
Domain Definitions
• Attributes a1…aN
• Each attribute has name, aliases, specificity index
• Queries q1…qN
• Each query contains 1 or more (attribute,value) pairs
• Relevance threshold
Evaluating Forms• Obtains bounding coordinates of all form fields and potential labels
• Distances and angles computed between fields and labels
Evaluating Forms• If label l is within min-distance of field f, l is added to f’s list
– Ties are broken using angle
• Lists are pruned so that labels appear in only one list and all fields have at least one possible label
Evaluating Forms
• Text similarity measures used to link domain attributes to fields
• Computes relevance of form
• If form score exceeds relevance threshold, DeepBot executes queries
Results and Conclusions
• Evaluated on three domain tasks: book, music, and movie shopping
• Achieves very high precision and recall
• Errors due to:– Missing aliases– Forms with too few fields to achieve minimum support– Sources that did not label fields
Summary of Deep Web Crawling
• Several challenges must be addressed:– Understanding forms– Handling JavaScript– Determining optimal queries– Identifying result links– Extracting metadata
• Most of the pieces exist
References• Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's Deep
Web crawl. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1241-1252.
• Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM, New York, NY, 76-85
• Carlson, A. and Schafer, C. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (Antwerp, Belgium, September 15 - 19, 2008).
• Manuel Álvarez, Alberto Pan, Juan Raposo, Justo Hidalgo. Crawling Web Pages with Support for Client-Side Dynamism. Proceedings of the 7th International Conference, Advances in Web-Age Information Management (WAIM 2006). Lecture Notes in Computer Science. Edited by Jeffrey Xu Yu, Masaru Kitsuregawa, Hong Va Leong. Published by Springer-Verlag Berlin. ISSN: 0302-9743, ISBN-10: 3-540-35225-2, ISBN-13: 978-3-540-35225-9. Vol. 4016, pp. 252-262. Hong Kong, China. June 17-19, 2006.
• Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., and Carneiro, V. 2007. DeepBot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international Workshop on Data Enginering Issues in E-Commerce and Services: in Conjunction with ACM Conference on Electronic Commerce (EC '07) (San Diego, California, June 12 - 12, 2007). M. Hepp, M. Sayal, S. Lee, J. Lee, and J. Shim, Eds. DEECS '07, vol. 236. ACM, New York, NY, 18-25.