1 web crawling for search: what’s hard after ten years? raymie stata chief architect, yahoo!...
TRANSCRIPT
1
Web Crawling for Search:What’s hard after ten years?
Raymie Stata
Chief Architect, Yahoo! Search and Marketplace
2
Agenda
• Introduction• What makes crawling hard for “beginners”• What remains hard for “experts”
3
Introduction
• Web “crawling” is the primary means of obtaining data for Search Engines– Tens of billions of pages downloaded– Hundreds of billions of pages “known”– Average page <10 days old
• Web crawling is as old as the Web– “Large scale” crawling is about ten-years old
• Lots published, but still exists “secret sauce”• Must support RCF
– Relevance, comprehensiveness, freshness
4
Components of a crawler
Downloaders
Web DB
Page processing
Page storage
Prioritization
Feeds
I’net*
*Internet = DNS as well as HTTP
Enrichment
Click streams
5
Baseline challenges: overall scale
• 100s machines dedicated to each component• Must be good at logistics (purchasing and
deployment), operations, distributed programming (fault tolerance included), …
6
Baseline challenges: downloaders
• DNS scaling (multi-threading)• Bandwidth
– Async I/O vs. threads– Clustering/distribution
• Non-conformance• Politeness
7
Baseline challenges: page processing
• File-cracking– HTML, Word, PDF, JPG, MPEG, …
• Non-conformance• Higher-level processing
– JavaScript, sessions, information extraction, …
8
Baseline challenges:Web DB and enrichment
• Scale– Update rate– Extraction rate
• Duplication detection• Alias detection• Checkpoints
9
Baseline challenges: prioritization
• Quality ranking• Spam and crawler traps
10
Evergreen problems
• Relevance– Page quality, spam
• Page processing, prioritization techniques
• Comprehensiveness– Sheer scale
• Sheer machine count (expensive)• Scaling of the Web DB
– Deep Web, information extraction• Page processing
• Freshness– Discovery, frequency, “long tail”
11
Web DB: more details
• For each URL, the Web DB contains:– In- and outlinks– Anchor text– Various dates: last downloaded, last changed, …– “Decorations” from various processors
• Language, topic, spam scores, term-vectors, fingerprints, “shingleprints,” many more…
• Subset of the above stored for several instances– That is, we keep track of the history of a page
12
Web DB: update volume
• When a page is downloaded, we need to update inlink and anchor-text info for each page it points to
• A page has ~20 outlinks on it• We download 1,000’s pages per second• At peak, need well over 100K updates/sec
13
Web DB: scaling techniques
• Perform updates in large batches• Solves bandwidth problems…• …but introduces latency problems
– In particular: time to discover new links
• Solve latency with “short-circuit” for discovery– But this by-passes the full prioritization logic,
which introduces quality problems that need to be solved with more special solutions and before long, Oi, it’s all getting very complicated…
14
DHTML: the enemy of crawling
• Increasing use of client-side scripting (aka, DHTML) is making more of the Web opaque to crawlers– AJAX: Asynchronous JavaScript and XML
• (The end of crawling?)
• Not (yet) a major barrier to Web search, but is a barrier to shopping and other specialized search, where we also have to deal with:– Form-filling and sessions– Information extraction
15
Conclusions
• Large-scale Web crawling not trivial• Smart, well-funded people could figure it out
from the literature• But secret sauce remains in:
– Prioritization– Scaling the Web DB– JavaScript, form-filling, information extraction
16
The future
• Will life get easier?– Ping plus feeds
• Will life get harder?– DHTML -> Ajax -> Avalon
• A little bit of both?– Publishers regain control– But, net, comprehensiveness improves