1 web crawling for search: what’s hard after ten years? raymie stata chief architect, yahoo!...

1

Web Crawling for Search:What’s hard after ten years?

Raymie Stata

Chief Architect, Yahoo! Search and Marketplace

2

Agenda

• Introduction• What makes crawling hard for “beginners”• What remains hard for “experts”

3

Introduction

• Web “crawling” is the primary means of obtaining data for Search Engines– Tens of billions of pages downloaded– Hundreds of billions of pages “known”– Average page <10 days old

• Web crawling is as old as the Web– “Large scale” crawling is about ten-years old

• Lots published, but still exists “secret sauce”• Must support RCF

– Relevance, comprehensiveness, freshness

4

Components of a crawler

Downloaders

Web DB

Page processing

Page storage

Prioritization

Feeds

I’net*

*Internet = DNS as well as HTTP

Enrichment

Click streams

5

Baseline challenges: overall scale

• 100s machines dedicated to each component• Must be good at logistics (purchasing and

deployment), operations, distributed programming (fault tolerance included), …

6

Baseline challenges: downloaders

• DNS scaling (multi-threading)• Bandwidth

– Async I/O vs. threads– Clustering/distribution

• Non-conformance• Politeness

7

Baseline challenges: page processing

• File-cracking– HTML, Word, PDF, JPG, MPEG, …

• Non-conformance• Higher-level processing

– JavaScript, sessions, information extraction, …

8

Baseline challenges:Web DB and enrichment

• Scale– Update rate– Extraction rate

• Duplication detection• Alias detection• Checkpoints

9

Baseline challenges: prioritization

• Quality ranking• Spam and crawler traps

10

Evergreen problems

• Relevance– Page quality, spam

• Page processing, prioritization techniques

• Comprehensiveness– Sheer scale

• Sheer machine count (expensive)• Scaling of the Web DB

– Deep Web, information extraction• Page processing

• Freshness– Discovery, frequency, “long tail”

11

Web DB: more details

• For each URL, the Web DB contains:– In- and outlinks– Anchor text– Various dates: last downloaded, last changed, …– “Decorations” from various processors

• Language, topic, spam scores, term-vectors, fingerprints, “shingleprints,” many more…

• Subset of the above stored for several instances– That is, we keep track of the history of a page

12

Web DB: update volume

• When a page is downloaded, we need to update inlink and anchor-text info for each page it points to

• A page has ~20 outlinks on it• We download 1,000’s pages per second• At peak, need well over 100K updates/sec

13

Web DB: scaling techniques

• Perform updates in large batches• Solves bandwidth problems…• …but introduces latency problems

– In particular: time to discover new links

• Solve latency with “short-circuit” for discovery– But this by-passes the full prioritization logic,

which introduces quality problems that need to be solved with more special solutions and before long, Oi, it’s all getting very complicated…

14

DHTML: the enemy of crawling

• Increasing use of client-side scripting (aka, DHTML) is making more of the Web opaque to crawlers– AJAX: Asynchronous JavaScript and XML

• (The end of crawling?)

• Not (yet) a major barrier to Web search, but is a barrier to shopping and other specialized search, where we also have to deal with:– Form-filling and sessions– Information extraction

15

Conclusions

• Large-scale Web crawling not trivial• Smart, well-funded people could figure it out

from the literature• But secret sauce remains in:

– Prioritization– Scaling the Web DB– JavaScript, form-filling, information extraction

16

The future

• Will life get easier?– Ping plus feeds

• Will life get harder?– DHTML -> Ajax -> Avalon

• A little bit of both?– Publishers regain control– But, net, comprehensiveness improves

1 web crawling for search: what’s hard after ten years? raymie stata chief architect, yahoo!...

Documents