1 web crawling for search: what’s hard after ten years? raymie stata chief architect, yahoo!...

16
1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

Upload: matilda-arnold

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

1

Web Crawling for Search:What’s hard after ten years?

Raymie Stata

Chief Architect, Yahoo! Search and Marketplace

Page 2: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

2

Agenda

• Introduction• What makes crawling hard for “beginners”• What remains hard for “experts”

Page 3: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

3

Introduction

• Web “crawling” is the primary means of obtaining data for Search Engines– Tens of billions of pages downloaded– Hundreds of billions of pages “known”– Average page <10 days old

• Web crawling is as old as the Web– “Large scale” crawling is about ten-years old

• Lots published, but still exists “secret sauce”• Must support RCF

– Relevance, comprehensiveness, freshness

Page 4: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

4

Components of a crawler

Downloaders

Web DB

Page processing

Page storage

Prioritization

Feeds

I’net*

*Internet = DNS as well as HTTP

Enrichment

Click streams

Page 5: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

5

Baseline challenges: overall scale

• 100s machines dedicated to each component• Must be good at logistics (purchasing and

deployment), operations, distributed programming (fault tolerance included), …

Page 6: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

6

Baseline challenges: downloaders

• DNS scaling (multi-threading)• Bandwidth

– Async I/O vs. threads– Clustering/distribution

• Non-conformance• Politeness

Page 7: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

7

Baseline challenges: page processing

• File-cracking– HTML, Word, PDF, JPG, MPEG, …

• Non-conformance• Higher-level processing

– JavaScript, sessions, information extraction, …

Page 8: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

8

Baseline challenges:Web DB and enrichment

• Scale– Update rate– Extraction rate

• Duplication detection• Alias detection• Checkpoints

Page 9: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

9

Baseline challenges: prioritization

• Quality ranking• Spam and crawler traps

Page 10: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

10

Evergreen problems

• Relevance– Page quality, spam

• Page processing, prioritization techniques

• Comprehensiveness– Sheer scale

• Sheer machine count (expensive)• Scaling of the Web DB

– Deep Web, information extraction• Page processing

• Freshness– Discovery, frequency, “long tail”

Page 11: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

11

Web DB: more details

• For each URL, the Web DB contains:– In- and outlinks– Anchor text– Various dates: last downloaded, last changed, …– “Decorations” from various processors

• Language, topic, spam scores, term-vectors, fingerprints, “shingleprints,” many more…

• Subset of the above stored for several instances– That is, we keep track of the history of a page

Page 12: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

12

Web DB: update volume

• When a page is downloaded, we need to update inlink and anchor-text info for each page it points to

• A page has ~20 outlinks on it• We download 1,000’s pages per second• At peak, need well over 100K updates/sec

Page 13: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

13

Web DB: scaling techniques

• Perform updates in large batches• Solves bandwidth problems…• …but introduces latency problems

– In particular: time to discover new links

• Solve latency with “short-circuit” for discovery– But this by-passes the full prioritization logic,

which introduces quality problems that need to be solved with more special solutions and before long, Oi, it’s all getting very complicated…

Page 14: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

14

DHTML: the enemy of crawling

• Increasing use of client-side scripting (aka, DHTML) is making more of the Web opaque to crawlers– AJAX: Asynchronous JavaScript and XML

• (The end of crawling?)

• Not (yet) a major barrier to Web search, but is a barrier to shopping and other specialized search, where we also have to deal with:– Form-filling and sessions– Information extraction

Page 15: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

15

Conclusions

• Large-scale Web crawling not trivial• Smart, well-funded people could figure it out

from the literature• But secret sauce remains in:

– Prioritization– Scaling the Web DB– JavaScript, form-filling, information extraction

Page 16: 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

16

The future

• Will life get easier?– Ping plus feeds

• Will life get harder?– DHTML -> Ajax -> Avalon

• A little bit of both?– Publishers regain control– But, net, comprehensiveness improves