advanced web crawling

39
CS6913: Web Search Engines CSE Department NYU Tandon School of Engineering Advanced Web Crawling

Upload: others

Post on 06-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Web Crawling

CS6913: Web Search EnginesCSE Department

NYU Tandon School of Engineering

Advanced Web Crawling

Page 2: Advanced Web Crawling

Advanced Web Crawling:

• High-Performance Crawling Systems

• Targeted/Focused Crawling

• Recrawling

• Random Walks

• Crawling Archit. and Challenges in Large Engines

Page 3: Advanced Web Crawling

High-Performance Crawling Systems•Large engines crawl billions of pages per day

• On hundreds to thousands of machines• Each machine crawls tens of millions per day• E.g., 1000 pages/sec means about 89 million pages/day• 1000 pages/sec == 20-100 MB/s network traffic

•How to build such systems• Scaling networking and I/O• Scaling the data structures• Scaling over many machines• Scaling the crawling strategy• Robustness and manageability

Page 4: Advanced Web Crawling

Crawling System vs. Crawling Strategy•Crawling Strategy: “what to crawl next”

• Which URLs should be crawled?• Based on topic? (focused crawling)• Based on likelihood of change? (recrawling)• Based on importance of page?• (Obeying robot exclusion and courtesy constraints?)

•Crawling System: “fast system for fetching stuff”• A simple as possible• Optimized for pages/second• Enforces robot exclusion policies and courtesy constraints (?)

Page 5: Advanced Web Crawling

Crawling System vs. Crawling Strategy

Page 6: Advanced Web Crawling

Crawling System vs. Crawling Strategy• Two components

• Crawling application implements crawling strategy

• More or less integrated based on:• Complexity of crawling strategy• Degree of agility/maneuverability required

• Example #1: BFS crawling or recrawling in a large engine:• Use map-reduce jobs in DA to determine what to crawl next• Send files with millions of URLs to crawling system

• Example #2: smaller agile focused crawler• May decide what page to crawl next based on last few pages

Page 7: Advanced Web Crawling

Scaling Networking and I/O • Problem: fetching a page “takes” 0.1 to 2s per page

• Round-trip latency over internet• Slow responses from web servers• Tail: some servers really slow à do not wait for 30s timeout!

• Solution: need to open many concurrent connections• Thousands of connections at a time• 1000 pages/sec/node requires 1000s of connections• Unresponsive servers may be handled with backoff scheme

(retry in 1 hour, 1 day, 1 week, etc.)

• Other bottlenecks:• I/O to store pages, DNS lookups, robot fetches

Page 8: Advanced Web Crawling

Scaling Networking and I/O• Basic networking architectures:

• Thread-based: one thread per connection• Process-based: one process per connection• Event-based: each process handling many connections• Hybrid approaches

• Compare to web server architectures

• Storing pages: append to large files• DNS: cache locally• Robot exclusion: request robots.txt like any other page• But robots.txt needs to be recrawled frequently (why?)

Page 9: Advanced Web Crawling

Scaling the Data Structures

• Problem: how to keep track of pages already crawled?

• In decoupled case, use map-reduce

• Otherwise, URLs are large, will run out of memory

• BFS: After 10M pages, we have 100M in the queue

• 75 bytes/URL = 7.5GB

• So we cannot just stick this into standard data structs

Page 10: Advanced Web Crawling

Scaling the Data Structures: Solutions (BFS)

• Distribute DS over many nodes (e.g., hash based on host)

• Compress URLs (prefix compression, lz4 on blocks)

• Use Bloom filters (special lossy data struct for membership)

• Lazy filtering (periodic batch filtering of queue, e.g. in DA)

• Caching of recently crawled URLs in memory

• Real systems may use combinations of these

• Check URL before insert into Q or before download?

Page 11: Advanced Web Crawling

Scaling the Data Structures: Solutions

• Distribute over many crawling nodes:

• Hash URLs to nodes?

• Or better, hash sites to nodes?

• Or something more complex (split very large sites)

• nodes forward URLs they parse to relevant other node

Page 12: Advanced Web Crawling

Scaling the Data Structures: Bloom Filters• A Bloom filter is a data-structure with two operations:

• Insert(x) inserts an item into the BF• Exists(x) checks if x was previously inserted in the BF• Exists(x) returns one of two answers: no or probably yes

Page 13: Advanced Web Crawling

Scaling the Data Structures: Bloom Filters• A Bloom filter is a data-structure with two operations:

• Insert(x) inserts an item into the BF• Exists(x) checks if x was previously inserted in the BF• Exists(x) returns one of two answers: no or probably yes

• A BF can have false positives, but no false negatives

Page 14: Advanced Web Crawling

Scaling the Data Structures: Bloom Filters• A Bloom filter is a data-structure with two operations:

• Insert(x) inserts an item into the BF• Exists(x) checks if x was previously inserted in the BF• Exists(x) returns one of two answers: no or probably yes

• A BF can have false positives, but no false negatives• Goal: minimize false positive rate given limited space• BF achieve good trade-off between space and fpr• BFs based on hashing, can be used on any data type• BF uses several hash functions!

Page 15: Advanced Web Crawling

How to Implement Bloom Filters• To implement Bloom filters, we need:

• An array A[0 ... m-1] of m bits, initially zeroed out• k random hash functions h1 to hk, each mapping to [0 ...m-1]

Page 16: Advanced Web Crawling

How to Implement Bloom Filters• To implement Bloom filters, we need:

• An array A[0 ... m-1] of m bits, initially zeroed out• k random hash functions h1 to hk, each mapping to [0 ...m-1]

• Insert(x): FOR i=1 TO k { A[hi(x)] ß 1 }

• Exists(x): FOR i=1 TO k { if A[hi(x)] == 0 RETURN(no) }RETURN(probably yes)

Page 17: Advanced Web Crawling

How to Implement Bloom Filters• To implement Bloom filters, we need:

• An array A[0 ... m-1] of m bits, initially zeroed out• k random hash functions h1 to hk, each mapping to [0 ...m-1]

• Insert(x): FOR i=1 TO k { A[hi(x)] ß 1 }

• Exists(x): FOR i=1 TO k { if A[hi(x)] == 0 RETURN(no) }RETURN(probably yes)

• After inserting n items, what is the false positive rate?• That is, what is the probability of getting probably yes

as answer to exists(x) even if x was not inserted?

Page 18: Advanced Web Crawling

Analyzing Bloom Filters• After inserting n items into BF of m bits with k hashes• Focus on one bit A[j] in the BF:

Page 19: Advanced Web Crawling

Analyzing Bloom Filters• After inserting n items into BF of m bits with k hashes• Focus on one bit A[j] in the BF:

prob[A[j] == 0] = (1 – 1/m)kn

Page 20: Advanced Web Crawling

Analyzing Bloom Filters• After inserting n items into BF of m bits with k hashes• Focus on one bit A[j] in the BF:

prob[A[j] == 0] = (1 – 1/m)kn

• Using (1-1/m)m à e-1 for m à inf, this is roughly e-kn/m

Page 21: Advanced Web Crawling

Analyzing Bloom Filters• After inserting n items into BF of m bits with k hashes• Focus on one bit A[j] in the BF:

prob[A[j] == 0] = (1 – 1/m)kn

• Using (1-1/m)m à e-1 for m à inf, this is roughly e-kn/m

• Probability of false positive for an uninserted x is the probability that all k tested bits are set to 1, which is

(1 – e-kn/m)k

Page 22: Advanced Web Crawling

Analyzing Bloom Filters• We build a BF for a maximum number of items n• Formula: trade-off between m/n, k, and false pos. rate• Increasing m and thus m/n lowers the fpr• But need to also increase k with m/n to get best fpr• Example values: m/n k fpr

8 6 0.021516 11 0.000458

• Achieves low fpr for reasonable space budget(fpr = false positive rate)

Page 23: Advanced Web Crawling

Robustness and Manageability• Courtesy: limit speed of crawling for sites

• e.g., at most one page every 30s• or wait 10 times the response speed• or depending on site• but what is a site? Or a domain? • give contact info

• Configurable while running• E.g., must have blacklist to exclude domains• Control panel so see and control speed, 404s, etc.

• Checkpointing and recovery

Page 24: Advanced Web Crawling

Robustness and Manageability• Example: implementing crawl speed via reordering

(as discussed for HW#1 using PQs)

Page 25: Advanced Web Crawling

High-Performance Crawling SystemsIndustry:• Googlebot and crawlers for other large engine• Crawlers to monitor the web for certain content• Monitor what kind of content?

Research Projects:• Mercator/Atrax at DEC Labs (1997/2002)• Polybot at Poly/NYU (2002)• UbiCrawler at Polit. di Milano (2004)• IRLBOT at Texas A&M (2009)• BubiNG at Polit. Di Milano (2018)

Page 26: Advanced Web Crawling

Focused Crawling•Problem: crawl only certain types of pages•E.g., topics such as sports, law, history•Or news, languages, extremist content, copyright

violations, local area, bulletin boards, stores•How to specify a topic: classifier, maybe a query•How to evaluate performance: harvest rate•How to decide what to crawl next:

• A machine-learning problem• Crawler can learn “on the job”• Assumption: after crawling, we know if a page is relevant

Page 27: Advanced Web Crawling

Focused Crawling Strategies•Depends on how “connected” a topic is•Small topics may not be well connected (or they may be)

•And on whether topics want to be found•And on structure of sites (e.g., store/bulletin board navigation)

•For now, assume simple case: • well-connected, unstructured pages, no attempt to hide

•Relevance of a page: determined after downloading it•Promise of a page: estimate of relevance before downl.•Simple greedy FC: download most promising page next

Page 28: Advanced Web Crawling

Estimating Page Promise• A machine-learning problem• Features: link structure, URLs, context of links• High promise: many links from relevant pages• Or in same site (or subdirectory) as other relevant pages• Or context of links is relevant • Idea #1: run link analysis (HITS) across crawl frontier• Idea #2: Master-apprentice approach (Chakrabarti et al 2002)

• Idea #3: learning with future reward

Page 29: Advanced Web Crawling

Evaluating a Focused Crawler• Harvest rate: num of relevant pages / number crawled• Start pages should be relevant if possible• Unfocused (BFS) crawler will soon lose focus• Focused crawler should maintain high harvest rate• But greedy may not be best

Page 30: Advanced Web Crawling

Recrawling Strategies• Problem: BFS only good for the initial crawl• Afterwards, need to maintain index more efficiently• Some pages change more frequently than others• Some pages are more important than others• Some changes are more significant than others• And also, discover new pages, not just changed ones• Simplified problem: given a limited amount of crawl

resources, try to maximize freshness of the index• Omitted: new pages, degree of change, import. of page

Page 31: Advanced Web Crawling

Estimating Change Probabilities• Suppose you have crawled a page n times• Sometimes it changed, sometimes it did not• Match to a poisson process model

Page 32: Advanced Web Crawling

Estimating Change Probabilities• What Poisson parameter best matches observations?• But we may have irregular observation points• One parameter for every page (or group of pages)

Page 33: Advanced Web Crawling

Recrawl Strategy• How should we use change probability to recrawl?• Always crawl page with highest prob. of change next?• This is not optimal! (Why)• Do not crawl highest-changing pages at all!

• Other approaches: maintain k classes of pages• Model how important a change is w.r.t. queries• How to find new pages

Page 34: Advanced Web Crawling

Summary: Recrawling• An optimized system for recrawling should

allocate crawling resources according to:• how often each page or site changes• how important the page or site is• how likely it is that a change will impact search ranking• how to best discover newly created pages

• Some challenges in implementation involve:• identifying templates and boilerplate, and spam• replicated content and URLs (“new” URL does not mean new content)

• keeping track of stats efficiently (maybe using DM platform)

Page 35: Advanced Web Crawling

Crawling Random Pages• How to estimate % of all pages with some property

• Say, pages in Chinese, pages using javascript, etc.• Idea: estimate using a sample of web pages• But how to get a random sample from the web fast?• Idea #1: start somewhere, follow random links for

some time, and then you are at a random page• This does not work! (Why?)• So we need something better …

Page 36: Advanced Web Crawling

Crawling in Large Search Engines

• The role of the data mining architecture

• Browsers and page rendering

Page 37: Advanced Web Crawling

Crawling in Large Search Engines

• Recall crawling application versus crawling system

Page 38: Advanced Web Crawling

Crawling in Large Search Engines

• Crawling application often implemented in data

analysis architecture (e.g., inside MapReduce)

• Period analysis job outputs URLs to be crawled next

• These URLs are then distributed to the nodes of the crawling system for download

• Simplifies crawler architecture but requires powerfuldata analysis architecture

• General trend to move functionality into DA system

Page 39: Advanced Web Crawling

Crawling in Large Search Engines

• Sites may send different pages to different browsers

• Search engine crawlers may request pages for

several browsers by using different user agents

• Crawlers may execute Javascript on pages

• Crawlers may render pages using rendering engines