advanced web crawling

CS6913: Web Search EnginesCSE Department

NYU Tandon School of Engineering

Advanced Web Crawling

Advanced Web Crawling:

• High-Performance Crawling Systems

• Targeted/Focused Crawling

• Recrawling

• Random Walks

• Crawling Archit. and Challenges in Large Engines

High-Performance Crawling Systems•Large engines crawl billions of pages per day

• On hundreds to thousands of machines• Each machine crawls tens of millions per day• E.g., 1000 pages/sec means about 89 million pages/day• 1000 pages/sec == 20-100 MB/s network traffic

•How to build such systems• Scaling networking and I/O• Scaling the data structures• Scaling over many machines• Scaling the crawling strategy• Robustness and manageability

Crawling System vs. Crawling Strategy•Crawling Strategy: “what to crawl next”

• Which URLs should be crawled?• Based on topic? (focused crawling)• Based on likelihood of change? (recrawling)• Based on importance of page?• (Obeying robot exclusion and courtesy constraints?)

•Crawling System: “fast system for fetching stuff”• A simple as possible• Optimized for pages/second• Enforces robot exclusion policies and courtesy constraints (?)

Crawling System vs. Crawling Strategy

Crawling System vs. Crawling Strategy• Two components

• Crawling application implements crawling strategy

• More or less integrated based on:• Complexity of crawling strategy• Degree of agility/maneuverability required

• Example #1: BFS crawling or recrawling in a large engine:• Use map-reduce jobs in DA to determine what to crawl next• Send files with millions of URLs to crawling system

• Example #2: smaller agile focused crawler• May decide what page to crawl next based on last few pages

Scaling Networking and I/O • Problem: fetching a page “takes” 0.1 to 2s per page

• Round-trip latency over internet• Slow responses from web servers• Tail: some servers really slow à do not wait for 30s timeout!

• Solution: need to open many concurrent connections• Thousands of connections at a time• 1000 pages/sec/node requires 1000s of connections• Unresponsive servers may be handled with backoff scheme

(retry in 1 hour, 1 day, 1 week, etc.)

• Other bottlenecks:• I/O to store pages, DNS lookups, robot fetches

Scaling Networking and I/O• Basic networking architectures:

• Thread-based: one thread per connection• Process-based: one process per connection• Event-based: each process handling many connections• Hybrid approaches

• Compare to web server architectures

• Storing pages: append to large files• DNS: cache locally• Robot exclusion: request robots.txt like any other page• But robots.txt needs to be recrawled frequently (why?)

Scaling the Data Structures

• Problem: how to keep track of pages already crawled?

• In decoupled case, use map-reduce

• Otherwise, URLs are large, will run out of memory

• BFS: After 10M pages, we have 100M in the queue

• 75 bytes/URL = 7.5GB

• So we cannot just stick this into standard data structs

Scaling the Data Structures: Solutions (BFS)

• Distribute DS over many nodes (e.g., hash based on host)

• Compress URLs (prefix compression, lz4 on blocks)

• Use Bloom filters (special lossy data struct for membership)

• Lazy filtering (periodic batch filtering of queue, e.g. in DA)

• Caching of recently crawled URLs in memory

• Real systems may use combinations of these

• Check URL before insert into Q or before download?

Scaling the Data Structures: Solutions

• Distribute over many crawling nodes:

• Hash URLs to nodes?

• Or better, hash sites to nodes?

• Or something more complex (split very large sites)

• nodes forward URLs they parse to relevant other node

Scaling the Data Structures: Bloom Filters• A Bloom filter is a data-structure with two operations:

• Insert(x) inserts an item into the BF• Exists(x) checks if x was previously inserted in the BF• Exists(x) returns one of two answers: no or probably yes

• A BF can have false positives, but no false negatives

• A BF can have false positives, but no false negatives• Goal: minimize false positive rate given limited space• BF achieve good trade-off between space and fpr• BFs based on hashing, can be used on any data type• BF uses several hash functions!

How to Implement Bloom Filters• To implement Bloom filters, we need:

• An array A[0 ... m-1] of m bits, initially zeroed out• k random hash functions h1 to hk, each mapping to [0 ...m-1]

• Insert(x): FOR i=1 TO k { A[hi(x)] ß 1 }

• Exists(x): FOR i=1 TO k { if A[hi(x)] == 0 RETURN(no) }RETURN(probably yes)

• Insert(x): FOR i=1 TO k { A[hi(x)] ß 1 }

• Exists(x): FOR i=1 TO k { if A[hi(x)] == 0 RETURN(no) }RETURN(probably yes)

• After inserting n items, what is the false positive rate?• That is, what is the probability of getting probably yes

as answer to exists(x) even if x was not inserted?

Analyzing Bloom Filters• After inserting n items into BF of m bits with k hashes• Focus on one bit A[j] in the BF:

prob[A[j] == 0] = (1 – 1/m)kn

• Using (1-1/m)m à e-1 for m à inf, this is roughly e-kn/m

prob[A[j] == 0] = (1 – 1/m)kn

• Using (1-1/m)m à e-1 for m à inf, this is roughly e-kn/m

• Probability of false positive for an uninserted x is the probability that all k tested bits are set to 1, which is

(1 – e-kn/m)k

Analyzing Bloom Filters• We build a BF for a maximum number of items n• Formula: trade-off between m/n, k, and false pos. rate• Increasing m and thus m/n lowers the fpr• But need to also increase k with m/n to get best fpr• Example values: m/n k fpr

8 6 0.021516 11 0.000458

• Achieves low fpr for reasonable space budget(fpr = false positive rate)

Robustness and Manageability• Courtesy: limit speed of crawling for sites

• e.g., at most one page every 30s• or wait 10 times the response speed• or depending on site• but what is a site? Or a domain? • give contact info

• Configurable while running• E.g., must have blacklist to exclude domains• Control panel so see and control speed, 404s, etc.

• Checkpointing and recovery

Robustness and Manageability• Example: implementing crawl speed via reordering

(as discussed for HW#1 using PQs)

High-Performance Crawling SystemsIndustry:• Googlebot and crawlers for other large engine• Crawlers to monitor the web for certain content• Monitor what kind of content?

Research Projects:• Mercator/Atrax at DEC Labs (1997/2002)• Polybot at Poly/NYU (2002)• UbiCrawler at Polit. di Milano (2004)• IRLBOT at Texas A&M (2009)• BubiNG at Polit. Di Milano (2018)

Focused Crawling•Problem: crawl only certain types of pages•E.g., topics such as sports, law, history•Or news, languages, extremist content, copyright

violations, local area, bulletin boards, stores•How to specify a topic: classifier, maybe a query•How to evaluate performance: harvest rate•How to decide what to crawl next:

• A machine-learning problem• Crawler can learn “on the job”• Assumption: after crawling, we know if a page is relevant

Focused Crawling Strategies•Depends on how “connected” a topic is•Small topics may not be well connected (or they may be)

•And on whether topics want to be found•And on structure of sites (e.g., store/bulletin board navigation)

•For now, assume simple case: • well-connected, unstructured pages, no attempt to hide

•Relevance of a page: determined after downloading it•Promise of a page: estimate of relevance before downl.•Simple greedy FC: download most promising page next

Estimating Page Promise• A machine-learning problem• Features: link structure, URLs, context of links• High promise: many links from relevant pages• Or in same site (or subdirectory) as other relevant pages• Or context of links is relevant • Idea #1: run link analysis (HITS) across crawl frontier• Idea #2: Master-apprentice approach (Chakrabarti et al 2002)

• Idea #3: learning with future reward

Evaluating a Focused Crawler• Harvest rate: num of relevant pages / number crawled• Start pages should be relevant if possible• Unfocused (BFS) crawler will soon lose focus• Focused crawler should maintain high harvest rate• But greedy may not be best

Recrawling Strategies• Problem: BFS only good for the initial crawl• Afterwards, need to maintain index more efficiently• Some pages change more frequently than others• Some pages are more important than others• Some changes are more significant than others• And also, discover new pages, not just changed ones• Simplified problem: given a limited amount of crawl

resources, try to maximize freshness of the index• Omitted: new pages, degree of change, import. of page

Estimating Change Probabilities• Suppose you have crawled a page n times• Sometimes it changed, sometimes it did not• Match to a poisson process model

Estimating Change Probabilities• What Poisson parameter best matches observations?• But we may have irregular observation points• One parameter for every page (or group of pages)

Recrawl Strategy• How should we use change probability to recrawl?• Always crawl page with highest prob. of change next?• This is not optimal! (Why)• Do not crawl highest-changing pages at all!

• Other approaches: maintain k classes of pages• Model how important a change is w.r.t. queries• How to find new pages

Summary: Recrawling• An optimized system for recrawling should

allocate crawling resources according to:• how often each page or site changes• how important the page or site is• how likely it is that a change will impact search ranking• how to best discover newly created pages

• Some challenges in implementation involve:• identifying templates and boilerplate, and spam• replicated content and URLs (“new” URL does not mean new content)

• keeping track of stats efficiently (maybe using DM platform)

Crawling Random Pages• How to estimate % of all pages with some property

• Say, pages in Chinese, pages using javascript, etc.• Idea: estimate using a sample of web pages• But how to get a random sample from the web fast?• Idea #1: start somewhere, follow random links for

some time, and then you are at a random page• This does not work! (Why?)• So we need something better …

Crawling in Large Search Engines

• The role of the data mining architecture

• Browsers and page rendering

• Recall crawling application versus crawling system

• Crawling application often implemented in data

analysis architecture (e.g., inside MapReduce)

• Period analysis job outputs URLs to be crawled next

• These URLs are then distributed to the nodes of the crawling system for download

• Simplifies crawler architecture but requires powerfuldata analysis architecture

• General trend to move functionality into DA system

• Sites may send different pages to different browsers

• Search engine crawlers may request pages for

several browsers by using different user agents

• Crawlers may execute Javascript on pages

• Crawlers may render pages using rendering engines

advanced web crawling

Documents

crawling the hidden web

deep web crawling and mining

distributed web crawling over dhts

[lvduit//lab] crawling the web

scheduling algorithms for web crawling

web crawling - university of california,...

crawling the web for a search engine or why crawling is cool

current challenges in web crawling

web crawling & crawler

intelligent web crawling

advanced crawling techniques chapter 6. outline selective...

5 benefits of web crawling services over in-house crawling

web scraping : crawling

deep web crawling: a survey

crawling the web- presentation

web search overview & crawling 1 lecture 1: web search...

crawling and web indexes

ch. 8: web crawling - lmu

crawling the web

web crawling with nodejs