crawling the web for a search engine or why crawling is cool

Crawling The Web For a Search EngineOr Why Crawling is Cool

Talk Outline

What is a crawler? Some of the interesting problems RankMass Crawler As time permits:

Refresh PoliciesDuplicate Detection

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Applications

Internet Search EnginesGoogle, Yahoo, MSN, Ask

Comparison Shopping ServicesShopping

Data miningStanford Web Base, IBM Web

Fountain

Is that it?

Not quite

Crawling the Big Picture

Duplicate Pages Mirror Sites Identifying

Similar Pages Templates Deep Web When to stop? Incremental

Crawler

Refresh Policies Evolution of the

Web Crawling the

“good” pages first Focused Crawling Distributed

Crawlers Crawler Friendly

Webservers

Today’s Focus

A crawler which guarantees coverage of the Web

As time permits:Refresh PoliciesDuplicate Detection Techniques

RankMass Crawler

A Crawler with High PersonalizedPageRank Coverage Guarantee

Motivation

Impossible to download the entire web:Example: many pages from one

calendar When can we stop? How to gain the most benefit

from the pages we download

Main Issues

Crawler Guarantee: guarantee on how much of the “important”

part of the Web they “cover” when they stop crawling

If we don’t see the pages, how do we know how important they are?

Crawler Efficiency: Download “important” pages early during a

crawl Obtain coverage with a min number of

downloads

Outline

Formalize coverage metric L-Neighbor: Crawling with RankMass

guarantee RankMass: Crawling to achieve high

RankMass Windowed RankMass: How greedy do

you want to be? Experimental Results

Web Coverage Problem

D – The potentially infinite set of documents of the web

DC – The finite set of documents in our document collection

Assign importance weights to each page

Web Coverage Problem

What weights? Per query? Topic? Font?

PageRank? Why PageRank?Useful as importance mesureRandom surfer.Effective for ranking.

PageRank a Short Review

||

1)1(

)(D

dc

rdr

ipIjp j

ji

p1

p2

p3

p4

Now it’s Personal

Personal, TrustRank, General

ip j

ji td

c

rdr

ipIj

)1((

2

1

t

t

T

p3

p4

RankMass Defined

Using personalized pagerank formally define RankMass of DC :

Coverage Guarantee: We seek a crawler that given , when it stops the

downloaded pages DC:

:Efficient crawling , , We seek a crawler that for a given N

|downloads DC|= . . (N s t RM DC) is greater or equal |to any other DC|= , N DCD

Ci Dp iC rDRM )(

1)(

Ci Dp iC rDRM

How to Calculate RankMass Based on PageRank How do you compute RM(Dc)

without downloading the entire web

We can’t compute the exact but can lower bound

Let’s a start a simple case

Single Trusted Page

T(1): t1=1 ; ti = 0 i≠1

Always jump to p1 when bored We can place a lowerbound on

being within L of P1

NL(p1)=pages reachable from p1 in L links

Single Trusted Page

Lower bound guarantee: Single Trusted

Theorem 1: Assuming the trust vector T(1), the

sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1.. That is:

)(

1

1

1pNp

Li

Li

dr

Lower bound guarantee: General Case Theorem 2:

The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:

)(

11GNp

Li

Li

dr

RankMass Lower Bound

Lower bound given a single trusted page

Extension: Given a set of trusted pages G

That’s the basis of the crawling algorithm with a coverage guarantee

)(

1

1

1pNp

Li

Li

dr

)(

11GNp

Li

Li

dr

The L-Neighbor Crawler

1. L := 02. N[0] = {pi|ti > 0} // Start with

the trusted pages3. While ( < dL+1)

1. Download all uncrawled pages in N[L]

2. N[L + 1] = {all pages linked to by a page in N[L]}

3. L = L + 1

But what about efficency? L-Neighbor similar to BFS L-Neighbor simple and efficient May wish to prioritize further

certain neighborhoods first Page level prioritization.

t0=0.99 t1=0.01

Page Level Prioritizing

We want a more fine-grained page-level priority

The idea: Estimate PageRank on a page basis High priority for pages with a high

estimate of PageRank We cannot calculate exact PageRank Calculate PageRank lower bound of

undownloaded pages …But how

Probability of being at Page P

Interrupted

Page

Random Surfer

Click Link

Trusted Page

Calculating PageRank Lower Bound

PageRank(p) = Probability Random Surfer in p

Breakdown path by “interrupts”, jumps to a trusted page

Sum up all paths that start with an interrupt and end with p

Interrupt Pj P1 P2 P3 P4 P5 Pi

(1-d) (tj) (d*1/3) (d*1/5) (d*1/3) (d*1/3) (d*1/3) (d*1/3)

RankMass Basic Idea

p1

0.99

p2

0.01

p3

0.25

p4

0.25

p5

0.25

p1

0.99

p1

0.99

p6

0.09

p7

0.09p3

0.25p1

0.99

RankMass Crawler: High Level But that sounds complicated?! Luckily we don’t need all that Based on this idea:

Dynamically update lower bound on PageRank

Update total RankMassDownload page with highest

lower bound

RankMass Crawler (Shorter) Variables:

CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.

RankMassCrawl() CRM = 0 rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):

• Pick pi with the largest rmi.• Download pi if not downloaded yet• CRM = CRM + rmi

• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi

• rmi = 0

Greedy vs Simple

L-Neighbor is simple RankMass is very greedy. Update expensive: random

access to web graph Compromise? Batching

downloads togetherupdates together

Windowed RankMass

Variables: CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.

Crawl() rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):

• Download top window% pages according to rmi

• Foreach page pi ∈ DC

• CRM = CRM + rmi

• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi

• rmi = 0

Experimental Setup

HTML files only Algorithms simulated over web

graph Crawled between Dec’ 2003 and

Jan’ 2004 141 millon URLs span over 6.9

million host names 233 top level domains.

Metrics Of Evaluation

1. How much RankMass is collected during the crawl

2. How much RankMass is “known” to have been collected during the crawl

3. How much computational and performance overhead the algorithm introduces.

L-Neighbor

RankMass

Windowed RankMass

Window Size

Algorithm Efficiency

Algorithm Downloadsrequired forabove 0.98%guaranteedRankMass

Downloadsrequiredfor above0.98% actualRankMass

L-Neighbor 7 million 65,000

RankMass 131,072 27,939

Windowed-RankMass

217,918 30,826

Optimal 27,101 27,101

Algorithm Running Time

Window Hours Number of Iterations

Number of Documents

L-Neighbor 1:27 13 83,638,834

20%-Windowed

4:39 44 80,622,045

10%-Windowed

10:27 85 80,291,078

5%-Windowed

17:52 167 80,139,289

RankMass 25:39 Not comparable

10,350,000

Refresh Policies

Refresh Policy: Problem Definition

You have N urls you want to keep fresh Limited resources: f documents /

second Download order to maximize average

freshness What do you do?

Note: Can’t always know how the page really looks

The Optimal Solution

Depends on freshness definition Freshness boolean:

A page can only be fresh or notOne small change deems it

unfresh

Understand Freshness Better Two page database Pd changes daily Pw changes once a week We can refresh one page per week How should we visit pages?

Uniform: Pd, Pd, Pd, Pd, Pd, Pd,…

Proportional: Pd,Pd, Pd, Pd, Pd,Pd,Pw

Other?

Proportional Often Not Good!

Visit fast changing e1

get 1/2 day of freshness

Visit slow changing e2

get 1/2 week of freshness

Visiting Pw is a better deal!

Optimal Refresh Frequency

ProblemGiven and f ,

find

that maximize

1 2, , ..., N

1 21

, ,... , /N

N

ii

f f f f f N

1

1( ) ( )

N

ii

F S F eN

Optimal Refresh Frequency

• Shape of curve is the same in all cases• Holds for any change frequency distribution

48

Do Not Crawl In The DUST:Different URLs Similar Text

Ziv Bar-Yossef (Technion and Google)

Idit Keidar (Technion)

Uri Schonfeld (UCLA)

49

DUST – Different URLs Similar Text Examples:

Default Directory Files:• “/index.html” “/”

Domain names and virtual hosts• “news.google.com” “google.com/news”

Aliases and symbolic links:• “~shuri” “/people/shuri”

Parameters with little effect on content• ?Print=1

URL transformations: • “/story_<num>” “story?id=<num>”

Even the WWW Gets Dusty

50

Reduce the crawl and indexing Avoid fetching the same document more than once

Canonization for better rankingReferences to a document may be split among its aliases

Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit

Why Care about DUST?

51

Related Work Similarity detection via document sketches [Broder et al, Hoad-Zobel, Shivakumar et al, Di Iorio et al, Brin et al, Garcia-Molina et al]

Requires fetching all document duplicates Cannot be used to find "DUST rules"

Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]

Not suitable for finding site-specific "DUST rules"

Mining association rules [Agrawal and Srikant]

A technically different problem

52

So what are we looking for?

53

DustBuster, an algorithm that Discovers site-specific DUST rules from a URL list without examining page content

Requires a small number of page fetches to validate the rules

Site specific URL canonization algorithm Experimented on real data both from:

Web access logs Crawl logs

Our Contributions

54

Valid DUST Rule: a mapping Ψ that maps each valid URL u to a valid URL Ψ(u) with similar content“/index.html” “/”“news.google.com” “google.com/news”“/story_<num>” “story?id=<num>”

Invalid DUST Rules: Either do not preserve similarity or do not produce valid URLs

1DUST Rules

55

Substring Substitution DUST: “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”

Parameter DUST: Removing a parameter or replacing its value to a default

value “Color=pink” “Color=black”

Types of DUST Rules

Focus

of

Talk

56

Basic Detection Framework

Input: List of URLs from a site: Crawl or web access log

Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples

Detect Eliminate Validate

No Fetch here

57

Example: Instances & Support

crawling the web for a search engine or why crawling is cool

Documents

important pages

pages reachable

uncrawled pages

downloaded pages dc

set of trusted pages

coverage guaranteethe

crawler efficiency

crawlobtain coverage