crawling the web for a search engine or why crawling is cool

58
Crawling The Web For a Search Engine Or Why Crawling is Cool

Upload: morgan-oliver

Post on 02-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Crawling The Web For a Search Engine Or Why Crawling is Cool

Crawling The Web For a Search EngineOr Why Crawling is Cool

Page 2: Crawling The Web For a Search Engine Or Why Crawling is Cool

Talk Outline

What is a crawler? Some of the interesting problems RankMass Crawler As time permits:

Refresh PoliciesDuplicate Detection

Page 3: Crawling The Web For a Search Engine Or Why Crawling is Cool

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 4: Crawling The Web For a Search Engine Or Why Crawling is Cool

Applications

Internet Search EnginesGoogle, Yahoo, MSN, Ask

Comparison Shopping ServicesShopping

Data miningStanford Web Base, IBM Web

Fountain

Page 5: Crawling The Web For a Search Engine Or Why Crawling is Cool

Is that it?

Not quite

Page 6: Crawling The Web For a Search Engine Or Why Crawling is Cool

Crawling the Big Picture

Duplicate Pages Mirror Sites Identifying

Similar Pages Templates Deep Web When to stop? Incremental

Crawler

Refresh Policies Evolution of the

Web Crawling the

“good” pages first Focused Crawling Distributed

Crawlers Crawler Friendly

Webservers

Page 7: Crawling The Web For a Search Engine Or Why Crawling is Cool

Today’s Focus

A crawler which guarantees coverage of the Web

As time permits:Refresh PoliciesDuplicate Detection Techniques

Page 8: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Crawler

A Crawler with High PersonalizedPageRank Coverage Guarantee

Page 9: Crawling The Web For a Search Engine Or Why Crawling is Cool

Motivation

Impossible to download the entire web:Example: many pages from one

calendar When can we stop? How to gain the most benefit

from the pages we download

Page 10: Crawling The Web For a Search Engine Or Why Crawling is Cool

Main Issues

Crawler Guarantee: guarantee on how much of the “important”

part of the Web they “cover” when they stop crawling

If we don’t see the pages, how do we know how important they are?

Crawler Efficiency: Download “important” pages early during a

crawl Obtain coverage with a min number of

downloads

Page 11: Crawling The Web For a Search Engine Or Why Crawling is Cool

Outline

Formalize coverage metric L-Neighbor: Crawling with RankMass

guarantee RankMass: Crawling to achieve high

RankMass Windowed RankMass: How greedy do

you want to be? Experimental Results

Page 12: Crawling The Web For a Search Engine Or Why Crawling is Cool

Web Coverage Problem

D – The potentially infinite set of documents of the web

DC – The finite set of documents in our document collection

Assign importance weights to each page

Page 13: Crawling The Web For a Search Engine Or Why Crawling is Cool

Web Coverage Problem

What weights? Per query? Topic? Font?

PageRank? Why PageRank?Useful as importance mesureRandom surfer.Effective for ranking.

Page 14: Crawling The Web For a Search Engine Or Why Crawling is Cool

PageRank a Short Review

||

1)1(

)(D

dc

rdr

ipIjp j

ji

p1

p2

p3

p4

Page 15: Crawling The Web For a Search Engine Or Why Crawling is Cool

Now it’s Personal

Personal, TrustRank, General

ip j

ji td

c

rdr

ipIj

)1((

2

1

t

t

T

p3

p4

Page 16: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Defined

Using personalized pagerank formally define RankMass of DC :

Coverage Guarantee: We seek a crawler that given , when it stops the

downloaded pages DC:

:Efficient crawling , , We seek a crawler that for a given N

|downloads DC|= . . (N s t RM DC) is greater or equal |to any other DC|= , N DCD

Ci Dp iC rDRM )(

1)(

Ci Dp iC rDRM

Page 17: Crawling The Web For a Search Engine Or Why Crawling is Cool

How to Calculate RankMass Based on PageRank How do you compute RM(Dc)

without downloading the entire web

We can’t compute the exact but can lower bound

Let’s a start a simple case

Page 18: Crawling The Web For a Search Engine Or Why Crawling is Cool

Single Trusted Page

T(1): t1=1 ; ti = 0 i≠1

Always jump to p1 when bored We can place a lowerbound on

being within L of P1

NL(p1)=pages reachable from p1 in L links

Page 19: Crawling The Web For a Search Engine Or Why Crawling is Cool

Single Trusted Page

Page 20: Crawling The Web For a Search Engine Or Why Crawling is Cool

Lower bound guarantee: Single Trusted

Theorem 1: Assuming the trust vector T(1), the

sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1.. That is:

)(

1

1

1pNp

Li

Li

dr

Page 21: Crawling The Web For a Search Engine Or Why Crawling is Cool

Lower bound guarantee: General Case Theorem 2:

The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:

)(

11GNp

Li

Li

dr

Page 22: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Lower Bound

Lower bound given a single trusted page

Extension: Given a set of trusted pages G

That’s the basis of the crawling algorithm with a coverage guarantee

)(

1

1

1pNp

Li

Li

dr

)(

11GNp

Li

Li

dr

Page 23: Crawling The Web For a Search Engine Or Why Crawling is Cool

The L-Neighbor Crawler

1. L := 02. N[0] = {pi|ti > 0} // Start with

the trusted pages3. While ( < dL+1)

1. Download all uncrawled pages in N[L]

2. N[L + 1] = {all pages linked to by a page in N[L]}

3. L = L + 1

Page 24: Crawling The Web For a Search Engine Or Why Crawling is Cool

But what about efficency? L-Neighbor similar to BFS L-Neighbor simple and efficient May wish to prioritize further

certain neighborhoods first Page level prioritization.

t0=0.99 t1=0.01

Page 25: Crawling The Web For a Search Engine Or Why Crawling is Cool

Page Level Prioritizing

We want a more fine-grained page-level priority

The idea: Estimate PageRank on a page basis High priority for pages with a high

estimate of PageRank We cannot calculate exact PageRank Calculate PageRank lower bound of

undownloaded pages …But how

Page 26: Crawling The Web For a Search Engine Or Why Crawling is Cool

Probability of being at Page P

Interrupted

Page

Random Surfer

Click Link

Trusted Page

Page 27: Crawling The Web For a Search Engine Or Why Crawling is Cool

Calculating PageRank Lower Bound

PageRank(p) = Probability Random Surfer in p

Breakdown path by “interrupts”, jumps to a trusted page

Sum up all paths that start with an interrupt and end with p

Interrupt Pj P1 P2 P3 P4 P5 Pi

(1-d) (tj) (d*1/3) (d*1/5) (d*1/3) (d*1/3) (d*1/3) (d*1/3)

Page 28: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Basic Idea

p1

0.99

p2

0.01

p3

0.25

p4

0.25

p5

0.25

p1

0.99

p1

0.99

p6

0.09

p7

0.09p3

0.25p1

0.99

Page 29: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Crawler: High Level But that sounds complicated?! Luckily we don’t need all that Based on this idea:

Dynamically update lower bound on PageRank

Update total RankMassDownload page with highest

lower bound

Page 30: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass Crawler (Shorter) Variables:

CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.

RankMassCrawl() CRM = 0 rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):

• Pick pi with the largest rmi.• Download pi if not downloaded yet• CRM = CRM + rmi

• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi

• rmi = 0

Page 31: Crawling The Web For a Search Engine Or Why Crawling is Cool

Greedy vs Simple

L-Neighbor is simple RankMass is very greedy. Update expensive: random

access to web graph Compromise? Batching

downloads togetherupdates together

Page 32: Crawling The Web For a Search Engine Or Why Crawling is Cool

Windowed RankMass

Variables: CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.

Crawl() rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):

• Download top window% pages according to rmi

• Foreach page pi ∈ DC

• CRM = CRM + rmi

• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi

• rmi = 0

Page 33: Crawling The Web For a Search Engine Or Why Crawling is Cool

Experimental Setup

HTML files only Algorithms simulated over web

graph Crawled between Dec’ 2003 and

Jan’ 2004 141 millon URLs span over 6.9

million host names 233 top level domains.

Page 34: Crawling The Web For a Search Engine Or Why Crawling is Cool

Metrics Of Evaluation

1. How much RankMass is collected during the crawl

2. How much RankMass is “known” to have been collected during the crawl

3. How much computational and performance overhead the algorithm introduces.

Page 35: Crawling The Web For a Search Engine Or Why Crawling is Cool

L-Neighbor

Page 36: Crawling The Web For a Search Engine Or Why Crawling is Cool

RankMass

Page 37: Crawling The Web For a Search Engine Or Why Crawling is Cool

Windowed RankMass

Page 38: Crawling The Web For a Search Engine Or Why Crawling is Cool

Window Size

Page 39: Crawling The Web For a Search Engine Or Why Crawling is Cool

Algorithm Efficiency

Algorithm Downloadsrequired forabove 0.98%guaranteedRankMass

Downloadsrequiredfor above0.98% actualRankMass

L-Neighbor 7 million 65,000

RankMass 131,072 27,939

Windowed-RankMass

217,918 30,826

Optimal 27,101 27,101

Page 40: Crawling The Web For a Search Engine Or Why Crawling is Cool

Algorithm Running Time

Window Hours Number of Iterations

Number of Documents

L-Neighbor 1:27 13 83,638,834

20%-Windowed

4:39 44 80,622,045

10%-Windowed

10:27 85 80,291,078

5%-Windowed

17:52 167 80,139,289

RankMass 25:39 Not comparable

10,350,000

Page 41: Crawling The Web For a Search Engine Or Why Crawling is Cool

Refresh Policies

Page 42: Crawling The Web For a Search Engine Or Why Crawling is Cool

Refresh Policy: Problem Definition

You have N urls you want to keep fresh Limited resources: f documents /

second Download order to maximize average

freshness What do you do?

Note: Can’t always know how the page really looks

Page 43: Crawling The Web For a Search Engine Or Why Crawling is Cool

The Optimal Solution

Depends on freshness definition Freshness boolean:

A page can only be fresh or notOne small change deems it

unfresh

Page 44: Crawling The Web For a Search Engine Or Why Crawling is Cool

Understand Freshness Better Two page database Pd changes daily Pw changes once a week We can refresh one page per week How should we visit pages?

Uniform: Pd, Pd, Pd, Pd, Pd, Pd,…

Proportional: Pd,Pd, Pd, Pd, Pd,Pd,Pw

Other?

Page 45: Crawling The Web For a Search Engine Or Why Crawling is Cool

Proportional Often Not Good!

Visit fast changing e1

get 1/2 day of freshness

Visit slow changing e2

get 1/2 week of freshness

Visiting Pw is a better deal!

Page 46: Crawling The Web For a Search Engine Or Why Crawling is Cool

Optimal Refresh Frequency

ProblemGiven and f ,

find

that maximize

1 2, , ..., N

1 21

, ,... , /N

N

ii

f f f f f N

1

1( ) ( )

N

ii

F S F eN

Page 47: Crawling The Web For a Search Engine Or Why Crawling is Cool

Optimal Refresh Frequency

• Shape of curve is the same in all cases• Holds for any change frequency distribution

Page 48: Crawling The Web For a Search Engine Or Why Crawling is Cool

48

Do Not Crawl In The DUST:Different URLs Similar Text

Ziv Bar-Yossef (Technion and Google)

Idit Keidar (Technion)

Uri Schonfeld (UCLA)

Page 49: Crawling The Web For a Search Engine Or Why Crawling is Cool

49

DUST – Different URLs Similar Text Examples:

Default Directory Files:• “/index.html” “/”

Domain names and virtual hosts• “news.google.com” “google.com/news”

Aliases and symbolic links:• “~shuri” “/people/shuri”

Parameters with little effect on content• ?Print=1

URL transformations: • “/story_<num>” “story?id=<num>”

Even the WWW Gets Dusty

Page 50: Crawling The Web For a Search Engine Or Why Crawling is Cool

50

Reduce the crawl and indexing Avoid fetching the same document more than once

Canonization for better rankingReferences to a document may be split among its aliases

Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit

Why Care about DUST?

Page 51: Crawling The Web For a Search Engine Or Why Crawling is Cool

51

Related Work Similarity detection via document sketches [Broder et al, Hoad-Zobel, Shivakumar et al, Di Iorio et al, Brin et al, Garcia-Molina et al]

Requires fetching all document duplicates Cannot be used to find "DUST rules"

Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]

Not suitable for finding site-specific "DUST rules"

Mining association rules [Agrawal and Srikant]

A technically different problem

Page 52: Crawling The Web For a Search Engine Or Why Crawling is Cool

52

So what are we looking for?

Page 53: Crawling The Web For a Search Engine Or Why Crawling is Cool

53

DustBuster, an algorithm that Discovers site-specific DUST rules from a URL list without examining page content

Requires a small number of page fetches to validate the rules

Site specific URL canonization algorithm Experimented on real data both from:

Web access logs Crawl logs

Our Contributions

Page 54: Crawling The Web For a Search Engine Or Why Crawling is Cool

54

Valid DUST Rule: a mapping Ψ that maps each valid URL u to a valid URL Ψ(u) with similar content“/index.html” “/”“news.google.com” “google.com/news”“/story_<num>” “story?id=<num>”

Invalid DUST Rules: Either do not preserve similarity or do not produce valid URLs

1DUST Rules

Page 55: Crawling The Web For a Search Engine Or Why Crawling is Cool

55

Substring Substitution DUST: “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”

Parameter DUST: Removing a parameter or replacing its value to a default

value “Color=pink” “Color=black”

Types of DUST Rules

Focus

of

Talk

Page 56: Crawling The Web For a Search Engine Or Why Crawling is Cool

56

Basic Detection Framework

Input: List of URLs from a site: Crawl or web access log

Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples

Detect Eliminate Validate

No Fetch here

Page 57: Crawling The Web For a Search Engine Or Why Crawling is Cool

57

Example: Instances & Support

Page 58: Crawling The Web For a Search Engine Or Why Crawling is Cool

END