crawling the web for a search engine or why crawling is cool
TRANSCRIPT
Crawling The Web For a Search EngineOr Why Crawling is Cool
Talk Outline
What is a crawler? Some of the interesting problems RankMass Crawler As time permits:
Refresh PoliciesDuplicate Detection
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
Applications
Internet Search EnginesGoogle, Yahoo, MSN, Ask
Comparison Shopping ServicesShopping
Data miningStanford Web Base, IBM Web
Fountain
Is that it?
Not quite
Crawling the Big Picture
Duplicate Pages Mirror Sites Identifying
Similar Pages Templates Deep Web When to stop? Incremental
Crawler
Refresh Policies Evolution of the
Web Crawling the
“good” pages first Focused Crawling Distributed
Crawlers Crawler Friendly
Webservers
Today’s Focus
A crawler which guarantees coverage of the Web
As time permits:Refresh PoliciesDuplicate Detection Techniques
RankMass Crawler
A Crawler with High PersonalizedPageRank Coverage Guarantee
Motivation
Impossible to download the entire web:Example: many pages from one
calendar When can we stop? How to gain the most benefit
from the pages we download
Main Issues
Crawler Guarantee: guarantee on how much of the “important”
part of the Web they “cover” when they stop crawling
If we don’t see the pages, how do we know how important they are?
Crawler Efficiency: Download “important” pages early during a
crawl Obtain coverage with a min number of
downloads
Outline
Formalize coverage metric L-Neighbor: Crawling with RankMass
guarantee RankMass: Crawling to achieve high
RankMass Windowed RankMass: How greedy do
you want to be? Experimental Results
Web Coverage Problem
D – The potentially infinite set of documents of the web
DC – The finite set of documents in our document collection
Assign importance weights to each page
Web Coverage Problem
What weights? Per query? Topic? Font?
PageRank? Why PageRank?Useful as importance mesureRandom surfer.Effective for ranking.
PageRank a Short Review
||
1)1(
)(D
dc
rdr
ipIjp j
ji
p1
p2
p3
p4
Now it’s Personal
Personal, TrustRank, General
ip j
ji td
c
rdr
ipIj
)1((
2
1
t
t
T
p3
p4
RankMass Defined
Using personalized pagerank formally define RankMass of DC :
Coverage Guarantee: We seek a crawler that given , when it stops the
downloaded pages DC:
:Efficient crawling , , We seek a crawler that for a given N
|downloads DC|= . . (N s t RM DC) is greater or equal |to any other DC|= , N DCD
Ci Dp iC rDRM )(
1)(
Ci Dp iC rDRM
How to Calculate RankMass Based on PageRank How do you compute RM(Dc)
without downloading the entire web
We can’t compute the exact but can lower bound
Let’s a start a simple case
Single Trusted Page
T(1): t1=1 ; ti = 0 i≠1
Always jump to p1 when bored We can place a lowerbound on
being within L of P1
NL(p1)=pages reachable from p1 in L links
Single Trusted Page
Lower bound guarantee: Single Trusted
Theorem 1: Assuming the trust vector T(1), the
sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1.. That is:
)(
1
1
1pNp
Li
Li
dr
Lower bound guarantee: General Case Theorem 2:
The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:
)(
11GNp
Li
Li
dr
RankMass Lower Bound
Lower bound given a single trusted page
Extension: Given a set of trusted pages G
That’s the basis of the crawling algorithm with a coverage guarantee
)(
1
1
1pNp
Li
Li
dr
)(
11GNp
Li
Li
dr
The L-Neighbor Crawler
1. L := 02. N[0] = {pi|ti > 0} // Start with
the trusted pages3. While ( < dL+1)
1. Download all uncrawled pages in N[L]
2. N[L + 1] = {all pages linked to by a page in N[L]}
3. L = L + 1
But what about efficency? L-Neighbor similar to BFS L-Neighbor simple and efficient May wish to prioritize further
certain neighborhoods first Page level prioritization.
t0=0.99 t1=0.01
Page Level Prioritizing
We want a more fine-grained page-level priority
The idea: Estimate PageRank on a page basis High priority for pages with a high
estimate of PageRank We cannot calculate exact PageRank Calculate PageRank lower bound of
undownloaded pages …But how
Probability of being at Page P
Interrupted
Page
Random Surfer
Click Link
Trusted Page
Calculating PageRank Lower Bound
PageRank(p) = Probability Random Surfer in p
Breakdown path by “interrupts”, jumps to a trusted page
Sum up all paths that start with an interrupt and end with p
Interrupt Pj P1 P2 P3 P4 P5 Pi
(1-d) (tj) (d*1/3) (d*1/5) (d*1/3) (d*1/3) (d*1/3) (d*1/3)
RankMass Basic Idea
p1
0.99
p2
0.01
p3
0.25
p4
0.25
p5
0.25
p1
0.99
p1
0.99
p6
0.09
p7
0.09p3
0.25p1
0.99
RankMass Crawler: High Level But that sounds complicated?! Luckily we don’t need all that Based on this idea:
Dynamically update lower bound on PageRank
Update total RankMassDownload page with highest
lower bound
RankMass Crawler (Shorter) Variables:
CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.
RankMassCrawl() CRM = 0 rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):
• Pick pi with the largest rmi.• Download pi if not downloaded yet• CRM = CRM + rmi
• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi
• rmi = 0
Greedy vs Simple
L-Neighbor is simple RankMass is very greedy. Update expensive: random
access to web graph Compromise? Batching
downloads togetherupdates together
Windowed RankMass
Variables: CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi.
Crawl() rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ):
• Download top window% pages according to rmi
• Foreach page pi ∈ DC
• CRM = CRM + rmi
• Foreach pj linked to by pi:• rmj = rmj + d/ci rmi
• rmi = 0
Experimental Setup
HTML files only Algorithms simulated over web
graph Crawled between Dec’ 2003 and
Jan’ 2004 141 millon URLs span over 6.9
million host names 233 top level domains.
Metrics Of Evaluation
1. How much RankMass is collected during the crawl
2. How much RankMass is “known” to have been collected during the crawl
3. How much computational and performance overhead the algorithm introduces.
L-Neighbor
RankMass
Windowed RankMass
Window Size
Algorithm Efficiency
Algorithm Downloadsrequired forabove 0.98%guaranteedRankMass
Downloadsrequiredfor above0.98% actualRankMass
L-Neighbor 7 million 65,000
RankMass 131,072 27,939
Windowed-RankMass
217,918 30,826
Optimal 27,101 27,101
Algorithm Running Time
Window Hours Number of Iterations
Number of Documents
L-Neighbor 1:27 13 83,638,834
20%-Windowed
4:39 44 80,622,045
10%-Windowed
10:27 85 80,291,078
5%-Windowed
17:52 167 80,139,289
RankMass 25:39 Not comparable
10,350,000
Refresh Policies
Refresh Policy: Problem Definition
You have N urls you want to keep fresh Limited resources: f documents /
second Download order to maximize average
freshness What do you do?
Note: Can’t always know how the page really looks
The Optimal Solution
Depends on freshness definition Freshness boolean:
A page can only be fresh or notOne small change deems it
unfresh
Understand Freshness Better Two page database Pd changes daily Pw changes once a week We can refresh one page per week How should we visit pages?
Uniform: Pd, Pd, Pd, Pd, Pd, Pd,…
Proportional: Pd,Pd, Pd, Pd, Pd,Pd,Pw
Other?
Proportional Often Not Good!
Visit fast changing e1
get 1/2 day of freshness
Visit slow changing e2
get 1/2 week of freshness
Visiting Pw is a better deal!
Optimal Refresh Frequency
ProblemGiven and f ,
find
that maximize
1 2, , ..., N
1 21
, ,... , /N
N
ii
f f f f f N
1
1( ) ( )
N
ii
F S F eN
Optimal Refresh Frequency
• Shape of curve is the same in all cases• Holds for any change frequency distribution
48
Do Not Crawl In The DUST:Different URLs Similar Text
Ziv Bar-Yossef (Technion and Google)
Idit Keidar (Technion)
Uri Schonfeld (UCLA)
49
DUST – Different URLs Similar Text Examples:
Default Directory Files:• “/index.html” “/”
Domain names and virtual hosts• “news.google.com” “google.com/news”
Aliases and symbolic links:• “~shuri” “/people/shuri”
Parameters with little effect on content• ?Print=1
URL transformations: • “/story_<num>” “story?id=<num>”
Even the WWW Gets Dusty
50
Reduce the crawl and indexing Avoid fetching the same document more than once
Canonization for better rankingReferences to a document may be split among its aliases
Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit
Why Care about DUST?
51
Related Work Similarity detection via document sketches [Broder et al, Hoad-Zobel, Shivakumar et al, Di Iorio et al, Brin et al, Garcia-Molina et al]
Requires fetching all document duplicates Cannot be used to find "DUST rules"
Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]
Not suitable for finding site-specific "DUST rules"
Mining association rules [Agrawal and Srikant]
A technically different problem
52
So what are we looking for?
53
DustBuster, an algorithm that Discovers site-specific DUST rules from a URL list without examining page content
Requires a small number of page fetches to validate the rules
Site specific URL canonization algorithm Experimented on real data both from:
Web access logs Crawl logs
Our Contributions
54
Valid DUST Rule: a mapping Ψ that maps each valid URL u to a valid URL Ψ(u) with similar content“/index.html” “/”“news.google.com” “google.com/news”“/story_<num>” “story?id=<num>”
Invalid DUST Rules: Either do not preserve similarity or do not produce valid URLs
1DUST Rules
55
Substring Substitution DUST: “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”
Parameter DUST: Removing a parameter or replacing its value to a default
value “Color=pink” “Color=black”
Types of DUST Rules
Focus
of
Talk
56
Basic Detection Framework
Input: List of URLs from a site: Crawl or web access log
Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples
Detect Eliminate Validate
No Fetch here
57
Example: Instances & Support
END