relevant - dipartimento di informaticaferragin/teach/informationretrieval/8-lecture.pdf · retrieve...
TRANSCRIPT
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
1
Algoritmi per IR
Web Search
Retrieve docs that are “relevant” for the user query
� Doc: file word or pdf, web page, email, blog, e-book,...
� Query: paradigm “bag of words”
Relevant ?!?
Goal of a Search Engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
2
Two main difficulties
The Web:
� Size: more than tens of billions of pages
� Language and encodings: hundreds…
� Distributed authorship: SPAM, format-less,…
� Dynamic: in one year 35% survive, 20% untouched
The User:
� Query composition: short (2.5 terms avg) and imprecise
� Query results: 85% users look at just one result-page
� Several needs: Informational, Navigational, Transactional
Extracting “significant data” is difficult !!
Matching “user needs” is difficult !!
Evolution of Search Engines
� First generation -- use only on-page, web-text data
� Word frequency and language
� Second generation -- use off-page, web-graph data
� Link (or connectivity) analysis
� Anchor-text (How people refer to a page)
� Third generation -- answer “the need behind the query”
� Focus on “user need”, rather than on query
� Integrate multiple data-sources
� Click-through data
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google
Fourth generation ���� Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
Google, Yahoo, MSN, ASK,………
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
5
This is a search engine!!!
Algoritmi per IR
The structure of a Search Engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
6
The structure
Crawler
Page archive
Control
Query
Queryresolver
?
Ranker
PageAnalizer
text
Structure
auxiliary
Indexer
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
7
Information Retrieval
Crawling
Spidering
� 24h, 7days “walking” over a Graph
� What about the Graph?
� BowTie
� Direct graph G = (N, E)
� N changes (insert, delete) >> 50 * 109 nodes
� E changes (insert, delete) > 10 links per node� 10*50*109 = 500*109 1-entries in adj matrix
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
8
Crawling Issues
� How to crawl? � Quality: “Best” pages first
� Efficiency: Avoid duplication (or near duplication)
� Etiquette: Robots.txt, Server load concerns (Minimize load)
� How much to crawl? How much to index?� Coverage: How big is the Web? How much do we cover?
� Relative Coverage: How much do competitors have?
� How often to crawl?� Freshness: How much has changed?
� How to parallelize the process
Link Extractor:while(<Page Repository is not empty>){
<take a page p (check if it is new)><extract links contained in p within href><extract links contained in javascript><extract …..<insert these links into the Priority Queue>
}
Downloaders:while(<Assigned Repository is not empty>){
<extract url u><download page(u)><send page(u) to the Page Repository><store page(u) in a proper archive,
possibly compressed>}
Crawler Manager:while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>foreach u extracted {
if ( (u ∉∉∉∉ “Already Seen Page” ) ||( u ∈∈∈∈ “Already Seen Page” && <u’s version on the Web is more recent> )
) {<resolve u wrt DNS><send u to the Assigned Repository>}
}}
Crawler “cycle of life”
PQ
PR
ARCrawler Manager
DownloadersLinkExtractor
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
9
Page selection
� Given a page P, define how “good” P is.
� Several metrics:
� BFS, DFS, Random
� Popularity driven (PageRank, full vs partial)
� Topic driven or focused crawling
� Combined
BFS
� “…BFS-order discovers the highest quality pages during the early stages of the crawl”
� 328 millions of URL in the testbed
[Najork 01]
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
10
This page is a new one ?
� Check if file has been parsed or downloaded before
� after 20 mil pages, we have “seen” over 200 million URLs
� each URL is at least 100 bytes on average
� Overall we have about 20Gb of URLS
� Options: compress URLs in main memory, or use disk
� Bloom Filter (Archive)
� Disk access with caching (Mercator, Altavista)
Parallel Crawlers
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication
� Dynamic assignment� Central coordinator dynamically assigns URLs to crawlers
� Links are given to Central coordinator
� Static assignment� Web is statically partitioned and assigned to crawlers
� Crawler only crawls its part of the web
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
11
Two problems
� Load balancing the #URLs assigned to downloaders:
� Static schemes based on hosts may fail
� www.geocities.com/….
� www.di.unipi.it/
� Dynamic “relocation” schemes may be complicated
� Managing the fault-tolerance:
� What about the death of downloaders ? D�D-1, new hash !!!
� What about new downloaders ? D�D+1, new hash !!!
Let D be the number of downloaders.
hash(URL) maps anURL to {0,...,D-1}.
Dowloader x fetchesthe URLs U s.t. hash(U) = x
A nice technique: Consistent Hashing
� A tool for:
� Spidering
� Web Cache
� P2P
� Routers Load Balance
� Distributed FS
� Item and servers mapped to unit circle
� Item K assigned to first server N such that ID(N) ≥ ID(K)
� What if a downloader goes down?
� What if a new downloader appears?
Each server gets replicated log S times
[monotone] adding a new server moves points between one old to the new, only.
[balance] Prob item goes to a server is ≤ O(1)/S
[load] any server gets ≤ (I/S) log S items w.h.p
[scale] you can copy each server more times...