relevant - dipartimento di informaticaferragin/teach/informationretrieval/8-lecture.pdf · retrieve...

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

1

Algoritmi per IR

Web Search

Retrieve docs that are “relevant” for the user query

� Doc: file word or pdf, web page, email, blog, e-book,...

� Query: paradigm “bag of words”

Relevant ?!?

Goal of a Search Engine


2

Two main difficulties

The Web:

� Size: more than tens of billions of pages

� Language and encodings: hundreds…

� Distributed authorship: SPAM, format-less,…

� Dynamic: in one year 35% survive, 20% untouched

The User:

� Query composition: short (2.5 terms avg) and imprecise

� Query results: 85% users look at just one result-page

� Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

Evolution of Search Engines

� First generation -- use only on-page, web-text data

� Word frequency and language

� Second generation -- use off-page, web-graph data

� Link (or connectivity) analysis

� Anchor-text (How people refer to a page)

� Third generation -- answer “the need behind the query”

� Focus on “user need”, rather than on query

� Integrate multiple data-sources

� Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation �� Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo, MSN, ASK,………


3


4


5

This is a search engine!!!

Algoritmi per IR

The structure of a Search Engine


6

The structure

Crawler

Page archive

Control

Query

Queryresolver

?

Ranker

PageAnalizer

text

Structure

auxiliary

Indexer


7

Information Retrieval

Crawling

Spidering

� 24h, 7days “walking” over a Graph

� What about the Graph?

� BowTie

� Direct graph G = (N, E)

� N changes (insert, delete) >> 50 * 109 nodes

� E changes (insert, delete) > 10 links per node� 10*50*109 = 500*109 1-entries in adj matrix


8

Crawling Issues

� How to crawl? � Quality: “Best” pages first

� Efficiency: Avoid duplication (or near duplication)

� Etiquette: Robots.txt, Server load concerns (Minimize load)

� How much to crawl? How much to index?� Coverage: How big is the Web? How much do we cover?

� Relative Coverage: How much do competitors have?

� How often to crawl?� Freshness: How much has changed?

� How to parallelize the process

Link Extractor:while(<Page Repository is not empty>){

<take a page p (check if it is new)><extract links contained in p within href><extract links contained in javascript><extract …..<insert these links into the Priority Queue>

}

Downloaders:while(<Assigned Repository is not empty>){

<extract url u><download page(u)><send page(u) to the Page Repository><store page(u) in a proper archive,

possibly compressed>}

Crawler Manager:while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>foreach u extracted {

if ( (u ∉∉∉∉ “Already Seen Page” ) ||( u ∈∈∈∈ “Already Seen Page” && <u’s version on the Web is more recent> )

) {<resolve u wrt DNS><send u to the Assigned Repository>}

}}

Crawler “cycle of life”

PQ

PR

ARCrawler Manager

DownloadersLinkExtractor


9

Page selection

� Given a page P, define how “good” P is.

� Several metrics:

� BFS, DFS, Random

� Popularity driven (PageRank, full vs partial)

� Topic driven or focused crawling

� Combined

BFS

� “…BFS-order discovers the highest quality pages during the early stages of the crawl”

� 328 millions of URL in the testbed

[Najork 01]


10

This page is a new one ?

� Check if file has been parsed or downloaded before

� after 20 mil pages, we have “seen” over 200 million URLs

� each URL is at least 100 bytes on average

� Overall we have about 20Gb of URLS

� Options: compress URLs in main memory, or use disk

� Bloom Filter (Archive)

� Disk access with caching (Mercator, Altavista)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

� Dynamic assignment� Central coordinator dynamically assigns URLs to crawlers

� Links are given to Central coordinator

� Static assignment� Web is statically partitioned and assigned to crawlers

� Crawler only crawls its part of the web


11

Two problems

� Load balancing the #URLs assigned to downloaders:

� Static schemes based on hosts may fail

� www.geocities.com/….

� www.di.unipi.it/

� Dynamic “relocation” schemes may be complicated

� Managing the fault-tolerance:

� What about the death of downloaders ? D�D-1, new hash !!!

� What about new downloaders ? D�D+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps anURL to {0,...,D-1}.

Dowloader x fetchesthe URLs U s.t. hash(U) = x

A nice technique: Consistent Hashing

� A tool for:

� Spidering

� Web Cache

� P2P

� Routers Load Balance

� Distributed FS

� Item and servers mapped to unit circle

� Item K assigned to first server N such that ID(N) ≥ ID(K)

� What if a downloader goes down?

� What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S

[load] any server gets ≤ (I/S) log S items w.h.p

[scale] you can copy each server more times...


12

Examples: Open Source

� Nutch, also used by WikiSearch� http://www.nutch.org

� Hentrix, used by Archive.org� http://archive-crawler.sourceforge.net/index.html

� Consisten Hashing

� Amazon’s Dynamo

relevant - dipartimento di informaticaferragin/teach/informationretrieval/8-lecture.pdf · retrieve...

Documents