search engines & question answering giuseppe attardi università di pisa

Search Engines & Question Search Engines & Question AnsweringAnswering

Giuseppe AttardiGiuseppe Attardi

Università di PisaUniversità di Pisa

TopicsTopics

Web SearchWeb Search– Search engines– Architecture– Crawling: parallel/distributed, focused– Link analysis (Google PageRank)– Scaling

Source: Jupiter Communications, 2000

72%

88%

96%

Product Info.Search

Web Search

Email

Top Online ActivitiesTop Online Activities

Pew Study (US users July 2002)Pew Study (US users July 2002)

Total Internet users = 111 MTotal Internet users = 111 MDo a search on any given day = 33 MDo a search on any given day = 33 MHave used Internet to search = 85%Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?//www.pewinternet.org/reports/toc.asp?

Report=64Report=64

Search on the WebSearch on the Web

CorpusCorpus:The publicly accessible Web: static + dynamic:The publicly accessible Web: static + dynamic

GoalGoal: Retrieve high quality results relevant to the user’s need: Retrieve high quality results relevant to the user’s need– (not docs!)

NeedNeed– Informational – want to learn about something (~40%)

– Navigational – want to go to that page (~25%)

– Transactional – want to do something (web-mediated) (~35%)

• Access a service

• Downloads

• Shop– Gray areas

• Find a good hub• Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Tampere weatherMars surface images

Nikon CoolPix

Car rental Finland

ResultsResults

Static pages (documents)Static pages (documents)– text, mp3, images, video, ...

Dynamic pages = generated Dynamic pages = generated on request on request – data base access

– “the invisible web”

– proprietary content, etc.

TerminologyTerminology

http://www.cism.it/cism/http://www.cism.it/cism/hotels_2001.htmhotels_2001.htm

Host name

Page name

Access method

URL = Universal Resource Locator

ScaleScale

Immense amount of content Immense amount of content – 2-10B static pages, doubling every 8-12 months– Lexicon Size: 10s-100s of millions of words

Authors galore (1 in 4 hosts run a web server)Authors galore (1 in 4 hosts run a web server)

http://www.netcraft.com/Survey

DiversityDiversity

Languages/EncodingsLanguages/Encodings– Hundreds (thousands ?) of languages, W3C encodings: 55

(Jul01) [W3C01]– Home pages (1997): English 82%, Next 15: 13% [Babe97]– Google (mid 2001): English: 53%, JGCFSKRIP: 30%

Document & query topicDocument & query topicPopular Query Topics (from 1 million Google queries, Apr 2000)

1.8%1.8%Regional: EuropeRegional: Europe7.2%7.2%BusinessBusiness

……………………

2.3%2.3%Business: IndustriesBusiness: Industries7.3%7.3%RecreationRecreation

3.2%3.2%Computers: InternetComputers: Internet8%8%AdultAdult

3.4%3.4%Computers: SoftwareComputers: Software8.7%8.7%SocietySociety

4.4%4.4%Adult: Image GalleriesAdult: Image Galleries10.3%10.3%RegionalRegional

5.3%5.3%Regional: North AmericaRegional: North America13.8%13.8%ComputersComputers

6.1%6.1%Arts: MusicArts: Music14.6%14.6%ArtsArts

Rate of changeRate of change

[Cho00][Cho00] 720K pages from 270 popular 720K pages from 270 popular sites sampled daily from Feb 17 – sites sampled daily from Feb 17 – Jun 14, 1999Jun 14, 1999

Mathematically, whatdoes this seem to be?

Web idiosyncrasiesWeb idiosyncrasies

Distributed authorshipDistributed authorship– Millions of people creating pages with their

own style, grammar, vocabulary, opinions, facts, falsehoods …

– Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages.

– The open web is largely a marketing tool.• IBM’s home page does not contain computer.

Other characteristicsOther characteristics

Significant duplicationSignificant duplication– Syntactic - 30%-40% (near) duplicates

[Brod97, Shiv99b]– Semantic - ???

High linkage High linkage – ~ 8 links/page in the average

Complex graph topologyComplex graph topology– Not a small world; bow-tie structure [Brod00]

More on these corpus characteristics laterMore on these corpus characteristics later– how do we measure them?

Web search usersWeb search users

Ill-defined queriesIll-defined queries– Short

• AV 2001: 2.54 terms avg, 80% < 3 words)

– Imprecise terms– Sub-optimal syntax

(80% queries without operator)

– Low effort Wide variance inWide variance in

– Needs– Expectations– Knowledge– Bandwidth

Specific behaviorSpecific behavior– 85% look over one

result screen only (mostly above the fold)

– 78% of queries are not modified (one query/session)

– Follow links – “the scent of

information” ...

Evolution of search enginesEvolution of search engines First generation -- use only “on page”, text dataFirst generation -- use only “on page”, text data

– Word frequency, language

Second generation -- use off-page, web-specific dataSecond generation -- use off-page, web-specific data– Link (or connectivity) analysis– Click-through data (What results people click on)– Anchor-text (How people refer to this page)

Third generation -- answer “the need behind the query”Third generation -- answer “the need behind the query”– Semantic analysis -- what is this about?– Focus on user need, rather than on query– Context determination– Helping the user– Integration of search and text analysis

1995-1997 AV, Excite, Lycos, etc

From 1998. Made

popular by Google but everyone now

Still experimental

Third generation search engine: Third generation search engine: answering “the need behind the query”answering “the need behind the query”

Query language determinationQuery language determinationDifferent rankingDifferent ranking

–(if query Japanese do not return English)Hard & soft matchesHard & soft matches

–Personalities (triggered on names)–Cities (travel info, maps)–Medical info (triggered on names and/or

results)–Stock quotes, news (triggered on stock

symbol)–Company info, …

Integration of Search and Text AnalysisIntegration of Search and Text Analysis

Answering “the need behind the query”Answering “the need behind the query”Context determinationContext determination

Context determination Context determination – spatial (user location/target location)– query stream (previous queries)– personal (user profile) – explicit (vertical search, family friendly)– implicit (use AltaVista from AltaVista France)

Context useContext use– Result restriction– Ranking modulation

The spatial context - geo-searchThe spatial context - geo-search

Two aspectsTwo aspects– Geo-coding

• encode geographic coordinates to make search effective– Geo-parsing

• the process of identifying geographic context. Geo-codingGeo-coding

– Geometrical hierarchy (squares)– Natural hierarchy (country, state, county, city, zip-codes, etc)

Geo-parsingGeo-parsing– Pages (infer from phone nos, zip, etc). About 10% feasible.– Queries (use dictionary of place names) – Users

• From IP data– Mobile phones

• In its infancy, many issues (display size, privacy, etc)

AV AV barry bondsbarry bonds

Lycos Lycos palo altopalo alto

Geo-search example - Northern Light (Now Divine Inc)Geo-search example - Northern Light (Now Divine Inc)

Helping the userHelping the user

UIUIspell checkingspell checkingquery refinementquery refinementquery suggestionquery suggestioncontext transfer …context transfer …

Context sensitive spell checkContext sensitive spell check

Crawl Control

Search Engine ArchitectureSearch Engine Architecture

Crawlers

Ranking

Page Repository

QueryEngine

Link Analysis

Text

Structure

Document Store

Queries

Results

SnippetExtractionIndexer

TermsTerms

CrawlerCrawlerCrawler controlCrawler control Indexes – text, structure, utilityIndexes – text, structure, utilityPage repositoryPage repository IndexerIndexerCollection analysis moduleCollection analysis moduleQuery engineQuery engineRanking moduleRanking module

RepositoryRepository

““Hidden Treasures”Hidden Treasures”

StorageStorage

The page repository is a scalable storage The page repository is a scalable storage system for web pagessystem for web pages

Allows the Crawler to store pagesAllows the Crawler to store pages Allows the Indexer and Collection Allows the Indexer and Collection

Analysis to retrieve themAnalysis to retrieve them Similar to other data storage systems – Similar to other data storage systems –

DB or file systemsDB or file systems Does Does notnot have to provide some of the have to provide some of the

other systems’ features: transactions, other systems’ features: transactions, logging, directorylogging, directory

Storage IssuesStorage Issues

Scalability and seamless load distribution Scalability and seamless load distribution Dual access modesDual access modes

– Random access (used by the query engine for cached pages)

– Streaming access (used by the Indexer and Collection Analysis)

Large bulk update – reclaim old space, Large bulk update – reclaim old space, avoid access/update conflictsavoid access/update conflicts

Obsolete pages - remove pages no longer Obsolete pages - remove pages no longer on the webon the web

Designing a Distributed Web Designing a Distributed Web RepositoryRepository

Repository designed to work over a Repository designed to work over a cluster of interconnected nodescluster of interconnected nodes

Page distribution across nodesPage distribution across nodesPhysical organization within a nodePhysical organization within a nodeUpdate strategyUpdate strategy

Page DistributionPage Distribution

How to choose a node to store a How to choose a node to store a pagepage

Uniform distribution – any page can Uniform distribution – any page can be sent to any nodebe sent to any node

Hash distribution policy – hash page Hash distribution policy – hash page ID space into node ID spaceID space into node ID space

Organization Within a NodeOrganization Within a Node

Several operations requiredSeveral operations required– Add / remove a page– High speed streaming – Random page access

Hashed organizationHashed organization– Treat each disk as a hash bucket– Assign according to a page’s ID

Log organizationLog organization– Treat the disk as one file, and add the page at the end– Support random access using a B-tree

Hybrid Hybrid – Hash map a page to an extent and use log structure

within an extent.

Distribution PerformanceDistribution Performance

LogLog HashedHashed Hashed Hashed LogLog

Streaming Streaming performanceperformance

++++ -- ++

Random access Random access performanceperformance

+-+- ++++ +-+-

Page additionPage addition ++++ -- ++

Update StrategiesUpdate Strategies

Updates are generated by the crawlerUpdates are generated by the crawlerSeveral characteristicsSeveral characteristics

– Time in which the crawl occurs and the repository receives information

– Whether the crawl’s information replaces the entire database or modifies parts of it

Batch vs. SteadyBatch vs. Steady

Batch mode Batch mode – Periodically executed– Allocated a certain amount of time

Steady mode Steady mode – Run all the time– Always send results back to the

repository

Partial vs. Complete CrawlsPartial vs. Complete Crawls

A batch mode crawler canA batch mode crawler can– Do a complete crawl every run, and replace

entire collection– Recrawl only a specific subset, and apply

updates to the existing collection – partial crawl

The repository can implementThe repository can implement– In place update

• Quickly refresh pages

– Shadowing, update as another stage• Avoid refresh-access conflicts

Partial vs. Complete CrawlsPartial vs. Complete Crawls

Shadowing resolves the conflicts Shadowing resolves the conflicts between updates and read for the between updates and read for the queriesqueries

Batch mode suits well with Batch mode suits well with shadowingshadowing

Steady crawler suits with in place Steady crawler suits with in place updatesupdates

IndexingIndexing

The Indexer ModuleThe Indexer Module

Creates Two indexes :Creates Two indexes :Text (content) index : Uses Text (content) index : Uses

“Traditional” indexing methods like “Traditional” indexing methods like Inverted IndexingInverted Indexing

Structure(LinksStructure(Links(( index : Uses a index : Uses a directed graph of pages and links. directed graph of pages and links. Sometimes also creates an inverted Sometimes also creates an inverted graphgraph

The Link Analysis ModuleThe Link Analysis Module

Uses the 2 basic indexes created byUses the 2 basic indexes created by

the indexer module in order tothe indexer module in order to

assemble “Utility Indexes”assemble “Utility Indexes”

e.g. : A site index.e.g. : A site index.

Inverted IndexInverted Index

A SA Set of inverted lists, one per each index et of inverted lists, one per each index term (word)term (word)

Inverted list of a term: A sorted list of Inverted list of a term: A sorted list of locations in which the term appeared.locations in which the term appeared.

Posting: A pair (w,l) where w is word and l Posting: A pair (w,l) where w is word and l is one of its locationsis one of its locations

Lexicon: Holds all index’s terms with Lexicon: Holds all index’s terms with statistics about the term (not the posting)statistics about the term (not the posting)

Challenges Challenges

Index build must be:Index build must be:– Fast– Economic

(unlike traditional index buildings)(unlike traditional index buildings) Incremental Indexing must be Incremental Indexing must be

supportedsupportedStorage: compression vs. speedStorage: compression vs. speed

Index PartitioningIndex Partitioning

A distributed text indexing can be done by:A distributed text indexing can be done by: Local inverted file Local inverted file (IF(IFLL))

– Each nodes contain disjoint random pages– Query is broadcasted– Result is the joined query answers

Global Global inverted file (IFinverted file (IFGG))– Each node is responsible only for a subset of

terms in the collection– Query sent only to the apropriate node

Indexing, ConclusionIndexing, Conclusion

Web pages indexing is complicated Web pages indexing is complicated due to it’s scale (millions of pages, due to it’s scale (millions of pages, hundreds of gigabytes) hundreds of gigabytes)

Challenges: Incremental indexing Challenges: Incremental indexing and personalizationand personalization

ScalingScaling

ScalingScaling

Google (Nov 2002):Google (Nov 2002):– Number of pages: 3 billion– Refresh interval: 1 month (1200 pag/sec)– Queries/day: 150 million = 1700 q/s

Avg page size:10KBAvg page size:10KBAvg query size: 40 BAvg query size: 40 BAvg result size: 5 KBAvg result size: 5 KBAvg links/page: 8Avg links/page: 8

Size of DatasetSize of Dataset

Total raw HTML data size:Total raw HTML data size:3G x 10 KB = 30 TB

Inverted index ~= corpus = 30 TBInverted index ~= corpus = 30 TBUsing compression 3:1Using compression 3:1

20 TB data on disk

Single copy of indexSingle copy of index

IndexIndex– (10 TB) / (100 GB per disk) = 100 disk

DocumentDocument– (10 TB) / (100 GB per disk) = 100 disk

Query LoadQuery Load

1700 queries/sec1700 queries/secRule of thumb: 20 q/s per CPURule of thumb: 20 q/s per CPU

– 85 clusters to answer queriesCluster: 100 machineCluster: 100 machineTotal = 85 x 100 = 8500Total = 85 x 100 = 8500Document serverDocument server

– Snippet search: 1000 snippet/s– (1700 * 10 / 1000) * 100 = 1700

LimitsLimits

Redirector: 4000 req/secRedirector: 4000 req/secBandwidth: 1100 req/secBandwidth: 1100 req/secServer: 22 q/s eachServer: 22 q/s eachCluster: 50 nodes = 1100 q/s =Cluster: 50 nodes = 1100 q/s =

95 million q/day95 million q/day

Scaling the IndexScaling the Index

Hardware based load balancer

Google Web ServerGoogle Web Server

Index serversDocument servers

Spell Checker Ad server

Queries

Google Web Server

Pooled Shard ArchitecturePooled Shard Architecture

Web ServerWeb Server

…

Pool for shard 1

…SISSIS

…

Intermediate load balancer 1

Intermediate load balancer 1

Pool for shard N

…

Intermediate load balancer N

Intermediate load balancer N

Index Server 1Index Server 1

Index load balancerIndex load balancer

Index Server KIndex Server K

S1 S1

SISSIS

SN SN

SISSIS SISSIS

Pool Network

Index Server Network

1 Gb/s

100 Mb/s

Replicated Index ArchitectureReplicated Index Architecture

Web ServerWeb Server

Full Index 1

…SISSIS

…

Index Server 1Index Server 1 Index Server MIndex Server M

Index load balancerIndex load balancer

S1

SISSIS

SN

Full Index M

…SISSIS

S1

SISSIS

SN

Index Server Network

1 Gb/s

100 Mb/s

Index ReplicationIndex Replication

100 Mb/s bandwidth100 Mb/s bandwidth20 TB x 8 / 10020 TB x 8 / 10020 TB require one full day20 TB require one full day

RankingRanking

First generation rankingFirst generation ranking

Extended Boolean model Extended Boolean model – Matches: exact, prefix, phrase,…– Operators: AND, OR, AND NOT, NEAR, …– Fields: TITLE:, URL:, HOST:,…– AND is somewhat easier to implement, maybe

preferable as default for short queries

RankingRanking– TF like factors: TF, explicit keywords, words in

title, explicit emphasis (headers), etc – IDF factors: IDF, total word count in corpus,

frequency in query log, frequency in language

Second generation search engineSecond generation search engine

Ranking -- use off-page, web-specific Ranking -- use off-page, web-specific datadata– Link (or connectivity) analysis– Click-through data (What results people

click on)– Anchor-text (How people refer to this

page)CrawlingCrawling

– Algorithms to create the best possible corpus

Connectivity analysisConnectivity analysis

Idea: mine hyperlink information in Idea: mine hyperlink information in

the Webthe Web

Assumptions:Assumptions:

– Links often connect related pages

– A link between pages is a

recommendation “people vote with their

links”

Citation AnalysisCitation Analysis

Citation frequencyCitation frequency Co-citation coupling frequencyCo-citation coupling frequency

– Cocitations with a given author measures “impact”– Cocitation analysis [Mcca90]

Bibliographic coupling frequencyBibliographic coupling frequency– Articles that co-cite the same articles are related

Citation indexingCitation indexing– Who is a given author cited by? (Garfield [Garf72])

Pinsker and NarinPinsker and Narin

Query-independent orderingQuery-independent ordering

First generation: using link counts as First generation: using link counts as simple measures of popularitysimple measures of popularity

Two basic suggestions:Two basic suggestions:– Undirected popularity:

• Each page gets a score = the number of in-links plus the number of out-links (3+2=5)

– Directed popularity:• Score of a page = number of its in-links (3)

Query processingQuery processing

First retrieve all pages meeting the First retrieve all pages meeting the text query (say text query (say venture capitalventure capital))

Order these by their link popularity Order these by their link popularity (either variant on the previous page)(either variant on the previous page)

Spamming simple popularitySpamming simple popularity

ExerciseExercise: How do you spam each of : How do you spam each of the following heuristics so your page the following heuristics so your page gets a high score?gets a high score?

Each page gets a score = the number Each page gets a score = the number of in-links plus the number of out-of in-links plus the number of out-linkslinks

Score of a page = number of its in-Score of a page = number of its in-linkslinks

PageRank scoringPageRank scoring

Imagine a browser doing a random Imagine a browser doing a random walk on web pages:walk on web pages:– Start at a random page– At each step, go out of the current page

along one of the links on that page, equiprobably

““In the steady state” each page has a In the steady state” each page has a long-term visit rate - use this as the long-term visit rate - use this as the page’s scorepage’s score

1/31/31/3

Not quite enoughNot quite enough

The web is full of dead-endsThe web is full of dead-ends– Random walk can get stuck in dead-

ends.– Makes no sense to talk about long-term

visit rates.

??

TeleportingTeleporting

At each step, with probability 10%, At each step, with probability 10%, jump to a random web pagejump to a random web page

With remaining probability (90%), go With remaining probability (90%), go out on a random linkout on a random link– If no out-link, stay put in this case

Result of teleportingResult of teleporting

Now cannot get stuck locallyNow cannot get stuck locallyThere is a long-term rate at which There is a long-term rate at which

any page is visited (not obvious, will any page is visited (not obvious, will show this)show this)

How do we compute this visit rate?How do we compute this visit rate?

PageRankPageRank

Tries to capture the notion of “Importance Tries to capture the notion of “Importance of a page”of a page”

Uses Backlinks for rankingUses Backlinks for ranking Avoids trivial spamming: Distributes Avoids trivial spamming: Distributes

pages’ “voting power” among the pages pages’ “voting power” among the pages they are linking tothey are linking to

““Important” page linking to a page will Important” page linking to a page will raise it’s rank more the “Not Important” raise it’s rank more the “Not Important” oneone

Simple PageRankSimple PageRank

Given by :Given by :WhereWhere

BB((ii) : set of pages links to ) : set of pages links to ii

NN((jj) : number of outgoing links from ) : number of outgoing links from jj

Well defined if link graph is strongly Well defined if link graph is strongly connectedconnected

Based on “Random Surfer Model” - Based on “Random Surfer Model” - Rank of page equals to the Rank of page equals to the probability of being in this pageprobability of being in this page

Otherwise 0

topoints if )(/1,

)](),...,2(),1([

jiiNji

a

mrrrr

rtAr

Computation Of PageRank (1)Computation Of PageRank (1)

Given a matrix Given a matrix AA, an eigenvalue , an eigenvalue c c and the corresponding eigenvector and the corresponding eigenvector vv is defined if is defined if Av Av = = cvcv

Hence Hence rr is eigenvector of is eigenvector of AAtt for for eigenvalue “1”eigenvalue “1”

If If GG is strongly connected then is strongly connected then rr is is uniqueunique

Computation of PageRank (2)Computation of PageRank (2)

Computation of PageRank (3)Computation of PageRank (3)

Simple PageRank can be computed Simple PageRank can be computed by:by:

ectorPageRank v theis 5.

2 goto 4.

5 goto |||| if 3.

2.

vectorrandomany 1.

r

rs

sr

sAr

st

PageRank ExamplePageRank Example

Practical PageRank: ProblemPractical PageRank: Problem

Web is not a strongly connected Web is not a strongly connected graph. It contains:graph. It contains:– “Rank Sinks”: cluster of pages without

outgoing links. Pages outside cluster will be ranked 0.

– “Rank Leaks”: a page without outgoing links. All pages will be ranked 0.

Practical PageRank: SolutionPractical PageRank: Solution

Remove all Page LeaksRemove all Page LeaksAdd decay factor Add decay factor dd to Simple to Simple

PageRankPageRank

Based on “Board Surfer Model”Based on “Board Surfer Model”

HITS: Hypertext Induced Topic SearchHITS: Hypertext Induced Topic Search

A query dependent techniqueA query dependent techniqueProduces two scores:Produces two scores:

– Authority: A most likely to be relevant page to a given query

– Hub: Points to many AuthoritiesContains two part: Contains two part:

– Identifying the focused subgraph– Link analysis

HITS: Identifying The Focused SubgraphHITS: Identifying The Focused Subgraph

Subgraph creation from t-sized page set:Subgraph creation from t-sized page set:

(d reduces the influence of extremely(d reduces the influence of extremely

popular pages like yahoo.com)popular pages like yahoo.com)

graph focused theholds 4.

in topoints that pages all ) maximum to(up Include (b)

in topoints that pages theall Include (a)

pageeach for 3.

2.

pages initialt 1.

S

Spd

Sp

Rp

RS

R

HITS: Link AnalysisHITS: Link Analysis

Calculates Authorities & Hubs Calculates Authorities & Hubs scores (scores (aai i & & hhii) for each page in ) for each page in SS

pages) all(for (b)

pages) all(for (a)

econvergenc untilRepeat 2.

y.arbitraril )1( , Initialize 1.

)(

)(

iFjji

iBjji

ii

ah

ha

niba

HITS:Link Analysis ComputationHITS:Link Analysis Computation

Eigenvectors computation can be Eigenvectors computation can be used by:used by:

WhereWhere

a: Vector of Authorities’ scoresa: Vector of Authorities’ scores

h: Vector of Hubs’ scoresh: Vector of Hubs’ scores

A: Adjacency matrix in which aA: Adjacency matrix in which a i,j i,j = 1 if points = 1 if points

to jto j

AhAh

aAAa

aAh

Ahatr

tr

tr

Markov chainsMarkov chains

A Markov chain consists of A Markov chain consists of n n statesstates, plus , plus an an nnnn transition probability matrixtransition probability matrix PP

At each step, we are in exactly one of the At each step, we are in exactly one of the statesstates

For For 1 1 i,j i,j n, n, the matrix entry the matrix entry PPijij tells us tells us

the probability of the probability of jj being the next state, being the next state, given we are currently in state given we are currently in state ii

i jPij

Pii>0is OK.

.11

ij

n

j

P

Markov chainsMarkov chains

Clearly, for all i,Clearly, for all i, Markov chains are abstractions of Markov chains are abstractions of

random walksrandom walksExerciseExercise: represent the teleporting : represent the teleporting

random walk from 3 slides ago as a random walk from 3 slides ago as a Markov chain, for this case: Markov chain, for this case:

Ergodic Markov chainsErgodic Markov chains

A Markov chain is A Markov chain is ergodicergodic if if– you have a path from any state to any

other– you can be in any state at every time

step, with non-zero probability

Notergodic(even/odd).

Ergodic Markov chainsErgodic Markov chains

For any ergodic Markov chain, there For any ergodic Markov chain, there is a unique long-term visit rate for is a unique long-term visit rate for each stateeach state– Steady-state distribution

Over a long time-period, we visit Over a long time-period, we visit each state in proportion to this rateeach state in proportion to this rate

It doesn’t matter where we startIt doesn’t matter where we start

Probability vectorsProbability vectors

A probability (row) vector A probability (row) vector xx = (x= (x11, … x, … xnn) )

tells us where the walk is at any pointtells us where the walk is at any point E.g., (000…1…000) means we’re in state E.g., (000…1…000) means we’re in state ii

i n1

More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi

.11

n

iix

Change in probability vectorChange in probability vector

If the probability vector is x If the probability vector is x = (x= (x11, … , …

xxnn) ) at this step, what is it at the next at this step, what is it at the next

step?step?Recall that row Recall that row ii of the transition of the transition

prob. Matrix prob. Matrix PP tells us where we go tells us where we go next from state next from state ii

So from So from xx, our next state is , our next state is distributed as distributed as xPxP

Computing the visit rateComputing the visit rate

The steady state looks like a vector The steady state looks like a vector of probabilities of probabilities aa = = ((aa11, … a, … ann):):

– ai is the probability that we are in state i

1 23/4

1/4

3/41/4

For this example, a1=1/4 and a2=3/4

How do we compute this vector?How do we compute this vector?

Let Let aa = = ((aa11, … a, … ann) denote the row vector of ) denote the row vector of

steady-state probabilitiessteady-state probabilities If we our current position is described by If we our current position is described by

aa, then the next step is distributed as , then the next step is distributed as aPaP But But aa is the steady state, so is the steady state, so aa==aPaP Solving this matrix equation gives us Solving this matrix equation gives us aa

– So a is the (left) eigenvector for P– (Corresponds to the “principal” eigenvector of

P with the largest eigenvalue)

One way of computing One way of computing aa

Recall, regardless of where we start, we Recall, regardless of where we start, we eventually reach the steady state eventually reach the steady state aa

Start with any distribution (say Start with any distribution (say xx=(=(10…010…0)))) After one step, we’re at After one step, we’re at xPxP after two steps at after two steps at xPxP22 , then , then xPxP33 and so on and so on ““Eventually” means for “large” Eventually” means for “large” kk, , xPxPk k = = aa Algorithm: multiply Algorithm: multiply xx by increasing by increasing

powers of powers of PP until the product looks stable until the product looks stable

Lempel: SalsaLempel: Salsa

By applying ergodic theorem, By applying ergodic theorem, Lempel has proved that:Lempel has proved that:– ai is proportional to number of incoming

links

Pagerank summaryPagerank summary

Preprocessing:Preprocessing:– Given graph of links, build matrix P– From it compute a

– The entry ai is a number between 0 and 1: the PageRank of page i

Query processing:Query processing:– Retrieve pages meeting query– Rank them by their PageRank– Order is query-independent

The realityThe reality

Pagerank is used in Google, but so Pagerank is used in Google, but so are many other clever heuristicsare many other clever heuristics

search engines & question answering giuseppe attardi università di pisa

Documents

dynamic pages

userintegration of search

accessible web

open web

web serverhttp

b static pages

good hubexploratory

jul01 w3c01home pages