search engines & question answering giuseppe attardi università di pisa
Post on 29-Dec-2015
220 Views
Preview:
TRANSCRIPT
Search Engines & Question Search Engines & Question AnsweringAnswering
Giuseppe AttardiGiuseppe Attardi
Università di PisaUniversità di Pisa
TopicsTopics
Web SearchWeb Search– Search engines– Architecture– Crawling: parallel/distributed, focused– Link analysis (Google PageRank)– Scaling
Source: Jupiter Communications, 2000
72%
88%
96%
Product Info.Search
Web Search
Top Online ActivitiesTop Online Activities
Pew Study (US users July 2002)Pew Study (US users July 2002)
Total Internet users = 111 MTotal Internet users = 111 MDo a search on any given day = 33 MDo a search on any given day = 33 MHave used Internet to search = 85%Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?//www.pewinternet.org/reports/toc.asp?
Report=64Report=64
Search on the WebSearch on the Web
CorpusCorpus:The publicly accessible Web: static + dynamic:The publicly accessible Web: static + dynamic
GoalGoal: Retrieve high quality results relevant to the user’s need: Retrieve high quality results relevant to the user’s need– (not docs!)
NeedNeed– Informational – want to learn about something (~40%)
– Navigational – want to go to that page (~25%)
– Transactional – want to do something (web-mediated) (~35%)
• Access a service
• Downloads
• Shop– Gray areas
• Find a good hub• Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Tampere weatherMars surface images
Nikon CoolPix
Car rental Finland
ResultsResults
Static pages (documents)Static pages (documents)– text, mp3, images, video, ...
Dynamic pages = generated Dynamic pages = generated on request on request – data base access
– “the invisible web”
– proprietary content, etc.
TerminologyTerminology
http://www.cism.it/cism/http://www.cism.it/cism/hotels_2001.htmhotels_2001.htm
Host name
Page name
Access method
URL = Universal Resource Locator
ScaleScale
Immense amount of content Immense amount of content – 2-10B static pages, doubling every 8-12 months– Lexicon Size: 10s-100s of millions of words
Authors galore (1 in 4 hosts run a web server)Authors galore (1 in 4 hosts run a web server)
http://www.netcraft.com/Survey
DiversityDiversity
Languages/EncodingsLanguages/Encodings– Hundreds (thousands ?) of languages, W3C encodings: 55
(Jul01) [W3C01]– Home pages (1997): English 82%, Next 15: 13% [Babe97]– Google (mid 2001): English: 53%, JGCFSKRIP: 30%
Document & query topicDocument & query topicPopular Query Topics (from 1 million Google queries, Apr 2000)
1.8%1.8%Regional: EuropeRegional: Europe7.2%7.2%BusinessBusiness
……………………
2.3%2.3%Business: IndustriesBusiness: Industries7.3%7.3%RecreationRecreation
3.2%3.2%Computers: InternetComputers: Internet8%8%AdultAdult
3.4%3.4%Computers: SoftwareComputers: Software8.7%8.7%SocietySociety
4.4%4.4%Adult: Image GalleriesAdult: Image Galleries10.3%10.3%RegionalRegional
5.3%5.3%Regional: North AmericaRegional: North America13.8%13.8%ComputersComputers
6.1%6.1%Arts: MusicArts: Music14.6%14.6%ArtsArts
Rate of changeRate of change
[Cho00][Cho00] 720K pages from 270 popular 720K pages from 270 popular sites sampled daily from Feb 17 – sites sampled daily from Feb 17 – Jun 14, 1999Jun 14, 1999
Mathematically, whatdoes this seem to be?
Web idiosyncrasiesWeb idiosyncrasies
Distributed authorshipDistributed authorship– Millions of people creating pages with their
own style, grammar, vocabulary, opinions, facts, falsehoods …
– Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages.
– The open web is largely a marketing tool.• IBM’s home page does not contain computer.
Other characteristicsOther characteristics
Significant duplicationSignificant duplication– Syntactic - 30%-40% (near) duplicates
[Brod97, Shiv99b]– Semantic - ???
High linkage High linkage – ~ 8 links/page in the average
Complex graph topologyComplex graph topology– Not a small world; bow-tie structure [Brod00]
More on these corpus characteristics laterMore on these corpus characteristics later– how do we measure them?
Web search usersWeb search users
Ill-defined queriesIll-defined queries– Short
• AV 2001: 2.54 terms avg, 80% < 3 words)
– Imprecise terms– Sub-optimal syntax
(80% queries without operator)
– Low effort Wide variance inWide variance in
– Needs– Expectations– Knowledge– Bandwidth
Specific behaviorSpecific behavior– 85% look over one
result screen only (mostly above the fold)
– 78% of queries are not modified (one query/session)
– Follow links – “the scent of
information” ...
Evolution of search enginesEvolution of search engines First generation -- use only “on page”, text dataFirst generation -- use only “on page”, text data
– Word frequency, language
Second generation -- use off-page, web-specific dataSecond generation -- use off-page, web-specific data– Link (or connectivity) analysis– Click-through data (What results people click on)– Anchor-text (How people refer to this page)
Third generation -- answer “the need behind the query”Third generation -- answer “the need behind the query”– Semantic analysis -- what is this about?– Focus on user need, rather than on query– Context determination– Helping the user– Integration of search and text analysis
1995-1997 AV, Excite, Lycos, etc
From 1998. Made
popular by Google but everyone now
Still experimental
Third generation search engine: Third generation search engine: answering “the need behind the query”answering “the need behind the query”
Query language determinationQuery language determinationDifferent rankingDifferent ranking
–(if query Japanese do not return English)Hard & soft matchesHard & soft matches
–Personalities (triggered on names)–Cities (travel info, maps)–Medical info (triggered on names and/or
results)–Stock quotes, news (triggered on stock
symbol)–Company info, …
Integration of Search and Text AnalysisIntegration of Search and Text Analysis
Answering “the need behind the query”Answering “the need behind the query”Context determinationContext determination
Context determination Context determination – spatial (user location/target location)– query stream (previous queries)– personal (user profile) – explicit (vertical search, family friendly)– implicit (use AltaVista from AltaVista France)
Context useContext use– Result restriction– Ranking modulation
The spatial context - geo-searchThe spatial context - geo-search
Two aspectsTwo aspects– Geo-coding
• encode geographic coordinates to make search effective– Geo-parsing
• the process of identifying geographic context. Geo-codingGeo-coding
– Geometrical hierarchy (squares)– Natural hierarchy (country, state, county, city, zip-codes, etc)
Geo-parsingGeo-parsing– Pages (infer from phone nos, zip, etc). About 10% feasible.– Queries (use dictionary of place names) – Users
• From IP data– Mobile phones
• In its infancy, many issues (display size, privacy, etc)
Geo-search example - Northern Light (Now Divine Inc)Geo-search example - Northern Light (Now Divine Inc)
Helping the userHelping the user
UIUIspell checkingspell checkingquery refinementquery refinementquery suggestionquery suggestioncontext transfer …context transfer …
Crawl Control
Search Engine ArchitectureSearch Engine Architecture
Crawlers
Ranking
Page Repository
QueryEngine
Link Analysis
Text
Structure
Document Store
Queries
Results
SnippetExtractionIndexer
TermsTerms
CrawlerCrawlerCrawler controlCrawler control Indexes – text, structure, utilityIndexes – text, structure, utilityPage repositoryPage repository IndexerIndexerCollection analysis moduleCollection analysis moduleQuery engineQuery engineRanking moduleRanking module
StorageStorage
The page repository is a scalable storage The page repository is a scalable storage system for web pagessystem for web pages
Allows the Crawler to store pagesAllows the Crawler to store pages Allows the Indexer and Collection Allows the Indexer and Collection
Analysis to retrieve themAnalysis to retrieve them Similar to other data storage systems – Similar to other data storage systems –
DB or file systemsDB or file systems Does Does notnot have to provide some of the have to provide some of the
other systems’ features: transactions, other systems’ features: transactions, logging, directorylogging, directory
Storage IssuesStorage Issues
Scalability and seamless load distribution Scalability and seamless load distribution Dual access modesDual access modes
– Random access (used by the query engine for cached pages)
– Streaming access (used by the Indexer and Collection Analysis)
Large bulk update – reclaim old space, Large bulk update – reclaim old space, avoid access/update conflictsavoid access/update conflicts
Obsolete pages - remove pages no longer Obsolete pages - remove pages no longer on the webon the web
Designing a Distributed Web Designing a Distributed Web RepositoryRepository
Repository designed to work over a Repository designed to work over a cluster of interconnected nodescluster of interconnected nodes
Page distribution across nodesPage distribution across nodesPhysical organization within a nodePhysical organization within a nodeUpdate strategyUpdate strategy
Page DistributionPage Distribution
How to choose a node to store a How to choose a node to store a pagepage
Uniform distribution – any page can Uniform distribution – any page can be sent to any nodebe sent to any node
Hash distribution policy – hash page Hash distribution policy – hash page ID space into node ID spaceID space into node ID space
Organization Within a NodeOrganization Within a Node
Several operations requiredSeveral operations required– Add / remove a page– High speed streaming – Random page access
Hashed organizationHashed organization– Treat each disk as a hash bucket– Assign according to a page’s ID
Log organizationLog organization– Treat the disk as one file, and add the page at the end– Support random access using a B-tree
Hybrid Hybrid – Hash map a page to an extent and use log structure
within an extent.
Distribution PerformanceDistribution Performance
LogLog HashedHashed Hashed Hashed LogLog
Streaming Streaming performanceperformance
++++ -- ++
Random access Random access performanceperformance
+-+- ++++ +-+-
Page additionPage addition ++++ -- ++
Update StrategiesUpdate Strategies
Updates are generated by the crawlerUpdates are generated by the crawlerSeveral characteristicsSeveral characteristics
– Time in which the crawl occurs and the repository receives information
– Whether the crawl’s information replaces the entire database or modifies parts of it
Batch vs. SteadyBatch vs. Steady
Batch mode Batch mode – Periodically executed– Allocated a certain amount of time
Steady mode Steady mode – Run all the time– Always send results back to the
repository
Partial vs. Complete CrawlsPartial vs. Complete Crawls
A batch mode crawler canA batch mode crawler can– Do a complete crawl every run, and replace
entire collection– Recrawl only a specific subset, and apply
updates to the existing collection – partial crawl
The repository can implementThe repository can implement– In place update
• Quickly refresh pages
– Shadowing, update as another stage• Avoid refresh-access conflicts
Partial vs. Complete CrawlsPartial vs. Complete Crawls
Shadowing resolves the conflicts Shadowing resolves the conflicts between updates and read for the between updates and read for the queriesqueries
Batch mode suits well with Batch mode suits well with shadowingshadowing
Steady crawler suits with in place Steady crawler suits with in place updatesupdates
The Indexer ModuleThe Indexer Module
Creates Two indexes :Creates Two indexes :Text (content) index : Uses Text (content) index : Uses
“Traditional” indexing methods like “Traditional” indexing methods like Inverted IndexingInverted Indexing
Structure(LinksStructure(Links(( index : Uses a index : Uses a directed graph of pages and links. directed graph of pages and links. Sometimes also creates an inverted Sometimes also creates an inverted graphgraph
The Link Analysis ModuleThe Link Analysis Module
Uses the 2 basic indexes created byUses the 2 basic indexes created by
the indexer module in order tothe indexer module in order to
assemble “Utility Indexes”assemble “Utility Indexes”
e.g. : A site index.e.g. : A site index.
Inverted IndexInverted Index
A SA Set of inverted lists, one per each index et of inverted lists, one per each index term (word)term (word)
Inverted list of a term: A sorted list of Inverted list of a term: A sorted list of locations in which the term appeared.locations in which the term appeared.
Posting: A pair (w,l) where w is word and l Posting: A pair (w,l) where w is word and l is one of its locationsis one of its locations
Lexicon: Holds all index’s terms with Lexicon: Holds all index’s terms with statistics about the term (not the posting)statistics about the term (not the posting)
Challenges Challenges
Index build must be:Index build must be:– Fast– Economic
(unlike traditional index buildings)(unlike traditional index buildings) Incremental Indexing must be Incremental Indexing must be
supportedsupportedStorage: compression vs. speedStorage: compression vs. speed
Index PartitioningIndex Partitioning
A distributed text indexing can be done by:A distributed text indexing can be done by: Local inverted file Local inverted file (IF(IFLL))
– Each nodes contain disjoint random pages– Query is broadcasted– Result is the joined query answers
Global Global inverted file (IFinverted file (IFGG))– Each node is responsible only for a subset of
terms in the collection– Query sent only to the apropriate node
Indexing, ConclusionIndexing, Conclusion
Web pages indexing is complicated Web pages indexing is complicated due to it’s scale (millions of pages, due to it’s scale (millions of pages, hundreds of gigabytes) hundreds of gigabytes)
Challenges: Incremental indexing Challenges: Incremental indexing and personalizationand personalization
ScalingScaling
Google (Nov 2002):Google (Nov 2002):– Number of pages: 3 billion– Refresh interval: 1 month (1200 pag/sec)– Queries/day: 150 million = 1700 q/s
Avg page size:10KBAvg page size:10KBAvg query size: 40 BAvg query size: 40 BAvg result size: 5 KBAvg result size: 5 KBAvg links/page: 8Avg links/page: 8
Size of DatasetSize of Dataset
Total raw HTML data size:Total raw HTML data size:3G x 10 KB = 30 TB
Inverted index ~= corpus = 30 TBInverted index ~= corpus = 30 TBUsing compression 3:1Using compression 3:1
20 TB data on disk
Single copy of indexSingle copy of index
IndexIndex– (10 TB) / (100 GB per disk) = 100 disk
DocumentDocument– (10 TB) / (100 GB per disk) = 100 disk
Query LoadQuery Load
1700 queries/sec1700 queries/secRule of thumb: 20 q/s per CPURule of thumb: 20 q/s per CPU
– 85 clusters to answer queriesCluster: 100 machineCluster: 100 machineTotal = 85 x 100 = 8500Total = 85 x 100 = 8500Document serverDocument server
– Snippet search: 1000 snippet/s– (1700 * 10 / 1000) * 100 = 1700
LimitsLimits
Redirector: 4000 req/secRedirector: 4000 req/secBandwidth: 1100 req/secBandwidth: 1100 req/secServer: 22 q/s eachServer: 22 q/s eachCluster: 50 nodes = 1100 q/s =Cluster: 50 nodes = 1100 q/s =
95 million q/day95 million q/day
Scaling the IndexScaling the Index
Hardware based load balancer
Google Web ServerGoogle Web Server
Index serversDocument servers
Spell Checker Ad server
Queries
Google Web Server
Pooled Shard ArchitecturePooled Shard Architecture
Web ServerWeb Server
…
Pool for shard 1
…SISSIS
…
Intermediate load balancer 1
Intermediate load balancer 1
Pool for shard N
…
Intermediate load balancer N
Intermediate load balancer N
Index Server 1Index Server 1
Index load balancerIndex load balancer
Index Server KIndex Server K
S1 S1
SISSIS
SN SN
SISSIS SISSIS
Pool Network
Index Server Network
1 Gb/s
100 Mb/s
Replicated Index ArchitectureReplicated Index Architecture
Web ServerWeb Server
Full Index 1
…SISSIS
…
Index Server 1Index Server 1 Index Server MIndex Server M
Index load balancerIndex load balancer
S1
SISSIS
SN
Full Index M
…SISSIS
S1
SISSIS
SN
Index Server Network
1 Gb/s
100 Mb/s
Index ReplicationIndex Replication
100 Mb/s bandwidth100 Mb/s bandwidth20 TB x 8 / 10020 TB x 8 / 10020 TB require one full day20 TB require one full day
First generation rankingFirst generation ranking
Extended Boolean model Extended Boolean model – Matches: exact, prefix, phrase,…– Operators: AND, OR, AND NOT, NEAR, …– Fields: TITLE:, URL:, HOST:,…– AND is somewhat easier to implement, maybe
preferable as default for short queries
RankingRanking– TF like factors: TF, explicit keywords, words in
title, explicit emphasis (headers), etc – IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language
Second generation search engineSecond generation search engine
Ranking -- use off-page, web-specific Ranking -- use off-page, web-specific datadata– Link (or connectivity) analysis– Click-through data (What results people
click on)– Anchor-text (How people refer to this
page)CrawlingCrawling
– Algorithms to create the best possible corpus
Connectivity analysisConnectivity analysis
Idea: mine hyperlink information in Idea: mine hyperlink information in
the Webthe Web
Assumptions:Assumptions:
– Links often connect related pages
– A link between pages is a
recommendation “people vote with their
links”
Citation AnalysisCitation Analysis
Citation frequencyCitation frequency Co-citation coupling frequencyCo-citation coupling frequency
– Cocitations with a given author measures “impact”– Cocitation analysis [Mcca90]
Bibliographic coupling frequencyBibliographic coupling frequency– Articles that co-cite the same articles are related
Citation indexingCitation indexing– Who is a given author cited by? (Garfield [Garf72])
Pinsker and NarinPinsker and Narin
Query-independent orderingQuery-independent ordering
First generation: using link counts as First generation: using link counts as simple measures of popularitysimple measures of popularity
Two basic suggestions:Two basic suggestions:– Undirected popularity:
• Each page gets a score = the number of in-links plus the number of out-links (3+2=5)
– Directed popularity:• Score of a page = number of its in-links (3)
Query processingQuery processing
First retrieve all pages meeting the First retrieve all pages meeting the text query (say text query (say venture capitalventure capital))
Order these by their link popularity Order these by their link popularity (either variant on the previous page)(either variant on the previous page)
Spamming simple popularitySpamming simple popularity
ExerciseExercise: How do you spam each of : How do you spam each of the following heuristics so your page the following heuristics so your page gets a high score?gets a high score?
Each page gets a score = the number Each page gets a score = the number of in-links plus the number of out-of in-links plus the number of out-linkslinks
Score of a page = number of its in-Score of a page = number of its in-linkslinks
PageRank scoringPageRank scoring
Imagine a browser doing a random Imagine a browser doing a random walk on web pages:walk on web pages:– Start at a random page– At each step, go out of the current page
along one of the links on that page, equiprobably
““In the steady state” each page has a In the steady state” each page has a long-term visit rate - use this as the long-term visit rate - use this as the page’s scorepage’s score
1/31/31/3
Not quite enoughNot quite enough
The web is full of dead-endsThe web is full of dead-ends– Random walk can get stuck in dead-
ends.– Makes no sense to talk about long-term
visit rates.
??
TeleportingTeleporting
At each step, with probability 10%, At each step, with probability 10%, jump to a random web pagejump to a random web page
With remaining probability (90%), go With remaining probability (90%), go out on a random linkout on a random link– If no out-link, stay put in this case
Result of teleportingResult of teleporting
Now cannot get stuck locallyNow cannot get stuck locallyThere is a long-term rate at which There is a long-term rate at which
any page is visited (not obvious, will any page is visited (not obvious, will show this)show this)
How do we compute this visit rate?How do we compute this visit rate?
PageRankPageRank
Tries to capture the notion of “Importance Tries to capture the notion of “Importance of a page”of a page”
Uses Backlinks for rankingUses Backlinks for ranking Avoids trivial spamming: Distributes Avoids trivial spamming: Distributes
pages’ “voting power” among the pages pages’ “voting power” among the pages they are linking tothey are linking to
““Important” page linking to a page will Important” page linking to a page will raise it’s rank more the “Not Important” raise it’s rank more the “Not Important” oneone
Simple PageRankSimple PageRank
Given by :Given by :WhereWhere
BB((ii) : set of pages links to ) : set of pages links to ii
NN((jj) : number of outgoing links from ) : number of outgoing links from jj
Well defined if link graph is strongly Well defined if link graph is strongly connectedconnected
Based on “Random Surfer Model” - Based on “Random Surfer Model” - Rank of page equals to the Rank of page equals to the probability of being in this pageprobability of being in this page
Otherwise 0
topoints if )(/1,
)](),...,2(),1([
jiiNji
a
mrrrr
rtAr
Computation Of PageRank (1)Computation Of PageRank (1)
Given a matrix Given a matrix AA, an eigenvalue , an eigenvalue c c and the corresponding eigenvector and the corresponding eigenvector vv is defined if is defined if Av Av = = cvcv
Hence Hence rr is eigenvector of is eigenvector of AAtt for for eigenvalue “1”eigenvalue “1”
If If GG is strongly connected then is strongly connected then rr is is uniqueunique
Computation of PageRank (2)Computation of PageRank (2)
Computation of PageRank (3)Computation of PageRank (3)
Simple PageRank can be computed Simple PageRank can be computed by:by:
ectorPageRank v theis 5.
2 goto 4.
5 goto |||| if 3.
2.
vectorrandomany 1.
r
rs
sr
sAr
st
Practical PageRank: ProblemPractical PageRank: Problem
Web is not a strongly connected Web is not a strongly connected graph. It contains:graph. It contains:– “Rank Sinks”: cluster of pages without
outgoing links. Pages outside cluster will be ranked 0.
– “Rank Leaks”: a page without outgoing links. All pages will be ranked 0.
Practical PageRank: SolutionPractical PageRank: Solution
Remove all Page LeaksRemove all Page LeaksAdd decay factor Add decay factor dd to Simple to Simple
PageRankPageRank
Based on “Board Surfer Model”Based on “Board Surfer Model”
HITS: Hypertext Induced Topic SearchHITS: Hypertext Induced Topic Search
A query dependent techniqueA query dependent techniqueProduces two scores:Produces two scores:
– Authority: A most likely to be relevant page to a given query
– Hub: Points to many AuthoritiesContains two part: Contains two part:
– Identifying the focused subgraph– Link analysis
HITS: Identifying The Focused SubgraphHITS: Identifying The Focused Subgraph
Subgraph creation from t-sized page set:Subgraph creation from t-sized page set:
(d reduces the influence of extremely(d reduces the influence of extremely
popular pages like yahoo.com)popular pages like yahoo.com)
graph focused theholds 4.
in topoints that pages all ) maximum to(up Include (b)
in topoints that pages theall Include (a)
pageeach for 3.
2.
pages initialt 1.
S
Spd
Sp
Rp
RS
R
HITS: Link AnalysisHITS: Link Analysis
Calculates Authorities & Hubs Calculates Authorities & Hubs scores (scores (aai i & & hhii) for each page in ) for each page in SS
pages) all(for (b)
pages) all(for (a)
econvergenc untilRepeat 2.
y.arbitraril )1( , Initialize 1.
)(
)(
iFjji
iBjji
ii
ah
ha
niba
HITS:Link Analysis ComputationHITS:Link Analysis Computation
Eigenvectors computation can be Eigenvectors computation can be used by:used by:
WhereWhere
a: Vector of Authorities’ scoresa: Vector of Authorities’ scores
h: Vector of Hubs’ scoresh: Vector of Hubs’ scores
A: Adjacency matrix in which aA: Adjacency matrix in which a i,j i,j = 1 if points = 1 if points
to jto j
AhAh
aAAa
aAh
Ahatr
tr
tr
Markov chainsMarkov chains
A Markov chain consists of A Markov chain consists of n n statesstates, plus , plus an an nnnn transition probability matrixtransition probability matrix PP
At each step, we are in exactly one of the At each step, we are in exactly one of the statesstates
For For 1 1 i,j i,j n, n, the matrix entry the matrix entry PPijij tells us tells us
the probability of the probability of jj being the next state, being the next state, given we are currently in state given we are currently in state ii
i jPij
Pii>0is OK.
.11
ij
n
j
P
Markov chainsMarkov chains
Clearly, for all i,Clearly, for all i, Markov chains are abstractions of Markov chains are abstractions of
random walksrandom walksExerciseExercise: represent the teleporting : represent the teleporting
random walk from 3 slides ago as a random walk from 3 slides ago as a Markov chain, for this case: Markov chain, for this case:
Ergodic Markov chainsErgodic Markov chains
A Markov chain is A Markov chain is ergodicergodic if if– you have a path from any state to any
other– you can be in any state at every time
step, with non-zero probability
Notergodic(even/odd).
Ergodic Markov chainsErgodic Markov chains
For any ergodic Markov chain, there For any ergodic Markov chain, there is a unique long-term visit rate for is a unique long-term visit rate for each stateeach state– Steady-state distribution
Over a long time-period, we visit Over a long time-period, we visit each state in proportion to this rateeach state in proportion to this rate
It doesn’t matter where we startIt doesn’t matter where we start
Probability vectorsProbability vectors
A probability (row) vector A probability (row) vector xx = (x= (x11, … x, … xnn) )
tells us where the walk is at any pointtells us where the walk is at any point E.g., (000…1…000) means we’re in state E.g., (000…1…000) means we’re in state ii
i n1
More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi
.11
n
iix
Change in probability vectorChange in probability vector
If the probability vector is x If the probability vector is x = (x= (x11, … , …
xxnn) ) at this step, what is it at the next at this step, what is it at the next
step?step?Recall that row Recall that row ii of the transition of the transition
prob. Matrix prob. Matrix PP tells us where we go tells us where we go next from state next from state ii
So from So from xx, our next state is , our next state is distributed as distributed as xPxP
Computing the visit rateComputing the visit rate
The steady state looks like a vector The steady state looks like a vector of probabilities of probabilities aa = = ((aa11, … a, … ann):):
– ai is the probability that we are in state i
1 23/4
1/4
3/41/4
For this example, a1=1/4 and a2=3/4
How do we compute this vector?How do we compute this vector?
Let Let aa = = ((aa11, … a, … ann) denote the row vector of ) denote the row vector of
steady-state probabilitiessteady-state probabilities If we our current position is described by If we our current position is described by
aa, then the next step is distributed as , then the next step is distributed as aPaP But But aa is the steady state, so is the steady state, so aa==aPaP Solving this matrix equation gives us Solving this matrix equation gives us aa
– So a is the (left) eigenvector for P– (Corresponds to the “principal” eigenvector of
P with the largest eigenvalue)
One way of computing One way of computing aa
Recall, regardless of where we start, we Recall, regardless of where we start, we eventually reach the steady state eventually reach the steady state aa
Start with any distribution (say Start with any distribution (say xx=(=(10…010…0)))) After one step, we’re at After one step, we’re at xPxP after two steps at after two steps at xPxP22 , then , then xPxP33 and so on and so on ““Eventually” means for “large” Eventually” means for “large” kk, , xPxPk k = = aa Algorithm: multiply Algorithm: multiply xx by increasing by increasing
powers of powers of PP until the product looks stable until the product looks stable
Lempel: SalsaLempel: Salsa
By applying ergodic theorem, By applying ergodic theorem, Lempel has proved that:Lempel has proved that:– ai is proportional to number of incoming
links
Pagerank summaryPagerank summary
Preprocessing:Preprocessing:– Given graph of links, build matrix P– From it compute a
– The entry ai is a number between 0 and 1: the PageRank of page i
Query processing:Query processing:– Retrieve pages meeting query– Rank them by their PageRank– Order is query-independent
top related