page rank

14
PageRank PageRank What is PageRank What is PageRank Why PageRank Why PageRank Related work and problems Related work and problems Link Structure of the Web Link Structure of the Web Definition of PageRank Definition of PageRank Dangling Links Dangling Links Implementation Implementation

Upload: ronnie-s-delgado

Post on 21-Jan-2015

310 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Page Rank

PageRankPageRank

What is PageRankWhat is PageRankWhy PageRankWhy PageRankRelated work and problemsRelated work and problemsLink Structure of the WebLink Structure of the WebDefinition of PageRankDefinition of PageRankDangling LinksDangling LinksImplementationImplementation

Page 2: Page Rank

PageRank(cont.)PageRank(cont.)

What is PageRankWhat is PageRank

In order to measure the relative importance of In order to measure the relative importance of web pages, PageRank is proposed. It is a web pages, PageRank is proposed. It is a method for computing a ranking for every web method for computing a ranking for every web page based on the graph of the web. page based on the graph of the web.

Page 3: Page Rank

PageRank(cont.)PageRank(cont.)

Why PageRankWhy PageRank

__The World Wide Web is very large and__The World Wide Web is very large and heterogeneous. heterogeneous. __Search engines on the Web must also contend__Search engines on the Web must also contend with inexperienced users and pages engineeredwith inexperienced users and pages engineered to manipulate search engine ranking functions. to manipulate search engine ranking functions. Unlike “flat” document collections, the WorldUnlike “flat” document collections, the World

Wide Web is hypertext and provides considerableWide Web is hypertext and provides considerable

Page 4: Page Rank

PageRank(cont.)PageRank(cont.)

auxiliary information on top of the text of the web auxiliary information on top of the text of the web pages, such as link structure and link text. We can pages, such as link structure and link text. We can take advantage of the link structure of the web to take advantage of the link structure of the web to produce a PageRank of every web page. It helps produce a PageRank of every web page. It helps search engines and users quickly make sense of search engines and users quickly make sense of the vast heterogeneity of the World Wide Web.the vast heterogeneity of the World Wide Web.

Page 5: Page Rank

PageRank (Cont.)PageRank (Cont.)Related work and problemsRelated work and problems

__B__Backlink countsacklink counts Problem: for example, if a web page has a link off Problem: for example, if a web page has a link off

the Yahoo home page, it may be just one link but the Yahoo home page, it may be just one link but it is very important one. This page should be it is very important one. This page should be ranked higher than many pages with more ranked higher than many pages with more

backlinks but from obscure places.backlinks but from obscure places.

__T__The ranks and numbers of backlinkshe ranks and numbers of backlinks This covers both the case that when a page has This covers both the case that when a page has

many backlinks and when a page has a few many backlinks and when a page has a few highly ranked backlinks. Let highly ranked backlinks. Let u u be a webpage,be a webpage,

Page 6: Page Rank

PageRank (Cont.)PageRank (Cont.)

Page 7: Page Rank

PageRank (Cont.)PageRank (Cont.)

be the set of pages that point to be the set of pages that point to u. u. be the number of be the number of

links from links from u u and let and let cc be a factor used for normalization, then be a factor used for normalization, then

a simplified version of PageRank: a simplified version of PageRank:

Page 8: Page Rank

PageRank (Cont.)PageRank (Cont.)

Problem:Problem: may form a rank sink. Consider two web pages may form a rank sink. Consider two web pages

that point to each other but to no other page. And if there isthat point to each other but to no other page. And if there is

some web page which points to one of them. Then, duringsome web page which points to one of them. Then, during

iteration, this loop will accumulate rank but never distributeiteration, this loop will accumulate rank but never distribute

any rank. The loop forms a sort of trap called a rank sink.any rank. The loop forms a sort of trap called a rank sink.

Page 9: Page Rank

PageRank (Cont.)PageRank (Cont.)

Link Structure of the WebLink Structure of the Web______Pages are as nodesPages are as nodes___Links are as edges (outedges and inedges)___Links are as edges (outedges and inedges)

Every page has some forward links (outedges) andEvery page has some forward links (outedges) andbacklinks (inedges). We can never know whether webacklinks (inedges). We can never know whether wehave found all the backlinks of a particular page but if wehave found all the backlinks of a particular page but if wehave downloaded it, we know all of its forward links at thathave downloaded it, we know all of its forward links at thattime. PageRank handles both cases and everything intime. PageRank handles both cases and everything inbetween by recursively propagating weights through thebetween by recursively propagating weights through thelink structure of the web.link structure of the web.

Page 10: Page Rank

PageRank(Cont.)PageRank(Cont.)

Definition of PageRankDefinition of PageRankWe assume page A has pages We assume page A has pages T1T1,…,,…,TnTn, which , which

point to it. The parameter point to it. The parameter dd is a damping factor is a damping factor

which can be set between 0 and 1(usually d iswhich can be set between 0 and 1(usually d is

set to 0.85). Also set to 0.85). Also C(A)C(A) is defined as the number is defined as the number

of links going out of page A. The PageRank of of links going out of page A. The PageRank of

page A is given as follows:page A is given as follows:

Page 11: Page Rank

T1

PR=0.5

T2

PR=0.3

T3

PR=0.1

A

PR(A)=(1-d) + d*(PR(T1)/C(T1) + PR(T2)/C(T2) + PR(T3)/C(T3)) =0.15+0.85*(0.5/3 + 0.3/4+ 0.1/5)

3

4

5

2

Page 12: Page Rank

PageRank(Cont.)PageRank(Cont.)

Let Let AA be a square matrix with the rows and column be a square matrix with the rows and column

corresponding to web pages. Let if corresponding to web pages. Let if

there is an edge from there is an edge from uu to to vv and if not. If and if not. If

we treat we treat RR as a vector over web pages, then we as a vector over web pages, then we

have have .. Here E is a uniform vector. Here E is a uniform vector.

Since , we can rewrite this asSince , we can rewrite this as

. So . So RR is an eigenvector of is an eigenvector of

11R ))

11((d

EARdR

with eigenvalue with eigenvalue dd..

Page 13: Page Rank

PageRank(Cont.)PageRank(Cont.)

Dangling LinksDangling Links

Dangling links are simply links that point to any page withDangling links are simply links that point to any page withno outgoing links. They affect the model because it is notno outgoing links. They affect the model because it is notclear where their weights should be distributed, and thereclear where their weights should be distributed, and thereare a large number of them. Because they do not affectare a large number of them. Because they do not affectthe ranking of any other page directly, we simply removethe ranking of any other page directly, we simply removethem from the system until all the PageRanks arethem from the system until all the PageRanks arecalculated. After all the PageRanks are calculated, theycalculated. After all the PageRanks are calculated, theycan be added back in, without affecting things significantly.can be added back in, without affecting things significantly.

Page 14: Page Rank

PageRank(Cont.)PageRank(Cont.)

ImplementationImplementation1.1. Sort the link structure by ParentIDSort the link structure by ParentID

2.2. Remove dangling links from the link databaseRemove dangling links from the link database

3.3. Make an initial assignment of the ranksMake an initial assignment of the ranks

4.4. Memory is allocated for the weights for every Memory is allocated for the weights for every pagepage

5.5. After the weights have converged, add the After the weights have converged, add the dangling links back in and recompute the dangling links back in and recompute the rankingsrankings