copia de page rank(2)

14
PageRank PageRank What is PageRank What is PageRank Why PageRank Why PageRank Related work and problems Related work and problems Link Structure of the Web Link Structure of the Web Definition of PageRank Definition of PageRank Dangling Links Dangling Links Implementation Implementation

Upload: david-leon

Post on 04-Jul-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Copia De Page Rank(2)

PageRankPageRank

What is PageRankWhat is PageRankWhy PageRankWhy PageRankRelated work and problemsRelated work and problemsLink Structure of the WebLink Structure of the WebDefinition of PageRankDefinition of PageRankDangling LinksDangling LinksImplementationImplementation

Page 2: Copia De Page Rank(2)

PageRank(cont.)PageRank(cont.)

What is PageRankWhat is PageRank

In order to measure the relative importance of In order to measure the relative importance of web pages, PageRank is proposed. It is a web pages, PageRank is proposed. It is a method for computing a ranking for every web method for computing a ranking for every web page based on the graph of the web. page based on the graph of the web.

Page 3: Copia De Page Rank(2)

PageRank(cont.)PageRank(cont.)

Why PageRankWhy PageRank

__The World Wide Web is very large and__The World Wide Web is very large and heterogeneous. heterogeneous. __Search engines on the Web must also contend__Search engines on the Web must also contend with inexperienced users and pages engineeredwith inexperienced users and pages engineered to manipulate search engine ranking functions. to manipulate search engine ranking functions. Unlike “flat” document collections, the WorldUnlike “flat” document collections, the World

Wide Web is hypertext and provides considerableWide Web is hypertext and provides considerable

Page 4: Copia De Page Rank(2)

PageRank(cont.)PageRank(cont.)

auxiliary information on top of the text of the web auxiliary information on top of the text of the web pages, such as link structure and link text. We can pages, such as link structure and link text. We can take advantage of the link structure of the web to take advantage of the link structure of the web to produce a PageRank of every web page. It helps produce a PageRank of every web page. It helps search engines and users quickly make sense of search engines and users quickly make sense of the vast heterogeneity of the World Wide Web.the vast heterogeneity of the World Wide Web.

Page 5: Copia De Page Rank(2)

PageRank (Cont.)PageRank (Cont.)Related work and problemsRelated work and problems

__B__Backlink countsacklink counts Problem: for example, if a web page has a link off Problem: for example, if a web page has a link off

the Yahoo home page, it may be just one link but the Yahoo home page, it may be just one link but it is very important one. This page should be it is very important one. This page should be ranked higher than many pages with more ranked higher than many pages with more

backlinks but from obscure places.backlinks but from obscure places.

__T__The ranks and numbers of backlinkshe ranks and numbers of backlinks This covers both the case that when a page has This covers both the case that when a page has

many backlinks and when a page has a few many backlinks and when a page has a few highly ranked backlinks. Let highly ranked backlinks. Let u u be a webpage,be a webpage,

Page 6: Copia De Page Rank(2)

PageRank (Cont.)PageRank (Cont.)

Page 7: Copia De Page Rank(2)

PageRank (Cont.)PageRank (Cont.)

be the set of pages that point to be the set of pages that point to u. u. be the number of be the number of

links from links from u u and let and let cc be a factor used for normalization, then be a factor used for normalization, then

a simplified version of PageRank: a simplified version of PageRank:

uB uN

∑∈

=uBv vN

vRcuR

)()(

Page 8: Copia De Page Rank(2)

PageRank (Cont.)PageRank (Cont.)

Problem:Problem: may form a rank sink. Consider two web pages may form a rank sink. Consider two web pages

that point to each other but to no other page. And if there isthat point to each other but to no other page. And if there is

some web page which points to one of them. Then, duringsome web page which points to one of them. Then, during

iteration, this loop will accumulate rank but never distributeiteration, this loop will accumulate rank but never distribute

any rank. The loop forms a sort of trap called a rank sink.any rank. The loop forms a sort of trap called a rank sink.

Page 9: Copia De Page Rank(2)

PageRank (Cont.)PageRank (Cont.)

Link Structure of the WebLink Structure of the Web______Pages are as nodesPages are as nodes___Links are as edges (outedges and inedges)___Links are as edges (outedges and inedges)

Every page has some forward links (outedges) andEvery page has some forward links (outedges) andbacklinks (inedges). We can never know whether webacklinks (inedges). We can never know whether wehave found all the backlinks of a particular page but if wehave found all the backlinks of a particular page but if wehave downloaded it, we know all of its forward links at thathave downloaded it, we know all of its forward links at thattime. PageRank handles both cases and everything intime. PageRank handles both cases and everything inbetween by recursively propagating weights through thebetween by recursively propagating weights through thelink structure of the web.link structure of the web.

Page 10: Copia De Page Rank(2)

PageRank(Cont.)PageRank(Cont.)

Definition of PageRankDefinition of PageRankWe assume page A has pages We assume page A has pages T1T1,…,,…,TnTn, which , which

point to it. The parameter point to it. The parameter dd is a damping factor is a damping factor

which can be set between 0 and 1(usually d iswhich can be set between 0 and 1(usually d is

set to 0.85). Also set to 0.85). Also C(A)C(A) is defined as the number is defined as the number

of links going out of page A. The PageRank of of links going out of page A. The PageRank of

page A is given as follows:page A is given as follows:

Page 11: Copia De Page Rank(2)

T1

PR=0.5

T2

PR=0.3

T3

PR=0.1

A

PR(A)=(1-d) + d*(PR(T1)/C(T1) + PR(T2)/C(T2) + PR(T3)/C(T3)) =0.15+0.85*(0.5/3 + 0.3/4+ 0.1/5)

3

4

5

2

Page 12: Copia De Page Rank(2)

PageRank(Cont.)PageRank(Cont.)

Let Let AA be a square matrix with the rows and column be a square matrix with the rows and column

corresponding to web pages. Let if corresponding to web pages. Let if

there is an edge from there is an edge from uu to to vv and if not. If and if not. If

we treat we treat RR as a vector over web pages, then we as a vector over web pages, then we

have have .. Here E is a uniform vector. Here E is a uniform vector.

Since , we can rewrite this asSince , we can rewrite this as

. So . So RR is an eigenvector of is an eigenvector of

11=R

11

=R

uvu NA /1, =0, =vuA

))1

1((d

EARdR −×+=

with eigenvalue with eigenvalue dd..

))11

(( −×+=d

EARdR

Rd

EAdR ))11

(( −×+=

))11

(( −×+d

EA

Page 13: Copia De Page Rank(2)

PageRank(Cont.)PageRank(Cont.)

Dangling LinksDangling Links

Dangling links are simply links that point to any page withDangling links are simply links that point to any page withno outgoing links. They affect the model because it is notno outgoing links. They affect the model because it is notclear where their weights should be distributed, and thereclear where their weights should be distributed, and thereare a large number of them. Because they do not affectare a large number of them. Because they do not affectthe ranking of any other page directly, we simply removethe ranking of any other page directly, we simply removethem from the system until all the PageRanks arethem from the system until all the PageRanks arecalculated. After all the PageRanks are calculated, theycalculated. After all the PageRanks are calculated, theycan be added back in, without affecting things significantly.can be added back in, without affecting things significantly.

Page 14: Copia De Page Rank(2)

PageRank(Cont.)PageRank(Cont.)

ImplementationImplementation Sort the link structure by ParentIDSort the link structure by ParentID Remove dangling links from the link databaseRemove dangling links from the link database Make an initial assignment of the ranksMake an initial assignment of the ranks Memory is allocated for the weights for every Memory is allocated for the weights for every

pagepage After the weights have converged, add the After the weights have converged, add the

dangling links back in and recompute the dangling links back in and recompute the rankingsrankings