a19-boldi , page rank, functional dependencies, pdf

8/9/2019 a19-Boldi , Page Rank, Functional Dependencies, PDF

1/23

19

PageRank: Functional DependenciesPAOLO BOLDI, MASSIMO SANTINI and SEBASTIANO VIGNAUniversit a degli Studi di Milano

Categories and Subject Descriptors: H.3.3 [ Information Storage And Retrieval ]: InformationSearchand Retrieval; G.2.2 [ Discrete Mathematics ]: Graph Theory; G.3 [ Mathematics of Com-puting ]: Probability and Statistics Markov processesGeneral Terms: Theory, Algorithms Additional Key Words and Phrases: PageRank, damping factor, power method

ACM Reference Format:Boldi, P., Santini, M., and Vigna, S. 2009.PageRank: Functionaldependencies. ACM Trans. Inform.Syst. 27, 4, Article 19 (November 2009), 23 pages.DOI = 10.1145/1629096.1629097 http://doi.acm.org/10.1145/1629096.1629097

1. INTRODUCTION

PageRank [Page et al. 1999] is a ranking technique used by todays searchengines. It is query independent and content independent; it can be computedofine using only the Web graph (the Web graph is the directed graph whosenodes areURLs and whose arcs correspond to hyperlinks).These features makeit interesting when we need to assignan absolute measure of importance to eachWeb page.

Originally at the basis of Googles ranking algorithm, PageRank is now justone of the many parameters used by search engines to rank pages. Albeit nopublic information is available on the current degree of utilization of PageRankin real-world search engines, it is likely that in certain areas, for instance,selectivecrawling (decidingwhichpages to crawl)andinverted index reordering

Part of the results of this article were presented at the 14th World Wide Web Conference [Boldiet al. 2005].This work has been partially supportedby the EC Project DELIS and by the MIUR COFIN projectsAutomi e linguaggi formali: aspetti matematici e applicativ and Gra del web e ranking. Authors addresses: P. Boldi, M. Santini, S. Vigna (contact author), Dipartimento di ScienzedellInformazione, Universit a degli Studi di Milano, via Comelico 39/41, 20135 Milano MI, Italy;email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributed for prot or commercialadvantage and that copies show this notice on the rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specicpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax + 1 (212) 869-0481, or [email protected] 2009 ACM 1046-8188/2009/11-ART19 $10.00DOI 10.1145/1629096.1629097 http://doi.acm.org/10.1145/1629096.1629097

ACM Transactions on Information Systems, Vol. 27, No. 4, Article 19, Publication date: November 2009.


2/23

19:2 P. Boldi et al.

(permuting documents so that more important documents are returned rst),PageRank (or one of its many variants) is still very useful [Vigna 2007].

PageRank(andmoregenerally link analysis 1) is an interestingmathematicalsubject that has inspired research in a number of elds.For instance, even basicmethods commonly used in numerical analysis formatrixcomputations becometricky to implement when the matrix size is of order 10 9 ; moreover, the matrixinduced by a Web graph is signicantly different from those commonly foundin physics or statistics, so many results that are common in those areas are notapplicable.

One suggestive way to describe the idea behind PageRank is the following:Consider a random surfer that starts from a random page, and at every timechooses the next page by clicking on one of the links in the current page (se-lected uniformly at random among the links present in the page). As a rstapproximation, we could dene the rank of a page as the fraction of time thatthe surfer spent on that page. Clearly, important pages (i.e., pages that happen

to be linked by many other pages, or by few important ones) will be visited moreoften, which justies the denition. However, as remarked by Page et al. [1999],this denitionwould be toosimple minded, as certain pages (called therein rank sinks , and, in this article, buckets ) would end up entrapping the surfer. To solvethis problem, at every step the surfer clicks on a link only with probability :With probability 1 , instead, the surfer will restart from another node chosen(uniformly) at random.

A signicant part of the current knowledge about PageRank is scatteredthrough the research laboratories of large search engines, and its analysis hasremained largely in the realm of trade secrets and economic competition [Eironet al. 2004]. We believe, however, that a scientic and detailed study of PageR-ank is essential to our understanding of the Web (independently of its usagein search engines), and we hope that this article can be a contribution in such

program.PageRank is dened formally as the stationary distribution of a stochastic

process whose states are the nodes of the Web graph. The process itself isobtained by mixing the normalized adjacency matrix of the Web graph (withsome patches for nodes without outlinks that will be discussed later) with atrivial uniform process that is needed to make the mixture irreducible andaperiodic, so that the stationary distribution is well dened. The combinationdepends on a damping factor [0..1), which will play a major role in this work(and corresponds to the probability that the surfer follows a link of the currentpage). When is 0, the Web graph part of the process is annihilated, resulting in the trivial uniform process. As gets closer to 1, the Web part becomes moreand more important.

The problem of choosing was curiously overlooked in the rst papersabout PageRank: yet, not only does PageRank change signicantly when is modied [Pretto 2002a, 2002b], but also the relative ordering of nodes de-termined by PageRank can be radically different [Langville and Meyer 2004].

1It should be noted that link analysis has much older roots: in Sections 2 and 3 we highlightconnections with twopapers in sociometrics [Katz 1953] andbibliometrics [Pinski andNarin 1976].



3/23

PageRank: Functional Dependencies 19:3

The original value suggested by Brin and Page ( = 0.85) is the most com-mon choice. Intuitively, 1 is the fraction of rank that we agree to spreaduniformly on all pages. This amount will be then funneled through the out-links. A common form of link spamming creates a large set of pages that funnelcarefully all their rank towards a single page: Even if the set is made of ir-relevant pages, they will receive their share of uniformly spread rank, and inthe end the page pointed to by the set will be given a preposterously greatimportance.

It is natural to wonder what is the best value of the damping factor, if sucha thing exists. In a way, when gets close to 1 the Markov process is closer tothe ideal one, which would somehow suggest that should be chosen as closeto 1 as possible. This observation is not new [Langville and Meyer 2004], butthere is some naivety in it. The rst issue is of computational nature: PageR-ank is traditionally computed using variants of the power method. The numberof iterations required for this method of converge grows with , and in ad-

dition more and more numerical precision is required as gets closer to 1.But there is an even more fundamental reason not to choose a value of tooclose to 1: We shall prove in Section 5 that when goes to 1 PageRank getsconcentrated in the recurrent states, which correspond essentially to the buckets : nondangling nodes whose strongly connected components have no pathtoward other components. This phenomenon gives a null PageRank to all thepages in the core component, something that is difcult to explain and that isin conict with common sense. In other words, in real-word Web graphs therank of all important nodes (in particular, all nodes of the core component)goes to 0 as tends to 1. 2 We also study the precise limit behavior of PageR-ank as 1, thus solving a conjecture that was left open by Boldi et al.[2005]; actually, we derive a much stronger result connecting this behaviorto the Ces aro limit of the transition matrix of the Markov chain underlying

PageRank.Thus, PageRank starts, when = 0, from an uninformative uniform distri-

bution and ends, when 1, into a counterintuitive distribution concentratedmostly in irrelevant nodes. As a result, both for choosing the correct damping factor and for detecting link spamming, being able to describe the behavior of PageRank when changes is essential.

To proceed further in this direction, it is essential that we have at ourdisposalanalytical tools that describe this behavior. To this purpose, we shall provideclosed-form formulae for the derivatives of any order of PageRank with respectto . Moreover, we show that the k th coefcient of the PageRank power series(in ) can be easily computed during the kth iteration of the power method.The most surprising consequence, easily derived from our formulae, is thatthe vectors computed during the PageRank computation for any (0..1) canbe used to approximate PageRank for every other (0..1). Of course, thesame coefcients can be used to approximate the derivatives, and we provide

2We remark that in 2006 a very precise analysis of the distribution of PageRank was obtained by Avrachenkov et al. [2007], corroborating the results described by Boldi et al. [2005]. Using theiranalysis, the authors conclude that should be set equal to 1 / 2.



4/23


5/23


Let us dene d as the characteristic vector 7 of dangling nodes (i.e., the vectorwith 1 in positions corresponding to such nodes and 0 elsewhere). Let v and u

be two distributions, which we will call the preference and the dangling nodedistribution, respectively.PageRank r v ,u ( ) is dened (up to a scalar) by the eigenvector equation

r v ,u ( ) ( G + d T u ) + (1 )1 T v = r v ,u ( ),

that is, as the stationary state of the Markov chain ( G + d T u ) + (1 )1 T v :Such chain is indeed unichain [Boldi et al. 2006], so the previous denition iswell given. More precisely, we have a Markov chain with restart [Boldi et al.2006] in which G + d T u is the Markov chain (that follows the natural randomwalk on nondangling nodes, and moves to a node at random with distribution uwhen starting from a dangling node) and v is the restart vector. The damping factor [0..1) determines how often the Markov chain follows the graphrather than moving at a random node according to the preference vector v .

The preference vector is used to bias PageRank with respect to a selected setof trusted pages, or might depend on the users preferences, in which case onespeaks of personalized PageRank [Jeh and Widom2003].Clearly, the preferencevector strongly inuences PageRank, but in real-world crawls, which have alarge number of dangling nodes (in particular if the graph contains the whole frontier of the crawl [Eiron et al. 2004], rather than just the visited nodes) thedangling node distribution u can also be very important. In the literature onecan nd several alternatives (e.g., u = v or u = 1 / n ).

Following Boldi et al. [2008], we distinguish clearly between strongly preferential PageRank, in which the preference and dangling node distributions areidentical (i.e., u = v ) and that corresponds to a topic or personalizationbias, and

weakly preferential PageRank, in which the preference and the dangling nodedistributions are not identical, and, in principle, uncorrelated. The distinctionis not irrelevant, as the concordance between weakly and strongly preferentialPageRank can be quite low [Boldi et al. 2008]. Both strongly and weakly pref-erential PageRank (and also pseudoranks dened in Section 7) have been usedin the literature to dene PageRank, so a great care must be exercised whencomparing results from different papers.

A related application of dominant eigenvectors is Pinski and Narin [1976]inuence weight : Given a square matrix counting the number of referencesamong journals, they scale it so that elements in the i th column are divided bythe i th row sum, and they propose to use the dominant left eigenvector of suchmatrix as a measure of inuence. Clearly, however, the different scaling leadsto a completely different ranking.

We are providing a toy example graph, shown in Figure 1. It will be used inthe rest of the article as a guide.

7 All vectors in this work are row vectors.



6/23


5

0 1

2

3

6 7 8 9

4

Fig. 1. A toy example graph with n = 10 nodes.

3. THE MANY NATURES OF PAGERANK

We introduced PageRank as the stationary state of a Markov chain. Actually,due to the presence of the damping factor, PageRank can be seen as a rationalvector function r v ,u ( ) associating to each value of a different rank. As goesfrom 0 to 1, the ranks change dramatically, and the main theme of this articleis exactly the study of PageRank as a function of the damping factor.

Usually, though, one looks at r v ,u ( ) only for a specic value of . All algo-rithms to compute PageRank actually compute (or, more precisely, provide anestimate for) r v ,u ( ) for some that you plug in it, and it is by now an estab-lished use to choose = 0.85. This choice was indeed proposed by Page et al.[1999].

Many authors have tried to devise a more thorough a posteriori justicationfor 0 .85. It is easy to get convinced that choosing a small value for is notappropriate, because too much weight would be given to the uniform partof M v ,u ( ). On the other hand, a value of too close to 1 leads to numericalinstability. Using a disciplined approach, based on the assumption that mostPageRank mass must belong to the core component, Avrachenkov et al. [2007]claim that should be 1 / 2.

In the rest of the article, we shall use the following matrices.

Pu := G + d T u M v ,u ( ) := Pu + (1 )1 T v

As a mnemonic, Pu is the patched version of G in which rows corresponding to dangling nodes have been patched with u , and M v ,u is the actual Markovchain whose stationary distribution is PageRank. Note that, here and else-where, when a matrix or a vector is a function of the damping factor [0..1),we will use a notation that reects this fact.

Noting that r v ,u ( )1 T = 1, we get

r v ,u ( ) Pu + (1 )1 T v = r v ,u ( ) r v ,u ( ) Pu + (1 )v = r v ,u ( )

(1 )v = r v ,u ( )( I Pu ),

which yields the following closed formula for PageRank. 8

8 A particular case of this formula appears in Haveliwala and Kamvar [2003a, Lemma 3], albeitthe factor 1 is missing, probably due to an oversight.



7/23


Fig.2. The explicit formula of PageRank as a function of with v = u = 1 / 10 for the graph shownin Figure 1.

r v ,u ( ) = (1 )v ( I Pu ) 1 (1)

Note that the preceding formula exhibits PageRank as a linear operator appliedto the preference vector v . In particular, standard methods for solving linearsystems can (and should) be used to compute it much more efciently than withthe power method. For instance, since I Pu is strictly diagonally dominant,the Gauss-Seidel method is guaranteed to converge, and in practice it featuresfaster convergence than the power method (see, for example, Del Corso et al.[2006]).

The reader can see the PageRank vector of our worked-out example inFigure 2 (both v and u are set to the uniform vector). PageRank is representedas a function of in Figure 8.

The linear operator in (1) can be written as

r v ,u ( ) = (1 )v

k= 0( Pu ) k

, (2)

which makes the dependence of PageRank on incoming paths very explicit:PageRank is computed by diffusing thebase preference along all outgoing pathswith decay . From this formulation it is also immediate to derive the combi-natorial description of PageRank of a node x in terms of a summation of weightof paths coming into x [Brinkmeier 2006].

Actually, this formulation makes the connection with Katzs index [Katz1953] 9 immediate: Katz considers a 0-1 matrix G representing choice (as invote-for) of individuals. He then uses (in our notation) a scalar multiple of thevector

1G

k= 0

k G k

to establish a measure of authoritativeness of each individual. The attenuation factor is used to decrease inuence of a vote at larger distance. The latterformula is very similar to (2) with v = 1G (and, in view of the results reported

9We thank D. Gleich for having pointed us to this reference.



8/23


of Section 7, also u = v ), but of course the lack of normalization radicallychanges the resulting vector.

4. POWER SERIES

Eq. (2) can be actually rewritten as follows.

r v ,u ( ) = v + v

k= 1 k P ku P

k 1u (3)

This formula suggests a way to study PageRank as a power series of . If wewant to follow this route, we must overcome two difculties: First of all, wemust compute explicitly thecoefcients of thepower series 10 ; and then, wemustdiscuss how good is the approximation obtained by truncating the series at agiven step. Both problems will be solved by a surprisingly simple relationshipbetween the power series and the power method that will be proved in thissection. To obtain our main result, we will need the following lemma (that canbe easily restated in any R -algebra):

LEMMA 1. Let C be a set of square matrices of the same size, and Z C suchthat for every A C we have AZ = Z . Then for all A C , R and for all n wehave

( A + (1 ) Z )n = n An + (1 )n 1

k= 0 k Z A k ,

or, equivalently,

( A + (1 ) Z )n = ( I Z ) n An + Z I +n

k= 1 k A k A k 1 .

PROOF. By an easy induction. The rst statement is trivial for n = 0. If wemultiply both members by A + (1 ) Z on the righthand side we have

n + 1 An+ 1 + (1 )n 1

k= 0 k+ 1 Z A k+ 1 + n (1 ) Z + (1 )2

n 1

k= 0 k Z

= n+ 1 An+ 1 + (1 )n 1

k= 0 k+ 1 Z A k+ 1 + n (1 ) Z + (1 )2

1 n

1 Z

= n + 1 An + 1 + (1 )n

k= 0 k Z A k .

The second statement can then be proved by expanding the summation and

collecting monomials according to the powers of .We are now ready for the main result of this section, which equates analytic

approximation (the index at whichwe truncate thePageRank powerseries) withcomputational approximation (the number of iterations of the Power Method).

10 Note that the coefcients are vectors because we are approximating a vector function.



9/23


THEOREM 1. The approximation of PageRank computed at the nth iterationof the power method with damping factor and starting vector v coincides withthe nth degree truncation of the power series of PageRank evaluated in . Inother words, for every n,

v M nv ,u = v + vn

k= 1 k P ku P

k 1u .

PROOF. Apply Lemma 1 to the case when A = Pu , Z = 1 T v and = . Wehave

M nv ,u = Pu + (1 )1T v n = ( I 1 T v ) n P nu + 1

T v I +n

k= 1 k P ku P

k 1u ,

hence, noting that v 1 T v = v ,

v M nv ,u = v ( I 1 T v ) n P nu + v 1 T v I +n

k= 1 k P ku P k 1u

= v + vn

k= 1 k P ku P

k 1u .

As a consequence, we have the following corollary.

COROLLARY 1. The difference between the kth and the ( k 1)-th approxima-tion of PageRank (as computed by the power method with starting vector v ),divided by k , is the kth coefcient of the power series of PageRank.

The previous corollary is apparently innocuous; however, it has a surprising consequence: The data obtained computing PageRank for a given , say 11 0,

can be used to compute PageRank for any other 1, obtaining the same resultthat we would have obtained after the same number of iterations of the powermethod with = 1. Indeed, by saving the coefcients of the power seriesduring the computation of PageRank with a specic it is possible to studythe behavior of PageRank when varies (this result was used by Brezinskiand Redivo-Zaglia [2006] to extrapolate PageRank values when 1). Evenmore is true, of course: Using standard series derivation techniques, one canapproximate the k th derivative. A useful bound for approximating derivativeswill be given in Section 6.2.

The rst few coefcients of the power series for our worked-out example areshown in Table I. Figure 3 shows the convergence of the power series towardthe actual PageRank behavior for a chosen node. Finally, in Figure 4 we dis-play the approximation obtained with truncating the power series after the

rst 100 terms. For this experiment we used a 41 291 594-nodes snapshot of the Italian Web gathered by UbiCrawler [Boldi et al. 2004] and indexed by We-bGraph [Boldi and Vigna 2004]. We chose four nodes with different behaviors(monotonic increasing/decreasing, unimodal concave/convex) to show that the

11 Actually, to compute the coefcients one can even use 0 = 1.



10/23


Table I. The Coefcients of the First Terms of the Power Series for r v ,u ( )

Coefcient 0 0.100, 0 .100, 0 .100, 0 .100, 0 .100, 0 .100, 0 .100, 0 .100, 0 .100, 0 .100 1 0.371, 0.058, 0.028, 0.028, 0 .015, 0.034, 0.058, 0.058, 0.058, 0.058 2 0.253, 0 .070, 0.033, 0.018, 0.048, 0 .003, 0 .070, 0 .070, 0 .070, 0 .070 3 0.260, 0.055, 0 .030, 0.021, 0 .032, 0.026, 0.055, 0.055, 0.055, 0.055 4 0.207, 0 .050, 0.029, 0 .013, 0.040, 0 .012, 0 .050, 0 .050, 0 .050, 0 .050

5e-09

1e-08

1.5e-08

2e-08

2.5e-08

3e-08

3.5e-08

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

1020304050

100

Fig. 3. Approximating r ( ) for a specic node (cross-shaped points) using Maclaurin polynomialsof different degrees (shown in the legend).

approximation is excellent in all these cases; it is also curious to notice that ap-parently most nodes have one of these four behaviors, an empirical observationthat probably deserves some deeper investigation.

5. LIMIT BEHAVIOR

Power series offer an easy numerical way to study the behavior of PageRankas a function of , but as gets closer to 1 the approximation needs more andmore terms to be useful (the other extremal behavior, i.e., = 0, is trivial, sincer v ,u (0) = v ). Thus, this section is devoted to a formal analysis of the behaviorof PageRank when is in a neighborhood of 1.

When 1 , the transition matrix M v ,u ( ) tends to Pu : This factseems to suggest that choosing close to 1 should give a truer or betterPageRank. This is a widely diffused opinion (as we shall see, most probablya misconception). In any case, as we remarked in the Introduction there aresome computational obstacles to choosing a value of too close to 1. The powermethod converges more and more slowly [Haveliwala and Kamvar 2003b] as



11/23


2e-08

3e-08

4e-08

5e-08

6e-08

7e-08

8e-08

9e-08

1e-07

0 0.2 0.4 0.6 0.8 1 8e-09

1e-08

1.2e-08

1.4e-08

1.6e-08

1.8e-08

2e-08

2.2e-08

2.4e-08

2.6e-08

0 0.2 0.4 0.6 0.8 1

5e-09

1e-08

1.5e-08

2e-08

2.5e-08

3e-08

3.5e-08

0 0.2 0.4 0.6 0.8 1 1.5e-08

2e-08

2.5e-08

3e-08

3.5e-08

4e-08

4.5e-08

5e-08

5.5e-08

0 0.2 0.4 0.6 0.8 1

Fig. 4. Examples of approximations obtained using a Maclaurin polynomial of degree 100, fornodes with different behaviors (the points were tabulated by computing PageRank explicitly with100 regularly spaced values of ).

1 , a fact that also inuences the other methods used to compute PageR-ank (which are often themselves variants of the power method [Page et al.1999; Haveliwala 1999; Golub and Greif 2006; Kamvar et al. 2003]). Indeed,the number of iterations required could in general be bounded using the sep-aration between the rst and the second eigenvalue, but unfortunately theseparation can be abysmally small if = 1, making this technique not applica-ble. Moreover, if is large the computation of PageRank may become numeri-cally ill-conditioned (essentially for the same reason [Haveliwala and Kamvar2003a]).

Even disregarding the problems discussed previously, we shall provide con-vincing reasons that make it inadvisable to use a value of close to 1, unless P

u is suitably modied. First observe that, since r

v ,u(

) is a rational (coordi-natewise) bounded function dened on [0 ..1), it is dened on the whole complexplane except for a nite number of poles, and the limit

r v ,u = lim 1

r v ,u ( )



12/23


Fig.5. The realand imaginary parts of PageRank ofnode0 ofthe graph shown in Figure1, plottedfor all complex values with real and imaginary parts smaller than 10. Poles appear as spikes.

exists. In fact, since the resolvent I / Pu has a Laurent expansion 12 around 1in the largest disc not containing 1 / for another eigenvalue of P u , PageRankis analytic in the same disc; a standard computation yields

(1 )(1 Pu ) 1 = P u

n = 0

1

n+ 1

Q n + 1u ,

where Q u = ( I P u + P u ) 1 P u and

P u = limn1n

n 1

k= 0 P ku

is the Ces aro limit of Pu [Iosifescu 1980], hencer v ,u = v P

u .

Figure 5 exhibits PageRank of a node as a complex function.The preceding equation proves the conjecture left open in Boldi et al. [2005],

actually providing a wide generalization of the conjecture itself, as there is nohypothesis (not even aperiodicity) on Pu .13

It is easy to see that r v ,u is actually one of the invariant distributions of Pu(because lim 1 M v ,u ( ) = Pu ). Can we somehow characterize the propertiesof r v ,u ? And what makes r v ,u different from the other (innitely many, if Pu isreducible) invariant distributions of Pu ?

The rst question is the most interesting, because it is about what happensto PageRank when 1 ; in a sense, fortunately, it is also the easiest to

12 A different expansion around 1, based on vector extrapolation techniques, has been proposedby Serra-Capizzano [2005].13 D. Fogaras [Fogaras 2005] provided immediately after ourpaper [Boldi et al. 2005] was presenteda proof of the conjecture for the uniform, periodic case using standard analytical tools. Unawareof Fogarass proof, Bao and Liu [2006] provided later a similar, slightly more general proof for ageneric preference vector.



13/23


answer. Before doing this, recall some basic denitions and facts about Markovchains.

Given two states x and y, we say that x leads to y iff there is some m > 0such that there is a nonzero probability to go from x to y in m steps.

A state x is transient iff there is a state y such that x leads to y but y doesnot lead to x . A state is recurrent iff it is not transient.

In every invariant distribution p of a Markov chain, if p x > 0 then x isrecurrent [Iosifescu 1980].

Let us now introduce some graph-theoretical notation. Let G be a graph.

Given a node x of G , we write [ x]G for the (strongly connected) component of G containing x .

The component graph of G is a graph whose nodes are the components of G ,with an arc from [ x]G to [ y]G iff there are nodes x [ x]G and y [ y]G suchthat there is an arc from x to y in G . The component graph is acyclic, apartfor the possible presence of loops.

If x , y are two nodes of G , we write x G y iff there is a directed (possiblyempty) path from x to y in G .

The aforesaid denitions are straightforwardly extended to the situationwhere G is the transition matrix of a Markov chain; in such a case, we assumeas usual that there is an arc from x to y if and only if G x y > 0.

Clearly, a node is recurrent in Pu iff [ x] Pu is terminal; otherwise said, x isrecurrent (in the Markov chain Pu ) iff x Pu y implies y Pu x as well. Notethat nodes with just a loop are recurrent.

We now turn to our characterization theorem, which identies recurrentstates on the basis of G , rather than P u . The essence of the theorem is that, forwhat concerns recurrent states, the difference between G and P u is not signi-cant, except for a special case which, however, is as pathological as periodicityin a large Web graph.

To state and prove comfortably the next theorem, we need a denition.

Denition 1. A component is said to be a bucket component if it is terminalin the component graph, but it is not dangling (i.e., if it contains at least onearc, or, equivalently, if the component has a loop in the component graph). A bucket (node) is a node belonging to a bucket component.

Note that given a component [ x] of a graph, it is always possible to reach a ter-minal component starting from [ x]; such a component must be either dangling or a bucket. We shall use this fact tacitly in the following proof.

Actually, much more is true.PROPOSITION 1. Let G and P u be dened as before. Then, buckets of G are

recurrent in P u .

PROOF. If x is a bucket of G and x Pu y, a path from x to y cannot traversea dangling node of G (because x is a bucket), so actually x G y, which implies



14/23


15/23


Fig. 6. Illustration of Theorem 2. The picture represents the component DAG of a graph(gray = non-terminal component; black = dangling component; white = bucket component); the curveindicates the part of the graph reachable from the support of u , and squared components indicaterecurrent components. (Left) a situationcovered by Theorem 2(1). (Right) the pathological situation

covered by Theorem 2(2).

generally, graphs with no bucket reachable from thesupport of u ), the recurrentnodes are exactly the buckets. Buckets are often called rank sinks , as they ab-sorb all the rank circulating through the graph, but we prefer to avoid the termsink as it is already quite overloaded in the graph-theoretical literature. Tohelp the reader understand Theorem 2, we show a pictorial example in Figure 6.

As we remarked, a real-world graph will certainly contain at least one bucketreachable from u , so the rst statement of the theorem will hold. This meansthat most nodes x will be such that ( r v ,u ) x = 0. In particular, this will betrue of all the nodes in the core component [Kumar et al. 2000]: This result issomehow surprising, because it means that many important Web pages (thatare contained in the core component) will have rank 0 in the limit (see, for

instance, node 0 in our worked-out example). A detailed analysis of this limitbehavior has been given by Avrachenkov et al. [2007].

This is a rather convincing justication that, contradicting common be-liefs, choosing too close to 1 does not provide any good PageRank. Rather,PageRank becomes sensible somewhere in between 0 and 1. If we are inter-ested in studying PageRank-like phenomena in the neighborhood of 1, PageR-ank variants such as TruRank [Vigna 2005] should be used instead.

To clarify the preceding discussion, let us apply it to our toy example (alwaysassuming u = v = 1/ 10). Node 3 is the only dangling node of the graph, butnodes 4 and 5 form a bucket component; all the other nodes are actually in aunique nonterminal component. Thus, the nonzero elements of Pu correspondexactly to the arcs of G and to the arcs connecting node 3 to every node in thegraph, as shown in Figure 7 (left), where dotted arcs are those that were notpresent in G . Figure 7 (right) represents the component graph, and the dottedarea encloses the components that are actually merged together by patching the dangling node. We are in the conditions of the rst item of Corollary 2,and correspondingly Figure 8 shows that PageRank for nodes 4 and 5 grows,whereas for all other nodes it goes to 0 as 1 . Note, however, the maximumattained by node 0 at 0.7.



16/23


4

6 7 8 9

0 1

2

3

5

4, 5

0, 1, 2,

6, 7, 8, 9

3

Fig. 7. The components of the graph in Figure 1 after the only dangling node has been patchedwith uniform distribution u = 1 / 10 (the arcs induced by the patching process are dashed) and thecorresponding component graph. The dashed line in the component graph gathers components thatare merged by the patching process. The only bucket component is {4, 5 }.

0

0.1

0.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1

node 0 node 1 node 4 node 5 node 6

Fig. 8. The behavior of the components of r 1 / 10, 1 / 10 ( ) (we only show some of them, for the sakeof readability). They all go to zero except for nodes 4 and 5: the only nodes belonging to a bucketcomponent. Note, however, the maximum attained by node 0 at 0.7.

6. DERIVATIVES

The reader should by now be convinced that the behavior of PageRank withrespect to the damping factor is nonobvious: r v ,u ( ) should be considered afunction of , and studied as such.

The standard tool for understanding changes in a real-valued function is theanalysis of its derivatives. Correspondingly, we are going to provide mathemat-ical support for this analysis.



17/23


6.1 Exact Formulae

Themain objectiveof this section is providing exact formulae for thederivatives

of PageRank. Dene r v ,u ( ), r v ,u ( ), . . . , r ( k)v ,u ( ) as the rst, second, . . . , k thderivative of r v ,u ( ) with respect to .

We start by providing the basic relations between these vector functions.

THEOREM 3. The following identities hold:

(1) r v ,u ( ) = (r v ,u ( ) Pu v )( I Pu ) 1 ;(2) for all k > 0 , r ( k+ 1)v ,u ( ) = ( k + 1)r ( k)v ,u ( ) Pu ( I Pu ) 1 .

PROOF. Multiplying both sides of (1) by I Pu and differentiating mem-berwise

r v ,u ( )( I Pu ) r v ,u ( ) Pu = v (4)r v ,u ( )( I Pu ) = r v ,u ( ) Pu v (5)

r v ,u ( ) = (r v ,u ( ) Pu v )( I Pu ) 1 . (6)

This proves the rst item; multiplying again both sides by I Pu and differ-entiating memberwise we obtain

r v ,u ( )( I Pu ) r v ,u ( ) Pu = r v ,u ( ) Pur v ,u ( )( I Pu ) = 2r v ,u ( ) Pu

r v ,u ( ) = 2r v ,u ( ) Pu ( I Pu ) 1

which accounts for the base case ( k = 1) of an induction for the second state-ment. For the inductive step, multiplying both sides of the inductive hypothesisby I Pu and differentiating memberwise

r ( k+ 2)v ,u ( )( I Pu ) r ( k+ 1)v ,u ( ) Pu = ( k + 1)r

( k+ 1)v ,u ( ) Pu

r ( k+ 2)v ,u ( )( I Pu ) = ( k + 2)r ( k+ 1)v ,u ( ) Pu

r ( k+ 2)v ,u ( ) = ( k + 2)r ( k+ 1)v ,u ( ) Pu ( I Pu )

1

which accounts for the inductive step.

Moreover, we can explicitly write a closed formula for the generic derivative.

COROLLARY 3. For every k > 0

r ( k)v ,u ( ) = k !v P ku P

k 1u ( I Pu )

( k+ 1) .

PROOF. The formula can be veried by induction on k, using Theorem 3and (1).

6.2 Approximating the Derivatives

The formulae obtained in Section 6 do not lead directly to an effective algorithmthat computes derivatives: Even assuming that the exact value of r v ,u ( ) is



18/23


available, to obtain the derivatives one should invert I Pu (see Theorem 3),a heavy (in fact, unfeasible) computational task. However, in this section weshall provide a way to obtain simultaneous approximations for PageRank andits derivatives, and we will show how these approximations converge to thedesired vectors. The technique we describe is essentially an extension of thepower method that infers values of the derivatives by exploiting the connectionpointed out in Theorem 1.

First ofall, note that the k-derivativecanbe obtained by deriving formally (2).To simplify the notation in the following computations, we rewrite (2) with amore compact notation. We have

r v ,u ( ) = v + v

n= 1 n P nu P

n 1u =

n= 0a n n ,

where a 0 = v and, for n > 0, a n = v P nu P n 1u . By formal derivation, we

obtainr ( k)( ) =

n = 0n k a n n k , (7)

where we dropped the dependency on v and u to make notation less cluttered,and n k denotes the falling factorial n k = n(n 1)(n 2) (n k + 1). TheMaclaurin polynomials of order t (that is, the t th partial sum of the series (7))will be denoted by [[ r ( k)( )]]t .

THEOREM 4. If t k / (1 ) ,

r ( k)( ) [[r ( k)( )]]t t

1 t[[r ( k)( )]]t [[r ( k)( )]]t 1 ,

where

1 > t = (t + 1)

t + 1 k.

Note that t < 1 and that t monotonically, so the theorem states thatthe error at step t is ultimately bounded by the difference between the t th andthe ( t 1)-th approximation. The difference between the two approximationsis actually the t th term, so we can also write

r ( k)( ) [[r ( k)( )]]t t

1 t t k t k a t .

As a corollary, we have the following.

COROLLARY 4. r ( k)( ) [[r ( k)( )]]t = O (t k t ). as t .

Note, however, that in practice Theorem 4 is much more useful than the lastcorollary, as convergence is usually quicker than O (t k t ) (much like the ac-tual, error-estimated convergence of the power method for the computation of PageRank is quicker than the trivial O ( t ) bound would imply).



19/23


PROOF (OF THEOREM 4). We have to bound

r ( k)

( ) [[r ( k)

( )]]t =

n= t + 1 n k

a n n k

.

Since

v ( P n+ 1u Pnu v ( P

nu P

n 1u Pu , and ( n + 1)

k n + 1 = (n + 1)

n + 1 kn k n ,

the terms of the power series obey the upper bound

(n + 1) k n + 1 a n+ 1 (n + 1)

n + 1 kn k n a n .

Thus, for every n t 0 we have the bound

n k n a n (t + 1)t + 1 k

n t

t k t a t = n tt t k t a t .

Hence, if t is such that t < 1, we have

r ( k)( ) [[r ( k)( )]]t

n = t + 1n tt t

k a t t k =t

1 tt k a t t k .

The previous results suggest a very simple way to compute any desired set of derivatives. Just run the power method and, as suggested in Section 4, gatherthe coefcients of the PageRank power series. Multiplying the n th coefcientby n k is sufcient to get the coefcient for the k -derivative, and after k / (1 )steps it will be possible to estimate the convergence using (4).

The same considerations made before apply: By storing the coefcients of the Maclaurin polynomials it will be possible to approximate every derivative

for every value of , albeit the approximation will be worse as the derivativeindex raises and as 1.

7. PAGERANK AS A FUNCTION OF THE PREFERENCE VECTOR

The dependence of PageRank on the preference vector v and on the dangling node distribution u is also a topic that deserves some attention. With this aim,let us dene the pseudorank [Boldi et al. 2008] (in G) of a distribution x anddamping factor [0..1) as

x ( ) = (1 ) x I G 1 .

For every xed , the pseudorank is a linear operator and the preceding de-nition can be extended by continuity to = 1 even when 1 is an eigenvalue of

G , always using the fact that I / G has a Laurent expansion around 1; oncemore,

lim 1

x ( ) = x G.

When < 1 the matrix I G is strictly diagonally dominant, so the Gauss-Seidel method can still be used to compute pseudoranks efciently.



20/23


Armed with this denition, we state the main result of Boldi et al. [2008] (anapplication of the Sherman-Morrison formula to Eq. (2)).

r v ,u ( ) = v ( ) u ( ) d v ( )T 1 1 + d u ( )T

(8)

The previous formula makes the dependence on the preference and dangling node distributions very explicit.

In particular, we notice that the dependence on the dangling node distribu-tion is not linear , so we cannot expect strongly preferential PageRank to be linearin v , because in that case v is also used as dangling node distribution. Nonethe-less, once the pseudoranks for certain preference vectors have been computed,the preceding formula makes it possible to compute PageRank using any convexcombination of such preference vectors.

However, if we let u = v in (8)(gettingback the formula obtained by DelCorsoet al. [2006]) 14 , we obtain

r v ( ) = v ( ) 1 d v ( )T

1 1 + d v ( )T , (9)

where we used r v ( ) in place of r v ,v ( ) for brevity. As observed by Del Corsoet al. [2006], this formula shows that the strongly preferential PageRank withpreference vector v is actually equal, up to normalization, to the pseudorankof v . Hence, in particular, even though strongly preferential PageRank is notlinear, if v = x + (1 ) y, then

r v ( ) = v ( ) 1 d T v ( )

1 1 + dT v ( )

= x ( ) 1 d T v ( )

1 1 + dT v ( ) + (1 ) y ( ) 1

d T v ( )1 1 + d

T v ( ) ,

so the two vectors

r v ( ) = r x+ (1 ) y( ) and x ( ) + (1 ) y ( )

are parallel to each other (i.e., they are equal up to normalization) becausepseudoranks are linear. This simple connection provides a way to compute thestrongly preferential PageRank with respect to the preference vector v = x +(1 ) y just by combining in the same way the pseudoranks of x and y, and

1-normalizing the resulting vector. Note that the same process would not workif r x( ) and r y( ) were known in lieu of x ( ) and y ( ), as there is no way torecover the (de)normalization factors.

7.1 Iterated PageRank A rather obvious question raising from the view of PageRank as an operator onpreference vectors is the behavior of PageRank with respect to iteration . What

14 The reader should note that our formula has some difference in signs with respect to the originalpaper, where it was calculated incorrectly.



21/23


happens if we compute again PageRank using a PageRank vector as preferencevector? We start by approaching the question using weakly preferential PageR-ank, as the linear dependence on v makes the analysis much easier. To avoidcluttering too much the notation, let us denote with r [ k]v ,u ( ) the k th iteration of weakly preferential PageRank, that is,

r [0]v ,u ( ) = v

r [ k]v ,u ( ) = r r [ k 1]v ,u ( ),u ( ).

Clearly,

r [ k]v ,u ( ) = (1 ) kv ( I Pu ) k = (1 ) k v

n = 0

n + k 1 k 1

n P nu .

This shows that iterating PageRank is equivalent to choosing a different damp-ing function in the sense of Baeza-Yates et al. [2006]. The factor ( I Pu ) kstrongly resembles the corresponding term in the derivative as obtained inCorollary 3. And indeed, a simple computation shows that for k > 0

r ( k)v ,u ( ) = k!

(1 ) k+ 1r [ k+ 1]v ,u ( ) P

k P k 1 , (10)

so there is a tight algebraic connection between iteration and derivation. Oneinteresting point is that it might be much quicker to iterate a Gauss-Seidelmethod and apply the previous formula than using the power method and thebounds of Theorem 4, at least for small k (albeit upper bounding numericalerrors could be difcult).

The same observations hold for pseudoranks: Indeed, the preceding compu-tations are valid also for pseudoranks just by setting u = 0 (it is easy to checkthat all results of Section 6 are still valid in this case). However, the situationis completely different for strongly preferential PageRank, where the nonlin-ear dependency on v makes it difcult to derive similar results: We leave thisproblem for future work.

There is a nal property about Eq. (10) that we want to highlight. Evenif this observation can be stated for the derivatives of any order, let us limitourselves to rst-order derivatives only. Consider the following denition: Forevery distribution x , let us dene the gain vector associated to x as

x = x ( P I ).

The gain at each node is the difference between the score that the node wouldobtain from its in-neighbors and the score that the node actually has; thisdifference is negative if the node has a score higher than the one its in-neighborswould attribute to it (we might say: if the node is overscored), and positiveotherwise (i.e., if the node is underscored).

In the case of rst-order derivatives, Eq. (10) reduces to

r v ,u ( ) =1

(1 )2r [2]v ,u ( )( P I ).

That is, the derivative vector r v ,u ( ) is parallel to r [2]v ,u ( P I ) = r [2]v ,u . In otherwords, the rst derivative of PageRank at a node is negative iff the node is



22/23


overscored by the second iterated PageRank; note that there is a shift, here,between the order of differentiation and the number of iterations.

8. CONCLUSIONS

We have presented a number of results which outline an analytic study of PageRank with respect to its many parameters. Albeit mainly theoretical innature, they provide efcient ways to study the global behavior of PageRank,and dispel a few myths (in particular, about the signicance of PageRank when gets close to 1).

REFERENCES

A VRACHENKOV , K., LITVAK , N., AND PHAM, K. S. 2007. Distribution of PageRank mass among prin-ciple components of the Web. In Proceedings of the 5th International Workshop on Algorithmsand Models for the Web-Graph (WAW07) , A. Bonato and F. R. K. Chung, Eds. Lecture Notes inComputer Science, vol. 4863. Springer, 1628.

B AEZA -Y ATES , R., BOLDI , P., AND C ASTILLO , C. 2006. Generalizing PageRank: Damping functionsfor link-based ranking algorithms. In (SIGIR 06) Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . ACM Press,New York, 308315.

B AO, Y. AND LIU, Y. 2006. Limit of PageRankwithdamping factor. Dynam. Contin. Discr. ImpulsiveSyst. 13 , 497504.

BOLDI , P., C ODENOTTI , B., S ANTINI , M., AND V IGNA , S. 2004. Ubicrawler: A scalable fully distributedWeb crawler. Softw. Pract. Exper. 34 , 8, 711726.

BOLDI , P., L ONATI , V., S ANTINI , M., AND V IGNA , S. 2006. Graph brations, graph isomorphism, andPageRank. RAIRO Inf. Th eor. 40 , 227253.

BOLDI , P., P OSENATO , R., S ANTINI , M., AND V IGNA , S. 2008. Traps and pitfalls of topic-biasedPage Rank. In Proceedings of the 4th Workshop on Algorithms and Models for the Web-Graph(WAW 06) , W. Aiello, A. Broder, J. Janssen, and E. Milios, Eds. Lecture Notes in ComputerScience, vol. 4936. Springer, 107116.

BOLDI , P., S ANTINI , M., AND V IGNA , S. 2005. PageRank as a function of the damping fac-

tor. In Proceedings of the 14th International World Wide Web Conference . ACM Press,557566.BOLDI , P. AND V IGNA , S. 2004. The WebGraph framework I: Compression techniques.

In Proceedings of the 13th International World Wide Web Conference . ACM Press,595601.

BREZINSKI , C. AND REDIVO -Z AGLIA , M. 2006. The PageRank vector: Properties, computation, ap-proximation, and acceleration. SIAM J. Matrix Anal. Appl. 28 , 2, 551575.

BRINKMEIER , M. 2006. PageRank revisited. ACM Trans. Internet Technol. 6 , 3, 282301.DEL CORSO , G., GULL I, A., AND ROMANI , F. 2006. Fast PageRank computation via a sparse linear

system. Internet Math. 2 , 3, 251273.E IRON , N., MCCURLEY , K. S., AND TOMLIN , J. A. 2004. Ranking the Web frontier. In Proceedings of

the 13th International World Wide Web Conference . ACM Press, 309318.FOGARAS, D. 2005. Personal communication.GOLUB ,G.H. AND GREIF , C. 2006. AnArnoldi-type algorithmfor computingPageRank. BIT Numer.

Math. 46 , 4, 759771.H AVELIWALA , T. AND K AMVAR , S. 2003a. The condition number of the PageRank problem. Tech.

rep. 36, Stanford University.H AVELIWALA , T. H. 1999. Efcient computation of PageRank. Tech. rep. 31, Stanford University.H AVELIWALA , T. H. AND K AMVAR , S. D. 2003b. The second eigenvalue of the Google matrix. Tech.

rep. 20, Stanford University.IOSIFESCU , M. 1980. Finite Markov Processes and Their Applications . John Wiley & Sons.J EH , G. AND WIDOM , J. 2003. Scaling personalized Web search. In Proceedings of the 12th Inter-

national World Wide Web Conference . ACM Press, 271279.



23/23


K AMVAR, S. D., H AVELIWALA , T. H., M ANNING , C. D., AND GOLUB , G. H. 2003. Extrapolation methodsfor accelerating PageRank computations. In Proceedings of the 12th International World WideWeb Conference . ACM Press, 261270.

K ATZ, L. 1953. A newstatus index derived from sociometric analysis. Psychometrika 18 , 1, 3943.K UMAR , R., R AGHAVAN , P., R AJAGOPALAN , S., S IVAKUMAR , D., TOMPKINS , A., AND UPFAL , E. 2000. The

Web as a graph. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 00) . ACM Press, 110.

L ANGVILLE , A. N. AND MEYER , C. D. 2004. Deeper inside PageRank. Internet Math. 1 , 3,355400.

P AGE, L., BRIN , S., MOTWANI , R., AND WINOGRAD , T. 1999. The PageRank citation ranking: Bringing order to the Web. Tech. rep. 66, Stanford University.

P INSKI , G. AND N ARIN, F. 1976. Citation inuence for journal aggregates of scientic publications:Theory, with application to the literature of physics. Inf. Proces. Manag. 12 , 5, 297312.

PRETTO , L. 2002a. A theoretical analysis of Googles PageRank. In Proceedings of the 9th Interna-tional Symposium on String Processing and Information Retrieval (SPIRE 02) , A. H. F. Laenderand A. L. Oliveira, Eds. Lecture Notes in Computer Science, vol. 2476. Springer, 125136.

PRETTO , L. 2002b. A theoretical approach to link analysis algorithms. Ph.D. thesis.SERRA -C APIZZANO , S. 2005. Jordan canonical form of the Google matrix: A potential contribution

to the PageRank computation. SIAM J. Matrix Anal. Appl. 27 , 2, 305312. V IGNA , S. 2005. TruRank: Taking PageRank to the limit. In Proceedings of the 14th Interna-tional World Wide Web Conference (WWW 2005), Special Interest Tracks & Posters . ACM Press,976977.

V IGNA , S. 2007. Stanford matrix considered harmful. In Web Information Retrieval and Lin- ear Algebra Algorithms , A. Frommer, M. W. Mahoney, and D. B. Szyld, Eds. Number 07071 inDagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum f ur Infor-matik (IBFI), Schloss Dagstuhl, Germany.

Received January 2008; revised September 2008; accepted December 2008


a19-boldi , page rank, functional dependencies, pdf

Documents