google and the page algortimo

Upload: pablonichy-keysias

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Google and the Page Algortimo

    1/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Mathematical Analysis of Google PageRank

    Konstantin Avrachenkov

    INRIA Sophia Antipolis, France

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    2/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Ranking Answers to User Query

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    3/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Ranking Answers to User Query

    How a search engine should sort the retrieved answers?

    Possible solutions: (a) use the frequency of the searched terms inthe Web page, (b) analyse the log files,... These solutions mightbe not objective.

    An original idea of Google is based on two observations:

    1 The more pages point to a Web page, the more important thepage is.

    2 If more important Web pages point to the page, the page iseven more important.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    4/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Web Graph

    Consider the Web as a directed graph:

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    D fi i i f P R k

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    5/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Random surfer PageRank Definition

    Consider a random surfer who, with probability c (=0.85) follows arandomly chosen outgoing link, otherwise, with probability 1 c

    jumps to a completely random page.

    Then, PageRank i of page i is the long run fraction of time thata random surfer spends on page i.

    The dynamics of random surfer can be described using Markovchains.

    Markov chains definition

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    D fi iti f P R k

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    6/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Formal PageRank Definition

    Let n be the total number of pages on the Web (n 8 109).

    Define the hyperlink matrix P = {pij}ni,j=1 as follows:

    pij = 1/di, if j is one of the di outgoing links of i,

    pij = 1/n, if di = 0 (dangling node),

    pij = 0, otherwise.

    The transitions of easily bored surfer corresponds to the

    following perturbed Google matrix

    P = cP + (1 c)(1/n)E,

    where E is an n n matrix consisting of ones, c = 0.85.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    7/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Formal PageRank Definition

    Then, the PageRank vector is a solution of

    P = , 1 = 1,

    or, equivalently, in the component form

    i = j i

    c

    dj

    j +1 c

    n

    , i

    i = 1.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    8/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Example

    P =

    0 0.5 0.5 0 00.2 0.2 0.2 0.2 0.20.2 0.2 0.2 0.2 0.2

    0.5 0 0 0 0.50 0 0 1 0

    =

    0.1982 0.1731 0.1731 0.2573 0.1982

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    9/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Power Method

    Even though this is a well kept secret, it seems that Google stilluses the simple power iteration method for PageRank computation

    (k+1) = c(k)P + (1 c) 1n

    1T, (0) = 1n

    1T.

    It can be easily estimated that using the constant c = 0.85 Googleachieves the tolerance level (measured by the residual(k+1) (k)) of 103 105 for only 50-100 iterations.

    But even this small number of iterations takes Google about aweek to update the PageRank...

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    10/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Monte Carlo Method

    Run random surfer process (Xt)t0, m times from each page,

    terminating at each step with probability 1 c. Evaluate j asj =[fraction of time spent in j].

    It turns out the one iteration of Monte Carlo method (m = 1) issufficient to estimate well the PageRank of important pages.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    11/34

    Definition of PageRankComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Monte Carlo Method

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    12/34

    gComputation of PageRank

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Advantages of MC Method in respect to PI Method

    Monte Carlo method has natural parallel implementation;

    Monte Carlo method provides good estimation of thePageRank for important pages already after one iteration;

    Monte Carlo method allows one to perform continuous update

    of the PageRank as the structure of the Web changes.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    13/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    Decomposition based on SCC

    It is known that the Web Graph consistes of many disjoint StronglyConnected Components (SCCs). This fact implies that thehyperlink matrix has the following form

    P =

    P1 0...

    . . ....

    0 PN

    ,

    where the elements of diagonal blocks PI, I = 1,..., N, correspondto links inside the I-th SCC. Denote by nI the size of the I-th SCC.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    14/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    Decomposition based on SCC

    For each block I, define the Google matrixPI = cPI + (1 c)(1/nI)E,and let vector I be the PageRank of SCC I such that

    IPI = , I1 = 1.

    Then the following theorem holds.

    Theorem

    The PageRank is given by

    = ((n1/n)1, (n2/n)2, . . . , (nN/n)N). (1)

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankC i f P R k

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    15/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    To what extend a page can control its PageRank?

    Let us give a rough estimation by how much a page can control itsPageRank by modifying its outgoing links.

    Define a discrete-time absorbing Markov chain {Xt, t = 0, 1, . . .}with the state space {0, 1 . . . , n}, where transitions between thestates 1, . . . , n are conducted by the matrix cP, and the state 0 isabsorbing.

    Let Nj be the number of visits to state j = 1, . . . , n beforeabsorption. Then, denote zij := E(Nj|X0 = i).

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankC t ti f P R k

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    16/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    To what extend a page can control its PageRank?

    Let qji be the probability to reach the state i before absorption ifthe initial state is j.

    We have the following decomposition result:

    Theorem

    The PageRank of page i = 1, . . . , n is given by

    i = 1 cn

    zii1 +

    nj=1j=i

    qji , i = 1, . . . , n. (2)

    Proof

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    17/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    To what extend a page can control its PageRank?

    The decomposition formula (2) represents the PageRank of page ias a product of three multipliers where only the term zii dependson the outgoing links of page i.

    Hence, by changing the outgoing links, a page can control itsPageRank up to a multiple factor

    zii = 1/(1 qii) [1, 1/(1 c2)],

    where qii [0, c2

    ] is a probability to return back to i starting fromi before absorption.

    Note that the upper bound 1/(1 c2) (approximately 3.6 forc = .85) is hard or rather not possible to achieve...

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    18/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    To what extend a page can control its PageRank?

    We note that even a threefold increase of the PageRank might not

    be considered as a significant improvement, since Google measuresthe PageRank on a logarithmic scale.

    Next we show how a Web page should use its scarce resources toincrease its PageRank.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    19/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    Optimal Linking Strategy

    Let us show that there exists in fact an optimal linking strategy.

    Consider a page i = 1, . . . , n and assume that i has links to thepages i1, . . . , ik where il = i for all l = 1, . . . , k.

    Then for the mean return time, we have

    ii = 1 +c

    k

    kl=1

    ili +1

    n(1 c)

    nj=1j=i

    ji, (3)

    where ij is the mean first passage time from page i to page j andc is the Google constant.

    Since i = 1/ii, the objective now is to choose k and i1, . . . , iksuch that ii becomes as small as possible.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    20/34

    Computation of PageRankThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    Optimal Linking Strategy

    From (3) one can see that i is a linear function of jis.Moreover, outgoing links from i do not affect jis.

    Thus, the best what one can do is to link only to one Web page j

    such thatji = min

    j{ji}.

    Note that (surprisingly) the PageRank of j

    plays no role here.

    Still, as was already mentioned, we need to admit that a Web pageowner has very limited control of his/her PageRank.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    21/34

    p gThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    The Bowtie structure of the Web graph

    A. Broder et al. 2000 and R. Kumar et al. 2000 have observedthat the Web Graph has a Bowtie structure.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    22/34

    p gThe effect of new links

    Choosing PageRank damping factorReducing the value of c

    Maximizing the mass of SCC

    One can choose the damping factor c to maximize the totalPageRank mass of SCC:

    However, the factor c becomes to close to one. Is it good?

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    23/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    More detailed structure of the Web graph

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    24/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    By renumbering the nodes, the transition matrix P can be thentransformed to the following form

    P =

    Q 0R T

    , (4)

    where

    the block T corresponds to the Extended SCC,

    the block Q corresponds to the part of the OUT component

    without dangling nodes and their predecessors,and the block R corresponds to the transitions from ESCC to thenodes in block Q.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    Th ff f li k

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    25/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    As was observed by Moler 2003, the PageRank vector can beexpressed by the following formula

    =1 c

    n1T[I cP]1. (5)

    If we substitute the expression (4) for the transition matrix P into(5), we obtain the following formula for the part of the PageRankvector corresponding to the nodes in ESCC:

    T =

    1 c

    n1T

    [I

    cT]

    1

    = (1 c

    )uT[

    I

    cT]

    1

    , (6)

    where = nT/n and nT is the number of nodes in ESCC, andwhere uT is the uniform distribution over all ESCC nodes.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    Th ff t f li k

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    26/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    First, we note that since matrix T is substochastic, the inverse[I T]1 exists and consequently T 0 as c 1.

    Clearly, it is not good to take the value of c too close to one.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/http://goback/
  • 7/28/2019 Google and the Page Algortimo

    27/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    It follows that the value of c should not be chosen in the criticalregion where the PageRank mass of the ESCC component israpidly decreasing.

    Luckily, the shape of the function ||T(c)||1 is such that itdecreases drastically only when c is really close to one, whichleaves a lot of freedom for choosing c.

    In particular, the famous Google constant c = 0.85 is small enoughto ensure a reasonably large PageRank mass of ESCC.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    28/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    However, as we have observed in numerical experiments, evenmoderately large values of c result in an unfairly large PageRankmass of the Pure OUT component.

    Now, our goal is to find the values of c that lead to a fairdistribution of the PageRank mass between the Pure OUT and theESCC components.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    29/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Let v be some probability vector over ESCC. We would like to

    choose c = c that satisfies the condition

    ||T(c)|| = ||vT||, (7)

    that is, starting from v, the probability mass preserved in ESCC

    after one step should be equal to the PageRank of ESCC.

    Reasonable choices of v:

    1 T, the quasi-stationary distribution of T,

    2 the uniform vector u,

    3 the normalized PageRank vector T(c)/||T(c)||.

    All three criteria indicate that c = 1/2 seems to be quite a goodchoice.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    30/34

    The effect of new linksChoosing PageRank damping factor

    Reducing the value of c

    Experiments with the log files

    c PR rank w/o link PR rank with link rank by no. of clicks

    Node A

    0.5 1648 2307 25880.85 731 2101 25880.95 226 2116 2588

    Node B

    0.5 1648 4009 36490.85 731 3279 3649

    0.95 226 3563 3649

    Table: Comparison between PR and click based rankings.

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    31/34

    Choosing PageRank damping factorReducing the value of c

    Recommendations for Web Page Design

    Now we can suggest the following recommendations for Web PageDesign:

    1 The more pages a Web site has, the better.

    2 Link all pages inside a Web site to the main page. This waythe main page will have a significant weight.

    3 Give hyperlinks to the Departement and Institution Web sites.

    4 Do not make inappropriate links.

    And, of course, one should not forget that content still matters forGoogle. There is really no substitute for good original content...

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    32/34

    Choosing PageRank damping factorReducing the value of c

    Thank you!

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new links

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    33/34

    Choosing PageRank damping factorReducing the value of c

    Markov Chain Definition

    Discrete-time discrete-state Markov chain is a stochastic process{Xn}

    n=0 on the set of states S = {1, 2,..., |S|} such that

    P{Xn+1 = j} =iS

    P{Xn+1 = j|Xn = i}P{xn = i}.

    We denote pij := P{Xn+1 = j|Xn = i} and call {pij}|S|i,j=1 the

    matrix of transition probabilities.return

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    Definition of PageRankComputation of PageRank

    The effect of new linksC f

    http://find/
  • 7/28/2019 Google and the Page Algortimo

    34/34

    Choosing PageRank damping factorReducing the value of c

    To what extend a page can control its PageRank?

    Proof: It follows from (??) that

    i =1 c

    n

    1T[I cP]1ei =1 c

    n

    n

    j=1

    zji. (8)

    Next, we note that for any i,j = 1, . . . , n; i= j, we have

    zji = qjizii,

    and consequently, substituting the last equation in (8) we obtain(2). Q.E.D.

    return

    Konstantin Avrachenkov Mathematical Analysis of Google PageRank

    http://find/