the pagerank citation ranking: bringing order to the web the pagerank citation ranking: bringing...

23
The PageRank Citation Ranking: The PageRank Citation Ranking: Bringing Order to the Web Bringing Order to the Web Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Project http://dbpubs.stanford.edu/pub/1999-66 Presented by Zheng Zhao Presented by Zheng Zhao Originally designed by Soumya Sanyal Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20- http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20- %20Redone.ppt %20Redone.ppt

Post on 22-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

The PageRank Citation Ranking: The PageRank Citation Ranking: Bringing Order to the WebBringing Order to the Web

Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Projecthttp://dbpubs.stanford.edu/pub/1999-66

Presented by Zheng ZhaoPresented by Zheng Zhao

Originally designed by Soumya SanyalOriginally designed by Soumya Sanyal

http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-%20Redone.ppt%20Redone.ppt

Page 2: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

OutlineOutline

Paper Citations and the Web : Motivation PageRank : Why it should be considered? More PageRank: Nuts and bolts PageRank Unleashed: Looking under the hood Convergence and Random Walks : Why does it

work? Implementation: Getting your hands dirty Personalized PageRank: The invisible source Applications: What wasn’t apparent already Conclusions

Page 3: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Paper Citations and the Web : MotivationPaper Citations and the Web : Motivation

Academic Citations link to other well known papers

But they are peer reviewed and have quality control

Web of academic documents are homogeneous in their quality, usage, citation & length

Most web pages link to web pages as well

Quality measure of a web page is subjective to the user though

Importance of a page is a quantity that isn’t intuitively possible to capture

Page 4: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd.Contd.

An user wants to see what is most applicable to her needs first.

The job of the retrieval system is to present the more relevant documents up front.

The notion of quality or relative importance of a web page magnifies

The average quality experienced by an user is higher than the average quality of the average web page. Notations Used:

• Backlinks (inedges) : Links that point to a certain page

• Forward Links (outedges): Links that emanate from that page

Page 5: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

PageRank : Why it should be considered?PageRank : Why it should be considered?

Think of a color palette – Colors are formed by the mixture of one

or more colors– The amount and intensity of each color

you mix ultimately governs the color of the final mixture not the number of colors !!!

Now think of a Web Page– A number of back links (inedges) point

to this webpage– Say a certain back link came from

Yahoo! and another came from an obscure home page. Think of the importance of the Yahoo! Page as

opposed to the importance of the ‘home page’. Now say the importance of the Yahoo! Page was

mapped to the amount (intensity) of one color and the ‘home page’ to another color

Importance of back links rather than their number.

+

+

Page 6: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

More PageRank: Nuts and boltsMore PageRank: Nuts and bolts

Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu |

R() = Rank of page u ; c = Normalization Constant– Note: c < 1 to cover for

pages with no outgoing links

Page 7: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..Contd..

So what does the overall picture look like?

A is designated to be a matrix, u and v correspond to the columns of this matrix

Page 8: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd.. (Matrices Revisited)Contd.. (Matrices Revisited)

Eigenvectors and eigenvalues

Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.

It can be found out by recursing the previous equation till the recurrence converges.– A set of eigenvalues form what is called the

eigenspace.

Page 9: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd.. (A Walk Through Example)Contd.. (A Walk Through Example)

Lets take an example

AT=

Page 10: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..Contd..

Matrix NotationR = c A R = M R

c : eigenvalueR : eigenvector of A

A x = λ x| A - λI | x = 0

A =

R = Normalized =

Page 11: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd.. (Markov Chains)Contd.. (Markov Chains)

Random surfer model– Description of a random walk through the Web graph– Interpreted as a transition matrix with asymptotic

probability that a surfer is currently browsing that page

– The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed.

Rt = M Rt-1

M: transition matrix for a first-order Markov chain (stochastic) The question is does it converge to some sensible

solution (as t) regardless of the initial ranks ?

Page 12: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..(Issues..)Contd..(Issues..)

The above equation would converge were it not for a little problem

This problem is called the ‘Rank Sink’ Problem.– The sink accumulates rank, but never distributes it!

Page 13: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..()Contd..()

In general many Web pages don’t have either backlinks or forward links.

Results in dangling edges of the graph

no parent rank 0 – MT converges to a matrix whose last column is all zero

no children no solution– MT converges to zero matrix

Page 14: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..(More Random Surfer)Contd..(More Random Surfer)

How do we escape from this ?– A: We actually ‘escape’ from it.

Say a surfer is randomly clicking and hopping from one page to the other.

If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages.

Hence, we associate an ‘escape’ factor E to account for this ‘boredom’.– How do we model this escape probability

We term this E to be a vector over all the web pages that accounts for each page’s escape probability.

Page 15: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..Contd..

Given this Escape vector, how do we associate this with the original model

In matrix notation where

It can be rewritten as

Hence

Page 16: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

PageRank Unleashed: Looking under the PageRank Unleashed: Looking under the hoodhood

• What can we say about d and ?• d1 is called the eigengap and it controls the rate of convergence• is the convergence threshold

The main algorithm :

Page 17: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Convergence and Random Walks : Why does it Convergence and Random Walks : Why does it work?work?

Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix

What is the issue all about?– We need a transition matrix model that is guaranteed

convergence and does indeed converge to a unique stationary distribution vector.

Page 18: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..Contd..

Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic– This guarantees convergence

What about the addition of new links – Whether the link analysis algorithms based on

eigenvectors are stable in the sense that results don’t change significantly?

The connectivity of a portion of the graph is changed arbitrary– How will it affect the results of algorithms?

Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’02

• It is possible to perturb a symmetric matrix by a quantity that grows as d1 that produces a constant perturbation of the dominant eigenvector

Page 19: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Contd..Contd..

Convergence Experiment(s)– Expander graphs and d1 (every subset S has a

neighborhood bounded by some factor times |S|)– Rapidly mixing random walk : Convergence is guaranteed

in logarithmic time in the order of the size of the graph

Page 20: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Implementation: Getting your hands dirtyImplementation: Getting your hands dirty

In 1998– 24 million web pages– Crawler builds an index of links– To do this in 5 days, 50 Web pages/second need to

be crawled– 11 is the average outdegree, 550 links/second– 75 million unique URL’s to be compared against– URL’s are hashed to unique integer ID– No dangling links are kept initially– Vector E will help in convergence issues also– Weights were kept for 75 million URLs @ 4

bytes/weight (300MB)– Access to link Database is linear since it is sorted

`99 – 800 million pages; `00 - 2 billion; `01 – 4 billion

Page 21: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Personalized PageRank: The invisible Personalized PageRank: The invisible sourcesource

||E||1=0.15– Web Pages are valued because they exist!– Web Pages with many related links receive an overly

high ranking The other extreme – E for just one web page

– Netscape Home Page and John McCarthy’s home page

Page 22: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

Applications: What wasn’t apparent Applications: What wasn’t apparent alreadyalready

Estimating Web Traffic– How PageRank corresponds to actual usage– Internet proxy cache from NLANR compared to

PageRank– 2.6 million pages intersect with PageRank’s indexed 75

mil.– Web based email access is one plausible reason for this

disparity– People look at certain pages but never link them

Backlink Predictor– PageRank is a better predictor for future citation counts

than citation counts themselves.– Experiment starts out with one URL and no other

information– Goal is to crawl the Web in the order of their importance– Importance being an Evaluation function on the number

of citation counts (number of backlinks)– PageRank escapes local minima, citation count get

stuck in these.

Page 23: The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd

University of Texas at Arlington

ConclusionsConclusions

In essence, the importance of one page being dependent on the importance of its predecessors is like a ‘peer’ review.

NASDAQ – 17th February, 2005 - $197.41 : Need I say More?