scaling personalized web search

Scaling Personalized Web Search

Glen Jeh , Jennfier Widom Stanford University

Presented by Li-Tal MashiachSearch Engine Technology course (236620)

Technion

Today’s topics

Overview Motivation Personal PageRank Vector Efficient calculation of PPVEfficient calculation of PPV Experimental results Discussion

PageRank Overview

Ranking method of web pages based on the link structure of the web

Important pages are those linked-to by many important pages

Original PageRank has no initial preference for any particular pages

PageRank Overview

The ranking is based on the probability that a random surferrandom surfer will visit a certain page at a given time

E(p) E(p) can be: Uniformly distributed Biased distributed

:

( )( ) (1 ) ( )

deg( )q q p

PR qPR p c cE p

Out q

Motivation

We would like to give higher importance to user selected pages User may have a set P P of preferred pages preferred pages Instead of jumping to to any random pagerandom page with

probability c, the jump is restricted to P That way, we increase the probability that the

random surfer will stay in the near environment of pages in P

Considering P will create a personalized personalized viewview of the importance of pages on the web

Personalized PageRank Vector (PPV)

Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank

PPV is a vector of length n, where n is the number of pages on the web

PPV[p] = the importance of page p

P H V

PPV Equation

u – preference vector |u| = 1 u(p) = the amount of preference for page p

A – nxn matrix

c – the probability the random surfer jumps to a page in P

(1 )PPV c A PPV c u

,

1

deg( )

0 .i j

j iOut jA

else

PPV – Problem

Not practical to compute PPV’s during query time

Not practical to compute and store offline There are preference sets

How to calculate PPV? How to do it efficiently?

| |2 H

Main Steps to solution

Break down preference vectorspreference vectors into commoncommon componentsComputation divided between offlineoffline

(lots of time) and online online (focused computation)

Eliminates redundantredundant computation

Linearity Theorem

The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s .

Let xxi be a unit vectorunit vector

Let rrii be the PPV corresponding to xi, called hub vectorhub vector

1 2 and u u

1 2 and v v

1 1 2 2 1 1 2 2PPV(a +a ) a ( ) a ( )u u PPV u PPV u

i

n

ii xau

1

i

n

iirav

1

Example

…

r1r1

…

r2r2

…

r12r12

…

x1, x2, x12

Personal preferences of David

1 1 1 + +

3 3 3DavidPPV

…

rkrk

…

Good, but not enough…

If hub vector ri for each page in H can be computed ahead of time and stored, then computing PPV is easier

The number of pre-computed PPV decrease from to |H|.

But…. Each hub vector computation requires multiple

scans of the web graph Time and space grow linearly with |H| The solution so far is impractical

| |2 H

Decomposition of Hub Vectors

In order to compute and store the hub vectors efficiently, we can further break them down into… Partial vectorPartial vector –unique component Hubs skeletonHubs skeleton –encode interrelationships

among hub vectors Construct into full hub vectorhub vector during query

time Saves computation time and storage due to

sharing of components among hub vectors

Inverse P-distance

Hub vectorHub vector rrpp can be represented as inverseinverse

P-distance vectorP-distance vector

l(t) – the number of edges in path t P(t) – the probability of traveling on path t

1

11

1( ) , where t=<w , ...,w >

| ( ) |

k

ki i

P to w

( )

:

( ) [ ] (1 )l tpt p q

r q P t c c

Partial Vectors

Breaking rrpp into into two components: Partial Vectors- Partial Vectors- computed without using

any intermediate nodes from H The rest

For well-chosen sets H, it will be true that for many pages p,q

( )H Hp p p pr r r r

0Hp pr r

Partial Vectorh H

Paths that going through some page h H

Precompute and store the partial vector Cheaper to compute and store than Decreases as |H| increases

Add at query time to compute the full hub vector

But… Computing and storing could be expensive

as itself

Good, but not enough…

( )Hp pr rpr

Hpr

Hpr

pr

Hubs Skeleton Breaking down :

Hubs skeletonHubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors

for each p, rrpp(H)(H) has size at most |H|, much smaller than the full hub vector

Hpr

1( ( ) ( ))*( )H H

p p p h h hh H

r r h cx h r r cxc

HphrS p |)(

Partial VectorsPartial VectorsHubs skeletonHubs skeleton

Handling the case p or q is itself in Hh H

Paths that go through some page h H

Example

21 1( ( ) ( )) (1 )

2 2H

a ar b r b c c

1 1 1 1( ) ( ) ( ) (1 ) (1 )

2 2Ha a cr b r c r b c c c c

c c

HH

aa

bb

ddcc

21( ) (1 )

2ar b c c

( ) ( ( ) ( )) ( )H Ha a a ar b r b r b r b

Putting it all together

Given a chosen reference set P

1. Form a preference vector u

2. Calculate hub vector for each ik

3. Combine the hub vectors

zziiiu ...2211

1( ) ( ) ( )

k k kk

H Hi i i h hi h Hr r r r h r r

c

11 21 2 ...ki i k ir r r r

Pre- computed of partial vectors

Hubs skeleton may be deferred to query time

Algorithms

Decomposition theorem Basic dynamic programming algorithm

Partial vectors - Selective expansion algorithm

Hubs skeleton - Repeated squaring algorithm

Decomposition theorem

The basis vector rrp p is the average of the basis vectors of its out-neighbors, plus a compensation factor

Define relationships among basis vectors Having computed the basis vectors of p’s

out-neighbors to certain precision, we can use the theorem to compute rrp p to greater precision

:

1

deg( )p q pq p q

cc

Out p

r r x

Basic dynamic programming algorithm

pˆ approximation of r in itration k

error in the iteration k of the basic vector p

p

p

k

k

r

E

:

:

1ˆ ˆ

deg( )

1

deg( )

p q pq p q

p qq p q

cc

Out p

c

Out p

k+1 k

k+1 k

r r x

E E

Using the decomposition theory, we can build a dynamic programming algorithmdynamic programming algorithm which iteratively improves the precision of the calculation

On iteration k, only paths with length ≤ k-1 are being considered

The error is reduced by a factor of 1-c on each iteration

Computing partial vectors

Selective expansion algorithm Tours passing through a hub page H are

never considered The expansion from p will stop when

reaching page from H

( )Hp pr r

Computing hubs skeleton

Repeated squaring algorithm Using the intermediate results from the

computation of partial vectors The error is squaredsquared on each iteration –

reduces error much faster Running time and storage depend only

on the size of rrpp(H)(H) This allows to defer the computation to

query time

( ) p Hpr H

Experimental results

Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages

Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory


Partial vector approach is much more effective when H contains high-PageRank pages

H was taken from the top 1000 to the top 100,000 pages with the highest PageRank


Compute hubs skeleton for |H|=10,000

Average size is 9021 entries, much less than dimensions of full hub vectors


Instead of using the entire set rrpp(H)(H), using only the highest m enteries

Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds

Discussion

Are personalized PageRank’s even useful? What if personally chosen pages are not

representative enough? Too focused? Even if overhead is scalable with number of

pages, do light-web users want to accept that overhead?

performance depends on choice of personal pages

References

Scaling Personalized Web Search Glen Jeh and Jennifer Widom WWW2003

Personalized PageRank seminar: Link mining http://www.informatik.uni-freiburg.de/~ml/teaching/ws04/l

m/20041207_PageRank_Alcazar.ppt

scaling personalized web search

Documents