scaling personalized web search

31
Scaling Personalized Web Search Glen Jeh , Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion

Upload: castor-rivas

Post on 31-Dec-2015

36 views

Category:

Documents


2 download

DESCRIPTION

Scaling Personalized Web Search. Glen Jeh , Jennfier Widom Stanford University. Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion. Today’s topics. Overview Motivation Personal PageRank Vector Efficient calculation of PPV Experimental results Discussion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scaling Personalized Web Search

Scaling Personalized Web Search

Glen Jeh , Jennfier Widom Stanford University

Presented by Li-Tal MashiachSearch Engine Technology course (236620)

Technion

Page 2: Scaling Personalized Web Search

Today’s topics

Overview Motivation Personal PageRank Vector Efficient calculation of PPVEfficient calculation of PPV Experimental results Discussion

Page 3: Scaling Personalized Web Search

PageRank Overview

Ranking method of web pages based on the link structure of the web

Important pages are those linked-to by many important pages

Original PageRank has no initial preference for any particular pages

Page 4: Scaling Personalized Web Search

PageRank Overview

The ranking is based on the probability that a random surferrandom surfer will visit a certain page at a given time

E(p) E(p) can be: Uniformly distributed Biased distributed

:

( )( ) (1 ) ( )

deg( )q q p

PR qPR p c cE p

Out q

Page 5: Scaling Personalized Web Search

Motivation

We would like to give higher importance to user selected pages User may have a set P P of preferred pages preferred pages Instead of jumping to to any random pagerandom page with

probability c, the jump is restricted to P That way, we increase the probability that the

random surfer will stay in the near environment of pages in P

Considering P will create a personalized personalized viewview of the importance of pages on the web

Page 6: Scaling Personalized Web Search

Personalized PageRank Vector (PPV)

Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank

PPV is a vector of length n, where n is the number of pages on the web

PPV[p] = the importance of page p

P H V

Page 7: Scaling Personalized Web Search

PPV Equation

u – preference vector |u| = 1 u(p) = the amount of preference for page p

A – nxn matrix

c – the probability the random surfer jumps to a page in P

(1 )PPV c A PPV c u

,

1

deg( )

0 .i j

j iOut jA

else

Page 8: Scaling Personalized Web Search

PPV – Problem

Not practical to compute PPV’s during query time

Not practical to compute and store offline There are preference sets

How to calculate PPV? How to do it efficiently?

| |2 H

Page 9: Scaling Personalized Web Search

Main Steps to solution

Break down preference vectorspreference vectors into commoncommon componentsComputation divided between offlineoffline

(lots of time) and online online (focused computation)

Eliminates redundantredundant computation

Page 10: Scaling Personalized Web Search

Linearity Theorem

The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s .

Let xxi be a unit vectorunit vector

Let rrii be the PPV corresponding to xi, called hub vectorhub vector

1 2 and u u

1 2 and v v

1 1 2 2 1 1 2 2PPV(a +a ) a ( ) a ( )u u PPV u PPV u

i

n

ii xau

1

i

n

iirav

1

Page 11: Scaling Personalized Web Search

Example

r1r1

r2r2

r12r12

x1, x2, x12

Personal preferences of David

1 1 1 + +

3 3 3DavidPPV

rkrk

Page 12: Scaling Personalized Web Search

Good, but not enough…

If hub vector ri for each page in H can be computed ahead of time and stored, then computing PPV is easier

The number of pre-computed PPV decrease from to |H|.

But…. Each hub vector computation requires multiple

scans of the web graph Time and space grow linearly with |H| The solution so far is impractical

| |2 H

Page 13: Scaling Personalized Web Search

Decomposition of Hub Vectors

In order to compute and store the hub vectors efficiently, we can further break them down into… Partial vectorPartial vector –unique component Hubs skeletonHubs skeleton –encode interrelationships

among hub vectors Construct into full hub vectorhub vector during query

time Saves computation time and storage due to

sharing of components among hub vectors

Page 14: Scaling Personalized Web Search

Inverse P-distance

Hub vectorHub vector rrpp can be represented as inverseinverse

P-distance vectorP-distance vector

l(t) – the number of edges in path t P(t) – the probability of traveling on path t

1

11

1( ) , where t=<w , ...,w >

| ( ) |

k

ki i

P to w

( )

:

( ) [ ] (1 )l tpt p q

r q P t c c

Page 15: Scaling Personalized Web Search

Partial Vectors

Breaking rrpp into into two components: Partial Vectors- Partial Vectors- computed without using

any intermediate nodes from H The rest

For well-chosen sets H, it will be true that for many pages p,q

( )H Hp p p pr r r r

0Hp pr r

Partial Vectorh H

Paths that going through some page h H

Page 16: Scaling Personalized Web Search

Precompute and store the partial vector Cheaper to compute and store than Decreases as |H| increases

Add at query time to compute the full hub vector

But… Computing and storing could be expensive

as itself

Good, but not enough…

( )Hp pr rpr

Hpr

Hpr

pr

Page 17: Scaling Personalized Web Search

Hubs Skeleton Breaking down :

Hubs skeletonHubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors

for each p, rrpp(H)(H) has size at most |H|, much smaller than the full hub vector

Hpr

1( ( ) ( ))*( )H H

p p p h h hh H

r r h cx h r r cxc

HphrS p |)(

Partial VectorsPartial VectorsHubs skeletonHubs skeleton

Handling the case p or q is itself in Hh H

Paths that go through some page h H

Page 18: Scaling Personalized Web Search
Page 19: Scaling Personalized Web Search

Example

21 1( ( ) ( )) (1 )

2 2H

a ar b r b c c

1 1 1 1( ) ( ) ( ) (1 ) (1 )

2 2Ha a cr b r c r b c c c c

c c

HH

aa

bb

ddcc

21( ) (1 )

2ar b c c

( ) ( ( ) ( )) ( )H Ha a a ar b r b r b r b

Page 20: Scaling Personalized Web Search

Putting it all together

Given a chosen reference set P

1. Form a preference vector u

2. Calculate hub vector for each ik

3. Combine the hub vectors

zziiiu ...2211

1( ) ( ) ( )

k k kk

H Hi i i h hi h Hr r r r h r r

c

11 21 2 ...ki i k ir r r r

Pre- computed of partial vectors

Hubs skeleton may be deferred to query time

Page 21: Scaling Personalized Web Search

Algorithms

Decomposition theorem Basic dynamic programming algorithm

Partial vectors - Selective expansion algorithm

Hubs skeleton - Repeated squaring algorithm

Page 22: Scaling Personalized Web Search

Decomposition theorem

The basis vector rrp p is the average of the basis vectors of its out-neighbors, plus a compensation factor

Define relationships among basis vectors Having computed the basis vectors of p’s

out-neighbors to certain precision, we can use the theorem to compute rrp p to greater precision

:

1

deg( )p q pq p q

cc

Out p

r r x

Page 23: Scaling Personalized Web Search

Basic dynamic programming algorithm

pˆ approximation of r in itration k

error in the iteration k of the basic vector p

p

p

k

k

r

E

:

:

1ˆ ˆ

deg( )

1

deg( )

p q pq p q

p qq p q

cc

Out p

c

Out p

k+1 k

k+1 k

r r x

E E

Using the decomposition theory, we can build a dynamic programming algorithmdynamic programming algorithm which iteratively improves the precision of the calculation

On iteration k, only paths with length ≤ k-1 are being considered

The error is reduced by a factor of 1-c on each iteration

Page 24: Scaling Personalized Web Search

Computing partial vectors

Selective expansion algorithm Tours passing through a hub page H are

never considered The expansion from p will stop when

reaching page from H

( )Hp pr r

Page 25: Scaling Personalized Web Search

Computing hubs skeleton

Repeated squaring algorithm Using the intermediate results from the

computation of partial vectors The error is squaredsquared on each iteration –

reduces error much faster Running time and storage depend only

on the size of rrpp(H)(H) This allows to defer the computation to

query time

( ) p Hpr H

Page 26: Scaling Personalized Web Search

Experimental results

Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages

Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory

Page 27: Scaling Personalized Web Search

Experimental results

Partial vector approach is much more effective when H contains high-PageRank pages

H was taken from the top 1000 to the top 100,000 pages with the highest PageRank

Page 28: Scaling Personalized Web Search

Experimental results

Compute hubs skeleton for |H|=10,000

Average size is 9021 entries, much less than dimensions of full hub vectors

Page 29: Scaling Personalized Web Search

Experimental results

Instead of using the entire set rrpp(H)(H), using only the highest m enteries

Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds

Page 30: Scaling Personalized Web Search

Discussion

Are personalized PageRank’s even useful? What if personally chosen pages are not

representative enough? Too focused? Even if overhead is scalable with number of

pages, do light-web users want to accept that overhead?

performance depends on choice of personal pages

Page 31: Scaling Personalized Web Search

References

Scaling Personalized Web Search Glen Jeh and Jennifer Widom WWW2003

Personalized PageRank seminar: Link mining http://www.informatik.uni-freiburg.de/~ml/teaching/ws04/l

m/20041207_PageRank_Alcazar.ppt