scalable peer-to-peer web retrieval with highly discriminative keys icde 2007 scalable peer-to-peer...

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007


Ivana Podnar Žarko*, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer

School of Computer and Communication SciencesEPFL, Lausanne, Switzerland

*FER, University of Zagreb, Croatia

Contact: [email protected]


ICDE 2007 2

Contents

• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion


ICDE 2007 3

Motivation

• Clustered retrieval engines are reaching scalability limits

– Fast growing public Web

– Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo

– Dynamically changing content

• P2P retrieval as a scalable alternative

– Involve large number of peer machines (millions)

– Exploit scalable P2P search techniques

– Support community-oriented search


ICDE 2007 4

P2P full text retrieval

• Goals– retrieval performance comparable to state-of-the-art engines– scalable in terms of generated traffic (indexing and retrieval)

• Two basic approaches– Document partitioning

unstructured overlay network for search (e.g. Gnutella)– Term partitioning

structured overlay network for search (e.g. Chord, P-Grid)• Problem: communication cost for search [Li et al, IPTPS 2003]

– Document partitioning: broadcast search– Term partitioning: long posting lists transmitted over network, in

particular when processing multi-term queries


ICDE 2007 5

Approach

• Some facts about web retrieval– queries are in general short (on average 2 to 3 terms)– users pose queries containing frequent terms– users are interested in a few high-precision answers (fast)

• Full-text information retrieval engine built over a structured P2P network specifically considering these observations

• ALVIS PEERS – EU FP6 research project (2004-2006)


ICDE 2007 6

Contents



ICDE 2007 7

P2PIR Architecture

Ranking

HDK Indexing/Querying

P2P

Web service IF

IR PEER

LI

GKI

LI Local single-term index

GKIGlobal key index(k, postinglist(k))

• Structured P2P network with N peers– logarithmic lookup cost for keys

• Large document collection D• Each peer a) indexes part of the global collection D (Pi) and

b) maintains part of the global index

Ranking


P2P

Web service IF

IR PEER

LI

GKI

Ranking


P2P

Web service IF

IR PEER

LI

GKI


ICDE 2007 8

Single-term P2P indexing

Q = {t1,t2} t1,t2:{d1,d4, d7}

t1:{d1, d2, d4, d5, d7, d8}

t2:{d1, d3, d6, d7}

Global single-term index

t1:{d1, d2}t2:{d1, d3}

t1:{d4, d5}t2:{d6}

t1:{d7, d8}t2:{d7}

Peer1

Peer2

Peer3

Local index

Querying peer

Retrieval traffic is not scalable!

grows with (Heap’s law)? D - collection size in no. of terms experimentally linear, frequent terms used frequently in queries

D

key = single-term


ICDE 2007 9

HDK-based P2P indexing

Q ={t1,t2, t3}k13, k2:{d5, d7, d1, d3, d6 , d8}

t1:{d1, d2}t2:{d1, d3}

t1:{d4, d5}t2:{d6}t3:{d5, d6}

t1:{d7, d8}t2:{d8}t3:{d7}

Peer1

Peer2

Peer3

Retrieval traffic is bounded by DFmax and query size!

Querying peer

t1:{d4, d1, d8, d5}

t2:{d1, d3, d6, d8}

k13:{d5, d7}

DFmax = 4

posting list truncated to top-DFmax postings

(t1, t3)

key = set of terms


ICDE 2007 10

single term indexing

highly discriminative key indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

... ...

long posting lists

smal

l voc

.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

... ... PEER N

PEER 1

PEER N

...

HDKs

Single-term vs. HDK-based P2P indexing

comparable retrieval quality(extended vocabulary)

voc. sizecould grow

exponentially!


ICDE 2007 11

Keys and key filtering

Non-Discriminative Keys (NDKs)

• e.g. t1 is an NDK iff:

– t1 appears in more than DFmax collection documents

• posting lists truncated to top-DFmax documents

Highly-Discriminative Keys (HDKs)

• e.g. (t1, t2) is an HDK iff:

– t1 & t2 appear in less than DFmax collection documents (discriminative w.r.t document collection)

– t1 and t2 are non-discriminative (redundancy filter)

– t1 and t2 are within a window of size w (proximity filter)

– the no. of terms comprising a key is limited by smax

(size filter)

• posting lists by definition contain only DFmax documentsKey filtering enables scalable indexing!


ICDE 2007 12

Contents



ICDE 2007 13

Scalability analysis (indexing)

• What is the upper bound on the index size for a very large document collection?

• D – collection size in no. of terms

• s – no. of terms comprising a key

• w – window size

• ISs – index size

associated with keys of size s

• Pf, (s-1) – probability of

NDK occurrences where NDK size is (s-1)

22 ,1( ) ( 1)fIS D D P w

2,( 1)

1( )

1s f s

wIS D D P

s

key size index size (location index)

1

2

s

DDIS )(1

constant?


ICDE 2007 14


Zipf model

z(r)

r

Ff

Fr

very frequent terms

frequent terms rare terms

arCrz )(

Fr DFmax

NDKs HDKs


ICDE 2007 15


C increases for an increasing collection size, a remains const.

z(r)

r

Ff

Fr

D increases

Theorem: Probability Pf,(s-1) of NDK occurrence remains constant!

constDDISs )(

arCrz )(


ICDE 2007 16

Scalability analysis (retrieval)

• Retrieval traffic is bounded by DFmax and the number of keys a query is mapped to (constant)

maxDFconstRT

Scalability theoretically guaranteed,but what are the constants? Experiments!


ICDE 2007 17

Contents



ICDE 2007 18

Experiment

• System fully implemented in Java (available on request)• Document collection

– 20.000, 40.000, ..., 140.000 documents from Wikipedia (www.wikipedia.org)

• Query log– Wikipedia query log for 2 months (08/2004 and 09/2004)– 3,000 randomly chosen queries from 2,000,000 unique queries with

more than 20 hits• No. of peers: 4, 8, ..., 28

– PCs running RedHat Linux with 1GB memory– 100 Mbit Ethernet– Each peer indexes 5.000 documents

• DFmax = 400 or 500, smax= 3, w = 20


ICDE 2007 19

Indexing costs

Average index size per peer

0,0E+00

2,0E+064,0E+06

6,0E+068,0E+06

1,0E+071,2E+07

1,4E+07

20000 40000 60000 80000 100000 120000 140000

#Documents

#Pos

tings

ST DFmax=500 DFmax=400

0,0E+00

2,0E+064,0E+06

6,0E+068,0E+06

1,0E+071,2E+07

1,4E+07

20000 40000 60000 80000 100000 120000 140000

#Documents

#Pos

tings

ST DFmax=500 DFmax=400

Average indexing traffic per peer

HDK vs single-term (ST) indexing experimentally: HDK / ST = 13.9 (for 140.000 documents) theoretically: HDK / ST = 40.7 (overestimated upper bound!)


ICDE 2007 20

Retrieval costs

Retrieval traffic per query (Wikipedia query log)

remains constant with a growing collection size for the HDK approach (linear for single-term)

0,0E+00

5,0E+03

1,0E+04

1,5E+04

2,0E+04

2,5E+04

60000 80000 100000 120000 140000

#Documents

#P

os

tin

gs

ST HDK,DFmax=500 HDK, DFmax=400


ICDE 2007 21

Estimated total generated traffic

0

5E+13

1E+14

1,5E+14

2E+14

2,5E+14

0,E+00 2,E+08 4,E+08 6,E+08 8,E+08 1,E+09

#Documents

#Po

stin

gs

HDK

single-term

Assumptions monthly indexing no. of queries per month: 1,5 * 106 (true no. of queries from the wikipedia log, conservative estimate) for 1 billion documents, HDK generates 42 times less overall traffic


ICDE 2007 22

Retrieval performance

Overlap on top 20 documents

comparable performance of the

HDK-based approach to the centralized single-term engine with BM25

0

10

20

30

40

50

60

70

80

90

100

60000 80000 100000 120000 140000

#Documents

Ov

erl

ap

[%]

ST HDK,DFmax=500 HDK, DFmax=400


ICDE 2007 23

Conclusion

• Novel indexing model based on indexing terms and term sets;

• Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval;

• Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system;

• Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.


ICDE 2007 24

Ongoing work

• Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering


ICDE 2007 25

Acknowledgement

• The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)