scalable peer-to-peer web retrieval with highly discriminative keys icde 2007 scalable peer-to-peer...

25
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana Podnar Žarko * , Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences EPFL, Lausanne, Switzerland *FER, University of Zagreb, Croatia Contact: [email protected]

Upload: annis-lester

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

Ivana Podnar Žarko*, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer

School of Computer and Communication SciencesEPFL, Lausanne, Switzerland

*FER, University of Zagreb, Croatia

Contact: [email protected]

Page 2: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 2

Contents

• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion

Page 3: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 3

Motivation

• Clustered retrieval engines are reaching scalability limits

– Fast growing public Web

– Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo

– Dynamically changing content

• P2P retrieval as a scalable alternative

– Involve large number of peer machines (millions)

– Exploit scalable P2P search techniques

– Support community-oriented search

Page 4: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 4

P2P full text retrieval

• Goals– retrieval performance comparable to state-of-the-art engines– scalable in terms of generated traffic (indexing and retrieval)

• Two basic approaches– Document partitioning

unstructured overlay network for search (e.g. Gnutella)– Term partitioning

structured overlay network for search (e.g. Chord, P-Grid)• Problem: communication cost for search [Li et al, IPTPS 2003]

– Document partitioning: broadcast search– Term partitioning: long posting lists transmitted over network, in

particular when processing multi-term queries

Page 5: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 5

Approach

• Some facts about web retrieval– queries are in general short (on average 2 to 3 terms)– users pose queries containing frequent terms– users are interested in a few high-precision answers (fast)

• Full-text information retrieval engine built over a structured P2P network specifically considering these observations

• ALVIS PEERS – EU FP6 research project (2004-2006)

Page 6: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 6

Contents

• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion

Page 7: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 7

P2PIR Architecture

Ranking

HDK Indexing/Querying

P2P

Web service IF

IR PEER

LI

GKI

LI Local single-term index

GKIGlobal key index(k, postinglist(k))

• Structured P2P network with N peers– logarithmic lookup cost for keys

• Large document collection D• Each peer a) indexes part of the global collection D (Pi) and

b) maintains part of the global index

Ranking

HDK Indexing/Querying

P2P

Web service IF

IR PEER

LI

GKI

Ranking

HDK Indexing/Querying

P2P

Web service IF

IR PEER

LI

GKI

Page 8: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 8

Single-term P2P indexing

Q = {t1,t2} t1,t2:{d1,d4, d7}

t1:{d1, d2, d4, d5, d7, d8}

t2:{d1, d3, d6, d7}

Global single-term index

t1:{d1, d2}t2:{d1, d3}

t1:{d4, d5}t2:{d6}

t1:{d7, d8}t2:{d7}

Peer1

Peer2

Peer3

Local index

Querying peer

Retrieval traffic is not scalable!

grows with (Heap’s law)? D - collection size in no. of terms experimentally linear, frequent terms used frequently in queries

D

key = single-term

Page 9: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 9

HDK-based P2P indexing

Q ={t1,t2, t3}k13, k2:{d5, d7, d1, d3, d6 , d8}

t1:{d1, d2}t2:{d1, d3}

t1:{d4, d5}t2:{d6}t3:{d5, d6}

t1:{d7, d8}t2:{d8}t3:{d7}

Peer1

Peer2

Peer3

Retrieval traffic is bounded by DFmax and query size!

Querying peer

t1:{d4, d1, d8, d5}

t2:{d1, d3, d6, d8}

k13:{d5, d7}

DFmax = 4

posting list truncated to top-DFmax postings

(t1, t3)

key = set of terms

Page 10: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 10

single term indexing

highly discriminative key indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

... ...

long posting lists

smal

l voc

.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

... ... PEER N

PEER 1

PEER N

...

HDKs

Single-term vs. HDK-based P2P indexing

comparable retrieval quality(extended vocabulary)

voc. sizecould grow

exponentially!

Page 11: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 11

Keys and key filtering

Non-Discriminative Keys (NDKs)

• e.g. t1 is an NDK iff:

– t1 appears in more than DFmax collection documents

• posting lists truncated to top-DFmax documents

Highly-Discriminative Keys (HDKs)

• e.g. (t1, t2) is an HDK iff:

– t1 & t2 appear in less than DFmax collection documents (discriminative w.r.t document collection)

– t1 and t2 are non-discriminative (redundancy filter)

– t1 and t2 are within a window of size w (proximity filter)

– the no. of terms comprising a key is limited by smax

(size filter)

• posting lists by definition contain only DFmax documentsKey filtering enables scalable indexing!

Page 12: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 12

Contents

• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion

Page 13: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 13

Scalability analysis (indexing)

• What is the upper bound on the index size for a very large document collection?

• D – collection size in no. of terms

• s – no. of terms comprising a key

• w – window size

• ISs – index size

associated with keys of size s

• Pf, (s-1) – probability of

NDK occurrences where NDK size is (s-1)

22 ,1( ) ( 1)fIS D D P w

2,( 1)

1( )

1s f s

wIS D D P

s

key size index size (location index)

1

2

s

DDIS )(1

constant?

Page 14: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 14

Scalability analysis (indexing)

Zipf model

z(r)

r

Ff

Fr

very frequent terms

frequent terms rare terms

arCrz )(

Fr DFmax

NDKs HDKs

Page 15: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 15

Scalability analysis (indexing)

C increases for an increasing collection size, a remains const.

z(r)

r

Ff

Fr

D increases

Theorem: Probability Pf,(s-1) of NDK occurrence remains constant!

constDDISs )(

arCrz )(

Page 16: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 16

Scalability analysis (retrieval)

• Retrieval traffic is bounded by DFmax and the number of keys a query is mapped to (constant)

maxDFconstRT

Scalability theoretically guaranteed,but what are the constants? Experiments!

Page 17: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 17

Contents

• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion

Page 18: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 18

Experiment

• System fully implemented in Java (available on request)• Document collection

– 20.000, 40.000, ..., 140.000 documents from Wikipedia (www.wikipedia.org)

• Query log– Wikipedia query log for 2 months (08/2004 and 09/2004)– 3,000 randomly chosen queries from 2,000,000 unique queries with

more than 20 hits• No. of peers: 4, 8, ..., 28

– PCs running RedHat Linux with 1GB memory– 100 Mbit Ethernet– Each peer indexes 5.000 documents

• DFmax = 400 or 500, smax= 3, w = 20

Page 19: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 19

Indexing costs

Average index size per peer

0,0E+00

2,0E+064,0E+06

6,0E+068,0E+06

1,0E+071,2E+07

1,4E+07

20000 40000 60000 80000 100000 120000 140000

#Documents

#Pos

tings

ST DFmax=500 DFmax=400

0,0E+00

2,0E+064,0E+06

6,0E+068,0E+06

1,0E+071,2E+07

1,4E+07

20000 40000 60000 80000 100000 120000 140000

#Documents

#Pos

tings

ST DFmax=500 DFmax=400

Average indexing traffic per peer

HDK vs single-term (ST) indexing experimentally: HDK / ST = 13.9 (for 140.000 documents) theoretically: HDK / ST = 40.7 (overestimated upper bound!)

Page 20: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 20

Retrieval costs

Retrieval traffic per query (Wikipedia query log)

remains constant with a growing collection size for the HDK approach (linear for single-term)

0,0E+00

5,0E+03

1,0E+04

1,5E+04

2,0E+04

2,5E+04

60000 80000 100000 120000 140000

#Documents

#P

os

tin

gs

ST HDK,DFmax=500 HDK, DFmax=400

Page 21: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 21

Estimated total generated traffic

0

5E+13

1E+14

1,5E+14

2E+14

2,5E+14

0,E+00 2,E+08 4,E+08 6,E+08 8,E+08 1,E+09

#Documents

#Po

stin

gs

HDK

single-term

Assumptions monthly indexing no. of queries per month: 1,5 * 106 (true no. of queries from the wikipedia log, conservative estimate) for 1 billion documents, HDK generates 42 times less overall traffic

Page 22: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 22

Retrieval performance

Overlap on top 20 documents

comparable performance of the

HDK-based approach to the centralized single-term engine with BM25

0

10

20

30

40

50

60

70

80

90

100

60000 80000 100000 120000 140000

#Documents

Ov

erl

ap

[%]

ST HDK,DFmax=500 HDK, DFmax=400

Page 23: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 23

Conclusion

• Novel indexing model based on indexing terms and term sets;

• Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval;

• Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system;

• Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.

Page 24: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 24

Ongoing work

• Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering

Page 25: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

ICDE 2007 25

Acknowledgement

• The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)