scalable peer-to-peer web retrieval with highly discriminative keys icde 2007 scalable peer-to-peer...
TRANSCRIPT
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
Ivana Podnar Žarko*, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer
School of Computer and Communication SciencesEPFL, Lausanne, Switzerland
*FER, University of Zagreb, Croatia
Contact: [email protected]
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 2
Contents
• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 3
Motivation
• Clustered retrieval engines are reaching scalability limits
– Fast growing public Web
– Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo
– Dynamically changing content
• P2P retrieval as a scalable alternative
– Involve large number of peer machines (millions)
– Exploit scalable P2P search techniques
– Support community-oriented search
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 4
P2P full text retrieval
• Goals– retrieval performance comparable to state-of-the-art engines– scalable in terms of generated traffic (indexing and retrieval)
• Two basic approaches– Document partitioning
unstructured overlay network for search (e.g. Gnutella)– Term partitioning
structured overlay network for search (e.g. Chord, P-Grid)• Problem: communication cost for search [Li et al, IPTPS 2003]
– Document partitioning: broadcast search– Term partitioning: long posting lists transmitted over network, in
particular when processing multi-term queries
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 5
Approach
• Some facts about web retrieval– queries are in general short (on average 2 to 3 terms)– users pose queries containing frequent terms– users are interested in a few high-precision answers (fast)
• Full-text information retrieval engine built over a structured P2P network specifically considering these observations
• ALVIS PEERS – EU FP6 research project (2004-2006)
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 6
Contents
• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 7
P2PIR Architecture
Ranking
HDK Indexing/Querying
P2P
Web service IF
IR PEER
LI
GKI
LI Local single-term index
GKIGlobal key index(k, postinglist(k))
• Structured P2P network with N peers– logarithmic lookup cost for keys
• Large document collection D• Each peer a) indexes part of the global collection D (Pi) and
b) maintains part of the global index
Ranking
HDK Indexing/Querying
P2P
Web service IF
IR PEER
LI
GKI
Ranking
HDK Indexing/Querying
P2P
Web service IF
IR PEER
LI
GKI
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 8
Single-term P2P indexing
Q = {t1,t2} t1,t2:{d1,d4, d7}
t1:{d1, d2, d4, d5, d7, d8}
t2:{d1, d3, d6, d7}
Global single-term index
t1:{d1, d2}t2:{d1, d3}
t1:{d4, d5}t2:{d6}
t1:{d7, d8}t2:{d7}
Peer1
Peer2
Peer3
Local index
Querying peer
Retrieval traffic is not scalable!
grows with (Heap’s law)? D - collection size in no. of terms experimentally linear, frequent terms used frequently in queries
D
key = single-term
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 9
HDK-based P2P indexing
Q ={t1,t2, t3}k13, k2:{d5, d7, d1, d3, d6 , d8}
t1:{d1, d2}t2:{d1, d3}
t1:{d4, d5}t2:{d6}t3:{d5, d6}
t1:{d7, d8}t2:{d8}t3:{d7}
Peer1
Peer2
Peer3
Retrieval traffic is bounded by DFmax and query size!
Querying peer
t1:{d4, d1, d8, d5}
t2:{d1, d3, d6, d8}
k13:{d5, d7}
DFmax = 4
posting list truncated to top-DFmax postings
(t1, t3)
key = set of terms
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 10
single term indexing
highly discriminative key indexing
term 1 posting list 1 term 2 posting list 2
term M-1 posting list M-1term M posting list M
... ...
long posting lists
smal
l voc
.
key 11 posting list 11 key 12 posting list 12
key 1i posting list 1i
... ...
short posting lists
larg
e vo
c.
PEER 1
...
key N1 posting list N1 key N2 posting list N2
key Nj posting list Nj
... ... PEER N
PEER 1
PEER N
...
HDKs
Single-term vs. HDK-based P2P indexing
comparable retrieval quality(extended vocabulary)
voc. sizecould grow
exponentially!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 11
Keys and key filtering
Non-Discriminative Keys (NDKs)
• e.g. t1 is an NDK iff:
– t1 appears in more than DFmax collection documents
• posting lists truncated to top-DFmax documents
Highly-Discriminative Keys (HDKs)
• e.g. (t1, t2) is an HDK iff:
– t1 & t2 appear in less than DFmax collection documents (discriminative w.r.t document collection)
– t1 and t2 are non-discriminative (redundancy filter)
– t1 and t2 are within a window of size w (proximity filter)
– the no. of terms comprising a key is limited by smax
(size filter)
• posting lists by definition contain only DFmax documentsKey filtering enables scalable indexing!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 12
Contents
• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 13
Scalability analysis (indexing)
• What is the upper bound on the index size for a very large document collection?
• D – collection size in no. of terms
• s – no. of terms comprising a key
• w – window size
• ISs – index size
associated with keys of size s
• Pf, (s-1) – probability of
NDK occurrences where NDK size is (s-1)
22 ,1( ) ( 1)fIS D D P w
2,( 1)
1( )
1s f s
wIS D D P
s
key size index size (location index)
1
2
s
DDIS )(1
constant?
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 14
Scalability analysis (indexing)
Zipf model
z(r)
r
Ff
Fr
very frequent terms
frequent terms rare terms
arCrz )(
Fr DFmax
NDKs HDKs
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 15
Scalability analysis (indexing)
C increases for an increasing collection size, a remains const.
z(r)
r
Ff
Fr
D increases
Theorem: Probability Pf,(s-1) of NDK occurrence remains constant!
constDDISs )(
arCrz )(
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 16
Scalability analysis (retrieval)
• Retrieval traffic is bounded by DFmax and the number of keys a query is mapped to (constant)
maxDFconstRT
Scalability theoretically guaranteed,but what are the constants? Experiments!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 17
Contents
• Motivation• Indexing and Retrieval model (HDKs)• Scalability analysis• Experimental results• Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 18
Experiment
• System fully implemented in Java (available on request)• Document collection
– 20.000, 40.000, ..., 140.000 documents from Wikipedia (www.wikipedia.org)
• Query log– Wikipedia query log for 2 months (08/2004 and 09/2004)– 3,000 randomly chosen queries from 2,000,000 unique queries with
more than 20 hits• No. of peers: 4, 8, ..., 28
– PCs running RedHat Linux with 1GB memory– 100 Mbit Ethernet– Each peer indexes 5.000 documents
• DFmax = 400 or 500, smax= 3, w = 20
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 19
Indexing costs
Average index size per peer
0,0E+00
2,0E+064,0E+06
6,0E+068,0E+06
1,0E+071,2E+07
1,4E+07
20000 40000 60000 80000 100000 120000 140000
#Documents
#Pos
tings
ST DFmax=500 DFmax=400
0,0E+00
2,0E+064,0E+06
6,0E+068,0E+06
1,0E+071,2E+07
1,4E+07
20000 40000 60000 80000 100000 120000 140000
#Documents
#Pos
tings
ST DFmax=500 DFmax=400
Average indexing traffic per peer
HDK vs single-term (ST) indexing experimentally: HDK / ST = 13.9 (for 140.000 documents) theoretically: HDK / ST = 40.7 (overestimated upper bound!)
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 20
Retrieval costs
Retrieval traffic per query (Wikipedia query log)
remains constant with a growing collection size for the HDK approach (linear for single-term)
0,0E+00
5,0E+03
1,0E+04
1,5E+04
2,0E+04
2,5E+04
60000 80000 100000 120000 140000
#Documents
#P
os
tin
gs
ST HDK,DFmax=500 HDK, DFmax=400
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 21
Estimated total generated traffic
0
5E+13
1E+14
1,5E+14
2E+14
2,5E+14
0,E+00 2,E+08 4,E+08 6,E+08 8,E+08 1,E+09
#Documents
#Po
stin
gs
HDK
single-term
Assumptions monthly indexing no. of queries per month: 1,5 * 106 (true no. of queries from the wikipedia log, conservative estimate) for 1 billion documents, HDK generates 42 times less overall traffic
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 22
Retrieval performance
Overlap on top 20 documents
comparable performance of the
HDK-based approach to the centralized single-term engine with BM25
0
10
20
30
40
50
60
70
80
90
100
60000 80000 100000 120000 140000
#Documents
Ov
erl
ap
[%]
ST HDK,DFmax=500 HDK, DFmax=400
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 23
Conclusion
• Novel indexing model based on indexing terms and term sets;
• Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval;
• Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system;
• Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 24
Ongoing work
• Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
ICDE 2007 25
Acknowledgement
• The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)