www.tu-darmstadt.de full-text search in p2p networks christof leng databases and distributed...
TRANSCRIPT
www.tu-darmstadt.de www.dfg.dewww.quap2p.de
Full-Text Search in P2P Networks
Christof Leng
Databases and Distributed Systems Group
TU Darmstadt
2
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Content
Short Intro to full-text search Full-Text search on DHTs Performance Comparison Conclusion / Outlook
3
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
What is full-text search?
Searching for documents containing all of a list of specified words Search for “QuaP2P” “Darmstadt” “Research”
Very common operation Google Filesharing Wikis Source Code Document / Knowledge Management …
Can be extended to phrase search Search for “TU Darmstadt” “Christof Leng”
4
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Inverted Index
Full-text search is normally solved with inverted indexes Query result is intersection of all searched word entries Stemming can reduce the number of word entries
doc2
“Similarity searches accelerate P2P downloads by 30-70 percent.”
doc1
“New P2P system could provide speed increase.”
doc3
“I fail to see how this will make downloads faster.”
30 doc1
70 doc1
accelerate doc1
by doc1
could doc2
downloads doc1, doc3
fail doc3
faster doc3
how doc3
i doc3
increase doc2
make doc3
new doc2
p2p doc1, doc2
percent doc1
provide doc2
searches doc1
see doc3
similarity doc1
speed doc2
system doc2
this doc3
to doc3
will doc3
5
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Overlay Types and Full-Text Search
Peer-to-Peer
Centralized Pure / Hierarchical Structured / DHT
Inverted index on central server
Inverted index on each (super-)node
Distributed inverted index
Challenge
6
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Naïve Approach
Map inverted index to DHT Key Lookup for every word Intersect result lists at client
Pro: Simple Short latency
Con: Result lists may be
extremely large! Result list sizes may vary
extremely!
Search for“QuaP2P” “Darmstadt” “Research”
QuaP2PDarmstadt
Research
Image from http://en.wikipedia.org/wiki/Zipf's_law7
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Zipf Distributions in Natural Text
Some words are extremely common
Most words are extremely uncommon
Largest word frequency is proportional to number of distinct words
Avoid transfering result lists before intersection!
Rank
Wor
d O
ccur
ence
s
8
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Intersecting on the way
Query least common word first
Forward result list to next word
Intersect on the way
Pro: Reduces trafficCon: High latency Knowledge about word
frequencies required Search for “the” and “who”
(7.2 and 2.4 billion hits on Google each)
Search for“QuaP2P” “Darmstadt” “Research”
QuaP2P
Darmstadt
Research
9
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Using Bloom Filters
Bloom Filters reduce result list size
Forward Bloom Filters and return result list recursively
Pro: Reduces traffic even more
(up to factor 50x)
Con: Even higher latency Getting complicated
Search for“QuaP2P” “Darmstadt” “Research”
QuaP2P
Darmstadt
Research
Image from http://www.useit.com/alertbox/traffic_logs.html10
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Zipf Distributions in Query Terms, too
Query popularity obeys Zipf’ Law (déjà vu!)
This puts high load on nodes with the most popular keys
Even worse, this load scales linearly with the network size and user activity
The responsible nodes are randomly assigned (could be a modem user)
Hotspots will occur
11
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Caching and Precomputation
Caching Keep lists received for intersection Keep answers to popular queries Traffic reduction: 38% But: How to ensure coherence?
Precomputation Inverted index for pairs or tupels of words Only feasible for the most popular words
(but most effective there anyway) Traffic reduction: 50%
12
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Further Optimizations
Compression of result lists Adaptive Set Intersection Gap Compression Clustering of keys
Incremental Results Do not return all results at once Should be used in conjunction with ranking algorithm
Image from Yang et al: Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems13
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Comparison of different approaches
Yang et al compared DHT with Bloom Filters Supernode with
exhaustive flooding Unstructured Random
Walk w/o replication
Network size 1000
Random data set from WWW
All approaches have strengths
14
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Feasibility of P2P Web Search Engine
Li et al calculated the bandwidth usage of a P2P-based web search engine
3 billion documents (10KB each) 60,000 peers Basic DHT was 100x worse than basic Gnutella DHT Optimizations (e.g. Bloom Filters) made it
competitive No index creation or maintenance cost included
(60TB) No replica maintenance cost included
15
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Conclusion
Distributed Inverted Indexes are challenging Implementation requires a lot of tricks Performance is not outstanding No comparison to state-of-the-art unstructured
systems available Maybe even more tricks from information retrieval
research will help Modeling the correct workload is really important for
system design
16
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Outlook
Examine robustness of full-text search under Zipf query workloads
Implement DHT full-text search in simulator Compare state-of-the-art unstructured and
structured full-text search overlays Improve consistency and coherence in DHT full-text
search systems Implement wiki and source code management with
full-text search for Scenario B Phrase search is even more challenging…
17
DF
G R
ES
EA
RC
H G
RO
UP
QU
AP
2P
TE
CH
NIS
CH
E U
NIV
ER
SIT
ÄT
DA
RM
ST
AD
T
Recommended Reading
Performance Comparison Li et al. On the Feasibility of Peer-to-Peer Web Indexing and
Search. IPTPS 2003. Yang et al. Performance of Full Text Search in Structured
and Unstructured Peer-to-Peer Systems. INFOCOM 2006.DHT Full-Text Search P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword
Searching. IMC 2003. O. Gnawali. A Keyword Set Search System for Peer-to-Peer
Networks. Msc. Thesis, MIT, 2002.Workload Modeling Breslau et al. Web Caching and Zipf-like Distributions:
Evidence and Implications. INFOCOM 1999. Gummadi et al. Measurement, Modeling and Analysis of a
Peer-to-Peer File-Sharing Workload. SOSP 2003.