www.tu-darmstadt.de full-text search in p2p networks christof leng databases and distributed...

www.tu-darmstadt.de www.dfg.dewww.quap2p.de

Full-Text Search in P2P Networks

Christof Leng

Databases and Distributed Systems Group

TU Darmstadt

2

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Content

Short Intro to full-text search Full-Text search on DHTs Performance Comparison Conclusion / Outlook

3

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

What is full-text search?

Searching for documents containing all of a list of specified words Search for “QuaP2P” “Darmstadt” “Research”

Very common operation Google Filesharing Wikis Source Code Document / Knowledge Management …

Can be extended to phrase search Search for “TU Darmstadt” “Christof Leng”

4

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Inverted Index

Full-text search is normally solved with inverted indexes Query result is intersection of all searched word entries Stemming can reduce the number of word entries

doc2

“Similarity searches accelerate P2P downloads by 30-70 percent.”

doc1

“New P2P system could provide speed increase.”

doc3

“I fail to see how this will make downloads faster.”

30 doc1

70 doc1

accelerate doc1

by doc1

could doc2

downloads doc1, doc3

fail doc3

faster doc3

how doc3

i doc3

increase doc2

make doc3

new doc2

p2p doc1, doc2

percent doc1

provide doc2

searches doc1

see doc3

similarity doc1

speed doc2

system doc2

this doc3

to doc3

will doc3

5

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Overlay Types and Full-Text Search

Peer-to-Peer

Centralized Pure / Hierarchical Structured / DHT

Inverted index on central server

Inverted index on each (super-)node

Distributed inverted index

Challenge

6

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Naïve Approach

Map inverted index to DHT Key Lookup for every word Intersect result lists at client

Pro: Simple Short latency

Con: Result lists may be

extremely large! Result list sizes may vary

extremely!

Search for“QuaP2P” “Darmstadt” “Research”

QuaP2PDarmstadt

Research

Image from http://en.wikipedia.org/wiki/Zipf's_law7

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Zipf Distributions in Natural Text

Some words are extremely common

Most words are extremely uncommon

Largest word frequency is proportional to number of distinct words

Avoid transfering result lists before intersection!

Rank

Wor

d O

ccur

ence

s

8

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Intersecting on the way

Query least common word first

Forward result list to next word

Intersect on the way

Pro: Reduces trafficCon: High latency Knowledge about word

frequencies required Search for “the” and “who”

(7.2 and 2.4 billion hits on Google each)


QuaP2P

Darmstadt

Research

9

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Using Bloom Filters

Bloom Filters reduce result list size

Forward Bloom Filters and return result list recursively

Pro: Reduces traffic even more

(up to factor 50x)

Con: Even higher latency Getting complicated


QuaP2P

Darmstadt

Research

Image from http://www.useit.com/alertbox/traffic_logs.html10

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Zipf Distributions in Query Terms, too

Query popularity obeys Zipf’ Law (déjà vu!)

This puts high load on nodes with the most popular keys

Even worse, this load scales linearly with the network size and user activity

The responsible nodes are randomly assigned (could be a modem user)

Hotspots will occur

11

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Caching and Precomputation

Caching Keep lists received for intersection Keep answers to popular queries Traffic reduction: 38% But: How to ensure coherence?

Precomputation Inverted index for pairs or tupels of words Only feasible for the most popular words

(but most effective there anyway) Traffic reduction: 50%

12

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Further Optimizations

Compression of result lists Adaptive Set Intersection Gap Compression Clustering of keys

Incremental Results Do not return all results at once Should be used in conjunction with ranking algorithm

Image from Yang et al: Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems13

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Comparison of different approaches

Yang et al compared DHT with Bloom Filters Supernode with

exhaustive flooding Unstructured Random

Walk w/o replication

Network size 1000

Random data set from WWW

All approaches have strengths

14

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Feasibility of P2P Web Search Engine

Li et al calculated the bandwidth usage of a P2P-based web search engine

3 billion documents (10KB each) 60,000 peers Basic DHT was 100x worse than basic Gnutella DHT Optimizations (e.g. Bloom Filters) made it

competitive No index creation or maintenance cost included

(60TB) No replica maintenance cost included

15

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Conclusion

Distributed Inverted Indexes are challenging Implementation requires a lot of tricks Performance is not outstanding No comparison to state-of-the-art unstructured

systems available Maybe even more tricks from information retrieval

research will help Modeling the correct workload is really important for

system design

16

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Outlook

Examine robustness of full-text search under Zipf query workloads

Implement DHT full-text search in simulator Compare state-of-the-art unstructured and

structured full-text search overlays Improve consistency and coherence in DHT full-text

search systems Implement wiki and source code management with

full-text search for Scenario B Phrase search is even more challenging…

17

DF

G R

ES

EA

RC

H G

RO

UP

QU

AP

2P

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

Recommended Reading

Performance Comparison Li et al. On the Feasibility of Peer-to-Peer Web Indexing and

Search. IPTPS 2003. Yang et al. Performance of Full Text Search in Structured

and Unstructured Peer-to-Peer Systems. INFOCOM 2006.DHT Full-Text Search P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword

Searching. IMC 2003. O. Gnawali. A Keyword Set Search System for Peer-to-Peer

Networks. Msc. Thesis, MIT, 2002.Workload Modeling Breslau et al. Web Caching and Zipf-like Distributions:

Evidence and Implications. INFOCOM 1999. Gummadi et al. Measurement, Modeling and Analysis of a

Peer-to-Peer File-Sharing Workload. SOSP 2003.

www.tu-darmstadt.de full-text search in p2p networks christof leng databases and distributed...

Documents