pub/sub functionality in ir environments using structured overlay … · hierarchical p2p networks...

1

Pub/Sub Functionality in IR Environments using Structured Overlay Networks

Christos Tryfonopoulos, Stratos Idreos and Manolis KoubarakisIntelligent Systems LaboratoryDept. of Electronic and Computer EngineeringTechnical University of Cretehttp://www.intelligence.tuc.gr/

Outline

MotivationRelated workBackgroundThe DHTrie protocols

Protocol presentationExperimental evaluation

Open problems

2

Motivation

Publish/subscribe scenario

peer

peerpeer

peerpeer

peer

peer

publication

publication

continuousquery

notificationpeer

informationprovider

download

continuousquery

Related Work

Information Retrieval (in P2P networks)Unstructured P2P networks

PlanetP, PIRS, [KJ04], [YF05], Peerware

Hierarchical P2P networks[LC03], [LC05], P2P-DIET, LibraRing, Odissea

Structured P2P networks pSearch, Arpeggio, Minerva, Galanx

3

Related Work

Information FilteringCentralised

InRoute, SIFT, DIAS

Hierarchical P2P networks P2P-DIET

Structured P2P networkspFilter

Content-based pub/sub on top of structured P2P networks

Meghdoot, Hermes, [TBF+03], [TA04]

Background: data model

Publication: a set of attribute-value pairs with at most one value per attribute.

Example:

( AUTHOR = “Manolis Koubarakis”, TITLE = “Peer-to-Peer Publish/Subscribe Networks with

Languages from Information Retrieval”, ABSTRACT = “We study …” )

4

Background: query language

Interpret text values under the boolean or vector space model, depending on their use in queries.QL: Conjunctions of attribute-operator-value atomic formulas (atomic queries). Example:

Extendable with other data types and richer structure.

[0,2]

0.6

( " ")( ( ) )

( ~ " 2 ...")

AUTHOR John BrownTITLE algorithms complexity filtering

ABSTRACT Information alert in structured P P

= ∧∧ ∧põ

Distributed Hash Tables (DHTs)

Second generation structured overlay networks.Created to solve the object location problem in a distributed and dynamic network of nodes:

Let K be some data item. Find K!Distributed version of hash table data structure.

Main operations: Put: given a key (for a data item), map the key onto a node.Get: Find the location of a data item with a given a key.

5

Chord

Node IDs: m bits (e.g., produced using SHA-1 on IPs)Nodes organised in a circle in the ID space.Keys are hashed using the same hash function and assigned to the successor node in the ID circle.For correct routing, nodes need only know their successor. For better routing, Chord maintains a routing table with m entries (fingers) pointing to nodes spaced exponentially in the ID space.We can locate a node in log(N) hops, where N is the number of nodes.

The Chord Ring

N1

N8

N14

N21

N32N38

N42

N48

N51

N56

K10

K42 K30

K24

K54

Finger Table-----------------N8+1 N14N8+2 N14N8+4 N14N8+8 N21N8+16 N32N8+32 N42

lookup(54)

6

The DHTrie protocols

Publish/subscribe scenario revisited

peer

peer

peer

publication

publication

continuousquery

notificationpeer

informationprovider

download

continuousquery

peer

peer

peer

peer

peer

The DHTrie protocols

Basic idea: Index and store subscriptions in the DHT. Make sure publications meet subscriptions.

Two levels of indexing: Global: Use an extension of a DHT.Local: Use what is appropriate for your subscription language. In our case a trie-like structure (BestFitTrie), and other bits.

7

Subscribing with Boolean atomic queriesAssume query q of the form:

1. Select a single word w contained in any of si or wpj.2. Index query in node responsible for id=H(w).

The DHTrie protocols (cont’d)

1 1 1 1( ) ... ( ) ( ) ... ( )m m m m n nA s A s A wp A wp+ += ∧ ∧ = ∧ ∧ ∧õ õ

Subscribing with vector space atomic queriesAssume query q of the form:

1. Construct D1, …, Dn, the sets of distinct words in s1, …, sn.

2. Construct list 3. Remove duplicates from L.4. Index query in all nodes responsible for an id=H(wj).


11 1( ~ ) ... ( ~ )nk n k nA s A s∧ ∧

1{ ( ) : }j j nL H w w D D= ∈ ∪ ∪L

8

Subscribing with mixed type queriesAssume query q of the form:

Index q under its Boolean part only


1

1 1

1 1

1 1

( ) ... ( )( ) ... ( )( ~ ) ... ( ~ )

n k

m m

m m n n

n a n k a k

A s A sA wp A wpA s A s

+

+ +

+ +

= ∧ ∧ = ∧∧ ∧ ∧∧ ∧

õ õ


Publishing a resourceAssume a publication p of the form:

1. Construct D1, …, Dn, the sets of distinct words in s1, …, sn.

2. Construct list 3. Remove duplicates from L.4. Forward publication in all nodes responsible for an

id=H(wj).5. Nodes will find all matching queries using their local

data structures and algorithms and notify subscribers.

1 1 2 2{( , ), ( , ),..., ( , )}n nA s A s A s

1{ ( ) : }j j nL H w w D D= ∈ ∪ ∪L

9


Notifying interested subscribersTo find all matching queries, nodes utilize a local (centralised) algorithm which combines:

BestFitTrie (Boolean part)SQI (VSM part)

Any other efficient centralised algorithm could be used.Notifications are sent to:

the IP address associated with the query (nodes that are online).the successor of the recipient (nodes that are offline).

Notifications could be batched and sent when network traffic is low depending on the application.


Other protocols for pub/sub scenario:Removing a queryUpdating a queryJoining or leaving the network

Also designed protocols to support one-time queries (IR).

See LibraRing system for distributed DLs. (ECDL’05)

10

Message routing

Boolean query indexing can be handled in a straightforward way using Chord’s put primitive.

Resource publication and VSM queries require sending the same message to a group of nodes.

Multicasting techniques are not applicable, since the group of nodes to be contacted is not known a priori.

Two methods: iterative vs. recursive

Iterative: send a publish() message to each recipient.For k recipients we need O(k log(N)) publish messages.

Resource publication

11

Recursive message forwardingMethod: Node P wants to contact a group of nodes with ids: L={id1, … idn}:

1. Sort ids in ascending order.2. Constructs message (msg, L) and sends it to node

responsible for head(L) - that is peer P’ with id(P’) ≤ head(L)

3. Upon reception by P’, head(L) is checkedi. If id(P’) < head(L), then P’ forwards the message as above.ii. If id(P’) ≥ head(L), then P’ copies msg and deletes head(L).

Resource publication (cont’d)

(MSG, {63})

(MSG, {54,56,63})

Iterative forwardingRecursive forwarding

Iterative vs Recursive forwarding

N1

N8

N14

N21

N32N38

N42

N48

N51

N56

Send messages:(MSG, 54)(MSG, 56)(MSG, 63)

Send message:(MSG, {54,56,63})

(MSG, {54,56,63})

(MSG, {56,63})

(MSG, {})

(MSG, {54,56,63})

(MSG, {63})

12

Frequency Cache (FCache)

Extra routing table.Stores IPs for nodes responsible for queries containing frequent words

→ reach nodes we contact often in 1 hopNo extra message cost:

Uses only local information (published documents).Contact information discovered when needed.

FCache misses:Mainly for infrequent words.Cost a single lookup() operation (needed anyway …).

Implemented Algorithms

FCacheutilisation

Iterative / Recursive

Algorithm name

RecursiveReC

RecursiveRe

IterativeItC

IterativeIt

13

Experimental Evaluation

Evaluated the algorithms under different parameters:

Network sizeQuery typesFCache size & level of trainingPublication size

Also looked into the load balancing problem (load shedding):

Filtering loadRouting load


0100020003000400050006000700080009000

10 20 30 40 50 60 70 80 90 100

# of nodes (x1000)

# of

mes

sage

s/doc

umen

t

It ItC Re ReC

Performance for variousnetwork sizes

14


0100020003000400050006000700080009000

10000

50 100 50 100 50 100 50 100

It ItC Re ReC

# of nodes (x1000) and algorithm

# of

mes

sage

s/doc

umen

t

DHT Messages FCache Messages

Total number of messagesfor a single document

0

40

80

120

T1 T2 T1 T2 T1 T2 T1 T2

It ItC Re ReC

# of

mes

sage

s/que

ry ty

pe

FcacheDHT

Experimental EvaluationTotal number of messagesfor indexing a query of different types

T1: Boolean or mixed type queriesT2: Vector space queries

15


0

1000

2000

3000

4000

5000

6000

7000

1 5 10 15 20 25 30FCache size (x1000)

# of

DH

T m

essa

ges/d

ocum

entt

300

400

500

600

700

800

900

# of

FC

ache

mes

sage

s/doc

umen

t

ItC (50K nodes) ItC (100K nodes)ReC (50K nodes) ReC (100K nodes)ItC (50K nodes) ReC (50K nodes)

Performance for differentFCache sizes


400

600

800

1000

1200

1400

1600

1800

2 6 10 14 18 40 80

# of documents (x100) published per node

# of

DH

T m

essa

ges/d

ocum

entt

800

820

840

860

880

900

# of

FC

ache

hits

/doc

umen

t

ItC (50K nodes) ReC (50K nodes)ItC (100K nodes) ReC (100K nodes)ItC (50K nodes) ReC (50K nodes)

Performance for variousFCache training levels

16


0

1000

2000

3000

4000

5000

1412 5415 20755

Average document size (words)

# of

DH

T m

essa

ges/d

ocum

entt

It ItC ReReC ReC increase rate ItC increase rateRe increase rate

Performance for differentdocument sizes


Filtering load and routing requests for the top 10K nodes.

0

10

20

30

40

50

60

1 1000 1999 2998 3997 4996 5995 6994 7993 8992 9991

Ranked peers

# of

filte

ring

requ

ests

No load balancing With load balancing

0

5

10

15

20

25

30

35

40

1 1000 1999 2998 3997 4996 5995 6994 7993 8992 9991

Ranked peers

# of

DH

T m

essa

ges /

pee

rr

No load balancing With load balancing

17

Open Problems

System development and deployment on PlanetLab.Distributed computation of word frequencies.Load balancing.More expressive data models (XPath/XQuerywith full text).

Acknowledgements - Funding

Evergrow http://www.evergrow.org (IST/FET IP on Complex Systems).Heraclitus (Greek Government PhD Fellowship Programme).

pub/sub functionality in ir environments using structured overlay … · hierarchical p2p networks...

Documents