pub/sub functionality in ir environments using structured overlay … · hierarchical p2p networks...
TRANSCRIPT
1
Pub/Sub Functionality in IR Environments using Structured Overlay Networks
Christos Tryfonopoulos, Stratos Idreos and Manolis KoubarakisIntelligent Systems LaboratoryDept. of Electronic and Computer EngineeringTechnical University of Cretehttp://www.intelligence.tuc.gr/
Outline
MotivationRelated workBackgroundThe DHTrie protocols
Protocol presentationExperimental evaluation
Open problems
2
Motivation
Publish/subscribe scenario
peer
peerpeer
peerpeer
peer
peer
publication
publication
continuousquery
notificationpeer
informationprovider
download
continuousquery
Related Work
Information Retrieval (in P2P networks)Unstructured P2P networks
PlanetP, PIRS, [KJ04], [YF05], Peerware
Hierarchical P2P networks[LC03], [LC05], P2P-DIET, LibraRing, Odissea
Structured P2P networks pSearch, Arpeggio, Minerva, Galanx
3
Related Work
Information FilteringCentralised
InRoute, SIFT, DIAS
Hierarchical P2P networks P2P-DIET
Structured P2P networkspFilter
Content-based pub/sub on top of structured P2P networks
Meghdoot, Hermes, [TBF+03], [TA04]
Background: data model
Publication: a set of attribute-value pairs with at most one value per attribute.
Example:
( AUTHOR = “Manolis Koubarakis”, TITLE = “Peer-to-Peer Publish/Subscribe Networks with
Languages from Information Retrieval”, ABSTRACT = “We study …” )
4
Background: query language
Interpret text values under the boolean or vector space model, depending on their use in queries.QL: Conjunctions of attribute-operator-value atomic formulas (atomic queries). Example:
Extendable with other data types and richer structure.
[0,2]
0.6
( " ")( ( ) )
( ~ " 2 ...")
AUTHOR John BrownTITLE algorithms complexity filtering
ABSTRACT Information alert in structured P P
= ∧∧ ∧põ
Distributed Hash Tables (DHTs)
Second generation structured overlay networks.Created to solve the object location problem in a distributed and dynamic network of nodes:
Let K be some data item. Find K!Distributed version of hash table data structure.
Main operations: Put: given a key (for a data item), map the key onto a node.Get: Find the location of a data item with a given a key.
5
Chord
Node IDs: m bits (e.g., produced using SHA-1 on IPs)Nodes organised in a circle in the ID space.Keys are hashed using the same hash function and assigned to the successor node in the ID circle.For correct routing, nodes need only know their successor. For better routing, Chord maintains a routing table with m entries (fingers) pointing to nodes spaced exponentially in the ID space.We can locate a node in log(N) hops, where N is the number of nodes.
The Chord Ring
N1
N8
N14
N21
N32N38
N42
N48
N51
N56
K10
K42 K30
K24
K54
Finger Table-----------------N8+1 N14N8+2 N14N8+4 N14N8+8 N21N8+16 N32N8+32 N42
lookup(54)
6
The DHTrie protocols
Publish/subscribe scenario revisited
peer
peer
peer
publication
publication
continuousquery
notificationpeer
informationprovider
download
continuousquery
peer
peer
peer
peer
peer
The DHTrie protocols
Basic idea: Index and store subscriptions in the DHT. Make sure publications meet subscriptions.
Two levels of indexing: Global: Use an extension of a DHT.Local: Use what is appropriate for your subscription language. In our case a trie-like structure (BestFitTrie), and other bits.
7
Subscribing with Boolean atomic queriesAssume query q of the form:
1. Select a single word w contained in any of si or wpj.2. Index query in node responsible for id=H(w).
The DHTrie protocols (cont’d)
1 1 1 1( ) ... ( ) ( ) ... ( )m m m m n nA s A s A wp A wp+ += ∧ ∧ = ∧ ∧ ∧õ õ
Subscribing with vector space atomic queriesAssume query q of the form:
1. Construct D1, …, Dn, the sets of distinct words in s1, …, sn.
2. Construct list 3. Remove duplicates from L.4. Index query in all nodes responsible for an id=H(wj).
The DHTrie protocols (cont’d)
11 1( ~ ) ... ( ~ )nk n k nA s A s∧ ∧
1{ ( ) : }j j nL H w w D D= ∈ ∪ ∪L
8
Subscribing with mixed type queriesAssume query q of the form:
Index q under its Boolean part only
The DHTrie protocols (cont’d)
1
1 1
1 1
1 1
( ) ... ( )( ) ... ( )( ~ ) ... ( ~ )
n k
m m
m m n n
n a n k a k
A s A sA wp A wpA s A s
+
+ +
+ +
= ∧ ∧ = ∧∧ ∧ ∧∧ ∧
õ õ
The DHTrie protocols (cont’d)
Publishing a resourceAssume a publication p of the form:
1. Construct D1, …, Dn, the sets of distinct words in s1, …, sn.
2. Construct list 3. Remove duplicates from L.4. Forward publication in all nodes responsible for an
id=H(wj).5. Nodes will find all matching queries using their local
data structures and algorithms and notify subscribers.
1 1 2 2{( , ), ( , ),..., ( , )}n nA s A s A s
1{ ( ) : }j j nL H w w D D= ∈ ∪ ∪L
9
The DHTrie protocols (cont’d)
Notifying interested subscribersTo find all matching queries, nodes utilize a local (centralised) algorithm which combines:
BestFitTrie (Boolean part)SQI (VSM part)
Any other efficient centralised algorithm could be used.Notifications are sent to:
the IP address associated with the query (nodes that are online).the successor of the recipient (nodes that are offline).
Notifications could be batched and sent when network traffic is low depending on the application.
The DHTrie protocols (cont’d)
Other protocols for pub/sub scenario:Removing a queryUpdating a queryJoining or leaving the network
Also designed protocols to support one-time queries (IR).
See LibraRing system for distributed DLs. (ECDL’05)
10
Message routing
Boolean query indexing can be handled in a straightforward way using Chord’s put primitive.
Resource publication and VSM queries require sending the same message to a group of nodes.
Multicasting techniques are not applicable, since the group of nodes to be contacted is not known a priori.
Two methods: iterative vs. recursive
Iterative: send a publish() message to each recipient.For k recipients we need O(k log(N)) publish messages.
Resource publication
11
Recursive message forwardingMethod: Node P wants to contact a group of nodes with ids: L={id1, … idn}:
1. Sort ids in ascending order.2. Constructs message (msg, L) and sends it to node
responsible for head(L) - that is peer P’ with id(P’) ≤ head(L)
3. Upon reception by P’, head(L) is checkedi. If id(P’) < head(L), then P’ forwards the message as above.ii. If id(P’) ≥ head(L), then P’ copies msg and deletes head(L).
Resource publication (cont’d)
(MSG, {63})
(MSG, {54,56,63})
Iterative forwardingRecursive forwarding
Iterative vs Recursive forwarding
N1
N8
N14
N21
N32N38
N42
N48
N51
N56
Send messages:(MSG, 54)(MSG, 56)(MSG, 63)
Send message:(MSG, {54,56,63})
(MSG, {54,56,63})
(MSG, {56,63})
(MSG, {})
(MSG, {54,56,63})
(MSG, {63})
12
Frequency Cache (FCache)
Extra routing table.Stores IPs for nodes responsible for queries containing frequent words
→ reach nodes we contact often in 1 hopNo extra message cost:
Uses only local information (published documents).Contact information discovered when needed.
FCache misses:Mainly for infrequent words.Cost a single lookup() operation (needed anyway …).
Implemented Algorithms
FCacheutilisation
Iterative / Recursive
Algorithm name
RecursiveReC
RecursiveRe
IterativeItC
IterativeIt
13
Experimental Evaluation
Evaluated the algorithms under different parameters:
Network sizeQuery typesFCache size & level of trainingPublication size
Also looked into the load balancing problem (load shedding):
Filtering loadRouting load
Experimental Evaluation
0100020003000400050006000700080009000
10 20 30 40 50 60 70 80 90 100
# of nodes (x1000)
# of
mes
sage
s/doc
umen
t
It ItC Re ReC
Performance for variousnetwork sizes
14
Experimental Evaluation
0100020003000400050006000700080009000
10000
50 100 50 100 50 100 50 100
It ItC Re ReC
# of nodes (x1000) and algorithm
# of
mes
sage
s/doc
umen
t
DHT Messages FCache Messages
Total number of messagesfor a single document
0
40
80
120
T1 T2 T1 T2 T1 T2 T1 T2
It ItC Re ReC
# of
mes
sage
s/que
ry ty
pe
FcacheDHT
Experimental EvaluationTotal number of messagesfor indexing a query of different types
T1: Boolean or mixed type queriesT2: Vector space queries
15
Experimental Evaluation
0
1000
2000
3000
4000
5000
6000
7000
1 5 10 15 20 25 30FCache size (x1000)
# of
DH
T m
essa
ges/d
ocum
entt
300
400
500
600
700
800
900
# of
FC
ache
mes
sage
s/doc
umen
t
ItC (50K nodes) ItC (100K nodes)ReC (50K nodes) ReC (100K nodes)ItC (50K nodes) ReC (50K nodes)
Performance for differentFCache sizes
Experimental Evaluation
400
600
800
1000
1200
1400
1600
1800
2 6 10 14 18 40 80
# of documents (x100) published per node
# of
DH
T m
essa
ges/d
ocum
entt
800
820
840
860
880
900
# of
FC
ache
hits
/doc
umen
t
ItC (50K nodes) ReC (50K nodes)ItC (100K nodes) ReC (100K nodes)ItC (50K nodes) ReC (50K nodes)
Performance for variousFCache training levels
16
Experimental Evaluation
0
1000
2000
3000
4000
5000
1412 5415 20755
Average document size (words)
# of
DH
T m
essa
ges/d
ocum
entt
It ItC ReReC ReC increase rate ItC increase rateRe increase rate
Performance for differentdocument sizes
Experimental Evaluation
Filtering load and routing requests for the top 10K nodes.
0
10
20
30
40
50
60
1 1000 1999 2998 3997 4996 5995 6994 7993 8992 9991
Ranked peers
# of
filte
ring
requ
ests
No load balancing With load balancing
0
5
10
15
20
25
30
35
40
1 1000 1999 2998 3997 4996 5995 6994 7993 8992 9991
Ranked peers
# of
DH
T m
essa
ges /
pee
rr
No load balancing With load balancing
17
Open Problems
System development and deployment on PlanetLab.Distributed computation of word frequencies.Load balancing.More expressive data models (XPath/XQuerywith full text).
Acknowledgements - Funding
Evergrow http://www.evergrow.org (IST/FET IP on Complex Systems).Heraclitus (Greek Government PhD Fellowship Programme).