emergent communities for semantic collaboration in multi ... unimi/ultimi... · collaboration in...

171
DOTTORATO DI RICERCA IN I NFORMATICA XIX CICLO SETTORE SCIENTIFICO DISCIPLINARE INF/01 I NFORMATICA Emergent Communities for Semantic Collaboration in Multi-Knowledge Environments: Methods and Techniques Tesi di Dottorato di Ricerca di: Stefano Montanelli Relatore: Prof.ssa Silvana Castano Coordinatore del Dottorato: Prof. Vincenzo Piuri Anno Accademico 2006/2007

Upload: dinhxuyen

Post on 27-Nov-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

DOTTORATO DI RICERCA IN INFORMATICAXIX CICLO

SETTORE SCIENTIFICO DISCIPLINARE INF/01 INFORMATICA

Emergent Communities for SemanticCollaboration in Multi-Knowledge

Environments: Methods and Techniques

Tesi di Dottorato di Ricerca di:Stefano Montanelli

Relatore:Prof.ssa Silvana Castano

Coordinatore del Dottorato:Prof. Vincenzo Piuri

Anno Accademico 2006/2007

Page 2: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Emergent Communities for SemanticCollaboration in Multi-Knowledge

Environments: Methods and TechniquesStefano Montanelli

Anno Accademico 2006/2007Dottorato di Ricerca in Informatica

XIX Ciclo

Tutor: Prof.ssa Silvana Castano

The need of methods and techniques to foster semantic collaboration is a challengingissue at the current stage of development of P2P networks and open distributed systemsin general. In this respect, the emergence of collaboration among peers requires thecapabilities to share data and resources by dynamically selecting the most appropriatepartners within the context of a given task. These scenarios are multi-knowledge, in thatno centralized authorities are defined to manage a comprehensive view of the resourcesshared by all the nodes in the system, due to the high dynamism and variability ofcollaboration and sharing requirements.

With respect to this scenario, the thesis work has focused on investigating the pe-culiar aspects and requirements of knowledge sharing in open distributed systems,and in P2P systems in particular. In this context, ontologies and ontology matchingtechniques have been identified as a key solution for enabling peers with similar con-tents to gradually “emerge” and for allowing the network to self-configure by linkingthem as semantic neighbors. The main contribution of the thesis work regards thedefinition of methods and techniques for enforcing semantic collaboration among au-tonomously emergent semantic neighbors. In particular, two main goals have beenobtained. On one side, the development of a matching-driven semantic routing mech-anism, called H-Link, for scalable distribution of knowledge requests on a semanticbasis and for effective sharing of distributed resources. On the other side, the defi-nition of consensus-driven techniques which exploit ontological resource descriptionsand ontology matching in order to form, maintain, and disband semantic communitiesof peers, with application to the Helios knowledge-sharing P2P system.

Page 3: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

et dormiat et exsurgat nocte et die, et semengerminet et increscat, dum nescit ille.[Mc 4,27]

iii

Page 4: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Acknowledgements

The first acknowledgment is dedicated to my advisor Prof. Silvana Castano. Workingunder her supervision has represented a great opportunity for my professional andscientific growth. I would also like to acknowledge the referees, Prof. Steffen Staaband Prof. Paolo Tiberio, for their comments and attention. Methods and techniquespresented in the thesis have been developed in the following research projects.

• FIRB WEB MINDS

• FP6 INTEROP NoE

• MIUR PRIN ESTEEM

The staff involved in these projects has considerably contributed to improve this thesiswith their invaluable work.A special recognition goes to the Balsi staff, namely Alfio Ferrara and GianpaoloRacca, for the helpful discussions, the support, the friendship, and the fun. Finally,a kind thought is dedicated to the core persons of my life: Manuela, my family, myfriends. Thank you for everything.

Bergamo, November 12th, 2006

iv

Page 5: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Abstract

The need of methods and techniques to foster semantic collaboration is a challengingissue at the current stage of development of P2P networks and open distributed systemsin general. In this respect, the emergence of collaboration among peers requires thecapabilities to share data and resources by dynamically selecting the most appropriatepartners within the context of a given task. These scenarios are multi-knowledge, in thatno centralized authorities are defined to manage a comprehensive view of the resourcesshared by all the nodes in the system, due to the high dynamism and variability ofcollaboration and sharing requirements.

With respect to this scenario, the thesis work has focused on investigating the pe-culiar aspects and requirements of knowledge sharing in open distributed systems,and in P2P systems in particular. In this context, ontologies and ontology matchingtechniques have been identified as a key solution for enabling peers with similar con-tents to gradually “emerge” and for allowing the network to self-configure by linkingthem as semantic neighbors. The main contribution of the thesis work regards thedefinition of methods and techniques for enforcing semantic collaboration among au-tonomously emergent semantic neighbors. In particular, two main goals have beenobtained. On one side, the development of a matching-driven semantic routing mech-anism, called H-Link, for scalable distribution of knowledge requests on a semanticbasis and for effective sharing of distributed resources. On the other side, the defi-nition of consensus-driven techniques which exploit ontological resource descriptionsand ontology matching in order to form, maintain, and disband semantic communitiesof peers, with application to the Helios knowledge-sharing P2P system.

v

Page 6: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Contents

1 Introduction 11.1 Thesis contribution and outline . . . . . . . . . . . . . . . . . . . . . 41.2 Research projects involved . . . . . . . . . . . . . . . . . . . . . . . 6

2 P2P systems: the state of the art 82.1 Classification of P2P systems . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Architectural classification . . . . . . . . . . . . . . . . . . . 102.1.1.1 Hybrid systems . . . . . . . . . . . . . . . . . . . 102.1.1.2 Pure systems . . . . . . . . . . . . . . . . . . . . . 122.1.1.3 SuperPeer systems . . . . . . . . . . . . . . . . . . 13

2.1.2 Structural classification . . . . . . . . . . . . . . . . . . . . . 162.1.2.1 Structured systems . . . . . . . . . . . . . . . . . . 172.1.2.2 Adaptive systems . . . . . . . . . . . . . . . . . . 182.1.2.3 Non-adaptive systems . . . . . . . . . . . . . . . . 20

2.2 P2P semantic routing techniques . . . . . . . . . . . . . . . . . . . . 202.2.1 Query routing through good peers . . . . . . . . . . . . . . . 202.2.2 The REMINDIN’ query routing algorithm . . . . . . . . . . . 222.2.3 Socialized.Net and the Seers search protocol . . . . . . . . . 242.2.4 P2P semantic link networks . . . . . . . . . . . . . . . . . . 262.2.5 The intelligent search mechanism . . . . . . . . . . . . . . . 272.2.6 Hierarchical semantic routing for Grid resource discovery . . 292.2.7 Query routing through semantic mappings . . . . . . . . . . . 322.2.8 Other interesting approaches . . . . . . . . . . . . . . . . . . 34

2.3 Peer community formation and consensus negotiation techniques . . . 352.3.1 Semantic overlay networks for P2P systems . . . . . . . . . . 35

vi

Page 7: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

CONTENTS

2.3.2 Formation of P2P communities through escalation . . . . . . 362.3.3 Communities of peers by trust and reputation . . . . . . . . . 382.3.4 Peer selection in P2P networks with semantic topologies . . . 402.3.5 Other interesting approaches . . . . . . . . . . . . . . . . . . 42

3 Towards emergent semantics in P2P systems:critical review of the state of the art 443.1 Schema-based P2P networks . . . . . . . . . . . . . . . . . . . . . . 44

3.1.1 Building blocks of schema-based P2P networks . . . . . . . . 453.1.2 Open issues in schema-based P2P networks . . . . . . . . . . 47

3.2 Knowledge-sharing P2P systems . . . . . . . . . . . . . . . . . . . . 493.3 Emergent semantics issues in

knowledge-sharing P2P systems . . . . . . . . . . . . . . . . . . . . 513.4 Emergent semantics requirements . . . . . . . . . . . . . . . . . . . 52

3.4.1 Knowledge representation . . . . . . . . . . . . . . . . . . . 523.4.2 Matching techniques . . . . . . . . . . . . . . . . . . . . . . 533.4.3 Query routing . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4.4 Community support . . . . . . . . . . . . . . . . . . . . . . 56

3.5 Critical review of the state of the art . . . . . . . . . . . . . . . . . . 573.5.1 Comparison on architectural and structural properties . . . . . 573.5.2 Comparison on emergent semantics requirements . . . . . . . 58

4 The H-Link semantic routing mechanism 614.1 Main features of H-Link . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.1 Motivating and running example . . . . . . . . . . . . . . . . 644.2 Peer ontology definition . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 The content knowledge layer . . . . . . . . . . . . . . . . . . 654.2.2 The network knowledge layer . . . . . . . . . . . . . . . . . 664.2.3 Building the network knowledge . . . . . . . . . . . . . . . . 674.2.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2.5 Considerations on the computation of confidence values . . . 69

4.3 The H-Match semantic matchmaker . . . . . . . . . . . . . . . . . . 704.3.1 The H-Match matching process . . . . . . . . . . . . . . . . 714.3.2 Linguistic features . . . . . . . . . . . . . . . . . . . . . . . 71

vii

Page 8: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

CONTENTS

4.3.3 Contextual features . . . . . . . . . . . . . . . . . . . . . . . 724.3.4 Basic matching functions of H-Match . . . . . . . . . . . . . 734.3.5 Property and semantic relation closeness function . . . . . . . 744.3.6 The H-Match matching models . . . . . . . . . . . . . . . . 744.3.7 Matching policy . . . . . . . . . . . . . . . . . . . . . . . . 764.3.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.9 Considerations on H-Match . . . . . . . . . . . . . . . . . . 78

4.4 The H-Link routing mechanism . . . . . . . . . . . . . . . . . . . . . 794.4.1 H-Link invocation . . . . . . . . . . . . . . . . . . . . . . . 804.4.2 The H-Link algorithm . . . . . . . . . . . . . . . . . . . . . 814.4.3 Considerations on H-Link . . . . . . . . . . . . . . . . . . . 86

4.5 Application example . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 H-Link experimental results 905.1 The Neurogrid P2P simulator . . . . . . . . . . . . . . . . . . . . . . 915.2 Experimentation configuration and goals . . . . . . . . . . . . . . . . 92

5.2.1 Neurogrid configuration . . . . . . . . . . . . . . . . . . . . 925.2.2 Simulation goals . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3 H-Link scalability results . . . . . . . . . . . . . . . . . . . . . . . . 955.4 Comparison with Gnutella . . . . . . . . . . . . . . . . . . . . . . . 965.5 Impact of random credit distribution . . . . . . . . . . . . . . . . . . 985.6 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Consensus negotiation techniques and application to the Helios system 1026.1 Foundations of semantic communities . . . . . . . . . . . . . . . . . 1026.2 Semantic community formation and management . . . . . . . . . . . 105

6.2.1 Consensus negotiation . . . . . . . . . . . . . . . . . . . . . 1056.2.1.1 The handshake state transition diagram . . . . . . . 1066.2.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . 107

6.2.2 Membership management . . . . . . . . . . . . . . . . . . . 1086.2.3 Sharing Policy . . . . . . . . . . . . . . . . . . . . . . . . . 1096.2.4 The role of ontology matching for community handshaking:

an example . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.3 Community-aware query propagation . . . . . . . . . . . . . . . . . 111

viii

Page 9: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

CONTENTS

6.4 Considerations on handshake and semantic communities . . . . . . . 1146.5 The Helios knowledge-sharing P2P system . . . . . . . . . . . . . . . 116

6.5.1 The probe query model . . . . . . . . . . . . . . . . . . . . . 1166.5.2 The search query model . . . . . . . . . . . . . . . . . . . . 119

6.6 The Helios toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.6.1 Enforcing dynamic knowledge discovery in Helios . . . . . . 1236.6.2 Enforcing semantic communities in Helios . . . . . . . . . . 124

6.6.2.1 The JXTA framework . . . . . . . . . . . . . . . . 1246.6.2.2 The JXTA handshake prototype . . . . . . . . . . . 125

7 Conclusions and future work 1297.1 Synthesis of the thesis results . . . . . . . . . . . . . . . . . . . . . . 1297.2 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2.1 Future research work on H-Link . . . . . . . . . . . . . . . . 1317.2.2 Future research work on handshake techniques . . . . . . . . 132

A Complete simulation results on H-Link 134A.1 H-Link scalability evaluation . . . . . . . . . . . . . . . . . . . . . . 134A.2 Comparison of H-Link with Gnutella . . . . . . . . . . . . . . . . . . 134

B Summary of the H-Matchmatching techniques 145

ix

Page 10: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

List of Figures

2.1 An example of hybrid P2P network . . . . . . . . . . . . . . . . . . . 112.2 An example of pure P2P network . . . . . . . . . . . . . . . . . . . . 142.3 An example of superPeer network . . . . . . . . . . . . . . . . . . . 16

3.1 Reference architecture for knowledge-sharing P2P networks . . . . . 493.2 Degree of expressiveness in P2P knowledge representation . . . . . . 533.3 Degree of semantics in matching techniques . . . . . . . . . . . . . . 543.4 Degree of adaptivity in P2P query routing . . . . . . . . . . . . . . . 553.5 Degree of decentralization in P2P community support . . . . . . . . . 56

4.1 Example of dynamic knowledge discovery in H-Link . . . . . . . . . . 644.2 An example of peer ontology . . . . . . . . . . . . . . . . . . . . . . 684.3 An H-Match example . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4 The H-Link algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 824.5 The ComputeMCL function . . . . . . . . . . . . . . . . . . . . . . . 834.6 The ComputeSNL function . . . . . . . . . . . . . . . . . . . . . . . 844.7 The ComputeRSNL function . . . . . . . . . . . . . . . . . . . . . . 844.8 The ComputeExpertise function . . . . . . . . . . . . . . . . . . . . 854.9 The ComputeQDL function . . . . . . . . . . . . . . . . . . . . . . . 864.10 Example of H-Link routing . . . . . . . . . . . . . . . . . . . . . . . 894.11 Example of H-Link routing schema . . . . . . . . . . . . . . . . . . . 89

5.1 Evaluation of H-Link scalability . . . . . . . . . . . . . . . . . . . . . 965.2 Comparison of H-Link with Gnutella . . . . . . . . . . . . . . . . . . 985.3 Impact of H-Link random credit distribution . . . . . . . . . . . . . . 100

x

Page 11: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

LIST OF FIGURES

6.1 The state transition diagram of the handshake algorithm . . . . . . . . 1076.2 Example of aggregation of a semantic community . . . . . . . . . . . 1086.3 Example of P2P network with peers and associated peer ontologies . . 1106.4 Example of intra-community query propagation . . . . . . . . . . . . 1146.5 The Helios architecture . . . . . . . . . . . . . . . . . . . . . . . . . 1176.6 The reference probe query template . . . . . . . . . . . . . . . . . . 1186.7 The reference probe answer template . . . . . . . . . . . . . . . . . . 1186.8 Examples of probe query answer . . . . . . . . . . . . . . . . . . . . 1206.9 Example of search query answer . . . . . . . . . . . . . . . . . . . . 1226.10 The Helios toolkit architecture . . . . . . . . . . . . . . . . . . . . . 1236.11 The JXTA framework . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.12 A screenshot of the JXTA handshake prototype . . . . . . . . . . . . 126

A.1 Evaluation of H-Link scalability for #credits = 12 . . . . . . . . . . . 135A.2 Evaluation of H-Link scalability for #credits = 16 . . . . . . . . . . . 136A.3 Evaluation of H-Link scalability for #credits = 20 . . . . . . . . . . . 137A.4 Evaluation of H-Link scalability for #credits = 24 . . . . . . . . . . . 138A.5 H-Link vs Gnutella: #connections per node = 3−5 . . . . . . . . . . 139A.6 H-Link vs Gnutella: #connections per node = 4−6 . . . . . . . . . . 140A.7 H-Link vs Gnutella: #connections per node = 5−7 . . . . . . . . . . 141A.8 H-Link vs Gnutella: #connections per node = 6−8 . . . . . . . . . . 142A.9 H-Link vs Gnutella: #connections per node = 7−9 . . . . . . . . . . 143A.10 H-Link vs Gnutella: #connections per node = 8−10 . . . . . . . . . . 144

xi

Page 12: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 1

Introduction

The need of sharing data and resources to foster semantic collaboration is a key prob-lem at the current stage of development of P2P networks and open distributed systemsin general [Nejdl et al., 2002; Broekstra et al., 2003; Halevy et al., 2004; Androutsellis-Theotokis and Spinellis, 2004]. In this respect, the emergence of collaboration amongpeers requires dynamic capabilities of negotiating agreements on common interpreta-tions within the context of a given task [Aberer et al., 2004]. This is typical for instanceof peer-based systems, characterized by a set of independent peer parties without priorreciprocal knowledge and no degree of relationship, that dynamically need to cooper-ate by sharing their resources (e.g., data, documents, services). These collaborationscenarios are multi-knowledge, in that no centralized authorities are defined to man-age a comprehensive view of the resources shared by all the nodes in the system, dueto the high dynamism and variability of collaboration and sharing requirements. Onthe opposite, each peer is responsible of providing the knowledge description of theresources to be shared, usually through an ontology. Furthermore, peers act as inde-pendent agents and interact by semantically matching their respective knowledge (i.e.,peer ontology) with the aim to discover potential collaboration partners with similarcontents. As a result, each peer becomes aware of its semantic neighbors (i.e., nodesstoring similar contents) that “emerge” from the knowledge discovery process. In thisrespect, the opportunities that arise for semantic collaboration are twofold. On oneside, semantic neighbors can be exploited to enforce a semantic query routing proto-col and to provide a scalable infrastructure for peer communications. On the otherside, semantic neighbors can be exploited to enforce consensus negotiation techniques

1

Page 13: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

and to provide federative structures of homogeneous peers (i.e., semantic communi-ties). In both cases, ontology matching techniques play a crucial role for discover-ing and maintaining semantic neighbors and for supporting associated semantic col-laboration functionalities, such as semantic routing and consensus negotiation. Theproblem of ontology matching has been investigated in the literature and a numberof approaches and tools have been proposed in the area of data and knowledge man-agement [Kalfoglou and Schorlemmer, 2003; Noy, 2004; Shvaiko and Euzenat, 2005;Shvaiko, 2006]. However, a number of additional and peculiar requirements need tobe considered for performing ontology matching in open distributed systems. Suchrequirements are originated by the dynamic behavior of peers in multi-knowledge sce-narios. For instance, the number and type of ontology features that can be exploitedduring the matching process is not known in advance and can vary, also for the sameontologies, each time a new matching execution is invoked. To this end, the designprinciples of ontology matching techniques must be driven by the necessity of satisfy-ing matching requests that are dynamically posed by peers on the basis of unexpectedneeds that can vary continuously.

Semantic routing protocols are generally based on the idea to address query prop-agation by selecting as query recipients the peers that are most likely to provide rele-vant results according to the query content. In this respect, some P2P semantic routingprotocols are being proposed in the literature with the aim to address query propaga-tion on the basis of the local context of the requesting peer [Joseph, 2002; Borch andVognild, 2004; Haase et al., 2004; Staab et al., 2004; Zeinalipour-Yazti et al., 2005].However, most of the existing approaches are based on the simplifying assumption toadopt a global repository of knowledge where mappings among the distributed peerknowledge are maintained. The centralized mapping knowledge is exploited to selectthe best recipients for new queries. In some other approaches, no centralized reposi-tory is assumed, but the supported knowledge model is poorly expressive in order toreduce the complexity of the matching problem to be processed. In particular, syn-tactic matching techniques (e.g., string- and keyword-based techniques) are generallyadopted to compute the similarity among the contents of the different peers, thus ac-curacy in query recipient selection is negatively affected as a result. Actually, a chal-lenging issue regards the need of advancing the existing semantic routing protocols bycombining ontology-based peer knowledge models and ontology matching techniques

2

Page 14: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

for providing query forwarding on a real semantic basis and in a completely decen-tralized way. Furthermore, additional issues are related to the scalability question. Inparticular, semantic routing techniques should switch from existing and widely usedTTL-based propagation strategies to credit-based mechanisms where the approximatenumber of desired replies is specified rather than the non-scalable number of hops tocross.

In this respect, the following requirements are addressed in the thesis work bydefining appropriate techniques to provide semantic routing functionalities in opendistributed systems, namely:

• Confidence evaluation. With respect to a given concept of interest, confidenceevaluation metrics based on ontology matching techniques are defined for al-lowing each peer to measure the expertise of the semantic neighbors discoveredduring past interactions.

• Semantic neighbor ranking. Appropriate techniques for semantic neighbor rank-ing are developed for allowing the selection of the query recipients by relyingon the confidence values of the semantic neighbors with respect to a given querycontent.

Consensus negotiation techniques are generally defined as the capability of thesystem nodes to self-organize themselves and to aggregate semantic communities ofpeers on the basis of their respective contents and interests. The key role of peerindependence and self-organization issues pose additional challenges to be investigatedfor existing peer community formation techniques and some work in this direction hasbeen recently appeared in literature [Khambatti et al., 2002; Bonifacio et al., 2003;Agostini and Moro, 2004; Yolum and Singh, 2003a; Bloehdorn et al., 2005]. In suchproposals, the main difficult is due to the capability to provide a semantically richrepresentation of peer interests while preserving the fully distributed architecture ofthe system at the same time. Advanced consensus negotiation techniques are requiredto handle the problem of high network traffic due to single-peer interactions and toprovide a coordination mechanism for allowing the parties to peer-to-peer interact andto negotiate local agreements. By aggregation of local agreements, single peers shouldhave the capability to build global agreements (and thus communities), to make theprocess scalable and suited to open distributed contexts.

3

Page 15: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

1.1 Thesis contribution and outline

To this end, the following requirements are addressed by the techniques developedin the thesis work in order to manage emerging semantic communities during theiroverall life-cycle, namely:

• Consensus negotiation. Consensus negotiation techniques are defined to en-able peers to commit an agreement on the basis of their personal interests andto define the topic area of interest featuring the community by exploiting self-describing information resources and/or ontologies.

• Membership management. Membership management primitives are required tomaintain the community and to cope with the events that can occur during time,such as insertion and deletion of participants or unexpected peer failure.

• Sharing policy. Once that a community is defined, a sharing policy is set to spec-ify the conditions under which a peer is available to process incoming querieswithin the community and to arrange a common behavior of participants for whatconcern the conditional availability of peers and resources.

1.1 Thesis contribution and outline

With respect to this scenario, the Ph.D. thesis has been devoted to investigate two mainissues: i) the development of matching-driven semantic routing techniques for scalabledistribution of knowledge requests on a semantic basis and for effective sharing of dis-tributed resources, and ii) the development of consensus-driven techniques which ex-ploit ontological resource descriptions and ontology matching in order to form, main-tain, and disband semantic communities in a P2P environment. In particular, the fol-lowing original contributions are discussed in the thesis:

• The thesis presents the H-Link semantic routing mechanism that is designed toexploit the matching knowledge acquired during the discovery process for pro-viding a semantic overlay network where peers having similar contexts are in-terlinked as semantic neighbors.

• The thesis defines semantic handshake techniques for autonomous and self-organizing formation of semantic communities of peers, based on ontologiesand ontology matching techniques to handle consensus negotiation during thecommunity formation process.

4

Page 16: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

1.1 Thesis contribution and outline

The thesis work has been developed in the context of the Helios framework forknowledge sharing and evolution in peer-based systems [Castano et al., 2006f]. InHelios, ontologies provide a semantically rich representation of the shared resourcesand enable peers to describe their involvement in one or more concepts of interest. Therole of ontology matching techniques regards the semantic affinity evaluation betweenconcepts provided by different peers in order to assess the level of match betweennodes with similar interests. In Helios, the H-Match semantic matchmaker is definedto provide the required ontology matching functionalities. A key feature of H-Match isrelated to the fact that it has been specifically conceived to work in multi-knowledgeenvironments. In particular, H-Match provides a set of flexible techniques that can bedynamically combined for adaptation to the semantic complexity of the consideredmatching scenario, and thus it is suitable for coping with the inherent dynamism ofopen systems, such as P2P systems [Castano et al., 2004d, 2006d].

The research methodology that has been applied during the Ph.D. activity is basedon the following main phases: i) literature review with the aim at providing a criticalcomparison of the state of the art solutions for managing semantic routing and peercommunities in P2P systems, ii) conceptual design where requirements and founda-tional aspects related to the Ph.D. issues are formally addressed, iii) experimentationby simulation with the aim at validating the thesis results on a number of real test cases,and iv) prototype implementation where a P2P prototype tool is developed accordingto the results and final considerations of the Ph.D. thesis work.

The thesis is organized as follows. In Chapter 2, we illustrate the state of the art onP2P systems, by discussing the main existing approaches in the field of P2P semanticrouting and consensus negotiation. In Chapter 3, we critically review the consideredapproaches by defining the main requirements for handling emergent semantics issuesin knowledge-sharing P2P networks. In Chapter 4, the H-Link techniques for semanticrouting in P2P systems are presented in detail. Experimental results and simulationson H-Link are then discussed in Chapter 5. In Chapter 6, we describe the handshaketechniques for consensus negotiation in peer-based systems, and we also illustrate theirapplication to the Helios framework for knowledge discovery and sharing. Finally,in Chapter 7, we give our concluding remarks and we discuss the most interestingdirections for future work.

5

Page 17: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

1.2 Research projects involved

1.2 Research projects involved

The PhD activity is strictly related to the following projects, where the methods andtechniques presented in the thesis have been developed and will be adopted:

• FIRB WEB MINDS. The WEB-MINDS (Wide-scalE, Broadband, MIddlewarefor Network Distributed Services) FIRB Project was funded by the Italian Min-istry of Education, University, and Research. The goal of the WEB-MINDSproject is to provide an infrastructure for supporting the access and discoveryof resources in a highly distributed and mobile context. The architecture pro-posed is based on the peer-to-peer paradigm. In the context of the WEB-MINDSproject the Helios framework has been developed by the ISLab Group of the Uni-versity of Milan 1 as a reference framework for knowledge sharing and evolutionin open networked systems.

• FP6 INTEROP NoE. The INTEROP NoE INTEROP is an association of 47partners from 15 countries coming from numerous sectors spanning academicinstitutions, research centers, industrial stakeholders and standards communi-ties. The Network gathers 204 researchers and 140 doctoral students. INTEROPaims to create the conditions of an innovative and competitive research in thedomain of interoperability for enterprise applications and software and to fa-cilitate the emergence of an interoperability research corpus. In the context ofINTEROP, the H-Match techniques for ontology matchmaking have been classi-fied and presented as a contribution to the European research on ontology-basedsemantic interoperability.

• MIUR PRIN ESTEEM. The ESTEEM (Emergent Semantics and cooperaTionin multi-knowledgE EnvironMents) MIUR PRIN Project is funded by the ItalianMinistry of Education, University, and Research. The main goal of the ESTEEMproject is the development of a platform for addressing emergent semantics inmulti-knowledge environments, where semantic communities of peers need tocooperate according to the P2P paradigm. The ESTEEM platform will providea comprehensive framework for data and service integration in P2P systems,and advanced solutions for consensus-driven construction of semantic commu-nities, trust and quality management, P2P physical infrastructure design, query

1http://islab.dico.unimi.it

6

Page 18: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

1.2 Research projects involved

processing, and dynamic service discovery, composition and matching. Fur-thermore, since parties of a semantic communities may be dislocated in differ-ent contexts and may need to access data through heterogeneous devices, theESTEEM platform will provide features for mobile and ubiquitous informationaccess and storage, and multichannel delivery. The methods and techniques pre-sented in the thesis will be adopted as the starting point for supporting emergentsemantics issues in the ESTEEM platform.

7

Page 19: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 2

P2P systems: the state of the art

The popularity of Peer-to-Peer (P2P) technologies has considerably grown in recentyears due to the many benefits they offer, such as adaptation, self-organization, load-balancing, and fault-tolerance [Daswani et al., 2003]. Currently, most of the existingP2P systems are constituted by file-sharing P2P systems that aim at enabling peer re-source sharing by exploiting metadata-based resource descriptions (e.g., audio/videofile-sharing systems). On the other hand, a lot of effort is still required for developingeffective P2P applications and for making the P2P paradigm as a true alternative tothe traditional client/server architecture. In particular, there is the need to go beyondtraditional metadata-based sharing techniques by introducing more expressive and se-mantically rich formalisms for resource description. Knowledge-sharing P2P systemsare being proposed to this end.

In this chapter, we analyze the main existing P2P systems. According to the thesisgoals, we focus on surveying the most interesting approaches concerning P2P seman-tic routing and community formation. In particular, we first consider the traditionalfile-sharing P2P systems by providing a classification based on their architectural andstructural properties. Furthermore, more recent P2P systems are also considered whenrelevant elements of discussion regarding P2P semantic routing and community for-mation are provided. For each system, we present its main features by providing someconsiderations about the level of expressiveness in resource description and the possi-ble applicability to knowledge-sharing P2P systems.

8

Page 20: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

2.1 Classification of P2P systems

Different surveys have been proposed in the literature for classifying the existing P2Psystems and for providing a comprehensive picture of the actual stage of developmentin such field. In this respect, the main effort has been devoted i) to describe the role ofP2P technologies for content distribution and sharing in open system [Androutsellis-Theotokis and Spinellis, 2004], and ii) to organize the existing P2P approaches in ataxonomy defined on the basis of different classification criteria (e.g., network organi-zation, system architecture) [Aberer and Cudre-Mauroux, 2005; Staab, 2006].

In our opinion, defining a single taxonomy, capable of capturing at once all therelevance aspects of the existing P2P technologies is a hard task. In this chapter, twocomplementary classifications are proposed: the architectural classification and thestructural classification. The architectural classification enforces a distinction betweenP2P systems that discriminate a set of stand-alone nodes with indexing capabilitiesfrom regular nodes (i.e., Hybrid systems), P2P systems that assign equal role and capa-bilities to all the system nodes (i.e., Pure systems), and P2P systems that are featuredby a set of interlinked nodes providing indexing services and a set of regular nodes(i.e., SuperPeer systems). The structural classification enforces a distinction betweenP2P systems that follow a pre-defined node organization (i.e., Structured systems), P2Psystems that dynamically adapt the query routing choices according to the results ofpeer interactions (i.e., Adaptive systems), and P2P systems with no pre-defined struc-ture and with no adaptation (i.e., Non-adaptive systems). Both the classifications aimsat providing a basic sorting where categories are chosen in order to highlight the pecu-liarities of the P2P paradigm and to foster the comparison among different approachesfor what concern routing and node interaction capabilities, respectively. The proposedclassifications are orthogonal, in that, each P2P system can be classified either withthe architectural classification and with the structural classification according to theselected classification criterion. For each category in both the classifications, a well-known P2P system will be provided as a reference example 1.

1The presentations are devoted to illustrate the system features regarding the classification-relatedissues. For a detailed description of the considered systems, the reader can refer to the correspondingwork listed in the references.

9

Page 21: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

2.1.1 Architectural classification

When P2P technologies started emerging as a novel interaction paradigm for dis-tributed systems, the distinction between pure and hybrid systems was used to featurethe two main methods that were originally designed to support content sharing opera-tions. Such a distinction is still valid and can been further refined by introducing theSuperPeer category that presents commonalities either with hybrid and pure systems.The architectural classification allows to analyze the different levels of decentraliza-tion supported by existing P2P systems, by clearly introducing the basic foundationsof the P2P paradigm.

2.1.1.1 Hybrid systems

In hybrid P2P systems, a number of stand-alone nodes are defined to support index-ing functionalities and are accessed by the requesting peers to identify the nodes ofthe system that are capable of providing a target file. Once the target file is located,the requesting node can download it by directly communicating with one of the nodesstoring the file. Such kind of systems are called hybrid as they mix the concept ofindexing node, typical of client/server architectures, with the concept of direct com-munication between equal nodes for sharing purposes, typical of the P2P paradigm.The most popular hybrid P2P system is Napster [Napster], which is also one of thefirst-developed P2P system. The OpenNap protocol (Open Source Napster Server) hasbeen also defined to extend the initial Napster project with an open source version. Inthe following, we will analyze the main features of an OpenNap [OpenNap] imple-mentation of a hybrid file-sharing service [Yang and Garcia-Molina, 2001].

An OpenNap server provides indexing services to a number of user nodes. Usernodes join the system by registering to one of the available OpenNap servers. In orderto complete the registration, the user node has to upload on the server the metadatadescribing its user library. The user library is the collection of files that a user iswilling to share. The metadata might include file names, creation dates, and copyrightinformation. For each registered user node, the OpenNap server maintains an indexon the metadata of all the files in the user library. Moreover, the server also maintainsa table of user connection information, describing active connections (e.g., client IPaddress, line speed). After the registration, a user node can connect to its server and canquery the server by submitting a request that consists of a list of keywords. When the

10

Page 22: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

server receives the query, it searches for matches in the indexed metadata and replieswith the list of all matching files and the corresponding coordinates (i.e., name of thefile and storing node). Only matching files held by active user nodes are returned inthe query result. The user node examines query results, and when an interesting file isfound, the node holding the file is directly contacted for download. After a successfuldownload, or when a file is added/deleted in the user library, the OpenNap server isnotified by the user node in order to allow the index update. The system may containmultiple stand-alone OpenNap servers, but a user node is registered and connected toonly one server. Each server is not interlinked with the other OpenNap servers andtheir respective indexes are independent. If a user intends to change the server, it hasto register its user library on a new OpenNap server.

As an example, we consider the portion of hybrid P2P network shown in Figure 2.1.In this example, a discovery query containing a target request is submitted by peer C

to a centralized indexing service (1). The server replies by specifying the coordinatesof a list of peers storing files that match the target query (2). Finally, the requestingpeer C directly connects to one of the discovered peers (peer D in the example) fordownloading the file of interest (3).

discovery reply

discovery request

file download

(2) (1)

(3)

peer E

peer A

peer B

peer C

peer D

Indexingservice

requestingpeer

replyingpeer

Legenda

Figure 2.1: An example of hybrid P2P network

Two pros are provided by hybrid P2P systems. The former is that searches are veryefficient as the available user libraries are indexed by metadata. The latter is that queryresults are “locally complete”, in that, all the matching results registered on the queried

11

Page 23: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

server are returned to the requesting node. In the large, this feature can be consideredas a limitation: a given query may collect different results when submitted to differentservers. Another weak point of hybrid P2P systems is that indexing functionalities arecentralized and, thus the system is not widely fault tolerant. When a server crashes, allthe files registered in its index are unavailable for download.

2.1.1.2 Pure systems

In pure P2P systems, nodes are equal in terms of role and responsibilities and interactby relying on a completely decentralized query/answer paradigm. In particular, eachpeer is called servent (i.e., SERVer + cliENT ) and it can deal with the role of serverand client at the same time. At a given moment, the peer can act as a client by query-ing the other nodes when searching for a target resource, and as a server by providingmatching files in reply to the other peer requests. Pure systems are the most intu-itive and flexible P2P architectures as no centralized index providers are required andnodes directly interact to satisfy their requests. The most popular pure P2P system isGnutella [Gnutella], which is also the reference implementation for a number of otherpure P2P systems, such as Limewire [LimeWire]. In the following, we will analyze themain features of a Gnutella implementation as described in the corresponding protocolspecification document [Gnutella Protocol].

In order to join a Gnutella network, a peer need to know the address of at leastone node already on the network. Once the joining peer gets connected to this node,it can acquire the addresses of other nodes. The basic idea is that each node maintainsa connection to a number of other nodes (e.g., five connections in a typical configura-tion). As no centralized index is defined, queries are propagated through the networkaccording to a flooding mechanism until the searched file is found. When a peer pspecifies a target query, the directly connected nodes (i.e., neighbors) are selected asquery recipients. The receiving peers forward the query to their respective neighborsand when a matching resource is found, the positive result (i.e. resource name and ad-dress) is replied back to the requesting peer p along the query path. Then, peer p canestablish a direct connection with the node storing the matching resource for the sub-sequent download. Query propagation is limited by a time-to-live (TTL) mechanism.The requesting peer p, sets an integer value as TTL that is associated to the query. TheTTL value indicates the maximum number of query hops before its expiration. Each

12

Page 24: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

receiving nodes decreases the TTL value before forwarding it. When a node receives aquery with T T L = 0, the forwarding process is stopped even in case that no matchingresources are still found. Furthermore, each query is associated to a unique identifierin order to enable the receiving peers to drop query duplicates.

As an example, we consider the portion of pure P2P network shown in Figure 2.2.In this example, peer A submits to the system a given target query with an associatedT T L = 3 value. The query is sent to peer A neighbors, namely peer B and peer C

(1). As neither peer B and peer C can provide files matching the target, the queryis forwarded to their respective neighbors, namely peer D and peer F respectively,according to the T T L = 2 value (2). Similarly, peer D propagates the query to peer

G and peer F, as it does not store any matching file and T T L = 1 (3). Receiving thequery from peer C, peer F stops the query forwarding as a file matching the query isidentified and a reply needs to be returned to the requesting peer A. The reply is sentto peer C (3) and then forwarded to peer A (4) along the query reverse path. Finally,peer A and peer F get directly connected for subsequent file download (5). As a finalremark, we stress that peer F drops the query duplicate received from peer D and peer

G stops the query propagation as T T L = 0.The main benefits provided by pure P2P systems are resilience and robustness to

failures. In particular, a node joining/leaving the network has the effect to add/dropits data collection to the available resources without affecting the overall lookup ser-vice capabilities. However, the main questions of pure P2P systems are related tosearch efficiency. In particular, flooding-based query propagation works well for smalland medium sized networks. Thus, pure P2P networks do not scale well. As an ex-ample, it has been shown that the cost of searching on a Gnutella network increasessuper-linearly as the number of nodes increases [Portmann et al., 2001]. Furthermore,searches are incomplete by definition. A requesting node has no guarantee that theretrieved set of results are all the results available in the network.

2.1.1.3 SuperPeer systems

In superPeer systems, nodes are split in a set of interlinked super-peer nodes withindexing functionalities and a set of regular nodes with querying capabilities. A super-peer acts as an indexing server to a number of regular peers as in a hybrid system.Moreover, super-peers are also connected to each other as peers in a pure system and

13

Page 25: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

peer D

peer C

peer B

peer A

peer G

peer F

peer E

discovery reply

discovery request (TTL=3)

file downloadrequestingpeer

replyingpeer

Legenda

(1)

(1)

(2)

(2)

(4)

(3)

(5)

(3)

(3)

Figure 2.2: An example of pure P2P network

cooperate by submitting and answering queries on behalf of regular nodes and them-selves. As a result, we can say that superPeer systems combine in a single P2P ar-chitecture the main features of both hybrid and pure P2P systems. For this reason, anumber of superPeer systems have been recently developed with the aim at providingefficient search procedures (as in hybrid systems) and high levels of decentralization(as in pure systems) at the same time. Two well-known examples of superPeer sys-tems are Morpheus [Morpheus] and Kazaa [Kazaa]. In the following, we will analyzethe main features of a generic superPeer implementation, as described in Yang andGarcia-Molina [2003].

A superPeer system can be considered as a pure P2P network composed of a num-ber of super-peers where each of them is connected to a subset of regular nodes. Eachregular node is connected to a single super-peer only and a super-peer with its subset

14

Page 26: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

of associated regular nodes (i.e., clients) defines a cluster. In a cluster, the super-peermaintains an index over the data provided by its clients, and such an index is updatedeach time a peer joins/leaves the cluster. Each regular node submits queries to its super-peer only and receives as a result a set of matching documents as well as the address ofeach client whose collection produced a result. Once that the location of the matchingdocuments is discovered, the requesting node can directly connect to the documentprovider for file download. When a super-peer receives a query from one of its clients,it matches the query with its index and the results are replied to the requesting client.In case that the retrieved set of results is not sufficient (or empty), the super-peer canpropagate the query to the other super-peers according to a TTL-based mechanism.When a super-peer receives a query from another super-peer, it processes the query onbehalf of its clients. In case that some matching documents are found, the results arereturned to the requesting super-peer and then relayed to the initial requesting client.

As an example, we consider the portion of SuperPeer network shown in Figure 2.3.In this example, peer A submits a target query to the super-peer of its cluster, namelypeer B (1). As no matching file are identified, peer B defines a T T L = 2 value andforwards the query to its neighbors, namely peer C and peer E, on behalf of peer A (2).Similarly, peer E propagates the query to peer D as it does not store any matching fileand T T L = 1 (3). No query forwarding is performed by peer D as T T L = 0. Receivingthe query from peer B, peer C stops the query forwarding as a file matching the queryis identified in its index and the coordinates of the storing peer (i.e., peer F) need to bereturned to the requesting peer A). To this end, the discovery reply is propagated backto peer B (3) and then to peer A (4). Finally, peer A and peer F get directly connectedfor subsequent file download (5). As a final remark, we note that peer F is not involvedin the discovery phase and it is contacted by peer A only for the final download action.

As a superPeer network combines elements of both pure and hybrid systems, it ispotentially capable to combine the efficiency of a centralized search with the autonomy,load balancing and robustness to attacks provided by distributed search. Furthermore,connections among super-peers enable clients to access a wider set of collections thusincreasing the number of query results. Although clusters are efficient, we note thata super-peer is still a possible point of failure for its cluster, and a potential bottle-neck. When the super-peer fails or simply leaves, all its clients become temporarilydisconnected until they can find a new super-peer to connect to. To provide reliabil-

15

Page 27: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

peer E

peer D

peer C

peer B

peer F

peer A

discovery reply

discovery request

file download

requestingpeer

replyingpeer

Legenda(1)

(2)

(2)

(4)

(5) (3)

(3)

Figure 2.3: An example of superPeer network

ity to clusters, redundancy can be introduced in super-peer design. A super-peer isk-redundant if there are k nodes sharing the super-peer load and forming a single “vir-tual” super-peer. Each partner in the virtual super-peer is connected to every client andmaintains a complete index of its client data, as well as the indexes of other partners.Queries from clients and from neighboring super-peers are processed by equally dis-tributing the requests on the k partners. The drawback is represented by the additionalcosts in terms of super-peer maintenance. The partners of a super-peer are required toprovide extra resources to handle the increasing size of indexes and the higher numberof connections. Moreover, a traffic overhead arise in order to keep indexes aligned andto coordinate the partner activities.

2.1.2 Structural classification

The structural classification aims at characterizing P2P systems according to the sup-ported peer organization. In particular, we discuss the main existing approaches forinserting a peer in the network when joining the system and for handling peer con-

16

Page 28: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

nections with other nodes during its sharing activities. In the following, we proposea basic classification where we distinguish between structured and unstructured P2Psystems. Moreover, unstructured P2P systems are further distinguished in adaptiveand non-adaptive P2P systems.

2.1.2.1 Structured systems

Structured P2P systems are based on a predefined network topology where peers areorganized in a geometrical structure (e.g., a toroid, a hypercube). The idea is that eachresource in the system is identified by a unique hash key and resources are determinis-tically placed in the system according to a Distributed Hash Table (DHT). Such kindof systems are often called CAN (Content-Addressable Network) to indicate that thehash key of a resource r is used to obtain the address of the peer that is responsiblefor storing r. Chord [Stoica et al., 2003] and Pastry [Rowstron and Druschel, 2001]are popular examples of structured P2P systems where hashing is used to build differ-ent network topologies and associated DHTs. In the following, we describe the mainfeatures of a generic CAN, as presented in Ratnasamy et al. [2001].

In a CAN, peers self-organize into an overlay network representing a virtual coor-dinate space and a hashing function is defined to associate each resource (i.e., a file)to a corresponding hash key. The association between a file and its hash key is storedin a DHT which represents the comprehensive search space of the system. The DHTis then split in chunks, also called zones, that are assigned to peers. Being assigned toa zone z, a peer has the responsibility to store the files whose hash keys are containedin z. The network topology is shaped as a balanced geometrical structure which is pre-served during a new peer insertion by pre-assigning the peer neighbors according tothe structure requirements. Joining the system, the following steps are performed by ageneric peer p:i) the peer p receives a zone and associated files from one of its neighbors, e.g., thepeer r;ii) the peer p uses the hash function to insert its files in the DHT. Each file f is placedin the peer that is responsible for the hash key of f ;iii) the peer p defines a local routing table where IP address and corresponding zone ofeach peer p neighbor are stored.

In order to assign the zone to a joining peer p, the neighbor peer r splits its zone

17

Page 29: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

in two equal parts. On the opposite, a peer p leaving the network needs to ensure thatits zone is taken over by the remaining nodes. A merge operation is then performedand both the zone and the associated files of p are passed to one of its neighbors.Moreover, each peer in the system periodically send to its neighbors an update messagefor advertising the zone it is responsible for.

In a CAN, a query contains the hash key of the target file to search and it is routedthrough intermediate peers towards the CAN node whose zone contains the requestedhash key. Consider a requesting peer p which submits a query q containing a targethash key tk. Receiving the query q, a peer r checks whether tk belongs to its zone.In this case, the query q is satisfied and the associated target file is returned to p.Otherwise, a greedy approach is performed for query forwarding and q is sent to thepeer r neighbor whose zone is closest to tk by exploiting its local routing table. Wenote that an empty query reply is provided when the target file of the query q does notexist in the CAN and no corresponding entry is found in the peer that is responsiblefor the zone of tk.

Structured P2P systems enforce an efficient search approach and they are charac-terized by self-organization, fault-tolerance, and scalability. However, a certain trafficoverhead due to administrative operations is required for handling join/leave peer op-erations and the correct assignment of zones.

2.1.2.2 Adaptive systems

We define an adaptive P2P system as an unstructured P2P system where the routingchoices of a peer can vary over time due to past peer interactions. Learning mecha-nisms and routing tables are usually adopted in adaptive P2P systems to support theselection of the best query recipients for a given query. Freenet is a popular exampleof adaptive P2P system where resources (i.e., files) are replicated in the system onthe basis of their popularity and query routes vary accordingly [Freenet; Clarke et al.,2001].

Freenet is a distributed information retrieval system where files can be anony-mously inserted and requested, and privacy protection is defined as a basic systemfeature. Each peer p provides a local data-store that is used for maintaining both thelocal files of p and the files provided by other nodes. In particular, a file can be repli-cated in the data-store of different peers according to its popularity, in order to foster

18

Page 30: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.1 Classification of P2P systems

its availability for future requests. In Freenet, hashing is used for file verification andidentification. The Content Hash Key (CHK) is used for file codification in order toprevent corruptions due to transmission errors, while the Signed Subspace Key (SSK)is used to certify the file integrity. An asymmetric encryption mechanism is defined toencode/decode files and to detect malicious file alterations.

Each peer is capable of performing query submission and forwarding by relyingon a local routing table where information regarding file positions are stored. Defininga query q, a requesting peer p has to know the SSK of the target file t f . The queryq is then submitted to the system together with a hops-to-live (HTL) parameter that isexploited as a TTL value by limiting the query scope. During the propagation, assumethat the query q is forwarded by a peer t to a peer r. Receiving the query q, a peer rchecks whether the target can be provided through its local data-store. Three scenarioscan occur:i) The target file t f is found in the local data-store of the peer r. The target file t fis returned to the peer p along the query reverse path. A peer s on the reverse pathupdates its routing table to store that the target file t f is available through the peer r.Furthermore, the peer s maintains a copy of t f in its data-store with the aim to speedup the subsequent requests of this target file. Both the peer routing table and the localdata-store are managed with a last recently requested strategy.The target file t f is not found in the peer r data-store and HT L > 0. The query q isforwarded to a peer r neighbor by exploiting its routing table.The target file t f is not found in the peer r data-store and HT L = 0. The peer r returns ano-match message to the peer t. The query q is retransmitted to another peer t neighborby exploiting its routing table.

In Freenet, and in adaptive P2P systems in general, similar hash keys are stored inneighbor nodes due to popular file replication. As an effect, the routing effectivenessgradually increases over time through content localization and routing table special-ization. However, search performances are affected by low response times due to theno-match/retrasmission mechanism. Furthermore, an advertising-like strategy is de-fined to bypass the problem of knowing the hash key of a file for inclusion in a queryas target file.

19

Page 31: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

2.1.2.3 Non-adaptive systems

Non-adaptive P2P systems are unstructured P2P systems where the routing choicesof a peer do not change over time. This means that past peer interactions are notconsidered and no learning mechanisms are adopted for improving the effectivenessof query routing. A popular example of non-adaptive P2P system is Gnutella, whosedetailed description is provided in Section 2.1.1.2.

The main advantage of non-adaptive P2P systems is related to a simple routinginfrastructure that reduces peer workload thus allowing fast query forwarding. Onthe other side, non-adaptive P2P systems are affected by i) wasted bandwidth due tounfiltered query retransmission and ii) poor scalability due to inability to rank peersaccording to their relevance for the query target.

2.2 P2P semantic routing techniques

The main limits that affect traditional P2P routing protocols derive from two oppositegoals that need to be pursued. On one side, queries need to reach the largest numberof peers in order to increase the chances to locate the target file. On the other side, theneed of scalability requires that the generated number of messages should be as loweras possible. When the P2P network is targeted to knowledge sharing, query routingbecome even more complex than traditional file-sharing systems due to the differentlevels of semantic complexity that need to be considered in query processing [Ehrigand Sure, 2004]. In this respect, content-based and semantic routing techniques arebeing proposed to optimize the tradeoff between the need to maximize the effectivenessof the discovery phase and to minimize the network traffic in a P2P system.

2.2.1 Query routing through good peers

The concept of “good” peer is defined in Ramanathan et al. [2002] as the key elementof an automatic interest-based mechanism for establishing connections between thenodes of a P2P system. In such an approach, each peer maintains a local repositoryof files to be shared with the other network nodes. Each file is described by a set ofmetadata (e.g., title, author, expiration date) and a number of keywords. Moreover, thefile metadata also include a reputation value which rates the file. The file owner assigns

20

Page 32: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

the corresponding reputation value by specifying an integer value in the range [1,5]according to its personal interests: the higher is the reputation, the higher is the filerelevance in the owner preferences. An hashing function is defined to index keywordsand associated metadata for efficient file retrieval in the local repository of each peer.The underlying idea of the approach is to build an overlay network of nodes wherepeers with similar interests are close and the similarity between peer interests decreasesas the distance between the peers increases. Thus, the system is characterized by anoverlay network of peers posed on top of a traditional pure P2P infrastructure (e.g.,Gnutella). The maximum number of direct connections per peer is limited by theavailable resources of the peer and connections are established according to the peerinterests. In particular, nodes with high degree of similar interests are considered as“good” peers and are directly connected. Interest similarity is measured by observingthe number and the type of files that peers maintain and provide during their activity.Two peers are considered to have similar interests if they are able to return relevantfiles to each other requests. In particular, each peer learns about the interests of othernodes by monitoring the replies received to its requests. As a consequence, each nodedecides to add/drop a connection with a given peer according to the observed interestsof that peer.

In order to store observations regarding external node interests, each peer i main-tains a set of information for each node j it has interacted with. In particular, impor-tance and percQueryHits metrics are maintained. The importance metric represents therelative importance that the peer j has for the node i and captures the interest overlap-ping of the two peers. Importance measures are locally maintained by each peer, thusthe same peer can have different importance for different nodes. The percQueryHitsmetric is used to support the computation of importance values and it is defined as thenumber of replies (QueryHits) received by peer i from peer j over the total number ofreplies (QueryHits) received from all peers as a response to the queries of peer i.

A node searches in the network by submitting a query message to each of its im-mediate neighbors (i.e., directly connected peers). The query message contains thetarget set of keywords featuring the topic of the requested file. When a node receives aquery message, target keywords are evaluated against the metadata of the documents inthe peer local repository. The evaluation is performed by adopting syntactic matchingtechniques, such as string-based techniques. If some matching files are identified, theyare returned to the requesting peer by following the reverse query path. Provided that

21

Page 33: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

T T L > 0, the receiving node also decrements the query TTL and forwards the mes-sage to each of its immediate neighbors. The files collected by the requesting peer aredisplayed according to the files reputations. Moreover, replies are used to update perc-QueryHits and importance measures of both immediate and indirect responding peers.Periodically, each peer evaluates the importance of immediate and indirect peers andcan decide i) to drop the connection with an immediate neighbor, if its importancemeasure has been decreased, and ii) to add a new direct connection with an indirectnode, if its importance measure has been increased.

Considerations. We note that the proposed approach is suited for file-sharing P2Pnetworks and only supports string-based matching functionalities when comparing therelevance of a file with respect to a given target query. Moreover, no check is performedby the requesting node on the real relevance of the collected replies, then importancemeasures can indicate misleading interest similarity values. The reputation evaluationmodel provides a subjective and poor grained interpretation of the file relevance, thatis not related to the semantic affinity existing between files and queries. However,experimental data shows that structuring the network according to the peer interestsimilarity can be promising. In particular, the approach can provide interested resultsin terms of traffic load reduction as a high number of files matching the query targetcan be discovered also with low TTL values.

2.2.2 The REMINDIN’ query routing algorithm

In Staab et al. [2004], the REMINDIN’ (Routing Enabled by Memorizing INforma-tion about Distributed INformation) algorithm is proposed to provide a query routingmechanism based on confidence ratings for the SWAP (Semantic Web And Peer-to-peer) platform [Broekstra et al., 2003]. In SWAP, each peer provides a local noderepository (i.e., an ontology) containing the RDF(S) statements that describe eitherdata and conceptual information of the peer [Brickley and Guha (eds.), 2003]. Further-more, the SWAP storage model also capture meta-information about the RDF(S) state-ments in order to specify where the statement came from (i.e., Swabbi.hasPeer) andhow much resource-specific confidence and overall confidence is put into these state-ments and peers, respectively (i.e., Swabbi.Resource-specific Confidence, Peer.OverallConfidence). The SWAP query model is based on the SeRQL language where a query

22

Page 34: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

consists of the RDF(S) triple (i.e., subject, object, predicate) [Broekstra and Kampman,2003].

In the REMINDIN’ query routing algorithm, a peer p maintains in its local noderepository a distinct Peer object for representing i) each node that replied to a querysubmitted by p, and ii) each node that submitted a query to peer p. A Swabbi.hasPeerrelation is defined to associate a Peer object, and related peer p, to the resources in thelocal node repository provided/required by p. A Swabbi.Resource-specific Confidenceis defined to specify a confidence value related to a Peer object and the correspondingresource in the local node repository. The Swabbi.Resource-specific Confidence valueaims to assess the relevance of a peer p with respect to the associated resource r bymeasuring the number of relevant answers regarding r provided by p. Such a con-fidence value is updated when new interactions are issued with the peer p on topicsrelated to r. The confidence value also contributes to load balancing by downgradingthe confidence of the nodes with low query responsitivity. A Peer.Overall Confidencevalue is associated to each Peer object to provide a comprehensive measure of its con-fidence by considering all the related resources. Receiving a query, a peer replies to arequesting node with the RDF(S) statements that satisfy the SeRQL query and that arecontained in its local node repository. A statement selection algorithm and a statementevaluation algorithm are defined to match and to rate the incoming SeRQL query withthe statements in the local node repository. Moreover, a query relaxation method isprovided to weaken the query constraints when no local RDF(S) statements match thetarget. After the reply, a peer can forward the query to other peers in the network.The peer selection procedure is invoked to select the set of peers which appear morepromising than the others to answer the given query. To this end, the Peer objects in thelocal node repository are ranked according to their confidence with respect to the re-sources that match the query triple (i.e., rankPeers algorithm). The query is forwardedto the top-k relevant peer identified. The forwarding mechanism stops when the TTLassociated to the query expires.

Considerations. We note that relying on peer replies to train the routing protocolbehavior enforces a passive learning approach where peers become aware of relevantknowledge location without affecting the network with an additional traffic overhead.Moreover, the idea that confidence and expertise measures are directly associated to re-sources allows to specify information about peer relevance at different levels of gran-

23

Page 35: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

ularity. Furthermore, additional metrics (e.g., trust and reputation of nodes) can beconsidered for flexibly providing a more accurate measure of peer relevance. On theother hand, we observe that semantic resource matching and efficient query processingare crucial issues for providing effective knowledge sharing. In this respect, a possi-ble REMINDIN’ improvement in terms of flexibility can be introduced by combiningSeRQL and associated syntactic and structural matching techniques with other exist-ing ontology matching approaches, such as those based on linguistic and contextualcomparisons or based on reasoning. A final remark is related to the use of a traditionalTTL-based mechanism for query forwarding in REMINDIN’. Given a target SeRQLquery, the rankPeers algorithm returns the top-k relevant peers. Based on the assump-tion that more relevant peers are mutually connected, different values of TTL should beassigned to query recipients according to their ranking. This way, high TTL values canbe selectively assigned to most relevant peers, thus contributing to reduce the overallnetwork traffic.

2.2.3 Socialized.Net and the Seers search protocol

The Seers search protocol is defined in Borch and Vognild [2004] to enforce P2P se-mantic query routing by creating an overlay network where nodes are organized ac-cording to their interests. In the Seers search infrastructure, each shared resource isdescribed through a meta-document that provides a set of corresponding metadata ex-pressed in the XML syntax. When a query is submitted to the system, a target meta-document is provided by specifying the searched metadata and one or more matchingmeta-documents are returned as a result by the replying peers. In particular, a replyis represented by a meta-document describing an existing resource whose metadatamatch the target meta-document. In order to limit a sudden burst of replies to therequesting peer, all matching meta-documents are routed back, following the reversepath of the query. After the discovery of the matching meta-documents available inthe network, the requesting peer can acquire the corresponding resources through atraditional HTTP connection.

The Seers search protocol is based on three reference policies: the matching pol-icy, the transmission policy, and the life cycle policy. The matching policy defineshow tags in different meta-documents should be matched against others. Each peeruses the matching policy to compare an incoming target meta-document with its own

24

Page 36: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

meta-documents. As a result of the matching, a score is assigned to each comparedmeta-document. The documents will then be ranked according to their score. Thetransmission policy describes how a query with a target meta-document should bepropagated in the network. In particular, each peer associates a ranking value to itsneighbors. For each neighbor, the ranking value is computed on the basis of i) theneighbor state (i.e., active, inactive, stale), and ii) the neighbor interests that are dis-covered by observing the meta-documents routed by the neighbor. In the transmissionpolicy, the scope of the query (i.e., host-only, local, global) is also defined in orderto specify how/when the query has to be forwarded to its neighbors by a receivingpeer. Finally, each peer defines a life cycle policy to set the constraints for storing andforwarding of meta-documents (i.e., validity limitations, cacheability).

The Seers search protocol has been refined in Borch [2005] where the Social-ized.Net infrastructure is presented to support a mechanism for query forwarding basedon preferences and reputation. A social metaphor based on human gossiping is consid-ered as a reference model that is followed by each peer of the network to identify thereliable nodes that can be selected for sharing interaction and to improve semantic rout-ing efficiency. In the Socialized.Net infrastructure, when peers are selected for queryforwarding, neighboring nodes are ranked by combining the traditional Seers measures(i.e., state, interests) with preference and reputation information. In particular, prefer-ence is a local rating of neighbor nodes based on statistics and user input, while reputa-tion is an asynchronous gossip between neighbor nodes regarding their behavior. For aremote peer r, the preference value computed by a peer l is defined as a combination ofi) the ratio indicating the number of replies provided by r to any query submitted to thesystem (ContributionRatio), ii) the ratio indicating the number of common interestsbetween r and l measured by observing the number of queries/replies exchanged be-tween them (IGot, YouGot ratio), iii) the ratio indicating the connectivity of r measuredby comparing the number of messages destined to r with the total number of messagesforwarded to r (RelayRatio), and iv) the ratio indicating the reliability of r measuredby comparing the number of relevant replies provided by r with the total number ofreplies provided by r (BogusReplyRatio). Preference values are asynchronously ex-changed among neighbor nodes in order to compute reputation. For a remote peer r,the reputation value computed by a peer l is defined as a combination of the local pref-erence value with the preference values received from the neighbor peers.

25

Page 37: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

Considerations. We note that the use of a social metaphor based on human gos-siping for defining the Socialized.Net infrastructure enforces an adaptive query routingprotocol. In particular, the proposed Seers search mechanism is based on a soft learningmode where logical connections are established according to the observed peer behav-ior in query replies and where active advertising methods are limited to reputation in-formation exchange. In this respect, we observe that the use of soft or completely pas-sive learning modes should be preferred for improving semantic routing efficiency asthey contribute to reduce the traffic overhead introduced by peers for measuring remotepeer expertise. However, the main concern with Socialized.Net is that this system onlysupport metadata-based peer content representation and thus traditional string-basedmatching functionalities that are suited to work in file-sharing P2P systems. In orderto address knowledge-sharing P2P interactions, the Socialized.Net framework shouldbe improved to provide semantic-based matching techniques capable of dealing withmore expressive knowledge representation models (e.g., ontologies).

2.2.4 P2P semantic link networks

A P2P Semantic Link Network (P2PSLN) is a directed network, where nodes are peersor P2PSLN, and edges are typed semantic links specifying semantic relationships be-tween peers [Zhuge et al., 2004]. Different types of semantic links can be establishedbetween two nodes (i.e., equal-to, similar-to, reference, implication, subtype, sequential,empty, and null) according to a set of pre-defined reasoning rules. Each peer definesits own XML Schema (source schema) describing the contents to share and adoptsSOAP-based messages to communicate with the other members of the network. Whena peer joins a P2PSLN, it will first identify the semantic relationship between itselfand the nodes in the network. Moreover, the entering peer acquires the XML Schemaof the identified nodes in the network (target schemas) and analyzes them recursivelyin depth-first order with the intention to identify the semantic mappings (i.e., semantic

node mapping, semantic clique mapping, and semantic path mapping) between the elementsin its source schema and the elements in the incoming target schemas. The similaritybetween the source and the target schemas can be measured by the methods of cycleanalysis and functional dependency analysis as proposed in Aberer et al. [2003b].

Upon receiving a query, a peer will autonomously forward the requirement to rele-vant peers according to the types of the semantic links as well as the similarity between

26

Page 38: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

elements and structures of peer schemas. QoP (Quality of Peer) techniques, such asnumber of returned results, response time, traffic overhead, precision, and recall, areconsidered to manage inconsistent data in returned data flows. The P2PSLN approachis compared with the Breadth First Search (BFS) and the Random Walk Search (RWS)by means of simulation techniques. The experiment illustrates that the BFS routingpolicy achieves the highest recall rate, but P2PSLN can ensure the best performance interms of generated traffic.

Considerations. We note that the adoption of reasoning-based semantic mappingsamong nodes constitute a first step towards an actual query semantic routing proto-col. However, the main problem of P2PSLN is that the semantic overlay of semanticmappings is statically defined with the gradual entrance of peers in the system. In par-ticular, an entering node defines semantic mappings only with the initially discoveredpeers. Such an approach has two main drawbacks. First, the network topology does notchange according to the peer contents. This way, the number and the relevance of thesemantic mappings can be poor in case that weakly similar peers are neighbors, thusaffecting the query routing capabilities of the entire network. Furthermore, a P2PSLNlacks of flexibility as semantic mappings need to be recomputed when changes occurin peer source schemas. In this respect, moving P2PSLN towards an adaptive approachfor peer organization could contribute to increase flexibility by allowing nodes to getcloser according to their content similarity.

2.2.5 The intelligent search mechanism

The Intelligent search mechanism (ISM) is defined in Zeinalipour-Yazti et al. [2005] asa novel mechanism for information retrieval in P2P networks with the aim to improvetraditional techniques by efficiently finding the most relevant answers to a given queryrather than the largest number of answers. The underlying idea of ISM is that a peerthat has a document relevant to a given query, is also likely to have other documentsthat are relevant to other similar queries. In this respect, for each query in an ISM-basedsystem, a peer exploits the replies of past queries and estimates which of the knownpeers are more likely to return relevant results to the actual query, and propagates therequest to those peers only. ISM techniques are entirely distributed and each nodecan make local and autonomous decisions without coordinating with any other peers

27

Page 39: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

and with no centralized authority. In an ISM-based system, each node maintains acollection of documents (text, audio, video) that it is willing to share with the othermembers of the system. Moreover, for each document in a node, a set of metadata(e.g., author, title) is maintained along with a number of associated keywords. Searchesoccur when a node submits to the ISM-based system a query containing one or moretarget keywords of interest. Each receiving node matches the target keywords in thequery with the keywords associated to its own documents and replies by returningthose documents that are associated to at least one matching keyword. In order tomatch queries and document keywords, string-based query similarity metrics can beused, such as the cosine similarity function [Baeza-Yates and Ribeiro-Neto, 1999].The cosine metric has been extensively used in information retrieval for the similarityevaluation of two vectors (i.e., lists) of terms. Such a metric exploits a string-basedmatching when comparing the query keywords with the keywords of a document, andcomputes the corresponding similarity evaluation in the range [0,1]. The evaluationresults are higher when the number of matching (i.e., identical) elements in the twolists of terms is higher.

The intelligent search mechanism for distributed information retrieval consists offour components: the profiling structure, the query similarity function, the RelevanceR-ank function, and the search mechanism. The profiling structure contains the list of themost recent past queries, and for each query, the list of peers that provided an answer aswell as the number of hits returned. In particular, each peer stores in its profiling struc-ture only the answer observations related to the queries that were routed through it.Thus, profiling structures represent an element of locality and provide indications re-garding the contents of the neighboring peers. In order to limit the size of the profilingstructure, a maximum size limit can be used in combination with a Last Recently Usedpolicy. When the recipients of a given query q1 need to be selected, the query q1 iscompared with the queries stored in the profiling structure by means of the query sim-ilarity function. Since queries are lists of keywords, traditional similarity metrics canbe adopted as query similarity function, such as the cosine similarity [Baeza-Yates andRibeiro-Neto, 1999]. The choice of the query similarity function is orthogonal to therest of the ISM techniques. In that, any other numeric similarity metric can be adoptedin place of the cosine metric (e.g., Jaccard coefficient, dice coefficient [Baeza-Yatesand Ribeiro-Neto, 1999]). The results of the query similarity function are passed to theRelevanceRank function (RR) as well as the number of hits provided by known peers

28

Page 40: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

to the past queries. For a given query q1, RR produces a peer ranking as a result wherepeers that replied with many relevant results to q1 similar queries are favored. Finally,the top-k ranked peers are selected as q1 recipients and submitted to the network bythe search mechanism component. The k parameter is set by the requesting node andindicates the number of queries to submit to the system. A TTL parameter is associ-ated to the query q1 as occurs for query propagation in the Gnutella protocol [GnutellaProtocol]. Receiving the query q1, a node evaluates whether it can provide relevantdocuments to the requesting node. Moreover, the query q1 is also forwarded to theknown peers by performing the ISM algorithm, provided that the TTL is not expired.Query results are returned to the requesting node on the reverse path. Thus, peers onsuch path can sniff the reply to further populate their profiling structure. As a furtheroptimization, a small random subset of peers is added to the set of top-k ranked peersfor each query. As a result, ISM explores a larger part of the network with the aim toprovide additional results and to include undiscovered peers in the routing mechanism.

Considerations. We note that ISM has been specifically conceived for P2P infor-mation retrieval and thus it is suited for document- and file-sharing rather than forknowledge-sharing. In particular, a lack of expressive power characterizes the ISMapproach due to the keyword-based mechanism adopted for metadata representationand to the string-based matching techniques. However, the idea to use past interac-tions to rank known peers and to train the behavior of the routing protocol can have aninteresting impact also on more expressive knowledge-sharing infrastructures. In par-ticular, the element of locality introduced with the profiling structure should be takeninto account to improve the scalability of the routing mechanism.

2.2.6 Hierarchical semantic routing for Grid resource discovery

In Li and Vuong [2005], a hierarchical semantic routing algorithm for grid networks isproposed with the aim at exploiting query content and peer knowledge to drive routingdecisions. In particular, the nodes of the network are grouped into independent clus-ters that are organized in tree-based structures. In such an approach, each node encodesits available resources and provides the corresponding RDF metadata representation.The adoption of RDF as resource representation language implies that the node meta-data repositories can be queried by relying on existing RDF query languages (e.g.,

29

Page 41: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

SPARQL [Prudhommeaux and Seaborne (eds.), 2006], RDQL [Seaborne, 2004]). Forsemantic routing purposes, nodes periodically exchange their respective RDF meta-data and, thus, a concise representation is required. To this end, each node summarizesthe verbose RDF syntax used for resource representation in a space-efficient bitmapaccording to the hashing-based Bloom Filter technique [Bloom, 1970]. It is importantto stress that the Bloom filter technique introduces a certain degree of approximationin the node resource representation.

The Bloom Filter summarizations are exploited to aggregate nodes in tree-baseclusters by relying on the Prinkey method [Prinkey, 2001]. In such a method, a nodejoining the system is inserted in a tree structure and provides to its parent a summa-rization (i.e., bitmap) of its RDF metadata. Each non-leaf node n maintains a routingtable containing a number of Bloom filter bitmaps: one bitmap for its local resourcesand one bitmap for each direct child. The node n merges through logical OR its ownbitmap with the child bitmaps and produces a new comprehensive bitmap as a result.The merged bitmap is sent to the parent of the node n . This way, each non-leaf node nin the tree maintains a comprehensive bitmap describing the RDF metadata of all thenodes in the branch rooted by n. Subsequently, the cluster root provides a summarizedview of all the knowledge stored in the cluster.

Bitmaps are exploited for supporting intra- and inter-cluster semantic routing al-gorithms by forwarding queries only to the nodes lying on branches that potentiallycan satisfy the query. When a node receives a RDF query, it checks its routing ta-ble. If a matching result is found in its local bitmap, the query is processed againstthe local RDF metadata of the node and the results are sent to requesting node. If amatching result is found in the bitmap of a child, the query is forwarded to that child.If no matching results are found and the query has not been received from the nodeparent, the query is forwarded to the parent. Otherwise, the query is stopped. In orderto enforce inter-cluster resource sharing, the resource-distance-vector (RDV) routingalgorithm is defined. In RDV, the roots of the clusters are interlinked and define anoverlay network where the neighbors of a given root node are the roots of the otherclusters that are directly connected with it. Each root maintains a resource routingtable which contains local and neighbor resource vectors, that are the bitmaps featur-ing the local cluster and the neighbor clusters, respectively. Furthermore, the resourcevectors also record the distance (in terms of number of hops) to arrive to the corre-sponding resource. Periodically, root nodes ping their respective neighbors in order to

30

Page 42: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

check whether they are still alive and exchange their resource routing tables to updatethe distances for each resource. Each query is associated with a TTL value, calledradius, that sets the maximum number of query hops. When a root node receives aquery from one of its neighbors, it is matched against its local resource vector. In casethat a matching result is found, the query is forwarded to the child holding the match-ing resource. Furthermore, the query is also evaluated against the neighbor resourcevectors stored in the routing table. In case that a matching result is found, the query isforwarded to the matching neighbor. In case that more than one matching neighbor isfound, the query is forwarded to the neighbor that can reach the matching node withthe lower distance. A query can be forwarded by several hops until it arrives at thematching root node or the TTL expires.

Considerations. We note that the proposed architecture has been conceived to workin grids, but can be extended to any other open distributed system, such as P2P systems.Approximation in the Bloom Filter bitmaps can lead to “false-positive” indications forrouting. We observe that this feature is acceptable for content-based routing in P2Psystems, and in open distributed systems in general, where the efficiency requirementcan tolerate a certain degree of inaccuracy. In the proposed routing approach, the RDFlanguage is adopted for expressive resource and query representation and for semanti-cally addressing query propagation. However, the expressiveness of the RDF model isonly partially exploited for semantic routing purposes, as the Bloom filter bitmaps pro-vide an approximated view of the node resources. Experimental data are provided toshow that this approach can contribute to reduce both the network and the peers load,thus providing a greater scalability. However, details regarding the RDF peer reposito-ries adopted in the experiments are not specified and a more extensive experimentationshould be provided to assess whether the approach can effectively work in differentscenarios, especially with sparse overlapping in RDF statement distribution. The mainconcern regards the fact that the authors implicitly assume to have a certain degreeof overlapping (i.e., 35% in the provided example) in the terminology of the differentpeers. The routing algorithm based on Bloom filter bitmaps uses such an overlappingto detect matchings, and thus query paths, through string-based comparisons. Ontol-ogy matching techniques are required to really exploit the expressiveness of the RDFlanguage and to introduce bitmap comparisons based on similarity with the aim to over-come sparse overlapping that typically emerges in real systems. Furthermore, another

31

Page 43: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

concern is related to the random construction of the clusters. Nodes are topologicallyaggregated and no guarantees are provided to ensure that nodes with similar contentsare in the same cluster. Similarity-based techniques are required to generate content-based clusters (i.e., communities). This way, most of the queries could be processed atthe intra-cluster level, thus lighten the more expensive inter-cluster routing algorithm.

2.2.7 Query routing through semantic mappings

The routing by mapping mechanism is proposed in Mandreoli et al. [2006] to addresssemantic query propagation in a Peer Data Management System (PDMS) [Halevyet al., 2004]. The approach is based on the idea to enhance the sharing functionali-ties of a peer-based system by combining the semantic richness of PDMSs with therouting capabilities of pure P2P architectures. To this end, each peer is defined asan autonomous and independent agent and provides an OWL ontology describing theinformative contents of its sources. Two adjacent peers (i.e., neighbors) are logicallyconnected through a set of semantic mappings that keeps track of the semantic affin-ity among the ontology concepts of the two peers. As a result, the system is orga-nized as an overlay network where two nodes are connected through a set of semanticcorrespondences between concepts. Furthermore, semantically rich query models areadopted as in traditional PDMS. In particular, a query is specified by a requesting peerp in terms of its relation terminology and it is submitted to the other peers that canevaluate whether they can reply with relevant knowledge. A receiving peer q evaluatesthe query by rewriting it according to the semantic mappings between p and q ontolo-gies and by executing the query on its ontology. When a peer joins the network, ithas to discover the semantic mappings between its ontology and the respective ontolo-gies of the neighbor peers by performing appropriate schema matching operations. Inparticular, the matching operations are based on terminological and structural compar-isons on the elements of the two ontologies and produce a set of semantic mappingsbetween concepts as a result. Thus, each concept in the ontology of the joining peeris associated to the most similar concepts in the neighbor schemas. Furthermore, eachmapping is characterized by a numerical score in the range [0,1] that quantifies thelevel of semantic similarity of the two concepts connected by the mapping. The map-ping score is computed according to a fuzzy-based similarity function and representsthe level of affinity of the two concepts.

32

Page 44: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

The routing by mapping mechanism is based on Semantic Routing Index (SRI) datastructures. The SRI is local to each peer and consists of a matrix where columns are theconcepts of the local ontology and rows contain the neighbor peers and a summarizedindication of their capability to provide relevant results for each local concept. For eachneighbor peer q and for each local concept c, the SRI value is computed by combiningthe scores of the semantic mappings between c and the concepts in the peer q ontol-ogy and an aggregated measure of the scores that peer q maintains with its respectiveneighbors. This way, each peer becomes aware of the approximated routing capabili-ties that are available through its neighbors. The SRI values are periodically updatedto align routing information with changes in peer ontologies and peer join/leave oper-ations. As each peer computes SRI by exploiting its neighborhood within a one-hopsradius, the information about mappings are propagated throughout the whole systemand each peer can learn about all other peers without being directly connected or in-teracting with them. When a query needs to be submitted/forwarded, the local SRIis exploited and the request is sent to the neighbors having the highest routing valuesfor the required concepts, as they have the highest likelihood to return relevant results.Query forwarding stops according to a TTL-based mechanism.

Considerations. We note that the semantic by mapping approach actually definesa semantic-based infrastructure for enforcing query routing through the P2P network.However, mapping definition is an expensive operation in terms of computation timeespecially when the peer ontologies are large. A joining peer may be interested intolerating such a computation effort only in case of an expected stable presence in thenetwork. This condition can rule out very dynamic peers. In this respect, the semanticaffinity between ontologies of different peers should be computed on the fly in orderto preserve the high dynamic nature of P2P systems. Moreover, the effectiveness ofthe routing by mapping mechanism highly depends on the level of similarity exist-ing between the ontology of a joining peer and the ontologies of the peers founded asneighbors, otherwise mappings and scores risk to be useless. As a consequence, SRIvalues, and thus query routing paths, are affected by the locality of one-hop mappingaggregation and multi-hop distant nodes are weakly considered for routing unless allpeers are quite homogeneous in terms of ontology contents. Furthermore, SRI tablesare slowly reactive to changes in topology and ontologies. The semantic overlay net-work should be adaptive and should be devised to “near” in terms of hops the peerswith similar contents.

33

Page 45: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.2 P2P semantic routing techniques

2.2.8 Other interesting approaches

The idea to combine the self-organization of unstructured P2P systems with the effi-ciency of DHT-like infrastructures represents the basic motivation of P-Grid [Abereret al., 2003a], a peer-to-peer lookup system based on a virtual distributed search tree.As in traditional DHT systems, the virtual tree covers the search space and each peerholds part of the overall space (i.e., a tree branch) through a hash-based index. A bit-string mechanism is provided to summarize the overall information for the tree branchthe peer is responsible for. The distinguishing feature of P-Grid is that the peer positionwithin the tree is dynamically determined through negotiation with other peers. Thisway, the network topology autonomously reacts to peer changes for providing queryadaptation, thus improving search efficiency. As for many other structured approaches,the main concern with P-Grid is related to the traffic overhead that is required for main-taining the tree structure and for dynamically determining the peer position.

In Sidirourgos et al. [2005], an original DHT-based framework is presented forproviding efficient routing of expressive RDF/S queries. In particular, the approachis organized in a classical schema-based infrastructure where both queries and peerknowledge are described through RDF/S statements. As a novel contribution, an orig-inal encoding of RDF/S schema fragments is introduced for checking whether a peerschema is subsumed by a query. This way, the system succeeds in recognizing RDF/Sschemas with similar semantics, and thus the associated peers are placed in the DHTtopology, accordingly. Experimental results show that the main benefits of the ap-proach are obtained in terms of scalability. On the opposite, flexibility represents themain drawback, since only RDF/S resource descriptions are supported.

As another example of P2P semantic routing approach, Neurogrid is proposedin Joseph [2002] as an adaptive and decentralized search system. In such an approach,semantic routing is intended as content-based query forwarding, and a learning mech-anism is defined to dynamically adjust the relevance of known peers for each query.In Neurogrid, each node maintains a knowledge base that contains associations be-tween keywords and other nodes. Queries are then forwarded to the nodes that maystore matching documents according to the stored relevance measures. Neurogrid is aflexible discovery overlay that can be extended to support expressive resource repre-

34

Page 46: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

sentations, like ontologies, and related semantic-based matching techniques.

2.3 Peer community formation and consensus negotia-tion techniques

Consensus negotiation in P2P systems is identified as the capability to identify peerswith similar interests and to build agreements, namely communities, among them. Theidea of peer community is not explicitly supported in all the existing P2P systems.Furthermore, a widely accepted definition of peer community has not been committedand only basic solutions are currently available.

2.3.1 Semantic overlay networks for P2P systems

Semantic overlay networks (SONs) are proposed in Crespo and Garcia-Molina [2004]asa flexible network organization with the aim to improve the query performance of tradi-tional P2P systems. In particular, a SON-based system is defined on top of a traditionalP2P architecture by replacing the traditional membership management primitives withSON-based operations that allow to organize peer according to their contents. WithSONs, the idea is that nodes with semantically similar content are clustered together,and queries are processed by identifying which SON(s) are better suited to answerit. The system is based on a classification hierarchy (i.e., a tree of concepts) that isshared by all the nodes of the system and that is used to define the SONs. In partic-ular, each SON is labeled with a concept of the classification hierarchy. When a peerp joins the network, a classifier function C and a matching function M are used by pto associate its documents provided for sharing with the most similar concept in thehierarchy. According to the classification results, the peer p joins the SON(s) whoselabeling concept is similar to the higher number of documents provided by p. Theclassifier function C is defined by relying on text matching techniques [Baeza-Yatesand Ribeiro-Neto, 1999], Bayesian networks [Rich and Knight, 1991], and clusteringalgorithms [Witten and Frank, 1999], while traditional string-matching techniques canbe adopted to implement the M function.

When a query is issued by a requesting peer, the matching function M is used to de-termine the most similar SONs that are selected as query recipients. Query propagation

35

Page 47: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

at intra-/inter-SON level is performed by exploiting the query propagation mechanismof the underlying P2P infrastructure. Such a mechanism has to ensure that the queryis forwarded only to the members of the SONs selected as recipients. A number ofexperiments are provided to illustrate the benefits of the SON-based approach withrespect to the traditional P2P query performances. In particular, the effectiveness ofthe SON-based approach is highlighted when homogeneous documents are provide byeach peer. In such case, the classification hierarchy can be chosen to ensure that eachpeer joins a small number of SONs. A further requirement for improving the benefitsof the SON-based approach is to select an efficient and easy-to-implement classifierfunction.

Considerations. We note that the approach has been defined to improve the perfor-mance of traditional file-sharing P2P systems and only deals with node classificationsbased on text matching techniques. The approach is characterized by the assumption tohave a shared classification taxonomy. On one side, the shared taxonomy contributesto reduce the complexity of the node classification task by decreasing the overhead dueto SON membership management. On the other side, the approach is poorly relevant interms of flexibility. In particular, taxonomy updates require node reclassification (i.e.,SON redefinition) and it is expensive in terms of traffic load. Furthermore, moving theapproach from a taxonomy-based model to a more expressive ontology-based modelis hardly applicable due to the open challenges in maintaining and aligning a sharedontology within a pure P2P system.

2.3.2 Formation of P2P communities through escalation

In Khambatti et al. [2002], the notion of peer community is introduced as a general-ization of peer group involving peers that are actively engaged in sharing, communi-cating, and promoting common interests. Peer communities are formed according tostring-based interests that are called by attributes. Attributes are used to determine thepeer communities in which a particular peer would participate. The full set of peerattributes defines its personal attributes and the subset of personal attributes that isexplicitly claimed as public determines the claimed attributes. Moreover, group at-tributes are defined according to peer location and affiliation and are needed to forma physical basis for communities. Each peer belongs to one pre-defined group and a

36

Page 48: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

group attribute is used to identify group membership.A peers adopts an escalation technique to advertise its interests through claimed

attributes. For each claimed attribute, a peer measures the number of nodes claimingthe same attribute after at most one indirection (outlink weight). At the same manner,a peer measures the number of incoming connections that arrive directly from othernodes claiming the same attribute (inlink weight). The constraint of at most one in-direction is required to limit the message overhead due to advertisement. Simulationresults show that probing more than two indirections does not significantly increase theaverage number of claimed attributes discovered by a peer. The outlink weight com-bined with a threshold-based mechanism is adopted to define the peer membershipin communities. If the outlink weight exceeds the specified threshold for a claimedattribute, the peer becomes a member of the community indicated by the claimed at-tribute. Moreover, outlink and inlink weights are exploited to evaluate the involvementand the responsibility of a peer within a given community. Moreover, a node maybelong to many different communities and communities may overlap.

As described in Khambatti et al. [2003], communities of peers can be used to re-alize an efficient query propagation strategy in a populated P2P space. To this end, apush-pull gossiping technique is defined on top of an overlay network based on ori-ented links. A peer x can define a link with a peer a if i) the peer a is a special peerchosen by the domain of P2P links; ii) the peer a is a peer trusted by peer x; iii) thepeer a is a well-known peer that belongs to many communities that peer x is interestedin. Peers are classified on the basis of their involvement with respect to communities.A peer is called seer for a community if it has a high value of involvement for thatcommunity due to its links (direct or indirect) to other peers claiming the attribute(s)featuring the community. In order to discover the seers of a given community c, aninitiating peer defines a vector containing its level of involvement with respect to c andsends it to every peer in its neighborhood claiming the c attribute(s). Every peer re-ceiving the vector appends its information to the vector and forwards it by applying thesame criterion. An end message is defined by a peer when no neighbors are availablefor vector forwarding. In this case, the final vector is returned to the initiating peer.Collecting end messages, the initiating peer defines a set of seers for the communityby picking the peers with the highest level of involvement. Information to share withinthe community is sent to the initiator. The initiator multicasts the received informationto the identified seers (push phase) in order to ensure the maximum level of advertise-

37

Page 49: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

ment. Experiments by simulation are also provided to confirm that, more than 80%of community members have a seer in their neighborhood when seers represent 10%of the community participants and links form a small-world network. Furthermore,experimental results show that all the community members can reach at least one seerwhen one degree of indirection is admitted. Thus, a peer queries the seer within itsdirect/indirect neighborhood (pull phase) in order to obtain the searched information.The presented approach is suitable for intra-community query propagation. In caseof inter-community searching, communities should be chosen by relying on attributematching rather than on seers.

Considerations. We note that the threshold-based community formation representsa flexible mechanism that enforces different kind of P2P community creation. For in-stance, if the threshold is low, a community is formed even when only a few peers sharea common interest. Conversely, a higher threshold implies that peer communities arecreated only when many peers sharing a common interest are available. However, ahigh traffic overhead is required for attribute advertisement and seer discovery. Thisis due to the proactive approach that is adopted to ensure decentralization and self-organization. A passive approach is preferable in distributed systems, like P2P net-works, where information regarding implicit communities can be derived by analyzingthe regular messages and query/reply flow.

2.3.3 Communities of peers by trust and reputation

In Agostini and Moro [2004], a reputation-based approach for semantic query routingin P2P networks is presented. Each peer classifies the documents to share in orderto define its local knowledge which is organized in structures, each called context. Acontext is a concept hierarchy 〈C,E〉 such that (i) each node in C is labeled by a tagfrom the language LC, (ii) each node is a document from a set D, and (iii) E induces atree structure over C. In other words, a document is positioned in a context accordingto a set of keywords featuring the document. One document in a context domain canbe chosen by a peer (seeker) as the query and becomes the query cluster. The path inthe query context between the query cluster and the cluster root is called focus of thequery.

The query is sent by the seeker to a set of target providers. Each target provider

38

Page 50: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

adopts both syntactic and structural matching techniques for comparing the query withits local knowledge in order to identify possible common features. The providers sendback to the seeker the matching documents, or a suitable representation of them. Theseeker defines the list of target providers by selecting the peers that are able to solvethe query with the highest probability, or that make the most progress toward solv-ing the query. In this respect, the seeker associates to each known peer a trust valuecomputed on the basis of its reputation. Target providers are then selected accordingto decreasing trust values. For a given query q (i.e., a document), the reputation of apeer p computed by a peer s is the expectation of peer s on peer p future behavior inproviding relevant documents for q. Such a reputation is measured on the basis of pastinteractions between s and p that concern queries related to q. The cluster of peerscapable of providing relevant documents with respect to a given query forms a com-munity of peers with similar semantics. The P2P network is self-organizing, and thenetwork clustering is not imposed, but rather gradually emerges by means of point-to-point interaction among the peers with the highest reputation. As a result, communitiesof peers can be used to enhance the effectiveness of semantic routing. At time t ∈ N,the seeker s first looks at community Ω(Q)t and randomly chooses one peer in thecommunity for sending the query q. Let p be the chosen peer. Then, p either solvesthe query directly, if it is able to do so, or p forwards the query q to a third-party peer p′

in the community Ω(Q)t . Similarly, the propagation of q runs until either q is solvedor there are no new members for query forwarding. In the former case, solution isback-propagated to s through all peers concerned (propagation chain). The reputationof the peers in the chain is updated to foster winning associations in future interactions.

Considerations. The presented approach is based on trust and reputation for train-ing the behavior of the query routing process and for detecting semantic communitiesof peers. In this respect, the main contribution is related to the fact that the approachsucceeds in providing semantic routing without introducing a traffic overhead for han-dling peer trust and reputation. As another contribution, communities are not definedas global structures, but each peer is still capable of connecting to other nodes withsimilar contents. However, the main concern is related to knowledge representation is-sues. Using a labeled document as a query and keyword-based matching techniques forcomparisons, the adoption of more expressive knowledge representation languages be-comes a hard task. As a result, the applicability of the approach to knowledge-sharing

39

Page 51: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

P2P networks is strongly limited.

2.3.4 Peer selection in P2P networks with semantic topologies

In Haase et al. [2004], the authors define a peer selection approach based on a sharedontology and an expertise advertisement mechanism. In such an approach, each peerpicks out the recipients of a given query by selecting the known peers whose expertiseis similar to the query subject. This way, the knowledge about the expertise of the peersdefines a semantic overlay topology that is independent from the underlying networkinfrastructure. The system is characterized by a common ontology that provides ashared conceptualization for the domain of interest of all the system nodes. Eachpeer provides a semantic description of its expertise that represents the knowledgebase of the peer according to the common ontology. The expertise description canbe automatically extracted from the common ontology or can be specified in someother formalism. In other words, each peer knowledge is mapped on a portion of theshared ontology. Each query is described by a subject that is expressed in terms of thecommon ontology and specifies the required expertise to answer the query.

Peers are responsible to advertise their expertise to the other peers of the network.To this end, each peer autonomously decides the recipients of its advertisements. Typ-ically, each peer submits the advertisements to its neighbors only (i.e., directly con-nected nodes), but the propagation depth for the advertisement messages can be set asa parameter of the system and should be chosen to provide a tradeoff between networktraffic and number of advertised nodes. Furthermore, each peer can accept/reject theadvertisements received from the other nodes. In particular, advertisements are ac-cepted only if they are found to have a semantic similarity with the expertise of thereceiving peer. As a result, each peer acquires and maintains a list of peers with sim-ilar expertise. A query is defined by specifying the query subject and by submittingit to the system. A receiving peer evaluates the semantic similarity between the sub-ject and its own knowledge base and possibly replies to the requesting peer with thediscovered matching results. When a peer needs to submit/forward a query to the sys-tem, it applies the peer selection algorithm that returns a ranked list of peers. Peersare sorted according to the semantic similarity between their expertise and the querysubject. The best n peers are selected as query recipients from the ranked list. As anoption, the choice of all the peers above a certain threshold can be used in the peer

40

Page 52: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

selection algorithm for query recipient identification. Queries are propagated in thenetwork according to a traditional TTL-based mechanism. A reference scenario fromthe bibliographic domain is illustrated to show the main system functionalities. Inparticular, the ACM topic hierarchy is used as shared ontology for peer expertise defi-nition and the RDF-based query language SeRQL [Broekstra and Kampman, 2003] isadopted for query representation. Similarity evaluation plays a key role in the proposedpeer selection approach. In particular, similarity evaluation techniques are required i)to measure the expertise similarity of two peers with the aim to support the peer indeciding the advertisements to accept, ii) to evaluate the affinity between the subjectof an incoming query and the knowledge base of a peer with the aim to assess whethermatching results can be returned, and iii) to compare the subject of an incoming querywith the expertise of a peer in order to select the best query recipients for forwarding.In the reference scenario, the SF function is defined to evaluates the similarity betweentwo ACM topics and relies on a distance-based metric. The idea is that topics whichare close according to their position in the topic hierarchy are more similar then topicsthat have a larger distance. The SF function produces results in the range [0,1]: anincreasing evaluation indicates an increasing level of similarity.

Considerations. Experimental data are provided to show that the proposed peerselection approach improves the performance of a generic P2P network in terms ofprecision, recall, and number of messages. On the other side, we note that the use of ashared ontology is a very limiting assumption and reduces the impact of the approachin terms of flexibility due to i) the open challenges in updating the ontology accord-ing to the peer expertise modifications and ii) the peer restriction that are required toexpress their expertise according to conceptualization of the common ontology. Ex-pertise advertisement is a proactive task and it implies a certain traffic overhead inorder to keep nodes aligned on their respective expertise. In this context, the adoptionof passive approaches for expertise advertisement should be preferable. In particular,query reply observations combined with an aging mechanism should be consideredfor the advertising task in order to reduce network traffic. Furthermore, the proposeddistance-based matching techniques are tailored for the considered ACM topic taxon-omy, but they are not suited for any general shared ontology. Thus, ontology matchingtechniques should be introduced to really exploit the expressive power of ontologiesand ontology-based query languages.

41

Page 53: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

2.3.5 Other interesting approaches

In Bonifacio et al. [2003] a social collaboration model is proposed as the referencebasis to develop the P2P Kex platform (Knowledge Exchange System) for supportingknowledge sharing in federations of peers. In this system, knowledge is organizedaccording to a XML-based syntax and a semantic matching algorithm is adopted tomanage the different meanings provided by single peers and federations. Specific Keximplementation details are not provided. However, the model can be considered as areference framework where the main principles for building federations of independentpeers are discussed.

In Yolum and Singh [2003a], the potential uses of communities are discussed andclassified in endogenous and exogenous applications. Referrals networks based on asociological metaphor are compared with bipartite communities based on link analysisin order to show the benefits of a collaborative approach for improving local perfor-mance in locating service providers. Agents (i.e., peers) adaptively select their neigh-bors and their query recipients by exploiting sociability and expertise information com-puted on previous interactions. The choices performed by agents bring communities toemerge. The social metaphor behind community formation in referral networks can beapplied to P2P systems where sociability and expertise are measured by observing pastsharing interactions among peers. Moreover, the proposed referral framework can beextended to support expressive knowledge representation techniques, like ontologiesand ontology-based languages.

In Bloehdorn et al. [2005], communities are defined as groups within or acrossorganizations who share a common set of information, needs, or problems. In this ap-proach, peer interactions (i.e., queries) are exploited to discover the communities andto populate the SWRC+COIN community ontology which describe the typical struc-ture and the key entities of communities as well as their relationships. The communityontology can be queried in order to identify the most relevant peers with respect to agiven request. The main concern with the approach is related to the use of a sharedcommunity ontology which limits the peer autonomy in community formation.

As another interesting approach, the PlanetP system is presented in Cuenca-Acuna

42

Page 54: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

2.3 Peer community formation and consensus negotiation techniques

et al. [2003] as a content addressable subscribe service for unstructured P2P com-munities. The approach is based on a membership directory and on a content indexthat are globally replicated on each peer. Gossiping-based techniques are used byall system members to keep such shared structures up to date. The structured infras-tructure of PlanetP provides efficiency in peer communications. However, the mainlimit of PlanetP regards the use of shared structures, which hamper the applicability ofontology-based approaches for managing communities due to the problem of aligningand evolving a shared ontology in a distributed system.

43

Page 55: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 3

Towards emergent semantics in P2Psystems:critical review of the state of the art

One of the main research directions in the field of P2P systems is related to semanticinteroperability management. By semantic interoperability in P2P systems we meanthe capability to handle heterogeneous schemas stored in different peers, while preserv-ing peer-to-peer interactions for supporting effective knowledge sharing and semanticcollaboration at the same time.

As a first goal of this chapter, we discuss the main features of schema-based P2Pnetworks, that are widely recognized as a promising solution for addressing semanticinteroperability issues in P2P systems. Then, we introduce knowledge-sharing P2Pnetworks and we discuss the role of emergent semantics for dealing with semantic in-teroperability in multi-knowledge environments. Finally, we critically review the P2Psystems surveyed in Chapter 2 by positioning them according to i) architectural andstructural properties, and ii) emergent semantics requirements for knowledge-sharingP2P systems.

3.1 Schema-based P2P networks

As discussed in Nejdl et al. [2003], database systems have evolved toward a higherdegree of distribution, becoming a viable solution also for application to open systems,

44

Page 56: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.1 Schema-based P2P networks

such as P2P systems. In this context, schema-based P2P networks aim at relying on thedatabase experience in the field of data management for improving the capabilities ofthe existing file-sharing P2P systems. As a consequence, schema-based P2P networksrepresent an evolution of metadata-based P2P systems and emerge as a combinationof the P2P approach with the database technologies [Gribble et al., 2001]. On oneside, the P2P approach is exploited to provide an efficient search infrastructure basedon a structured P2P topology (e.g., [Ratnasamy et al., 2001; Schlosser et al., 2002;Stoica et al., 2003]). On the other side, the database technologies are employed forintroducing more expressiveness for what concern peer resource description and queryspecification (e.g., [Nejdl et al., 2002; Aberer et al., 2003b; Halevy et al., 2004].

Edutella [Nejdl et al., 2002], Chatty Web [Aberer et al., 2003b], and Piazza [Halevyet al., 2004] are three well-known examples of schema-based P2P networks. In thissection, we summarize the building blocks that characterize the design choices of suchsystems, and of schema-based P2P systems in general.

3.1.1 Building blocks of schema-based P2P networks

As discussed in Nejdl et al. [2003], four building blocks characterize a schema-basedP2P network, namely the schema language, the query language, the network topology,and the information integration.

Schema language. The schema language is the language that each peer adopts forproviding a semantically rich description of the resource to share. In particular, theschema language has the role to semantically annotate the peer resources on the basisof a reference schema, with the aim to allow communication and sharing among peerswith different vocabularies. To this end, expressive languages, such as the SemanticWeb languages like RDF [Lassila and Swick (eds.), 1999], RDFS [Brickley and Guha(eds.), 2003], and OWL [Smith et al., 2004], are adopted as schema language in P2Psystems. This is due to the fact that Semantic Web languages have been conceivedfor semantic annotation of Web resources that present a number of common featureswith semantic annotation of peer resources. For instance, each resource, both on theWeb and on a P2P network, needs to be univocally identified and the URI (UniformResource Identifier) mechanism provided by the Semantic Web languages can be suit-ably adopted to this end. Moreover, Semantic Web languages support the definition of

45

Page 57: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.1 Schema-based P2P networks

distributed annotations regarding a given resource through the namespace mechanism.As another motivation for using Semantic Web languages in peer schema definition,we stress that RDFS and OWL are ontology-based languages and allow the defini-tion of extensible schemas that can be evolved over time to capture the modificationsoccurring in peer resource specifications.

Query language. The query language is the language that is used by peers to spec-ify search queries. In this respect, the query language has to be capable of exploit-ing the expressive power of the schema language to go beyond traditional keyword-based queries. In particular, the semantics of the underlying data needs to be takeninto account when performing query evaluation. In the context of Semantic Web lan-guages, different query languages have been proposed to support semantic-based queryprocessing (e.g., RDF-QEL [Nejdl et al., 2002], TRIPLE [Sintek and Decker, 2002],RDQL [Seaborne, 2004]). The existing query languages can be distinguished accord-ing to the kind of semantic features that are considered, which affect accuracy andcomputation time of the overall query evaluation process. On one side, the use of asemantic-based query language contributes to improve the accuracy of the producedresults. On the other side, such approaches require high computation time accordingto the number of the considered semantic features. The choice of the most appropriatequery language depends on the specific requirements that need to be addressed. Inparticular, only a limited set of semantic features can be considered when computationtime is a crucial aspect, as in the case of P2P systems.

Network topology. The network topology describes the peer organization that emergesfrom the underlying P2P infrastructure. The network topology, and the related P2Pinfrastructure, has the responsibility to address P2P communications and to provideefficient query routing. Search efficiency is an open issue of P2P systems where nocentralized server is defined and relevant peers are identified by flooding queries toall the available nodes. In the context of schema-based P2P networks, structured P2Pinfrastructures have been proposed as a possible solution to improve the efficiencyof traditional P2P routing algorithms (e.g., [Ratnasamy et al., 2001; Schlosser et al.,2002]). Structured P2P topologies are based on two main assumptions: i) resourcesare deterministically placed in the network according to a hashing mechanism, and ii)peers are organized according to a predefined structure where query distribution can be

46

Page 58: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.1 Schema-based P2P networks

efficiently performed (i.e., with a logarithmic number of messages). As another possi-ble solution for improving search efficiency in schema-based P2P networks, super-peertopologies have been also proposed [Yang and Garcia-Molina, 2001, 2002]. The ideaof super-peer networks is that nodes with the highest capabilities in terms of compu-tational resources and bandwidth are responsible of performing specific tasks such aspeer aggregation, query routing, and mediation 1.

Information integration. Information integration refers to the capability to handleheterogeneity in different peer schemas. Each peer interacts with the other nodes bysubmitting queries according to its own schema and vocabulary. In this respect, in-formation integration facilities are defined with the aim to allow a receiving peer toreformulate the query in terms of its schema. As proposed in Nejdl et al. [2002];Halevy et al. [2004], information integration issues can be addressed by relying on amediation approach where local transformations (i.e., mappings) and rules are definedto enable schema interoperability among peers. Mediation approaches are commonlyemployed for addressing query processing in data integration systems (e.g., [Li et al.,1998; Castano et al., 2001; Lenzerini, 2002]). In the context of schema-based P2Pnetworks, mediation approaches are based on the exchange of peer schemas amongneighbor nodes [Lenzerini, 2004]. This way, semantic mappings between correspond-ing elements of two peer schemas are defined for supporting query reformulation andprocessing.

3.1.2 Open issues in schema-based P2P networks

The recent success of schema-based P2P networks has shown that there is the need of agrowing role of semantics in P2P systems for moving towards knowledge-sharing P2Pinfrastructures where peers can create and share knowledge. In this respect, ontologiesand ontology-based languages, such as OWL, are widely recommended for enforc-ing peer resource sharing and P2P semantic collaboration. To this end, ontologies areemployed for semantically describing heterogeneous information resources (e.g., data,documents, and services) provided by each peer and for enabling a seamless resourceaccess at different levels of granularity (e.g., single tuple or attribute, selected docu-ment parts, single service component).

1A detailed description of super-peer architectures is provided in Chapter 2.

47

Page 59: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.1 Schema-based P2P networks

The building blocks of schema-based P2P networks represent the main aspects toconsider for developing an effective knowledge-sharing P2P system. However, mostof the existing solutions are borrowed from centralized architectures and are not com-pletely suited for open distributed systems, such as P2P systems, where query responsetime is a crucial aspect besides accuracy of results. For instance, the existing semantic-based query languages for RDF, RDFS, and OWL are mostly based on description log-ics and reasoning techniques. This means that the time required for query processingrapidly increases according to the number of semantic features to consider. Moreover,structured network topologies with hashing-based placement of resources can not betrivially adapted to support deterministic knowledge placement. Apart from some in-teresting work that are being appearing in this field (e.g., [Sidirourgos et al., 2005]),similarity-based placement of RDF(S) and OWL statements is still an open issue thatrequires further investigation. Finally, the use of mediation approaches for providinginformation integration is viable only when specific requirements can be ensured. Thisis due to the fact that semantic mapping definition among neighbor peers is an expen-sive operation, which requires that i) peer schemas are not frequently changing and ii)peers stably join the system. In a generic P2P system, peers join/leave the network atany moment and the time initially required for semantic mapping definition is a strictrequirement that negatively affects the peer autonomy.

According to these considerations, we observe that many issues regarding schema-based P2P networks are still open and require further investigations. In particular, thedevelopment of more tailored solutions for effective ontology-based knowledge shar-ing is becoming the next stage of development for P2P systems. Starting from theexperience of schema-based P2P networks, the thesis work aims at focusing on thespecific features and requirements of knowledge-sharing P2P systems. In particular,knowledge-sharing P2P systems are characterized by a different approach for handlingsemantic interoperability among peers. While schema-based P2P networks derive fromdistributed database technologies, knowledge-sharing P2P systems originate from theSemantic Web experience where peer autonomy and independence is seen as a de-sign principle rather than a difficult to bypass. In this context, matching, and ontologymatching in particular, is getting more and more important for addressing peer seman-tic interoperability.

48

Page 60: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.2 Knowledge-sharing P2P systems

3.2 Knowledge-sharing P2P systems

As shown in Figure 3.1, we define a knowledge-sharing P2P system as a network ofindependent peers with equal role and capabilities. This kind of systems is multi-knowledge, in that no centralized authorities are defined to manage a comprehensiveview of the knowledge related to the resources shared by all the nodes and each peeris responsible of providing the knowledge description of the resources to be shared,usually through an ontology. A peer interacts with the other parties with the intentionto (i) discover peers containing relevant knowledge with respect to a target request, and(ii) acquire the information resources of interest provided by the other peers. To thisend, the knowledge discovery process is based on a query/answer paradigm in whicheach peer in the system acts both as a client and as a server interacting with othernodes directly, by submitting queries containing a request for one or more concepts ofinterest and by replying to queries with concepts relevant to (i.e., matching) the target.

Figure 3.1: Reference architecture for knowledge-sharing P2P networks

With respect to the reference scenario of Figure 3.1, the following main featurescan be recognized:

• Dynamism of the system, regards the fact that peers are allowed to join and leavethe network at any moment.

49

Page 61: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.2 Knowledge-sharing P2P systems

• Autonomy of nodes, in that each peer is responsible for its own knowledge man-agement and representation.

• Absence of a-priori agreement, about ontology specification vocabulary and lan-guage to be used for knowledge specification.

• Equality of node responsibilities, no centralized nodes with coordinating tasksare recognized and each peer directly enforces interaction facilities with othernodes for knowledge sharing.

According to the scenario of Figure 3.1, we note that the knowledge discoveryprocess plays a crucial role since it has the responsibility to enable peer interactions bysemantically matching their respective knowledge with the aim to discover potentialcollaboration partners with similar contents. By knowledge discovery we mean thecapability of each node of finding knowledge in the system about information resourcesthat, at a given moment, best match the requirements of a request for given targetresource(s). In this context, ontology matching techniques are required to determinewhether and how concepts of different ontologies are semantically related each other.

Existing ontology matching approaches address a number of general requirementswhich remain very important in knowledge-sharing P2P systems. A first general re-quirement is the applicability to different ontology specification languages, with spe-cial attention to recent standards of the Semantic Web like OWL. A further general re-quirement is the capability of coping with different levels of detail and design choicesin describing the knowledge of interest using a certain language. In addition, the ca-pability of considering different constructs used in ontology languages is required formatching purposes. In addition, new peculiar requirements must be taken into ac-count in conceiving ontology matching techniques for knowledge-sharing P2P sys-tems. These requirements are originated by the dynamic behavior of peers in such ascenario. A first peculiar requirement is that the number and type of ontology featuresthat can be exploited during the matching process are not known in advance as they areembedded in the knowledge request. Furthermore, this number can vary, also for thesame ontologies, each time a new matching execution comes into play triggered by aknowledge discovery request. Moreover, design principles of ontology matching tech-niques must be driven by i) the necessity of satisfying matching requests that are dy-namically posed by peers on the basis of unexpected needs that can vary continuously,

50

Page 62: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.3 Emergent semantics issues inknowledge-sharing P2P systems

and ii) by the necessity of addressing all general and peculiar matching requirementsas a whole.

A number of solutions have been proposed for providing ontology matching func-tionalities [Do and Rahm, 2002; Doan et al., 2002; Giunchiglia and Shvaiko, 2003; Noyand Musen, 2003]. However, only a few approaches and tools have been specificallyconceived for addressing ontology matching in the context of open distributed systems,such as knowledge-sharing P2P systems [Euzenat et al., 2004; Ehrig and Staab, 2004;Castano et al., 2006d].

3.3 Emergent semantics issues inknowledge-sharing P2P systems

With respect to the reference scenario presented in Figure 3.1, we observe that auton-omy of nodes and absence of a-priori agreement introduce a semantic interoperabilityproblem that need to be addressed for providing effective knowledge discovery andsharing functionalities. This type of semantic interoperability is called emergent se-mantics and it is related to the capability of dynamically negotiating agreements amongdifferent peer parties in order to obtain common interpretations within the context ofa given task [Aberer et al., 2004]. In other words, emergent semantics in knowledge-sharing P2P networks refers to the growing need to develop advanced solutions forallowing peers with similar interests to establish collaborations by addressing possiblesemantic heterogeneities in knowledge representation languages and vocabularies.

According to the emergent semantics principles discussed in Aberer et al. [2004],a number of opportunities and challenges are being arising for knowledge-sharing P2Psystems. In the thesis work, we focus on analyzing two main aspects related to emer-gent semantics issues in knowledge-sharing P2P networks, that are semantic routingand consensus negotiation:

Semantic routing. We define semantic routing as the general problem of address-ing query propagation by selecting as query recipients the peers that are most likelyto provide relevant results according to the query content. The main goal of seman-tic routing protocols is to address search efficiency by reducing network traffic andby improving scalability. As a result, peers are organized in a semantic overlay net-work where peers storing similar contents (i.e., semantic neighbors) get closer and can

51

Page 63: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.4 Emergent semantics requirements

establish direct connections for sharing purposes.Consensus negotiation. Consensus negotiation is defined as the capability of the

system nodes to aggregate self-organizing semantic communities of peers on the ba-sis of their respective contents and interests. The main goal of consensus negotiationtechniques is to provide a coordination mechanism for allowing peers to peer-to-peerinteract and to commit local agreements. As a result, by aggregation of local agree-ments, single peers should have the capability to build global agreements (and thuscommunities).

3.4 Emergent semantics requirements

Regarding knowledge-sharing P2P systems and related emergent semantics issues, weconsider the following requirements for which appropriate techniques have been de-veloped in the thesis work.

• Availability of expressive knowledge representation languages for providing se-mantically rich description of the peer resources to share.

• Use of semantic-based matching techniques for evaluating the level of semanticaffinity between resource descriptions provided by different peers.

• Definition of an adaptive semantic routing mechanism for supporting efficientquery propagation.

• Support of a decentralized consensus negotiation approach for aggregating peerswith similar interests in semantic communities.

According to such requirements, the following classification criteria, namely knowl-edge representation, matching techniques, query routing, and community support, aredefined for assessing the level of penetration of emergent semantics issues in existingP2P systems.

3.4.1 Knowledge representation

The goal of this criterion is to evaluate the existing P2P systems according to the levelof expressiveness of the supported knowledge representation formalism. As shown in

52

Page 64: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.4 Emergent semantics requirements

Figure 3.2, we distinguish four different levels of expressiveness, namely metadata-based, XML-based, RDF-based, and ontology-based.

Knowledge representation(expressiveness)

metadatabased

RDFbased

XMLbased

ontologybased

- +

Figure 3.2: Degree of expressiveness in P2P knowledge representation

• Metadata-based representation techniques allow to formalize the main featuresof a resource through a set of descriptive information called metadata. The levelof expressiveness is low since the number and the kind of the metadata to use ispredefined and no structure/schema is designed on the metadata set.

• XML-based representation techniques support resource description through aXML document. As a result, the use of the XML language ensures portabil-ity and extensible format for resource description. Moreover, the XML Schemalanguage can be also employed to define a reference schema for the XML docu-ments.

• RDF-based representation techniques allow to specify predicates on resourcesas triples. A RDF triple is defined in the form 〈sub ject, property,value〉, wheresub ject is the resource we want to describe (using a URI), property is the prop-erty we use, value is a datatype or another resource. The opportunity to de-fine predicates further extends the expressiveness of traditional property = valuerepresentation languages, such as XML.

• Ontology-based representation techniques allow to define resources as instancesof a corresponding schema description. Ontology-based languages are asso-ciated to a formal semantics which supports automatic reasoning and consis-tency check through description logics. RDFS and OWL are typical examples ofontology-based representation languages.

3.4.2 Matching techniques

The goal of this criterion is to evaluate the existing P2P systems according to the levelof semantics featuring the matching techniques used for measuring the similarity be-

53

Page 65: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.4 Emergent semantics requirements

tween knowledge portions provided by different peers. As shown in Figure 3.3, we dis-tinguish four different levels of semantics, namely syntactic-based, structural-based,linguistic-based, and reasoning-based.

Matching techniques(level of semantics)

structuralbased

syntacticbased

reasoningbased

linguisticbased

- +

Figure 3.3: Degree of semantics in matching techniques

• Syntactic-based techniques perform matching by considering names as stringsof characters and by evaluating the similarity between two names as the similar-ity of their strings. The string distance metric is typically adopted for comput-ing string matching. Context and semantics of elements are not considered insyntactic-based matching technique.

• Structural-based techniques compares two elements by evaluating the similar-ity between their structures. The idea is to measure similarity by counting thenumber of matching elements in the context of the considered elements. Graph-based metrics are typically adopted for computing structural matching. Formalsemantics is not considered in structural-based matching techniques.

• Linguistic-based techniques perform matching by considering names as words ofa natural language and by evaluating their similarity according to the similarityof their meanings. Thus, meaning disambiguation techniques are required todistinguish the intended meaning of a word according to the context where itis adopted. Thesauri of terms are typically exploited to support the meaningdisambiguation process.

• Reasoning-based techniques consider the matching problem as a deductive ques-tion where similarity between elements can be inferred by exploiting an initialset of formal propositions. The main examples of reasoning-based matchingtechniques rely on propositional satisfiability (SAT) and description logics (DL).

54

Page 66: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.4 Emergent semantics requirements

3.4.3 Query routing

The goal of this criterion is to evaluate the existing P2P systems according to the levelof adaptivity of the query routing mechanism. In particular, we refer to adaptivity asthe capability of a peer to react to the system changes by dynamically modifying itsneighbors, and thus its routing choices. As shown in Figure 3.4, we distinguish fourdifferent levels of adaptivity, namely mapping-based, advertisement-based, history-based, and expertise-based.

Query routing(adaptivity) - +

advertisementbased

historybased

expertisebased

neighborhoodbased

Figure 3.4: Degree of adaptivity in P2P query routing

• Mapping-based adaptivity refers to those P2P systems where a peer joining thesystem has to define a set of semantic mappings with its neighbors. The level ofadaptivity is low due to the fact that peers, and thus query routing, slowly reactto semantic mapping changes since mapping update is expensive and thus rarely(never) performed.

• Advertisement-based adaptivity regards those P2P systems where each peer hasto actively advertise its contents to the other nodes. Normally, the advertisementoperations involve direct neighbors and indirect neighbors within one hop. Thelevel of adaptivity is affected by the locality of the advertisements which preventsthe discovery of multi-hop-distance peers.

• History-based adaptivity refers to those P2P systems where peer connectionsare established by exploiting previous peer interactions. Query recipients areselected by counting the number of relevant replies provided by known peers.

• Expertise-based adaptivity is related to those P2P systems where peer connec-tions are defined on the basis of similarity measures. Query recipients are se-lected according to the expected similarity of their contents with respect to thequery.

55

Page 67: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.4 Emergent semantics requirements

3.4.4 Community support

The goal of this criterion is to evaluate the existing P2P systems according to the levelof decentralization featuring the formation of peer communities. As shown in Fig-ure 3.5, we distinguish three different levels of decentralization, namely shared knowl-edge base, local negotiation, and global negotiation. Moreover, we also represent thenot supported level for those approaches where the formation of peer communities isnot expected.

Community support(decentralization) - +

notsupported

sharedknowledge base

localnegotiation

globalnegotiation

Figure 3.5: Degree of decentralization in P2P community support

• Not supported refers to those P2P systems where the idea of semantic com-munities of peers is not defined. Peers act as single agents and similar nodesdiscovered during past interactions are not considered for future applications.

• Shared knowledge base is related to those P2P systems where semantic commu-nities are defined according to a shared knowledge base (i.e., a taxonomy or anontology). The level of decentralization is low due to the fact that the sharedknowledge base is statically defined without an active negotiation phase amongpeers.

• Local negotiation refers to those P2P systems where a definition of peer com-munity is not explicitly supported, but each peer has a local perception of itssemantic communities. In other words, each node discovers similar peers in thenetwork which are considered as members of its “virtual” semantic community.

• Global negotiation regards those P2P systems where semantic communities areexplicitly supported and require a global negotiation process for determiningcommunity membership. Each community is aware of its peer members andeach peer is aware of the communities where it is involved in.

56

Page 68: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.5 Critical review of the state of the art

3.5 Critical review of the state of the art

In this section, both the schema-based P2P systems presented in Section 3.1 and theP2P systems surveyed in Chapter 2 are critically reviewed and classified with respectto i) architectural and structural properties, and ii) emergent semantics requirements.

3.5.1 Comparison on architectural and structural properties

In Table 3.1, we summarize the results of the comparison by relying on both architec-tural and structural classifications presented in Chapter 2. With respect to the architec-

Structured Adaptive Non-adaptive

Hybrid • Bloehdorn et al. [2005]

Pure • Aberer et al. [2003a]

• Li and Vuong [2005]

• Sidirourgos et al. [2005]

• Joseph [2002]

• Ramanathan et al. [2002]

• Cuenca-Acuna et al. [2003]

• Yolum and Singh [2003a]

• Agostini and Moro [2004]

• Staab et al. [2004]

• Borch [2005]

• Zeinalipour-Yazti et al. [2005]

• Aberer et al. [2003b]

• Bonifacio et al. [2003]

• Khambatti et al. [2003]

• Crespo and Garcia-Molina [2004]

• Haase et al. [2004]

• Zhuge et al. [2004]

• Mandreoli et al. [2006]

SuperPeer • Nejdl et al. [2002] • Halevy et al. [2004]

Table 3.1: Comparison according to architectural and structural properties

tural classification, we note that most of the considered approaches are based on a pureP2P infrastructure. In our opinion, this is due to the following motivations:

• The hybrid P2P approach is more similar to a client/server approach rather thanto a P2P one. When the P2P paradigm is chosen for implementing a sharing sys-tem, load-balancing and fault-tolerance in search operations are expected proper-ties. Such properties are not guaranteed when centralized searching functionali-ties are used, like in hybrid P2P systems. The approach presented in Bloehdornet al. [2005] has been classified as a hybrid P2P system due to the use of a cen-tralized community ontology for defining peer communities.

• The pure P2P approach is flexible and very simple to implement. As a conse-quence, the pure P2P approach is widely used as it suited for providing the basic

57

Page 69: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.5 Critical review of the state of the art

communication infrastructure on top of which more complex and customizableoverlay structures can be posed.

• The superPeer approach requires more complex management operations. Inparticular, creation and maintenance of super-peer nodes are expensive opera-tions that can be accepted only when stable super-peer presence can be assumed.Thus, the superPeer approach is adopted in specific domain-dependent scenarios,like in Nejdl et al. [2002] and Halevy et al. [2004].

With respect to the structural classification, we note that most of the consideredapproaches are characterized by an unstructured peer organization (i.e. adaptive andnon-adaptive organizations). The following considerations can be made:

• The structured P2P approach is based on strict requirements that are compatiblewith a limited number of P2P application scenarios. In particular, hashing anddeterministic resource placement are possible only when a peer is willing to sep-arate from its data and to have the responsibility of other peer data, according tothe assigned chunk of hash keys. As a consequence, structured systems are popu-lar in collaborative environments, like Grids, where independence and autonomyof peers are relaxable constraints (e.g., [Aberer et al., 2003a] and [Sidirourgoset al., 2005]).

• The unstructured P2P approach is highly flexible and it is suited for genericP2P applications where specific assumptions on peer behavior are not possible.The distinction between adaptive and non-adaptive systems depends on differ-ent design choices rather than on different application scenarios. In an adap-tive system, routing issues are normally addressed at the overlay level whereapplication-dependant mechanisms are applied for performing query propaga-tion (e.g., [Staab et al., 2004]). On the other side, in a non-adaptive system,routing issues are addressed at the infrastructure level since query propagation isa topology-dependant mechanism (e.g., [Crespo and Garcia-Molina, 2004]).

3.5.2 Comparison on emergent semantics requirements

In Table 3.2, we summarize the results of the comparison by relying on emergent se-mantic requirements and related evaluation criteria presented in Section 3.4. For what

58

Page 70: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.5 Critical review of the state of the art

Knowledge

Representation

Matching

Techniques

Query

Routing

Community

Support

Joseph [2002] metadata-based syntactic-based expertise-based -

Nejdl et al. [2002] RDF-based reasoning-based neighborhood-based -

Ramanathan et al. [2002] metadata-based syntactic-based history-based -

Aberer et al. [2003a] metadata-based syntactic-based history-based -

Aberer et al. [2003b] XML-based

RDF-based

syntactic-based

reasoning-based

neighborhood-based -

Bonifacio et al. [2003] XML-based structural-based

reasoning-based

neighborhood-based

history-based

global

negotiation

Cuenca-Acuna et al.

[2003]

XML-based syntactic-based advertisement-based shared

knowledge base

Khambatti et al. [2003] metadata-based syntactic-based advertisement-based local negotiation

Yolum and Singh [2003a] metadata-based syntactic-based expertise-based local negotiation

Agostini and Moro [2004] metadata-based syntactic-based

structural-based

expertise-based local negotiation

Crespo and Garcia-

Molina [2004]

metadata-based syntactic-based neighborhood-based shared

knowledge base

Haase et al. [2004] ontology-based syntactic-based

structural-based

advertisement-based shared

knowledge base

Halevy et al. [2004] XML-based

ontology-based

structural-based

reasoning-based

neighborhood-based -

Staab et al. [2004] RDF-based syntactic-based

structural-based

expertise-based -

Zhuge et al. [2004] XML-based structural-based

reasoning-based

neighborhood-based -

Bloehdorn et al. [2005] ontology-based reasoning-based history-based local negotiation

shared know. base

Borch [2005] XML-based syntactic-based history-based -

Li and Vuong [2005] RDF-based syntactic-based

structural-based

neighborhood-based -

Sidirourgos et al. [2005] RDF-based reasoning-based advertisement-based -

Zeinalipour-Yazti et al.

[2005]

metadata-based syntactic-based history-based -

Mandreoli et al. [2006] ontology-based reasoning-based neighborhood-based -

Table 3.2: Comparison according to emergent semantics requirements

concern knowledge representation, we note that the use of RDF(S) and ontologies isbecoming a common solution in P2P systems. Moreover, we observe that such lan-guages are usually combined with semantic-based matching techniques (i.e., linguistic-

and reasoning-based techniques). This is mainly due to the recent success of on-tologies and Semantic Web technologies which can be easily extended for applicationto open distributed systems, like P2P systems. However, the pair 〈metadata-based,syntactic-based〉 still characterizes a number of P2P systems. This is motivated by

59

Page 71: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

3.5 Critical review of the state of the art

the fact that most of these approaches derive from the Information Retrieval field. Inthis context, the focus is not in evaluating the benefits of expressive knowledge repre-sentations and related semantic-based matching techniques, but in improving the effi-ciency of the search procedures. We note that adaptive query routing techniques (i.e.,history-based and expertise-based techniques) are usually preferred in these systems.In particular, two main patterns can be observed in Table 3.2:

• Systems deriving from the Semantic Web field, where expressive knowledge rep-resentations and semantic-based matching techniques are used, are usually char-acterized by poorly adaptive routing techniques (e.g., [Nejdl et al., 2002], [Sidirour-gos et al., 2005], and [Mandreoli et al., 2006]).

• Systems deriving from the Information Retrieval field, where adaptive queryrouting techniques are used, are usually characterized by poorly expressive knowl-edge representations and syntactic-based matching techniques (e.g., [Joseph,2002], [Borch, 2005], and [Zeinalipour-Yazti et al., 2005]).

In this respect, the thesis work aims at merging the previous patterns by develop-ing methods and techniques where ontologies and ontology matching techniques areexploited for providing an adaptive P2P routing mechanism.

Finally, we observe that community support is not widely diffused in current P2Psystems. Even when communities are supported, peers usually rely on a shared knowl-edge base to reduce complexity of community discovery, membership management,and topic overlapping among different communities. As an alternative to the sharedknowledge base, local agreements are supported to enable implicit peer communitydefinition. In our opinion, such limited community support is due to the initial level ofdevelopment in knowledge-sharing P2P systems and to the lacks of advanced semantic-based techniques for addressing emergent semantics issues in P2P environments. Inthis respect, the thesis work aims at filling the gap in emergent semantics supportby providing methods and techniques for formation and self-organization of semanticcommunities of peers.

60

Page 72: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 4

The H-Link semantic routingmechanism

P2P semantic routing protocols are generally based on the idea to address query propa-gation by selecting as query recipients the peers that are most likely to provide relevantresults according to the query content. The main goal of semantic routing techniques isto reduce the overall network traffic and to improve the effectiveness of the knowledgediscovery phase at the same time. In this chapter, we present the main features of theH-Link mechanism for semantically routing queries in a peer-based system. The maincontributions of H-Link regard i) the use of ontologies for a semantically rich represen-tation of peer knowledge, ii) the adoption of ontology matching techniques for queryrecipient selection, and iii) the independence of peers in autonomously managing theirown ontologies.

After presenting the foundational aspects of H-Link we will discuss the role of on-tologies and ontology matching techniques for enforcing H-Link by focusing on theH-Match semantic matchmaker. A detailed description of the H-Link semantic routingmechanism is then provided. Moreover, the presentation will be also supported bya final application example in order to further illustrate the H-Link semantic routingfunctionalities.

61

Page 73: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.1 Main features of H-Link

4.1 Main features of H-Link

The H-Link mechanism has been conceived to work in a knowledge sharing peer-basedsystem where peers act as independent agents and no centralized authorities are definedto manage a comprehensive view of the resources shared by all the nodes of the system.The key idea of H-Link is to exploit the results of knowledge discovery interactions totrain the behavior of the routing mechanism. To this end, peers are connected throughmatching-based confidence measures that keep track of the semantic affinity amongthe contents of different peers. As a result, peers are organized in a semantic overlaynetwork where nodes having similar knowledge are interlinked as semantic neighbors.The following main features characterize the H-Link mechanism.

• Use of a dynamic knowledge discovery approach. In H-Link, peers interact bysubmitting discovery queries with the aim to identify relevant partners with re-spect to one or more target concepts of interest. Receiving a discovery query,a peer evaluates whether it is capable of providing concepts matching the tar-get request. According to the results of the matching process, the list of con-cepts found to be relevant are replied to the requesting peer as well as a listof associated semantic affinity values. In this respect, affinity values provide ameasure of the level of similarity between the target concepts of the query andthe discovered matching concepts. In H-Link, the replying nodes are linked tothe requesting peer as semantic neighbors and the returned affinity values areexploited to set the level of confidence of each semantic neighbor with respectto the discovered matching concepts. This way, as a peer learns about the net-work contents through discovery queries, also its network knowledge graduallyevolves to reflect its newly acquired semantic neighbors.

• Use of ontologies. In H-Link, both queries and peer resources are expressed interms of ontological descriptions. In particular, each query contains a list oftarget concept(s) of interest with possible properties and semantic relations thatfurther specify the request. Furthermore, each peer joining the system providesa peer ontology describing the knowledge the peer brings to the network and theknowledge the peer perceives from the network. In other words, the peer ontol-ogy accomplishes two main purposes. On one side, the peer ontology providesa semantically rich description of the local knowledge that the peer intends to

62

Page 74: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.1 Main features of H-Link

share with the other nodes. On the other side, the peer ontology also links thesemantic neighbors that has been acquired during past discovery interactions to-gether with the corresponding confidence values.

• Use of ontology matching techniques. In H-Link, each peer is capable of provid-ing ontology matching functionalities through the use of a semantic matchmaker.Ontology matching is employed by a peer during the dynamic knowledge dis-covery process in order to assess whether it can provide relevant knowledge inreply to an incoming discovery query. Furthermore, ontology matching is alsoexploited in H-Link in order to select the recipients of a query according to theexpected semantic affinity between the query contents and the peer ontology ofthe receiving peer.

In order to illustrate the H-Link mechanism, we refer to the following definition.

Definition of H-Link. The H-Link mechanism HL is defined as a 4-tuple of the formHL = 〈Q,O,MP,Ncr〉 where

• Q is a discovery query and it is defined as a set of target concepts TCq, char-acterized by a name, and a (possible empty) set of properties Pq, and semanticrelations SRq.

• O is a peer ontology and it is defined as a collection of concepts C, properties P,semantic relations SR, network concepts NC and location relations LR. As theuse of ontologies is a foundational aspect of H-Link, a detailed description of thepeer ontology definition will be provided in Section 4.2.

• MP is a matching policy for configuration of the semantic matchmaker. Thestructure of the matching policy depends from the requirements of the semanticmatchmaker. The H-Match semantic matchmaker is currently used to supportH-Link. As ontology matching techniques are a foundational aspect of H-Link, adetailed description of H-Match will be provided in Section 4.3.

• Ncr is the number of credits attached to the query Q and represents the numberof replies that the requesting peer wishes to receive as answers to Q.

63

Page 75: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.1 Main features of H-Link

4.1.1 Motivating and running example

As an example of dynamic knowledge discovery with H-Link, we consider the sce-nario of Figure 4.1. In this scenario, each peer is independent and joins the system byproviding its own peer ontology. In the following, we assume to work with OWL on-tologies, either for specifying peer ontologies and queries 1. Furthermore, the H-Match

semantic matchmaker is exploited in this example when ontology matching function-alities are required. In Figure 4.1, we suppose that peer A is interested in discoveringpeers capable of providing resources semantically related to the publishing domain. To

Peer D

Peer A

Query Q1

Peer C

SA(Book, Volume)=0.83 Peer B

Person

Volume

author

title contains

Section

Article

Publication

Book

Journal

author

publisher

volumes

year

author title

contains number

title

Newspaper

PeriodicalPublication

Magazine

SA(Publication, Newspaper)=0.67SA(Book, Magazine)=0.539

Query Answer

Query Answer

Concept SubClassrelation

Legenda

Peer Propertyrange

Peer ontology Hotel

category

name address

Publication

year author

Book Publication

Strong propertydomain

Weak propertydomain

Figure 4.1: Example of dynamic knowledge discovery in H-Link

this end, peer A composes and submits to the system a discovery query Q1 containingthe target concepts of interest Publication and Book with the properties year and au-

thor, respectively. Moreover, Book is specified as a subclass of Publication. Receivingthe query Q1, the peer (i.e., peer B, peer C, and peer D) invokes its semantic match-maker to compare the description of the target concepts in the query against its ownpeer ontology. According to the results of the matching process, peer B and peer D

send back to the requesting peer A a ranked list of concepts matching the target, and,

1In Figure 4.1, both peer ontologies and queries are described through a graph-based formalismthat provides a representation of the main constructs of the OWL language (see Section 4.2 for furtherdetails).

64

Page 76: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.2 Peer ontology definition

for each entry, the calculated semantic affinity value SA. When the H-Match semanticmatchmaker is used, the SA value is computed between a target concept in the queryand a concept in the peer ontology, and produces a matching result in the range [0,1]where SA = 1 and SA = 0 are the maximum and minimum level of semantic affin-ity, respectively. In particular, peer B replies with the Volume matching concept sinceSA(Book,Volume) = 0.83 in H-Match, while peer D sends back two matching con-cepts, namely Newspaper and Magazine, with SA(Publication,Newspaper) = 0.67and SA(Book,Magazine) = 0.539, respectively. On the other hand, peer C does notreply to peer A as no matching concepts are identified. The query replies representthe discovered knowledge of peer A that can be exploited to decide whether to furtherinteract with the answering peers in order to access their relevant resources for datasharing.

Without H-Link, the discovery process relied on conventional P2P infrastructuresand associated routing protocols for query propagation in the network (e.g., flood-ing). In the following sections, we show how the discovered knowledge can be furtherexploited in H-Link for semantic routing purposes by enforcing query forwarding ac-cording to peer ontology similarities rather than to the mere network topology.

4.2 Peer ontology definition

As shown in the example of Figure 4.2, the knowledge of a peer is described througha peer ontology that is organized in a two-layer architecture where the upper layerrepresents the content knowledge and the lower layer represents the network knowledgeof the peer, respectively.

4.2.1 The content knowledge layer

The content knowledge layer of a peer ontology describes the knowledge the peerbrings to the network and it is described as a collection of concepts C, properties P,and semantic relations SR. For the sake of internal representation of ontology speci-fication languages, and in particular for Semantic Web languages like OWL, we relyon a graph-based representation of peer ontologies. According to the example of Fig-ure 4.2, the graph nodes denote concepts and properties (that in the example representclasses and properties of OWL), while the edges denote the semantic relations be-

65

Page 77: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.2 Peer ontology definition

tween concepts (that in the example represent the subClassOf relation in OWL as wellas properties domain and range derived by OWL restrictions). The content knowledgelayer of a peer ontology is defined according to the underlying resources that need tobe shared and by exploiting the classical ontology engineering methodologies [Gomez-Perez et al., 2003].

Concept. A concept c ∈C is defined as a pair of the form c = (nc,Pc), where nc isthe concept name, and Pc is a set, possibly empty, of properties of c. Each concept crepresents a class declaration in OWL.

Property. A property p ∈ P is defined as a pair of the form p = (np,PCp), where np

is the property name, and PCp is a set of property constraints. Each property constraintassociates a property p with a concept c, by specifying the minimal cardinality and theproperty value vp of p in c. A property constraint pcp ∈ PCp is a 3-tuple of the formpcp = (c,kp,vp), where c is a concept, k ∈ 0,1 is the minimal cardinality associatedwith p when applied to c, and vp is the value associated with p when applied to c, andcan be a datatype dtp or a reference name. We call strong properties the properties withk = 1, and weak properties the ones with k = 0. Each property p represents a propertydeclaration in OWL. The property cardinality kp as well as the property value vp areenforced by exploiting OWL property restrictions.

Semantic relations. A semantic relation sr ∈ SR is defined as a binary relation of theform sr(c,c′), where c and c′ are concepts and sr is the relation holding between them.Semantic relations are defined according to the constructs of the ontology specificationlanguage that is adopted for peer ontology description. With reference to the OWLlanguage, the same-as and kind-of semantic relations are defined, which represent theequivalentClass and the subClassOf relations, respectively.

4.2.2 The network knowledge layer

The network knowledge layer of a peer ontology describes the knowledge that the peerhas of the semantic neighbors it has interacted with. This layer is seen as a set of net-work concepts NC that are connected with the concepts in the content knowledge layerthrough location relations LR. A network concept nc provides an abstract representa-

66

Page 78: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.2 Peer ontology definition

tion of a semantic neighbor (i.e., a peer) that has been identified during the knowledgediscovery process. A location relation is defined to connect a network concept nc witha concept c in the content knowledge layer. A confidence annotation c f is associatedto a location relation to keep track of the discovered semantic affinity between c andthe peer ontology of the semantic neighbor represented with nc.

Network concept. A network concept nc ∈ NC is defined as a triple of the formnc = (nnc,Pnc,enc), where nnc is a label identifying the semantic neighbor, Pnc is a setof properties describing the network features of the peer (e.g., IP address, bandwidth),and enc is a comprehensive measure of the expertise of nc. Currently, enc is computedas the average mean of the confidence annotations associated to the location relationsconnected to nc.

Location relation. A location relation lr ∈ LR is defined as a triple of the formlr(c,nc,c f ), where c is a concept in the content layer, nc is a network concept inthe network layer, and c f is the confidence annotation associated to the relation.

4.2.3 Building the network knowledge

Receiving query replies after a knowledge discovery process, a requesting peer has topopulate the network knowledge layer of its peer ontology according to the receivedmatching concepts and associated semantic affinity values. In particular, a networkconcept is defined in the peer ontology for each peer that replied to a discovery query.

We consider a peer p submitting a discovery query Q containing a single targetconcept tc. The peer r replies to the query Q by providing a list of matching conceptsc1 . . .cn and associated semantic affinity values SA(tc,c1) . . .SA(tc,cn). Receiving sucha query reply, a network concept nc is defined in the peer p ontology to denote that peerr is a semantic neighbor, and a location relation is defined to connect nc to the concepttc. For defining a location relation, two different scenarios can occur:

• The target concept tc is in the content knowledge layer of the peer p ontology.In this case, a location relation lr(tc,nc,c f ) is defined, and the confidence valuec f is computed as the average mean of the received semantic affinity values fortc, namely c f = ∑

ni=1 SA(tc,ci)

n .

67

Page 79: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.2 Peer ontology definition

• The target concept tc is not in the content knowledge layer of the peer p ontology.In this case, the location relation is not defined. As an alternative, the peer p candecide to evolve its peer ontology by including the definition of the concept tcthrough ontology evolution techniques [Castano et al., 2006c]. As a result, thelocation relation is defined according to the previous method.

The same peer p actions are performed when more than one target concept tc iscontained in the query Q. In this case, a different location relation can be definedbetween the network concept nc and each target concept tc, according to the matchingresults returned in the query reply by peer r.

4.2.4 Example

As an example, in Figure 4.2, we consider a portion of the peer ontology of the peer

A after the knowledge discovery interaction described in Figure 4.1. In this example,

Content Knowledge Layer

Network Knowledge Layer

Article

Publication

BookJournal

author publisher

volumes year

author

title contains

number

title

peer Bpeer D

0.670.539

0.83

Networkconcept

Legenda

Locationrelation

Concept

Strong propertydomain

SubClassrelation

Propertyrange

Weak propertydomain

Figure 4.2: A portion of the peer ontology of peer A

peer B and peer D have answered to query Q1, then the corresponding network con-cepts are defined in the network knowledge layer. According to the query reply of peer

B, a new location relation with a confidence annotation of 0.83 is defined to connectthe peer B network concept with the Book concept in the content knowledge layer. Astwo matching concepts are returned in the query answer of peer D (i.e., Newspaper,

68

Page 80: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.2 Peer ontology definition

Magazine), two location relations are defined by connecting the peer D network con-cept with the concepts Publication and Book in the content knowledge layer, and bysetting a confidence annotation of 0.67 and 0.539, respectively. As a consequence, theexpertise measures associated with peer B and peer D are 0.83 and 0.67+0.539

2 = 0.605,respectively.

4.2.5 Considerations on the computation of confidence values

In the network knowledge layer of a peer ontology, confidence and expertise measuresallow to link peers with similar contents at different level of granularity (i.e., conceptand peer level). Moreover, confidence annotations introduce a flexible and extensiblemechanism that can be improved according to the specific peer requirements. Forinstance, a peer can decide to restrict the creation of new location relations only fora selected set of concepts of interest. This way, only a limited number of conceptsare connected to network concepts, thus fostering the efficiency of peers with veryfocused interests. As another possible option, the creation of location relations can berestricted only to the received matching concepts whose semantic affinity value SA isover a given threshold. This way, only highly expert semantic neighbors are stored inthe peer ontology.

In H-Link, confidence values are computed by relying on the semantic affinity eval-uations produced by the matchmaker during the knowledge discovery phase. Such achoice is aimed at supporting query propagation by exploiting semantics-based crite-ria, like semantic affinity, rather than topology-based parameters, like peer distanceand bandwidth. However, the computation of confidence measures can be extendedto combine additional metrics and to provide a more accurate evaluation of each se-mantic neighbor expertise. For instance, information regarding the network reliabilityof the semantic neighbors, such as connection stability and granted bandwidth, canbe also considered for integration in the expertise computation [Borch and Vognild,2004]. Furthermore, trust-based techniques and aging mechanisms can be consideredfor extending the affinity-based confidence computation.

• Trust-based techniques. Many different approaches have been proposed in theliterature to handle trust issues in P2P systems where the inherent open natureof communications may cause malicious peer behaviors [AbdulRahman andHailes, 1999; Xiong and Liu, 2004]. The main existing approaches are based

69

Page 81: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

on the idea to advertise reputation information in order to penalize peers withmalicious behavior. Information regarding semantic neighbor reputation can beobtained by exploiting an existing trust-based approach and by integrating it inthe computation of confidence values. A possible drawback of introducing trustin confidence computation is due to the additional traffic overhead required foradvertising peer reputations. For this reason, the choice of introducing trust isrecommended when peer reliability is a critical factor. In general, the presenceof a small number of malicious peers does not significantly affect the routingperformance as discussed in Marti and Garcia-Molina [2004].

• Aging mechanism. Currently, the confidence value associated with a location re-lation between c and nc is updated when a new semantic affinity value with c isreturned by nc in reply to a discovery query. As proposed in Joseph [2002], theconfidence value c f associated to a given location relation between c and nc canbe periodically updated by observing the ratio between the number of relevantreplies provided by nc and the number of queries sent to nc with a target conceptrelated to c. When the ratio has low values, c f can be decreased to denote thatthe original confidence (i.e., semantic affinity) is no more actual. This way, onlyconfirmed location relations are maintained in the peer ontology, while unreli-able confidence values are gradually reduced and finally dropped. We stress thatthe use of an aging mechanism allows to cut off irrelevant semantic neighbors byrelying on a completely passive approach. In other words, no additional trafficis required to periodically push semantic neighbors with the intention to checkpossible changes in their respective peer ontologies.

4.3 The H-Match semantic matchmaker

In the following, we present the main features of the H-Match semantic matchmaker formatching independent ontologies in open distributed systems [H-Match; Castano et al.,2003a, 2004d, 2006d]. A key feature of H-Match is that it can be dynamically config-ured for adaptation to the semantic complexity of the ontologies to be compared, usinga combination of syntactic and semantic techniques. This feature is achieved by meansof four matching models, namely surface, shallow, deep, and intensive defined withthe goal of providing a wide spectrum of metrics suited for dealing with the inherent

70

Page 82: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

dynamism of open systems, such as P2P systems.

4.3.1 The H-Match matching process

We define ontology matching in H-Match as a process that takes two ontologies as inputand returns the mappings that identify corresponding concepts in the two ontologies,namely the concepts with the same or the closest intended meaning, by taking into ac-count their descriptions and constraints in terms of properties and semantic relations.We define a H-Match mapping as a correspondence between a concept of the first on-tology and one or more concepts of the second ontology. Ontology mappings areestablished after an analysis of the similarity of the concepts in the compared ontolo-gies. In H-Match, we perform similarity analysis through affinity metrics to determinea measure of semantic affinity SA in the range [0,1]. A threshold-based mechanism isenforced to set the minimum level of semantic affinity required to consider two con-cepts as matching concepts.

Given two concepts c and c′, H-Match calculates a semantic affinity value SA(c,c′)as the linear combination of a linguistic affinity value LA(c,c′) and a contextual affinityvalue CA(c,c′). Linguistic and contextual affinity functions are computed by exploit-ing linguistic and contextual features of c and c′, respectively.

4.3.2 Linguistic features

Linguistic features refer to concept names and corresponding meaning. For the linguis-tic affinity evaluation, H-Match relies on a thesaurus T h of terms and terminologicalrelationships automatically extracted from the WordNet lexical system [Miller, 1995].In particular, the thesaurus is created by exploiting a subset of the terminological rela-tionships defined in WordNet, namely synonymy, hypernymy/hyponymy, meronymy,and coordinate terms. Furthermore, an appropriate procedure for handling compoundterms that are not included in WordNet is also defined [Castano et al., 2004d].

The thesaurus T h is organized as a graph, where the nodes represent terms and theedges represent terminological relationships. Terminological relationships representedin the thesaurus are SYN, BT, NT, and RT. SYN (synonymy) denotes that two termshave the same meaning. BT (broader term) (resp., NT (narrower term)) denotes that aterm has a more (resp., less) general meaning than another term. Finally, RT (related

71

Page 83: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

terms) denotes that two terms have a generic positive relationship. These terminolog-ical relationships are derived from the relationships extracted from WordNet duringthe thesaurus construction process. In particular: a WordNet synonymy is representedthrough a SYN terminological relationship; a WordNet hypernymy (resp., hyponymy)relation is represented through a BT (resp., NT) terminological relationship; meronymyand coordinate terms relations of WordNet are represented through a RT relationship inthe thesaurus. Finally, a weight wtr is associated with each terminological relationshiptr ∈ SYN, BT/NT, RT in the thesaurus. Such a weight expresses the implication ofthe terminological relationship for semantic affinity. Different types of relationshipshave different implications for semantic affinity, with wSY N ≥ wBT/NT ≥ wRT . In fact,synonymy is generally considered a more precise indicator of affinity than hierarchicalrelationships, consequently wSY N ≥ wBT/NT . The lowest weight is associated with RT

since it denotes a more generic relationship than the hierarchical relationships BT/NT.

4.3.3 Contextual features

Contextual features of a concept c refer to properties and concepts directly related toc through a semantic relation. Contextual information are really important for mean-ing disambiguation, especially in open systems where the same term can be used indifferent domains by different independent peers.

Given a concept c, we denote by P(c) the set of properties of c, and by C(c) the setof concepts that participate in a semantic relation with c, in the following referred to asadjacents, respectively. The context of a concept in H-Match is defined as the union ofthe properties and of the adjacents of c, that is, Ctx(c) = P(c)∪C(c). Like linguisticfeatures, also contextual features are weighted in H-Match. In particular, we associatea weight wsp to strong properties, and a weight wwp to weak properties, with wsp ≥wwp to capture the different importance each kind of property has in characterizingthe concept. In fact, strong properties are mandatory properties related to a conceptand they are considered more relevant in contributing to concept description. Weakproperties are optional for the concept in describing its structure, and, as such, are lessimportant in featuring the concept than strong properties. Each semantic relation hasassociated a weight wsr which expresses the strength of the connection expressed by therelation on the involved concepts. The greater the weight associated with a semanticrelation, the higher the strength of the semantic connection between concepts. For this

72

Page 84: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

reason, we define wsame−as ≥ wkind−o f .

4.3.4 Basic matching functions of H-Match

The H-Match semantic matchmaker relies on some basic matching functions in orderto evaluate the similarity/compatibility of terms, datatypes, properties and semanticrelations, respectively.

Term affinity function. The term affinity function A(t, t ′) → [0,1] evaluates theaffinity between two terms t and t ′ byu exploiting the thesaurus T h. A(t, t ′) of twoterms t and t ′ is equal to the value of the highest-strength path of terminological rela-tionships between them in T h if at least one path exists, and is zero otherwise. A pathstrength is computed by multiplying the weights associated with each terminologicalrelationship involved in the path, that is:

A(t, t ′) =

maxi=1...k wt→ni t ′ if k ≥ 1

0 otherwise(4.1)

where: k is the number of paths between t and t ′ in T h; t →ni t ′ denotes the ith path

of length n ≥ 1; wt→ni t ′ = w1tr ·w2tr · · · · ·wntr is the weight associated with the ith

path, and w jtr , j = 1,2, . . . ,n denotes the weight associated with the jth terminologicalrelationship in the path.

Datatype compatibility function. The datatype compatibility function T(dt,dt ′)→[0,1] is defined to evaluate the compatibility of data types of two concept propertiesaccording to a pre-defined set CR of compatibility rules. T(dt,dt ′) of two data typesdt and dt ′ returns 1 if dt and dt ′ are compatible according to CR, and 0 otherwise, thatis:

T(dt,dt ′) =

1 iff ∃ a compatibility rule for dt,dt ′ in CR0 otherwise

(4.2)

For instance, with reference to XML Schema datatypes (which are relevant for OWLontology matching), examples of compatibility rules that have been defined are: xsd:in-

teger ⇔ xsd:int, xsd:integer ⇔ xsd:float, xsd:decimal ⇔ xsd:float, xsd:short ⇔ xsd:int.

73

Page 85: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

4.3.5 Property and semantic relation closeness function

The closeness function C(e,e′)→ [0,1] is defined to calculate a measure of the distancebetween two elements e and e′ of concept contexts. Depending on the way conceptcontexts are defined in each respective ontology, e and e′ can be either two properties,or two semantic relations, or a semantic relation and a property, respectively. C(e,e′)exploits the weights associated with context elements and returns a value in the range[0,1] proportional to the absolute value of the complement of the difference betweenthe weights associated with the elements, that is:

C(e,e′) = 1− |We−We′ | (4.3)

where We and We′ are the weights associated with e and e′, respectively. For any pairsof elements e and e′, the highest value (i.e., 1.0) is obtained when weights of e and e′

coincide. The higher the difference between We and We′ the lower the closeness valueof e and e′.

4.3.6 The H-Match matching models

Linguistic and contextual features can be differently combined in H-Match to providea comprehensive semantic affinity evaluation between two concepts c and c′. To thisend, four matching models, namely, surface, shallow, deep, and intensive, are definedfor dynamically configuring H-Match for its adaptation to the semantic complexity ofthe ontologies to be compared.

Each model in H-Match calculates a semantic affinity value SA(c,c′) of two con-cepts c and c′ which expresses their level of matching. SA(c,c′) is produced by consid-ering linguistic and/or contextual features of concept descriptions. In a given matchingmodel, the impact of linguistic and contextual features can be dynamically established,by properly setting the linguistic affinity weight wla ∈ [0,1] in the semantic affinityevaluation process.

Surface matching. The surface matching is defined to take into account only the lin-guistic features of concept descriptions. Surface matching addresses the requirementof dealing with high-level, poorly structured ontological descriptions. Given two con-cepts c and c′, surface matching provides a measure SA(c,c′) of their semantic affinity

74

Page 86: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

determined only on the basis of their names using the term affinity function (4.1), thatis:

SA(c,c′)≡A(nc,nc′) (4.4)

where nc and nc′ are the names of c and c′, respectively.

Shallow matching. The shallow matching is defined to take into account conceptnames and concept properties. With this model, we want a more accurate level ofmatching, by taking into account not only the linguistic features but also informationabout the presence of properties and about their cardinality constraints. For propertycomparison, each property pi ∈ P(c) is matched against all properties p j ∈ P(c′) using(4.1) and (4.3), and the best matching value m(pi) is considered for the evaluation ofSA(c,c′), as follows:

m(pi) = maxA(npi,np j) ·C(pi, p j),∀p j ∈ P(c′) (4.5)

where npi and np j denote the names of pi and p j, respectively. SA(c,c′) is evaluatedby the shallow matching as the weighted sum of the linguistic affinity of c and c′,calculated using (4.1), and of their contextual affinity, calculated as the average meanof the best matching properties computed using (4.5), that is:

SA(c,c′) = wla ·A(nc,nc′)+(1−wla) ·∑|P(c)|i=1 m(pi)| P(c) |

(4.6)

Deep matching. The deep matching model is defined to take into account conceptnames and the whole context of concepts, that is, both properties and semantic rela-tions. Each element ei ∈ Ctx(c) (i.e., a property or an adjacent) is compared againstall elements e j ∈ Ctx(c′) using (4.1) and (4.3) and the best matching value m(ei) isconsidered for the evaluation of SA(c,c′), as follows:

m(ei) = maxA(nei,ne j) ·C(ei,e j),∀e j ∈Ctx(c′) (4.7)

where nei and ne j denote the names of ei and of e j, respectively. With the deep match-ing model, SA(c,c′) is evaluated as the weighted sum of the linguistic affinity of cand c′, calculated using (4.1), and of their contextual affinity, calculated as the averagematching value for the elements of the context of c using (4.7), that is:

SA(c,c′) = wla ·A(nc,nc′)+(1−wla) ·∑|Ctx(c)|i=1 m(ei)|Ctx(c) |

(4.8)

75

Page 87: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

Intensive matching. The intensive matching model is defined to take into accountconcept names, the whole context of concepts, and also property values, in order toexhibit the highest accuracy in semantic affinity evaluation. In fact, by adopting theintensive model not only the presence and cardinality of properties, but also their valuesare considered to produce the resulting semantic affinity value. Given two conceptsc and c′, the intensive matching calculates a comprehensive matching value for theelements of the context of c such as in (4.7) as well as a matching value v(pi) foreach property pi ∈ P(c). The matching value v(pi) is calculated as the highest valueobtained by composing the affinity of the name npi and the value vpi of pi with thename np j and the value vp j of each property p j ∈ P(c′), respectively. For propertyvalues comparison, we exploit the term affinity function (4.1) if the property value isthe name of a referenced concept, and the datatype compatibility function (4.2) if theproperty value is a datatype, that is:

v(pi) =

maxA(npi ,np j) ·A(vpi ,vp j),∀p j ∈ P(c′) iff vpi is a reference namemaxA(npi ,np j) ·T(vpi ,vp j),∀p j ∈ P(c′) iff vpi is a datatype

(4.9)

SA(c,c′) is evaluated by the intensive matching as the weighted sum of the linguisticaffinity of c and c′, calculated using (4.1), and of their contextual affinity, calculated asthe average of the matching values for the elements of the context of c using (4.7) andfor the property values calculated using (4.9), that is:

SA(c,c′) = wla ·A(nc,nc′)+(1−wla) ·∑|Ctx(c)|i=1 m(ei)+∑

|P(c)|j=1 v(p j)

|Ctx(c)|+|P(c)|(4.10)

A summary of the H-Match functions and techniques is provided in Appendix B.

4.3.7 Matching policy

The execution of the H-Match semantic matchmaker is configured through a matchingpolicy that is defined as a 4-tuple of the form 〈mm,wla, t〉, where:

• mm ∈surface, shallow, deep, intensive denotes the matching model to use;

• wla ∈ [0,1] denotes the linguistic affinity weight;

76

Page 88: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

• t ∈ (0,1] denotes the value of the matching threshold that is used to cut-off themappings having a low semantic affinity value (i.e., pairs of concepts that areconsidered poorly similar).

4.3.8 Example

As an H-Match example, we consider the scenario of Figure 4.1 where the conceptBook in the query Q1 is matched against the concept Volume in the peer B ontology.The result of the H-Match execution with the intensive matching model and wla = 0.5 isrepresented in Figure 4.3. In this example, the following weights are used: wSY N = 1,wBT/NT = 0.8, wRT = 0.5, wsp = 1.0, wwp = 0.5, wsame−as = 1.0, and wkind−o f = 0.8.These weights are the default values in H-Match. Such weights have been selected afterextensive experimentation on several ontology matching case, by choosing as defaultvalues those that exhibited best behavior in most cases.

Publication

author

Book

Person

Volume

author

title

contains

SectionPublication

m(Publication) = 1

LA(Book,Volume) = 1

m(author) = 1v(author) = 0

Figure 4.3: An H-Match example

By exploiting WordNet, we obtain that the terms Book and Volume are synonym,then the terminological relationship Book SYN Volume is inserted in the thesaurus. Asa consequence we have that LA(Book,Volume) = 1. For what concern the contextualaffinity, we note that Ctx(Book) contains the string property author and the kind-of se-mantic relation with the concept Publication. Each element in Ctx(Book) is comparedwith all the elements in Ctx(Volume). As a result, the best matching for the prop-erty author∈Ctx(Book) and for the semantic relation Publication∈Ctx(Book) are theproperty author∈ Ctx(Volume) and the semantic relation Publication∈ Ctx(Volume),

77

Page 89: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.3 The H-Match semantic matchmaker

respectively. In particular, A(author,author) = 1 and A(Publication,Publication) =1 as WordNet returns a synonym relationship when a term is compared with itself.Moreover, m(author∈Ctx(Book)) = A(author,author) ·C(author, author) = 1 as eitherauthor∈Ctx(Book) and author∈Ctx(Volume) are strong properties and thus C(author,

author)= 1. Similarly, m(Publication∈Ctx(Book))=A(Publication,Publication)·C(Pub-

lication, Publication)= 1 as either Publication∈Ctx(Book) and Publication∈Ctx(Volume)are kind-of semantic relations and thus C(Publication, Publication)= 1. Finally, v(author∈Ctx(Book)) = A(author,author) · T(string,Person) = 0 as string and Person are notcompatible datatypes. According to the intensive matching model (Formula 4.10),H-Match returns SA(Book,Volume) = 0.5 ·1+0.5 · 1+1+0

2+1 = 0.83.

4.3.9 Considerations on H-Match

H-Match has been extensively tested on several real ontology matching cases in order toevaluate the matching models with respect to performance and quality of results [Cas-tano et al., 2006a,d]. By analyzing the obtained results, we note that the most accurateand precise results are achieved with the deep and intensive matching models providedthat the ontology descriptions are detailed enough. On the other side, we note thatthe best performance in terms of computation time are achieved with the surface andshallow matching models. For semantic routing purposes, the computation time of thesemantic affinity evaluation is a crucial factor and needs to be performed as fastest aspossible in order to avoid bottlenecks. To this end, possible lacks in matching preci-sion and accuracy can be admitted in turn of rapid response time during the semanticaffinity evaluation. For this reason, the shallow matching model is selected to workwith H-Link for identifying the semantic neighbors that have the highest chance to pro-vide relevant knowledge with respect to a given query (see Section 4.4). We note that,a number of ontology matching tools and techniques are available for providing se-mantic affinity evaluation of two different input ontologies [Giunchiglia and Shvaiko,2003; Noy and Musen, 2003; Euzenat et al., 2004; Ehrig and Staab, 2004]. However,not all the existing tools are suited to work in open distributed systems due to theirlimitations in terms of high computation times and restricted support to different on-tology specification languages. As a consequence, the choice of H-Match to enforce theH-Link semantic routing functionalities is motivated by the following considerations:i) H-Match can deal with different ontology specification languages,

78

Page 90: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

ii) H-Match has been specifically conceived to work in open distributed systems by re-lying on its flexible matching models that allow to dynamically configure the tradeoffbetween performance and accuracy according to the requirements of the consideredmatching scenario.Provided that a dynamic and flexible configuration is supported, other existing match-ing tools can however be used to enforce the H-Link semantic routing mechanism inturn of H-Match [Shvaiko, 2006]. In the remainder of the thesis, we assume to workwith H-Match for supporting semantic routing in H-Link.

4.4 The H-Link routing mechanism

The H-Link semantic routing mechanism is based on the idea of exploiting the networkknowledge layer of a peer ontology by using the H-Match semantic matchmaker forproviding query routing support according to semantic neighbor contents.

We consider a discovery query Q with a target concept tc 2. Two different roles canbe distinguished:

• Requesting peer. The peer p needs to submit to the network a discovery queryQ in order to identify relevant partners for subsequent resource sharing. To thisend, peer p invokes H-Match to compare the target concept tc against the con-tent knowledge layer of its peer ontology O. A list MCL = 〈c1,SA(tc,c1)〉 . . .〈cn,SA(tc,cn)〉 of matching concepts c1 . . .cn ∈ O and corresponding semanticaffinity values SA(tc,c1) . . .SA(tc,cn) is returned as a result. Peer p also setsthe number of credits Ncr that are attached to the query Q. Therefore, H-Link isinvoked by passing the list MCL to select the semantic neighbors for query Qsubmission.

• Receiving peer. When a peer r receives a discovery query Q together with thenumber of credits ncr from a requesting peer p, it needs to evaluate whethermatching concepts can be provided back to peer p. To this end, H-Match isinvoked by peer r and the list MCL of matching concepts is still produced as aresults. If MCL 6= /0, the peer r sends MCL back to peer p by consuming one

2For the sake of clarity, we consider the case of a single target concept in the query. The H-Link

semantic routing mechanism can be easily extended to consider the case of multiple target concepts asquery contents.

79

Page 91: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

credit, otherwise no reply is sent back to peer p and all the received credits arestill available for forwarding. If at least one credit is available, H-Link is invokedby peer r to select the semantic neighbors for query Q forwarding; otherwisethe propagation mechanism stops. Each query is identified through a uniquetimestamp. The timestamp is exploited for enabling a peer to discard duplicateincoming queries. Query replies are returned to the requesting peer by followingthe reverse query path in order to avoid a sudden burst of incoming messages assuggested in Borch and Vognild [2004].

4.4.1 H-Link invocation

H-Link is invoked for both query submission/forwarding provided that at least onecredit is still available. Three main steps define H-Link: selection of semantic neigh-bors; ranking of semantic neighbors; distribution of credits.

1- Selection of semantic neighbors. The network knowledge layer of the peer ontol-ogy is accessed to select the network concepts, together with the associated con-fidence values, that are connected to the concepts in MCL through a locationrelation. A list SNL of semantic neighbors is returned as a result. A seman-tic neighbor sn ∈ SNL is described in the form sn = 〈nnc,c1,c f1 . . .cm,c fm〉,where nnc is the identifying label of the network concept featuring sn, whilec1 . . .cm ∈ MCL are the concepts of MCL connected to nc through a locationrelation, and c f1 . . .c fm the corresponding confidence values.

2- Ranking of semantic neighbors. The semantic neighbors in SNL are ranked withrespect to their relevance for the query target tc. To this end, the harmonic meanis used to combine the confidence values associated to the semantic neighborsin SNL and the semantic affinity values in MCL. Given a semantic neighborsn ∈ SNL, the ranking value rsn corresponds to the following formula:

rsn =∑

mi=1

2·c fi·SA(tc,ci)c fi+SA(tc,ci)

m(4.11)

Finally, a ranked list RSNL = 〈sn1,rsn1〉 . . .〈snk,rsnk〉 of semantic neighborswith the corresponding ranking value is returned as a result. A threshold mech-anism can be used to rule out the semantic neighbors with a ranking value lowerthan a predefined threshold t.

80

Page 92: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

3- Distribution of credits. The semantic neighbors in RSNL determine the recipientsof the query Q. Available credits Acr are proportionally distributed to the se-mantic neighbors in RSNL according to their ranking value. Then, the numberof credits ncrsn assigned to the semantic neighbor sn ∈ RSNL is computed asfollows:

ncrsn =Acr

∑∀sni∈RSNL rsni

· rsn (4.12)

The value ncrsn computed according to (4.12) is approximated to the nearest inte-ger value. Finally, a query distribution list QDL = 〈sn1,ncrsn1〉 . . .〈snk,ncrsnk〉is returned as a result.

4.4.2 The H-Link algorithm

The H-Link algorithm is shown in Figure 4.4 where four parameters, namely the dis-covery query Q, the peer ontology O, the matching policy MP, and the available creditsAcr, are defined as input. In particular, the discovery query Q and the peer ontologyO are defined as a list of target concepts tc1 . . . tck and as a list of ontology conceptsc1 . . .cn, respectively, with possible context composed of properties and/or semanticrelations. The matching policy provides the configuration for the H-Match seman-tic matchmaker (see Section 4.3) and it is defined as a triple in the form 〈mm,wla, t〉where mm is the H-Match model to use, wla is the linguistic affinity weight, and t is thethreshold indicating the minimum level of semantic affinity for considering two con-cepts as matching concepts. Finally, Acr is defined as an integer value indicating thenumber of credits available for distribution by H-Link. The first step of the H-Link algo-rithm is the computation of the matching concept list MCL through the ComputeMCL

function (line 1). In case that MCL 6= /0, the query recipients are selected by computingthe list of semantic neighbors SNL through the ComputeSNL function (line 3) and theranked list of semantic neighbors RSNL through the ComputeRSNL function (line 4).Otherwise (i.e., MCL = /0), the query recipients are selected according to the expertisevalues of the network concepts (i.e., semantic neighbors) in the network knowledgelayer of the peer ontology. In this case, RSNL is computed through the ComputeEx-

pertise function (line 7). Finally, the query distribution list QDL is defined through theComputeQDL function by assigning the credits Acr according to the ranking values inRSNL.

In the following, we describe the functions that compose the H-Link algorithm,

81

Page 93: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

Input:Q: target query containing a list of target concepts tc1...tck with related context O: peer ontology MP: matching policy containing: mm: matching model wla: linguistic affinity weight t: thresholdAcr: credits available for distribution

Output:QDL: query distribution list (containing pairs in the form <sn, ncrsn>, where sn is a semantic neighbor (query recipient) ncrsn is the number of credits assigned to sn

Method:// Compute the list of matching concepts MCL

(1) MCL = ComputeMCL(Q, O, MP);(2) If MCL ≠ ∅ Then:

// Compute the list of semantic neighbors SNL(3) SNL = ComputeSNL(MCL, O);

// Compute the ranking of semantic neighbors(4) RSNL = ComputeRSNL(SNL, MCL, MP.t);(5) Else:(6) // Compute the ranking of semantic neighbors according to peer expertise(7) RSNL = ComputeExpertise(O, MP.t);(8) End If;

// Distribute credits to query recipients(9) QDL = ComputeQDL(RSNL, Acr);(10) Return QDL

Figure 4.4: The H-Link algorithm

namely ComputeMCL, ComputeSNL, ComputeRSNL, ComputeExpertise, and Com-

puteQDL.

The ComputeMCL function. As shown in Figure 4.5, the ComputeMCL functionreceives as input the discovery query Q, the peer ontology O, and the matching policyMP. The matching concepts list MCL is returned as output. The H-Match semanticmatchmaker has the responsibility to evaluate the level of semantic affinity amongthe target concepts in Q and the concepts in O according to the matching policy MP.To this end, the hmatch function is invoked for each target concept tc ∈ Q and eachconcept c ∈ O, returning as a result the semantic affinity evaluation SA between tc andc (lines 2-3) 3. Finally, the pair (c,SA) is inserted in MCL when the semantic affinitySA returned by H-Match is higher than the threshold MP.t (lines 4-5).

3Optimization strategies are also defined in the H-Match semantic matchmaker in order to reducethe number of comparisons in case of long lists of concepts. For a detailed description of the H-Match

optimization strategies, the reader can refer to Ferrara [2006].

82

Page 94: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

Input:Q: target query containing a list of target concepts tc1...tck with related context O: peer ontology MP: matching policy containing: mm: matching model Wla: linguistic affinity weight t: threshold

Output:MCL: matching concept list

Method:(1) MCL = ∅;

//Fill MCL with H-Match results(2) For Each tc ∈ Q and c ∈ O:(3) SA = hmatch(tc,c,MP);(4) If SA > MP.t then:(5) MCL.Add(c, SA);(6) End If; (7) End For Each;(8) Return MCL

Figure 4.5: The ComputeMCL function

The ComputeSNL function. As shown in Figure 4.6, the ComputeSNL function re-ceives as input the matching concept list MCL, and the peer ontology O. The semanticneighbor list SNL is returned as output. SNL is defined as a list where each semanticneighbor sn corresponds to a network concept in the peer ontology O and is defined as apair sn = (name,LR). In particular, name is the identifying label of the semantic neigh-bor and LR is the set of location relations connected to sn. The ComputeSNL functionis defined to fill SNL with each network concept nc that is connected with a conceptc∈MCL through a location relation lr(c,nc,c f ) (c f is the confidence value associatedto lr). To this end, for each concept c ∈ MCL, the peer ontology O is exploited by theGetNextLR method. Given a concept c, the GetNextLR method is executed for eachlocation relation lr that is connected to c (line 3). Given a location relation lr, if lr.nchas been already inserted in SNL as a semantic neighbor, only the location relation lris added to sn.LR (line 10). Otherwise, a new semantic neighbor is created and insertedin SNL by assigning sn.name = lr.nc and by adding lr to sn.LR (lines 5-8).

The ComputeRSNL function. As shown in Figure 4.7, the ComputeRSNL functionreceives as input the semantic neighbor list SNL, the matching concept list MCL, andthe threshold t. The ranked semantic neighbor list RSNL is returned as output. TheComputeRSNL function is defined to fill RSNL with the semantic neighbors selected

83

Page 95: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

Input:MCL: matching concept listO: peer ontology

Output:SNL: semantic neighbor list

Method:(1) SNL = ∅;

/* Fill SNL with the relevant semantic neighbors by exploiting location relations and related confidence values */

(2) For Each c ∈ MCL: /* lr is a location relation where: lr.c is the concept in the content layer lr.nc is the network concept in the network layer lr.cf is the associated confidence value */

(3) While (lr = O.GetNextLR(c) Is Not Null): /* sn is a semantic neighbor where: sn.name is the identifying label of the semantic neighbor sn.LR is the list of associated location relations */

(4) If (sn = SNL.Find(lr.nc) Is Null) Then: // Create a new semantic neighbor

(5) new sn;(6) sn.name = lr.nc;(7) sn.LR.Add(lr);(8) SNL.Add(sn); (9) Else:(10) sn.LR.Add(lr);(11) End If; (12) End While;(13) End For Each; (14) Return SNL

Figure 4.6: The ComputeSNL function

Input:SNL: semantic neighbor list MCL: matching concept list

t: thresholdOutput:

RSNL: ranked semantic neighbor listMethod:(1) RSNL = ∅;

//Fill RSNL with semantic neighbors and associated ranking values(2) For Each sn ∈ SNL:(3) rsn = rank(sn.LR,MCL); // rsn is the ranking value of sn(4) If rsn > t Then:(5) NSRL.Add(sn.name, rsn);(6) End If;(7) End For Each;(8) Return RSNL

Figure 4.7: The ComputeRSNL function

84

Page 96: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

as query recipients and their corresponding ranking values. For each semantic neighborsn ∈ SNL, the rank function is invoked to compute the ranking value rsn (line 3). Tothis end, the rank function exploits the confidence values associated to the locationrelations in sn.LR and the semantic affinity values SA ∈MCL according to the formula(4.11). Finally, the pair (sn.name,rsn is inserted in RSNL when the ranking value rsn ishigher of the threshold t received as input (line 5).

The ComputeExpertise function. As shown in Figure 4.8, the ComputeExpertise

function receives as input the peer ontology O, and the threshold t. The ranked se-mantic neighbor list RSNL is returned as output. The ComputeExpertise function is

Input:O: peer ontology t: threshold

Output:RSNL: ranked semantic neighbor list

Method:(1) RSNL = ∅;

/* Fill RSNL with the expertise values of all the network concepts in O *//* sn is a semantic neighbor where: sn.name is the semantic neighbor name sn.LR is the list of associated location relations */

(2) While (sn = O.GetNextSN() Is Not Null): (3) ∑lr = 0;

/* lr is a location relation where: lr.c is the concept in the content layer lr.nc is the network concept in the network layer lr.cf is the associated confidence value */

(4) For Each lr ∈ sn.LR: // ∑cf contains the sum of all confidence values in sn.LR

(5) ∑cf = ∑cf + lr.cf;(6) End For Each;(7) esn = ∑cf / sn.LR.count(); // esn is the expertise value of sn(8) If esn > t Then:(9) NSRL.Add(sn.name, esn);(10) End If;(11) End While; (12) Return RSNL

Figure 4.8: The ComputeExpertise function

defined to fill RSNL with the network concepts (i.e., semantic neighbors) in the net-work knowledge layer of the peer ontology O and their corresponding expertise values.To this end, for each semantic neighbor sn ∈ O, the peer ontology O is exploited bythe GetNextSN method (line 2). The expertise esn is computed as the arithmetic mean

85

Page 97: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

of the confidence values associated to each location relation lr ∈ sn.LR (lines 3-7). Fi-nally, the pair (sn.name,esn is inserted in RSNL when the expertise value esn is higherof the threshold t received as input (line 9).

The ComputeQDL function. As shown in Figure 4.9, the ComputeQDL functionreceives as input the ranked semantic neighbor list RSNL, and the available credits Acr.The query distribution list QDL is returned as output. The ComputeQDL function is

Input:RSNL: ranked semantic neighbor list Acr: credits available for distribution

Output:QDL: query distribution list

Method:(1) QDL = ∅; ∑r = 0;

// ∑r contains the sum of all ranking values rsn ∈ RSNL

(2) For Each sn ∈ RSNL:(3) ∑r = ∑r + rsn;(4) End For Each;

//Fill QDL with query recipients and associated credits(5) For Each sn ∈ RSNL:(6) ncrsn = distribute(rsn, Acr, ∑r); // ncrsn is the number of sn credits(7) QDL.Add(sn.name, ncrsn);(8) End For Each;(9) Return QDL

Figure 4.9: The ComputeQDL function

defined to proportionally assign the available credits Acr to each semantic neighborsn∈ RSNL. To this end, the ∑r variable is defined to contain the sum of all the rankingvalues of each sn ∈ RSNL (lines 2-4). According to the formula (4.12), the distribute

function is subsequently exploited to perform credit assignment and to compute thenumber of credits ncrsn for each semantic neighbor sn (lines 5-6). Finally, the pair(sn.name,ncrsn is inserted in QDL (line 9).

4.4.3 Considerations on H-Link

When MCL = /0, that is the peer ontology does not contain relevant concepts with re-spect to the target query, the default H-Link invocation with selection and ranking ofsemantic neighbors can not be executed. In this case, credits are proportionally dis-tributed according to the expertise measure of the semantic neighbors in the network

86

Page 98: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.4 The H-Link routing mechanism

knowledge layer. The choice of distributing credits according to expertise measures ismotivated by the assumption that a high expertise peer is also highly queried and main-tains a populated network knowledge layer in its peer ontology. Then, high expertisepeers can be exploited as relays with the aim to increase the chances to reach peers withrelevant contents. As another possible solution, we are working on a back-propagationstrategy to address query forwarding when MCL = /0. In the back-propagation strategy,when a peer r receives a discovery query Q and the credits ncr from a requesting peerp, the query propagation is stopped and credits ncr are returned to peer p. This way,peer p can update its network knowledge layer by reducing the confidence measuresof peer r for the query target concepts and the returned credits ncr can be reassignedi) by increasing the number of credits distributed to other query recipients and ii) bydistributing credits to those semantic neighbors that were previously excluded due tolow ranking values. The back-propagation strategy for credit distribution introducesadditional complexity in the query forwarding process. We plan to evaluate the ef-fectiveness of such a strategy in future simulation experiments. The back-propagationstrategy is not viable when the query is submitted by the requesting peer. In case thatthe requesting peer obtains the MCL = /0, only the expertise-based query propagationstrategy is feasible.

As another possible scenario, the ncr credits received with a query Q can be notsufficient to assign at least one credit to each sn ∈ RSNL. In this case, credits are as-signed starting from the semantic neighbors with the higher ranking value. As a result,the semantic neighbors with low ranking value can be excluded by QDL, thus reduc-ing the number of query Q recipients. In case that a query Q duplicate is received, theattached ncr credits are used to accomplish the credit distribution phase by consider-ing the previously excluded semantic neighbors. For this reason, each query is cachedtogether with the associated QDL until a predefined deletion timeout is expired.

In the H-Link ranking of semantic neighbors, the harmonic mean is used to combinethe confidence values associated to the peers in SNL and the semantic affinity valuesin MCL. The choice of the harmonic mean is motivated by the idea to push the rankingvalue of the semantic neighbors that present similar values of confidence in SNL andsemantic affinity in MCL. This way, when calculating the comprehensive ranking valueof a semantic neighbor, we succeed in balancing the impact of the semantic affinitycomputed on the fly for the current query and the impact of the confidence valuesderived from past interactions.

87

Page 99: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.5 Application example

A possible side effect of the H-Link mechanism is due to the fact that credits aredistributed on the basis of the knowledge discovered during past interactions. Thismeans that the knowledge of new peers joining the system is hardly discovered and itis not considered for semantic neighbor selection. H-Link deals with this by introducinga perturbation in the credit distribution phase. As proposed in Zeinalipour-Yazti et al.[2005], a small set of random peers is picked and it receives a percentage of the creditsavailable for distribution. As a result, a larger part of the network is explored with theaim to discover additional knowledge and to include new peers in the semantic routingprocess. We will evaluate the impact of random peers in credit distribution with respectto the default H-Link semantic routing protocol in Chapter 5.

As a final consideration on H-Link, we note that the network knowledge layer of apeer can be empty. Such a situation may occur during peer bootstrapping after join-ing the system and during normal peer activities when all the semantic neighbors aretemporarily disconnected. In this scenario, credits are proportionally assigned to eachdirectly connected peer when a query is submitted to the system.

4.5 Application example

As an example of H-Link semantic routing, we consider the peer B of Figure 4.1. Peer

B intends to submit to the system the query Q2 described in Figure 4.10(a) with totalnumber of credits to distribute Ncr = 5. The peer B uses H-Match to compare the query

Q2 against its peer ontology (see Figure 4.10(b)). As a result, the following semanticaffinity values are returned by H-Match:

SA(Book,Volume)=0.79

SA(Book,Publication)=0.49

By invoking H-Link, we find that:

MCL=〈 Volume,0.79 〉,〈 Publication,0.49 〉SNL=〈 peer A,Volume,0.74〉,〈 peer E,Publication,0.81〉,

〈 peer F,Volume,0.875,Publication,0.62〉

On the basis of such results, H-Link computes the ranking of the semantic neighbors inSNL and assigns the corresponding number of credits, as shown in Table 4.1. The query

Q2 is then submitted to the selected semantic neighbors together with the assigned

88

Page 100: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

4.5 Application example

(a) (b)

year

titleBook

Query Q2

peer Apeer F

0.74 0.875

Person

Volume

author title

contains

Section

Publication

peer E0.62

0.81

Portion of peer B ontology

Figure 4.10: (a) The query Q2 example and (b) a portion of the peer B ontology

Semantic neighbor Ranking value Assigned credits

peer A 0.764 2

peer E 0.611 1

peer F 0.689 2

Table 4.1: Example of semantic neighbor ranking and credit distribution

number of credits. As shown in the routing schema of Figure 4.11, peer A receives thequery, consumes one credit for replying to peer B, and forwards the query Q2 to peer

D by assigning the last remaining credit. Peer E consumes the unique credit receivedand stops the forwarding process, while the peer F forwards all the received credits topeer G as no reply is sent back to peer B.

Peer B

Peer A

Peer D

Peer E

Peer FPeer G

Q2 :: 2credits

Q2 :: 1creditQ2 :: 2credits

Q2 :: 2credits

Q2 :: 1credit

Q2 replyQ2 reply

Figure 4.11: The H-Link routing schema for query Q2

89

Page 101: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 5

H-Link experimental results

In order to assess the real contribution of the H-Link semantic routing mechanism withrespect to the other state-of-the-art P2P routing protocols, an extensive experimenta-tion needs to be executed. To this end, two different methods can be followed. Assuggested in Schlosser et al. [2003], we can evaluate the properties of a P2P networkby modeling it as a graph where nodes are peers and edges are P2P connections amongnodes. Such a graph provides an abstract representation of the P2P network and, bystudying the formal properties of the graph model, the real properties of the P2P net-work can be inferred. As another approach, the properties of the P2P network can beverified by simulation on a number of different test cases. Obviously, a formal proofof the network properties is more robust, but it is also more difficult to obtain. On theother side, proof by simulation allows the rapid verification of properties in real sce-narios. Furthermore, simulation can be also used for rapidly comparing the target P2Papproach with other similar solutions. For these reasons, experiment by simulation hasbeen selected as the current reference method for testing the performance of the H-Link

mechanism.In the following, we briefly describe the Neurogrid P2P simulator that has been

used for H-Link experiments and we discuss goals and configuration of the performedtests. Then, we present the obtained results for three kind of experiments that weredevoted i) to measure H-Link scalability, ii) to compare H-Link with the Gnutella P2Prouting protocol, and iii) to evaluate the impact of random credit distribution on thedefault behavior of the H-Link algorithm.

90

Page 102: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.1 The Neurogrid P2P simulator

5.1 The Neurogrid P2P simulator

Neurogrid is a Java-based P2P simulator that was initially conceived for comparing thesearch protocols of Gnutella [Gnutella Protocol], Freenet [Freenet], and Neurogrid [Neu-rogrid]. In Neurogrid, the simulation is a single-threaded discrete event simulator. Aproperty file in text format is defined to enable users to modify the simulation param-eters for each simulation run. The Neurogrid simulator is composed of a small set ofabstract classes that can be extended and redefined for implementing the simulation ofa specific P2P search protocol. In particular, the following abstract classes are definedin Neurogrid:

• Network. The network abstract class contains the methods for adding nodes andfor generating the network topology according to the P2P search protocol tosimulate.

• Node. The node abstract class describes a peer of network. Each node is a col-lection of data structures devoted to represent the current status of the peer. Forinstance, the node class maintains information regarding the active peer connec-tions and the list of messages that was received and forwarded by the peer.

• Message. The message abstract class models queries that are propagated in thenetwork. Messages are stored in nodes through ordered lists. When a query issubmitted to the system, a message copy is stored in the incoming message listof each receiving node.

• Document. The document abstract class defines the resources that are assignedto each node. A document is defined as a list of keywords.

• Keyword. The keyword abstract class represents a possible query target. Asfor documents, each query is defined as a list of keywords. In order to assesswhether a query is satisfied by a node, the query keywords are compared withthe keywords in the node document. For the sake of efficiency, keywords arecodified with a hash function.

Due to the flexible configurability and modularity of the software, the Neurogrid

P2P simulator can be easily extended for supporting additional P2P search protocols.For this reason, the Neurogrid simulator has been selected as the reference framework

91

Page 103: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.2 Experimentation configuration and goals

for performing the H-Link experiments. In particular, the H-Link class implementing theH-Link semantic routing protocol has been developed as an extension of the networkabstract class. This way, H-Link simulation can be executed by setting simulation type

= hlink in the property file.

5.2 Experimentation configuration and goals

The main goals of the experimentation by simulations is to assess the performance ofthe H-Link semantic routing mechanism in terms of generated traffic and recall. In thefollowing, we describe how the Neurogrid P2P simulator has been configured to thisend.

5.2.1 Neurogrid configuration

In order to perform the H-Link experimentation, the Neurogrid P2P simulator has beenconfigured according to the following global settings:

• #nodes. Initializing a simulation run, the network is generated by creating thenumber of nodes indicated by the #nodes parameter.

• #connections per node. Each node is connected with a subset of the other nodes.The initial number of node connections is randomly defined in the range indi-cated by the #connections per node parameter. The number of node connectionscan vary during the simulation according to the H-Link results.

• #concepts per node. Each peer is equipped with a randomly generated peer on-tology. The size of the peer ontology is measured in terms of the number ofconcepts belonging to the ontology that is randomly defined in the range indi-cated by the #concepts per node parameter.

• max network layer size. Each peer maintains a vector for storing location re-lations and corresponding network concepts. The size of such vector is definedaccording to the max network layer size parameter. Since each node has a limitedspace for storing network concepts, only the location relations with the higherconfidence value are maintained.

92

Page 104: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.2 Experimentation configuration and goals

• #concepts per query. Each query is randomly defined and submitted by peerswith equal probability. The number of concepts belonging to the query is ran-domly defined in the range indicated by the #concepts per query parameter.

• #queries. A simulation run finishes when the number of queries propagated inthe system reaches the limit defined in the #queries parameter.

• #credits. Each query is associated to an initial number of credits defined in the#credits parameter.

For peer ontology and query generation, we have considered two reference OWLontologies from the publishing domain (i.e., Ka 1, Portal 2). An extraction procedurehas been developed to pick out a random-size ontology region starting from a givenconcept of the reference ontologies. This way, we can provide both random generationand semantic homogeneity of peer ontologies. The same procedure is exploited alsofor query definition.

The H-Match semantic matchmaker is used for providing semantic affinity eval-uation among queries and peer ontologies during the simulation process. For whatconcern the role of H-Match in the simulations, we have produced a hmatch-file con-taining all the H-Match results and corresponding affinity values obtained from thecomplete comparison of the two reference ontologies. Such a file is exploited by theNeurogrid simulator in order to decide whether a peer is relevant for a given query.In other words, when a query is propagated in the network, the Neurogrid simulatorchecks in the hmatch-file if the concepts in the query have matching entries with theconcepts in the ontology of each receiving peer.

5.2.2 Simulation goals

Simulations are devoted to measure H-Link generated traffic and recall. As generatedtraffic, we mean the overall number of messages routed during a complete simulationrun. Furthermore, we measure recall according to the classical definition of Informa-tion Retrieval [Salton, 1989], namely as the ratio of the number of relevant conceptsretrieved by a H-Link query to the total number of relevant concepts that was expected

1http://protege.stanford.edu/plugins/owl/owl-library/ka.owl2http://www.aktors.org/ontology/portal

93

Page 105: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.2 Experimentation configuration and goals

according to the hmatch-file. In the simulation, the generated traffic will be analyzedthrough the Avg Traffic indicator which describes the average number of messages re-quired by each query. Moreover, recall will be analyzed through the Min recall, Max

recall, and Avg recall indicators, which describe the minimum, maximum, and averagerecall values obtained by a query, respectively.

The experiments are devoted to illustrate the obtained results for what concern i)H-Link scalability, ii) H-Link performance when compared with the Gnutella protocol,and iii) the impact of random credit distribution on the performance of the defaultH-Link mechanism. The choice of a comparison with Gnutella is due to the fact thatboth H-Link and Gnutella define an overlay network posed on top of an unstructuredP2P infrastructure. Apart from the architectural similarities, we have chosen Gnutellasince its routing protocol is well-known and it is frequently considered as a referenceexample. Some other P2P routing protocols have been also considered for a compar-ison with H-Link and we plan to perform additional experiments in future work (seeChapter 7).

The following results are expected and will be verified and discussed after experi-ment presentation.

• The main contribution of H-Link is expected in terms of scalability and trafficreduction.

• Recall is positively affected by the use of more accurate matching models (i.e.,deep, intensive).

• Random credit distribution increases the effectiveness of the H-Link mechanism.

As a final remark, we stress that simulation experiments have been performed bygradually varying the configuration parameters (i.e., #nodes, #connections per node,#queries, #credits). For the test comparing H-Link with Gnutella (i.e., see Section 5.4),we have also repeated the simulation by relying on different H-Match models (i.e., sur-

face, shallow, deep, intensive) with a standard matching configuration (i.e., wla = 0.5and t = 0.5). When not differently specified, the H-Match intensive model is exploitedfor generating confidence values. In the following, we discuss some selected resultswhere the H-Link properties are more evident. However, the complete simulation re-sults will be provided in Appendix A for the sake of completeness.

94

Page 106: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.3 H-Link scalability results

5.3 H-Link scalability results

The H-Link scalability is evaluated by measuring traffic and recall when the numberof peers joining the system is a growing variable. The simulation has been performedaccording to the following configuration:

• #nodes varies in the range [100−1000] with an increment of 180 nodes for eachsimulation run.

• #connections per node is increased according to the #nodes parameter and variesin the range [3−10].

• #queries = 5000.

• A complete simulation has been performed for each value of #credits varying inthe range [12−24] with an increment of 4 credits.

In Table 5.1, the simulation results for #credits = 24 are reported. A graphical rep-

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

100 1-3 3-5 24 372 0.0 1.0 0.71

280 1-3 4-6 24 772 0.0 1.0 0.84

460 1-3 5-7 24 977 0.0 1.0 0.83

640 1-3 6-8 24 1097 0.2 1.0 0.87

820 1-3 7-9 24 2068 0.1 1.0 0.76

1000 1-3 8-10 24 2247 0.1 1.0 0.73

Table 5.1: The results of the H-Link scalability test for #credits = 24

resentation of such results is also provided in Figure 5.1. We observe that the generatedtraffic follows a sub-linear growing, which is an interesting result. We believe that fur-ther improvements in H-Link scalability are possible. The growing traffic is caused bythe fact that when the number of nodes increases, also the number of query forwardingwithout credit consumption increases. This is due to the completely random generationof queries and peer ontologies. We plan to exploit more realistic distribution models(e.g., the Zipf distribution model [Korfhage, 1997]) in future experiments, with theaim at investigating possible enhancements in terms of H-Link scalability. On the otherhand, we observe that recall lies in the range [0.71,0.87] and thus it is not significantlyaffected by the variation in the #nodes parameter. A similar consistent behavior in

95

Page 107: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.4 Comparison with Gnutella

(a) (b)

Figure 5.1: Evaluation of H-Link scalability for #credits = 24: (a) Traffic (b) Recall

both generated traffic and recall values can be observed in the other simulations wheredifferent values of the #credits parameter are employed (See Appendix A).

5.4 Comparison with Gnutella

The comparison of H-Link with Gnutella has been evaluated by measuring traffic andrecall when the query scope (i.e., number of credits for H-Link and TTL for Gnutella)is a growing variable. The simulation has been performed according to the followingconfiguration:

• #nodes = 500

• A complete simulation has been performed for each value of #connections per

node varying in the range [3−10].

• #queries = 5000.

• #credits varies in the range [27−54] with an increment of 9 credits for each sim-ulation run. The #credits parameter is proportionally set according to the valuein #connections per node. For the Gnutella simulation, the #credits parameter isreplaced by the TTL parameter which varies in the range [3−6].

In Table 5.2, the simulation results for #connections per node = 8−10 are reported.In Table 5.2, we also compare the H-Link results obtained by varying the H-Match

model from surface to intensive matching. A graphical representation of such results

96

Page 108: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.4 Comparison with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 8-10 27 460 0.0 1.0 0.53

500 1-3 8-10 36 1188 0.1 1.0 0.62

500 1-3 8-10 45 1697 0.1 1.0 0.66

500 1-3 8-10 54 2840 0.1 1.0 0.70

H-Link with the Surface matching model

500 1-3 8-10 27 664 0.0 1.0 0.71

500 1-3 8-10 36 1058 0.0 1.0 0.77

500 1-3 8-10 45 1449 0.0 1.0 0.80

500 1-3 8-10 54 1923 0.0 1.0 0.83

H-Link with the Shallow matching model

500 1-3 8-10 27 615 0.1 1.0 0.77

500 1-3 8-10 36 1358 0.1 1.0 0.86

500 1-3 8-10 45 1896 0.1 1.0 0.89

500 1-3 8-10 54 2865 0.2 1.0 0.91

H-Link with the Deep matching model

500 1-3 8-10 27 703 0.0 1.0 0.82

500 1-3 8-10 36 1253 0.1 1.0 0.89

500 1-3 8-10 45 1803 0.1 1.0 0.93

500 1-3 8-10 54 2671 0.1 1.0 0.96

H-Link with the Intensive matching model

#Nodes #Concepts

per node

#Connections

per node

TTL Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 8-10 3 645 0.2 1.0 0.74

500 1-3 8-10 4 3089 0.8 1.0 1.00

500 1-3 8-10 5 4240 1.0 1.0 1.00

500 1-3 8-10 6 4254 1.0 1.0 1.00

Gnutella

Table 5.2: Comparison of H-Link with Gnutella for #connections per node = 8−10

is also provided in Figure 5.2. We note that excellent results are obtained by H-Link

in terms of generated traffic. Moreover, we also note that the variation in the adoptedmatching model does not significantly affect the performance of H-Link when the gen-erated traffic is considered. For what concern the recall values, the optimal behavior ofGnutella is motivated by the fact that the simulation of Figure5.2 works with #nodes

= 500 and #connections per node = 8− 10. Under such a configuration, Gnutellafloods the network and succeeds in reaching a large part of either relevant and irrele-vant nodes, thus retrieving all the available concepts matching the target query. On theother hand, H-Link presents very interesting results in terms of recall values, which liein the range [0.82,0.96] when the intensive matching model is used. Furthermore, wenote that the choice of a specific matching model has an impact on the H-Link recall

97

Page 109: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.5 Impact of random credit distribution

(a) (b)

Figure 5.2: Comparison of H-Link with Gnutella for #connections per node = 8−10:(a) Traffic (b) Recall

values. According to the results of Figure 5.2(b), the deep and the intensive matchingmodels are suggested when a more accurate behavior of H-Linkis required. However,we stress that more accurate H-Match models also imply higher computation time asdiscussed in Chapter 4. For a generic P2P application, surface and shallow matchingmodels should be preferred to avoid peer bottlenecks in routing functionalities.

For a comprehensive comparison of H-Link and Gnutella, both generated trafficand recall need to be considered as a whole. This way, we can observe that there isa trade-off between traffic reduction and accuracy of results. In this respect, H-Link

claims to succeed in considerably reducing traffic while preserving an important levelof accuracy in results.

5.5 Impact of random credit distribution

H-Link has been implemented in the Neurogrid P2P simulator as two different versions,namely H-Link-D and H-Link-R. The H-Link-D version implements the default H-Link

algorithm as presented in Chapter 4. On the opposite, the H-Link-R version supports therandom distribution of a subset of the available credits. In other words, a portion of theavailable credits are randomly assigned during the H-Link credit distribution phase inorder to discover the knowledge of the new peers joining the system (see Section 4.4).The goal of this test is to measure the impact of random credit distribution on the

98

Page 110: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.5 Impact of random credit distribution

default H-Link algorithm. Such an impact has been evaluated by measuring traffic andrecall when the number of queries is a growing variable. The simulation has beenperformed according to the following configuration:

• #nodes = 1000

• #connections per node is randomly chosen in the range [8−10].

• #queries varies in the range [1000− 10000] with an increment of 1000 queriesfor each simulation run.

• #credits = 24.

The simulation results for the considered configuration are reported in Table 5.3.A graphical representation of such results is also provided in Figure 5.3. We note that

#Queries #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

1000 1-3 8-10 24 1190 0.1 1.0 0.68

2000 1-3 8-10 24 1198 0.1 1.0 0.70

3000 1-3 8-10 24 1186 0.1 1.0 0.73

4000 1-3 8-10 24 1243 0.1 1.0 0.76

5000 1-3 8-10 24 1247 0.1 1.0 0.75

6000 1-3 8-10 24 1343 0.0 1.0 0.73

7000 1-3 8-10 24 1350 0.1 1.0 0.75

8000 1-3 8-10 24 1334 0.1 1.0 0.74

9000 1-3 8-10 24 1480 0.1 1.0 0.73

10000 1-3 8-10 24 1453 0.1 1.0 0.73

H-Link without random credit distribution (H-Link-D)

1000 1-3 8-10 24 1102 0.1 1.0 0.67

2000 1-3 8-10 24 1183 0.1 1.0 0.71

3000 1-3 8-10 24 1198 0.1 1.0 0.69

4000 1-3 8-10 24 1275 0.1 1.0 0.75

5000 1-3 8-10 24 1363 0.1 1.0 0.70

6000 1-3 8-10 24 1353 0.0 1.0 0.73

7000 1-3 8-10 24 1609 0.1 1.0 0.71

8000 1-3 8-10 24 1420 0.1 1.0 0.77

9000 1-3 8-10 24 1521 0.1 1.0 0.74

10000 1-3 8-10 24 1724 0.1 1.0 0.74

H-Link with random credit distribution (H-Link-R)

Table 5.3: The impact of random credit distribution on the default H-Link algorithm

the impact of random credit distribution is not particularly prominent neither in terms

99

Page 111: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.6 Final considerations

(a) (b)

Figure 5.3: Impact of H-Link random credit distribution: (a) Traffic (b) Recall

of generated traffic and recall. However, a slight increase in traffic can be observedwhen H-Link-D is applied, thus confirming the benefits of the default H-Link mechanismfor traffic reduction. In terms of recall, we note that the default H-Link-D algorithmstill provides better results in the average case. As a result, the H-Link-R algorithm isnegatively affected by the credit waste on irrelevant paths. By reading the poor resultsof the H-Link-R algorithm, we stress that the role of the random credit distribution is tohandle the problem of frequent peer join/leave operations. In the simulation, join andleave operations are not considered, thus random credit distribution has only a loweringeffect on the standard H-Link performance. As a conclusion, further experiments arerequired to actually assess the effects of random credit distribution and to define thespecific conditions under which the H-Link-R algorithm can be properly adopted.

5.6 Final considerations

According to the expected results presented in Section 5.2, we note that H-Link simu-lations confirm most of the hypothesis. In particular, the main contribution of H-Link

regards traffic reduction. Such an interesting result also characterizes some other P2Psemantic routing approaches (e.g., [Joseph, 2002; Staab et al., 2004]), confirming thatquery propagation on a semantic basis is an effective solution when compared with tra-ditional techniques, like Gnutella. In the case of H-Link, the benefits in terms of trafficreduction are also supported by interesting values in terms of recall and scalability. Inour opinion, two main reasons affect such a result. On one side, the use of ontology

100

Page 112: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

5.6 Final considerations

matching techniques have a high impact on recall measures where H-Link is near to52% in the worst case and reaches 100% in the best case. On the other side, the use ofcredits represents a first attempt to enable peers, and thus users, to supervise the querypropagation through an application-oriented parameter rather than network-orientedones, like TTL.

For what concern the H-Match models, the simulations fully confirm the expectedresults. In particular, more accurate matching models (i.e., deep and intensive) pro-vide higher recall values. However, fairly good results in terms of recall are obtainedalso with less accurate matching models (i.e., surface and shallow). As discussed inSection 5.4, the choice of the correct H-Match matching model is a key aspect whenpeer computational resources are a critical factor. In this respect, it is important to notethat a peer can switch from one H-Match model to another on the fly according to thecurrent peer status in terms of available bandwidth and workload.

Finally, we stress that some unexpected H-Link behaviors suggest to perform fur-ther experiments on H-Link. Especially in the context of H-Link optimizations, like therandom credit distribution, we note that they can be positively employed only underspecific network requirements. H-Link tuning for adaptation to different P2P collabo-ration scenarios represent an interesting issue and it is planned in future experiments(see Chapter 7).

101

Page 113: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 6

Consensus negotiation techniques andapplication to the Helios system

Apart from enforcing query routing in H-Link, semantic neighbors can be also ex-ploited for supporting semantic community formation and management. In this chap-ter, we illustrate the handshake techniques for consensus negotiation in peer-basedsystems. In this respect, semantic communities are autonomously emerging and relyon ontology matchmaking for selecting community participants. Furthermore, the ap-plication of the handshake techniques to the Helios (Helios EvoLving Interaction-basedOntology knowledge Sharing) system is then discussed. Helios has been developed forsupporting P2P knowledge discovery and sharing. In the Helios context, consensusnegotiation techniques and semantic communities are defined to improve the effective-ness of knowledge sharing and to foster semantic collaborations among similar peers.

6.1 Foundations of semantic communities

In a peer ontology, the network knowledge layer stores the network concepts (i.e.,semantic neighbors) that have been discovered during past peer interactions. As de-scribed in Chapter 4, semantic neighbors are exploited for enforcing the H-Link seman-tic routing mechanism. Furthermore, we note that given a concept c in the contentknowledge layer of a peer ontology, the set NCc of its associated network concepts,that is NCc = nc | ∃lr(c,nc,c f ), represents a set of semantic neighbors capable ofproviding knowledge similar to the concept c. Thus, peers in NCc can be considered as

102

Page 114: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.1 Foundations of semantic communities

potential participants of a semantic community scc where the topic of interest regardsthe concept c. The semantic neighbors in NCc can be invited to participate to scc andthey can forward such an invitation to their own semantic neighbors on the concept c,and so on. In other words, the role of semantic communities is related to the capabil-ity of dynamically aggregating nodes with similar interests in structured organizationswith the aim i) to reduce the network load due to overlapping requests of single peersand ii) to define effective communication mechanisms for sets of nodes which sharethe same understanding of a domain of interest (i.e., peers members of the same com-munity).

Definition of semantic community. We define a semantic community of peers as aset of nodes which show a common interest in a given topic and are organized in astructured way (e.g., a tree). Semantic communities are autonomously emerging, inthat they originate from a declaration of interest of a peer and group those peers whichspontaneously agree with the declaration, since they have relevant resources for thecommunity. Formally, a semantic community sc is a 5-tuple of the form: sc = 〈CID,

ICard, Members, SPolicy, Status〉, where:

• CID is the unique Community Identifier that characterizes the community sc.

• ICard is the community Identity Card. The ICard represents a subject categoryor topic area of interest and it is defined as an ontology. The use of an ontology-based ICard provides a semantically rich description of a given topic area ofinterest and allows the characterization of the common interpretation (i.e., per-spective) featuring the community.

• Members is the set of participants that joins sc and spontaneously agrees with itsICard, since they have semantically relevant resources for the community.

• SPolicy ∈ (strict | so f t) defines the behavior that sc members have to observe interms of resource availability. The strict policy requires that incoming requestsare processed by all community members in cooperation. The so f t policy de-fines that each community member can autonomously choose the set of incomingqueries to evaluate.

• Status ∈ (potential | emerging | partially committed | committed | disbanded)represents the actual status of sc. During the consensus negotiation process, the

103

Page 115: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.1 Foundations of semantic communities

community passes through the potential, emerging, and partially committedstates. The committed and the disbanded states indicate that the community iseffective and no more active in the network, respectively.

The following building blocks characterize the formation of semantic communitiesin knowledge-sharing P2P systems:

• Ontology-based peer description. Each peer exposes to the system a peer on-tology which provides a semantically rich representation of the resources thatthe peer exposes to the network, in terms of concepts, properties, and semanticrelations.

• Query-based interactions. Each peer interacts in a peer-to-peer manner with theother members of the system by submitting discovery queries in order to identifythe potential members of a given community and by replying to incoming querieswhether it can join a given community.

• Semantic matchmaking capabilities. Each peer implements a semantic match-maker for matching ontologies in order to find which concepts match in differentontologies and at which level.

The community formation process is addressed under the constraints that: i) eachpeer can be member of multiple communities and stores the CID and the ICard of eachjoined community ii) no centralized authority (e.g., SuperPeer) is expected to coor-dinate the community discovery and formation process, and iii) the choice of joiningan emergent community with a given ICard depends on the semantic matchmaker re-sults. Receiving an incoming community ICard (i.e., an ontology), a peer invokes thesemantic matchmaker and compares such an ICard with its peer ontology. A peer join-ing the community sc1 is required to provide concepts in its peer ontology with a highsemantic affinity with the ICard of sc1.

For community formation in peer-based systems, the semantic matchmaker has tobe capable of addressing the requirements discussed in Chapter 4 in the context of on-tology matching for supporting the H-Link semantic routing mechanism. In particular,efficiency, flexibility, and dynamic configurability are the main requirements for a se-mantic matchmaker that is invoked during peer community formation. For this reason,

104

Page 116: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

the H-Match semantic matchmaker has been selected for supporting consensus negoti-ation in peer-based systems. In the following, we describe the handshake techniquesfor semantic community formation and management in peer-based system.

6.2 Semantic community formation and management

The following three issues are addressed for managing a semantic community duringits life-cycle in a peer-based system: consensus negotiation, membership management,and sharing policy.

6.2.1 Consensus negotiation

As shown in Figure 6.1, a semantic community of peers emerges when a node, calledcommunity founder, invokes a semantic handshake process which is composed of thefollowing steps:

ICard advertisement. The founder p f defines a CID and an ICard describing thetopic area of interest of the emerging community, along with a set of commitmentconstraints specifying the conditions required for the community establishment (e.g.,minimum number of member required, specific semantic affinity constraints). Then,the founder composes an Invitation Message containing the CID and the ICard created,as well as the T T L parameter defining the maximum number of hops allowed for theinvitation propagation, the matching model to be used for affinity evaluation (i.e., sur-

face, shallow, deep, intensive), and the matching threshold t specifying the minimumsemantic affinity value required to consider a concept of the ICard and a concept ofa peer ontology as matching concepts. Then, the invitation message is sent to all thesemantic neighbors of p f that are relevant for the ICard, with the aim to advertise thenew community 1.

Member identification. Each invited peer pi invokes the H-Match semantic match-maker in order to compare the incoming ICard with its peer ontology. pi is relevantfor the community if H-Match identifies concepts in the peer ontology with a semantic

1In selecting the ICard recipients, we use the H-Link semantic routing mechanism where the ICard isconsidered as a normal discovery query.

105

Page 117: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

affinity higher than the specified threshold t with respect to the ICard. In this casepi replies to p f with an Interest Message reporting the portion of its peer ontologyrelated to the matching concepts found to be relevant for the community by the se-mantic matchmaker. Independently from the matchmaker results and if T T L ≥ 0, pi

forwards the invitation message to all its relevant semantic neighbors with respect tothe ICard, except for the peer from which the message has been received. Each invitedpeer discards duplicate copies of the same invitation message possibly received.

Request approval. Receiving the interest messages, the founder p f has to evaluatewhich peers are admitted in the community. For this reason, p f invokes its semanticmatchmaker and compares each peer ontology portion received by the interested peerswith its knowledge (i.e., its peer ontology). For each candidate peer, the goal of thiscomparison is to evaluate whether the provided knowledge matches the knowledgeof the founder, and then to assess whether they share a common perspective of thecommunity interests. If the matchmaker returns matching results higher than t, p f

admits the peer in the community and sends an Approval Message to the admittedpeer.

Community commitment. Once the Request approval phase is completed, the founderverifies that the commitment constraints are satisfied. In this case, a Commitment Mes-sage is sent to all the admitted peers and the semantic community is effectively estab-lished. If the committed constraints are not satisfied, the founder stops the communityformation. In this case, the admitted peers wait for the commitment message until apredefined timeout expires and the community is considered as disbanded.

6.2.1.1 The handshake state transition diagram

In Figure 6.1, we describe the state transition diagram for the handshake process. Weobserve that the aggregation of a semantic community of peers passes through the fol-lowing states. At the beginning, the semantic community is expressed at a potentiallevel and lies in the potential community state. When the community founder definesthe ICard and CID, the community starts the ICard advertisement transition and movesin the emerging community state in which the invited peers are called to show their in-terest in the rising community. The community remains in an emerging state until theinvitation message is propagated to all the invited peers and the identification member

106

Page 118: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

transition is completed. With the request approval transition, the community movesin the partially committed state where the accepted peers are notified of their member-ship. Only after the completion of the commitment transition, the semantic communityenters the committed community state and becomes effective in the network.

PotentialCommunity

EmergingCommunity

PartiallyCommittedCommunity

CommittedCommunity

ICardadvertisement

Requestapproval

Communitycommitment

Memberidentification

Figure 6.1: The state transition diagram of the handshake algorithm

6.2.1.2 Example

As an example of the semantic handshake process, we consider the Figure 6.2 wherethe handshake algorithm is applied to a snatch of a P2P network. In this example, thepeer B is the community founder and it is represented by a double hoop. Accordingto Figure 6.2, dashed lines represent random P2P connections, while continuous linesindicate the path followed by the invitation message with an initial T T L = 2. Wenote that the path of the invitation message defines a tree structure where the root isidentified by the community founder (i.e., peer B) and the leafs are represented by theinvited peer with T T L = 0 (i.e., peer C, peer I, peer J, and peer K). Each invited peernegotiates its participation in the community directly with the community founder.Once it is admitted, the peer exploits the tree structure and communicates within thecommunity through its community neighbors. We define the community neighbors

107

Page 119: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

of a community member pm as the peer that invited pm in the community (i.e., pm

predecessor) and the peers that pm invited in the community (i.e., pm successors).An invited peer not interested in the community or discarded by the founder is to

be pruned from the tree structure of the community. For this reason, after the approvalphase, each community member pm notifies to its predecessor pp of its presence in thecommunity. If pp is not member of the community, it forwards the pm notification toits predecessor pg and notifies pm that pg is its new predecessor.

As an example, we consider the peer H in Figure 6.2. The community memberspeer I and peer K notify to peer H of their participation. Peer H has not joined thecommunity and is to be pruned from the community tree. Then, peer H forwards thenotification to peer A and notifies peer I and peer K that peer A is their new predecessor.

Invitation message pathP2P communication

Community founder

Invited member

Not invited member

Accepted member

Peer F

Peer I

Peer A

Peer E

Peer D

Peer H

Peer C

Peer G

Peer K

Peer J

Peer M

Peer L

Peer B(Pf)

Figure 6.2: Example of aggregation of a semantic community

6.2.2 Membership management

Appropriate techniques are defined to address the main events that may occur duringthe community life-cycle, such as insertion and deletion of participant, unexpectedpeer failure, and community disband.

Each community member pi can decide to invite new peers in the community byforwarding an invitation message containing the CID and the ICard of the commu-

108

Page 120: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

nity. Each invited peer negotiates its participation in the community directly with thecommunity founder, but once it is admitted, it is appended as a successor of pi. Forwhat concern deletion and unexpected peer failure, the community has to react in or-der to re-arrange the tree structure. To this end, the community members are organizedin a partially ordered set according to an activation number uniquely assigned by thefounder to each admitted community member. When a community member pi withan activation number ni is deleted or fails, its immediate successors re-arrange the treeby connecting them to the available node with the maximum activation number lowerthan ni. Finally, a timeout mechanism combined with a keepalive technique is adoptedto accomplish the automatic disband of useless communities.

6.2.3 Sharing Policy

A sharing policy is defined in order to set the behavior that community members haveto observe in terms of resource availability. The following levels of severity can beidentified:

• Strict. The community acts as a single peer and incoming requests are servedby all the participants in cooperation. Queries are decomposed in tasks thatare distributed to the community members in order to split the effort on all thecommunity participants.

• Soft. Each community member can autonomously decide the number and thekind of the incoming requests to serve according to its actual workload and to itscomputation resource availability.

The sharing policy can be forced by the community founder as a commitment con-straint to be satisfied during the handshake process. Alternatively, the sharing policycan be set on the basis of the results of a voting procedure involving all the communityparticipants.

6.2.4 The role of ontology matching for community handshaking:an example

In order to put in evidence the role of ontology matching in the community aggre-gation process, we consider the example of Figure 6.3 where we show a portion of

109

Page 121: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.2 Semantic community formation and management

the network of Figure 6.2 and we discuss the role of the H-Match algorithm duringthe ICard advertisement and the member identification phases. In Figure 6.3, peersare represented together with a portion of their peer ontologies. The peer B acts as

Document format

title owner

Article

Publication

Book

Journal

author

publisher

volumes

year

author title

contains number

title

Newspaper

Periodical_Publication

Magazine

Person

Volume

author

title contains

Section

Publication

Peer B(Pf)

Peer F

Peer EPeer G

Peer D

Peer AHoliday

Journey

duration_in_days

includes

itinerary

Conference_Publication

editor

Publication

title year

author

category

Figure 6.3: Example of P2P network with peers and associated peer ontologies

community founder p f and intends to create a community regarding the publishingdomain with CID = Publishing-Community. To this end, the Peer B defines an ICardcontaining the concept Volume with the properties author and title. Then, peer B com-poses an invitation message with such an ICard together with a T T L = 1, the deep

matching model to use, and a matching threshold t = 0.5. According to the peer B

ontology shown in Figure 4.10(b), such an invitation message is sent to the semanticneighbors of peer B (i.e., peer A, peer E, peer F) and, according to the T T L, the in-vitation is also forwarded to peer D and peer G by peer A and peer E, respectively.Each receiving peer invokes the H-Match algorithm with the deep matching model toevaluate the semantic affinity between the incoming ICard and its respective peer on-tology. The peer ontology of peer F is related to the tourism domain and no matchingconcepts are found for the ICard. For what concern the peer E ontology, H-Match re-turns SA(Volume,Document) = 0.21 < t that is not sufficient for participating in thecommunity. For this reason, peer E and peer F do not reply to the invitation message

110

Page 122: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.3 Community-aware query propagation

Peer H-Match result

peer A SA(Volume,Publication) = 0.71SA(Volume,Book) = 0.89

peer D SA(Volume,Periodical Publication) = 0.52

peer G SA(Volume,Publication) = 0.77SA(Volume,Con f erence Publication) = 0.66

Table 6.1: The H-Match results for peer A, peer D, peer G

and are no more considered for the community aggregation 2. For what concern peer

A, peer D, and peer G, the H-Match results produced with the deep model are reportedin Table 6.1. According the threshold t = 0.5 specified in the invitation, peer A, peer D,peer G can provide relevant concepts for the ICard, and reply to peer B with an interestmessage. Then, peer A, peer D, peer G will be considered by peer B during the re-quest approval and community commitment phases for the definition of the committedcommunity.

6.3 Community-aware query propagation

The structured organization of committed communities can be exploited to improvesearch and discovery capabilities in peer-based system. The idea is that the set ofsemantic communities SCp = sc1 . . .sch joined by a generic peer p is inserted inthe network knowledge layer of the peer p ontology. In particular, a network conceptis defined for each semantic community sc ∈ SCp and appropriate location relationsare defined to link semantic communities with the corresponding matching conceptsin the content knowledge layer. Confidence values are also associated to the locationrelations according to the H-Match results obtained by comparing the peer p ontologyconcepts with the ICard concepts of each sc ∈ SCp. This way, semantic communitiesare handled as single semantic neighbors in the peer ontology and can be consideredby H-Link for query routing. The community-aware H-Link mechanism is defined asan upgraded version of the H-Link semantic routing mechanism in order to deal with

2The example in Figure 6.3 presents an atypical scenario. Peer E and peer F are indicated as seman-tic neighbors in peer B ontology of Figure 4.10(b). This example remarks that location relations andassociated confidence values can provide misleading indications when they are not updated.

111

Page 123: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.3 Community-aware query propagation

communities during the credit distribution phase. When a semantic community scis included by H-Link in the query recipients, only one credit is assigned to sc andthe intra-community query propagation strategy is adopted to disseminate the queryamong the community members.

In the following, we describe the community-aware H-Link mechanism and theintra-community query propagation strategy.

The community-aware H-Link mechanism. The community-aware H-Link mech-anism is an upgraded version of the H-Link semantic routing mechanism described inSection 4.4. In order to illustrate the community-aware H-Link mechanism, we con-sider a peer p that needs to submit/forward a query Q for discovering relevant peerscapable of providing knowledge matching the query Q target. As in the traditionalH-Link mechanism, the query is matched against the peer p ontology by relying onthe H-Match semantic matchmaker and the matching concept list MCL is returned as aresult. The following steps are then performed:

1- Selection of semantic neighbors. The semantic neighbor list SNL is defined by ex-ploiting the network knowledge layer in order to extract the network conceptsand the associated confidence values connected to the concepts in MCL througha location relation. We note that SNL = SN ∪ SC where SN is a set of singlesemantic neighbors and SC is a set of semantic communities, respectively.

2- Ranking of semantic neighbors. Both single semantic neighbors and semantic com-munities in SNL are ranked by using (4.11) and the ranked list RSNL is definedas a result.

3- Distribution of credits. Available credits Acr are proportionally distributed to therecipients in RSNL. Credits are assigned to each single semantic neighbor sn ∈RSNL by using (4.12). In case of a semantic community sc ∈ RSNL, the cor-responding number of credits ncrsc is always set to 1 without considering theranking value rsc. Finally, the query distribution list QDL is returned as a result.

By assigning one credit to a semantic community sc ∈ RSNL, we stress that com-munities require a specific solution for credit distribution. This is due to the fact thatif a community is relevant for a query Q, all the community members can provide rel-evant concepts with respect to query Q. When a semantic community sc is selected as

112

Page 124: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.3 Community-aware query propagation

a query Q recipient (i.e., ncrsn = 1) the intra-community query propagation strategy isinvoked in order to efficiently disseminate the query Q to all the sc members.

The intra-community query propagation strategy. We consider a peer p that ismember of a semantic community sc. We assume that i) peer p needs to submit/forwarda query Q to the system and ii) sc is selected as query Q recipient according to theresults of the community-aware H-Link mechanism. In order to disseminate the queryQ to all the members of sc, the peer p invokes the intra-community query propagationstrategy and sends the query Q to its neighbors in the community tree (i.e., peer ppredecessor and peer p successors). By exploiting the community tree structure, theforwarding mechanism is iterated until the query Q reaches each community member.

We note that a semantic community sc can be selected as query Q recipient onlyby a sc member. After the first hop in the intra-community forwarding, each receivingsc member is no more required to evaluate the affinity between the query Q and the scICard, by assuming that such a semantic affinity evaluation has been already performedby the sc member that started the intra-community propagation. Thus, the query Qforwarding on the sc tree is directly executed.

Example. As an example of community-aware query propagation, we consider aportion of the peer G ontology (Figure 6.4(a)) and the Publishing Community presentedin Figure 6.2 and 6.3. We suppose that the peer G needs to submit the query Q2 shownin Figure 4.10(a) and Ncr = 4 is the number of credits to assign. By invoking thecommunity-aware H-Link mechanism, we obtain the following results:

MCL=〈 Publication,0.8 〉SNL=〈 peer A,Publication,0.75〉,〈 peer L,Publication,0.63〉,

〈 Publishing Community,Publication,0.77〉RSNL=〈 Peer A,0.77 〉,〈 peer L,0.7 〉,〈 Publishing Community,0.78 〉QDL=〈 Peer A,2 〉,〈 peer L,1 〉,〈 Publishing Community,1 〉

We note that the Publishing community receives one credit even if rPublishing Community =0.78 is the highest ranking value in RSNL. Moreover, the remaining three credits areproportionally assigned to peer A and peer L. The query Q2 is then sent to the recipientsin QDL with the corresponding credits. For what concern the Publishing Community,the query Q2 follows the intra-community propagation strategy as described in Fig-ure 6.4(b). In particular, the peer G sends the query to its community neighbors (i.e.,

113

Page 125: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.4 Considerations on handshake and semantic communities

(a) (b)

ConferencePublication

editor Publication

title year

author

category

peer APublishingCommunity

0.75

0.66

peer L0.77

0.63

Portion of peer G ontology

Peer I

Peer A

Peer D

Peer G

Peer K Peer J

Peer B(Pf)

(3)

(2)(1)

(1)(3) (3)

Figure 6.4: (a) A portion of the peer G ontology and (b) intra-community query prop-agation in the Publishing Community

peer B and peer J). Subsequently, peer B forwards query Q2 to peer A. Finally, peer

D, peer I, and peer K receive the query Q2 from peer A. We note that peer A twicereceives the query Q2: i) as a member of the Publishing Community, and ii) as a sin-gle semantic neighbor of peer G. The first time the query Q2 is received, the H-Match

semantic matchmaker is invoked to assess whether a relevant reply can be provided tothe requesting peer. Regarding the query Q2 forwarding, the intra-community propa-gation strategy is executed (case (i)), otherwise the H-Link semantic routing mechanismis invoked to assign the available credits (case (ii)).

6.4 Considerations on handshake and semantic com-munities

Semantic communities have an impact on H-Link and related semantic routing func-tionalities. By relying on the community-aware H-Link mechanism, a larger numberof relevant peers can be contacted by exploiting semantic communities, thus increas-ing the number of query replies. Such an intuition need to be further investigated bydefining accurate experiments (see Chapter 7). On the other hand, we note that ac-cording to the community-aware H-Link mechanism, the peer submitting the query has

114

Page 126: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.4 Considerations on handshake and semantic communities

not the capability to control the number of query replies. For this reason, the use ofthe community-aware H-Link mechanism should be specified by the requesting peer onquery submission, otherwise only single semantic neighbors are exploited by H-Link

and the number of credits Ncr is used to set the number of desired query replies.As another consideration, communities are expensive in terms of communication

overhead that is required by the semantic handshake process and by ordinary commu-nity management operations (e.g, member insertion/deletion, community tree adjust-ment). Furthermore, we observe that semantic communities have an impact on thereference social metaphor that characterizes knowledge-sharing P2P networks, by af-fecting the independency requirement of peers. In particular, coordination constraintsare imposed on single-peer behavior. For instance, community members are requiredto execute query processing according to the constraints set in the sharing policy nego-tiated in the community. Coordination constraints, and thus communities, are desirablein specific application scenarios. For instance, the following scenarios can be consid-ered for semantic community application.

• Networked enterprises. Semantic communities can be adopted in the contextof networked enterprises for providing novel forms of semantic collaborationand coordinated access to distributed information resources. We define a net-worked enterprise as a virtual organization where independent business subjectsdynamically agree to participate in order to rapidly respond to opportunities orchallenges that cannot be anticipated in advance. In this respect, ontology-basedcommunities contribute to address semantic interoperability issues by enablinga seamless access and retrieval of the right information resources while preserv-ing the information representation and management requirements of each singleparty involved in the networked enterprise [Afsarmanesh et al., 1998; Silva et al.,2003]. Some preliminary results on this topic can be found in Castano et al.[2006e]. Furthermore, the sharing policy defined within a community can be ex-ploited to enforce advanced collaboration schemas among the community mem-bers. For instance, the sharing policy can be used to arrange distributed queryprocessing and optimization as discussed in Brunkhorst et al. [2003]; Karnstedtet al. [2004].

• Cross-organization expert collaborations. Semantic communities can be ex-ploited to provide a cross-organizational collaboration platform. In this respect,

115

Page 127: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.5 The Helios knowledge-sharing P2P system

single expert users belonging to different organizations and characterized bysimilar interests and common professional skills, can rely on a semantic com-munity for enforcing timely communication and information sharing. As anexample, the handshake techniques are adopted in the Esteem project for sup-porting mobile/ubiquitous information access to a community of homologousspecialists belonging to different healthcare institutes [Esteem].

6.5 The Helios knowledge-sharing P2P system

The handshake techniques for semantic community formation and management havebeen implemented in the Helios system for P2P knowledge discovery and sharing Cas-tano and Montanelli [2005]; Montanelli [2006]. Helios is conceived as a networkof independent peers with equal role and capability which dynamically interact forknowledge discovery and sharing purposes. As described in Figure 6.5, each peer ofthe Helios network provides a semantically rich representation of the information re-sources to be shared by means of a peer ontology (see Section4.2). In Helios, two basicphases are required to enforce a typical knowledge sharing scenario: i) the discoveryphase where a generic peer p defines a probe query Q containing one or more conceptsof interests in order to identify the set of peers that contain relevant knowledge (i.e.,matching concepts) with respect to Q, and ii) the acquisition phase where the repliescollected after the discovery phase are filtered by the requesting peer p and relevantpeers are contacted for acquisition of the discovered knowledge.

To this end, each Helios peer adopts a query-based probe/search approach based onH-Match for enforcing both knowledge discovery and acquisition phases. According toFigure 6.5, each Helios peer is equipped with a toolkit that provides peer ontology man-agement and knowledge discovery/acquisition functionalities (See Section 6.6). Forfurther details regarding Helios, the reader can refer to Helios; Castano et al. [2003e,2006f].

6.5.1 The probe query model

The probe query model is used by a peer in the discovery phase to identify potentialpartners (i.e., peers) that are capable of providing relevant knowledge with respect to agiven target request. To this end, the requesting peer defines a probe query providing

116

Page 128: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.5 The Helios knowledge-sharing P2P system

Figure 6.5: The Helios architecture

an expressive representation of one or more target concepts of interest. A probe queryprovides an ontological description of target in terms of concepts with possible proper-ties and semantic relations. The Helios probe query template is reported in Figure 6.6and it is composed of the following clauses:

• Find: list of target concept(s) names.

• With: (optional) list of properties of the target concept(s).

• Where: (optional) list of conditions to be verified by the property values, and/or(optional) list of concepts related to the target by a semantic relation.

• Matching policy: (optional) specification of the H-Match configuration requestedfor the query evaluation.

The answer to a probe query is list of concepts that match the target. As described inFigure 6.7, the structure of the Helios answer template contains the following clauses:

• Concept: name of the matching concept.

• Properties: (optional) list of properties of the matching concept.

• Adjacents: (optional) list of concepts related to the matching concept by a se-mantic relation.

117

Page 129: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.5 The Helios knowledge-sharing P2P system

Probe query templateFind Target concept name [, ...][With 〈Property name〉 [, ...]][Where Condition,

〈related concept, semantic relation name〉 [, ...]][Matching policy 〈 model, wla, t 〉]

Figure 6.6: The reference probe query template

• Matching: set of pairs 〈target concept, affinity value〉, specifying the target con-cept with which the matching concept matches, together with the correspondingaffinity value.

• Matching policy: (optional) the matching policy adopted for the query evalua-tion.

Probe answer templateConcept Matching concept name

[Properties 〈Property name〉 [, ...]][Adjacents 〈related concept, semantic relation name〉 [, ...]]Matching 〈Target concept, affinity value〉[, ...][Matching policy 〈 model, wla, t 〉]

Figure 6.7: The reference probe answer template

The knowledge discovery phase starts with the probe query Q definition accordingto the query template of Figure 6.6. Subsequently, the probe query Q is submitted to thesystem for processing and semantic affinity evaluation. Receiving a probe query, a peerinvokes the H-Match algorithm to compare such a request against its peer ontology inorder to identify whether there are concepts matching the target. In particular, the Find,With, and Where clauses are used to derive a description of the target concept(s), whilethe matching policy to apply is derived from the Matching policy clause of the probequery, if specified 3. As a result, for each target concept of the probe query, H-Match

3If the matching policy is not specified in the query, the receiving peer can decide the policy toapply according to internal criteria (e.g., current workload). Further details regarding the dynamic

118

Page 130: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.5 The Helios knowledge-sharing P2P system

returns a (possibly empty) ranked list of matching concepts semantically related tothe target (that is, those concepts whose semantic affinity value exceeds the thresholdspecified in the adopted matching policy). This ranked list can contain either one singleconcept which is the best-matching concept for the target, or a set of concepts, whichare all best-k matching concepts for the target. Finally, the results of H-Match areorganized according to the probe answer template in Figure 6.7, and such an answer isreplied back to the requesting peer. Collecting query replies from answering peers, therequesting peer evaluates the results and decides whether to further interact with thosepeers found to be relevant in order to access the specific information resources.

Example. As an example of the knowledge discovery phase in Helios, we considerthe probe query Q1 defined for the scenario of Figure 4.1. According to the probequery template of Figure 6.6, peer A composes the query Q1 as follows:

Find Publication, Book

With Book.author, Publication.Year

Where 〈Publication, Book, kind-of〉Matching Policy 〈Intensive, 0.5, 0.5〉

The query Q1 is then submitted to peer B, peer C, and peer D. Receiving the query,each peer invokes the H-Match semantic matchmaker which performs ontology match-ing using the intensive model to evaluate the semantic affinity between the incomingquery and the respective peer ontology. According to the H-Match results, peer B

replies to peer A with the answer shown in Figure 6.8(a). Following the same proce-dure, peer C does not identify matching concepts over the specified threshold in itspeer ontology (i.e., t = 0.5), then no reply is returned to the requesting peer A. Finally,according to its H-Match results, peer D replies to peer A with the answer shown inFigure 6.8(b).

6.5.2 The search query model

The search query model is used by a peer in the acquisition phase to obtain resourcedata from another node of the network. When a requesting peer has identified a partnercontaining relevant knowledge during the discovery phase, it can subsequently send a

configurability of the H-Match semantic matchmaker can be found in Castano et al. [2005a].

119

Page 131: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.5 The Helios knowledge-sharing P2P system

(a)

Concept Volume

Properties title, author, contains

Adjacents 〈Publication, kind-of〉Matching 〈Book, 0.83〉Matching policy 〈Intensive, 0.5, 0.5〉

(b)

Concept Newspaper

Adjacents 〈Periodical Publication, kind-of〉Matching 〈Publication, 0.67〉Matching policy 〈Intensive, 0.5, 0.5〉

Concept Magazine

Adjacents 〈Periodical Publication, kind-of〉Matching 〈Publication, 0.539〉Matching policy 〈Intensive, 0.5, 0.5〉

Figure 6.8: The probe answer provided by (a) peer B and (b) peer D, respectively

search query in order to access and acquire data about the target information resources.Search queries adopt a SQL-like syntax and contain the Find, With and Where clausesas in the probe query template. The Find clause is used for selecting the concept ofinterest. The With clause is used for specifying the property that have to be returned inthe answer. Finally, the Where clause is used for specifying the conditions to apply toone or more concept properties. In order to support search query processing, each peerprovides appropriate techniques to access its repositories containing resource data.

Web Service-based techniques. A peer provides a standard access to its sharedinformation resources by means of a Web Service. Standard protocols (e.g., SOAP,WSDL) can be adopted in order to interact with the Web Service and provide a seam-less access to the underlying information resources. The requesting peer interacts withthe Web Service by means of the SOAP protocol which supports well-defined XML-based message communications. The WSDL document provides the specification of

120

Page 132: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

the set of public methods as well as the structure of the returned data extracted fromthe information resources.

Wrapper-based techniques. The access to the shared information resources is pro-vided by means of a wrapper service. Such an approach is similar to the mediator-based services of traditional data integration systems (e.g., [Castano et al., 2001]). Therequesting peer interacts with the wrapper service by submitting target queries for in-formation resource access. The wrapper service manages mapping rules defining theconcepts of the peer ontology and the underlying information resources in order to re-formulate the target query in terms of queries over specific structures of the repositorywhere the information resources are stored (e.g., relational database structure). Finally,the answer to a search query is sent back to the requesting peer as an XML documentcontaining the information about the shared resources and an additional contentURIelement that specifies the location where eventually related files can be downloaded.

Example. As an example of data acquisition, we consider the scenario in Figure 4.1.Collecting the results of the knowledge discovery phase, peer A decides to interactwith peer B in order to acquire the data related to the Volume concept. To this end,peer A sends to peer B the following search query.

Find Volume

With title, author

Where author like ’%Montanelli%’

The incoming search query is processed by peer B through a wrapper service and aXML document containing the search query answer is composed for the subsequentreply to peer A. In Figure 6.9, we show a portion of the returned XML document thatdescribes two articles that fit the search query, together with the URIs that peer A canuse for downloading the associated files.

6.6 The Helios toolkit

As shown in Figure 6.10, Helios has been implemented as a toolkit that is charac-terized by a set of components related to ontology metadata management, dynamic

121

Page 133: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

<Article>

<title>

From Surface to Intensive Matching of Semantic Web Ontologies

</title>

<author>S. Castano, A. Ferrara, S. Montanelli, G. Racca</author>

<contentURI>http://islab.dico.unimi.it/WEBS04.pdf</contentURI>

</Article> <Article>

<title>Semantic Self-Formation of Communities of Peers</title>

<author>S. Castano, S. Montanelli</author>

<contentURI>http://islab.dico.unimi.it/ESWC05.pdf</contentURI>

</Article> <Article>

<title>

Enforcing a Semantic Routing Mechanism

based on Peer Context Matching

</title>

<author>S. Castano, S. Montanelli</author>

<contentURI>http://islab.dico.unimi.it/ECAI06.pdf</contentURI>

</Article>

Figure 6.9: A portion of the XML document returned to peer A as a search queryanswer

knowledge discovery, and semantic community management. For what concern on-tology metadata management, each Helios peer can define its own peer ontology byacquiring an external, predefined, ontology or by composing a new one by applyingclassical ontology engineering methodologies [Gomez-Perez et al., 2003]. Helios sup-ports the acquisition of ontologies represented in a Semantic Web ontology language(i.e., RDF(S), DAML+OIL, and OWL) through the ontology wrapper manager. Ex-ternal ontological descriptions are stored in a metadata repository implemented as arelational database and a repository API interface is defined to enable the other toolkitcomponents to get access to the ontological metadata. By describing the Helios toolkitcomponents, we will see the role of H-Link as network manager (see Section 6.6.1) andthe role of the handshake techniques as community manager (see Section 6.6.2).

122

Page 134: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

WORDNET

SEMANTICWEB

ONTOLOGIES

REPOSITORY APIINTERFACE

METADATA REPOSITORY

MATCHINGMANAGER

THESAURUS

MANAGERNE

TWOR

K IN

TERF

ACE

USER INTERFACE

COM

MUN

ITY

MAN

AGER

ONTOLOGYWRAPPINGMANAGER

QUERY/ANSWER WRAPPER

QUERY/ANSWER MANAGER

based on the H-Link mechanism

based on the handshake techniques

NETW

ORK

MAN

AGER

Figure 6.10: The Helios toolkit architecture

6.6.1 Enforcing dynamic knowledge discovery in Helios

Dynamic knowledge discovery is supported in Helios by exploiting the toolkit com-ponents for probe query processing and ontology matching. The query/answer wrap-per and the query/answer manager are exploited for supporting the query processingand the answer composition according to the probe model. Probe query processing isperformed by exploiting the matching manager considering the parameters associatedwith the query (e.g., threshold, matching model). The matching manager performsconcept matching against a peer ontology by exploiting the H-Match semantic match-maker. It exploits the thesaurus manager for acquiring terminological relationshipsamong the terms that constitute the linguistic features of concepts. The thesaurus man-ager interfaces WordNet in order to capture the terminological relationships holdingbetween the terms used as concept or property names in a given peer ontology. More-over, the thesaurus manager supports the inclusion of new, user-defined, terminologi-cal relationships in the thesaurus. The network manager is based on H-Link and it isresponsible for defining semantic query routing in Helios. The network interface isresponsible for managing query propagation in the network. Given a query from thequery/answer manger, the network interface interacts with the network manager to ex-ploit H-Link and to select the best recipients for each query. Finally, the user interfaceprovides a graphical access to the functionalities of the Helios toolkit for supportinguser interaction. Helios users can perform queries and visualize answers by exploitingthe query/answer wrapper and manager.

123

Page 135: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

6.6.2 Enforcing semantic communities in Helios

According to the Helios toolkit architecture of Figure 6.10, the community manager isdefined to implement the handshake techniques for consensus negotiation and semanticcommunity formation in Helios. In particular, the community manager is implementedas a JXTA prototype where each peer is a JXTA node and the handshake techniquesare defined as JXTA application for grouping peers in semantic communities.

6.6.2.1 The JXTA framework

JXTA is a Java-based library of open P2P protocols that allow any connected deviceon the network (e.g., cell phones, wireless devices, workstations) to communicate,collaborate, and share resources in a P2P manner [Jxta].

Each JXTA peer is identified on the network by a unique URN (Uniform ResourceName) and peer groups are defined to aggregate nodes according to specific sharinggoals that need to be pursued. JXTA peers communicate by defining peer pipes that aredefined as asynchronous communication channels. A pipe allows XML-based messageexchange among peers and data transfer in a protocol-independent manner. For furtherdetails regarding the JXTA framework, the reader can refer to Jxta; Team [2001]; Gong[2002].

According to Figure 6.11, the JXTA framework is organized in a three-layer archi-tecture as follows:

• JXTA core. The JXTA core provides the basic functionalities for supporting P2Pservices and applications. In particular, the JXTA core is responsible to providei) security support for enforcing anonymous peer access, ii) peer group manage-ment for handling peer group creation, deletion, advertising, and membership,iii) peer pipe management for supporting peer-to-peer communication amongnodes, and iv) peer monitoring control for supervising the peer activity on thenetwork (e.g., access control, traffic metering, and bandwidth balancing).

• JXTA services. The JXTA services are defined to provide higher-level functionsand to expand the capabilities of the core. In the JXTA services, we include thoseapplications that are considered as basic services for a generic P2P network, suchas searching, sharing, and indexing mechanisms.

124

Page 136: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

Security

Peer Groups Peer Pipes Peer Monitoring

JXTA Community ServicesSunJXTAServices

- Indexing- Searching- File Sharing

JXTAShell

PeerCommands

SunJXTAApplications

JXTA Community ApplicationsJXTAApplications

JXTAServices

JXTACore

Figure 6.11: The JXTA framework [Gong, 2002]

• JXTA applications. The JXTA applications are defined on top of the JXTAservices in order to provide additional and domain-specific functionalities thatcan be required in a given peer group. In this respect, developers can exploitthe JXTA shell for accessing a set of predefined core-level functions through acommand-line interface, and a set of external peer commands for defining morecomplex functions through pipe combination.

The JXTA framework has been conceived to provide a modular architecture whereadditional services can be quickly prototyped by relying on a basic set of flexible prim-itives. This way, users can develop specific applications as extensions of an underlyingP2P infrastructure where membership and communication functionalities are enabledas default.

6.6.2.2 The JXTA handshake prototype

The JXTA handshake prototype enforces the formation of semantic communities ofHelios peers according to the handshake techniques presented in Section 6.2.

Prototype initialization. The JXTA handshake prototype is based on a JXTA P2Pnetwork where peers are automatically inserted in a predefined Helios group. The pro-totype provides a shell interface for enabling a user to interact with the other members

125

Page 137: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

of the Helios group in order to negotiate the formation of a semantic community (i.e.,a JXTA peer group) on the basis of a specific ICard. Starting the prototype, a new peerwith a unique URN is defined. Such a peer joins the predefined Helios group and aXML configuration file is loaded in order to associate the peer to its peer ontology.

Prototype description. As shown in the screenshot of Figure 6.12, the /cg (Create

Group) command is defined to start the creation of a new semantic community. Thefollowing parameters are required by the group creation command:

Figure 6.12: A screenshot of the JXTA handshake prototype

• Group-Name. The group-name parameter indicates the path of the file contain-ing the ICard of the community that is being created. The file name of the ICardalso represent the reference name of the community. Currently, the ICard isrepresented as an OWL file. In the example of Figure 6.12, the group-name pa-rameter is represented by the string Publication as the community ICard regardsthe publishing domain.

126

Page 138: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

• Matching-Model. The matching-model parameter indicates the model to use forsemantic affinity evaluation of potential community members during the hand-shake process. Currently, only H-Match models (i.e., surface, shallow, deep, andintensive) are accepted as matching-model parameter. In the example of Fig-ure 6.12, the matching-model parameter is set to deep.

• Threshold. The threshold parameter indicates the minimum level of semanticaffinity required for joining the community. In particular, threshold ∈ (0,1]. Inthe example of Figure 6.12, the threshold parameter is set to 0.6.

• Min-members. The min-members parameter indicates the minimum numberof members required for the community commitment. In the example of Fig-ure 6.12, the min-members parameter indicates that at least 100 peers are re-quired for the community commitment.

• Timeout. The timeout parameter indicates the expiration time for communityformation. If the community is not committed before the time limit set by thetimeout parameter the creation command fails. The timeout parameter is ex-pressed in terms of milliseconds. In the example of Figure 6.12, the timeoutparameter is set to 10000 milliseconds.

The /cg command terminates with a successful message indicating that the ini-tial requirements (i.e., minimum number of community members aggregated beforethe timeout expiration) are satisfied and the community is committed, otherwise a non-successful message is displayed and the command fails. In the example of Figure 6.12,the Publication community is committed and the successful message (i.e., Group Cre-

ation finished with success!) is displayed. After the successful termination of the /cg

command a log file is saved for providing information regarding community members.Starting the prototype, a node becomes aware of the existing peer groups and re-

lated ICards. In addition to the predefined Helios group, a node can apply for join-ing one or more of existing peer groups. Currently, the group founder decides to ac-cept/reject peer join-requests according to handshake.

Considerations. The JXTA handshake prototype has been conceived to implementthe handshake techniques for fast testing the community formation functionalities within

127

Page 139: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

6.6 The Helios toolkit

the Helios system. We note that the use of the JXTA framework simplifies the man-agement of the aspects related to peer communication and message propagation thatare automatically addressed by the JXTA infrastructure. For measuring the overheaddue to community formation and for comparing community-aware query propagationwith the H-Link semantic routing mechanism we plan to develop specific experimentsby simulation (See Chapter 7).

128

Page 140: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Chapter 7

Conclusions and future work

In this chapter, we focus on presenting the results of the thesis work, by also discussingthe main contribution with respect to the state of the art. Furthermore, we outlinethe main interesting research directions that we plan to investigate in the field of P2Psemantic routing and peer community formation.

7.1 Synthesis of the thesis results

The thesis work was focused on the development of methods and techniques capableof providing i) semantic query routing in a P2P system and ii) consensus-driven for-mation of semantic communities of peers. In particular, multi-knowledge P2P systemshave been considered as the reference thesis scenario where a set of autonomous andindependent peers needs to cooperate by identifying the relevant partners with similarknowledge for a possible semantic collaboration. As a first thesis result, the main ex-isting P2P systems have been surveyed for classification and critical review. Throughthe survey activity, the lacks of semantics in current P2P systems has emerged, thusshowing the need to enforce knowledge-sharing P2P systems. In this respect, emer-gent semantics requirements have been discussed by also defining four classificationcriteria. Such criteria was used for evaluating the level of penetration of emergentsemantics issues in current P2P systems (see Chapter 3).

As another main thesis result, the H-Link semantic routing mechanism has beendefined. H-Link has been conceived to work in a peer-based system where each nodeprovides its own knowledge. The main contributions of H-Link regard i) the use of

129

Page 141: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

7.1 Synthesis of the thesis results

ontologies for a semantically rich representation of peer knowledge, and ii) the adop-tion of ontology matching techniques for query recipient selection. With respect to theemergent semantics requirements discussed in Chapter 3, H-Link is characterized by anexpertise-based level of adaptivity for query recipient selection. In particular, H-Link

relies on the results of the knowledge discovery interactions among peers to train thequery routing behavior and to identify the peers with similar knowledge that are calledsemantic neighbors. Query recipients in H-Link are then selected according to confi-dence and expertise measures associated to the discovered semantic neighbors. As afurther contribution of H-Link, semantic affinity values are computed with the H-Match

semantic matchmaker and they are exploited for providing semantic neighbor exper-tise evaluations. This way, H-Link aims at supporting query propagation by exploitingsemantics-based criteria, like semantic affinity, rather than topology-based parameters,like peer distance and bandwidth. The results obtained in the experiments show thatthe H-Link mechanism succeeds in improving the effectiveness of traditional P2P queryprotocols by providing interesting results in terms of scalability at the same time.

As a final thesis result, the handshake techniques for semantic community forma-tion and management has been developed. The main contributions of the handshaketechniques regard i) the formation of autonomous and self-organizing communities ofpeers, and ii) the use of ontologies and ontology matching techniques for addressingconsensus negotiation during the community formation process. With respect to theemergent semantics requirements discussed in Chapter 3, the handshake techniquesare characterized by a global-negotiation level of decentralization in community for-mation. This means that a community is defined as structured peer organization whichoriginates from an ontology-based declaration of interest and group those peers whichspontaneously agree with this declaration. Semantic communities aim at supportingpeer semantic collaboration by aggregating peers with similar interests and by reduc-ing the network load due to overlapping requests of single peers. As a further con-tribution of semantic communities defined through handshake, the community-awareH-Link mechanism is defined for providing effective communication for those peerswhich share the same understanding of a domain of interest (i.e., peers members of thesame community).

Both the H-Link mechanism and the handshake techniques have been developedin the context of the Helios P2P system for ontology-based knowledge discovery andsharing.

130

Page 142: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

7.2 Future research directions

7.2 Future research directions

According to the thesis results, two different research directions can be distinguishedfor the H-Link semantic routing mechanism and for the handshake community forma-tion techniques.

7.2.1 Future research work on H-Link

With respect to H-Link, we plan to finalize the development of H-Link by concludingthe experimental phase and by continuing to work on the algorithm optimization.

Further H-Link experiments. Further experiments are required to identify the causesof the unexpected H-Link behaviors observed in Chapter 5. In this respect, the mainaction that we are investigating is to exploit another test-case from a different domainfor assessing whether H-Link behavior is coherent. Our hypothesis is that most of theirregular measures are due to the use of non-sufficiently large ontologies for the sizeof the considered tests. In particular, we need to verify how the H-Link behavior isaffected by different level of peer ontology overlapping.As another further experiment, we are working on evaluating the H-Link behavior whena different model for peer-knowledge distribution is exploited. In particular, we intendto assess the variation of H-Link performance in terms of scalability when the ZIPFdistribution is adopted. As discussed in Schlosser et al. [2003], this choice is dueto the fact that the ZIPF distribution provides a non-uniform distribution and a morerealistic model of peer-knowledge assignment where some concepts are more queriedthan others and certain concepts have a higher likelihood to be assigned to a peerontology during the distribution phase.At the same time, we plan to compare H-Link with other similar P2P semantic rout-ing mechanisms. By considering Table 3.2, we note that H-Link is characterized by anumber of similarities with the REMINDIN’ approach [Staab et al., 2004]. In partic-ular, both the systems relies on ontologies and expertise-based levels of adaptivity forselecting query recipients. In our opinion, the main difference between H-Link and RE-MINDIN’ consists in a different method for computing peer confidence. Thus, apartfrom the performance evaluation based on traffic and recall, the experimental com-parison of the two approaches is motivated by the possibility to assess the benefits inadopting a semantic matchmaker like H-Match in computing peer confidence values.

131

Page 143: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

7.2 Future research directions

H-Link optimization. For what concern the H-Link development, we are workingon some algorithm refinements devoted to improve the current methods for confidencecomputation and for peer ontology management. In particular, we are investigatingthe opportunity to modify the credit distribution procedure (currently based on the har-monic mean) by considering the recommendation adjustment techniques developed inthe field of document retrieval in distributed environments [Huhns et al., 1986]. Fur-thermore, we are investigating the adoption of network-sniffing techniques for populat-ing the network knowledge of a peer. In the peer bootstrapping, the H-Link mechanismneeds of a training phase for locating its semantic neighbors. Network sniffing can beused to speed up H-Link training by observing other peer queries and related replies fordiscovering the semantic neighbors.

7.2.2 Future research work on handshake techniques

With respect to the handshake techniques, we plan to start an experimental evaluationphase for comparison with the H-Link semantic routing mechanism. Furthermore, weplan to enhance the current handshake techniques with advanced negotiation strategies.

Experimentation on semantic communities. As discussed in Chapter 6, we guessthat the H-Link effectiveness can be further improved by relying on the semantic com-munities of peers defined through the handshake techniques. In order to verify thishypothesis, we aim at comparing the default H-Link mechanism with an extended ver-sion supporting semantic communities. In particular, we are working on a simulationtest based on the community-aware H-Link mechanism. The test will be performedby considering a network composed of both single peers and semantic communitieswhere peer ontologies and community ICards are randomly generated. Currently, weare developing the software procedure for allowing query propagation at the intra-community level. Such a procedure is invoked for generating the community treestructure that is posed on top of the default network topology.

Advanced negotiation techniques. Currently, the handshake techniques assign acentral role to the community founder which is responsible for membership approvaland community commitment. In this respect, we plan to refine the actual handshakeprocess in order to share the responsibilities between the founder and the community

132

Page 144: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

7.2 Future research directions

members during the formation process. In particular, we are working on the defini-tion of advanced consensus negotiation techniques where the community ICard is theresult of a distributed negotiation process where the founder and the interested peersinteract and discuss changes to the community ICard until an agreement among themis established. In the negotiation phase, each interested peer can propose to modifythe ICard ontology to capture its personal perception of the community interests. Avoting algorithm is being defined to accept/reject the peer proposals and to guaranteethe termination of the negotiation process. ICard modification proposals are defined interms of insert, update, and delete operations on ontology concepts, and ontology evo-lution techniques are required to this end. Some preliminary results regarding ontologyevolution in open distributed systems are discussed in Castano et al. [2006c].Finally, we note that semantic communities emerge in consequence of user-drivenevents. In future research work, we are interested in developing popularity-drivencommunity aggregation techniques, where a peer founder can advertise a new commu-nity on the basis of queries sniffed in the network. When a great number of queriesin the network is due to similar requests, a peer can propose to establish a semanticcommunity regarding such a popular topic.

133

Page 145: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Appendix A

Complete simulation results on H-Link

A.1 H-Link scalability evaluation

A.2 Comparison of H-Link with Gnutella

134

Page 146: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

100 1-3 3-5 12 148 0.0 1.0 0.57

280 1-3 4-6 12 232 0.0 1.0 0.60

460 1-3 5-7 12 295 0.0 1.0 0.53

640 1-3 6-8 12 373 0.0 1.0 0.53

820 1-3 7-9 12 446 0.0 1.0 0.51

1000 1-3 8-10 12 531 0.0 1.0 0.48

(a)

(b)

Figure A.1: Evaluation of H-Link scalability for #credits = 12: (a) Traffic (b) Recall

135

Page 147: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

100 1-3 3-5 16 228 0.0 1.0 0.65

280 1-3 4-6 16 404 0.0 1.0 0.74

460 1-3 5-7 16 491 0.0 1.0 0.69

640 1-3 6-8 16 600 0.0 1.0 0.71

820 1-3 7-9 16 826 0.0 1.0 0.69

1000 1-3 8-10 16 889 0.1 1.0 0.66

(a)

(b)

Figure A.2: Evaluation of H-Link scalability for #credits = 16: (a) Traffic (b) Recall

136

Page 148: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

100 1-3 3-5 20 292 0.0 1.0 0.69

280 1-3 4-6 20 581 0.0 1.0 0.8

460 1-3 5-7 20 768 0.0 1.0 0.78

640 1-3 6-8 20 848 0.0 1.0 0.82

820 1-3 7-9 20 1229 0.0 1.0 0.73

1000 1-3 8-10 20 1348 0.1 1.0 0.69

(a)

(b)

Figure A.3: Evaluation of H-Link scalability for #credits = 20: (a) Traffic (b) Recall

137

Page 149: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

100 1-3 3-5 24 372 0.0 1.0 0.71

280 1-3 4-6 24 772 0.0 1.0 0.84

460 1-3 5-7 24 977 0.0 1.0 0.83

640 1-3 6-8 24 1097 0.2 1.0 0.87

820 1-3 7-9 24 2068 0.1 1.0 0.76

1000 1-3 8-10 24 2247 0.1 1.0 0.73

(a)

(b)

Figure A.4: Evaluation of H-Link scalability for #credits = 24: (a) Traffic (b) Recall

138

Page 150: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 3-5 27 108 0.0 1.0 0.25

500 1-3 3-5 36 171 0.0 1.0 0.34

500 1-3 3-5 45 236 0.0 1.0 0.39

500 1-3 3-5 54 260 0.0 1.0 0.41

H-Link with the Surface matching model

500 1-3 3-5 27 146 0.0 1.0 0.26

500 1-3 3-5 36 214 0.0 1.0 0.38

500 1-3 3-5 45 301 0.0 1.0 0.46

500 1-3 3-5 54 381 0.0 1.0 0.51

H-Link with the Shallow matching model

500 1-3 3-5 27 133 0.0 1.0 0.33

500 1-3 3-5 36 204 0.0 1.0 0.47

500 1-3 3-5 45 296 0.0 1.0 0.56

500 1-3 3-5 54 366 0.0 1.0 0.63

H-Link with the Deep matching model

500 1-3 3-5 27 147 0.0 1.0 0.53

500 1-3 3-5 36 221 0.0 1.0 0.65

500 1-3 3-5 45 308 0.0 1.0 0.72

500 1-3 3-5 54 388 0.0 1.0 0.79

H-Link with the Intensive matching model

500 1-3 3-5 (3) 59 0.0 0.6 0.10

500 1-3 3-5 (4) 202 0.0 1.0 0.30

500 1-3 3-5 (5) 591 0.0 1.0 0.64

500 1-3 3-5 (6) 1218 0.4 1.0 0.88

Gnutella

Figure A.5: Comparison of H-Link with Gnutella for #connections per node = 3−5

139

Page 151: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 4-6 27 175 0.0 1.0 0.37

500 1-3 4-6 36 254 0.0 1.0 0.42

500 1-3 4-6 45 298 0.0 1.0 0.45

500 1-3 4-6 54 584 0.0 1.0 0.52

H-Link with the Surface matching model

500 1-3 4-6 27 225 0.0 1.0 0.45

500 1-3 4-6 36 322 0.0 1.0 0.55

500 1-3 4-6 45 438 0.0 1.0 0.59

500 1-3 4-6 54 584 0.0 1.0 0.65

H-Link with the Shallow matching model

500 1-3 4-6 27 206 0.0 1.0 0.48

500 1-3 4-6 36 324 0.0 1.0 0.57

500 1-3 4-6 45 428 0.0 1.0 0.64

500 1-3 4-6 54 834 0.0 1.0 0.70

H-Link with the Deep matching model

500 1-3 4-6 27 227 0.0 1.0 0.53

500 1-3 4-6 36 342 0.0 1.0 0.65

500 1-3 4-6 45 454 0.0 1.0 0.72

500 1-3 4-6 54 697 0.0 1.0 0.79

H-Link with the Intensive matching model

500 1-3 4-6 (3) 113 0.0 0.8 0.23

500 1-3 4-6 (4) 464 0.0 1.0 0.65

500 1-3 4-6 (5) 1364 0.5 1.0 0.95

500 1-3 4-6 (6) 2107 0.9 1.0 0.99

Gnutella

Figure A.6: Comparison of H-Link with Gnutella for #connections per node = 4−6

140

Page 152: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 5-7 27 235 0.0 1.0 0.41

500 1-3 5-7 36 361 0.0 1.0 0.46

500 1-3 5-7 45 674 0.0 1.0 0.51

500 1-3 5-7 54 951 0.0 1.0 0.56

H-Link with the Surface matching model

500 1-3 5-7 27 311 0.0 1.0 0.55

500 1-3 5-7 36 454 0.0 1.0 0.62

500 1-3 5-7 45 641 0.0 1.0 0.70

500 1-3 5-7 54 804 0.0 1.0 0.73

H-Link with the Shallow matching model

500 1-3 5-7 27 289 0.0 1.0 0.58

500 1-3 5-7 36 473 0.0 1.0 0.66

500 1-3 5-7 45 834 0.0 1.0 0.74

500 1-3 5-7 54 1069 0.0 1.0 0.78

H-Link with the Deep matching model

500 1-3 5-7 27 330 0.0 1.0 0.62

500 1-3 5-7 36 479 0.0 1.0 0.73

500 1-3 5-7 45 780 0.0 1.0 0.82

500 1-3 5-7 54 984 0.0 1.0 0.85

H-Link with the Intensive matching model

500 1-3 5-7 (3) 193 0.0 1.0 0.33

500 1-3 5-7 (4) 876 0.2 1.0 0.84

500 1-3 5-7 (5) 2258 0.8 1.0 1.00

500 1-3 5-7 (6) 2702 1.0 1.0 1.00

Gnutella

Figure A.7: Comparison of H-Link with Gnutella for #connections per node = 5−7

141

Page 153: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 6-8 27 326 0.0 1.0 0.49

500 1-3 6-8 36 508 0.0 1.0 0.53

500 1-3 6-8 45 996 0.0 1.0 0.59

500 1-3 6-8 54 1295 0.0 1.0 0.62

H-Link with the Surface matching model

500 1-3 6-8 27 411 0.0 1.0 0.62

500 1-3 6-8 36 614 0.0 1.0 0.68

500 1-3 6-8 45 846 0.0 1.0 0.74

500 1-3 6-8 54 1074 0.0 1.0 0.77

H-Link with the Shallow matching model

500 1-3 6-8 27 404 0.0 1.0 0.73

500 1-3 6-8 36 621 0.0 1.0 0.79

500 1-3 6-8 45 1138 0.0 1.0 0.85

500 1-3 6-8 54 1510 0.0 1.0 0.88

H-Link with the Deep matching model

500 1-3 6-8 27 425 0.0 1.0 0.74

500 1-3 6-8 36 632 0.0 1.0 0.84

500 1-3 6-8 45 994 0.0 1.0 0.88

500 1-3 6-8 54 1348 0.0 1.0 0.91

H-Link with the Intensive matching model

500 1-3 6-8 (3) 310 0.0 1.0 0.48

500 1-3 6-8 (4) 1512 0.4 1.0 0.96

500 1-3 6-8 (5) 3102 1.0 1.0 1.00

500 1-3 6-8 (6) 3242 1.0 1.0 1.00

Gnutella

Figure A.8: Comparison of H-Link with Gnutella for #connections per node = 6−8

142

Page 154: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 7-9 27 430 0.0 1.0 0.50

500 1-3 7-9 36 958 0.0 1.0 0.58

500 1-3 7-9 45 1273 0.0 1.0 0.62

500 1-3 7-9 54 2211 0.0 1.0 0.67

H-Link with the Surface matching model

500 1-3 7-9 27 544 0.0 1.0 0.69

500 1-3 7-9 36 837 0.0 1.0 0.75

500 1-3 7-9 45 1126 0.0 1.0 0.78

500 1-3 7-9 54 1488 0.0 1.0 0.81

H-Link with the Shallow matching model

500 1-3 7-9 27 542 0.1 1.0 0.73

500 1-3 7-9 36 1079 0.0 1.0 0.81

500 1-3 7-9 45 1472 0.1 1.0 0.85

500 1-3 7-9 54 2276 0.0 1.0 0.89

H-Link with the Deep matching model

500 1-3 7-9 27 564 0.0 1.0 0.79

500 1-3 7-9 36 978 0.0 1.0 0.87

500 1-3 7-9 45 1347 0.0 1.0 0.90

500 1-3 7-9 54 1874 0.0 1.0 0.94

H-Link with the Intensive matching model

500 1-3 7-9 (3) 457 0.0 1.0 0.61

500 1-3 7-9 (4) 2258 0.7 1.0 1.00

500 1-3 7-9 (5) 3697 1.0 1.0 1.00

500 1-3 7-9 (6) 3746 1.0 1.0 1.00

Gnutella

Figure A.9: Comparison of H-Link with Gnutella for #connections per node = 7−9

143

Page 155: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

A.2 Comparison of H-Link with Gnutella

#Nodes #Concepts

per node

#Connections

per node

#credits

(TTL)

Avg

Traffic

Min

Recall

Max

Recall

Avg

Recall

500 1-3 8-10 27 460 0.0 1.0 0.53

500 1-3 8-10 36 1188 0.1 1.0 0.62

500 1-3 8-10 45 1697 0.1 1.0 0.66

500 1-3 8-10 54 2840 0.1 1.0 0.70

H-Link with the Surface matching model

500 1-3 8-10 27 664 0.0 1.0 0.71

500 1-3 8-10 36 1058 0.0 1.0 0.77

500 1-3 8-10 45 1449 0.0 1.0 0.80

500 1-3 8-10 54 1923 0.0 1.0 0.83

H-Link with the Shallow matching model

500 1-3 8-10 27 615 0.1 1.0 0.77

500 1-3 8-10 36 1358 0.1 1.0 0.86

500 1-3 8-10 45 1896 0.1 1.0 0.89

500 1-3 8-10 54 2865 0.2 1.0 0.91

H-Link with the Deep matching model

500 1-3 8-10 27 703 0.0 1.0 0.82

500 1-3 8-10 36 1253 0.1 1.0 0.89

500 1-3 8-10 45 1803 0.1 1.0 0.93

500 1-3 8-10 54 2671 0.1 1.0 0.96

H-Link with the Intensive matching model

500 1-3 8-10 (3) 645 0.2 1.0 0.74

500 1-3 8-10 (4) 3089 0.8 1.0 1.00

500 1-3 8-10 (5) 4240 1.0 1.0 1.00

500 1-3 8-10 (6) 4254 1.0 1.0 1.00

Gnutella

Figure A.10: Comparison of H-Link with Gnutella for #connections per node = 8−10

144

Page 156: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Appendix B

Summary of the H-Matchmatching techniques

145

Page 157: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Technique Description

Affinity function

A(t, t ′) =

maxi=1...k wt→n

i t′ if k ≥ 10 otherwise

k is the number of paths between t and t ′ in T h;

t →ni t ′ denotes the ith path of length n≥ 1; wt→n

i t′ =w1tr ·w2tr · · · · ·wntr is the weight associated with the

ith path, and w jtr , j = 1,2, . . . ,n denotes the weight

associated with the jth terminological relationship in

the path

Datatype compatibility

T(dt,dt ′) =

1 iff ∃ a compatibility rule for dt,dt ′ in CR0 otherwise

CR denotes a set of compatibility rules among

datatypes

Closeness function

C(e,e′) = 1− |We −We′ |

We and We′ are the weights associated with e and e′,respectively. For any pairs of elements e and e′, the

highest value (i.e., 1.0) is obtained when weights of

e and e′ coincide. The higher the difference between

We and We′ the lower the closeness value of e and e′

Property values evaluation

v(pi) =

maxA(npi ,np j ) ·A(vpi ,vp j ) iff vpi is a namemaxA(npi ,np j ) ·T(vpi ,vp j ) iff vpi is a datatype

npi denotes the name of a property pi, while vpi de-

notes its value that can be a reference name of a

concept or an individual as well as a datatype

Surface matching

SAc,c′ ≡A(nc,nc′ )nc and nc′ are the names of c and c′, respectively

Shallow matching

SAc,c′ = wla ·A(nc,nc′ )+(1−wla) ·∑|P(c)|i=1 m(pi)| P(c) |

m(pi) denotes the best matching value for a prop-

erty pi

Deep matching

SAc,c′ = wla ·A(nc,nc′ )+(1−wla) ·∑|Ctx(c)|i=1 m(ei)|Ctx(c) |

m(ei) denotes the best matching value for a context

element ei

Intensive matching

SAc,c′ = wla ·A(nc,nc′ )+(1−wla) ·∑|Ctx(c)|i=1 m(ei)+∑

|P(c)|j=1 v(p j)

|Ctx(c)|+|P(c)|

m(ei) denotes the best matching value for a context

element ei

Matching policy

〈mm,wla, t〉mm: the marching model

wla: the linguistic affinity weight

t: the threshold

146

Page 158: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

Bibliography

A. AbdulRahman and S. Hailes. Relying on Trust to Find Reliable Information. InProc. of 1999 Int. Symposium on Database, Web and Cooperative Systems (DWA-COS’99), Baden-Baden, Germany, 1999.

K. Aberer and P. Cudre-Mauroux. Semantic Overlay Networks. In Proc. of the 31stInt. Conference on Very Large Data Bases (VLDB05, Trondheim, Norway, 2005.Tutorial.

K. Aberer, P. Cudre-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva,and R. Schmidt. P-Grid: a Self-Organizing Structured P2P System. SIGMODRecord, 32(3):29–33, 2003a.

K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. The Chatty Web: Emergent Seman-tics through Gossiping. In Proc. of the 12th Int. World Wide Web Conference (WWW2003), Budapest, Hungary, 2003b.

K. Aberer, P. Cudre-Mauroux, A.M. Ouksel, T. Catarci, M.-S. Hacid, A. Illarramendi,V. Kashyap, M. Mecella, E. Mena, E.J. Neuhold, O. De Troyer, T. Risse, M. Scan-napieco, F. Saltor, L. De Santis, S. Spaccapietra, S. Staab, and R. Studer. EmergentSemantics Principles and Issues. In Proc. of the 9th Int. Conference on DatabaseSystems for Advances Applications - DASFAA 2004, pages 25–38, Jeju Island, Ko-rea, 2004.

K. Aberer, A. Datta, M. Hauswirth, and R. Schmidt. Indexing Data-Oriented OverlayNetworks. In Proc. of the 31st International Conference on Very Large Databases -VLDB 2005, Trondheim, Norway, 2005.

K. Aberer and M. Hauswirth P. Cudre-Mauroux. A Framework for Semantic Gossip-ing. ACM SIGMOD Record, 31(4):48–53, 2002.

147

Page 159: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

H. Afsarmanesh, C. Garita, and L.O. Hertzberger. Virtual Enterprises and FederatedInformation Sharing. In Proc of the 9th Int. Conference on Database and ExpertSystems Applications (DEXA 1998), Vienna, Austria, 1998.

A. Agostini and G. Moro. Identification of Communities of Peers by Trust and Repu-tation. In Proc. of the 11th Int. Conference on Artificial Intelligence: Methodology,Systems, Applications - Semantic Web Challenges (AIMSA 04), Varna, Bulgaria,2004.

S. Androutsellis-Theotokis and D. Spinellis. A Survey of Peer-to-Peer Content Distri-bution Technologies. ACM Computing Surveys, 36(4):335–371, 2004.

M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R.J. Miller, and J. Mylopou-los. The Hyperion Project: From Data Integration to Data Coordination. SIGMODRecord, Special Issue on Peer-to-Peer Data Management, 32(3):53–58, 2003.

R.A. Baeza-Yates and B.A. Ribeiro-Neto. Modern Information Retrieval. ACM PressSeries/Addison Wesley, New York, 1999.

C. Behrens and V. Kashyap. The “Emergent” Semantic Web: A Consensus Approachfor Deriving Semantic Knowledge on the Web. In Proc. of the 1st Semantic WebWorking Symposium (SWWS 2001), Stanford University, California, USA, 2001.

D. Bianchini, S. Castano, F. D’Antonio, V. De Antonellis, M. Harzallah, M. Missikoff,and S. Montanelli. Digital Resource Discovery: Semantic Annotation and Match-making Techniques. In Proc. of the Interoperability for Enterprise Software andApplications Conference (I-ESA 2006), Bordeaux, France, 2006.

S. Bloehdorn, P. Haase, M. Hefke, Y. Sure, and C. Tempich. Intelligent CommunityLifecycle Support. In Proc. of the 5th Int. Conference on Knowledge Management(I-KNOW 05), Graz, Austria, 2005.

B. Bloom. Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communi-cations of the ACM, 13(7):422–426, 1970.

M. Bonifacio, P. Bouquet, G. Mameli, and M. Nori. Peer - Mediated DistributedKnowledge Management. In Proc. of Int. Symposium on Agent Mediated KnowledgeManagement (AMKM 2003), Stanford, CA, USA, 2003.

148

Page 160: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

N.T. Borch. Improving Semantic Routing Efficiency. In Proc. of the 2nd Int. Workshopon Hot Topics in Peer-to-Peer Systems (Hot P2P), La Jolla, CA, USA, 2005.

N.T. Borch and N.K. Vognild. Searching Variably Connected Networks. In Proc. ofthe Int. Conference on Pervasive Computing and Communications (PCC-04), LasVegas, Nevada, USA, 2004.

D. Brickley and R.V. Guha (eds.). RDF Vocabulary Description Language 1.0:RDF Schema, 2003. World Wide Web Consortium (W3C), http://www.w3.org/TR/-

rdf-schema/.

J. Broekstra, M. Ehrig, P. Haase, F. van Harmelen, A. Kampman, M. Sabou, R. Siebes,S. Staab, H. Stuckenschmidt, and C. Tempich. A Metadata Model for Semantics-Based Peer-to-Peer Systems. In Proc. of the 1st WWW Int. Workshop on Semanticsin Peer-to-Peer and Grid Computing (SemPGRID 2003), Budapest, Hungary, 2003.

J. Broekstra and A. Kampman. The SeRQL Query Language. Technical report, De-partment of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam,2003. http://www.openrdf.org/doc/SeRQLmanual.html.

I. Brunkhorst, H. Dhraief, A. Kemper, W. Nejdl, and C. Wiesner. DistributedQueries and Query Optimization in Schema-Based P2P-Systems. In Proc. of the1st Int. Workshop on Databases, Information Systems, and Peer-to-Peer Computing(DBISP2P-2003), Berlin Germany, 2003.

S. Castano, V. De Antonellis, and S. De Capitani Di Vimercati. Global Viewing of Het-erogeneous Data Sources. IEEE Transactions on Knowledge and Data Engineering,13(2):277–297, 2001.

S. Castano, A. Ferrara, and G. Messa. ISLab HMatch Results for OAEI 2006. In Proc.of Int. Workshop on Ontology Matching, co-located with the 5th Int. Semantic WebConference (ISWC 2006), Athens, Georgia, USA, 2006a.

S. Castano, A. Ferrara, and S. Montanelli. H-MATCH: an Algorithm for DynamicallyMatching Ontologies in Peer-based Systems. In Proc. of the 1st VLDB Int. Workshopon Semantic Web and Databases (SWDB 2003), Berlin, Germany, 2003a.

149

Page 161: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

S. Castano, A. Ferrara, and S. Montanelli. Ontologies and Matching Techniques forPeer-based Knowledge Sharing. In Short Paper Proc. of the 15th Conference On Ad-vanced Information Systems Engineering (CAISE 2003), Klagenfurt/Velden, Aus-tria, 2003b.

S. Castano, A. Ferrara, and S. Montanelli. The HELIOS Framework for Peer-basedKnowledge Sharing and Evolution. In Proc. of the 11th Italian Symposium on Ad-vanced Database Systems (SEBD 2003), Cetraro (CS), Italy, 2003c.

S. Castano, A. Ferrara, and S. Montanelli. Methods and Techniques for Ontology-based Semantic Interoperability in Networked Enterprise Contexts. In Proc. of the1st CAiSE Workshop on Enterprise Modelling and Ontologies for Interoperability(EMOI-INTEROP 2004), Riga, Latvia, 2004a.

S. Castano, A. Ferrara, and S. Montanelli. Dynamic Configurability of a SemanticMatchmaker for Ontology-based Resource Discovery in Open Distributed Systems.In Proc. of the 2nd CAiSE Workshop on Enterprise Modelling and Ontologies forInteroperability (EMOI-INTEROP 2005), Luxembourg, 2005a.

S. Castano, A. Ferrara, and S. Montanelli. Ontology-based Interoperability Servicesfor Semantic Collaboration in Open Networked Systems. In Proc. of the 1st Int.Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA 2005), Geneva, Switzerland, 2005b.

S. Castano, A. Ferrara, and S. Montanelli. A Matchmaking-based Ontology EvolutionMethodology. In Proc. of the 3rd CAiSE Workshop on Enterprise Modelling andOntologies for Interoperability (EMOI-INTEROP 2006), Porto, Portugal, 2006b.

S. Castano, A. Ferrara, and S. Montanelli. Evolving Open and Independent Ontologies.Journal of Metadata, Semantics and Ontologies, 2006c. To appear.

S. Castano, A. Ferrara, and S. Montanelli. Matching Ontologies in Open NetworkedSystems: Techniques and Applications. Journal on Data Semantics (JoDS), V,2006d.

S. Castano, A. Ferrara, and S. Montanelli. Ontology Knowledge Spaces for SemanticCollaboration in Networked Enterprises. In Proc. of the 2nd Int. Workshop on En-

150

Page 162: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

terprise and Networked Enterprises Interoperability (ENEI 2006), Vienna, Austria,2006e.

S. Castano, A. Ferrara, and S. Montanelli. Web Semantics and Ontology, chapter Dy-namic Knowledge Discovery in Open, Distributed and Multi-Ontology Systems:Techniques and Applications. Idea Group, 2006f.

S. Castano, A. Ferrara, S. Montanelli, E. Pagani, G. P. Rossi, and S. Tebaldi. OnCombining a Semantic Engine and Flexible Network Policies for P2P KnowledgeSharing Networks. In Proc of the 1st DEXA Workshop on Grid and Peer-to-PeerComputing Impacts on Large Scale Heterogeneous Distributed Database Systems(GLOBE 2004), Zaragoza, Spain, 2004b.

S. Castano, A. Ferrara, S. Montanelli, E. Pagani, and G.P. Rossi. Ontology-Addressable Contents in P2P Networks. In Proc. of the 1st WWW Int. Workshopon Semantics in Peer-to-Peer and Grid Computing (SemPGRID 2003), Budapest,Hungary, 2003d.

S. Castano, A. Ferrara, S. Montanelli, and G. Racca. A Semantic Approach for Re-source Discovery in Distributed Systems. In Proc. of 12th Italian Symposium onAdvanced Database Systems (SEBD 2004), S. Margherita di Pula (CA), Italy, 2004c.

S. Castano, A. Ferrara, S. Montanelli, and G. Racca. From Surface to Intensive Match-ing of Semantic Web Ontologies. In Proc. of the 3rd DEXA Int. Workshop on WebSemantics (WEBS 2004), Zaragoza, Spain, 2004d.

S. Castano, A. Ferrara, S. Montanelli, and G. Racca. Matching Techniques for Re-source Discovery in Distributed Systems Using Heterogeneous Ontology Descrip-tions. In Proc. of the Int. Conference on Coding and Computing (ITCC 2004), LasVegas, Nevada, USA, 2004e.

S. Castano, A. Ferrara, S. Montanelli, and G. Racca. Semantic Information Interoper-ability in Open Networked Systems. In Proc. of the Int. Conference on Semanticsof a Networked World (ICSNW), in cooperation with ACM SIGMOD 2004, Paris,France, 2004f.

S. Castano, A. Ferrara, S. Montanelli, and D. Zucchelli. HELIOS: a General Frame-work for Ontology-based Knowledge Sharing and Evolution in P2P Systems. In

151

Page 163: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

Proc. of the 2nd DEXA Int. Workshop on Web Semantics (WEBS 2003), Prague,Czech Republic, 2003e.

S. Castano and S. Montanelli. Semantic Self-Formation of Communities of Peers. InProc. of the ESWC Workshop on Ontologies in Peer-to-Peer Communities, Herak-lion, Greece, 2005.

S. Castano and S. Montanelli. Enforcing a Semantic Routing Mechanism based onPeer Context Matching. In Proc. of the 2nd Int. ECAI Workshop on Contexts andOntologies: Theory, Practice and Applications (C&O-2006), Riva del Garda (TN),Italy, 2006.

I. Clarke, O. Sandberg, B. Wiley, and T.W. Hong. Freenet: A Distributed AnonymousInformation Storage and Retrieval System. In Proc. of the Int. Workshop on DesignIssues in Anonymity and Unobservability, Berkeley, CA, USA, 2001.

A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P Systems. InProc. of the 3rd Int. Workshop on Agents and Peer-to-Peer Computing (P2PC 2004),New York, NY, USA, 2004.

F.M. Cuenca-Acuna, C. Peery, R.P. Martin, and T.D. Nguyen. PlanetP: Using Gossip-ing to Build Content Addressable Peer-to-Peer Information Sharing Communities.In Proc. of the 12th Int. Symposium on High-Performance Distributed Computing(HPDC-12 2003), Seattle, WA, USA, 2003.

N. Daswani, H. Garcia-Molina, and B. Yang. Open Problems in Data-Sharing Peer-to-Peer Systems. In Proc. of the 9th Int. Conference on Database Theory - ICDT 2003,Siena, Italy, 2003.

H. Do and E. Rahm. COMA - A System for Flexible Combination of Schema MatchingApproaches. In Proc. of 28th Int. Conference on Very Large Databases (VLDB2002), Hong Kong, China, 2002.

A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to Map between On-tologies on the Semantic Web. In Proc. of the 11th Int. World Wide Web Conference(WWW 2002), Honolulu, Hawaii, USA, 2002.

M. Ehrig and S. Staab. QOM - Quick Ontology Mapping. In Proc. of the 3rd Int.Semantic Web Conference (ISWC 2004), Hiroshima, Japan, 2004.

152

Page 164: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

M. Ehrig and Y. Sure. Ontology Mapping - An Integrated Approach. In Proc. of the1st European Semantic Web Symposium, Heraklion, Greece, 2004.

Esteem. The Esteeem Project Website, 2006. http://www.dis.uniroma1.it/∼esteem/.

J. Euzenat, D. Loup, M. Touzani, and P. Valtchev. Ontology Alignment with OLA.In Proc. of the 3rd ISWC Workshop on Evaluation of Ontology-based Tools (EON2004), Hiroshima, Japan, 2004.

A. Ferrara. Matching of Independent Ontologies in Open Networked Systems: Methodsand Techniques. PhD thesis, Universita degli Studi di Milano, 2006.

G.W. Flake, S. Lawrence, C.L. Giles, and F.M. Coetzee. Self-Organization and Identi-fication of Web Communities. IEEE Computer, 35(3):66–70, 2002.

Freenet. The Freenet Website, 2000. http://freenet.sourceforge.net.

F. Giunchiglia and P. Shvaiko. Semantic Matching. Knowledge engineering review,18(3):265–280, 2003.

Gnutella. The Gnutella Website, 2001. http://www.gnutella.com.

Gnutella Protocol. The Gnutella Protocol Specification v0.4, 2001. http://www9.-

limewire.com/developer/gnutella protocol 0.4.pdf.

A. Gomez-Perez, M. Fernandez-Lopez, and O. Corcho. Ontological Engineering.Springer Verlag, 2003.

L. Gong. Project JXTA: A Technology Overview. Technical report, Sun Microsystems,Inc., 2002.

S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What Can Databases Do forPeer-to-Peer? In Proc. of the 4th Int. Workshop on the Web and Databases (WebDB-2001), in conjunction with ACM PODS/SIGMOD 2001, Santa Barbara, California,USA, 2001.

H-Match. The H-MATCH Website, 2006. http://islab.dico.unimi.it/hmatch/.

153

Page 165: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

P. Haase, R. Siebes, and F. van Harmelen. Peer Selection in Peer-to-Peer Networkswith Semantic Topologies. In Proc. of the Int. Conference on Semantics of a Net-worked World (ICSNW), in cooperation with ACM SIGMOD 2004, Paris, France,2004.

A. Halevy, Z. Ives, J. Madhavan, P. Mork, D. Suciu, and I. Tatarinov. The Piazza PeerData Management System. IEEE Transactions on Knowledge and Data Engineer-ing, 16(7):787–798, 2004.

Helios. The HELIOS Project Website, 2003. http://islab.dico.unimi.it/helios/.

R. Huebsch, J.M. Hellerstein, N. Lanham, B. Thau Loo, S. Shenker, and I. Stoica.Querying the Internet with PIER. In Proc. of the 29th Int. Conference on Very LargeData Bases (VLDB 2003), Berlin, Germany, 2003.

M.N. Huhns, U. Mukhopadhyay, L.M. Stephens, and R.D. Bonnell. An IntelligentSystem for Document Retrieval in Distributed Office Environments. Journal of theAmerican Society for Information Science, 37(3):123–135, 1986.

A. Iamnitchi, M. Ripeanu, and I.T. Foster. Locating Data in (Small-World?) Peer-to-Peer Scientific Collaborations. In Proc. of the 1st Int. Workshop on Peer-to-PeerSystems IPTPS 2002, pages 232–241, Cambridge, MA, USA, 2002.

Mark Jelasity and Ozalp Babaoglu. T-Man: Gossip-Based Overlay Topology Manage-ment. In Proc. of the 3rd Int. Workshop on Engineering Self-Organising Systems(ESOA 2005), Utrecht, The Netherlands, 2005.

S. Joseph. NeuroGrid: Semantically Routing Queries in Peer-to-Peer Networks. InProc. of the Int. Workshop on Peer-to-Peer Computing (co-located with Networking2002), Pisa, Italy, 2002.

Jxta. The JXTA Website, 2006. http://www.jxta.org.

Y. Kalfoglou and M. Schorlemmer. Ontology Mapping: the State of the Art. TheKnowledge Engineering Review, 18(1), 2003.

M. Karnstedt, K. Hose, and K.-U. Sattler. Query Routing and Processing in Schema-Based P2P Systems. In Proc of the 1st DEXA Workshop on Grid and Peer-to-Peer

154

Page 166: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

Computing Impacts on Large Scale Heterogeneous Distributed Database Systems(GLOBE 2004), Zaragoza, Spain, 2004.

Kazaa. The Kazaa Website, 2002. http://www.kazaa.com.

M. Khambatti, K. Dong Ryu, and P. Dasgupta. Efficient Discovery of ImplicitlyFormed Peer-to-Peer Communities. International Journal of Parallel and Dis-tributed Systems and Networks, 5(4):155–164, 2002.

M. Khambatti, K. Dong Ryu, and P. Dasgupta. Push-Pull Gossiping for InformationSharing in Peer-to-Peer Communities. In Proc. of the Int. Conference on Paral-lel and Distributed Processing Techniques and Applications - PDPTA, Las Vegas,Nevada, USA, 2003.

R. Korfhage. Information Storage and Retrieval. John Wiley, 1997.

O. Lassila and R. Swick (eds.). W3C Resource Description Framework Model andSyntax Specification, 1999. World Wide Web Consortium (W3C), http://www.w3.-

org/TR/REC-rdf-syntax/.

M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of the 21st ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS2002), Madison, Winsconsin, USA, 2002. Invited tutorial.

M. Lenzerini. Principles of P2P Data Integration. In Proc. of the 3rd Int. CAiSE Work-shop on Data Integration over the Web (DIWeb2004), Riga, Latvia, 2004. Invitedspeaker.

C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J.D. Ullman,and M. Valiveti. Capability Based Mediation in TSIMMIS. In ACM SIGMOD Int.Conference on Management of Data, Seattle, Washington, USA, 1998.

J. Li and S. Vuong. A Scalable Semantic Routing Architecture for Grid ResourceDiscovery. In Proc. of the 11th Int. Conference on Parallel and Distributed Systems,pages 29–35, Fukuoka, Japan, 2005.

LimeWire. The LimeWire Website, 2006. http://www.limewire.com/.

155

Page 167: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and Replication in UnstructuredPeer-to-peer Networks. In Proc. of the Int. Conference on Supercomputing, NewYork City, NY, USA, 2002.

F. Mandreoli, R. Martoglia, W. Penzo, and S. Sassatelli. SRI: Exploiting SemanticInformation for Effective Query Routing in a PDMS. In Proc. of the 8th ACM Int.Workshop on Web Information and Data Management (WIDM 2006), Arlington,Virginia, USA, 2006.

S. Marti and H. Garcia-Molina. Limited Reputation Sharing in P2P Systems. In Proc.of the 5th ACM Conference on Electronic Commerce ACM-EC 2004, New York,NY, USA, 2004.

P. Mika. Social Networks and the Semantic Web. In Proc. of the IEEE/WIC/ACM Int.Conference Web Intelligence (WI’04), Beijing, China, 2004.

G. A. Miller. WordNet: A Lexical Database for English. Communications of the ACM(CACM), 38(11):39–41, 1995.

J. Mitre and L. Navarro-Moldes. P2P Architecture for Scientific Collaboration. InProc. of the 13th Int. Workshops on Enabling Technologies: Infrastructure forCollaborative Enterprises WETICE’04, pages 95–100, Modena, Italy, 2004. IEEEComputer Society.

S. Montanelli. Emergent Communities for Semantic Collaboration in Multi-Knowledge Environments: Methods and Techniques. In Proc. of the CAiSE Doc-toral Consortium, Luxembourg, 2006.

Morpheus. The Morpheus Website, 2006. http://www.morpheus.com.

Napster. The Napster Website, 2003. http://www.napster.com.

W. Nejdl, B.Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer, andT. Risch. EDUTELLA: a P2P Networking Infrastructure Based on RDF. In Proc.of the 11th Int. World Wide Web Conference (WWW 2002), Honolulu, Hawaii, USA,2002.

W. Nejdl, W. Siberski, and M. Sintek. Design Issues and Challenges for RDF- andSchema-based Peer-to-Peer Systems. ACM SIGMOD Record, 32(3):41–46, 2003.

156

Page 168: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

Neurogrid. The Neurogrid Website, 2001. http://www.neurogrid.net/.

N. F. Noy. Semantic Integration: a Survey of Ontology-based Approaches. SIGMODRecord Special Issue on Semantic Integration, December 2004.

N. F. Noy and M. A. Musen. The PROMPT Suite: Interactive Tools For OntologyMerging And Mapping. International Journal of Human-Computer Studies, 59(6):983–1024, 2003.

OpenNap. The OpenNap Website, 2001. http://opennap.sourceforge.net/.

M. Portmann, P. Sookavatana, S. Ardon, and A. Seneviratne. The Cost of Peer Dis-covery and Searching in the Gnutella Peer-to-Peer File Sharing Protocol. In Proc. ofthe 9th IEEE Int. Conference on Networks (ICON 2001), pages 263–268, Bangkok,Thailand, 2001.

M.T. Prinkey. An Efficient Scheme for Query Processing on Peer-to-Peer Networks.Technical report, Aeolus Research, Inc., 2001.

E. Prudhommeaux and A. Seaborne (eds.). SPARQL Query Language for RDF, 2006.World Wide Web Consortium (W3C) Recommendation, http://www.w3.org/TR/rdf-

sparql-query/.

M.K. Ramanathan, V. Kalogeraki, and J. Pruyne. Finding Good Peers in Peer-to-Peer Networks. In IEEE Int. Parallel and Distributed Processing Symposium, FortLauderdale, FL, USA, 2002.

S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-Addressable Network. In Proc. of the ACM SIGCOMM 2001 Conference on Appli-cations, Technologies, Architectures, and Protocols for Computer Communication,San Diego, CA, USA, 2001. ACM Press.

E. Rich and K. Knight. Artificial Intelligence. McGraw-Hill Inc., 1991.

A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object Location and Rout-ing for Large-Scale Peer-to-Peer Systems. In Proc. of the 18th IFIP/ACM Int. Con-ference on Distributed Systems Platforms (Middleware 2001), Heidelberg, Germany,2001.

157

Page 169: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval ofInformation by Computer. Addison-Wesley, 1989.

M.T. Schlosser, T.E. Condie, and S.D. Kamvar. Simulating a P2P File-Sharing Net-work. In Proc. of the 1st WWW Int. Workshop on Semantics in Peer-to-Peer andGrid Computing (SemPGRID 2003), Budapest, Hungary, 2003.

M.T. Schlosser, M. Sintek, S. Decker, and W. Nejdl. HyperCuP - Hypercubes, On-tologies, and Efficient Search on Peer-to-Peer Networks. In Proc. of the 1st Int.Workshop on Agents and Peer-to-Peer Computing (AP2PC 2002), Bologna, Italy,2002.

A. Seaborne. RDQL - A Query Language for RDF, 2004. World Wide Web Consor-tium (W3C) Member Submission, http://www.w3.org/Submission/RDQL/.

P. Shvaiko. Ontology Matching Projects, 2006. http://www.ontologymatching.org/-

projects.html.

P. Shvaiko and J. Euzenat. A Survey of Schema-based Matching Approaches. Journalon Data Semantics (JoDS), IV, 2005.

E. Sidirourgos, G. Kokkinidis, and T. Dalamagas. Efficient Query Routing in RDF/SSchema-based P2P System. In Proc. of the 4th Hellenic Data Management Sympo-sium (HDMS’05), Athens, Greece, 2005.

N. Silva, J. Rocha, and J. Cardoso. E-Business Interoperability Through OntologySemantic Mapping. In Proc. of the PRO-VE Working Conference, Lugano, Switzer-land, 2003.

M. Sintek and S. Decker. TRIPLEA query, Inference, and Transformation Languagefor the Semantic Web. In Proc. of the 1st Int. Semantic Web Conference, Chia (CA),Italy, 2002.

W. Siong Ng, B. Chin Ooi, K.-L. Tan, and A. Ying Zhou. PeerDB: A P2P-basedSystem for Distributed Data Sharing. In Proc. of the 19th Int. Conference on DataEngineering (ICDE 2003), Bangalore, India, 2003.

M. K. Smith, C. Welty, and D. L. McGuinness (eds.). OWL Web Ontology LanguageGuide, 2004. World Wide Web Consortium (W3C), http://www.w3.org/TR/owl-guide/.

158

Page 170: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

K. Sripanidkulchai, B.M. Maggs, and H. Zhang. Efficient Content Location UsingInterest-Based Locality in Peer-to-Peer Systems. In Proc. of The 22nd Annual JointConference of the IEEE Computer and Communications Societies (IEEE INFOCOM2003), San Franciso, CA, USA, 2003.

S. Staab. Topic Communities in Peer to Peer Networks. In Proc. of the 1st ESWC Work-shop on Semantic Network Analysis (SNA’06), Budva, Montenegro, 2006. Invitedspeaker.

S. Staab, C. Tempich, and A. Wranik. REMINDIN’: Semantic Query Routing in Peer-to-Peer Networks based on Social Metaphors. In Proc. of the 13th Int. conferenceon World Wide Web (WWW 2004), New York, NY, USA, 2004.

I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F. Dabek, andH. Balakrishnan. Chord: a Scalable Peer-to-Peer Lookup Protocol for Internet Ap-plications. IEEE/ACM Transactions on Networking, 11(1):17–32, 2003.

JXTA Developer Team. Project JXTA: An Open, Innovative Collaboration. Technicalreport, Sun Microsystems, Inc., 2001.

I. Witten and E. Frank. Data Mining. Morgan Kaufmann Publishers, 1999.

L. Xiong and L. Liu. PeerTrust: Supporting Reputation-based Trust for Peer-to-PeerElectronic Communities. IEEE Transactions on Knowledge and Data Engineering,16(7):843–857, 2004.

B. Yang and H. Garcia-Molina. Comparing Hybrid Peer-to-Peer Systems. In Proc. ofthe 27th Int. Conference on Very Large Data Bases (VLDB 2001), pages 561–570,Roma, Italy, 2001.

B. Yang and H. Garcia-Molina. Improving Search in Peer-to-Peer Networks. In Proc.of the 22nd Int. Conference on Distributed Computing Systems (ICDCS 2002), Vi-enna, Austria, 2002.

B. Yang and H. Garcia-Molina. Designing a Super-Peer Network. In Proc. of the 19thInt. Conference on Data Engineering (ICDE 2003), Bangalore, India, 2003.

P. Yolum and M.P. Singh. Dynamic Communities in Referral Networks. Web Intelli-gence and Agent Systems, 1(2):105–116, 2003a.

159

Page 171: Emergent Communities for Semantic Collaboration in Multi ... UNIMI/ultimi... · Collaboration in Multi-Knowledge Environments: Methods and Techniques TesidiDottoratodiRicercadi: Stefano

BIBLIOGRAPHY

P. Yolum and M.P. Singh. Emergent Properties of Referral Systems. In Proc. of the 2ndInt. Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-03),pages 592–599, New York, NY, USA, 2003b.

D. Zeinalipour-Yazti, V. Kalogeraki, and D. Gunopulos. Exploiting Locality for Scal-able Information Retrieval in Peer-to-Peer Networks. Information Systems, 30(4):277–298, 2005.

H. Zhuge, J. Liu, L. Feng, and C. He. Semantic-Based Query Routing and Heteroge-neous Data Integration in Peer-to-Peer Semantic Link Networks. In Proc. of the Int.Conference on Semantics of a Networked World (ICSNW), in cooperation with ACMSIGMOD 2004, pages 91–107, Paris, France, 2004.

160