abstractcams/projects/305.pdf · the developed system manages data in the peer databases to answer...

ii

ABSTRACT

The developed system manages data in the peer databases to answer queries

related to several databases in the purview. Data integration tools can be used for the same

purpose but they suffer from two problems- they require comprehensive schema design

before they can be used (overhead) and they are difficult to be extended since they typically

breakdown the backward compatibility. The developed system used coordination rules or

mappings and the Query Reformulation Algorithm to manage data in the peers and provide

the required data to the end user. The algorithm and coordination rules also make the

system feasible and easily extendible.

iii

TABLE OF CONTENTS

Abstract 2

Table of contents 3

List of Figures 5

1. Background and Rationale 6

1.1. Introduction 6

1.2. Existing Methods 8

1.2.1. File based P2P Systems 9

1.2.2. Mediator based Integration -GAV and LAV 10

1.2.3. Peer to Peer Integration - Introducing PeerDB, Hyperion and Piazza 11

1.3. Advantages of the New System 16

2. Narrative 17

2.1. A Simple Peer to Peer System 17

2.2. Peer to Peer Data Placement Problem 18

2.3. Data Placement Design Choices 19

2.3.1. Scope of Decision Making 20

2.3.2. Extent of Knowledge Sharing 21

2.3.3. Heterogeneity of Information Sources 20

2.3.4. Dynamicity of Participants 20

2.4. How Piazza Works 21

2.4.1. Query Optimization Exploiting Commonalities and Available Data 22

2.4.2. Propagating Information about Materialized Views 23

2.4.3. Consolidating Query Evaluation and Data Placement 23

iv

2.4.4. Schema Mediation in Piazza 24

2.4.5. Schema Mediation in Piazza 27

2.5. JXTA 28

2.5.1. JXTA Jorgan 30

3. System Design 36

3.1. System Requirements 36

3.2. Piazza Algorithm 37

3.3. P2P Database and Coordination Rules 42

3.4. Workflow 44

4. Evaluation and Results 49

4.1. Evaluation 49

4.2. Results 50

5. Future Work 52

6. Conclusion 53

7. Bibliography and References 54

8. Appendix 58

v

LIST OF FIGURES

Figure 2.1 P2P Architecture: Logical view 19

Figure 2.2 Piazza System Architecture 24

Figure 3.2.1 Flow chart of the Query Reformulation Algorithm 39

Figure 3.2.2 Rule-Goal Tree 42

Figure 3.3.1 Topology 41

Figure 3.4.1 Status Window 45

Figure 3.4.2 Interface for Each Node (batch file) 46

Figure 3.4.3 Status during Coordination Rules Announcement 47

Figure 3.4.4 Peers are Ready 47

Figure 3.4.5 Execution of Query 1 48

Figure 3.4.5 Execution of Query 2 49

Table 4.2.1 Results 52

6

1. BACKGROUND AND RATIONALE

1.1 Introduction

Users are equipped to access a multitude of data sources that are related in some

way and to combine the returned data to come up with useful information which is not

physically stored in a single space. For instance, a person who has the intension of buying

a car can query several car dealer Web sites and then compare the results. He can further

query a data source which provides information about car reviews to help his decision

about the cars he liked. As another example, imagine a company which has several

branches in different cities. Each branch has its own local database recording its sales.

Whenever global decisions about the company have to be made, each branch database

must be queried and the results must be combined. On the other hand, contacting data

sources individually and then combining the results manually every time information is

needed is a very tedious task.

Instead, a service is needed which provides transparent access to a collection of

related data sources as if these sources as a whole constituted a single data source. Such a

service is called a data integration service and the system that integrates multiple sources

to provide this service is usually referred to as a data integration system. The main

contribution of a data integration system is that users can focus on specifying what data

they want rather than on describing how to obtain it. A data integration system relieves

the user from the burden of finding the relevant data sources, interacting with each of

them separately, then combining the data they return. To achieve this, the system

provides an integrated view of the data stored in the underlying data sources. Users can

uniformly access all the data sources as if they were querying a single data source.

7

Also, environmental, hydrographic, meteorological and oceanographic data have

been collected and made available by numerous local, state and federal agencies as well

as by universities. Currently users have to manually interact with these large collections

of internet data sources, determine which ones to access and how to access and manually

merge results from different data sources which is tedious and cumbersome process and

hence a data integration system is required to answer such type of queries. Some

examples of areas in which integration is much useful are – Science and Culture:

Integrating Genomic data, Monitoring events in the sky, Puget Sound Regional Synthesis

Model; Enterprise data integration; World-wide web: XML integration, comparison

shopping etc.

Medical System in India is not so well organized in the aspect of rendering

services to people who live under Below Poverty Line (BPL). Most of the people living

in villages are deprived of advanced medical technology mainly due to the lack of

promptness in delivering required help. Villagers have to travel from their location to

cities to get medical checkup or blood tests etc which involves money, time and effort. If

the government could provide fair transportation system, with the advent of data

integrating approaches, we can cater to the dire needs of the people immediately. For

instance, if we can set up an emergency station in the city and provide its number to all,

any serious incident can be reported to it from any village and let the service take care of

the patient. All the required data like hospital data, clinical laboratory for blood test, fire

station etc can be accessible at the emergency station looking at which one can decide

which hospital the patient can be taken to. If fire accident happens at any location, the

emergency station service can pick up all the victims and place them in the nearby

8

hospitals besides sending firemen to control the fire by just looking at the available data

from different sources at the station. This project has been developed to address this

problem and at least ease it to certain extent.

1.2 Existing Methods

A long-standing tenet of distributed systems is that the strength of a distributed

system can grow as more hosts participate in it. Each participant may contribute data and

computing resources (such as unused CPU cycles and storage) to the overall system, and

the wealth of the community can scale with the number of participants. A peer-to-peer

(P2P) distributed system is one in which participants rely on one another for service, rather

than solely relying on dedicated and often centralized infrastructure. Instead of strictly

decomposing the system into clients (which consume services) and servers (which provide

them), peers in the system can elect to provide services as well as consume them. The

membership of a P2P system is relatively unpredictable: service is provided by the peers

that happen to be participating at any given time [Rabinovich 1998]. At first glance, many

of the challenges in designing P2P systems seem to fall clearly under the banner of the

distributed systems community. However, upon closer examination, the fundamental

problem in most P2P systems is the placement and retrieval of data. Indeed, current P2P

systems focus strictly on handling semantics-free, large-granularity requests for objects by

identifier (typically a name), which both limits their utility and restricts the techniques that

might be employed to distribute the data. Most of the integration techniques used currently

can be categorized under three roofs – Content or file based integration system where

communication in peers is achieved through file sharing, Mediator based integration where

global and local schemas are defined in terms of one another to achieve data

9

communication and P2P integration using cutting edge query reformulation algorithms like

Piazza [Halevy 2003], PeerDB [Ives 2000] etc which avoids any compromise on peer

autonomy besides providing data coordination among peers.

1.2.1 File Based P2P Systems

Many examples of P2P systems have emerged recently, most of which are wide-

area, large-scale systems that provide content sharing [Napster 2001], storage services

[Kubiatowicz 2000], or distributed “grid” computation [Legion 2000]. Smaller-scale P2P

systems also exist, such as federated, server less file systems and collaborative workgroup

tools. The success of these systems has been mixed; some, such as Napster, have enjoyed

enormous popularity and perform well at scale.

Others, including Gnutella, have failed to attract a large community, possibly due

to a combination of weak application semantics and technical flaws that limit its scaling.

Perhaps the most exciting possibility of peer-to-peer computing is that the desirable

properties of the system can become amplified as new peers join: because of its

decentralization, the system’s robustness, availability, and performance might grow with

the number of peers. A more subtle possibility is that the richness and diversity of the

system can similarly scale, since new peers can introduce specialized data or resources that

the system was previously lacking. Decentralization also helps eliminate proprietary

interests in the system’s infrastructure; instead of trust being placed in dedicated servers,

trust is diffused over all participants in the system. The need for administration is

diminished, since there is no dedicated infrastructure to manage. By routing requests

through many peers and replicating content, the system might be able to hide the identity of

content publishers and consumers, making it resilient against censorship.

10

Although the vision of P2P systems is grand, the technical challenges associated

with them are immense, and as a result the realization of the vision has been elusive.

Because the membership in the system is ad-hoc and dynamic, it is very difficult to predict

or reason about the location and quality of the system’s resources. For example, the

placement of data in content-sharing systems is often naive: data placement is largely

demand driven, with little regard given to network bandwidth, load, or historical

trustworthiness of the peer on which the data is placed. Because the system is

decentralized, any optimizations such as data placement must be done in a completely

distributed manner; the system cannot necessarily presume the existence of a single oracle

that coordinates the activity of all of the systems’ peers [Siong 2003]. Furthermore, the

dynamic nature of the system may impose fundamental limitations on its data consistency

and availability: if the rate at which data changes in the system is high, then the overhead

of maintaining globally accessible indexes may become prohibitive as the number of peers

in the system grows. Because P2P systems designers have to a large extent failed to

overcome these challenges, the semantics provided by these systems is typically quite

weak. In most content sharing systems, only popular content is readily accessible - yet

content popularity seems to be driven by distributions, in which a large fraction of requests

are directed to unpopular content. Similarly, current content sharing systems ignore

problems such as updates to content, and they typically only support retrieval of objects by

name. These current content sharing systems are largely limited to applications in which

objects are large, opaque, and atomic, and whose content is well-described by their name;

for instance, today’s P2P systems would be highly ineffective at content-based retrieval of

text files or at fetching only the abstracts from a set of LATEX documents. Moreover, they

11

are limited to caching, pre-fetching, or pushing of content at the object level, and know

nothing of overlap between objects. These limitations arise because the P2P world is

lacking in the areas of semantics, data transformation, and data relationships, yet these are

some of the core strengths of the data management community.

Queries, views, and integrity constraints can be used to express relationships

between existing objects and to define new objects in terms of old ones. Complex queries

can be posed across multiple sources, and the results of one query can be materialized and

used to answer other queries. Data management techniques such as these can be used to

develop better solutions to the data placement problem at the heart of any P2P system

design: data must be placed in strategic locations and then used to improve query

performance. The database field will benefit from the results, as new query processing

systems can leverage the increased scalability, reliability, and performance of a successful

P2P architecture [Doan 2002].

1.2.2 Mediator Based Integration - GAV and LAV

In recent years, there have been researches in developing tools that facilitate the

rapid integration of heterogeneous information sources that may include both structured

and unstructured data. A common problem facing many organizations today is that of

multiple, disparate, object stores, knowledge bases, file systems, digital libraries,

information retrieval systems, and electronic mail systems. Decision makers often need

information from multiple sources, but are unable to get and use the required information in

a timely fashion due to the difficulties of accessing the different systems, and due to the

fact that the information obtained can be inconsistent and contradictory. There are basically

two approaches for designing a data integration system. In the global-as-view approach,

12

one defines the concepts in the global schema as views over the sources, whereas in the

local-as view approach, one characterizes the sources as views over the global schema.

The recent trend in data integration has been to loosen the coupling between data.

Here the idea is to provide a uniform query interface over a mediated schema. This query is

then transformed into specialized queries over the original databases. This process can also

be called as view based query answering because we can consider each of the data sources

to be a view over the (nonexistent) mediated schema. Formally such an approach is called

Local As View (LAV) — where "Local" refers to the local sources/databases. An alternate

model of integration is one where the mediated schema is designed to be a view over the

sources. This approach called Global As View (GAV) — where "Global" refers to the

global (mediated) schema — is often used due to the simplicity involved in answering

queries issued over the mediated schema. However, the obvious drawback is the need to

rewrite the view for mediated schema whenever a new source is to be integrated and/or an

existing source changes its schema.

Data integration systems are formally defined as a triple (G, S, M) where G is the

global (or mediated) schema, S is the set of heterogeneous source schemas, and M is the

mapping that maps queries between the source and the global schemas. Both G and S are

expressed in languages over alphabets comprised of symbols for each of their respective

relations. The mapping M consists of assertions between queries over G and queries over S.

In GAV, the global schema is modeled as a set of views over S. In this case M associates to

each element of G a query over S. Query processing becomes a straightforward operation

because the associations between G and S are well-defined. The burden of complexity is

placed on implementing mediator code instructing the data integration system exactly how

13

to retrieve elements from the source databases. If any new sources are added to the system,

considerable effort may be necessary to update the mediator, thus the GAV approach

should be favored in cases where the sources are not likely to change. In LAV, the source

database is modeled as a set of views over G. In this case M associates to each element of S

a query over G. Here the exact associations between G and S are no longer well-defined.

As is illustrated in the next section, the burden of determining how to retrieve elements

from the sources is placed on the query processor. The benefit of an LAV modeling is that

new sources can be added with far less work than in a GAV system, thus the LAV

approach should be favored in cases where the mediated schema is not likely to change.

Modeling Websites often require expressive power of GAV and LAV. Hence

GLAV is developed which is a language for source description that is more expressive than

GAV and LAV combined. Query answering for GLAV sources is no harder than it is for

LAV sources. GLAV reaches the limits on the expressive power of a data source

description language. GLAV is also of interest for data integration independent of data

webs, because of the flexibility it provides in integrating diverse sources.

1.2.3 Peer to Peer Integration - Introducing PeerDB, Hyperion and Piazza

In current data sharing P2P systems, only file-system-like capabilities are

provided while the semantics of data is largely ignored. For example, in Gnutella, queries

are restricted to strings that can be contained in a filename and directory path, that is,

only simple value searches on file names are supported. Peer-based data management

system can be seen as a distributed and heterogeneous database system, the scale of the

system and its dynamism as nodes join and leave the network offer several major

challenges [Napster 2001]. First, there is no predefined global schema. With each node

14

joining and leaving the network at anytime, assuming a global schema in such a dynamic

environment is apparently not practical, scalable and extensible. One possible approach is

to perform “mapping” on-the-fly during querying. Second, realizing efficient query

processing becomes more difficult. Initial response time is expected to be high as relevant

data have to be identified before any optimization and query processing can be

performed. Third, much information redundancy exists in the network, which inevitably

brings about data and computation redundancy. Unfortunately, information redundancy

cannot be avoided unless some control over data placement is taken. Finally, the notions

of correctness and completeness of query results cannot be used in their pure meaning as

in traditional database systems.

PeerDB is a P2P based system for distributed data sharing. PeerDB has several

distinguishing features. First, each participating node is a full fledge object management

system that supports content-based search. Second, in PeerDB, users can share data

without a shared global schema. Third, PeerDB adopts mobile agents to assist in query

processing. Since agents can perform operations at the peers’ sites, the network

bandwidth is better utilized. More importantly, agents can be coded to perform a wide

variety of tasks, making it easy to extend the capabilities of a PeerDB node [Ives 2000].

There is an another architecture for peer data base management systems (PDMS)

that instantiates the vision of logical P2P data coordination laid out in which is called

Hyperion. A PDBMS is a conventional DBMS augmented with a P2P interoperability

layer. This layer implements the functionality required for peers to share and coordinate

data without compromising their own autonomy. The P2P layer allows a PDBMS to

establish or abolish an acquaintance (semi-)automatically at runtime, thereby inducing a

15

logical peer-to-peer network. The two important aspects of this system are data

coordination in which each source behaves as an access point for both local and shared

data, data sharing both within and across domains, while views and GLAV (global-and

local-as-view) mappings have been used to integrate and exchange data within a common

domain.

The other peer system predominantly based on ontology and which has not been

completely implemented is Piazza. Piazza paves the way for a fruitful combination of

data management and knowledge representation techniques in the construction of the

semantic web [Halevy 2003]. In fact, the techniques offered in Piazza are not a

replacement for rich ontologies and languages for mapping between ontologies but is to

provide the missing link between data described using rich ontologies and the wealth of

data that is currently managed by a variety of tools. In order to exploit data from other

sites, there must be semantic glue between the sites, in the form of semantic mappings.

Mappings in Piazza are specified between a small numbers of sites, usually pairs. In this

way, it is possible to support the two rather different methods for semantic mediation -

mediated mapping, where data sources are related through a mediated schema or

ontology, and point-to-point mappings, where data is described by how it can be

translated to conform to the schema of another site.

1.3 Advantages of the New System

Ultimate goal with Piazza is to provide query answering and translation across the

full range of data. Logically, a Piazza system consists of a network of different sites (also

referred to as peers or nodes), each of which contributes resources to the overall system.

The resources contributed by a site include one or more of the following: (1) ground or

16

extensional data, (2) models of data. In addition, nodes may supply computed data, i.e.,

cached answers to queries posed over other nodes. When a new site (with data instance or

schema) is added to the system, it is semantically related to some portion of the existing

network. Queries in Piazza are always posed from the perspective of a given site's schema,

which defines the preferred terminology of the user. When a query is posed, Piazza

provides answers that utilize all semantically related data within the system [Halevy 2003].

In order to exploit data from other sites, there must be coordination rule between

the sites, in the form of mappings. Mappings in Piazza are specified between small

numbers of sites, usually pairs. In this way, it is possible to support the two rather different

methods for schema mediation mentioned earlier: mediated mapping, where data sources

are related through a mediated schema or ontology, and point-to-point mappings, where

data is described by how it can be translated to conform to the schema of another site.

Admittedly, from a formal perspective, there is little difference between these two kinds of

mappings, but in practice, content providers may have strong preferences for one or the

other.

17

2. NARRATIVE

The goal of the peer to peer data integration using semantic rules is to address this

need: the use of a decentralized, easily extensible data management architecture in which

any user can contribute new data, schema information, or even mappings between other

peers’ schemas.

2.1 A Simple Peer to Peer System

A peer to peer (or P2P) computer network uses diverse connectivity between

participants in a network and the cumulative bandwidth of network participants rather than

conventional centralized resources where a relatively low number of servers provide the

core value to a service or application. P2P networks are typically used for connecting nodes

via largely ad hoc connections. Such networks are useful for many purposes. Sharing

content files containing audio, video, data or anything in digital format is very common,

and realtime data, such as telephony traffic, is also passed using P2P technology. The

concept of P2P is increasingly evolving to an expanded usage as the relational dynamic

active in distributed networks, i.e. not just computer to computer, but human to human.

Yochai Benkler has coined the term "commons-based peer production" to denote

collaborative projects such as free software. Associated with peer production are the

concept of peer governance (referring to the manner in which peer production projects are

managed) and peer property [Franconi 2003]. A logical view of P2P architecture is shown

in Figure 2.1.

18

Figure 2.1 P2P architecture: logical view.

2.2 Peer to Peer Data Placement Problem

The data placement problem for a P2P system is as follows. Assume we are given

a set of cooperating nodes connected by a network (typically, but not necessarily, the

Internet) that has limited bandwidth on each link. Nodes know about and exchange data

with a collection of participating peers, and they may serve any or all of four roles [Suciu

2003]. The first of these is a data origin, which provides original content to the system and

is the authoritative source of that data. As a storage provider, a peer stores materialized

views (consuming disk resources, and perhaps replacing previously materialized views if

there is insufficient space), and as a query evaluator, it uses a portion of its CPU resources

to evaluate the set of queries forming its workload. As query initiators, peers act as clients

in the system and pose new queries. (A node may initiate new queries on behalf of a query

it is attempting to evaluate.) The overall cost of answering a query includes the transfer

cost from the storage provider or data origin to the query evaluator, the cost of resources

utilized at the query evaluator and other nodes, and the cost to transfer the results to the

19

query initiator. The data placement problem is to distribute data and work so the full query

workload is answered with lowest cost under the existing resource and bandwidth

constraints. While a cursory glance at the data placement problem suggests many

similarities with multi-query optimization in a distributed database, there are substantial

and fundamental differences. For example, in the general case, a P2P system has no

centralized schema and no central administration.

2.3 Data Placement Design Choices

While the globally optimal peer-to-peer concept is conceptually simple to define

for an ideal environment, in practice any P2P system will have certain limitations. These

compromises are due to factors such as constrained bandwidth and resources, message

propagation delays, and so on. Some important dimensions that affect the data placement

problem include:

2.3.1 Scope of Decision-Making

A major factor is the scale at which query processing and view materialization

decisions are made. At one extreme, all queries in the entire system are optimized together,

using complete knowledge of the available materialized views, resources, and network

bandwidth constraints — this poses all of the challenges of multi-query optimization plus a

number of additional difficulties. In particular, work must be distributed globally across

many peers, and decisions must be made about when and where to materialize results for

future use. At the other end of the spectrum, every decision is made on a single-node,

single-query basis — this is the familiar problem of query optimization for distributed data.

Clearly, a good query optimization and data placement strategy will be much more

beneficial to the global system than the local one; yet decisions are likely to be much more

20

expensive to make on the global scale, so any real system will likely be forced to work

within a smaller scope.

2.3.2 Extent of Knowledge Sharing

Related to the above problem is the question of how much knowledge is available

to the system during its query optimization process. In particular, the first step in choosing

a query evaluation strategy is likely to be identifying which nodes have materialized views

that can speed query processing. A simple technique would be to use a centralized catalog

of all available views and their locations, analogous to the central directory used by

Napster.

2.3.3 Heterogeneity of Information Sources

Data may originate at a few authoritative sources, or alternatively, every

participant might be allowed (or expected) to contribute data to the community. The level

of heterogeneity of the data influences the degree to which a system can ensure uniform,

global semantics for the data. A P2P system might impose a single schema on all

participants to enforce uniform, global semantics, but for some applications this will be too

restrictive. Alternatively, a limited number of data sources and schemas may be allowed, so

traditional schema and data integration techniques will likely apply (with the restriction

that there is no central authority). The case of fully heterogeneous data makes global

semantic integration extremely challenging.

2.3.4 Dynamicity of Participants

Some P2P systems, such as [Napster 2001], assume a fixed set of nodes in the

system. However, one of the greatest potential strengths of P2P systems is when they

eschew reliance on dedicated infrastructure and allow peers to leave the system at will.

21

Even under these conditions, participants typically have broadly varying availability

characteristics. Some peers are akin to servers: their membership in the system stays

largely static. Others have much more dynamic membership, joining and leaving the

system at will. In a configuration where original data is distributed uniformly across the

network, including on nodes that frequently disappear, it may become impossible to

reliably access certain items. At the other extreme, if all data is placed or cached only on

the set of static “servers,” the system will have greatly reduced flexibility and performance

(this configuration is equivalent to yesterday’s web, prior to proxy caches and content

distributors such as Akamai). An intermediate approach places all original content on the

consistently available nodes to provide availability, but replicates or caches data at the

dynamic peers.

2.4 How Piazza Works

Piazza algorithm focuses on the dynamic data placement problem mentioned

above with goals as scalability even with large numbers of nodes and moderately frequent

updates. Figure 2.2 shows data origin as an entity distinct from the peers in the system

(though a peer can actually serve both roles) — Piazza can only guarantee availability of

data while its origin is a member of the network, and only the origin may update its data.

All peer nodes belong to spheres of cooperation, in which they pool their resources and

make cooperative decisions. Each sphere of cooperation may in turn be nested within a

successively larger sphere, with which it cooperates to a lesser extent. These spheres of

cooperation will often mirror particular administrative boundaries (e.g. those within a

corporation or local ISP), and in many ways resemble a cooperative cache. Given this

configuration, Piazza focuses on the following aspects of the data placement problem:

22

2.4.1 Query Optimization Exploiting Commonalities and Available Data

At the heart of our problem lies a variation of traditional multi-query

optimization. Ideally, the Piazza system will take the current query workload, find

commonalities among the queries, exploit materialized views whenever cost-effective,

distribute work under resource and bandwidth constraints, and determine whether certain

results should be materialized for future use (while considering the likelihood of updates to

the data). For scalability reasons, these decisions are taken at the level of a sphere of

cooperation rather than on a global basis. In order to perform this optimization, Piazza must

address two important sub-problems [Halevy 2003][Suciu 2003].

2.4.2 Propagating Information about Materialized Views

When a query is posed, the first step is to consider whether it can be answered

using the data at “nearby” storage providers, and to evaluate the costs of doing so. This

requires the query initiator to be aware of existing materialized views and properties such

as location and data freshness. One direction we are exploring is to propagate information

about materialized views using techniques derived from routing protocols [Tanenbaum

1996]. In particular, a node advertises its materialized views to its neighbors. Each node

consolidates the advertisements it receives and propagates them to its neighbors. Under

constrained resources, any node can arbitrarily drop advertisements without jeopardizing

system correctness— a query can always be posed in terms of the data origins. This routing

protocol avoids the scalability problems caused by broadcasting every view materialization

and those caused by broadcasting every query request.

23

2.4.3 Consolidating Query Evaluation and Data Placement

A node may pose a query that cannot be evaluated with the data available from

known peers. In this case, the data must be retrieved directly from the data origins.

However, at any given point, there may be many similar un-evaluable queries within the

same sphere of cooperation, and the sphere should choose an optimal strategy for

evaluating all of them. Therefore, all un-evaluable queries are broadcast within the cluster;

the cluster identifies commonalities among this query set, then assigns roles (evaluation of

a query or sub query and/or materialization of results) to specific nodes based on cost.

Figure 2.2 Piazza System Architecture [Doan 2002].

Data Origins serve original content, peer nodes (A-E) cooperate to answer queries

but have limited disk and CPU resources. Nodes are connected by band-width constrained

links and advertize their materialized views. Nodes belong to spheres of cooperation with

which they share resources; these spheres may be nested within successively larger spheres

(see Figure 2.2).

2.4.4 Schema Mediation in Piazza

In contrast to a data integration environment, which has a tree-based hierarchy

with data sources schemas at the leaf nodes and one or more mediated schemas as

24

intermediate nodes, a peer data management system (PDMS) can support an arbitrary

graph of interconnected schemas. Some of these schemas are defined virtually for

purposes of querying and mapping. These are called peer schemas, and generally their

relations (peer relations) will have an open-world assumption (i.e., the data returned by

querying these relations may be incomplete). Queries in the PDMS will be posed over the

relations from a specific peer schema. A peer schema represents the peer’s “view of the

world” that is unlikely to be the same at different peers. Peers may also contribute data to

the system in the form of stored relations. Stored relations are analogous to data sources

in a data integration system: all queries in a PDMS will be reformulated strictly in terms

of stored relations that may be stored locally or at other peers [Suciu 2003].

There are two types of schema mappings in Piazza. A mapping that relates two or

more peer schemas is called a peer description, whereas a mapping that relates a stored

schema to a peer schema is called a storage description. Peer descriptions define the

correspondences between the “views of the world” at different peers. Storage

descriptions, on the other hand, map the data stored at a peer into the peer’s view of the

world. Thus, storage descriptions are similar to data source descriptions in a data

integration system.

Two main formalisms have been proposed for schema mediation in data

integration systems. In the first, called global as- view (GAV), the relations in the

mediated schema are defined as views over the relations in the sources. In the second,

called local-as-view (LAV), the relations in the sources are specified as views over the

mediated schema. For example, Let us assume there are two data sources - two car dealer

25

databases which both became parts of Acme Cars company. Each of the car dealers has a

separate schema for storing information about cars. Dealer 1 stores it in the relation:

Cars(vin, make, model, color, price)

Dealer 2 stores information about his cars for sale in the relation:

CarsForSale(vehicleID, carMake, carModel, carColor, carPrice).

Acme Cars uses a mediated architecture to integrate these two dealers' databases.

It does this by providing a mediated schema of the two schemas above. The mediated

schema consists of just one relation:

Automobiles(vin, autoMake, autoModel, autoColor, autoPrice).

In GAV approach, for each relation R in the mediated schema, a view in terms of the

source relations is written which species how to obtain R's tuples from the sources.

The following simple example shows how mediated schema relations CAR and REVIEW

can be obtained from the source relations S1, S2 and S3.

S1(vin, status, model, year) => CAR(vin, status)

S2(vin, status, make, price) => CAR(vin, status)

S1(vin, status, model, year) ∩ S3(vin, review) => REVIEW(vin, review)

S2(vin, status, make, price) ∩ S3(vin, review) => REVIEW(vin, review)

In LAV approach, for each data source S, a view in terms of the mediated schema

relations is written that describes which tuples of the mediated schema relations are found

in S. In LAV, we take an opposite approach to GAV and we describe each source in

terms of the mediated schema relations. Assume that source S1 contains cars produced

after 1990 and source S2 contains cars sold by the dealer "ACME".

S1(vin, status, model, year) : − CAR(vin, status), MODEL(vin, model, year), year

26

≥1990

S2(vin, status, make, price) : − CAR(vin, status), MODEL(vin, make, year),

SELLS(dealer name, vin, price), dealer name = "ACME"

S3(vin, review) : − REVIEW(vin, review)

Query processing using the LAV approach is an application of a much broader problem

called "Answering Queries using Views" [Franconi 2003].

Piazza combines and generalizes the two data integration formalisms, and it

extends them to the XML world in a way that keeps evaluation tractable. Two kinds of

peer descriptions are supported: equality and inclusion descriptions. Peer descriptions

have the following form: Q1(P1) = Q2(P2), (or Q1(P1) subset Q2(P2) for inclusions)

where Q1 and Q2 are conjunctive queries with the same arity and P1 and P2 are sets of

peers. Intuitively, the mapping statement specifies a semantic mapping by stating that

evaluating Q1 over the peers P1 will always produce the same answer (or a subset in the

case of inclusions) as evaluating Q2 over P2. The set of mappings of a PDMS defines its

semantic network (or topology). Optimizing the topology of a PDMS is an interesting

research problem. Some of the possible optimization criteria include: eliminating

redundant mappings, reducing the diameter of a PDMS (to reduce information loss in

query reformulation), and identifying semantically unreachable peers.

2.4.5 Querying in Piazza

Query reformulation is perhaps the single most important aspect of query

processing in a PDMS, since it is crucial for PDMS’s ability to answer user queries. The

input of the algorithm is a set of peer mappings and storage descriptions and a query Q.

27

The output of the algorithm is a query expression Q0 that refers to stored relations only.

To answer Q we need to evaluate Q0 over the stored relations.

The algorithm proceeds by constructing a simple rule-goal tree: goal nodes are

labeled with atoms of the peer relations, and rule nodes are labeled with peer mappings. It

begins by expanding each query subgoal according to the relevant peer mappings in the

PDMS. When none of the leaves of the tree can be expanded any further, it uses the

storage descriptions for the final step of reformulation in terms of the stored relations.

Suppose all peer mappings in the PDMS are of the form V subset Q(P). In this case (that

is similar to LAV mappings in data integration), we begin with the query subgoals and

apply an algorithm for answering queries using views. The algorithm is applied to the

result until it cannot proceed further, and as in the previous case, it used the storage

descriptions for the last step of reformulation [Halevy 2003].

A major challenge of the reformulation algorithm is to combine and interleave the

two types of reformulation techniques. One type of reformulation (unfolding) replaces a

sub goal with a set of sub goals, while the other (rewriting) replaces a set of sub goals

with a single sub goal. As a result, the output of the algorithm can be a acyclic graph

rather than a tree.

2.5 JXTA

It is a set of open, generalized peer-to-peer protocols that allows any connected

device (cell phone, PDA, PC to server) on the network to communicate and collaborate in

p2p manner. It is an open source product. JXTA technology enables developers to create

innovative distributed services and applications. JXTA technology is used to create

applications and services that enable people to:

28

• Collaborate on projects from anywhere using any connected device

• Share compute services, such as processor cycles or storage systems, regardless of

where the systems or the users are physically located

• Communicate with colleagues across the world using a peer-to-peer network

• Share files and information to distributed locations on the network, not just to local

hard drives

• Connect game systems so that multiple people in multiple locations can play the

same game interactively.

There are the obvious, such as messaging and resource sharing, but as the

deployments prove collaboration, content delivery, and decentralization are ripe for P2P

applications. JXTA technology provides developers with tools to build network

applications that thrive in highly dynamic environments. "There's not so much one classic,

killer application," Soto says. "But there are killer characteristics that make an application

suitable for JXTA technology." [Berners 2000].These characteristics include situations:

• where centralization is not required or not possible

• where resilience is needed--in case a piece of the network is lopped off, for example

• where massive scalability is important--peers could pick up large pieces of the load

on the network--the more peers in the network the more valuable the P2P solution is

• where relationships are transient or ad hoc

• where resources are highly distributed

The JXTA Release 2.0 adds new features that enhance scalability and

performance. All are aimed at making JXTA technology more and more enterprise ready.

29

"JXTA greatly reduces the complexity required to build and deploy P2P solutions and

services," says Soto. "Businesses benefit greatly as a result: improved collaboration and

sharing, greater security and resilience because there's no single point of failure, up-to-the-

second data currency, and better control. And that means lower costs and faster time to

market for improved competitiveness." JXTA strives to provide a base P2P infrastructure

over which other P2P applications can be built. This base consists of a set of protocols that

are language independent, platform independent, and network agnostic (that is, they do not

assume anything about the underlying network). These protocols address the bare

necessities for building generic P2P applications. Designed to be simple with low

overhead, the protocols target, to quote the JXTA vision statement, "every device with a

digital heartbeat." [Bolosky 2000]. JXTA currently defines six protocols, but not all JXTA

peers are required to implement all six of them. The number of protocols that a peer

implements depends on that peer's capabilities; conceivably, a peer could use just one

protocol. Peers can also extend or replace any protocol, depending on its particular

requirements. It is important to note that JXTA protocols by themselves do not promise

interoperability. Here, you can draw parallels between JXTA and TCP/IP. Though both

FTP and HTTP are built over TCP/IP, you cannot use an FTP client to access Webpages.

The same is the case with JXTA. Just because two applications are built on top of JXTA

doesn't mean that they can magically interoperate. Developers must design applications to

be interoperable. However, developers can use JXTA, which provides an interoperable

base layer, to further reduce interoperability concerns.

30

2.5.1 The JXTA Jargon

Before proceeding any further, let's quickly look at the various concepts in JXTA

[Bolosky 2000].

Peers

Any entity on the network implementing one or more JXTA protocols. A peer

could be anything from a mainframe to a mobile phone or even just a motion sensor. A

peer exists independently and communicates with other peers asynchronously.

Peer groups

Peers with common interests can aggregate and form peer groups. Peer groups

can span multiple physical network domains.

Messages

All communication in the JXTA network is achieved by sending and receiving

messages. These messages, called JXTA messages, adhere to a standard format, which is

key to interoperability.

Pipes

Pipes establish virtual communication channels in the JXTA environment. Peers

use them for sending and receiving JXTA messages. Pipes are deemed virtual because

peers don't need to know their actual network addresses to use them. That is an important

abstraction.

Services

Both peers and peer groups can offer services. A service offered by a peer

individually, at a personal level, is called a peer service, a concept equivalent to

31

centralization. No other peer needs to offer that service; if the peer is not active, the service

might become unavailable.

Peer groups offer services called peer group services. Unlike peer services, these

services are not specific to a single peer but available from multiple peers in the group.

Peer group services are more readily available, because even if one peer is unavailable,

other peers offer the same services.

Advertisements

An advertisement publishes and discovers any JXTA resource such as a peer, a

peer group, a pipe, or a codat. Advertisements are represented as XML documents.

Identifiers

Identifiers play a key role in the JXTA environment. Identifiers specify

resources, not physical network addresses. The JXTA identifier is defined as a URN

(Uniform Resource Name). A URN is nothing but a URI (Uniform Resource Identifier)

that has to remain globally unique and persistent even when the resource ceases to exist.

Endpoints

Endpoints are destinations on the network and can be represented by a network

address. Peers don't generally use endpoints directly; they use them indirectly through

pipes, which are built on top of endpoints.

Routers

Anything that moves packets around the JXTA network is called a JXTA router.

Not all peers need to be routers. Peers that are not routers must find a router to route their

messages.

32

The JXTA protocols

The key to JXTA lies in a set of common protocols defined by the JXTA

community. These protocols can be used as a foundation to build applications. Designed

with a low overhead, the protocols assume nothing about the underlying network topology

over which an application that uses them is built [Berners 2000].

Peer Discovery Protocol (PDP)

Peers use this protocol to discover all published JXTA resources. Since

advertisements represent published resources, PDP essentially helps a peer discover an

advertisement on other peers. As the lowest-level discovery protocol, PDP provides a basic

mechanism for discovery. Applications might choose to use higher-level discovery

mechanisms. PDP serves as a low-level protocol over which higher-level discovery

mechanisms can be built.

Peer Resolver Protocol (PRP)

Often in the network, peers send queries to other peers to locate some service or

content. The Peer Resolver Protocol intends to standardize these queries' formats. With this

protocol, peers can send generic queries and receive responses.

Peer Information Protocol (PIP)

PIP can be used to ping a peer in the JXTA environment. A peer receiving a ping

message has several options: It can give a simple acknowledgment, consisting only of its

uptime. It can send a full response, which includes its advertisement. Or it can ignore the

ping. Thus, there can be peers capable of receiving messages but not sending responses.

33

Peer Membership Protocol (PMP)

Peers use the Peer Membership Protocol for joining and leaving peer groups. This

protocol recognizes four discrete steps used by peers and thus defines JXTA messages for

each of these actions:

• Apply: A peer interested in entering a group can apply for a membership to the

group membership authenticator. The authenticator responds by sending back an

acknowledge message to the peer.

• Join: After an apply, the peer can choose to join the peer group.

• Renew: To update their membership information in the group, peers use the renew

message.

• Cancel: Peers can choose to cancel their peer group memberships.

The JXTA Java Binding

The best way to see the above protocols in action is to explore the JXTA Java

Binding, the JXTA reference implementation in Java. Developers can build on the existing

implementation or choose to implement their own version of the protocols in the languages

and platforms of their choice. Though the reference uses the HTTP and TCP/IP transports

because of their simplicity and popularity, you can implement the JXTA protocols on any

transport protocol, depending on the network topology.

The Class Organization

The JXTA Java Binding consists of two main class hierarchies:

• The net.JXTA.* classes

• The net.JXTA.impl.* classes

34

The first package contains all the JXTA interfaces, which are the Java interfaces

for the JXTA protocols and core building blocks. The second package contains these

interfaces' implementations. The interfaces and their implementations must be clearly

separated. Let's dive into these packages.

Where is My Peer?

A peer is an independent, asynchronous entity in the network associated with a

peer ID. You might consider an instance of running code as a peer. Currently, a boot class

(net.JXTA.impl.peergroup.Boot), which provides a main() method, starts a

peer.

A peer's capabilities depend on the groups to which it belongs. But, by virtue of

just being a peer, every peer exhibits some minimum capability -- having an ID, for

instance. That means that there must exist at least one peer group that every peer must be a

member of: the world peer group. Also called the platform peer group, the world peer

group is represented by the class net.JXTA.impl.peergroup.Platform, an

implementation of the PeerGroup class (net.JXTA.peergroup.PeerGroup).

Peer Groups as Applications

An important abstraction in binding, an application

(net.JXTA.platform.Application) is anything that a peer group can initialize,

start, and stop. It is interesting to note that one peer group

(net.JXTA.peergroup.PeerGroup) usually starts another peer group (refer to the

discussion on peer group nesting) and is hence an application. An exception is the platform

(or world) peer group. It is not started by any other peer group and forms the base of the

35

peer group hierarchy. An application defines three methods: init(), startApp(), and

stopApp(). The methods in the Application class are as follows:

public void init(PeerGroup group, Advertisement adv);

public int startApp(String[] args);

public void stopApp();

36

3. SYSTEM DESIGN

3.1 System Requirements

A peer to peer database integration system was developed using Java and MySQL

with JXTA protocols and the Piazza algorithm. The different data sources for the project

would be on a standalone system.

Different data sources are used to collect the required information and display the

result to the end user in the required format. This project has few data sources namely –

hospitals and clinical laboratories and fire stations. There may be one or many hospitals

and one or many laboratories. Hospitals would have the data of wards and capacity of each

department and general information that any user can view. Laboratories would have the

information of the patients’ blood samples. If the user of the system wants to see

information of all the hospitals and fire stations in a particular area with particular

department, it needs two queries to two different databases. But as this project provides a

global view of all the data from different sources, it collects the data from the peer

databases and results are given accordingly with only one query.

The process does not require any schema changes for the peers as the query is

executed in runtime, but the data sources should be consistent with respect to integrity of

data. Since the queries may range from simple to complex, there should be a particular

procedure that generates the query that can pull the required data from sources. Query

Reformulation or Piazza algorithm is used to create rule-goal tree which is used to

breakdown global query and generate queries that can extract the required data [Suciu

2003].

37

3.2 Piazza Algorithm

The algorithm takes as input a conjunctive query Q( X) that is posed at some peer,

and a set of peer mappings and storage descriptions. The following are the steps involved

in the algorithm:

1. Each equality description is transformed into two inclusion mappings.

2. Each inclusion of the form is then transformed to V Q2, and

V : − Q1, where V is a new predicate name. (‘ ’ says that Q1 is

proper subset of Q2).

3. Each node in the rule goal tree is labeled l(n) which is an atom whose

arguments are variables or constants. unc(n) is the father node of l(n).

4. The root of the tree is named and its children are the sub goals of the

global query.

5. Choose an arbitrary leaf node and expand it following the steps specified below

until no leaf node can be expanded further:

i. Expand node n with the definition of its head if the head appears in

the definitional description. Create a child node l(nr) with every

sub goal of l(n). This type of expansion is applied if the peers

appear in GAV- style.

ii. If the stored relation ‘p’ p appears in the right-hand side of an

inclusion description or storage description r of the form V U Q1

(or V = Q1), we do the following.

a. Let n1, . . . , nm be the children of the father node of n, and p1, .

. . , pm be their corresponding labels

38

b. The MCD (Minicon description) contains an atom of the form

V and the set of atoms in p1, . . . , pm that it covers.

An MCD is a mapping from a subset of the variables in the query to variables in

one of the views. Intuitively, an MCD represents a fragment of a containment mapping

from the query to the rewriting of the query [Halevy 2001]. An MCD C for a query Q

over a view V is a tuple of the form ),,)(,( cccc GYVh ψ where: ch is a head

homomorphism on V, cYV )( is the result of applying ch to V, i.e., )(AhY c= , where A

are the head variables of V, cψ is a partial mapping from Vars(Q) to ch (Vars(V)), cG is

a subset of the subgoals in Q which are covered by some sub goal in hc(V) using the

mapping cψ .

c. A child rule node is created nr for n labeled with r and a child

goal node ng for nr labeled with V .

6. Solution is constructed from rule goal tree T. Union of conjunctive queries over

the stored relations is the result of the global query.

7. The body of the conjunctive query is the conjunction of all the leaves of T.

A user enters the query, Q, on the interface and the algorithm creates a bucket

for each sub goal in Q that is relevant to answering that particular sub goal. The sub queries

collect data from different data sources and provide results in the format of global schema

[Halevy 2003]. The flow chart of the above algorithm is as shown below in Figure 3.2.1:

39

Figure 3.2.1 Flow chart of the query reformulation algorithm.

Consider a P2P system in which all peer mappings are definitional (similar to

GAV mappings in data integration). In this case, the algorithm is a simple construction of a

rule goal tree: goal nodes are labeled with atoms of the peer relations, and rule nodes are

labeled with peer mappings. It begins by expanding each query sub goal according to the

relevant definitional peer mappings in the PDMS. When none of the leaves of the tree can

be expanded any further, the storage descriptions are used for the final step of

reformulation in terms of the stored relations.

40

Suppose all peer mappings in the PDMS are inclusions in which the left-hand side

has a single atom (similar to LAV mappings in data integration). In this case, we begin

with the query sub goals and apply an algorithm for answering queries using views. The

algorithm is applied to the result until it cannot proceed further, and as in the previous case,

the storage descriptions are used for the last step of reformulation.

The first challenge of the complete algorithm is to combine and interleave the two

types of reformulation techniques. One type of reformulation replaces a sub goal with a set

of sub goals, while the other replaces a set of sub goals with a single sub goal. The

algorithm will achieve this by building a rule-goal tree, while it simultaneously marks

certain nodes as covering not only their parent node, but also their uncle nodes.

Before illustrating with an example, schema of data sources used is shown in Table 3.2.1.

Table 3.2.1 Database schema

For example, if we have to find out the hospitals or departments and fire stations

in the zip area = 7010, the query that we put to the system is Q(h,f,e,a = 7010) where h is

hospital id, f is fire station id and e is equipment and a is area. The query needs to get

records form two different peers with two different schemas. The concept of using peer

41

description comes handy here as stated in coordination rules which match each peer

schema and let the communication happen. The storage descriptions aids in combining

results from different databases within the same peer.

The main query can be decomposed into two sub goals as following:

(q) Q(h,f,e,a = 7010 ) :- sameareahospitals(h,f,e,q,’7010’),

sameareafirestations(f,h,e,q,a=7010’)

Peer descriptions:

(r0) sameareahospitals(h,a,d,b,p,n) proper subset of bfs(h,d,b,p,n,a=7010) and

cs(h,d,b,p,n,a=7010)

(r2) sameareafirestations(f,h,e,q,a) proper subset of bs(F,h,e,q,a=7010) and

pfs(F,h,e,q,a=7010)

Storage descriptions:

(r1) bfs(h,d,b,p,n,a=7010) is a subset of bfs(h,d,b,p,n,a)

(r1) cs(h,d,b,p,n,a=7010) is a subset of cs(h,d,b,p,n,a)

(r3) bs(f,h,e,q,,a=7010) is a subset of bs(f,h,e,q,a)

(r3) pfs(f,h,e,q,a =7010) is a subset of pfs(f,h,e,q,a)

Reformulated Query:

Q’ :- bs(h,f,e,a =7010), pfs(f,h,e,a =7010) , bfs(h,d,b,p,n,a =7010),

cs(h,d,b,p,n,a=7010)

42

Figure 3.2.2 Rule-goal tree for the query Q.

A query can be evaluated in a PDMS by sending it (reformulated appropriately) to

all the peers that might have answers. In such a scheme, it is absolutely vital that every

query not flood the entire network. The query reformulation algorithm devotes

considerable effort towards pruning rewritings that are guaranteed to return no results (or

redundant results). However, reformulation can only exploit information contained in the

schema mappings, whereas it would be desirable to exploit information about the actual

data stored at the peers in order to identify the peers relevant to the user query.

43

3.3 P2P Database and Coordination Rules

The database has been created in MySQL. Three different databases

corresponding to each peer have been created namely – Clinical laboratories, Hospitals and

Fire stations and each database corresponds to each node. The details from the three

databases should be fetched at 911 center according to the need. The schemas of the three

databases are as shown in ‘Database Schema’ in section 3.2.Mapping of attributes is

possible with declaration of co-ordination rules. A coordination rule has certain

specifications in its declaration.

Coordination Rule: A coordination rule allows a node i to fetch data from its

neighbor nodes j1,…..,jm. A coordination rule is an expression of the form [Franconi 2003]:

Let I be nonempty finite set of indices {1,2,3,..,n}, and C be a set of constants. For each

pair of distinct i, j ∈ I, let Li be first order function-free language with signature disjoint

from Lj, but for the shared constants C. A local database DBi is a theory on the first order

language Li. A coordination rule is an expression of the form:

wherexhiyxbjyxbj kkkk ),(:),(:...),(: 1111 ⇒∧∧ ijj k ,,...,1 are distinct indices, and each

),( lll yxb is a formula of Ljl, and h(x) is a formula of Li, and x = x1 U … U xk.

Coordination rule for the node N0 according to above schema is as shown in the next page:

44

Coordination rule snippet.

The above code corresponds to the following topology:

Figure 3.3.1 Topology.

In the above topology (Figure 3.3.1), we can query from node-0 and retrieve the results

from node-1 to node-3.

45

3.4 Workflow

After setting up the java environment, the batch file in each node is run which

opens up a window known as status window as shown in figure 3.4.1 and an interface as

shown in Figure 3.4.2 that shows that one of the peers engine has started, registers in the

topology and discovers any other peer around the sphere.

Figure 3.4.1 Status window.

46

Figure 3.4.2 Interface for each node (batch file).

Once all the peers have started click on “Read Coordination rules from file” on

the node 0 interface which reads the xml file of rules that defines the relationships among

peers. The following Figure 3.4.3 shows the status of peers during the above step-

Figure 3.4.3 Status during coordination rules announcement.

47

Now click on the “Publish Topology Advertisement” button that creates one way

channel from node 0 to all the other nodes followed by “Initialize connections in the

network” button that enables all the peers to participate in the network. Figure 3.4.4 shows

the status of the peers during the above step execution-

Figure 3.4.4 Peers are ready.

Now switch to the Queries tab and type in the required query to get the results

from all the participating peers.

48

Example: All the hospitals and department name with number of beds ‘700’

Q(h,a,d,b):-h1(h,a,d,b,p,n);b=’700’; The corresponding screen is as shown in Figure 3.4.5.

Figure 3.4.5 Execution of Query 1.

49

Example: Find out all the fire stations and hospitals in the city of corpus christi

Q(h,f,e,a) :- fs1(f,h,e,q,a),h1(h,a,d,b,p,n). The corresponding screen is as shown in the

Figure 3.4.5.

Figure 3.4.5 Execution of Query 2.

50

4. EVALUATION AND RESULTS

The system has been tested in two phases – during development phase and after

development phase. Since the system is basically built through integration, the process of

testing each unit and the system as a whole was main aspect of evaluation. Debugging has

also been given due importance as it might lead to any integration errors. The coordination

rules were checked more than once to make sure that all the peers are participating actively

in the sphere.

4.1 Evaluation

Testing the system during construction made it easy to figure out possible bugs

and eliminate them. Since the system is built using java and JXTA, all the core classes

were tested using JUnit cases. Each peer is tested separately by posing queries and

validated with the results obtained.

The hard part started when integrating the system with different peers. The

communication channel between two peers is established using pipes (one of the classes

in JXTA). The following problems have occurred with the configuration of pipes:

a. The channel is sometimes established in only one way allowing communication in only

one direction even though it is properly configured for two way communication. This

problem has been reported to JXTA forum and its been taken care of.

b. Intermittent disconnections occur as two peers communicate because of lack of peer

address storage perseverance. This problem has been addressed by increasing the size of

address table.

51

Redundant or duplicate tuples have been reported upon execution of the query

involving more than 3 peers or if the execution of query generates a rule goal tree of level

4 or more. This problem can be attributed to the structure of JXTA communication

architecture which needs to be corrected by JXTA developers in the future.

The time of execution of any query falls in the range 0.01 to 1.1 sec. Queries

involving one peer to a collection of peers is supported using piazza system.

The Piazza algorithm is theoretically well structured to add any number of peers

to the system and answer any kind of query however, the practical difficulty involved in

creating such complex topology or rule goal tree is hardly possible given the condition of

JXTA in its present form.

Another problem encountered during execution of a query that involves more than

3 peers with rule goal tree of level 2 or 1 is that results are sometimes either being

repeated or discarded.

4.2 Results

Different queries have been posted at node 0 and the results obtained from all the

peers have been evaluated with the corresponding results obtained by querying the database

itself directly and compiling the result set.

52

Table 4.2.1 Results of different query executions

The Query should be written in the following recommended syntax –

Q(fields from the n0 schema) :- schema name (fields from corresponding schema),[ schema

name (fields from corresponding schema)];condition

*condition: field name = value

53

5. FUTURE WORK

Future research includes reconciling peers with inconsistent integrity constraints,

and considering richer constraint languages at the peers. More generally, peer data

management is a very rich domain that creates a wealth of new problems, such as how to

replicate data, how to reconcile inconsistent data, and optimization across multiple peers.

Although the prototype application is still somewhat preliminary, it already

suggests that the architecture provides useful and effective mediation for heterogeneous

structured data, and that adding new sources is easier than in a traditional two-tier

environment. Furthermore, the overall Piazza system gives a strong research platform for

uncovering and exploring issues in building a semantic web.

A key aspect of the system is that there may be many alternate mapping paths

between any two nodes. An important problem is identifying how to prioritize these paths

that preserve the most information, while avoiding paths that are too diluted to be useful. A

related problem at the systems level is determining an optimal strategy for evaluating the

rewritten query.

54

6. CONCLUSION

The concept of the peer data management emphasizes not only an ad-hoc,

scalable, distributed peer-to peer computing environment (which is compelling from a

distributed systems perspective), but it provides an easily extensible, decentralized

environment for sharing data with rich semantics.

The primary contribution of the query reformulation algorithm is that it combines

both LAV- and GAV-style reformulation in a uniform fashion, and it is able to chain

through multiple peer descriptions to reformulate a query.

By this project, it would make it easier to extract results from different databases

at a time without the need to change any schema designs of individual databases.

55

7. BIBLIOGRAPHY

[Anderson 1995] T. E. Anderson, M. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli,

and R. Wang. Serverless network file systems. In SOSP 1995, volume 29(5), pages 109–

126, December 1995.

[Berners 2001] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific

American, May 2001.

[Bolosky 2000] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a

serverless distributed file system deployed on an existing set of desktop pcs. In Proc.

Measurement and Modeling of Computer Systems, 2000, pages 34–43, June 2000.

[Cao 1998] P. Cao, J. Zhang, and K. Beach. Active cache: Caching dynamic contents on

the web. In Middleware ’98, Sept. 1998.

[Chen 1976] Chen, P.P., “The entity-relationship model: towards a unified view of data,”

ACM Transactions on Database Systems, vol. 1, no. 1, pp.9-36, 1976.

[David 1983] Maier. David, “Null Values Partial Information and Database Semantics,” pp.

371-438 in The Theory of Relational Databases (1983).

[Doan 2002] A. Doan and A. Halevy. Efficiently ordering query plans for data integration.

In Proc. of ICDE, 2002.

[Fan 1998] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: A scalable wide-

area web cache sharing protocol. In Proc. Of ACM SIGCOMM ’98, August 1998.

[Gray 1996] J. Gray, P. Helland, P. E. O’Neil, and D. Shasha. The dangers of replication

and a solution. In SIGMOD ’96, pages 173–182, 1996.

[Franconi 2003] E. Franconi, G. Kuper, A. Lopatenko, and L. Serafini. A Robust Logical

and Computational Characterisation of Peer-to-Peer Database Systems, in International

Workshop On Databases, Information Systems and Peer-to-Peer Computing, 2003.

(Slides)

[Halevy 2003] A. Halevy, Z. Ives, P. Mork, and I. Tatarinov. Piazza: Data Management

Infrastructure for Semantic Web Applications. In WWW 2003.

[Ives 2000] Z. G. Ives, A. Y. Levy, J. Madhavan, R. Pottinger, S. Saroiu, I. Tatarinov, S.

Betzler, Q. Chen, E. Jaslikowska, J. Su, and W. T. T. Yeung. Self-organizing data sharing

communities with SAGRES. In SIGMOD ’00, page 582, 2000.

[Kubiatowicz 2000] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,

R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer,About LEGION – the Grid OS.

World-wide web: www.appliedmeta.com/legion/about.html., 2000.

56

[Leonid 1998] Leonid Stoimenov, Aleksander stanimirovic, slobodanka djordjevic-kajan

“Discovering mappings between ontologies in semantic integration process”

[Rodriguez 2003] Rodriguez, M.A, Egenhofer M., Determining Semantic Similarity

Among Entity Classes from Different Ontologies, IEEE Transaction on Knowledge and

Data Engineering, 2003.

[Napster 2001] Napster. World-wide web: www.napster.com, 2001.

[Rabinovich 1998] M. Rabinovich, J. Chase, and S. Gadde. Not all hits are created equal:

Cooperative proxy caching over a wide area network. In Proc. of the 3rd Int. WWW

Caching Workshop, June 1998.

[Siong 2003] W. Siong Ng, B. Chin Ooi, K. L. Tan, and A. Ying Zhou. Peerdb: A p2p-

based system for distributed data sharing. In International Conference On Data

Engineering (ICDE), 2003.

[Suciu 2003] A. Halevy and Z. Ives and D. Suciu and I. Tatarinov. Schema Mediation in

Peer Data Management Systems. In ICDE 2003.

[Tanenbaum 1996] A. S. Tanenbaum. Computer Networks. Prentice Hall PTR, 3rd edition,

1996.

[Wikipedia 2007]

http://www.conceptdraw.com/products/img/ScreenShots/cd5/software/Chen_ERD.gif

57

8. APPENDIX

1. GAV - In the Global-As-View (GAV) approach, one defines the concepts in the

global schema as views over the data sources.

2. LAV – In the Local-As-View (LAV), one characterizes the data sources as views

over the global schema.

3. Materialized Views - A materialized view takes a different approach in which the

query result is cached as a concrete table that may be updated from the original

base tables from time to time. This enables much more efficient access, at the cost

of some data being potentially out-of-date.

4. Mediated Schema – Mediated Schema allows a user to access multiple databases

by creating mappings between source schema and mediated schema.

5. Mobile Agent - A Mobile Agent is a composition of computer software and data

which is able to migrate (move) from one computer to another autonomously and

continue its execution on the destination computer.

6. Node - A node is a critical element of any computer network. It can be defined as

a point in a network at which lines intersect or branch, a device attached to a

network, or a terminal or other point in a computer network where messages can

be transmitted, received or forwarded.

7. Ontology – It is a branch of metaphysics, often considered the most fundamental.

It is the study of the nature of being, existence, or reality in general and of its

basic categories and their relations, with particular emphasis on determining what

entities exist or can be said to exist, and how these can be grouped and related

58

within an ontology (typically, a hierarchy subdivided according to similarities and

differences).

8. P2P - Peer to peer (P2P) is a network protocol for computer users, used for

downloading torrents or P2P files. Rather than connecting to the Internet, P2P

software allows surfers to connect with each other to search for and download

content. Because of the unique structure of a P2P network, it is very efficient for

downloading large files.

9. Schema - The schema of a database system is its structure described in a formal

language supported by the database management system (DBMS). In a relational

database, the schema defines the tables, the fields in each table, and the

relationships between fields and tables.

10. Topology - It is the study of the arrangement or mapping of the elements (links,

nodes, etc.) of a network, especially the physical (real) and logical (virtual)

interconnections between nodes.

11. View - A view is a stored query accessible as a virtual table composed of the

result set of a query. Unlike ordinary tables (base tables) in a relational database, a

view is not part of the physical schema: it is a dynamic, virtual table computed or

collated from data in the database. Changing the data in a table alters the data

shown in the view.

abstractcams/projects/305.pdf · the developed system manages data in the peer databases to answer...

Documents