search on centralized networks

8/6/2019 Search on Centralized Networks

1/40

Peter S.B. Dushkin

Efficient Search Methods

In Centralized Systems

Diploma in Computer Science

Queens College, Cambridge 2003


2/40

2


3/40

Proforma

Peter B Dushkin

Queens College

Peer to Peer Content SharingDiploma, Computer Science, 2003

Word Count: 9,039

Project Originator: Mr. Peter Dushkin

Project Supervisor: Mr. Meng How Lim

Original Aims:

The original aim of this project was to build an example of a peer to peer content search

and distribution system. Each user of the system is capable of entering keywords (such as

getFile or getPeers) to find out information about attached nodes on the network. Theproject follows Napsters example by making a central server available to the connected

clients. The server contains a registry file that maintains and keeps current information

about connected clients and what content each is advertising.

Work Completed:

An RMI client and server were designed and implemented to test various search and

retrieval methods in a centralized network environment. Class libraries were built for

both the local and remote software and data was collected to test the project.

Special Difficulties:

There were no difficulties.

Declaration:

I Peter Dushkin of Queens College, being a candidate for the Diploma in ComputerScience, hereby declare that this dissertation and the work described in it are my own

work, unaided except as may be specified below, and that the dissertation does not

contain material that has already been used to any substantial extent for a comparablepurpose.

Signed

3


4/40

Date

4


5/40

Table of Contents

List of Figures ...................................................................................................................... 6

1 Introduction ....................................................................................................................... 71.1 Abstract ...................................................................................................................... 7

1.2 Motivation .................................................................................................................. 7

1.3 Subject Overview and Terminology .......................................................................... 92 Preparation ...................................................................................................................... 10

2.1 Resources ................................................................................................................. 10

2.2 Planning & Documentation ......................................................................................103 Implementation ............................................................................................................... 15

3.2 Classes and Methods ................................................................................................15

3.2 The Client ................................................................................................................17

3.3. The Server .............................................................................................................. 18

3.4 Security ................................................................................................................... 194 Evaluation ....................................................................................................................... 21

4.1 Data Collection ........................................................................................................213.3 Node Discovery ....................................................................................................... 21

3.4 Content Discovery ................................................................................................... 23

3.5 Content Delivery ...................................................................................................... 253.6 Observing results .................................................................................................... 26

5 Conclusions ..................................................................................................................27

5.1 Further Development ............................................................................................... 275.2 Final Conclusion ......................................................................................................28

Appendices .....................................................................................................................29

A. Key Code Samples ....................................................................................................29B. Bibliography ..........................................................................................................31

....................................................................................................................................... 31

C. Project Proposal .....................................................................................................32

Supervision Requirements ................................................................................................. 37

5


6/40

List of Figures

Figure 1..11An example of the question and answer exercises in the planning and documentation

period. A number of key outputs of this exercise helped us get a better sense of what

needed to be addressed in sequence diagrams and use cases.Figure 2..12

User view of the system. This use-case describes the main inputs required of a user on the

system. Later, these inputs will be translated into classes and methods.Figure 3..14

The pseudo code for how the client is intended to interact with the system. A number of

such pseudo code examples were designed and updated later informing sequencediagrams and class and method design.

Figure 4..16Sequence flow of the central indexing architecture. The local host queries the remote

directory server for the location of a given piece of content. The server replies with the IPlocation of the node containing the text file. Local host then handshakes with the remote

host to receive the content.

Figure 5..17Design of the systems core class diagrams.

Figure 6..18

The public interface for the client.Figure 7..19

Design of the systems core class diagrams.

Figure 8..20The above class example remotely queries the server and returns a time stamp.Figure 9..21

The startPeers method on the client side invokes the remote givePeers.

Figure 1022Functioning of the timing sequences. Commands issued on the local client, the execution

of remote methods, and returned results are all time stamped to issue data points for

further exploration.Figure 1123

Wait time for locating remote connected nodes.

Figure 1223

Wait time for locating remote connected nodes. One client.Figure 1324

Wait time for multiple peers requesting multiple files.

Figure 1424Wait time for multiple peers requesting the same file.

Figure 1525

Wait time for multiple peers requesting small files.Figure 1625

6


7/40

Wait time for multiple peers requesting a large file.

1 Introduction

1.1 Abstract

The digital exchange of information over peer-to-peer networks is not a new topic.Applications such as classroom educational tools, chat services, multicast applications,

and, more commonly, electronic mail, could all be categorized as variations on this

design. The now infamous birth of Napster as a quick and dirty way to access music fileshas reinvigorated industrial and academic activity in the p2p arena. Core to such

development activities has been issues such as disk space utilization, bandwidth

constraints, effectiveness of peer discovery and group management, point of failure,quality of data location, reliable exchange of information, and other issues. The fruit of

these development efforts have been second generation services such as Gnutella, Kazaa,

JXTA, and Pastry; all of which offer a completely decentralized method of file sharing.

1.2 Motivation

A common trait of all such Peer to Peer systems is that they establish their own particular

foundation, leaving the application development community much flexibility to build

their products and services on top. The assumption being, of course, that the foundationthat was laid by their predecessors is both efficient and reliable. This simply quandary

makes up driving motivation behind this diploma project. How do I test the effectiveness

of the underlying architecture of a file-sharing system? Given the popularity of this area

of development, there seem to be as many protocols as there are claims to a solidsolution. For example, Gnutellas flexible network design dramatically decreases the

possibility of a single point of failure. But, as the system scales, Gnutellas method of

flooding the network with queries eventually does more harm than good. Anotherexample is Pixie, a peer-to-peer architecture that uses the concept of content scheduling

to decrease the limitations imposed by network utilization. In this case, rather than flood

the network with requests, the application schedules content based on the efficient use ofresources. Finally, there is the recently popular hybrid networks a combination of the

decentralized and centralized architectures of companies such as Gnutella and Napster

respectively.

With this latest evolution in networking approaches in mind, the goal of this project is to

design a small centralized network and test how efficiently that network performs search,discovery, and retrieval methods under varying conditions. Since Napsters introduction,

central indexing services, as a scalable solution for content delivery, have been largely

replaced by decentralized systems that are less vulnerable to a single point of failure.

However, as both research and industrial efforts continue to offer solutions to theshortcomings of both centralized and decentralized networks, it is becoming increasingly

apparent that a combination of the two networks is (currently) the best solution. This is

7


8/40

the stuff that the KaZaA file-sharing network is made of. In the KaZaA solution,

centralized servers are located through-out decentralized peer groups, combining the fast

access of a central index with the request propagation strengths of a decentralizedsolution.

In large part due to the renewed utility of centralized search methods within hybridsolutions, this project sets out to explore how centralized search and retrieval is

accomplished and, if need be, where it can be improved. The four primary system

features considered in this project are:

Node Discovery. One of the major challenges in peer-to-peer systems design is

the discovery of nodes on the network. Each individual node needs a reliablemethod for discovering and handshaking with every other node. Additionally,

information about the various nodes on the network needs to be stored in some

manner. In an indexing system such as this one, a central server is used to

maintain information about the nodes on the network. Each individual node must

log onto the network, register with the central server, and query the serversregistry to discover other nodes on the network.

Content Discovery. Content discovery is the location of files on the network. In

centralized systems, this can be directly from one peer to another. In our system,

the directory server acts as the adjudicating element in the network, directing thelocal clients requests. Specifically, requesting peers are given IP references to

nodes on the network with requested files. Of interest to me in this project is the

time it takes between content request and node discovery.

Content Delivery. Once the server, remote hosts, and content location is

discovered, there needs to be an efficient mechanism for delivering files over thenetwork. In many peer-to-peer systems, this piece of the puzzle is key to theoverall success of the design. For centralized architectures in particular, content

delivery can become an electronic thorn relative to the amount of nodes on the

network. The more hosts requesting content, the more likely the single directoryserver is unavailable for replying to requests.

Security. Security should never be overlooked when designing any networkedsystem. Security is especially important in peer to peer networks where both the

volume of content and network nodes can be quite large. Common security

problems such as viruses, encryption cracking, bandwidth clogging, internal and

external network attacks, eavesdropping, and so on are all concerns whendesigning such a system. Additionally, as the number of nodes on the peer to peer

network increases, so does the systems overall vulnerability to security breaches.

For our centralized application of peer to peer, I have decided to implement abasic example of Secure Sockets Layer (SSL) from the standard java SDK.

8


9/40

1.3 Subject Overview and Terminology

At the heart of any discussion of efficient network design is overall topology, or, howto best connect the nodes within a group. For centralized system such as the one I

considered, the topology is considered in terms of information flow of the network as

a whole. The nodes in the graph are the peers and links (or edges) between peersindicate a regular sharing of information. For the network to be truly effective, the

nodes should be able to use the edges to share information without unnecessarily

loading the network.

How the network is designed determines how information is shared. Below are

definitions of the most common network models in use today:

1. Centralized. The architecture considered in this diploma project. Centralized

client/server systems are currently the most popular form of network with a

central server adjudicating among its client peers. Examples include web

servers, databases, SETI@Home, Napster, etc.2. Ring. A common method for scaling centralized services is to use a cluster of

machines arranged in a ring to act as a distributed server. Communication

between servers coordinate the sharing of the system state. This establishes agroup of nodes that provide identical function to a single server but incorporate

redundancy and load balancing capabilities. Typically, ring systems consist of

machines that are nearby on the network and owned by a single organization.3. Hierarchical. DNS is an example of such a system. In the case of hierachical

systems, authority flows from the root name servers to the server for the

registered name and downward. Usenet is another example of a largehierarchical system.

4. Decentralized. Popular as truepeer to peer computing, decentralized systemscommunicate symmetrically, where each node takes on all responsibility as bothclient and server. Popular examples are Gnutella and FreeNet.

Each of the above system architectures dramatically effects the overall success

and usability of the network. In our centralized example, there are numerousenvironmental factors that can influence the overall stability and effectiveness of the

system. For example, resources may suddenly become unavailable if a user decides to

disconnect from the network or power-off a machine. Of course, the volume of users andtheir volume of use is a key concern. There are also random events such as connectivity

failures, hackers, viruses, etc. (albeit, not a consideration in this project) that can

influence the systems performance.

9


10/40

2 Preparation

This chapter details the requirements gathering and design phase prior to algorithmdesign and data collection. In this phase, a waterfall type method was employed,

consisting of evaluating the projects requirements, building and updating use cases,planning the final system design, and finally, implementing, debugging and testing it.

2.1 Resources

2.1.1 Hardware

The only significant hardware requirements of this project was access to a small

network. After considering setting up a number of Linux boxes in Concroft, it was

decided to opt for the convenience of setting up a small network environment withinmy college room. This was done by networking my laptop directly to a rented PC.

2.1.2 Protocol

There are a great deal of file-sharing protocols on the market today. Most reflect its

particular flavor of file-sharing. For example, the Gnutella networks decentralized

approach is, in fact, its own protocol and can be developed for accordingly. When Ioriginally considered protocols with my supervisor, we opted to go with the relatively

new and much-discussed JXTA API provided by Sun Microsystems. This decision

was based entirely on our desire to explore a new application of peer-to-peernetworking with a large development community supporting it. Later on in the

project, after learning a good deal about the JXTA API, we decided to move to

Remote Method Invocation (RMI) as the protocol of choice. We did this to be able todig deeper than the JXTA solution would have allowed.

2.2 Planning & Documentation

A good deal of design-time efforts went into this project. Requirements analysis was

made up of several sections, each defining a particular functionality of the system.This project followed the following planning constructs: Analysis (the textual what,

how, and why of planning for development), Use Case Diagrams (graphicaldescriptions of how the user interacts with the system), Sequence Diagrams (what is

the particular step-by-step process of the system), & Class Diagrams (what are theclasses and their methods). Other diagrams such as State, Activity, Collaboration, &

Deployment were considered but deemed overkill and, in many cases, redundant.

10


11/40

2.2.1 Analysis

The analysis phase is used as a top-level exploration of the project as a whole. Itfollows a simple question and answer format and is intended to touch on each and

every aspect of the system, from hardware and software requirements, to how

users collaborate. Below is a piece of the initial Analysis phase Q and A.Q: What is the intended purpose of this project?

A: To build a simple example of a peer-to-peer system. In this case, a

central directory server will be used (similar to Napsters model).

Q: What are the particular inputs of the system?

A: The user should have a way of connecting to the central server

and registering with the network. Additionally, it should be able

to retrieve information on network nodes and file contents.

Finally, methods for searching for and retrieving files should be

designed.

Q: What software is required?

A: For this project, the j2sdk1.4.2 was used in addition to SparxSystems

UML tools. Additionally, RMI was used for remote networking. RMI is

included in the java development kit.

Q: What hardware is required?

A: For realistic client/server interaction, two computers would be nice

but much can be done on a single computer.Q: How many users of the system will there be?

A: For this system, a single directory server and three remote clients.

Q:

Figure 1: An example of the question and answer exercises in the

planning and documentation period. A number of key outputs of thisexercise helped us get a better sense of what needed to be addressed in

sequence diagrams and use cases.

2.2.2 Use Case Diagrams

Use case diagrams were designed to describe how the individual user interactswith the system. This exercise helped us to satisfy feedback in the analysis phase

such as what are the desired inputs?, how is a file advertised? and what is beingaccomplished?. It became clear early on that one of the key issues was going to behow Content Delivery was going to occur. Protocols such as I/O Streams, TCP/IP,

and UDP would have made the job easier but, since this is an RMI project, I had

to rely on remote object calls. This issue is further explored in the Implementationsection below.

11


12/40

Figure 2: User view of the system. This use-case describes the main

inputs

required of a user on the system. Later, these inputs will be translated into

classes and methods.

2.2.3 Sequence Diagrams

At this stage of the projects documentation, the overall architecture of the projectstarted to take shape. When I discussed different design alternatives for this

project with my supervisor, we concentrated on determining the factors most

important to issues relating to peer location, content location, content delivery andsecurity. Several early architecture ideas were designed and eventually dropped.

Accordingly, the main issue we initially struggled with was the choice of

protocol. Sun Microsystemss JXTA seemed like a good option initially but, oncewe decided to move away from a decentralized architecture to a directory server,it was no longer relevant.

Pseudo code was designed in this stage to for both the client and server

implementations. The pseudo code underwent numerous revisions as the projectprogressed. Below is a version of the client implementation:

Pseudo Code for Client Implementation

If the directory server accepts a new connection

register client with the server/network;update the directory;

initialize the peer group;

else if the server refuses to connect

attempt to establish a

connection n times; else fail;

if there is input into the user interface

search clients for files matching entered

string; else download files from clients;

12


13/40

Figure 3: The pseudo code for how the client is intended to interact withthe system. A number of such pseudo code examples were designed

and updated later informing sequence diagrams and class and

method design.

Both the Use Case Diagrams and the pseudo code laid a good foundation for thedevelopment of the several sequence diagrams that were built to help us describe the

steps that would be taken during the exchange between client and server. A goodexample of the flow of events can be seen in Figure 4.

13


14/40

Figure 4.Sequence flow of the central indexing architecture. The local host queries

the remote directory server for the location of a given piece of content. The server

replies with the IP location of the node containing the text file. Local host thenhandshakes with the remote host to receive the content.

14


15/40

3 Implementation

The work completed is outlined in the following sections. The implementation of theRMI client/server was informed by the use cases set up in the requirements. Classes and

their associated attributes and procedures were designed based on the information

gathered during the planning and requirements gathering phase. A first draft of the Serverand Client implementations were designed. Because RMI transparently accomplishes a

good deal of the work involved in setting up a file sharing system, it made much of the

design time efforts straight forward.

I make a few assumptions during implementation. I assume that only one peer is

accessing the server at a time and that queries to remote methods are not happening in

tandem. I also assume that the same is true for similar activity on the system such as filerequests and downloads. Finally, I have designed a small network and assume the reader

understands that the results will not be the same should the number of nodes or overall

usage increase.

3.2 Classes and Methods

Classes and methods were designed during the implementation phase to transport method

calls from the local GUI interface to the remote server. When I designed the methods in

my client and server classes, they essentially followed the below conventions:

Derive an interface from java.rmi.Remote that contains the methods to be madeavailable to RMI clients.

Define a class that extends the appropriate subclass of

java.rmi.server.RemoteServer. In our case, this class is UnicastRemoteObject. Implement the derived interface in the derived class.

Use javac to create class files.

Create stub and skeleton classes with the JDK rmic utility, and make the stubclasses available to the client and servers.

Start the RMI registry on the local machine.

Start the main application, which should instantiate the RMI server class and

register it with the local registry.

Originally, the system was designed with a command-line interface but a GUI was added

later on in the development process. On the server side, Start() and Stop() methods weredesigned for allowing clients to register and deregister with the server. As information

flows to and from the server, and updateDir() method updates the directory of files.

15


16/40

Figure 5. Design of the systems core class diagrams.

ThestartPeers() method is provided as a callback from the server to the

requesting client. It calls this method back on the client and passes its references

to the other peers on the network. Once this happens, the receiving peers hold the

node information in an array and searches this information when necessary.

The generic actionPerformed() method is used here to handle all the inputs from

the GUI. The possibilities reflect the BLAH:

Search Nodes on the Network. To do this, the client user enters the

getPeers string in the command line. This puts a call to theserver

to get a new set of peers for the client.

Search Files on the Network. To do this, the client calls the remote file

search method on all peers in a clients peer array and put the results in the

list object of the graphical interface.

Download a Located File. To do this, the client user calls agetFile

method on the peer it needs to receive the file from. The results are written

to a byte array on the local client.

16


17/40

Additional methods are used to return basic information to the requesting client.

They include writeFile, which writes the contents of the byte array contents to a

file. AgetFile is called by the remote client to return the byte array of the contentsof the file that is requested. In other words, if a client wants to download a file

called A_Tale_of_Two_Cities.txt, it must call that method on the providing client.

3.2 The Client

After the classes were designed and revised, the next step was to implement the core code

for the local client. The client design is declared in a public interface that extends remoteobjects. Accordingly, this interface extends java.rmi.Remote and its methods are declared

to throw RemoteExceptions to the server.

As each new client joins and leaves the network, it calls the methods designed in the

Client interface class. Additionally, fileName and fileSize strings were added to storeinformation about the files and file sizes located on the other clients on the network. The

class takes the form:

package network;

import java.rmi.*;

import java.io.*;

public interface Client extends Remote {

String [] filenames = new String[99];

long [] fileSizes = new long[99];

public abstract void initPeers(Client clientA; Client clientB; Client c

lientC) throws java.rmi.RemoteException;

public abstract void getHost() throws java.rmi.RemoteException;

public abstract void listFile(int i) throws java.rmi.RemoteException;

public abstract void getNumFiles() throws java.rmi.RemoteException;

public abstract void getIP(string searchString) throws

java.rmi.RemoteException;

public void writeFile(string filename) throws java.rmi.RemoteException;

public byte[] getFile(string filename) throws java.rmi.RemoteException;

Figure 6. The public interface for the client.

The interfaces purpose is to mark derived interfaces that contain methods to be exported

by the remote RMI Server. These method calls were designed to find out as much as

possible about the surrounding network environment importantly information about thelocation of other nodes, getIP() and getHost(), and information about the files each and

every node has in its local directory. This is done via the listFile, getNumFiles. Finally,

methods to download the file are getFile and writeFile. These last two methods later

proved to be somewhat problematic. Namely, the getFile byte array. I will discussreasons and solution further down in the paper.

Once the public Client interface was designed and tested, the main client implementation

class ClientImpl was coded. This class contains the actual logic of the designedmethods and procedures in addition to the relationship to the graphical user interface. The

core functionality of this class is its ability to register each new client with the network

and ultimately make its information available to the other clients. Other key algorithms

17


18/40

involve the deregistering of the client and the method that updates the array of file names

and sizes on the current listing. This is the updateDir() method.

This file is too large to demonstrate here, but one method of note is the getFile method

that reads from filename in the public interface to an array of bytes. The contents of the

file is assigned a location in memory by declaring a new temporary file location. Thetemporary byte array location is later written a new file in the local disk of the requesting

node. The essence of this process is detailed in the below steps:

{

byte[] temp = new byte[1];

byte[] contents1;

File inputFile;

try {

inputFile = new File(filename);

size = inputFile.length();

contents1 = new byte[(int) size];

FileInputStream in = new FileInputStream(inputFile);

}

return temp;

}

Figure 7. Design of the systems core class diagrams.

3.3. The Server

The remote extension of the server class is, by comparison, not as complex as the designof the client classes. Essentially, the server was designed to simply do the followingthings: register and deregister clients, store information about client activity, and give

information to requesting clients as need be. Accordingly, the Server classes methods are

register, deregister, and givePeers.

In the implementation of the Server class, a vector array was provided that keeps

information about client activity. When local clients query the server, the server searchesits vector array to provide information about what clients are currently active on the

network. The code for this functionality takes the form:

protected vector clients;

public ServerImpl () throws RemoteException {

clients = new Vector ();

}

Additional efforts were made to determine how long objects were taking to execute onthe server-side. The two key operations for the remote timing of events were, first, the

18


19/40

initialization of a timer at the local node that executes on the server. The timer terminates

once the result is returned from the remote server. The data from this operation helped me

gather more information about how the network was behaving in different environments -such as increased traffic or the transmission of different file sizes.

The server code that tests the start and stop time of a remote call should be distinguishedfrom the Timer package which attempts to gauge the length of remote execution. To do

this the local java code makes a call on the remote object being implemented and returns

a date and time in association with the object being acted on. An example of one of theclass diagrams in this package takes the form:

public class Timer implements Runnable {

TimeMonitor tm;

public Timer(TimeMonitor tm){

this.tm = tm;

}

public void start(){

(new Thread(this)).run();

}

public void run(){

while(true){

try{

Thread.currentThread().sleep(10000);

} catch(InterruptedException x){

}

if (this.tm!=null){

try{

this.tm.tellMeTheTime(new Date());

} catch(RemoteException x){

}

}

Figure 8. The above class example remotely queries the server and returns

a time stamp.

The purpose of the this code (in addition to its related classes) is to remotely start the

execution of a timer on each new thread that is executing on the server. When the threadhas completed its execution, the current system time is returned as a result. The end result

is that I get a sense of how long the various methods take to execute on the server.

3.4 Security

Security in networking, and particularly in large peer-to-peer applications, is an important

topic. Because this diploma project is about the effective sharing of resources, the is a lot

of potential to deliver harmful material to any of the nodes on the network. In largernetworks this topic of research is crucial to the vitality of the system.

In my particular RMI client/server program, the intention was to decrease the flexibilityenjoyed by local clients when invoking remote classes. Otherwise, any client program

19


20/40

could run any server object, some of which could be potentially harmful to the network.

When researching solutions provided by RMI, the answer I decided to use in my

implementation was to install a security manager. Without the installation of a securitymanager there are no restrictions placed on how remote objects are accessed and by

whom.

I used the java.rmi.security classes to quite simply instatiate the security manager with

the below statement:

if(System.getSecurityManager() == null) {

System.setSecurityManager(new RMISecurityManager());

}

In addition to the above lines of code, the Java SDK that I was using for this project

required that a security policy file be specified at runtime. This is done by defining thejava.security.policy property:

java -Djava.security.policy = mypolicy

In order to access remote objects on the system, Java looks for a system-wide policy file

in its runtime library. It also looks for a local policy file in the home directory of each

requesting client. A sample policy file that grants full access permissions to everyonelooks like:

grant {

permission java.security.AllPermission;

};

Through the use of its local policy file, each client on the network can grant permissionsto each other node on the network. This exchange is made possible by the Permission

classes in the java.security package, which provides access grants to specific resources.

20


21/40

4 Evaluation

4.1 Data Collection

Presented in this project are the results of three experiments. In all cases, the intent is tolearn something about the strengths and weaknesses of centralized search methods. To do

this effectively, I use the algorithms designed in a Time package that help to gauge how

long an action event takes from the time it is initiated to the time a response is returnedfrom the server. Data is collected on performance tests on each of the three areas of

interest mentioned above: node discovery, content discovery, and content download. In

each case, the load placed on the system is equal to that performed by 50 simultaneous

queries performed by each of the three clients.

System tests were completed over a period of one week. Tests and resulting data was not

gathered consecutively as research has suggested that this can cause results to varysignificantly.

Later on in the testing phase, results were compiled and explored.

3.3 Node Discovery

Of primary interest when implementing the Timer class is the location of additional peers

on the network As each new node comes onto the network, it registers with the server.

The servers chief task is to keep track of all the nodes currently logged onto the systemand give references to requesting peers. To keep information updated, any given node

can, at any time, contact the server to request an updated list of information about the

other nodes.

When a new client comes onto the network, it initializes the startPeers method and

remotely invokes the servers givePeers method. Information about the other nodes onthe network is then loaded into a local array. The initializes method on the client side

takes the form:

.

public void getPeers (Client clientA, Client clientB, Client clientC)

{

clients[0] = clientA.getHost();

clients[1] = clinetB.getHost();

clinets[2] = clientC.getHost();

}

Figure 9.The startPeers method on the client side invokes the remote givePeers

method on the server. A resulting list of connected nodes is delivered to the

client.

21


22/40

It is the first objective of this project to study the effectiveness of this peer discovery

interplay between client and server. Data is colleted by timing the initialization of the

getPeers method, the remote invocation of the givePeers method, and the final responsefrom the server. The sequence of events takes the form:

> getPeers

client sent time

Time: Tues Aug 02 12:11:09 CST 2003

server received time


client returned time


Figure 10.Functioning of the timing sequences. Commands issued on the local

client, the execution of remote methods, and returned results are all time stamped

to issue data points for further exploration.

As Figures 7 and 8 indicate, I tested the getPeers method call in an environment where

only one client was querying the server and then, in an environment where three clients

where simultaneously querying the server. In the case of Figure 7, the overall averageof the combined data points for each client was 1.562. In Figure 8, the single clients

average was 1.262. While it wasnt surprising that the increased load of the Figure 7

resulted in a larger overall average, I did expect the numbers to be further apart.

A second observation was in the difference in fluidity between the two figures. In

Figure 7, a somewhat erratic behavior is observed that is not so (or not at all) present in

its counterpart. I am guessing that this observation can possibly be attributed to thethree clients competing for the same method invocations on the remote server. This

touches on the issues inherent in concurrent systems programming mentioned earlier.

For me, the question that Figure 7 raises is whether or not RMI is thread-safe. There area lot of possibilities here. One such possibility is that the connections are being pooled

in such a way that only one is being used by an outstanding remote call at a time. Just

because the stub never modifies any instance data does not mean that concurrent callswriting to the same socket will marshall correctly. Another possible explanation is the

actual activation of the remote objects. In Suns documentation it was unclear to me

how to tell whether a remote object is in an active or passive state when being accessed.

Without clarity here, it is possible that the graph below reflects multiple threads trying

to spawn multiple processes for the same activation group in this case the givePeersmethod.

22


23/40

0

1

2

3

4

-10 10 30 50

0

0.5

1

1.5

2

-10 10 30 50

Figure 11: wait time for locating Figure 12: wait time for locating remote

remote connected nodes. connected nodes. One client.

3.4 Content Discovery

The way in which content is stored and advertised on a network can dramatically

influence the effectiveness of its associated search methods. The advantage of a systemwith a centralized directory is that it is possible to quickly gain access to informationabout which nodes contain which files. Systems such as KaZaA use this fact to its

advantage by combining a fast directory lookup node, or supernode, with the propagate

power of decentralized systems.

Fast access to content references initially happens when the client registers with the

remote server. As each new node registers and deregisters with the remote server, the

updateDir() method updates the array of file names and sizes with the currentinformation. The local client then stores that information as an array in its local directory:

String[] filenames = new String[99]; //stores file listLong[] fileSizes = new long[99]; //stores file sizes

The metrics I used to explore the content location qualities of centralized systems areoutlined in the below charts. Two approaches were designed. In the first test, each peer

simultaneously searches for a different file on the network. In the second approach, each

peer is simultaneously searching for the same file on the network. The focus of these two

different tests is, in general, to gauge the overall time it takes to locate a file on thenetwork and how well file location performs under increased load conditions.

23


24/40

0

1

2

3

4

0 20 40

0

1

2

3

4

5

-10 10 30 50

Figure 13: wait time for multiple Figure 14: wait time for multiple

peers requesting the same peers requesting multiple

file. files.

As observed by Figures 9 and 10, there is no real significant differences between multiple

clients accessing the same file and multiple files being accessed by multiple clients. I didexpect to see some variation in the results. There is, however, some notable spikes in

Figure 9s activity. Whether or not I can attribute these small increases to issues such as

thread safety or the remote activation of objects is hard to say - although I doubt it. It ismore likely that these notables are due to slight variations in the results. As a side note, it

is important to point out that, of the files searched, the majority of the desirable files

(likely the ones I was querying the most often) were located on a single client. In terms

wait time resulting from competition for resources, this observation (on a small scale,anyway) doesnt seem to have much effect on the systems activity.

A second test was conducted to rank the expected results from content searches based onthepopularity of the content. Unfortunately, I dont have the luxury of a large network

used by a diverse group of users with varied interests to get a truly random sampling of

how popularity may influence usage of the network. As a next best solution, I gave eachfile on the network a popularity ranking. This was done by assigning values from 1 to

10 (10 being the highest) to each of the ten files on each for the three clients being tested.

As the remote method invocations request files at random, the final results hope to give

us an idea of where traffic might be directed within the network. The results of this testrevealed:

Client A Rank Hits Client B Rank Hits Client C Rank Hits

Timer.java 10 x Crossley.txt 10 Hayden.txt 10

Server.java 9 Xxx Hello.c 9 X Client.java 9

Bio.txt 8 Xxxxx Resume.txt 8 Xxx Letter.txt 8 x

Memo.doc 7 Xxx Dad.doc 7 Xx Crypt.java 7

Summary.txt 6 Xxx Itinerary.txt 6 Xx ToDo.txt 6 xx

Funny.txt 5 Xx FindIt.html 5 Xx Flight.txt 5 xx

RMI.html 4 Xx NIHA.txt 4 X Monitor.java 4 x

NMH.html 3 x Dickens.txt 3 Xxx eCOS.html 3 x

Sam.doc 2 Stream.java 2 X JXTA.txt 2 xx

Mom.doc 1 Jill.doc 1 Columbia.txt 1 x

24


25/40

3.5 Content Delivery

This section of my diploma project explores how content delivery behaves in a

centralized network environment. Namely, I evaluate how the flow of informationhappens from one node to the next. To get a sense of how efficient this type of

information exchange is in our small network, the two tests I used evaluated 1) multiple

file downloads (on various small files under 2 MEG in size) happening at the same timeand 2) a multiple downloads of a single large file (20 MEG in size).

0

2

4

6

8

-10 10 30 50

Figure 13: wait time for multiple Figure 14: wait time for multiplepeers requesting peers requesting a large file.

small files.

Once a file was located on the network, the actual transmission from one node to the nextproved to be a bit more complicated than I anticipated. After researching solutions such

as TCP/IP and IO Streams, it seemed that the best method for RMI file transfer was to

read the files contents into an array of bytes on the remote client. To do this, I followedthe following sequence of events:

1) Instantiate the remote object

2) Open the file and get its size3) Allocate the byte array and read the file into that array.

4) Copy the file name.

Once these steps were accomplished, the remote file object could be transferred by

calling thegetFile method on the remote client. This method call fills the local clients

array of bytes with the bytes from the remote file. Then, the local client calls the

writeFile method which writes the contents of the byte array to a file.

As observed in Figure 12, initial large-file transfer tests yielded poor results. In most

cases, the transfer of a large file proved too memory intensive and the system simplyhung. After exploring this problem, I discovered that this wasnt a short-coming of the

centralized architecture design but how the writeFile method call was writing bytes into

the local array. The (short-term) remedy to the file was to cut the client file into byte 5

25


26/40

MEG byte arrays and transfer the file as a sequential series, reassembling on the

receiving end.

3.6 Observing results

The network characteristics of centralized systems were studied with peer location,

content location, and file download effectiveness in mind. In our fist test, I tested a singleclients ability to invoke remote method calls on the server to get a listing of connected

peers on the network. The metric used is the wait time between the execution of the

command and the returned results. In the fist case, I found that that average wait time is

roughly 1.562 in an environment where load is being placed on the server. In the casewhere one node is accessing the remote server to get peer information, the wait time is

comparatively less 1.262 as might be expected.

The ability for a local client to quickly find the location of files on the network was

shown in the second exploration. In both cases multiple nodes querying the same and

different files the response time was immediate. As an added observation to this test, itis interesting to note that the percentage of responses to requests maintains a high and

predictable level. At no one time was a request for a file rejected by the remote server. A

second test was added to the subject of content location popularity. The popularity of aparticular file (or group of files on a particular node) on the network can dramatically

influence how activity is distributed. In our test a sampling taken at random shows that

that overall load of the network was weighted towards Client A. In such a case, especially

if the popularity of files is proportionally small on the surrounding peers, a bottleneckcould possibly occur. As observed by the weighted results in Client A, an effective

solution to evenly distributing how and when remote objects are invoked by connected

clients is an important consideration when dealing with a centralized system.

Finally, the effects a file retrieval has on the system was tested. In both cases, all three

peers in the network were transferring a relatively small file under 2 MEG and then alarger file of 20 MEGS.

26


27/40

5 Conclusions

5.1 Further Development

There are a number of improvements I would likely make to this project, given more

time. As mentioned earlier, one of the primary problems of centralized systems is thatthey are not as efficient in propagating requests throughout the network as their

decentralized counterparts. The problem is that a remote peer cannot send unrequested

data to a client doing a search. The remote peer can only send data when it is explicitlycalled for by the requesting client. This inflexibility of the centralized RMI approach

makes it quite impossible to seamlessly share information throughout the network. To

illustrate the point, consider Client A. When Client A wants a file, it makes a call to theremote Server, requesting information about the other connected nodes. If Client B has

the requested file, then Client A makes a direct connection. But, what if the file does not

exist within Client As peergroup, but is available somewhere else on the network? In an

ideal situation, Client B could be able to refer the requesting client to another node on thenetwork that does have the file. This is the essence of the JXTA API. JXTA uses the

notion of Advertisements and a Peer Discovery Protocol (PDP) to fluidly locate

references to information throughout the network. Advertisements are essentiallymessages represented as XML that make available information stored in a given peers

cache such as other peers, peer groups, or available local or remote content. When a

peer attempts to discover a particular piece of content, it searches the referring

advertisements until a reference to the correct node is found. The efficient propagationmethods of the JXTA protocol are not possible in an RMI environment where

information exchange is a one-to-one dynamic.

A second improvement would address the problem of concurrency within distributed

systems. The way RMI currently works, a method dispatched by the RMI runtime to a

remote object implementation may or may not execute in a separate thread. The RMIruntime makes no guarantees with respect to mapping remote object invocations to

threads. As a result, when an RMI server is written, any assistance in executing separate

threads must be hand coded. This introduces a degree of complexity that, although I didnot have time to address it in this diploma project, is crucial to any system that entertains

the possibility of numerous simultaneous client requests (such as a multi-user file-sharing

system).

Another improvement I would like to add, time allowing, would be the inclusion ofadditional tests for each of the three subject groups. I feel that additional variations could

be done on system load testing. For example, in the case of the content discovery tests, it

might be interesting to explore how the system behaves when content is evenlydistributed throughout the network versus unevenly located in only a few nodes. Another

27


28/40

such improvement might be a decent stab at building a system that manages to propagate

requests from node to node. Given the focus of this project, I was not able to invest too

much effort in finding a good solution to this problem. None-the-less, propagation is theshortcoming of centralized systems (and the advantage of its decentralized kin) and the

design of an RMI system that intelligently tackles this problem would certainly be

interesting. Finally, it would definitely be useful to see gauge how each of the search,discovery, and retrieval subject areas behave as the size of the system scales. While a

three node network is useful for the purposes of an academic exploration, translation into

the day-to-day environments would require a more robust architecture.

5.2 Final Conclusion

Overall, the goals of this project have been accomplished. I have spent a good deal of

time testing the various strengths of searching a centralized network and have found thatsuch a network can be both powerful and powerless depending on what you are

demanding of it. Centralized directory servers are a very powerful tool for providing fast

references to remote locations on the network. This fact is certainly a valuablecommodity in large, multi-node environments where multiple files are being shared. Thedata points under peer discovery and peer delivery sections certainly back up this finding.

On the other hand, I found content delivery to be a problematic for the reasons stated

above. I dont feel this is the result of a centralized environment but, rather theshortcomings of RMI. Certainly, my earlier solution of cutting up my files into segments

of byte streams could be solved with sockets or some other such solution, but, the issue of

propagation makes RMI a poor solution for large-scale file sharing environments centralized or decentralized.

28


29/40

Appendices

A. Key Code Samples

//The group of Time classes in the Time package act as an

//aid for data collection by remotely invocing the stub objects

//on the server returning the times that objects were

//invoked.

TimeMonitorImpl.java

Package Time;

import java.rmi.*;

import java.util.Date;

import java.io.Serializable;

public class TimeMonitorImpl implements TimeMonitor, Serializable

{

public void tellMeTheTime( Date d ) throws RemoteException

{

System.out.println("Time: " + d.toString() + "\n");

}

}

//The below method helps register the clients with the server.

register() in ServerImpl.java

public int register (Client client) {

String chost="";

try {

chost = getClientHost();

} catch (ServerNotActiveException ignored) {

}

clients.addElement (client);

try{System.out.println(chost + " has registered - sharing "

+client.getNumFiles()+" files");

givePeers(client);}

catch (RemoteException ignored) {}

return clients.size()-1;}

public void givePeers (Client client) throws RemoteException{

if (clients.size () > 1) { // Only give random clients if there is

try {int randNumber = (int)(Math.random()*(clients.size()-1));

System.out.println("Giving new Peer "+randNumber+" to

remote.");

Client temp = (Client) clients.elementAt (randNumber);

randNumber = (int)(Math.random()*(clients.size()-1));

29


30/40


remote.");

Client temp2 = (Client) clients.elementAt(randNumber);

randNumber = (int)(Math.random()*(clients.size()-1));


remote.");

Client temp3 = (Client) clients.elementAt

(randNumber);

client.initPeers(temp, temp2, temp3); }


} else { // First client, register with itself three

times;

Client temp = (Client) clients.elementAt (0);

try {client.initPeers(temp, temp, temp); }


}

}

//The main method of the ClientImpl class. Key methods from the

//Client class are passed via this method.

public static void main (String[] args) throws RemoteException,

NotBoundException {

if (args.length != 1) throw new IllegalArgumentException

("please enter host name");

ClientImpl Client = new ClientImpl (args[0]);

Client.start ();

}

public String listFile(int i) throws java.rmi.RemoteException {

return this.fileNames[i];

}

public long listSize(int i) throws java.rmi.RemoteException {return this.fileSizes[i];

}

public int getNumFiles() throws java.rmi.RemoteException {

return numFiles;

}

public int getTime() throws java.rmi.RemoteException {

return getTime;

}

public String getIP() throws java.rmi.RemoteException {

return myip;

}

30


31/40

B. Bibliography

[1] Brookshier, Govoni, Krishnan, & Soto, JXTA: Java P2P Programming.

[2] Gradecki, Mastering JXTA.

[3] Rosenberg & Scott, Applying Use Case Driven Object Modeling with

UML.

[4] Fowler & Scott, UML Distilled, Second Edition.

[5] Kolenikov & Hatch, Building Linux Virtual Private Networks (VPNs).

[6] Nelson Minar: Distributed Systems Topologies, Parts 1 and 2

http://www.openp2p.com/pub/a/p2p/2001/12/14/topologies_one.html

31


32/40

C. Project Proposal

Peter DushkinQueens

pbd22

Diploma in Computer Science Project Proposal

A File Discovery Scheme in Decentralized Computing

December 6th, 2002

Project Originator: Peter Dushkin

Project Supervisors: Meng How Lim

Signature:

Director of Studies: Dr. Robin Walker

Signature:

Overseers: Dr. Larry Paulson & Dr. Tim Harris

32


33/40

Table of Contents

Introduction..........................................................................................................................2Project Proposal...................................................................................................................3

I. Front end application.................................................................................................3

II. Node software...........................................................................................................4Resources.............................................................................................................................5

Supervision Requirements...................................................................................................5

Phases of Development........................................................................................................6Timetable and Milestones....................................................................................................7

Weeks 1 and 2: Proposal Definition.............................................................................8

Weeks 3 to 6: Paper Network Design...........................................................................8Weeks 7 to 10: Paper Software Design.........................................................................8

Weeks 11 to 13: Physical Network Implementation.....................................................8

Weeks 14 and 20: Application Coding.........................................................................8

Weeks 21 to 27: Evaluation & Debugging...................................................................9

Weeks 28 to 35: Evaluation & Debugging/Dissertation...............................................9Week 36: Final Form....................................................................................................9

33


34/40

Introduction

The ways in which a network of computers share files have come a long wayover the past decade. The days of a small office or university network sharing

documents over a dedicated LAN or WAN have rapidly evolved into radically newareas of network computing. The two most commonly used today are bothcentralized and decentralized architectures.

Centralized, or client/server, networks rely on one central server to adjudicateactivity. The central computer maintains a database of files owned by computerson the network. When a computer requests a file, it is checked by the centralserver against the database and, if acknowledged, a direct connection can beestablished between the requesting and sending computers.

The problem with a centralized network architecture is that a lot of demand is

placed on the central server. As a result, the network can become quite slow dueto bottlenecks. Also, should the central server experience problems or go down,the whole network is affected.

A response to these problems is decentralized computing. In this model, all of thenodes on the network act in both a client and server capacity, removing the needfor a central server. This project will be using a file location scheme to show thecomparative advantages of decentralized to centralized computing.

34


35/40

Project Proposal

This diploma project will use the TCP and RMI protocols to search for files on a smalldecentralized network. A Java-based GUI application will be developed to serve as the

primary interface to the network. It will allow the end-user to ping the nodes on the

network and discover information about the various computers - with the end-goal of filelocation. Each node on the network will provide a simple query interface that enables

them to receive requests and respond accordingly.

Outlined below are some possible features of the requesting and responding nodes:

I. Front end application

The front-end application will be the GUI interface to the decentralized network.

Possible networking protocols involved are RMI, TCP, IP and UDP. The

applications core objective is to serve as an interface for the location anddiscovery of files. Additionally, it should return information about individual

nodes. Some of the returning information might be:

a) The IP address of each computer

b) Network Bandwidth used by nodes.

c) Network status of each computer.

d) Time/milliseconds between ping and pong.e) The geographic location of the computers.

II. Node software

The node software will directly relate and respond to the incoming

packets sent by the requesting computer. As a result, the primaryresponsibility of the node software is to return the appropriate

information or pass the request along to the next computer. Some of the

resulting class definitions should be:

Pong (to return a packet request with related information)

Retrieval of information

Download of file(s)

35


36/40

Figure 1: Intended Network Build

Below is the intended network set-up. I will be logging onto the suggested networkthrough SSH.

College Server

linux2.pwf.cl.cam.ac.uk linux3.pwf.cl.cam.ac.uk linux4.pwf.cl.cam.ac.uk

131.111.128.110

CAMBRIDGE

NETWORK

PWF Server/Client PWF Server/Client PWF Server/Client

Router

L. 1

L. 2

L. 3

Possible Extension of the Project

Depending on the overall development of the project, ways of decreasing bandwidthutilization may be considered. A number of peer-to-peer networking protocols have been

wrestling with possible solutions to the problem of excessive network traffic. Below are

some suggested solutions.

I. Pong Limiting

Pong limiting reduces the amount of traffic on the network by only

returning a pong with its own address if the host is not restricted by afirewall. Moreover, only a fixed number of pongs should be returned in

response to a given ping.

II. Pong Caching

A drawback of pong limiting is that it is inefficient if too many pings arebeing sent. To solve this problem, a possible solution is the caching ofthe most recent pongs and avoiding the broadcast of pings.

In other words, if the appropriate reply is cached, then the distance that

the matching request has to travel can be significantly reduced.

III. Ping Multiplexing

36


37/40

The idea behind Ping Multiplexing is that when a singe incoming ping

reaches a node, it is "multiplexed" into numerous outgoing pings. The

reverse is true for pongs (numerous pongs can be "demultiplexed" intoone pong).

Resources

I am planning on using the following:

1. Operating Systems: Linux

2. Programming Language: Java, UNIX, possibly Perl

3. Networking Protocols: TCP, IP, RMI, UDP4. Hardware: C4 computers via SSH.

5. Additional Software: Viso for UML

6. Storage: My ADP Tape drive, possibly Penguin.

Supervision Requirements

I will be sitting down with Mr. How Lim once ever two weeks to provide a project update

and discuss milestones. Otherwise, we will be corresponding via email as needed.

Phases of Development

37


38/40

Research andImformation

Gathering

Paper

Planning

Physical

NetworkDesign

Physical

SoftwareDesing

Dissertation and

Completion

Timetable and Milestones

Weeks 1 and 2: Proposal Definition

38


39/40

* Meetings with Supervisor, Overseer, Director of studies.

* Set up schedule of meetings with overseers and supervisor.* Consolidation of project plan and overall direction.

* Acknowledgement of resource availability.

* Supporting research and data collection.* Project fine tuned per overseer/supervisor comments and finalized.

* Network hardware availability sorted out.

* All appropriate signatures prior to final project plan.* Milestones: Final project plan.

Weeks 3 to 6: Paper Network Design

* Study Linux and how it is going to be used for this project.

* Study particularly relevant aspects of Unix/Linux documentation.

* Study particularly relevant aspects of P2P documentation.* Start designing sketches of network diagram and related functions.

* Software Modules* Start building UML diagrams of approved sketches.

* Milestone: Final Network Design.

Weeks 7 to 10: Paper Software Design

* Study related networking java source code.

* Study documentation on ping/pop schemes.* Create sketches of the application's GUI.

* Create pseudocode for all of the items in "A" to "I" above.* Design methods, classes, procedures, functions, etc.* Apply work to UML diagrams.

* Milestone: Final Application Design

Weeks 11 to 13: Physical Network Implementation

* Implement Network Design* Start to think about dissertation.

* Milestone: Running Network

Weeks 14 and 20: Application Coding

* Implement Application Design

* Successful Implementation of all "core items".* Start to work on the development of dissertation.

* Milestone: Prototype Application

Weeks 21 to 27: Evaluation & Debugging

39


40/40

* Check consistency, logic, etc.

* Evaluate the overall protocol* Check against original schematics/specification

* Evaluate network efficiency/functioning

* Evaluate bandwidth usage

Weeks 28 to 35: Evaluation & Debugging/Dissertation

* These weeks are a repetition of the last weeks.

* Check consistency, logic, etc.

* Evaluate the overall protocol* Check against original schematics/specification

* Evaluate network efficiency/functioning

* Continuation of work and development on dissertation.

* Milestone: Fully flushed out dissertation

Week 36: Final Form

* Completed application and dissertation.

* Milestone: Completed Dissertation and Application

search on centralized networks

Documents