internet and web technology

15
1 Introduction With the advances in Internet and Web technologies along with the increased availability of tools and contents, we are seeing an exponential growth of online resources. However, a variety of obstacles (such as dispersion over the Web and lack of metadata) hamper discovery of such materials and hinder their widespread use. Digital libraries (DLs) address these problems by providing an infrastructure for publishing and managing content so it is discovered easily and effectively. Digital Libraries are being viewed as a means for dissolving inequities in access to scientific information for both researchers and students alike. A number of digital libraries exist as of today, both in commercial world as well as in the government and educational domains. Some examples from the commercial world are: American Physical Society [Aps], IEEE Digital Library [Iee], ACM Digital Library [Acm]; and a few examples from the other domains are: Los Alamos Digital Library [Lan], PubMed [Pub], and CERN [Cer]. Despite the creation of many digital libraries, there exists considerable untapped content with individuals in diverse communities. One major hindrance is the centralized frameworks under which digital libraries are organized and deployed, thus limiting their accessibility, particularly in publishing. The centralized framework, by its very nature, requires an organization in a domain/community to take a lead in not only providing the hardware and software infrastructure to support a digital library, but also needs processes to develop and maintain the content. A number of organizations and communities are struggling to define business models that would financially sustain these digital libraries. Another problem, which is faced by domain specific digital libraries of today, is the evolving nature of a communitys interest. With time a communitys interest seems to grow out and move to a different topic, however the processes and interfaces that have been put in place are not flexible to accommodate the changing interest of the community. Finally, there is the problem of building a community where members are distributed, do not know of other members, do not know of available interest areas, yet still want to come together in a community of common interest. Our solution moves away from the traditional Digital Libraries that relies on some central organization to develop, support and maintain collections and services. In our vision there are on the order of a million universal clients, each client being able to do all the activities: searching, contributing, maintaining collections. Here we address these issues by proposing a digital library framework that is based on peer- to-peer (or P2P) networks and which leverages existing work in the area of Digital Library such as OAI and Kepler projects [Oai, Kep]. A P2P network is a distributed network with no central control and consists of nodes running identical software. These networks do not require a central server, which is typically expensive and requires technical personnel to maintain it. Gnutella [Gnu], Napster [Nap], and Kazaa [Kaz] are examples of file sharing applications using a P2P network. There has been interest in using P2P networks for building digital libraries [Mal01, Baw03]. The Kepler project [Kep] has some components of a P2P network, but is not a pure P2P network. It does require a central service to locate target nodes, and is closer to Napster, which is a broker based P2P network. In [Baw03], the authors propose a topic-segmented network based on a pure P2P network, which has the nice property of clustering nodes with similar content and at same time reducing the query processing cost. However, this approach has an inherent problem in the handling of flexible and evolving communities. In particular, it becomes difficult to identify communities that are multidisciplinary in nature and are hard to categorize by a specific discipline subject. Also, in the topic-segmented networks it is not straightforward to handle new communities in future or communities with changing interests.

Upload: salahheddad

Post on 14-Jul-2016

212 views

Category:

Documents


0 download

DESCRIPTION

Internet and Web Technology

TRANSCRIPT

1 Introduction With the advances in Internet and Web technologies along with the increased availability of tools and contents, we are seeing an exponential growth of online resources. However, a variety of obstacles (such as dispersion over the Web and lack of metadata) hamper discovery of such materials and hinder their widespread use. Digital libraries (DLs) address these problems by providing an infrastructure for publishing and managing content so it is discovered easily and effectively. Digital Libraries are being viewed as a means for dissolving inequities in access to scientific information for both researchers and students alike. A number of digital libraries exist as of today, both in commercial world as well as in the government and educational domains. Some examples from the commercial world are: American Physical Society [Aps], IEEE Digital Library [Iee], ACM Digital Library [Acm]; and a few examples from the other domains are: Los Alamos Digital Library [Lan], PubMed [Pub], and CERN [Cer]. Despite the creation of many digital libraries, there exists considerable untapped content with individuals in diverse communities. One major hindrance is the centralized frameworks under which digital libraries are organized and deployed, thus limiting their accessibility, particularly in publishing. The centralized framework, by its very nature, requires an organization in a domain/community to take a lead in not only providing the hardware and software infrastructure to support a digital library, but also needs processes to develop and maintain the content. A number of organizations and communities are struggling to define business models that would financially sustain these digital libraries. Another problem, which is faced by domain specific digital libraries of today, is the evolving nature of a community�s interest. With time a community�s interest seems to grow out and move to a different topic, however the processes and interfaces that have been put in place are not flexible to accommodate the changing interest of the community. Finally, there is the problem of building a community where members are distributed, do not know of other members, do not know of available interest areas, yet still want to come together in a community of common interest. Our solution moves away from the traditional Digital Libraries that relies on some central organization to develop, support and maintain collections and services. In our vision there are on the order of a million universal clients, each client being able to do all the activities: searching, contributing, maintaining collections. Here we address these issues by proposing a digital library framework that is based on peer-to-peer (or P2P) networks and which leverages existing work in the area of Digital Library such as OAI and Kepler projects [Oai, Kep]. A P2P network is a distributed network with no central control and consists of nodes running identical software. These networks do not require a central server, which is typically expensive and requires technical personnel to maintain it. Gnutella [Gnu], Napster [Nap], and Kazaa [Kaz] are examples of file sharing applications using a P2P network. There has been interest in using P2P networks for building digital libraries [Mal01, Baw03]. The Kepler project [Kep] has some components of a P2P network, but is not a pure P2P network. It does require a central service to locate target nodes, and is closer to Napster, which is a broker based P2P network. In [Baw03], the authors propose a topic-segmented network based on a pure P2P network, which has the nice property of clustering nodes with similar content and at same time reducing the query processing cost. However, this approach has an inherent problem in the handling of flexible and evolving communities. In particular, it becomes difficult to identify communities that are multidisciplinary in nature and are hard to categorize by a specific discipline subject. Also, in the topic-segmented networks it is not straightforward to handle new communities in future or communities with changing interests.

C - 2

The intellectual merit of the proposed research is to develop new models and investigate research issues in building digital libraries that are self-sustainable and support evolving communities with diverse interests. The proposed research builds upon the existing work in the area of Open Archive Initiative (OAI), Kepler, and Peer-to-Peer & Social networks. The key in our vision of supporting evolving communities is to build the network based on access pattern. We use concepts from SETS [Baw03] and Symphony [Man03] to build and ensure that the network exhibits the small world property [Wat98], which in turn leads to efficient search. The major challenges in building such libraries are: (a) Keeping the network connected in the presence of frequent joining, re-joining, and leaving of participants; (b) Maintaining a low barrier for building communities with diverse interests; and (c) Providing support for new and evolving communities with changing interests. The main objective of the proposed research is to investigate an alternate model for digital libraries (we shall refer to it as Freelib) that is scalable and supports communities and domains evolving from the bottom up. As part of this project we plan to, (i) Research system design issues in building software to support such libraries that grow and evolve with time; (ii) Build a universal client that every member node will run; (iii) Build a test bed (faculty, staff and students at four departments at ODU) to demonstrate the viability of our approach; and (iv) Evaluate the effectiveness of the proposed digital library in terms of end user services, evolving domains/communities support, scalability, and sustainability. If Freelib proves indeed to have the desired sustainability property then the project will have potentially a broad impact on the education community, as there are many target communities that can benefit from such a system. For example: • High school teachers nation-wide can build their own community to publish and

exchange school project ideas, class syllabuses, classroom materials etc. • Faculty members of universities nation-wide can use such a system to publish and make

available their own publications, project reports, course material and to search for publications of interest to them.

• Students can create collections of material related to their courses and also to their private life.

Old Dominion University (ODU) is actively involved in the OAI as a member of the technical working group and alpha tester. We developed the first OAI-compliant service provider � Arc [Arc]. Besides Arc, ODU is working with Phillips Air Force Research Laboratory (AFRL), Los Alamos National Laboratory, and NASA Langley Research Center in building a Technical Report Interchange (TRI) federation of report collections [Tri] available at the three organizations in their native digital libraries. Recently, ODU demonstrated how OAI could be used for building Kepler [Kep], a framework for individual publishers, Archon [Arch] an NSDL project, and a new OAI-based NCSTRL [Ncs]. ODU will leverage its experience in building OAI-compliant data and service providers for this project. As the nodes of Freelib are going to be OAI compliant they can be integrate with any OAI service provider. In particular, we will integrate Freelib with the NSDL architecture through collaboration with Archon (an NSDL funded project), broadening the exposure of content available in Freelib. 2 Background 2.1 Digital Libraries Today digital resources exist that include a large variety of objects: pre-prints, including technical reports; tutorials, posters, and demonstrations from conferences; student project reports, theses, and dissertations; working papers; courseware; and tool descriptions. However, a variety of obstacles (such as dispersion over the Web and lack of metadata) hamper discovery of such materials and hinder their widespread use. Digital libraries (DLs) address these problems by providing an infrastructure for publishing and managing content

C - 3

so it is discovered easily and effectively. Researchers have been addressing a variety of issues ranging from theoretical models to system building issues, for example: [Rag03, Dlit, Fox02, Eli, Inf, Lev, Dml, Dlib]. There is also interest in tracking user accesses to support high level services, for example recommendation system [Bol02, Sar01, Ale, Vtd]. There are approaches for resource discovery that are based either on harvesting as is OAI (described below) or distributed real time searches such as [Shi02, Pae00]. OAI: In the last decade literally thousands of digital libraries have emerged. One of the biggest obstacles for dissemination of information to a user community is that many digital libraries use different, proprietary technologies that inhibit interoperability. Building interoperable digital libraries allows communities to share information that cut across artificial institutional and geographic borders. The Open Archive Initiative [Oai, Lag01, Liu01] is one major effort to address technical interoperability among distributed archives. The OAI framework defines two roles, the data provider (archive) and the service provider. The data provider exposes its metadata and makes it available to service providers through the use of the Open Archive Initiative-Protocol for Metadata Harvesting (OAI-PMH). Kepler: Kepler [Mal01] is a system for providing the digital library services to the individual. Kepler provides the individual user with a client (called an Archivelet) that contains an OAI2.0 compliant dataprovider and a user interface to enter and edit metadata. Kepler supports community building by providing a group package. The group harvests metadata from the individual archivelets (and optionally caches full text documents) into a centralized service provider where any user can search the whole collection. The Kepler framework is similar to a broker based P2P network such as Napster. The system supports two types of users: individual publishers using the archivelet publishing tool and general users interested in retrieving published documents. The individual publishers interact with the publishing tool and the general users interact with a service provider and archivelets using a browser. 2.2 Peer-to-Peer A peer-to-peer network is a distributed network with no central control and consists of nodes running identical software. There are many applications of the P2P model with file sharing as one of the most popular applications. It enables users to share and exchange files, for example, Napster [Nap], Gnutella [Gnu], Freenet [Fre, Cla00], Kazaa [Kaz]. Some example for supporting other applications with a P2P model are: Groove Networks [Gro], SETI@home [Set], OpenCola [Ope], and JXTA Search [Jxt]. One of the problems most of the P2P networks face is that of increased search latencies as the network grows and the network degradation in presence of frequent leaves and re-joins [Pan01]. Recently, there has been interest in using these networks for building digital libraries [Mal01, Kep, Baw03]. The Kepler project has some components of P2P networks but is not a pure P2P network. It does require a central service to locate the right node, and is closer to Napster, which is a broker based P2P network. In [Baw03], they propose a topic-segmented network based on pure P2P networks, which has the nice property of clustering nodes with similar content and at same time reducing the query processing cost. Also there is interest in building P2P networks that exhibit small-world network property [Man03]. A small-world network is one that has a small bounded diameter and high clustering coefficient [Wat98]. 3 Approach and Objectives The objective of the proposed work is to research the key issues in the design, implementation, deployment, and evaluation of a sustainable digital library that supports dynamic evolution of communities. We propose a digital library that builds upon the existing work in the area of Open Archive Initiative (OAI), Kepler, Peer to Peer, and Social Networks. Our solution moves away from the traditional Digital Libraries that rely on a central organization to develop, support and maintain collections and services. In our vision there

C - 4

are on the order of a million universal clients, each client being able to do all the activities: searching, contributing, maintaining collections. We distribute the cost of evolving the code through Opensource participation and the cost of running the DL to millions admittedly incurring a tradeoff in less efficient search methods. Whereas in the OAI world data providers and service providers are always drawn on the opposite side of a line, in Freelib each node is both. We also envision that the current NSDL framework will be able to integrate with Freelib and thus have access to its content. Note that all nodes of Freelib will be OAI compliant. The three key features of the proposed library are sustainability, dynamic evolution of communities, and support of diverse communities. Sustainability. The underlying architecture of the proposed library is based on P2P, which is decentralized and does not rely on any centralized expensive hardware in sustaining the library. In addition, there is no need for any centralized administrative control in maintaining and enforcing policies. Dynamic Evolution of Communities. The key to our approach here is to characterize communities based on users� access patterns and to build the network topology to reflect this structure. The cluster of nodes formed by common access pattern identifies a community. Freelib allows for communities to form, grow, mature, dwindle and disappear as the users� interests change. Support of Diverse Communities We want to architect our universal client in such a way that different requirements for different communities, typically in terms of the metadata and interfaces, can be integrated with the core client using plug-ins. In our current Kepler project, we are addressing some of these issues and we plan to leverage that work for this proposal. As part of our objectives, we plan to demonstrate that the Freelib approach will produce a viable DL that is of benefit to NSDL and its community. We should emphasize that this is a targeted research proposal and as such, we will only demonstrate in a rather small test bed that Freelib has potential. The specific objectives we plan to address are: • Develop an alternate model of building digital library that is sustainable and supports

communities and domains with diverse interests. • Research how a P2P network model can be utilized for building such a digital libraries, in

particular investigate the effectiveness of user access driven P2P network topologies and develop a framework.

• Build a test bed (faculty, staff and students at four departments at ODU) that demonstrates the network of DLs and their interactions.

• Evaluate the effectiveness of the proposed digital library in terms of: end user services, evolving domains/communities support, scalability, and sustainability

• Research system design issues in building software to support the proposed library that grows and evolves with time.

• Analyze the community pattern that develop in the test bed. 4 Proposed Work In the Approach section we described the specific objectives of the overall vision of a scalable, sustainable DL that supports evolving communities. In this section we have organized the work for achieving the objectives into network architecture, universal client architecture, and evaluation .

C - 5

4.1 The Network Architecture A node in Freelib is an OAI service provider as well as an OAI data provider. Recall that a service provider in the OAI framework is the one that harvests metadata and provides end user services such as indexing and searching. On the other hand, an OAI data provider is typically an archive that holds the published records. We will need to endow a node with additional methods beyond the OAI protocol to support the P2P infrastructure. The network architecture of Freelib consists of two overlaying networks: access network and support network. The access network topology is characterized by the user access pattern, and the support network topology is determined by an adaptation of the Symphony protocol [Man03]. The Symphony protocol preserves the connectivity and small world properties of the network for supporting efficient search. A node linked to other nodes in this support network is called a �contact�, either short or long-distance dependent on the distance between the nodes. For the sake of clarity, in our discussion we will view the two networks separately although in the implementation both protocols reside in the universal client. In the access network, a node is connected to the interacting nodes. Interaction is defined to be both ways, that is, a node can either search for objects and upon discovering them �access� them or a node�s object can be discovered by other nodes and are �accessed�. Interacting nodes are linked to each other by �friend�-links in the P2P terminology. The network contains only active nodes that are on-line. A node is not connected to all the interacting nodes, only a subset as described later. For completeness, we briefly describe the Symphony protocol and how we are adapting it in our context, for details on the Symphony protocol please refer to [Man03]. Symphony is a protocol developed to maintain a distributed hash table of identities in such a way that no node has global knowledge yet can discover any object in a network. It does so by having each node maintain k (a system parameter) friend links besides knowing its predecessor and successor. The latter two relations are obtained by organizing all nodes on a virtual, directed ring. The ring is of unit perimeter and IDs of objects are real numbers in the interval [0,1), each node, x, manages the sub-interval of the ring between its own ID and that of its clockwise predecessor. Nodes can join the network by acquiring an ID (drawing from a uniform distribution) and adjusting, creating links and can also leave (in which case friend links are adjusted). The reachability property is maintained by drawing k �links� to distant objects from a harmonic probability distribution with constraints such no node can have more than 2k incoming links. Though we are not using the protocol for maintaining a distributed hash table, we still maintain the concept of a manager of the sub-interval, which helps us in inserting a new node in the network as explained later. We also expand the concept of short-range friends for a node, which in the original protocol is defined as links to the two adjacent nodes in the ring. As these are links in the support network additional to those that are defined by the original Symphony protocol, we still maintain the network characteristics that are ensured by the Symphony protocol. 4.1.1 Freelib Network Protocol Rule for creating friend links(Zf). These links are in the access network, see Figure 1, and are identified by the list of friends1 (interacting nodes). The nodes in the friend list are ranked and links exist to only the first Zf nodes of this ranked list; it is this list that will be used to search. In the access pattern information repository of a node, see Figure 3, we maintain a list of all friends a node has ever had (up to a set limit), tagged by being active or off-line. Consider m entries in the friend list of a node. The rank, Ri, i = 1 t o m, of a friend is given by

1 Unless otherwise stated, friends are the active friends, see also the discussion on �Rejoining the Network� and �Dropping friends�

C - 6

( ) ��

���

�×−+��

���

�=Pp

Nn

R iii αα 1 (1)

where N: is the total number of outgoing accesses to all nodes during a time interval t. ni: is the total number of outgoing accesses to ith friend in the list. P: is the total number of incoming accesses from all nodes during a time interval t. pi: is the total number of incoming accesses from ith friend in the list. α : is the weighting factor in the range 0 to 1 where α =1 ignores incoming accesses in the ranking calculation ; and α = 0 ignores outgoing accesses in the ranking calculation . The optimal value of α is an open question that requires further research and experiment as it affects the formation of communities. Note that Ri forms the probability distribution function over a friends list, so the sum of ranking over all friends is one. Ri indicates the probability with which a friend will be accessed. Note that access here refers to access of the digital object, typically a full text document. Access to metadata is not counted in the above rank calculation. The number of friends links, Zf, is a system parameter and is bounded by the following constraint: ZZZZ lsf ≤++ (2) Here, Z is the bound on the total links a node can have Zs short-range contact links and Zl long range contact links which are defined below. The optimal values for these parameters will be determined in parts by an analytical model, simulation, and experiments. Rules for creating long-range contact links (Zl). These links are in the support network, see Figure 1, and are created according to the Symphony protocol. These links are key in maintaining a low diameter of the network. The number of long-range contact links, Zl, is a system parameter to be determined by the simulation and experiment, and is also bounded by constraint (2). The long-range contact links are utilized for executing search as described later in the search section. Rules for creating short-range contact links (Zs). These links are in the support network and are created by considering Zs nodes close on the ring (support network) and adding links to those nodes. These links are defined when the node, ci, joins the network or when the node migrates. In either case, the node ci identifies all nodes within a distance d to form short-range contacts. The distance metric: ci,j = |ci-cj| (3) is the absolute value of the distance between the two nodes on the unit perimeter ring. The threshold d is a system parameter we shall determine through experiments and simulation. Note that we have extended the Symphony protocol here, which only defines two links as short-range links; see our earlier discussion on the Symphony protocol. Rule for when to migrate. The decision of when to migrate is not a black and white one but rather is fuzzy one as is the definition of a community. Equation (2) provides the relation of the number of links towards the steady state ideal of Z links per node. At the beginning of a node joining there will be few friends and almost all links will be synthetic links to ensure that it will be found during searches nevertheless. As the number of friends increases we can reduce the number of short contacts while maintaining the number of long contacts. Using equations (2) and (3) we develop heuristics that will relate the location of the friend links to those of the short contact to determine whether or not a node is physically in the wrong interval on the support ring.

C - 7

Rule for how to migrate. At some time, as described above, a node may decide to migrate to a new position on the support network. When a node makes the decision to migrate, it simply picks its top ranked friend and sends a migrate request to it. This target node behaves as if this is a new node that is trying to join and inserts it between itself and its clockwise predecessor on the support network. The migrating node must notify its own neighbors before it leaves so that they update their links. After moving to the new position, the migrating node updates its short-range and long-range contacts on the support network. The friend list in the access network does not change for a node.

Figure 1. Network architecture, showing the access and the support network

Rules for joining and leaving the network. To join the network, a new node simply needs to know the address (IP) of any existing node on the network. This is usually done by some offline means, e.g., a user getting the IP of a friend or getting some IP from a website. The joining node sends a joining request to the existing node. The existing node picks an id randomly from the sub-range it is managing, and returns the id along with the short-range contacts (based on the picked id) to the joining node. The short-range contacts are built using the distance metric (2) on the support network. In contrast to the Symphony protocol we select Zs nodes at random from the interval (id-d, id+d). Note that if the joining node happens to belong to the community of interest of the existing node, this approach of joining will put the new node in the right cluster. The Zl long distance links are chosen in accordance with the Symphony protocol. The new node can now start building its list of friends by accessing and interacting with other nodes. When a node leaves the network, it informs its contacts so that they can update their links accordingly (Note a link on a support network is bi-directional as opposed to unidirectional links in the access network � in Figure 1 we only show the unidirectional links to avoid crowding the picture). Rules for re-joining. When a node leaves the network temporarily such as the client being on a home computer with an intermittent Internet connection one would not want to destroy all the access information the node has built up during its existence. On the other hand, the support network needs to delete all the links relating to the leaving node so as to keep the network connected and not to violate the bound on the diameter. We will maintain the links

C - 8

for a node in the access pattern information repository with a flag indicating that it is off-line (there will be a time threshold for removing a node permanently). Any node in either the access or support network periodically pings its friends or contacts to determine whether they are active, and if they are not, the links are adjusted accordingly. Upon re-joining, the friend list is restored from the repository and the contacts are created as if the node were doing a regular joining. Rules for dropping friends. As a node interacts with other nodes by accessing their objects and is being accessed by other nodes, the friend list may grow beyond Zf. In that case we drop the lowest ranked (active) friends, see equation (1), from the list but maintain the information in the access information repository in case the list may need filling up again. That may happen through friends leaving the network (either temporarily or permanently). Rules for searching. There are various search protocols used in the area of P2P, some variations of breadth first search and depth first search [Gnu, Cla00, Das03]. We will adapt these searches in our context, utilizing access friends, short-range contacts, and long-range contacts. On one end of the spectrum we may only utilize friend links and on the other end we use all types of links: friends and contacts. We need to limit the number of links the search method uses as it may flood the network, and will also lead to a low-precision result set. The latency of searches will vary depending on various parameters such TTL (time to live) and fraction of total links used. For example, a friend-only search with TTL of 1 would be fast and have good precision and recall. This is the fastest search mode and after the initial startup, when the node has enough access statistics, and access friends are ranked based on these statistics, this search mode would return large portion of the results the user might be interested in. We expect this to be the dominating search mode especially after the initial startup. In the local community search mode, any node that receives the request does not route it to any of its long-range synthetic friends. It routes it to only short-range synthetic friends and access friends. This is the second fastest search mode and is expected to return more search results than the previous one. In the Global search mode, there is no restriction on which nodes to forward the request to. This mode is the default and should be used in the beginning after a node joins until it discovers its access friends in the community. Rules for Replication. Figure 2 shows the structure and contents of the repositories of the nodes. Each node has the standard repository that contains the objects and their metadata records the owner has published. It also shows supernodes (to be discussed in the next section) that will store aggregated metadata of friends (to improve search) and replication of metadata records and objects that the node owner had accessed. The replication of the content is done at the time the node owner accesses an object and keeps it (maintaining provenance) so it can serve the metadata (and the object) up should an appropriate query reach it. The intuitive idea behind this approach of replication is that an object that is in great demand will be replicated at many places and will be retrieved early on in the search. On the other hand an object that is scarcely retrieved will not be replicated that often. This is in spirit similar to Freenet�s replication strategy, though Freenet keeps a copy of the document at intermediate nodes as well. When a node replicates an object, there is a need to notify the original owner of the document (we skip some details of notification, particularly when the owner node is not active) for the node owner to adjust its ranking calculations. The number ni in the rank calculation equation includes these indirect accesses. On the other hand, it is not clear whether the servicing node (keeping the replica) should consider the indirect access for its rank calculation. We suspect it may be desirable to consider it, though treat it as a fractional access. As part of this project we plan to evaluate it theoretically, by simulation, and observe it experimentally. Duplicate Detection. The search methods all involve processing the query locally for �hits� and if the algorithm dictates, forwarding, the query to other nodes. Each node has therefore

C - 9

to merge results from the local repository and the returns from the other nodes before it sends these merged results back to the node that originated the query (or the user if the query came directly from the application). Because of the replication mechanism and the supernode (see below) mechanism, it is quite likely that duplicates will be present in the lists to be merged. In the Archon [Arch] and TRI [Tri] projects we have developed methods for detecting duplicates based on a number of heuristics which we will import for this purpose.

Figure 2. Collection structure

4.1.2 Network Topology and Communities Recall from the previous section that in Figure 1, a physical node appears twice, once in the access network and second time in the support network. In general, nodes that are clustered in the access network may not be clustered in the support network. However, our protocol for migrating nodes tries to form clusters on the support network corresponding to the communities in the access network. The main objective of maintaining this correspondence of communities and clusters is to help a node when it joins the network to find contacts, which are most likely going to be its friends as well. Super Nodes. A node has the option to become a supernode at any time.. Once a node becomes a supernode, it harvests all metadata using OAI-PMH from all its friends. It also detects and removes duplicates and normalizes & indexes the metadata. In other words, a supernode can now act as an indexer (search engine) for the community. The user of the node has an incentive to become a supernode because it will improve her search performance. (It is not clear that awareness of supernodes is an advantage or not; at this time we will keep its existence known only directly to the owner). At the discretion of the user, a supernode may decide to keep a copy of the full text document as well. In this case, it would add alternate URLs in the metadata records pointing to the copy of the full text. 4.1.3 OAI extensions It is essential that our nodes in the network are OAI compliant so that there is flexibility of these nodes to work with centralized services providers as well and it will also help us in

C - 10

integrating Freelib with NSDL. For this reason we add methods to OAI-PMH that are required for supporting the access and support networks, and for performing search. We understand that there are trade-offs in doing so, as some of the network operations may not be efficient over HTTP as required by OAI. The proposed extensions, like base OAI methods, are submitted using either the HTTP GET or POST methods. We now briefly describe a few of the extensions to illustrate on how we plan to develop these extensions. Joining the network: submitted to an existing node by a new node that wants to join the network. An example of this method is:

http://somenode.odu.edu/?verb=join&id=0.524&ip= 128.82.8.76 The existing node routs the request to the node responsible for the given id, the target node. The target node replies by telling the node about its short-range contacts. The response would be encoded in XML format, just like OAI-PMH responses. For lack of space , we skip the details. The joining node then establishes short-range connections to these nodes and starts accessing the network. It also starts building its friend list based on the access information. Leaving the network: submitted by an existing node to its contacts before leaving the network. Note that this operation is not always explicitly initiated by the user (owner of the existing node � machine crashes). An example of this method is: http://somenode.odu.edu/?verb=leave & id = 0.524 On receiving this request, the contacts adjust their links as defined by the protocol. Migration: A node on making a decision to migrate, first leaves the network and then submits the migration request to the first member in the friend list. An example of migration request is:

http://somenode.odu.edu/?verb=migrate When the target node receives the migrate request, it allocates an id for the requesting node that is close to its own id and replies with information similar to the join request. Search: sent or forwarded by a node to some other node requesting a search. Two sample requests for a breadth first search six levels deep and the TTL received being 4:

http://somenode.odu.edu/?verb=listIdentifiers & metadataprefix=oai_dc & author=maly & keyword=network&search=BF6&ttl=3

http://someonode.odu.edu/?verb=listRecord& metadataprefix=oai_dc & author=maly & subject=digital library&search=BF6&ttl=3 Upon receiving a request, the node searches its collection, returns any matches found and forwards the request to its neighbors in a P2P fashion, updating the TTL parameter. The additional search parameters can be made optional. When the XML records return, the node processes them as described in the section concerning duplication.

4.1.4 Formative Simulation Study The objective of the simulation is to (i) test and validate the protocol, (ii) obtain ranges for different system parameters, (iii) evaluate the effectiveness of support network, and (iv) evaluate trade-offs between different variations of protocol rules. In (i), we will check if there are any bottlenecks and if there are special boundary conditions that we need to adjust. We need to avoid any infinite loops of operations, for example, migration may have a tendency to create infinite loops. In (ii), we will obtain optimal system parameter ranges conducting various experiments. These system parameters have been identified throughout the network protocol sections. For (iii), we would simulate the access network first and see how it evolves for a given set of access patterns. Next we put the support network along with contacts and see how it evolves for the same access pattern. In (iv), we want to decide on protocol options where we are not clear on the right strategy, for example, how to do a new join?

C - 11

Here, the first preferred option, as described in the rules, is to insert the node in the support network and use all nodes within a short distance as the short-range contacts. These contacts are initially used for executing search for this node and later with time the node develops its friends. The alternate option is for the node to take a copy of short-range contacts from the initial contact node. 4.2 Universal Client Architecture In the preceding section we presented our model of the network and how nodes interact mostly from the network perspective. In this section we present the architecture of the universal client as seen from the perspective of the users. To start things off, the user downloads the code for the universal client that is being maintained in OpenSource. Customization. At the time of installation a user will be able to configure various components. In Figure 3 we have shown two components that can be configured: which metadata schemas should be available for publishing documents (objects), and within each schema which of the fields are mandatory/optional (UI schema). The default will be that Dublin Core is chosen with the fields shown in Figure 4 being mandatory (we have taken the screenshots from Kepler where such configurability is a key component). The idea behind this customization is to allow users to be part of several communities each preferring their own metadata schema. The search interface presented to the user will similarly be configurable to a chosen metadata schema. Finally, we allow a node to decide whether or not it wants to serve as a supernode (periodically aggregate metadata records of friends). Search. The core services of the network protocol, shown at the right part of Figure 3 provide a set of basic search mechanisms described in the search section of the network architecture. One of the key questions we have to answer is how or whether to give some control to the user over the search methods used by the system. In traditional DLs there is no question that the user has no control over what algorithms the system uses to satisfy a user request. The only control the user has, is to provide more or fewer details about the constraints of the search. However, in our environment search efficiency can vary dramatically depending what kind of search is used. We shall investigate how the system can interact with the user to optimize the search. Specifically, we will design an interface that lets the user express sentiments such as: • Need to broaden search as the search results are not satisfactory. • Wants more friends in community � wants to see explicit buddy list, to do browsing • Wants to search all members of a community only � is happy with community and for a

specific search wants fastest results but not necessarily all • Needs to broaden out � knows very little about another community but wants to find out • Is interested in more than one community � has friends in two or more communities but

wants to search only one particular community, need ways to group friend links • Wants to browse by communities � need to support accessing other communities and

get to supernodes This user input is taken at the application level search module at the left of Figure 3. This module interacts with the core search service module that in turn sends extended OAI requests to the network and obtains OAI compliant XML records back. The core search module then sends the search results back to the application search module and also updates the access pattern information repository. Accessing of documents is also routed through these two modules in order to update the two repositories with regard to the supernode attribute, replication of objects, and the access pattern. The application level search module also handles the periodic supernode harvesting and handles the incoming OAI requests for information about its local collection repository.

C - 12

Figure 3. Universal client architecture

Figure 4. Publication interface In addition to the interface issues, we need to implement a number of application level algorithms that use the routing protocol to implement say the browsing feature in the above list. The basic idea would be to do a breadth first search of degree j, where j is the community diameter (that could be either kept in each node or estimated through the support network) to locate all supernodes and perform a browse on these and merge the results. Assessment. One of the features of DL services we shall need for the evaluation study to be described in Section 4.4 we will need assessment tools that provide statistics about the network and its performance. As all nodes support OAI, we can use OAI to harvest logs and

C - 13

analyze them. For this purpose we shall develop a special harvester that can locate all the nodes in the network and harvest their logs and statistics about the nodes� collections (number of objects, size) and access information repositories (searches: number, type, query result: number of hits, community distribution, wall time to obtain query,accesses by other nodes: who, what how often). Publishing. This service is probably the least contentious problem. From our various DL projects in the past we have tremendous experience in building customizable publication services for a variety of digital libraries. The one most appropriate for our environment is the Kepler [Mal01] publication services because it provides software that lets the user select which metadata schema should govern the publication process and even further how to enforce parts of metadata fields, e.g., should the creator filed in DC be mandatory. For a sample of the services available in our Kepler project see [Kep]. In Figure 4 we have shown a typical screenshot of such a service. I will interact with various metadata handlers the user chose in her installation process. Each metadata handler will interact with the repository through a standard API to deposit or retrieve an object to or from the repository. The repository contains metadata records and objects of the owner, friends and contacts as described in Figure 2. Network maintenance. The module shown in the core service part in Figure 3 handles all operations that maintain the connectivity of the node to the other nodes in the network. These operations have been described in section 4.1. Connection Obstacles. There are a number of well known obstacles to establishing and maintaining connections within a P2P network. Firewalls, NAT, PAT, SOCKS, Proxies all may make it difficult to either have someone connect to the client or the client to connect to the outside world. This is a most difficult subject that occupies many researchers as a separate subject, we will employ a simple solution that works for proxies, NAT and limited firewalls from the Kepler project [Kep]. 4.3 Evaluation In the previous three sections we described the architecture and services of the proposed DL and identified the major open research issues we will have to resolve. In this section we shall describe both the process of validating the correctness of our approach and design and also that of evaluating, or rather provide the groundwork for estimating the performance of any such system. The approach we shall take is to implement a prototype universal client and deploy it in a test bed here at ODU. We will perform then formative and substantive evaluations of the prototype with an eye towards developing models for having predictive capabilities. We will also use the test bed to analyze the behavior the network with regard to community formation and evolution. Performance estimation. For evaluation purposes we plan to build a test bed at ODU consisting of three different categories of users in four of the departments: faculty, staff, and students. The four departments that will be part of the test bed are: Physics, Computer Science, Biology, and Teacher Education. We have contacts in these departments and in the past we have collaborated with them on various projects. We now describe some of the experiments that we will do to see whether we meet our objectives. Note that to support collecting data for our experiments we will provide hooks in the Universal client architecture. Discovery of communities. This experiment assumes that FreeLib has been deployed and has been running for some time, a week or so. We conduct this experiment by inserting a new node, a faculty from Physics department, into the FreeLib network at a position close to Biology nodes in the support network (See the Join protocol description in Section 4.1). We

C - 14

will monitor the movement of node position with time as the new node starts publishing, and searching. We will observe the impact of number of accesses to the speed at which the node moves to the right community, in this case Physics community. We will repeat this experiment for different insertion points and selecting new nodes from different communities. How communities evolve. We conduct this experiment by asking a few faculty in Computer Science and in Biology to start activity in terms of searching and publishing papers in the area of Bioinformatics (the assumption here is that initially these faculty members have been requested to be active only in their respective fields). Next we will observe the number of accesses to the time it takes to these faculty members to come together and form a new community. The proximity of the members will be measured on the support network. Network Characteristics. With this experiment we will observe the impact of frequently leaving and re-joining on the characteristics of the network. In particular, we will observer whether the network remains connected all the time and whether the diameter of network is as specified by small world networks (our sample may be too small to be useful). Search Performance. The objective of this experiment is to evaluate the performance of the search in terms of latency, precision, and recall for different search strategies as outlined in Sections 4.1 and 4.2. For this experiment, we will develop a standard set of queries covering all the domains in the test bed. These queries would be executed at different specified points on the network. The results for different search strategies would be collected and evaluated. Evolution towards what community. In the test bed section we described the participants of the evaluation study and emphasized that they come from four different departments. We are interested how these participants will evolve into communities. Will it be along the lines of subject as given by the departmental organization or will it be something entirely different? It may be that communities evolve along the lines of organizational grouping such as staff, faculty, students, or it may be that grouping by interest such as music, literature, sport will dominate. Or it may be that age or gender will be the community-driving factor. The sample is small but should provide us some insights into what the system is capable and whether it will serve the needs of the participants: create resources they are interested in and find resources they need. The study will be summary in the sense that we gather statistics as the system is deployed and periodically we will take snapshots and produce reports and the composition of communities and the nature of evolution whether it is continuous, oscillatory, and/or converging to a stable state. 5 Impact and Dissemination Potentially the impact of this project could be very large if we are completely successful in demonstrating the concept of Freelib being not only feasible but also that it performs comparable to traditional digital libraries yet provides a service that is not available under the current model, one that scales to all the members of the community and is sustainable. The impact is potential only because as part of this project we will not supervise the actual deployment of the universal client in the environment of K-12 and Higher Education but only in the test bed. We will most definitely complete the project and place the resulting source code in OpenSource and encourage the development of additional service features that will make the client acceptable to the community and encourage adoption by many. We will also pursue the traditional means: publications, education of graduate and undergraduate students through seminars, and talks at other universities. One aspect of this proposal makes it stand out is the fact that its low cost instantiation will make it a tool that should prove it amenable for wide dissemination among groups that are disadvantaged in the current education system and we will attempt to bear this out through encouraging

C - 15

participation in the test bed by students at traditionally black universities in the Hampton Roads area and high schools in Norfolk that have a predominant population of African Americans. 6 Schedule and Deliverables In the table below we have grouped the activities of the proposed work in terms of architecture design and code development, test bed development and deployment, and evaluation. Whereas the �Proposed work� section was organized in terms of required functionality, the table is presented as a flow of tasks.

7 Results from Prior NSF Support The PIs of this proposal have and had a number of NSF grants in the area of digital libraries. Since in this proposal the collaboration with the Archon project is described, we chose it as the representative grant for this section: NSF 0121656, An OAI-Compliant Federated Physics Digital Library for the NSDL. This project is an ongoing NSDL project that is nearing completion and results better than expect in some cases and worse than expected in other parts hare available. Archon is a collaborative project between Old Dominion University, American Physical Society and Los Alamos National Laboratory. This project is building an Open Archives Initiative compliant federated digital library with an emphasis on physics for the National Science Digital Library. NSDL is the comprehensive source for science, technology, engineering and mathematics education, it is funded by the National Science Foundation. This physics digital library federates holdings from the physics e-print server arXiv, Physical Review D from the American Physical Society, CERN and a number of smaller holdings. We have developed high-level services such as cross-reference linking, which is based on OpenURL and leverages the Citebase research at Southampton. A number of unique services for the Physics community have been developed, equation based search being one of them, others are author and subject similarity. We also have developed a strategy that relies on simple search followed by result set processing (recursive search) that help the well known aversion of users to avoid advanced search. Finally, all these services have been integrated with the OAI based dynamic harvesting of the contributing collections.

TasksYear 1 Year 2

months 3 6 9 12 15 18 21 24Network Architecture Protocol Design Protocol Simulation and ValidationUniversal Client Archtecture Server component with HTTP support OAI Harvester and Normalization Component Duplicate detection support Protocol Implementation Client Implementation with DC plug-in Develop client plug-ins for two communitiesTest Bed Deploy the test bed at Computer Science Test, refine, and debug the architecture Include Physics and test the system Include Biology, and Teacher Education & system test Integrate harvesting of Freelib from ArchonEvaluation Experiments for discovering communities Experiments for evolving communities Experiments for network evaluation, scalability prediction Experiments for search performance evaluation

Schedule