u-p2p a peer-to-peer system for description and discovery of resource-sharing communities aloke...
TRANSCRIPT
U-P2P
A Peer-to-peer System for Description and Discovery of Resource-sharing
Communities
Aloke Mukherjee, Carleton UniversityAugust 28, 2003
Peer-to-peer File-sharing
Exploit storage capability of the edge
Balance load
Robustness to failure
Weaknesses: Search and Communities
Search Problem
Lack of structured metadata Filenames, Keyword matching Opaque identifiers Support for popular formats
Ignoring structured metadata Implicit indicators Collaborative filtering
State of the Art: Search
Metadata Napster, Kazaa, Limewire, JxtaSearch
Query Routing Gnutella, Routing Indices, Limewire, Neurogrid
Communities JxtaSearch, Alpine, Associative P2P
Search in DHTs PIER, FASD, Inverted Indices
Community Problem
Not simple to create a community for sharing a new file format
Current state Different protocols/apps (gnutella, fasttrack, jxtasearch) Inadequate metadata (filename matching, limited
schemas) Ad-hoc attempts aimed at specific domains
Scattered and isolated – there is no easy way to discover communities
State of the Art: Communities
Opaque No existing rich metadata search, no way to add it
Limited Rich metadata search for some formats but no way to support new formats
Implicit Implicit indicators are used to identify communities, no way to specify explicitly
Partial Users can explicitly form groups but each grouping is in the eye of the beholder
Unshared Users can explicitly direct rich metadata queries to a community, but response format is not specified
Improving Search
Standard metadata layer
Explicit structured metadata
All resources are XML files
XML Schema used to describe format (e.g. MP3, design pattern)
Schema instantiates resource
<schema>
<element name=“designpattern”>
<sequence>
<element name=“name” type=“string”>
<element name=“author” type=“string”>
<element name=“context” type=“string”>
<element name=“problem” type=“string”>
<element name=“design” type=“string”>
<element name=“diagram” type=“anyURI”>
</sequence>
</element>
</schema>
<designpattern>
<name>singleton</name>
<author>gang of four</author>
<context>when creating a new class…</context>
<problem>ensure a class only has…</problem>
<design>make the class itself responsible…</design>
<diagram>http://example.com/singleton.jpg</diagram>
</designpattern>
Automated interface generation
resource xml
schema
resource create form
resource search form
resource
resource view
instantiates
xslt
xslt xslt
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns: xsd="http: / /www.w3.org/2001/XMLSchema">
<xsd:element name="stamps">
<xsd:complexType>
<xsd:all>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="description" type="xsd: string"/>
<xsd:element name="country" type="xsd:string"/>
<xsd:element name="dateOfI ssue" type="xsd:date"/>
<xsd:element minOccurs="0" name="lastDayOfSale" type="xsd:date"/>
<xsd:element minOccurs="0" name="denomination" type="xsd: string"/ >
. . .
XSL Transform
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns: xsd="http: / /www.w3.org/2001/XMLSchema">
<xsd:element name="stamps">
<xsd:complexType>
<xsd:all>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="description" type="xsd: string"/>
<xsd:element name="country" type="xsd:string"/>
<xsd:element name="dateOfI ssue" type="xsd:date"/>
<xsd:element minOccurs="0" name="lastDayOfSale" type="xsd:date"/>
<xsd:element minOccurs="0" name="denomination" type="xsd: string"/ >
. . .
XSL Transform
resource xml
schema
resource create form
resource search form
resource
resource view
instantiates
xsl
xsl xsl
<?xml version="1.0" encoding="UTF-8"?>
<stamps title="2002 Olympic Winter Games">
<name>2002 Olympic Winter Games</name>
<description>To celebrate the spirit of the 2002 Winter Games that took place February 8-24, 2002 in Salt Lake City, Canada Post issued four stamps featuring some of the most exciting events of the games.</description>
<country>Canada</ country>
<dateOfI ssue>2002-01-25</dateOfI ssue>
<lastDayOfSale>2003-01-24</ lastDayOfSale>
<denomination>4 x 48¢</denomination>
<design>Bhandari and Plater I nc.</design>
<dimensions>30 mm x 40 mm (vertical)</dimensions>
<gumType>P.V.A.</gumType>
<paperType>Tullis Russell Coatings</paperType>
XSL Transform
<?xml version="1.0" encoding="UTF-8"?>
<stamps title="2002 Olympic Winter Games">
<name>2002 Olympic Winter Games</name>
<description>To celebrate the spirit of the 2002 Winter Games that took place February 8-24, 2002 in Salt Lake City, Canada Post issued four stamps featuring some of the most exciting events of the games.</description>
<country>Canada</ country>
<dateOfI ssue>2002-01-25</dateOfI ssue>
<lastDayOfSale>2003-01-24</ lastDayOfSale>
<denomination>4 x 48¢</denomination>
<design>Bhandari and Plater I nc.</design>
<dimensions>30 mm x 40 mm (vertical)</dimensions>
<gumType>P.V.A.</gumType>
<paperType>Tullis Russell Coatings</paperType>
XSL Transform
resource xml
schema
resource create form
resource search form
resource
resource view
instantiates
xsl
xsl xsl
Community Creation and Discovery:What is a Community?
Concrete object with defined tuple of attributes
Simplest form: (format, protocol, …)Known examples:(mp3, napster) (video, kazaa)
Examples that don’t exist: (design patterns, gnutella) (p2p papers, jxtasearch)
Tuple is specified as a XML file
Simplifying Community Creation
<community>
<name>designpatterns</name>
<schema>designpattern.xsd</schema>
<protocol>gnutella</protocol>
<display>designpattern.stylesheet</display>
</community>
User-designed communities Compose schema to describe format Compose community XML file
Community as class
mp3
mp3 community
mp3
mp3 class
Metaclass analogy
mp3
mp3 community
mp3
mp3 class
community community
class class
Community discovery is File discovery MP3 community shares MP3 files Community community shares communities
mp3
mp3 community
community
community
community
Simplifying Community Discovery
A Community for Communities: The Root Community
Communities are files shared in a real community
Root Community includes schema for communities
(format, protocol) = (community, centralized db)
Schema for Communities
<schema>
<element name="community">
<complexType>
<sequence>
<element name="name" type="xsd:string"/>
<element name="protocol" type="protocolTypes"/>
<element name="schema" type="xsd:anyURI"/>
<element name="display" type="xsd:anyURI"/>
</sequence>
</complexType>
</element>
</schema>
<community>
<name>root community</name>
<schema>community.xsd</schema>
<protocol>central-db</protocol>
<display>community.stylesheet</display>
</community>
The Root Community
What is U-P2P?
A framework that breathes life into these ideas
Explicit metadata search and creation for every Community
Creation of Community tuples (format, protocol etc…)
Discovery of Community tuples
Design
User
WebAdapter
NetworkAdapter
Repository
network layer
User
WebAdapter
NetworkAdapter
Repository
User
WebAdapter
NetworkAdapter
Repository
User
WebAdapter
NetworkAdapter
Repository
network layer
User
WebAdapter
NetworkAdapter
Repository
User
WebAdapter
NetworkAdapter
Repository
Technologies
Java Tomcat Servlet Container Java Server Pages (JSP) + Servlets XSLT (transforms), XPath (queries) Java components for XSLT, XPath (Xerces,
Xalan) eXist XML Database Log4j (logging infrastructure), JUnit (unit testing)
Evaluation and Validation: Areas of Interest
Publish and Search times as Community size increases
Breaking down Publish and Search operations
Community effectMultiple central servers
Publish
Time to publish a file
y = 0.4473x + 260.84
0
500
1000
1500
2000
2500
1 100 199 298 397 496 595 694 793 892 991
Number of Files
Mil
lis
ec
on
ds
Publish time
Linear trend
Search
Search time vs. Number of Files
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
100s of files
Mil
lis
ec
on
ds
Search time
Community EffectTime to Publish With Communities Present
0
200
400
600
800
1000
1200
1400
1 251 501 751
Number of Files
Mil
lis
ec
on
ds Time to add files
(250 file groups)
Time to publish a file
y = 0.4473x + 260.84
0
500
1000
1500
2000
2500
1 100 199 298 397 496 595 694 793 892 991
Number of Files
Mill
isec
on
ds
Publish time
Linear trend
Average Publish Time
Multiple communities
356 ms
Single community 485 ms
Multiple Central Servers
Root
A
Client
Root
Client
A
Central Server
Central Server(Root) Central Server(A)
Retrieve
Community A
I nfo
Search/Publish
I n Community A
Retrieve
Community A
I nfo
Search/Publish
I n Community A
Single Server Deployment Multiple Server Deployment
Root
A
Client
Root
Client
A
Central Server
Central Server(Root) Central Server(A)
Retrieve
Community A
I nfo
Search/Publish
I n Community A
Retrieve
Community A
I nfo
Search/Publish
I n Community A
Single Server Deployment Multiple Server Deployment
Publish with Multiple Servers
Server Processor Speed OS
1 Pentium 4 1.8 GHz Windows 2000
2 Pentium II 250 MHz Linux (RH7)
3 Celeron 1 GHz Windows XP
Time to Publish Files with Three Servers
0
500
1000
1500
2000
2500
3000
3500
1 251 501
Number of files (250 / server)
Mil
lis
ec
on
ds
Time to Publish(avg: 517.732 ms)
Server 1 Server 2 Server 3
Vs. Without Multiple Central Servers
Server Avg. time to publish a file (750 files published)
S1 455 ms
S2 1355 ms
S3 645 ms
S1, S2, S3 (load-balanced)
517 ms
Contributions
Standard Metadata Layer All communities include support for explicit metadata search and
creation
User-designed Communities Users can easily share new formats with full support for metadata
Community for Communities Prevents fragmented, isolated communities by providing metadata
about communities and a standard method for discovering them
Performance and Scalability Gains Communities can improve performance and scalability vs. systems
where resources are undifferentiated
Future Work
Performance improvements
Protocol independence (adapters for Gnutella, Freenet, etc.)
Community-aware Gnutella routing
More Community parameters (security, authentication, etc.)
Future Work continued
Trust metrics (to differentiate between communities, metadata quality)
Community evolution
Inheritance and multiple inheritance for Communities
U-P2P Publications
A. Mukherjee, B. Esfandiari, N. Arthorne, “U-P2P: A Peer-to-peer System for Description and Discovery of Resource-sharing Communities”, ICDCS Workshops 2002: 701-705, July 2002.
Neal Arthorne, Babak Esfandiari and Aloke Mukherjee, "U-P2P: A Peer-to-peer Framework for Universal Resource Sharing and Discovery”, Proceedings of Freenix track of Usenix 2003, 29-38, June 2003.
http://u-p2p.sourceforge.net
Backup slides
WebAdapter: User Interaction Model
UserUser
Web service
network layer
Application/UI Web Browser
Standard user interaction model for Peer-to-peer applications
User interaction model for U-P2P
UserUser
Web service
network layer
Application/UI Web Browser
Standard user interaction model for Peer-to-peer applications
User interaction model for U-P2P
Repository Design
Community Resource
Resource Collection
Attachment
Attachment
Id: 1
Id: 2Id: 3
Community Resource
Resource Collection
Attachment
Attachment
Id: 1
Id: 2Id: 3
Repository Design: Resource IDs
Root (/ )
/genes / music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/ music/ rock
file10
file11
file12
User designates /music as the root directory
beneath which all files and directories are shared.
Traditional model
Root (/ )
/genes /music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/music/ rock
file10
file11
file12
Resource IDs act as indirect references to files
anywhere in the file system.
Resource IDs
4d5e…f7/molecules/file1
82db…0a
/genes/file5
9e40…f9/music/ rock/file10
U-P2P model
Root (/ )
/genes / music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/ music/ rock
file10
file11
file12
User designates /music as the root directory
beneath which all files and directories are shared.
Traditional model
Root (/ )
/genes /music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/music/ rock
file10
file11
file12
Resource IDs act as indirect references to files
anywhere in the file system.
Resource IDs
4d5e…f7/molecules/file1
82db…0a
/genes/file5
9e40…f9/music/ rock/file10
U-P2P model
Root (/ )
/genes /music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/music/ rock
file10
file11
file12
Root (/ )
/genes /music/molecules
file1
file2
file3
file4
file5
file6
file7
file8
file9
/music/ rock
file10
file11
file12
Resource IDs act as indirect references to files
anywhere in the file system.
Resource IDs
4d5e…f7/molecules/file1
82db…0a
/genes/file5
9e40…f9/music/ rock/file10
Resource IDs
4d5e…f7/molecules/file1
82db…0a
/genes/file5
9e40…f9/music/ rock/file10
U-P2P model
Repository Design: XML Database
Requirements Flexibility to store wide variety of formats Handle powerful queries over all metadata
XML Database better suited than RDBMS Difficult to map fields to rows and columns
Chose eXist XML database Open source Written in Java Support for XML:DB API
Network Adapter Design
Abstract interface to Peer-to-peer Network Routing search requests, handling results,
handle incoming search requests, etc.
Only implemented Hybrid model (Napster model)
All peers can act as client and/or server
Network Adapter: Protocol
PeerCentral
Server
1. RegisterRequest( community, resource id )
2. RegisterResponse ( is resource known? )
3. RegisterRequest( community, resource id, metadata )
PeerCentral
Server
1. RegisterRequest( community, resource id )
2. RegisterResponse ( is resource known? )
3. RegisterRequest( community, resource id, metadata )
PeerCentral
Server
1. SearchRequest( community, query )
2. SearchResponse ( results )
PeerCentral
Server
1. SearchRequest( community, query )
2. SearchResponse ( results )
1. SearchRequest( community, query )
2. SearchResponse ( results )
Evaluation and Validation: Challenges
Finding large XML collections Berkeley Drosophila Genome Project: genome
annotations Other sources: DBLP (CS papers), EDGAR
(SEC filings), GeneOntology (gene-related concepts)
Transforming DTDs to XML Schema (DTDXS package)
Automation XML-RPC interface for publish and search
Publish: Breakdown of Operations
User
browser
database
Server
datastructures3
3b
database
Client
datastructures
1
2
User
browser
database
Server
datastructures3
3b
database
Server
datastructuresdatastructures3
3b
database
Client
datastructuresdatastructures
1
2
Publish: Client Timings
Time to Store File in Client DB
y = 0.1232x + 48.702
0
100
200
300
400
500
600
1 100 199 298 397 496 595 694 793 892 991
Number of files
Mil
lis
ec
on
ds
Time to store in client db
Linear trend
Publish: Server Timings
Comparison of Total Publish Time vs. Server Operations
y = 0.3622x + 114.72
y = 0.3476x + 47.319
0
100
200
300
400
500
600
700
800
900
1000
1 67 133 199 265 331 397 463 529 595 661 727 793 859 925 991
Number of Files
Mil
lis
ec
on
ds
Publish to Server
Resource Lookups + Stored in Db
Linear (Publish to Server)
Linear (Resource Lookups + Stored in Db)
Network Adapter: Protocol
PeerCentral
Server
1. RegisterRequest( community, resource id )
2. RegisterResponse ( is resource known? )
3. RegisterRequest( community, resource id, metadata )
PeerCentral
Server
1. RegisterRequest( community, resource id )
2. RegisterResponse ( is resource known? )
3. RegisterRequest( community, resource id, metadata )
PeerCentral
Server
1. SearchRequest( community, query )
2. SearchResponse ( results )
PeerCentral
Server
1. SearchRequest( community, query )
2. SearchResponse ( results )
1. SearchRequest( community, query )
2. SearchResponse ( results )
Search: Breakdown of Operations
User
browser
database
Server
datastructures
1
Client
2
User
browser
database
Server
datastructuresdatastructures
1
Client
2
Search: Total vs. Server Timings
Components of Search Operation
0
500
1000
1500
2000
2500
1 101 201 301 401 501 601 701
Number of Searches (58 / 100 files)
Mil
lis
ec
on
ds
Time to Search Server Database
Total Search Time