comparing hybrid peer-to-peer systems beverly yang and hector garcia-molina presented by marco...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Comparing Hybrid Peer-to-Peer Systems
Beverly Yang and Hector Garcia-Molina
Presented by Marco BarrenoNovember 3, 2003
CS 294-4: Peer-to-peer systems
Hybrid peer-to-peer systems
Pure peer-to-peer systems are hard to scale
Gnutella
Look at hybrids between p2p and server-client
Servers will index files, clients download from each other directly
Searching can be done more efficiently on a server
Napster (but Napster had its own problems...)
Several other architectures
Questions for hybrid systems
Best way to organize servers?
Index replication policy?
What queries are submitted often?
How do we deal with churn?
How do query patterns affect performance?
Contributions of this paper
Presents several architectures for hybrid systems
Presents and evaluates a probabilistic model for queries
Compares architectures quantitatively, based on their models and the music sharing domain
Compares strategies in non-music-sharing domains (a bit)
General concepts: basic actions
Login
A client connects to a server and uploads metadata about the files it offers
It is a local user to that server, a remote user to others
Query
A list of words to search on
Satisfied if preset maximum number of results found
Download
Contact peer directly after getting info from server
Batch vs. incremental logins
Batch: on login/logout, user's entire metadata set is added/removed
Allows index to remain small, but login/logout is expensive
Incremental: metadata kept in index at all times, and only deltas are sent at login
Saves much effort on login/logout
Queries become more expensive, as server must filter for online users
Architectures (1)
Chained architecture
Servers are arranged in a linear chain (ring?)
Each server keeps metadata for local users
Unsatisfied queries sent along chain
Logins and downloads scalable; queries potentially expensive
Architectures (2)
Full replication architecture
Each server keeps metadata about all users
Logins expensive
Queries cheap
Architectures (3)
Hash architecture
Metadata words hashed so a particular server is responsible for a particular subset of them
Queries sent to relevant servers
On login, metadata sent to all relevant servers
Limited number of servers need to see each query, but sending the lists may be expensive
Architectures (4)
Unchained architecture
Servers are independent and don't communicate
A user can only search files on the server he/she connects to
Napster
Disadvantage: user's views are limited
Advantage: scales very well (as servers, users increase together)
Query model
Universe of queries: q1, q
2, q
3, ...; densities f, g
g(i) is probability that a submitted query is query q
i (query popularity)
f(i) is probability that any given file will match query q
i (selection power)
g tells us what queries users like to submit, while f tells us which files users like to store
Expected results for chained
ExServ = Expected number of servers needed to obtain R results (MaxResults)
If P(s) is the probability that exactly s servers are needed to return R or more results, we have:
ExLocalResults based on (UsersPerServer * FilesPerUser) files
ExTotalResults based on (ExLocalResults * k) files
Expected values for others
ExServ trivially 1 for full replication and unchained
ExServ is equivalent to balls-in-bins for hash
Distributions for f() and g()
Exponential distributions work well for music domain:
Monotonically decreasing
Popularity and selection power are correlated
Most popular has highest selection power, and so on
Validation of query model
M(n) = expected # results from n files
Q(n) = probability we don't get R results
These data gathered from OpenNap
Performance model
CPU cycles
Cost estimates based on examination and guesswork, plus some experiments
Matched OpenNap relatively well for batch logins
Inter-server bandwidth
Varies among architectures
Server-client bandwidth
Napster protocol: Login, AddFile, RemoveFile
Take min over resources (iterative estimation)
Beyond music
f() and g() could be different
May be no or negative correlation
e.g. Adding “price > 0” to a query makes it less popular but doesn't change size of result set
e.g. Archive system will return more results from farther in the past (queries presumably rarer)
No or negative correlation can be modeled by adjusting the ratio of the parameters to f and g
No: r = 1 Negative: r >> 1