comparing hybrid peer-to-peer systems beverly yang and hector garcia-molina presented by marco...

Comparing Hybrid Peer-to-Peer Systems

Beverly Yang and Hector Garcia-Molina

Presented by Marco BarrenoNovember 3, 2003

CS 294-4: Peer-to-peer systems

Hybrid peer-to-peer systems

Pure peer-to-peer systems are hard to scale

Gnutella

Look at hybrids between p2p and server-client

Servers will index files, clients download from each other directly

Searching can be done more efficiently on a server

Napster (but Napster had its own problems...)

Several other architectures

Questions for hybrid systems

Best way to organize servers?

Index replication policy?

What queries are submitted often?

How do we deal with churn?

How do query patterns affect performance?

Contributions of this paper

Presents several architectures for hybrid systems

Presents and evaluates a probabilistic model for queries

Compares architectures quantitatively, based on their models and the music sharing domain

Compares strategies in non-music-sharing domains (a bit)

General concepts: basic actions

Login

A client connects to a server and uploads metadata about the files it offers

It is a local user to that server, a remote user to others

Query

A list of words to search on

Satisfied if preset maximum number of results found

Download

Contact peer directly after getting info from server

Goal

The goal of this study is to maximize UsersPerServer

What do you think of this goal?

Batch vs. incremental logins

Batch: on login/logout, user's entire metadata set is added/removed

Allows index to remain small, but login/logout is expensive

Incremental: metadata kept in index at all times, and only deltas are sent at login

Saves much effort on login/logout

Queries become more expensive, as server must filter for online users

Architectures (1)

Chained architecture

Servers are arranged in a linear chain (ring?)

Each server keeps metadata for local users

Unsatisfied queries sent along chain

Logins and downloads scalable; queries potentially expensive

Architectures (2)

Full replication architecture

Each server keeps metadata about all users

Logins expensive

Queries cheap

Architectures (3)

Hash architecture

Metadata words hashed so a particular server is responsible for a particular subset of them

Queries sent to relevant servers

On login, metadata sent to all relevant servers

Limited number of servers need to see each query, but sending the lists may be expensive

Architectures (4)

Unchained architecture

Servers are independent and don't communicate

A user can only search files on the server he/she connects to

Napster

Disadvantage: user's views are limited

Advantage: scales very well (as servers, users increase together)

Query model

Universe of queries: q1, q

2, q

3, ...; densities f, g

g(i) is probability that a submitted query is query q

i (query popularity)

f(i) is probability that any given file will match query q

i (selection power)

g tells us what queries users like to submit, while f tells us which files users like to store

Expected results for chained

ExServ = Expected number of servers needed to obtain R results (MaxResults)

If P(s) is the probability that exactly s servers are needed to return R or more results, we have:

ExLocalResults based on (UsersPerServer * FilesPerUser) files

ExTotalResults based on (ExLocalResults * k) files

Expected values for others

ExServ trivially 1 for full replication and unchained

ExServ is equivalent to balls-in-bins for hash

Distributions for f() and g()

Exponential distributions work well for music domain:

Monotonically decreasing

Popularity and selection power are correlated

Most popular has highest selection power, and so on

Validation of query model

M(n) = expected # results from n files

Q(n) = probability we don't get R results

These data gathered from OpenNap

Performance model

CPU cycles

Cost estimates based on examination and guesswork, plus some experiments

Matched OpenNap relatively well for batch logins

Inter-server bandwidth

Varies among architectures

Server-client bandwidth

Napster protocol: Login, AddFile, RemoveFile

Take min over resources (iterative estimation)

Evaluation

Metric: max users per server (throughput, not latency)

Memory requirements

Beyond music

f() and g() could be different

May be no or negative correlation

e.g. Adding “price > 0” to a query makes it less popular but doesn't change size of result set

e.g. Archive system will return more results from farther in the past (queries presumably rarer)

No or negative correlation can be modeled by adjusting the ratio of the parameters to f and g

No: r = 1 Negative: r >> 1

CPU performance vs. r

Conclusion

Chained is the best architecture for music domain

Full replication might be good with lots of cheap memory and stable network connections

Incremental logins do best when there is negative correlation between f and g, and it performs best in short, bandwidth-limited sessions

comparing hybrid peer-to-peer systems beverly yang and hector garcia-molina presented by marco...

Documents

server slide

expensive slide

queries users

serverclient servers

hybrid peer

server napster

peer systems pure peer

loginlogout queries