NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
1B.N. Levine
Recent Problems in Peer-to-peer Content Retrieval
Brian Neil Levine
Dept. of Computer Science
UMass AmherstThe work by BNL and his students presented here was supported in part by National Science Foundation awards ANI-033055 and EIA-0080199.
AMHERST
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
2B.N. Levine
Motivation
• Peer-to-peer content sharing is one of the largest portions of traffic on the network.
• Illegal (gnutella, kazaa) or not (Apple iTunes), understanding the characteristics of such traffic is important to a well-performing Internet.
• This talk: – What’s being done in p2p content & retrieval.– Overview of research in p2p traffic measurement.– How such measurements can affect p2p design.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
3B.N. Levine
What is a p2p architecture?
1
Re
sou
rces
ou
t of y
ou
r p
ock
et
to m
ake
it w
ork
(=m
on
ey)
Peers required to make it work
Centralized
successful
unsuccessful
robust,fault-tolerant
Many
over-budgeted
Little
LotsDistributed
Robust P2P
P2P
Cha
nce
you’
ll be
hel
d ac
coun
tabl
e
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
4B.N. Levine
Overview of P2P research problems
• Content search– P2P designs are not one-size-fits-all.– Different applications require different solutions.
• Peer selection– Finding the best peer of many serving a file…
• Incentives for peers to participate• Security and privacy• Evaluation against measurement traces
– What does real p2p traffic look like?– What’s the real performance of these protocols?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
6B.N. Levine
Circular Pegs, square holes…• DHTs work great when:
– each node is associate with a unique keyword (e.g., SOS).
– The keywords stored are well-known
• e.g., DNS lookup using a DHT
– Hashes of keywords ensure work is evenly distributed
• Libraries of content?• Real measurements show:
– Nodes store more than one file, each file brings at least one keyword
• h(“The Red Hot Chili Peppers”, “Breaking the girl”)
– Content search is difficult: index each term? Or index whole title? Or part?
• h(“red”), h(“hot”), h(“chili”),…• H(“let”), h(“there”), h(“be”), h(“light”)…
– Some stored keywords are more popular than others.
– Some queried keywords are more popular than others.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
7B.N. Levine
How many keys per new user in your app?
806.0 xy
Number of files in user library
Nu
mb
er o
f u
niq
ue
keys
• DNS: 1-2 keys pers authoritative domain.
• [Left] : Unique terms in real collections of shared files (based on file names only! Not idv3 tags).
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
8B.N. Levine
Cost of indexing files in DHTs
100%
80%
60%
40%
20%
0%
Per
cent
age
of p
eers
con
tact
ed to
inde
x fil
es
Cumulative percentage of peers (ranked)
e.g., in a 100-node network, 40% of the nodes must contact 100% of the peers to index filenames for each join and leave.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
9B.N. Levine
Methods of p2p search
• Distributed Hash Tables– CAN, Chord, Pastry, etc…
• Distribute the index• Cost: updating
pointers to content
• Flooded search over– Random graphs – Small-world networks– Power-law degree networks
• Return results only on the content you have stored
• Make it easy for searches to traverse the graph
• Cost: updating the graph; group similar nodes together
• Links represent– Nothing– Relational autocorrelation
• “Heat-seeking search” over an organized network.
Mu
ch focu
sN
ot e
no
ugh
focu
s
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
10B.N. Levine
Searching for Topics not files…
• Information Retrieval searches:– Show me all documents that are related to
“salsa dancing” (as google does)
• You can’t index every word of every document– It’s hard enough to handle file names.
• One approach: place nodes with similar content together.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
11B.N. Levine
Arranging topology to match content
0
0.2
0.4
0.6
0.8
1
- 20 40 60 80 100Nodes contacted by BFS of the graph
Rec
all
Optimal
Per-queryArrangement
Arrangement
Random (gnutella)
• Arrange topology so that we increase the amount of relevant information returned to peers for limited BFS of the graph.
• Tough problem!• Can you find
answers without flooding? Can you route queries towards content?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
12B.N. Levine
Retrieval (briefly)
• Content is likely to be available from several peers.
• From which peer do you download?– Random (current approach)– Heuristics (ping, hop count, dl time)
• (but, most peers you’ve never seen before)
– Learned/Adaptive methods (e.g., MDPs)• See [BZLS; IPTPS’03]
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
13B.N. Levine
Selecting for both accuracy and speed
• Of the set of 100, IR techniques will chose servers it believes are most accurate (red)
• Selecting nodes for best transfer times picks a different set (green).
• Trivial composition doesn’t work.
Client
...
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
14B.N. Levine
Some other lessons learned from measurement (openNap)
Ratio of audio:video
Shared Transferred
# of files 20:1 1:1
# of bytes 1:1 0.06:1
• What happened to content delivery on the Internet?• What happened to serving video on the Internet?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
15B.N. Levine
Who’s transferring/serving files? (openNap)
Percentage of users down/uploading
Pe
rce
nta
ge
of a
ll d
ow
n/u
plo
ads
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
16B.N. Levine
Session Lengths (gnutella)P
erce
ntag
e of
all
sess
ion
>x
Length of node availability (10 min. increments)
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
17B.N. Levine
Balance of work in Chord(simulation based on real traces)
Equal work
Keys indexed
Queries Resolved
Msgs rcvd
Msgs sent
Percentage of all nodes (ranked)
100%
80%
60%
40%
20%
0%
Cum
ulat
ive
perc
enta
ge
of w
ork
doin
g “x
” pe
rfor
med
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
18B.N. Levine
Does caching queries balance load? (simulation based on real traces)
• cached (infinite buffer): 20% answer 55% of the queries.
•Answer: yes, but still a problem.
• normal: 20% answer 84% of the queries.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
19B.N. Levine
Some Measurements of P2P
• Ripeanu et al. – Gnutella topology does not match underlying network topology.
MMCN'02
• Markatos – A simple, query caching scheme can reduce query traffic by a factor of
two. CCGrid 2002
• Saroiu et al. – Gnutella bandwidth, latency, and node availability over a 60-hour
period. Multimedia Systems Journal v8n6
• Adar and Huberman – A free-rider study, using Gnutella’s QueryHit messages to
infer peer downloads.
• Chu, Labonte, Levine – Measurements of Napster and Gnutella file popularity and
session lengths. Proc. ITCom 2002
• Bhagwan et al – effects of dhcp on availability of nodes in p2p, TOD, joins and
leaves IPTPS 2003
• Chu, Labonte, Levine – Measurements of all transfers and most libraries in a large
p2p system (openNap); evaluation of Chord
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
20B.N. Levine
Summary Open Issues
• Applications of p2p are broad.• Methods other than DHT are possible.• Measurement studies have revealed the skewed
distributions of p2p systems.– Can these be modeled?
• DHTs are limited in their application to content sharing.– Work well for single-key systems
• Stronger efforts are needed to match research designs to real characteristics of systems.
• Thanks to Jacky Chu and Kevin Labonte for doing the balance of the work.