recent problems in peer-to-peer content retrieval
DESCRIPTION
Recent Problems in Peer-to-peer Content Retrieval. AMHERST. Brian Neil Levine Dept. of Computer Science UMass Amherst. The work by BNL and his students presented here was supported in part by National Science Foundation awards ANI-033055 and EIA-0080199. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
1B.N. Levine
Recent Problems in Peer-to-peer Content Retrieval
Brian Neil Levine
Dept. of Computer Science
UMass AmherstThe work by BNL and his students presented here was supported in part by National Science Foundation awards ANI-033055 and EIA-0080199.
AMHERST
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
2B.N. Levine
Motivation
• Peer-to-peer content sharing is one of the largest portions of traffic on the network.
• Illegal (gnutella, kazaa) or not (Apple iTunes), understanding the characteristics of such traffic is important to a well-performing Internet.
• This talk: – What’s being done in p2p content & retrieval.– Overview of research in p2p traffic measurement.– How such measurements can affect p2p design.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
3B.N. Levine
What is a p2p architecture?
1
Re
sou
rces
ou
t of y
ou
r p
ock
et
to m
ake
it w
ork
(=m
on
ey)
Peers required to make it work
Centralized
successful
unsuccessful
robust,fault-tolerant
Many
over-budgeted
Little
LotsDistributed
Robust P2P
P2P
Cha
nce
you’
ll be
hel
d ac
coun
tabl
e
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
4B.N. Levine
Overview of P2P research problems
• Content search– P2P designs are not one-size-fits-all.– Different applications require different solutions.
• Peer selection– Finding the best peer of many serving a file…
• Incentives for peers to participate• Security and privacy• Evaluation against measurement traces
– What does real p2p traffic look like?– What’s the real performance of these protocols?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
6B.N. Levine
Circular Pegs, square holes…• DHTs work great when:
– each node is associate with a unique keyword (e.g., SOS).
– The keywords stored are well-known
• e.g., DNS lookup using a DHT
– Hashes of keywords ensure work is evenly distributed
• Libraries of content?• Real measurements show:
– Nodes store more than one file, each file brings at least one keyword
• h(“The Red Hot Chili Peppers”, “Breaking the girl”)
– Content search is difficult: index each term? Or index whole title? Or part?
• h(“red”), h(“hot”), h(“chili”),…• H(“let”), h(“there”), h(“be”), h(“light”)…
– Some stored keywords are more popular than others.
– Some queried keywords are more popular than others.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
7B.N. Levine
How many keys per new user in your app?
806.0 xy
Number of files in user library
Nu
mb
er o
f u
niq
ue
keys
• DNS: 1-2 keys pers authoritative domain.
• [Left] : Unique terms in real collections of shared files (based on file names only! Not idv3 tags).
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
8B.N. Levine
Cost of indexing files in DHTs
100%
80%
60%
40%
20%
0%
Per
cent
age
of p
eers
con
tact
ed to
inde
x fil
es
Cumulative percentage of peers (ranked)
e.g., in a 100-node network, 40% of the nodes must contact 100% of the peers to index filenames for each join and leave.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
9B.N. Levine
Methods of p2p search
• Distributed Hash Tables– CAN, Chord, Pastry, etc…
• Distribute the index• Cost: updating
pointers to content
• Flooded search over– Random graphs – Small-world networks– Power-law degree networks
• Return results only on the content you have stored
• Make it easy for searches to traverse the graph
• Cost: updating the graph; group similar nodes together
• Links represent– Nothing– Relational autocorrelation
• “Heat-seeking search” over an organized network.
Mu
ch focu
sN
ot e
no
ugh
focu
s
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
10B.N. Levine
Searching for Topics not files…
• Information Retrieval searches:– Show me all documents that are related to
“salsa dancing” (as google does)
• You can’t index every word of every document– It’s hard enough to handle file names.
• One approach: place nodes with similar content together.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
11B.N. Levine
Arranging topology to match content
0
0.2
0.4
0.6
0.8
1
- 20 40 60 80 100Nodes contacted by BFS of the graph
Rec
all
Optimal
Per-queryArrangement
Arrangement
Random (gnutella)
• Arrange topology so that we increase the amount of relevant information returned to peers for limited BFS of the graph.
• Tough problem!• Can you find
answers without flooding? Can you route queries towards content?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
12B.N. Levine
Retrieval (briefly)
• Content is likely to be available from several peers.
• From which peer do you download?– Random (current approach)– Heuristics (ping, hop count, dl time)
• (but, most peers you’ve never seen before)
– Learned/Adaptive methods (e.g., MDPs)• See [BZLS; IPTPS’03]
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
13B.N. Levine
Selecting for both accuracy and speed
• Of the set of 100, IR techniques will chose servers it believes are most accurate (red)
• Selecting nodes for best transfer times picks a different set (green).
• Trivial composition doesn’t work.
Client
...
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
14B.N. Levine
Some other lessons learned from measurement (openNap)
Ratio of audio:video
Shared Transferred
# of files 20:1 1:1
# of bytes 1:1 0.06:1
• What happened to content delivery on the Internet?• What happened to serving video on the Internet?
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
15B.N. Levine
Who’s transferring/serving files? (openNap)
Percentage of users down/uploading
Pe
rce
nta
ge
of a
ll d
ow
n/u
plo
ads
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
16B.N. Levine
Session Lengths (gnutella)P
erce
ntag
e of
all
sess
ion
>x
Length of node availability (10 min. increments)
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
17B.N. Levine
Balance of work in Chord(simulation based on real traces)
Equal work
Keys indexed
Queries Resolved
Msgs rcvd
Msgs sent
Percentage of all nodes (ranked)
100%
80%
60%
40%
20%
0%
Cum
ulat
ive
perc
enta
ge
of w
ork
doin
g “x
” pe
rfor
med
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
18B.N. Levine
Does caching queries balance load? (simulation based on real traces)
• cached (infinite buffer): 20% answer 55% of the queries.
•Answer: yes, but still a problem.
• normal: 20% answer 84% of the queries.
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
19B.N. Levine
Some Measurements of P2P
• Ripeanu et al. – Gnutella topology does not match underlying network topology.
MMCN'02
• Markatos – A simple, query caching scheme can reduce query traffic by a factor of
two. CCGrid 2002
• Saroiu et al. – Gnutella bandwidth, latency, and node availability over a 60-hour
period. Multimedia Systems Journal v8n6
• Adar and Huberman – A free-rider study, using Gnutella’s QueryHit messages to
infer peer downloads.
• Chu, Labonte, Levine – Measurements of Napster and Gnutella file popularity and
session lengths. Proc. ITCom 2002
• Bhagwan et al – effects of dhcp on availability of nodes in p2p, TOD, joins and
leaves IPTPS 2003
• Chu, Labonte, Levine – Measurements of all transfers and most libraries in a large
p2p system (openNap); evaluation of Chord
NeXtworking’03 June 23-25,2003, Chania, Crete, GreeceThe First COST-IST(EU)-NSF(USA) Workshop on EXCHANGES & TRENDS IN NETWORKING
20B.N. Levine
Summary Open Issues
• Applications of p2p are broad.• Methods other than DHT are possible.• Measurement studies have revealed the skewed
distributions of p2p systems.– Can these be modeled?
• DHTs are limited in their application to content sharing.– Work well for single-key systems
• Stronger efforts are needed to match research designs to real characteristics of systems.
• Thanks to Jacky Chu and Kevin Labonte for doing the balance of the work.