multimedia computing & networking 2006 1 shanyu zhao, daniel stutzbach, reza rejaie multimedia...
TRANSCRIPT
Multimedia Computing & Networking 2006
1
Shanyu Zhao, Daniel Stutzbach, Reza RejaieMultimedia & Internetworking Research Group (Mirage)
Computer & Information Science DepartmentUniversity of Oregon
http://mirage.cs.uoregon.edu
Characterizing Files in the Modern Gnutella Network: A Measurement
Study
Multimedia Computing & Networking 2006
2
Introduction
P2P applications are very popular over the Internet• File-sharing: Gnutella, Kazza, eDonkey• Content distribution: BitTorrent• IP telephony: Skype
P2P applications remain popular because of• Ease of deployment, self-scaling, infrastructure-less
Significant impact on the InternetCharacterizing P2P applications is essential for• Evaluating their performance and improving their
designs• Conducting meaningful simulations and analytical study• Examining their impact on the network
Characteristics of large scale P2P applications are not well understood!
Multimedia Computing & Networking 2006
3
P2P Systems: An Overview (I)
Theme: enabling a group of peers (computers) to share their resources (e.g. file, bandwidth, storage, CPU) As participating peers arbitrarily join & leave, they form an (application level) overlay topology.• Overlay is inherently dynamic• No especial support from the
network (e.g. multicast)
Overlay is used for resource discovery, management
Multimedia Computing & Networking 2006
4
P2P Systems – Overview (II)
Inherent properties:• Scalability: available resources
organically grows with the number of peers
• Churn: peers voluntarily join/leave• Heterogeneity: peers have different
capabilities
Two basic architectures:1) Unstructured: peers form a randomly
connected overlay2) Structured: peers form an overlay
with certain properties (ring, tree)
Multimedia Computing & Networking 2006
5
Effect on the Internet
60% of all Internet traffic [CacheLogic Research 2005]
Some P2P apps have millions of simultaneous users.Geographically distributed.
Gnutella population (Oct 04 – Jan 06)
Gnutella overlay in 2002
Multimedia Computing & Networking 2006
6
Research on P2P Networking
Active area of research since 2001 Mostly focusing on new architectures, new resource discovery/management techniques• Evaluation is only feasible through simulation or
small scale experiments with synthetic workloads.
Few empirical studies on P2P systems Characteristics of widely-deployed P2P
systems are not well understood. Peer dynamics: e.g. dist of peer uptime Overlay properties: e.g. dist of peer degree Resource properties: e.g. popularity dist of files
Multimedia Computing & Networking 2006
7
Methodology
Characterizing P2P applications requires capturing system “snapshots”.• Snapshot is a graph that represents state of the
system at a given point of time (peers = nodes, connections = edges).
• Individual snapshots reveal instantaneous properties.
• Consecutive snapshots reveal dynamics.
Ideally, a snapshot is captured instantaneously.In practice, a snapshot is iteratively discovered by a P2P crawler. • P2P apps should provide support for crawler,
e.g. query a peer for list of neighbors, files. It is difficult to characterize proprietary P2P
applications.
Multimedia Computing & Networking 2006
8
Cruiser: a Fast P2P Crawler
We developed a parallel crawler, called Cruiser.Features:• Master-slave architecture, master coordinates among
slaves, each slave crawls hundred peers simultaneously
• Dynamic adaptation to bandwidth & CPU constraints• Generic crawler, accommodates plug-ins
Orders of magnitude faster than other P2P crawlers:• Captures one million Gnutella nodes in around 7
minutes• 140K peers/min (visiting 22K peers/min) >> 2.5
peers/minLots of important implementation issues:• Setting timeout, no of file-descriptors per process,
dealing with local NAT box
Multimedia Computing & Networking 2006
9
Evaluating Snapshot Accuracy
No ref. snapshot to compareCompleteness of captured snapshots: edges, nodesTradeoff between granularity & completeness of snapshots • Node distortion > 4% • Edge distortion > 15%
30% of peers are unreachable• 3% departed peer• 17% behind firewall (NAT)• 10% overloaded !!
Cruiser/
Peers
dis
covere
d (
*10
,00
0)
Multimedia Computing & Networking 2006
10
Previous Studies
Captured a small population of peers• Partial snapshot through a short crawl• Periodic probe of a fixed group of peers Have not verified whether the captured
population is representative
Conducted more than 3 years ago (outdated)• Population of these apps has significantly
grown• New features & two-tier arch. were
incorporated
Characterizing Files/
Multimedia Computing & Networking 2006
11
Measurement Methodology
Characterizing files requires file snapshots.• Obtaining the list of shared files & neighbor info. from individual
peers a content crawl + a topolgy crawl
• Individual snapshots reveal static & topological analysis.• Consecutive snapshots reveal dynamic analysis.
Topology crawl is much faster than content crawl (minutes vs hours)Other challenges: NAT, DHCP, fileID, …(see paper).Minimizing the distortion in file snapshots by• Capturing a complete snapshot with a high-speed crawler• Decoupling topology crawl from content crawl
Topology crawl Content Crawl
Topology crawl
5.5 hours 15 min15 min
Characterizing Files/
Top-level overlay
Leaf
Ultrapeer
Multimedia Computing & Networking 2006
12
Dataset
Captured around 50 snapshots • Average log size/snapshot: 10GByte• Each snapshot represents
• 800 Terabyte content• 100 million unique files• 0.5 million reachable peers, 20% of identified peers
Available content in Gnutella = 4,000 TerabytesReported results were consistent across multiple snapshotsPost processing• e.g. Removed duplicate files reported by
individual peers (9% of all captured files)
Characterizing Files/
Multimedia Computing & Networking 2006
13
Summary of Characterizations
1) Static analysis: characteristics of files at a given point of time
2) Topological analysis: correlation between file distribution and overlay topology
3) Dynamics analysis: changes in file characteristics over time
Characterizing Files/
Multimedia Computing & Networking 2006
14
Free RidingCharacterizing Files/Static Analysis
352
332349363350297340
12%15%12%
12%
16%
14%14%
159K235K125K
34K156K
79K
394K
Peers None Files
Ultra
Leaf
Long-lived Ultra
Short-lived Ultra
Long-lived leaf
Short-lived Leaf
total
% of free riders reported in previous studies• 66% in 2000 [Adar]• 25% in 2002 [Saroiu]
% of free riders have dropped
June 13, 2005[rounded numbers]
Free Riders
Multimedia Computing & Networking 2006
15
Resource Sharing
How much resources (files, storage) peers contribute?Dist. of peers contributing:• x files conforms power-law• x MByte conforms power-law
Most peers contribute little, but few contribute a lot Shared files vs storage• Not as strong as reported by
Saroiu et al. 2002
Characterizing Files/Static Analysis
Multimedia Computing & Networking 2006
16
File Popularity
Representing availability of individual files.Follows Zipf distributionPopularity distribution remains stable over time
Characterizing Files/Static Analysis
Multimedia Computing & Networking 2006
17
File Types
in 2001, chu et al. reported• Audio: 67% of files, 79% of bytes• Video: 2% of files, 19% of bytes
mp3 files are very popular!mm files make up: 73% files, 93% bytesNon-mm: jpg, gif, htm, exe, txtVideo files become more popular
Characterizing Files/Static Analysis
mp3 61% 37%
wma 2.7% 1.3%
wave 1.9% 0.7%
m4a 1.4% 0.7%
total 67% 40%
wmv 2.3% 3.4%
mpg 2.4% 23.3%
avi 0.8% 24.5%
asf .14% 0.64%
Type File% Byte%
Type File% Byte%
Major Audio Types
Major Video Types
total 5.6% 52%
Multimedia Computing & Networking 2006
18
Topological Analysis
Is there any correlation between locations of a file and overlay topology?• i.e. Are copies of a file topologically clustered?
File locations are affected by two factors:1) Scoped search => topological clustering2) Churn => random distributionWhich factor is dominant?
Examining from two angles:• Per-file perspective• Per-peer perspective
Characterizing Files/
Multimedia Computing & Networking 2006
19
Topological Analysis
Simulate flood-based query from 100 random peers• No of messages to find 5 copies• Files with different popularity• Random vs realistic file distr.
Average similarity of content between 100 random peers with one/two/three-hop neighbors.
No topological clustering exists Churn is the dominant factor Use random file dist. for sim Select random peers to
characterize files (non trivial)
Characterizing Files/
Multimedia Computing & Networking 2006
20
Dynamic Analysis
How do various characteristics of available files change over different timescales?
• Peers add/download or remove files• Peers join/leave the system
1) Variations in shared files by individual peers
• Dynamics IP address introduces error
2) Variations in popularity of individual files3) Trend in popularity changes
Characterizing Files/
Multimedia Computing & Networking 2006
21
Variations of files at individual peers
Ratio of added/removed files to total files (degree of change)• 3000 random peers• Timescales: 2hr, 6hr, 1day, 1wk
More change over longer timescales seems intuitive
Change in popularity of 50K files over one-day interval• More changes for more popular
Characterizing Files/Dynamic Analysis
Multimedia Computing & Networking 2006
22
Change in file popularityCharacterizing Files/Dynamic Analysis
Top 100 files
Top 1000 files
Change in popularity• For top 100 and 1000 files • Over different timescales
For any timescale, more popular files • exhibit larger changes• Changes occur more rapidly Caching references is useful
These all seem intuitive but one needs to quantify rate of changes
Multimedia Computing & Networking 2006
23
Trends in Popularity Changes Characterizing Files/Dynamics Analysis
Goal: to predict popularity of a file in the future? No major change in popularity over several daysLarger changes over a few monthsThe key is to quantify the rate and pattern of changes.Significantly more snapshots are required to derive any reliable conclusion