multimedia computing & networking 2006 1 shanyu zhao, daniel stutzbach, reza rejaie multimedia...

23
Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information Science Department University of Oregon http://mirage.cs.uoregon.edu Characterizing Files in the Modern Gnutella Network: A Measurement Study

Upload: martha-barnett

Post on 02-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

1

Shanyu Zhao, Daniel Stutzbach, Reza RejaieMultimedia & Internetworking Research Group (Mirage)

Computer & Information Science DepartmentUniversity of Oregon

http://mirage.cs.uoregon.edu

Characterizing Files in the Modern Gnutella Network: A Measurement

Study

Page 2: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

2

Introduction

P2P applications are very popular over the Internet• File-sharing: Gnutella, Kazza, eDonkey• Content distribution: BitTorrent• IP telephony: Skype

P2P applications remain popular because of• Ease of deployment, self-scaling, infrastructure-less

Significant impact on the InternetCharacterizing P2P applications is essential for• Evaluating their performance and improving their

designs• Conducting meaningful simulations and analytical study• Examining their impact on the network

Characteristics of large scale P2P applications are not well understood!

Page 3: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

3

P2P Systems: An Overview (I)

Theme: enabling a group of peers (computers) to share their resources (e.g. file, bandwidth, storage, CPU) As participating peers arbitrarily join & leave, they form an (application level) overlay topology.• Overlay is inherently dynamic• No especial support from the

network (e.g. multicast)

Overlay is used for resource discovery, management

Page 4: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

4

P2P Systems – Overview (II)

Inherent properties:• Scalability: available resources

organically grows with the number of peers

• Churn: peers voluntarily join/leave• Heterogeneity: peers have different

capabilities

Two basic architectures:1) Unstructured: peers form a randomly

connected overlay2) Structured: peers form an overlay

with certain properties (ring, tree)

Page 5: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

5

Effect on the Internet

60% of all Internet traffic [CacheLogic Research 2005]

Some P2P apps have millions of simultaneous users.Geographically distributed.

Gnutella population (Oct 04 – Jan 06)

Gnutella overlay in 2002

Page 6: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

6

Research on P2P Networking

Active area of research since 2001 Mostly focusing on new architectures, new resource discovery/management techniques• Evaluation is only feasible through simulation or

small scale experiments with synthetic workloads.

Few empirical studies on P2P systems Characteristics of widely-deployed P2P

systems are not well understood. Peer dynamics: e.g. dist of peer uptime Overlay properties: e.g. dist of peer degree Resource properties: e.g. popularity dist of files

Page 7: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

7

Methodology

Characterizing P2P applications requires capturing system “snapshots”.• Snapshot is a graph that represents state of the

system at a given point of time (peers = nodes, connections = edges).

• Individual snapshots reveal instantaneous properties.

• Consecutive snapshots reveal dynamics.

Ideally, a snapshot is captured instantaneously.In practice, a snapshot is iteratively discovered by a P2P crawler. • P2P apps should provide support for crawler,

e.g. query a peer for list of neighbors, files. It is difficult to characterize proprietary P2P

applications.

Page 8: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

8

Cruiser: a Fast P2P Crawler

We developed a parallel crawler, called Cruiser.Features:• Master-slave architecture, master coordinates among

slaves, each slave crawls hundred peers simultaneously

• Dynamic adaptation to bandwidth & CPU constraints• Generic crawler, accommodates plug-ins

Orders of magnitude faster than other P2P crawlers:• Captures one million Gnutella nodes in around 7

minutes• 140K peers/min (visiting 22K peers/min) >> 2.5

peers/minLots of important implementation issues:• Setting timeout, no of file-descriptors per process,

dealing with local NAT box

Page 9: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

9

Evaluating Snapshot Accuracy

No ref. snapshot to compareCompleteness of captured snapshots: edges, nodesTradeoff between granularity & completeness of snapshots • Node distortion > 4% • Edge distortion > 15%

30% of peers are unreachable• 3% departed peer• 17% behind firewall (NAT)• 10% overloaded !!

Cruiser/

Peers

dis

covere

d (

*10

,00

0)

Page 10: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

10

Previous Studies

Captured a small population of peers• Partial snapshot through a short crawl• Periodic probe of a fixed group of peers Have not verified whether the captured

population is representative

Conducted more than 3 years ago (outdated)• Population of these apps has significantly

grown• New features & two-tier arch. were

incorporated

Characterizing Files/

Page 11: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

11

Measurement Methodology

Characterizing files requires file snapshots.• Obtaining the list of shared files & neighbor info. from individual

peers a content crawl + a topolgy crawl

• Individual snapshots reveal static & topological analysis.• Consecutive snapshots reveal dynamic analysis.

Topology crawl is much faster than content crawl (minutes vs hours)Other challenges: NAT, DHCP, fileID, …(see paper).Minimizing the distortion in file snapshots by• Capturing a complete snapshot with a high-speed crawler• Decoupling topology crawl from content crawl

Topology crawl Content Crawl

Topology crawl

5.5 hours 15 min15 min

Characterizing Files/

Top-level overlay

Leaf

Ultrapeer

Page 12: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

12

Dataset

Captured around 50 snapshots • Average log size/snapshot: 10GByte• Each snapshot represents

• 800 Terabyte content• 100 million unique files• 0.5 million reachable peers, 20% of identified peers

Available content in Gnutella = 4,000 TerabytesReported results were consistent across multiple snapshotsPost processing• e.g. Removed duplicate files reported by

individual peers (9% of all captured files)

Characterizing Files/

Page 13: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

13

Summary of Characterizations

1) Static analysis: characteristics of files at a given point of time

2) Topological analysis: correlation between file distribution and overlay topology

3) Dynamics analysis: changes in file characteristics over time

Characterizing Files/

Page 14: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

14

Free RidingCharacterizing Files/Static Analysis

352

332349363350297340

12%15%12%

12%

16%

14%14%

159K235K125K

34K156K

79K

394K

Peers None Files

Ultra

Leaf

Long-lived Ultra

Short-lived Ultra

Long-lived leaf

Short-lived Leaf

total

% of free riders reported in previous studies• 66% in 2000 [Adar]• 25% in 2002 [Saroiu]

% of free riders have dropped

June 13, 2005[rounded numbers]

Free Riders

Page 15: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

15

Resource Sharing

How much resources (files, storage) peers contribute?Dist. of peers contributing:• x files conforms power-law• x MByte conforms power-law

Most peers contribute little, but few contribute a lot Shared files vs storage• Not as strong as reported by

Saroiu et al. 2002

Characterizing Files/Static Analysis

Page 16: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

16

File Popularity

Representing availability of individual files.Follows Zipf distributionPopularity distribution remains stable over time

Characterizing Files/Static Analysis

Page 17: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

17

File Types

in 2001, chu et al. reported• Audio: 67% of files, 79% of bytes• Video: 2% of files, 19% of bytes

mp3 files are very popular!mm files make up: 73% files, 93% bytesNon-mm: jpg, gif, htm, exe, txtVideo files become more popular

Characterizing Files/Static Analysis

mp3 61% 37%

wma 2.7% 1.3%

wave 1.9% 0.7%

m4a 1.4% 0.7%

total 67% 40%

wmv 2.3% 3.4%

mpg 2.4% 23.3%

avi 0.8% 24.5%

asf .14% 0.64%

Type File% Byte%

Type File% Byte%

Major Audio Types

Major Video Types

total 5.6% 52%

Page 18: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

18

Topological Analysis

Is there any correlation between locations of a file and overlay topology?• i.e. Are copies of a file topologically clustered?

File locations are affected by two factors:1) Scoped search => topological clustering2) Churn => random distributionWhich factor is dominant?

Examining from two angles:• Per-file perspective• Per-peer perspective

Characterizing Files/

Page 19: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

19

Topological Analysis

Simulate flood-based query from 100 random peers• No of messages to find 5 copies• Files with different popularity• Random vs realistic file distr.

Average similarity of content between 100 random peers with one/two/three-hop neighbors.

No topological clustering exists Churn is the dominant factor Use random file dist. for sim Select random peers to

characterize files (non trivial)

Characterizing Files/

Page 20: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

20

Dynamic Analysis

How do various characteristics of available files change over different timescales?

• Peers add/download or remove files• Peers join/leave the system

1) Variations in shared files by individual peers

• Dynamics IP address introduces error

2) Variations in popularity of individual files3) Trend in popularity changes

Characterizing Files/

Page 21: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

21

Variations of files at individual peers

Ratio of added/removed files to total files (degree of change)• 3000 random peers• Timescales: 2hr, 6hr, 1day, 1wk

More change over longer timescales seems intuitive

Change in popularity of 50K files over one-day interval• More changes for more popular

Characterizing Files/Dynamic Analysis

Page 22: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

22

Change in file popularityCharacterizing Files/Dynamic Analysis

Top 100 files

Top 1000 files

Change in popularity• For top 100 and 1000 files • Over different timescales

For any timescale, more popular files • exhibit larger changes• Changes occur more rapidly Caching references is useful

These all seem intuitive but one needs to quantify rate of changes

Page 23: Multimedia Computing & Networking 2006 1 Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information

Multimedia Computing & Networking 2006

23

Trends in Popularity Changes Characterizing Files/Dynamics Analysis

Goal: to predict popularity of a file in the future? No major change in popularity over several daysLarger changes over a few monthsThe key is to quantify the rate and pattern of changes.Significantly more snapshots are required to derive any reliable conclusion