techniques and data structures for efficient …szymansk/theses/lafortune_thesis.pdfdoctor of...

TECHNIQUES AND DATA STRUCTURES FOREFFICIENT INFORMATION ACCESS AND

RETRIEVAL IN DISTRIBUTED NETWORKS

By

Ryan LaFortune

A Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Subject: Computer Science

Approved by theExamining Committee:

Christopher Carothers, Thesis Adviser

Konstantin Busch, Thesis Adviser

Boleslaw Szymanski, Member

Bulent Yener, Member

Srikanta Tirthapura, Member

Rensselaer Polytechnic InstituteTroy, New York

March 2008(For Graduation May 2008)

TECHNIQUES AND DATA STRUCTURES FOREFFICIENT INFORMATION ACCESS AND

RETRIEVAL IN DISTRIBUTED NETWORKS

By

Ryan LaFortune

An Abstract of a Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Subject: Computer Science

The original of the complete thesis is on filein the Rensselaer Polytechnic Institute Library

Examining Committee:

Christopher Carothers, Thesis Adviser

Konstantin Busch, Thesis Adviser

Boleslaw Szymanski, Member

Bulent Yener, Member

Srikanta Tirthapura, Member

Rensselaer Polytechnic InstituteTroy, New York

March 2008(For Graduation May 2008)

c© Copyright 2008

by

Ryan LaFortune

All Rights Reserved

ii

CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1. Introduction and Historical Review . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Information Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Information Access Contributions . . . . . . . . . . . . . . . . 5

1.3.2 Information Retrieval Contributions . . . . . . . . . . . . . . . 6

2. Information Access: Tracking Mobile Objects in Wireless Sensor Networks 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Querying Schemes . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Sparse Covers and Motivations . . . . . . . . . . . . . . . . . 10

2.1.2.1 Name-Independent Compact Routing . . . . . . . . . 11

2.1.2.2 Synchronizers . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Definitions and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Graph Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3 Path Separators . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.4 Graph Minors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 A Structural Lower Bound for Sparse Covers . . . . . . . . . . . . . . 19

2.5.1 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.2 Set Cardinalities . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.3 Proving a Lower Bound . . . . . . . . . . . . . . . . . . . . . 24

2.6 Shortest Path Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Cover for k-Path Separable Graphs . . . . . . . . . . . . . . . . . . . 29

iii

2.8 Cover for Planar Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8.1 Basic Results for Planar Graphs . . . . . . . . . . . . . . . . . 33

2.8.2 High Level Description of the Algorithm . . . . . . . . . . . . 38

2.8.3 Algorithm Depth-Cover . . . . . . . . . . . . . . . . . . . . . . 38

2.8.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.8.5 General Planar Cover . . . . . . . . . . . . . . . . . . . . . . . 45

2.9 Cover for Unit Disk Graphs . . . . . . . . . . . . . . . . . . . . . . . 47

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3. Information Retrieval: P2P Content Delivery . . . . . . . . . . . . . . . . . 50

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.1.1 Users Happy ISPs Not . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 The BitTorrent Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4.1 Message Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.2 Choker Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4.3 Piece-Picker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4.4 Implications for Network Model Design . . . . . . . . . . . . . 58

3.5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.1 BitTorrent Model Data Structure . . . . . . . . . . . . . . . . 59

3.5.2 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Topology Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6.1.1 Internet Mapping Projects . . . . . . . . . . . . . . . 62

3.6.1.2 Abstractions . . . . . . . . . . . . . . . . . . . . . . 63

3.6.2 Internet Connectivity Model . . . . . . . . . . . . . . . . . . . 65

3.6.2.1 Backbone . . . . . . . . . . . . . . . . . . . . . . . . 65

3.6.2.2 Neighborhood-Level . . . . . . . . . . . . . . . . . . 67

3.6.3 Population Model . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.4 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.5 Technology Model . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.6.6 Bandwidth Model . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

iv

3.7.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.7.2 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . 77

3.8 BitTorrent as a Streaming Protocol . . . . . . . . . . . . . . . . . . . 80

3.8.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.8.2 BiToS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.8.3 BASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.8.3.1 Simulation Results . . . . . . . . . . . . . . . . . . . 85

3.8.3.2 QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.8.3.3 CDN Utilization . . . . . . . . . . . . . . . . . . . . 89

3.8.3.4 Video Bit Rate . . . . . . . . . . . . . . . . . . . . . 92

3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4. Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 95

LITERATURE CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

v

LIST OF TABLES

3.1 Approximate memory required for simulation runs, and technique lookupcomplexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2 Number of events generated in the slice-level simulations and lowerbound on the number of events generated in the packet-level simula-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.3 Number of messages received per type per simulation scenario. . . . . . 75

3.4 For a flash crowd of 2,048 peers and a look-ahead buffer of 1,000 slices,this table shows the performance of several large swarms (16,384 peersto 131,072 peers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.5 This table shows how many streaming peers received each grade of theStreamQ performance rating system. Note that a grade of F is givenwhen a peer’s adjusted frustration time is 27 seconds or more. . . . . . 90

3.6 For a flash crowd of 2,048 peers and a look-ahead buffer of 1,000 slices,this table shows the 95th percentiles and approximate flat-rate distribu-tion costs for several large swarms (16,384 peers to 131,072 peers), at$0.10 per GB delivered. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.7 For a flash crowd of 2,048 peers, this table shows the appropriate sizeof the look-ahead buffer to achieve a similar QoS for different bit rates(700 Kbps and 1.5 Mbps) and swarm sizes (16,384 peers and 32,768peers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.8 This table shows the differences in P2P contribution and CDN utiliza-tion for the scenarios presented in Table 3.7. . . . . . . . . . . . . . . . 93

vi

LIST OF FIGURES

2.1 This figure shows S0 and S1. In S1, S0 is replicated c + 1 times andconnected using the gadget G0

1. . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 This figure demonstrates the structure of a general Si graph. . . . . . . 22

2.3 This figure demonstrates one way path lengths may grow as describedin lemma 2.5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 A demonstration of the proof of property iii of Lemma 2.6.1. . . . . . . 29

2.5 For Lemma 2.8.2: the figure on the left shows a configuration of re-moved edges that are external in C and span from A to B (note, ifthe lemma were not true, B would be disconnected), the figure in themiddle demonstrates the walk options from lu to lv, and the figure onthe right demonstrates the walk options from lu to ru. . . . . . . . . . . 35

2.6 For Case 2 of Lemma 2.8.3: the figure on the left demonstrates a possiblesetup, and the figure on the right demonstrates one of the two possiblepath configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 This figure demonstrates the subgraphs and paths described in Lemma2.8.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 Execution example of Algorithm Subgraph-Clustering. . . . . . . . . . . 40

3.1 This figure shows our BitTorrent model data structure. . . . . . . . . . 61

3.2 This figure is the connectivity graph of the backbone of the connectiv-ity model. The nodes represent sources, sinks, intermediate backbonerouters, and identified low-tiered ISP routers. The edges represent linksbetween respective nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 This figure shows the distribution of shortest path lengths for distinctpaths in the backbone of the connectivity model. This curve is typicalof the Internet, demonstrating that we have preserved the required pathproperties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4 This figure is the connectivity graph resulting from one set of traces toa popular cable ISP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5 This figure shows the average delays experienced at each of the first 18links along a packet’s path from our traces. . . . . . . . . . . . . . . . . 70

vii

3.6 This figure shows the national technology distribution for home high-speed Internet connections for March of 2003 and March of 2006. . . . . 71

3.7 This figure shows the download completion times of the modified IN-RIA/PlanetLab scenario taken from [67]. In our case, we varied therandom number seed-sets across 10 separate runs of the 40 peer, 1seeder scenario. Thus providing us with 400 peer data points. . . . . . . 73

3.8 Simulated download completion times (seconds) for the 1,024 peer 1,024piece scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.9 Model execution time as a function of the number of pieces and thenumber of peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.10 Model event rate as a function of the number of pieces and the numberof peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.11 Model memory usage in MB as a function of the number of pieces andthe number of peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.12 Simulated download completion times (seconds) for the 16,384 peer4,096 piece scenario (this simulation run required 15.14 GB of RAMand 59.66 hours to execute with an event rate of 35,179 events persecond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.13 This figure demonstrates our double-buffering scheme. In this example,the playback buffer is 5 slices and the look-ahead buffer is 15 slices. . . 84

3.14 This graph demonstrates the average buffer times (seconds) experiencedby streaming peers in the simulation runs. . . . . . . . . . . . . . . . . 86

3.15 This graph demonstrates the average number of buffer events experi-enced by streaming peers in the simulation runs. Note that the firstbuffer event is mandatory for all streaming peers. . . . . . . . . . . . . 86

3.16 This graph demonstrates the percent of bandwidth contributed by theP2P network for the file distribution in the simulation runs. . . . . . . . 87

3.17 This histogram of the buffering times demonstrates that most streamingpeers experience a buffering time of under 3 seconds. . . . . . . . . . . . 88

3.18 This histogram of the number of buffering events demonstrates thatmost streaming peers experience few re-buffers. . . . . . . . . . . . . . . 89

3.19 This histogram of the adjusted frustration times demonstrates that moststreaming peers experience a high QoS. . . . . . . . . . . . . . . . . . . 89

3.20 This graph demonstrates the average server utilizations over the simu-lation runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

viii

3.21 This graph demonstrates the peak server utilizations over the simulationruns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.22 This graph shows the distribution of slices delivered by the CDN through-out the file, for the 16,384 peer, 1,024 flash crowd, and 2,000 slice look-ahead scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

ix

ACKNOWLEDGMENT

Pursuing a doctoral degree is an extensive and intricate process. I would like to

thank my parents for the support and encouragement they have provided me with

throughout my entire life. I would like to thank my brother Erik for challenging

me and for helping initiate my pursuit of mathematical superiority at an early age.

I would also like to thank my fiance Christina for her support and understanding

throughout this challenging feat.

Further, I would like to thank my doctoral committee, and my advisors Chris

Carothers and Costas Busch for their guidance, for sharing their knowledge, and for

helping me throughout graduate school.

x

ABSTRACT

Networks were designed and continue to exist to allow for fast and convenient ac-

cess to remote data. With data scattered across a large network, there exists a

fundamental challenge to efficiently find any sought data. This challenge is further

complicated when the data is periodically relocated in the network, as is the case

with wireless sensor networks. Thus, a solution to the problem necessitates a data

structure with the ability to update in response to object relocations. A trivial solu-

tion to the problem uses a centralized directory responsible for knowing the location

of all data at all times, and directing all querying nodes to the location of the data

they seek. Dependence on one node to provide directions results in a single point of

failure, and may cause some queries to be unnecessarily long, especially when the

sought data lies at a node topologically close to the querying node. A better solution

to the problem uses a distributed directory, where all queries are answered quickly

regardless of the whereabouts of the querying and storing nodes. In this thesis, we

provide significant improvements to previous distributed directory solutions by cre-

ating innovative algorithms that improve the structural properties of sparse covers,

the underlying data structure from which a directory is built. Specifically, we im-

prove directory performance to Stretchfind = O(log n) and Stretchmove = O(log n)

for H-minor free graphs (a savings of log n in each measure), and Stretchfind = O(1)

and Stretchmove = O(log n) for planar graphs, unit disk graphs, and other graphs

with a constant-stretch planar spanner (an additional savings of log n in Stretchfind).

Once data is located in the network, it must then be retrieved (delivered). In a

simple world, this delivery would be between a single source and a single destination.

The possibilities for delivery techniques increase greatly when there are many sources

and destinations, like in peer-to-peer (P2P) networks. P2P networks have gained

much attention in recent years due to their scalability and fault-tolerance, and also

their potential to drastically reduce distributor transit costs. In order to study the

dynamics and causal relationships between peer entities in these complex overlay

networks, we have developed a flow-based discrete-event simulator and abstract

xi

Internet topology model that accurately and realistically model today’s broadband

service, at a scale larger than previous efforts. Specifically, our model can scale to

hundreds of thousands of peers, where prior efforts peak at only a few thousand.

Using detailed simulations, we have improved the efficiency of data dissemination

and reduced distributor transit costs for both the time-insensitive mass-download

scenario and the real-time streaming scenario.

xii

CHAPTER 1

Introduction and Historical Review

This thesis discusses two fundamental problems in distributed networking. The first

problem deals with efficiently locating sought data and objects in wireless sensor

networks. We provide a distributed directory solution with substantial performance

improvements over previous results. The second problem deals with techniques for

data retrieval and delivery in peer-to-peer (P2P) networks. Here, we show through

detailed simulation how to improve network throughput, user performance, and

reduce distributor transit costs.

1.1 Information Access

Networks were designed and continue to exist to allow for fast and convenient

access to remote data. With data scattered across a large network, there exists a

fundamental challenge to efficiently find any sought data. There are many lookup

systems for P2P networks like: Chord [16], CAN [17], Pastry [18], Tapestry [19],

Symphony [20], randomized hypercubes [21, 22], and randomized Chord [21, 23], to

name a few. The challenge of efficiently finding data is further complicated when

the sought data/objects are mobile.

Consider a wireless sensor network responsible for tracking objects with the

ability to relocate frequently and at will. Finding data in such a network necessitates

a data structure with the ability to update in response to object relocations. The

current state of the art solution is a directory service for mobile objects. This service

is responsible for establishing and maintaining paths to objects in the network, so

that it may provide directions to navigate a user to the location of a desired object.

A trivial solution to the problem uses a centralized directory responsible for knowing

the location of all data at all times, and directing all querying nodes to the location of

the data they seek. Dependence on one node to provide directions results in a single

point of failure, and may cause some queries to be unnecessarily long, especially

when the sought data lies at a node topologically close to the querying node. A

1

2

better solution to the problem uses a distributed directory [35], where all queries are

answered quickly regardless of the whereabouts of the querying and storing nodes.

A distributed directory provides two operations: find, to locate an object

given its name, and move, to move an object from one node to another. There is an

inherent tradeoff between the cost of implementing the find and move operations.

The performance of a directory is measured by the Stretchfind, the Stretchmove, and

the memory overhead of the directory, where stretch is defined as the ratio between

the cost of performing an operation and the optimal cost.

A sparse cover is the underlying data structure from which a directory is built,

consisting of a set of connected components called clusters, where every node in the

graph (network) belongs to some cluster containing its entire γ-neighborhood (where

γ is some desired locality parameter). Structurally, a cover is characterized by two

locality metrics, its radius (the maximum cluster radius, which is the minimum

eccentricity (maximum shortest path distance to any cluster node) of a node in any

cluster) and degree (the maximum number of clusters a node participates in). The

radius often translates into latency, and the degree translates into the load imposed

on a node by the data structure.

In Chapter 2 of this thesis, we prove that there exists a network with n nodes,

and constrained by the locality parameter γ and the maximum tolerable degree c,

such that when clustered, there must exist a cluster whose radius is Ω(γ log logc n).

This proves that for arbitrary graphs, we cannot simultaneously optimize both met-

rics, limiting the quality of the distributed directories we can create for these graphs

using sparse covers.

In light of the above tradeoff for arbitrary graphs, it is natural to ask whether

better sparse covers can be obtained for special classes of graphs. We answer this

question in the affirmative for the class of graphs that exclude a fixed minor. This

includes many popular graph families, such as: planar graphs, which exclude K5

and K3,3, outerplanar graphs, which exclude K4 and K2,3, series-parallel graphs,

which exclude K4, and trees, which exclude K3. For any graph G that excludes

a fixed minor graph H, we present an algorithm for computing a sparse cover Z

such that rad(Z) ≤ 4γ and deg(Z) = O(log n), where n is the number of nodes in

3

G (rad(Z) refers to the radius of Z, and deg(Z) refers to the degree of Z). The

constants in the degree bound depend on the size of H. For any planar graph G,

we present an algorithm for computing a sparse cover Z with rad(Z) ≤ 24γ − 8

and deg(Z) ≤ 18. This cover is optimal (modulo constant factors) with respect

to both the degree and the radius. To our knowledge, this is the first optimal

construction for planar graphs. Finally, for any unit disk graph G (or other graph

with a constant-stretch planar spanner), we present a technique for computing a

sparse cover Z with rad(Z) ≤ 24γt−8 and deg(Z) ≤ 18 (for some constant t). This

cover is also optimal (modulo constant factors) with respect to both the degree and

the radius. Once again, this is the first optimal construction for unit disk graphs.

Using our innovative algorithms that improve the structural properties of

sparse covers, we significantly improve the performance of both the find and move

operations of distributed directories for the studied families of graphs. Using the gen-

eral algorithm of Awerbuch and Peleg [36], we can build a directory with Stretchfind =

O(log2 n) and Stretchmove = O(log2 n). Using our improved algorithm for H-minor

free graphs, we achieve Stretchfind = O(log n) and Stretchmove = O(log n). Using

our improved algorithm for planar graphs (similarly unit disk graphs) we achieve

Stretchfind = O(1) and Stretchmove = O(log n).

Our contributions significantly improve the performance of distributed directo-

ries as well as other well-studied distributed computing problems including network

synchronizers and name-independent compact routing schemes.

1.2 Information Retrieval

Once data is located in the network, it must then be retrieved (delivered).

In the past, the data would typically be delivered directly from some dedicated

server or content distribution network (CDN), and the distributor would pay all

associated transit costs. An attractive alternative to this architecture is using a

P2P overlay network, such as BitTorrent [64]. In such a network, all participants

act as both clients and servers, by downloading content for themselves, and also

uploading it to other users. This architecture can alleviate the costs required for

distribution, as most content is delivered using the swarm’s aggregate bandwidth,

4

rather than bandwidth purchased by the distributor. Further, if the demand for

some particular data is extremely high, even the most powerful single server would

be quickly overwhelmed. This is not a concern in P2P networks, as requests would

be distributed across a very large set of nodes.

The Internet is evolving in ways unforseen upon its conception. In recent

years, we have seen the Internet used as a phone service. Vonage [127] is a company

providing voice over Internet protocol (VoIP) and Skype [128] provides a P2P-based

phone service. We are beginning to see companies offer television programs over the

Internet. If Internet protocol television (IPTV) succeeds, we will see an explosion

in the amount of data transferred over the Internet. Specifically, it is clear that

BitTorrent-like networks and other P2P networks will be used for the bulk of this

delivery, making it economically feasible for content distributors.

BitTorrent is known to be very scalable, robust, and provide high performance

to its users. Unfortunately, the protocol is based almost entirely on heuristics,

making it all but impossible to analyze through theoretical measures. Further,

there are no statistics available for television-scale swarms (as none have existed in

real life), and for smaller swarms that have existed, statistics are not available, as

the sessions were likely distributing data illegally. Thus, we turn to simulation to

help us study the problem at hand. To this point, simulators of the protocol tend

to examine small-scale swarms consisting of at most a few thousand peers.

Through careful design and the use of abstractions, we have developed a Bit-

Torrent simulator and Internet topology model capable of scaling to hundreds of

thousands of users (television-size audiences), on commodity hardware, while main-

taining a high level of accuracy. In Chapter 3, we discuss our model, provide ex-

perimental results and validation, and discuss how we have used it to study data

dissemination in both the time-insensitive mass-download scenario, and the real-

time streaming scenario.

1.3 Summary of Contributions

The following is a summary of all contributions in the areas of information

access (discussed in Chapter 2) and information retrieval (discussed in Chapter 3)

5

presented in this thesis.

1.3.1 Information Access Contributions

1. For any planar graph G, we present an algorithm for computing a sparse

cover Z with rad(Z) ≤ 24γ − 8 and deg(Z) ≤ 18. This cover is optimal

(modulo constant factors) with respect to both the degree and the radius. To

our knowledge, this is the first optimal construction for planar graphs (see

Section 2.8).

2. Using our sparse cover construction algorithm for planar graphs, we improve

distributed directory performance for networks that can be represented by

these graphs, achieving Stretchfind = O(1) and Stretchmove = O(log n).

3. For any unit disk graph G (or other graph with a constant-stretch planar span-

ner), we present a technique for computing a sparse cover Z with rad(Z) ≤24γt− 8 and deg(Z) ≤ 18 (for some constant t). This cover is optimal (mod-

ulo constant factors) with respect to both the degree and the radius. To

our knowledge, this is the first optimal construction for unit disk graphs (see

Section 2.9).

4. Using our sparse cover construction technique for unit disk graphs (and other

graphs with a constant-stretch planar spanner), we improve distributed di-

rectory performance for networks that can be represented by these graphs,

achieving Stretchfind = O(1) and Stretchmove = O(log n).

5. For any graph G that excludes a fixed minor graph H, we present an algorithm

for computing a sparse cover Z such that rad(Z) ≤ 4γ and deg(Z) = O(log n),

where n is the number of nodes in G. The constants in the degree bound

depend on the size of H (see Section 2.7).

6. Using our sparse cover construction algorithm for H-minor free graphs, we im-

prove distributed directory performance for networks that can be represented

by these graphs, achieving Stretchfind = O(log n) and Stretchmove = O(log n).

6

7. We prove there exists a network with n nodes, and constrained by the locality

parameter γ and the maximum tolerable degree c, such that when clustered,

there must exist a cluster whose radius is Ω(γ log logc n), proving the inherent

tradeoff between radius and degree, and that for arbitrary graphs these metrics

cannot be simultaneously optimized (see Section 2.5).

1.3.2 Information Retrieval Contributions

1. A memory efficient model of the BitTorrent protocol built on the ROSS discrete-

event simulation system [88, 89]. The memory consumed by a single BitTor-

rent client can be upwards of 70 MB. The memory consumed by a client in

our model is between 67 KB and 2.3 MB (see Section 3.5).

2. A slice-level data model that ensures protocol accuracy while avoiding the

event explosion problem characteristic of typical packet-level models, such

as employed with NS [70]. As a result, we achieve tremendous sequential

processor speedups (up to 180 times) (see Sections 3.5 and 3.6).

3. A realistic Internet topology model that preserves geographic market rela-

tionships, is massively scalable, and accurately models the in-home consumer

broadband Internet (see Section 3.6).

4. Validation of our BitTorrent model against instrumented BitTorrent opera-

tional software as well as previous measurement studies (see Section 3.7.1).

5. Model performance results and analysis for a large number of BitTorrent

swarm scenarios (see Section 3.7.2).

6. Analysis of techniques for streaming content using BitTorrent. We show ac-

ceptable quality of service (QoS) can be achieved when only a small fraction of

a BitTorrent swarm is streaming. Further, we show how the use of BitTorrent

along with a CDN can significantly reduce transit costs while providing an

excellent QoS (see Section 3.8).

CHAPTER 2

Information Access: Tracking Mobile Objects in Wireless

Sensor Networks

2.1 Introduction

Networks of wireless sensors provide unprecedented opportunities for distributed

sensing and monitoring of the physical environment. Many applications of sensor

networks such as distributed surveillance and habitat monitoring deal with mobile

objects such as people, animals, or vehicles. Fundamental tasks of the sensor net-

work are to track these objects, navigate users to them, answer queries about them,

and route data to/from them.

Consider a senor network responsible for the surveillance of vehicles. It should

be able to track the whereabouts of a vehicle, or even reach a vehicle by navigating

to its current location. Such a network would be capable of warning vehicles of

impending danger, or informing users of vehicles in the area.

An example of habitat monitoring is Zebranet [1, 2, 3]. This system monitors

animals in a region, and can be used to navigate users (such as photographers) to a

specific animal, herd of animals, or to the animal closest to the user. It can also be

used to answer aggregate queries about the habitat, such as distances traveled over

time.

The above problems can be rephrased in terms of building a directory service

for mobile objects in a wireless sensor network. A directory service in a distributed

system must establish and maintain paths to objects in the network so that it may

provide directions to navigate a user to the current location of any desired object.

Sensor networks are inherently resource-constrained [4]; each sensor node has

only limited energy, processing power, memory, storage capacity, and communica-

tion bandwidth. The combination of the large-scale nature and resource constraints

of sensor networks make the task of building a scalable directory service for them

extremely challenging.

Consider the conventional centralized implementation of a directory service.

7

8

Centralized nodes record location estimates of all interesting objects that are being

sensed. Users then communicate with these central nodes in order to find the objects

they seek. There are three drawbacks to this type of implementation. First, it

is expensive in terms of communication and energy consumed to keep the central

nodes up-to-date every time objects move, and for them to be involved in all queries.

Second, if a central node fails, many objects may be unreachable. The final drawback

is that the centralized scheme is inherently global. If a user is close to the sought

object, it must still communicate with a central node, which may be very far away.

In contrast, the ideal solution would take advantage of this user-object proximity

and would only involve local communication among nodes that are nearby to the

user and the object.

Such an ideal solution can be approached via a carefully designed distributed

data structure for the directory. In a distributed directory, no single node serves

as a “home” for the object, constantly knowing its current coordinates. Instead,

the directory information is spread out through the network in a way that makes

it possible to easily reach the object using local queries. Yet, the directory can be

updated locally whenever the object moves. The directory must also be lightweight

since it must operate within the energy, memory, and processing constraints of the

sensor nodes.

A distributed directory provides two operations: find, to locate an object

given its name, and move, to move an object from one node to another. There is an

inherent tradeoff between the cost of implementing the find and move operations.

The performance of a directory is measured by the Stretchfind, the Stretchmove, and

the memory overhead of the directory, where stretch is defined as the ratio between

the cost of performing an operation and the optimal cost.

2.1.1 Querying Schemes

There exist many schemes for storing data and querying wireless sensor net-

works. Some designs aim to reduce communication complexity by using named-data,

replication, and other methods of avoiding or controlling flooding through the net-

work. Another popular design feature is to reduce the amount of data sent by

9

performing filtering or aggregation at intermediate nodes.

Directed diffusion [7, 8] is a data-centric method where a network of application-

aware nodes implement data-naming. A user may run a query by disseminating a

task as an interest for named data, and awaits the flow back of events. Along the

path back to the user, intermediate nodes may choose to locally cache or aggre-

gate the results before forwarding them. TAG [14] is a generic aggregation service

that operates in a similar fashion. However, this service is specifically designed to

run on ad hoc networks comprised of motes running the TinyOS operating system.

ACQUIRE [12] is another data-centric querying mechanism, which treats the net-

work like a distributed database. When required, an active query packet is injected,

and follows some trajectory through the network. This path can be random, pre-

determined, or guided, and helps avoid flooding. When a node receives the active

query, it performs an on-demand update for which it obtains information from all its

neighbors within its lookahead parameter. As the active query moves through the

network, it gets progressively resolved until at some point it is completely solved,

at which time it is returned back to the querying node.

Introduced in [9] is a data-centric storage mechanism built upon the GPSR

geographic routing algorithm and a P2P lookup system such as Chord [16], CAN

[17], Pastry [18], or Tapestry [19] (some others include: Symphony [20], randomized

hypercubes [21, 22], and randomized Chord [21, 23]). This technique is based on

hashing, and stores data in different locations of the network. Therefore, queries

can be directed to certain locations rather than flooding the network. This mech-

anism assumes the availability of geographic information regarding the network.

Geographical routing is also discussed in [6]. The rumor routing algorithm [5] is

another method that disassociates data with nodes and stores it in regions. It is

intended to be used when geographic routing is inapplicable. This method does not

guarantee delivery, but is highly configurable for different network topologies, query

rates, and event rates. Configuring appropriately is a compromise between flooding

queries and flooding event notifications.

In [15], data is proactively pushed to select nodes, and later pulled when

queries are requested. This technique is orthogonal to data-centric storage, as data

10

is stored at the push-pull boundary. Carefully defining the line between push and

pull areas can result in significant communication savings. A comb-needle model

is proposed in [13]. In this model, the push component features data duplication

in a linear neighborhood of each node, and the pull component features a dynamic

formation of an on-demand routing structure that resembles a comb. Queries need

only go to a subset of the network, avoiding flooding.

In [10], a controlled flooding TTL-based model is discussed. The idea is to

flood the network with a query, but control it using a TTL (time to live) value. When

a query has reached its time to live, it does not progress any further. If at this time

the query is not solved, the user can give up or increase the TTL value (expanding

ring search [11] is one TTL strategy). A dynamic programming formulation with

search strategies that minimize the expected cost is given, which can be used when

the probability distribution of the location of an object is known. It is also shown

that given any deterministic TTL sequence, there exists a randomized version with

a better worst case expected search cost. This strategy can be used when the

probability distribution of the location of an object is not known.

2.1.2 Sparse Covers and Motivations

Awerbuch and Peleg have created distributed directories using hierarchies of

regional matchings [35], which are constructed from sparse covers [34] (both de-

fined in Section 2.4.2). Their directory scheme features find and move stretches

of Stretchfind = O(log2 n) and Stretchmove = O(log2 n). Thus, improvements to

sparse cover construction techniques can improve the quality of regional matchings,

and therefore improve the performance of distributed directories.

A cover Z of a graph G is a set of connected components called clusters, such

that the union of all clusters is the vertex set of G. A cover is defined with respect

to a locality parameter γ > 0. It is required that for each node v ∈ G, there is some

cluster in Z that contains the entire γ-neighborhood of v. Two locality metrics

characterize the cover: the radius, denoted rad(Z), which is the maximum radius of

any of its clusters1, and the degree, denoted deg(Z), which is the maximum number

1The radius of a cluster C is defined with respect to the subgraph G′ that it induces in G. Theradius of C is the minimum eccentricity of any node in G′, where the eccentricity of a node v ∈ G′

11

of clusters that a node in G is a part of.

In addition to the construction of distance-dependant distributed directories

[35, 46, 47], covers play a key role in the design of several other locality preserving

distributed data structures, including compact routing schemes [26, 27, 32, 47, 48,

53], network synchronizers [30, 33, 45, 47], and transformers for certain classes of

distributed algorithms [32]. In the design of these data structures, the degree of the

cover often translates into the load on a vertex imposed by the data structure, and

the radius of the cover translates into the latency. Thus, it is desirable to have a

sparse cover, whose radius is close to its locality parameter γ, and whose degree is

small.

2.1.2.1 Name-Independent Compact Routing

Consider a distributed system where nodes have arbitrary identifiers. A routing

scheme is a method that delivers a message to a destination given the identifier of

the destination. A name-independent routing scheme does not alter the identifiers of

the nodes, which are assumed to be in the range 1, 2, . . . , n. The stretch of a routing

scheme is the worst case ratio between the total cost of messages sent between a

source and destination pair, and the length of the respective shortest path. The

memory overhead is the number of bits (per node) used to store the routing table.

A routing scheme is compact if its stretch and memory overhead are small.

There is a tradeoff between stretch and memory overhead. For example, a

routing scheme that stores the next hop along the shortest path to every destination

has stretch 1, but a very high memory overhead of O(n log n), and hence is not

compact. The other extreme of flooding a message through the network has very

little memory overhead, but is not compact either since the stretch can be as much

as the total weight of all edges in the network. There has been much work on

deriving interesting tradeoffs between the stretch and memory overhead of routing,

including [26, 27, 29, 43, 44, 48, 53].

Sparse covers can be used to provide efficient name-independent routing schemes

(for example, see [30]). A hierarchy of regional routing schemes is created based on a

is the maximum distance from v to any other node in G′.

12

hierarchy of covers Z1, Z2, . . . , Zδ, where the locality parameter of cover Zi is γi = 2i,

and δ = dlog De where D is the diameter of the graph2. Henceforth, we assume that

log D = O(log n), i.e. the diameter of the graph is polynomial in the number of

nodes.

Using the covers of Awerbuch and Peleg [34], the resulting routing scheme

has stretch O(k) and the average memory bits per node is O(n1/k log2 n), for some

parameter k. When k = log n, the stretch is O(log n) and the average memory

overhead is O(log2 n) bits per node.

On the other hand, using our covers we obtain routing schemes with optimal

stretch (within constant factors) for planar and H-minor free graphs. For any planar

graph G with n nodes, our covers give a name-independent routing scheme with

O(1) stretch and O(log2 n) average memory overhead per node. For any graph that

excludes a fixed minor, our covers give a name-independent routing scheme with

O(1) stretch and O(log3 n) average memory overhead per node.

For planar graphs, to our knowledge, this is the first name-independent routing

scheme that achieves constant stretch with O(log2 n) space per node on average. For

H-minor free graphs, Abraham, Gavoille, and Malkhi [26] present name-independent

compact routing schemes with O(1) stretch and O(1) maximum space per node (the

O notation hides polylogarithmic factors). However, their paper does not provide

the explicit power of log n inside the O, hence, we cannot directly compare our

results with those in [26]. Although, it is noted in [26] that it is an open problem to

construct efficient sparse covers for planar graphs with O(γ) radius and O(1) degree,

which we have solved.

There are also efficient routing schemes known for a weaker version of the rout-

ing problem called labeled routing, where the designer of the routing scheme is given

the flexibility to assign names to nodes. Thorup [52] gives a labeled routing scheme

for planar graphs with stretch (1+ ε) and memory overhead of O((1/ε) log2 n) max-

imum bits per node. Name-independent routing schemes are clearly less restrictive

to the user than labeled routing, and hence a harder problem.

2The diameter D of a graph G is the maximum shortest path distance between any two nodesin the graph. It also holds that rad(G) ≤ D ≤ 2 · rad(G), where rad(G) denotes the radius of G.

13

2.1.2.2 Synchronizers

Many distributed algorithms are designed assuming a synchronous model where

the processors execute and communicate in time synchronized rounds [30, 45]. How-

ever, synchrony is not always feasible in real systems due to physical limitations

such as different processing speeds or geographical dispersal. Synchronizers are

distributed programs that enable the execution of synchronized algorithms in asyn-

chronous systems [30, 31, 45, 47]. A synchronizer uses logical rounds to simulate

the time rounds of the synchronous algorithm.

One of the most efficient synchronizers is called ZETA [51]. This synchronizer

is based on a sparse cover with locality parameter γ = 1, radius O(logk n), and

average degree O(k), for some parameter k. ZETA simulates a round in O(logk n)

time steps and uses O(k) messages per node on average. In contrast, using our

covers, we obtain a better time to simulate a round. For planar graphs, our covers

give a synchronizer with O(1) time and average messages per node. For H-minor

free graphs, the synchronizer has time O(1) and uses O(log n) messages per node

on average.

Awerbuch and Peleg [34] present an algorithm for constructing a sparse cover

of a general graph based on the idea of coarsening. Starting from an initial cover

S consisting of the n clusters formed by taking the γ-neighborhoods of each of

the n nodes in G, their algorithm constructs a coarsening cover Z by repeatedly

merging clusters in S. For a parameter k ≥ 1, their algorithm returns a cover Z

with rad(Z) = O(kγ) and deg(Z) = O(kn1/k) (the average degree is O(n1/k)). By

choosing k = log n, the radius is O(γ log n) and the degree is O(log n). This is the

best known result for general graphs. For these graphs, there exists an inherent

tradeoff between the radius of a cover and its degree: a small degree may require a

large radius, and vice versa.

It is known ([47, Theorem 16.2.4]) that for every k ≥ 3, there exist graphs and

values of γ (e.g. γ = 1) such that for every cover Z, if rad(Z) ≤ kγ, then deg(Z) =

Ω(n1/k). Thus, in these graphs if rad(Z) = O(γ), then deg(Z) is polynomial in n.

14

2.2 Contributions

In light of the above tradeoff for arbitrary graphs, it is natural to ask whether

better sparse covers can be obtained for special classes of graphs. We answer this

question in the affirmative for the class of graphs that exclude a fixed minor. This

includes many popular graph families, such as: planar graphs, which exclude K5 and

K3,3, outerplanar graphs, which exclude K4 and K2,3, series-parallel graphs, which

exclude K4, and trees, which exclude K3.

We give improved bounds for planar graphs, unit disk graphs, and other graphs

excluding fixed minors (and improvements to distributed directory performance for

networks modeled by these graphs), and also a structural lower bound for sparse

covers in arbitrary graphs.

1. For any planar graph G, we present an algorithm for computing a sparse

cover Z with rad(Z) ≤ 24γ − 8 and deg(Z) ≤ 18. This cover is optimal

(modulo constant factors) with respect to both the degree and the radius. To

our knowledge, this is the first optimal construction for planar graphs (see

Section 2.8).

2. Using our sparse cover construction algorithm for planar graphs, we improve

distributed directory performance for networks that can be represented by

these graphs, achieving Stretchfind = O(1) and Stretchmove = O(log n).

3. For any unit disk graph G (or other graph with a constant-stretch planar span-

ner), we present a technique for computing a sparse cover Z with rad(Z) ≤24γt− 8 and deg(Z) ≤ 18 (for some constant t). This cover is optimal (mod-

ulo constant factors) with respect to both the degree and the radius. To

our knowledge, this is the first optimal construction for unit disk graphs (see

Section 2.9).

4. Using our sparse cover construction technique for unit disk graphs (and other

graphs with a constant-stretch planar spanner), we improve distributed di-

rectory performance for networks that can be represented by these graphs,

achieving Stretchfind = O(1) and Stretchmove = O(log n).

15

5. For any graph G that excludes a fixed minor graph H, we present an algorithm

for computing a sparse cover Z such that rad(Z) ≤ 4γ and deg(Z) = O(log n),

where n is the number of nodes in G. The constants in the degree bound

depend on the size of H (see Section 2.7).

6. Using our sparse cover construction algorithm for H-minor free graphs, we im-

prove distributed directory performance for networks that can be represented

by these graphs, achieving Stretchfind = O(log n) and Stretchmove = O(log n).

7. There exists a network with n nodes, and constrained by the locality parameter

γ and the maximum tolerable degree c, such that when clustered, there must

exist a cluster whose radius is Ω(γ log logc n) (see Section 2.5).

In each case the graphs are weighted, and the algorithms run in polynomial

time with respect to G. For the class of H-minor free graphs, our construction

improves upon the previous work of Awerbuch and Peleg [34] by providing a smaller

radius. For planar graphs and unit disk graphs, our construction simultaneously

improves both the degree and the radius.

We now present related work in the area of sparse covers. Definitions and

preliminaries can be found in Section 2.4. A structural lower bound is presented in

Section 2.5. A technique for clustering shortest paths is described in Section 2.6.

Our sparse cover construction algorithm for graphs excluding a fixed minor can be

found in Section 2.7, for planar graphs in Section 2.8, and for unit disk graphs in

Section 2.9. We summarize the chapter in Section 2.10.

2.3 Related Work

Concurrent with our work, we have become aware of a closely related work by

Abraham, Gavoille, Malkhi, and Wieder [28] that gives an algorithm for constructing

a sparse cover of diameter 4(r + 1)2γ and degree O(1) for any graph excluding Kr,r,

for a fixed r > 1. Though the goal of both our works is the same, our work yields

different tradeoffs than [28]. For graphs excluding a fixed minor H, our algorithm

returns a cover with radius at most 4γ, while their cover has a radius of 4(r + 1)2γ,

which is clearly greater. On the other hand, their degree of O(1) is smaller than

16

ours of O(log n). We note that the constants for the degree are exponential in the

size of the excluded minor for both algorithms.

For planar graphs, our algorithm yields a much better tradeoff than [28] since

we give a radius of no more than 24γ − 6, and a degree of no more than 18, while

their cover (by using r = 3, since a planar graph must exclude K3,3) gives a diameter

of 64γ (which translates to a radius of at least 32γ) and the degree of the cover is

840 (this can be derived from the proof of Theorem 1.2 on page 6 of the technical

report [28]).

Klein, Plotkin, and Rao [42] obtain sparse covers for H-minor free graphs

with degree O(1) but with a weak diameter O(γ), where the O(γ) length shortest

path between two nodes in the same cluster may not necessarily lie in the cluster

itself. For many applications of covers, such as compact routing and distributed

directories, this is not sufficient. In contrast, our construction yields clusters with a

strong diameter of O(γ) where the shortest path lies completely within the cluster.

For graphs with doubling dimension α, Abraham, Gavoille, Goldberg, and

Malkhi [25] present a sparse cover with degree 4α and radius O(γ). However, since

planar graphs and H-minor free graphs can have large doubling dimensions, this

does not yield efficient sparse covers for these graphs.

2.4 Definitions and Preliminaries

Some of the following definitions are borrowed from Awerbuch and Peleg [34]

and from Abraham and Gavoille [24].

2.4.1 Graph Basics

Consider a weighted graph G = (V,E, ω), where V is the set of nodes, E is

the set of edges, and ω is a weight function E → Z+ that assigns a weight ω(e) > 0

to every edge e ∈ E. An “unweighted” graph is a special case where ω(e) = 1 for all

e ∈ E. For simplicity, we will also write G = (V, E) and sometimes use the notation

v ∈ G to denote v ∈ V and e ∈ G to denote e ∈ E. For a graph H, we use the

notation V (H) and E(H) to denote the nodes and edges of H respectively.

A walk q is a sequence of nodes q = v1, v2, . . . , vk where nodes may be repeated.

17

The length of q is defined as length(q) =∑k−1

i=0 ω(vi, vi+1). We also use walks with

one node q = v, where v ∈ V , which has length(q) = 0. If v1 = vk, the walk is

closed. A path is a walk with no repeated nodes.

Graph G is connected if there is a path between every pair of nodes. G′ =

(V ′, E ′) is a subgraph of G = (V, E), if V ′ ⊆ V , and E ′ ⊆ E. If V ′ 6= V or E ′ 6= E,

then G′ is said to be a proper subgraph of G. In the case where graph G is not

connected, it consists of connected components G1, G2, . . . , Gk, where each Gi is a

connected subgraph that is not a proper subgraph of any other connected subgraph

of G. For any set of nodes V ′ ⊆ V , the induced subgraph by V ′ is G(V ′) = (V ′, E ′)

where E ′ = (u, v) ∈ E : u, v ∈ V ′. Let G− V ′ = G(V − V ′) denote the subgraph

obtained by removing the vertex set V ′ from G. For any subgraph G′ = (V ′, E ′),

G−G′ = G− V ′. For any two graphs G1 = (V1, E1) and G2 = (V2, E2), their union

graph is G1 ∪G2 = (V1 ∪ V2, E1 ∪ E2).

The distance between two nodes u, v in G, denoted distG(u, v), is the length

of the shortest path between u and v in G. If there is no path connecting the nodes,

then distG(u, v) = ∞. The j-neighborhood of a node v in G is Nj(v, G) = w ∈V : distG(v, w) ≤ j. For V ′ ⊆ V , the j-neighborhood of V ′ in G is Nj(V

′, G) =⋃

v∈V ′ Nj(v, G).

If G is connected, the radius of a node v ∈ V with respect to G is rad(v, G) =

maxw∈V (distG(v, w)). The radius of G is defined as rad(G) = minv∈V (rad(v, G)).

If G is not connected, then rad(G) = ∞. For every connected graph G, rad(G) ≤diam(G) ≤ 2 · rad(G).

2.4.2 Covers

Consider a set of vertices C ⊆ V in graph G = (V, E). The set C is called a

cluster if the induced subgraph G(C) is connected. When the context is clear, we

will use C to refer to G(C). Let Z = C1, C2, . . . , Ck be a set of clusters in G. For

every node v ∈ G, let Z(v) ⊆ Z denote the set of clusters that contain v. The degree

of v in Z is defined as deg(v, Z) = |Z(v)|. The degree of Z is defined as deg(Z) =

maxv∈V deg(v, Z). The radius of Z is defined as rad(Z) = maxC∈Z(rad(C)).

For γ > 0, a set of clusters Z is said to γ-satisfy a node v in G, if there is a

18

cluster C ∈ Z, such that Nγ(v, G) ⊆ C. A set of clusters Z is said to be a γ-cover

for G, if every node of G is γ-satisfied by Z in G. We also say that Z γ-satisfies

a set of nodes X in G, if every node in X is γ-satisfied by Z in G (note that the

γ-neighborhood of the nodes in X is taken with respect to G).

An m-regional matching is a collection of read and write sets such that for any

pair of nodes u and v where distG(u, v) ≤ m, Read(u) and Write(v) intersect. The

radius of a regional matching is the furthest distance between any pair of nodes in

a read or write set, and the degree is the maximum number of nodes in such a set.

Given any sparse cover, a regional matching with the same radius and degree can

be easily constructed [35].

2.4.3 Path Separators

A graph G with n nodes is k-path separable [24] if there exists a subgraph S,

called the k-path separator, such that:

(i) S = P1 ∪ P2 ∪ · · · ∪ P`, where for each 1 ≤ i ≤ `, subgraph Pi is the union of

ki paths where each path is shortest in G − ⋃1≤j<i Pj with respect to its end

points,

(ii)∑

i ki ≤ k, and

(iii) either G−S is empty, or each connected component of G−S is k-path separable

and has at most n/2 nodes.

For instance, any rectangular grid of nodes (2-dimensional mesh) is 1-path separable

by taking S to be the middle row path. Trees are also 1-path separable by taking S

to be the center node whose subtrees have at most n/2 nodes. Thorup [52] shows how

to compute in polynomial time a 3-path separator for planar graphs, in particular,

the 3-path separator is S = P1. That is, S consists of three paths each of which is

a shortest path in the original graph.

2.4.4 Graph Minors

The contraction of edge e = (u, v) in G is the replacement of vertices u and v

by a single vertex whose incident edges are all the edges incident to u or to v except

19

for e. A graph H is said to be a minor of graph G, if H is a subgraph of a graph

obtained by a series of edge contractions starting from G. Graph G is said to be

H-minor free, if H is not a minor of G. Abraham and Gavoille [24] generalize the

result of Thorup [52] for the class of H-minor free graphs:

Theorem 2.4.1 (Abraham and Gavoille [24]) Every H-minor free connected graph

is k-path separable, for some k = k(H), and a k-path separator can be computed in

polynomial time.

The proof of Theorem 2.4.1 is based on the structure theorems for graphs

excluding minors of Robertson and Seymour [49, 50]. We note that in Theorem 2.4.1,

the parameter k is exponential in the size of the minor. Some interesting classes of

H-minor free graphs are: planar graphs, which exclude K5 and K3,3, outerplanar

graphs, which exclude K4 and K2,3, series-parallel graphs, which exclude K4, and

trees, which exclude K3.

2.5 A Structural Lower Bound for Sparse Covers

We now present a structural lower bound for sparse covers of arbitrary graphs

(Sparse Cover Contribution 7). As previously mentioned, the best known sparse

cover result for general graphs is due to Awerbuch and Peleg, and has a radius of

O(γ log n) and a degree of O(log n). In this section, we provide a lower bound on the

radius of clusters in an arbitrary graph, given the locality parameter γ and c, the

maximum tolerable degree. This is done using a recursive construction that allows

us to force a radius increase in certain clusters of the graph. That is, we provide

a graph that no-matter how it is clustered, at least one cluster’s radius is greater

than or equal to the lower bound.

For any possible cover S, the optimal radius is rad(S) = O(γ), and the optimal

degree is deg(S) = O(1). The lower bound given in this section proves that the ra-

dius and degree cannot both be optimized simultaneously when clustering arbitrary

graphs.

20

Theorem 2.5.1 There exists a network with n nodes, and constrained by param-

eters γ and c, such that when clustered, there must exist a cluster whose radius is

Ω(γ log logc n).

The lower bound is obtained from a recursive construction, that at each level,

guarantees the increase of some cluster’s radius. We have an initial graph, that

when clustered into a cover S, rad(S) is known. We then replicate the graph many

times, and create connections between specific nodes. The manner in which this is

done guarantees that when the new graph is clustered into a cover Z, rad(Z) is at

least rad(S) + 2γ. This behavior is independent of the clustering algorithm, and

thus proves a lower bound for arbitrary graphs.

2.5.1 Graph Construction

The construction takes in user-specified parameters γ (the locality parameter),

c (the maximum tolerable degree, c ≥ 1), and n (the number of nodes in the graph).

Si = (Vi, Ei) will refer to the graph at level i of the construction. At each such level,

we also have a set Ai ⊆ Vi containing all the anchor nodes of Si. When we move

from level i−1 to level i of the construction, we replicate Si−1 a total of (c+1)|Ai−1|times, and connect the replicas together with gadgets. A gadget is a star network

with rays of length γ, such that the tip of each ray is an anchor node of a replicated

Si−1 graph (one ray connects each replica). Ai contains the anchor nodes from all

the Si−1 replicas. Each replicated node is associated with its original, and when

gadgets are added to the graph, their rays will connect all nodes sharing the same

association. We can continue the recursion until all nodes in the network have been

used. This process allows us to continuously increase the radius of some cluster in

the graph, allowing us to provide a lower bound on the size of the radius.

The recursion basis is at level 0. We initialize the graph to have a single

node v (which is also an anchor node). S0 = (v, ∅) and A0 = v. For each

level i > 0, we create a set Ri comprised of δ = (c + 1)|Ai−1| replicas of Si−1,

Ri = S0i−1, S

1i−1, . . . , S

δ−1i−1 . For each Sx

i−1 ∈ Ri, there is a set of anchor nodes

βi,x = α0, α1, . . . , α|Ai−1|−1. We know that for any anchor node αt and Si−1 replicas

x and y, βαti,x and βαt

i,y are associated since both x and y represent the same graph

21

G01

S0 S1

S0 S0 S0 S0

. . .

︸︷︷︸c + 1

Figure 2.1: This figure shows S0 and S1. In S1, S0 is replicated c+1 timesand connected using the gadget G0

1.

from the previous level of recursion. We complete the construction of the current

level by connecting the Si−1 replicas using |Ai−1| gadgets (star networks with δ rays,

each of length γ) in the following manner: ∀j < |Ai−1|, create gadget Gji whose rays

connect βαj

i,0 , βαj

i,1 , . . . , βαj

i,δ−1, and add it to the set Gi. Si = R0i ∪ R1

i ∪ . . . ∪ Rδ−1i ∪

G0i ∪ G1

i ∪ . . . ∪ G|Ai−1|−1i and Ai = βi,0 ∪ βi,1 ∪ . . . ∪ βi,δ−1. While there still exists

unused nodes in the network, increment i and continue to the next level.

2.5.2 Set Cardinalities

Based on the construction, it is clear that |A0| = 1 and |A1| = c + 1, and also

that |V0| = 1 and |V1| = cγ+γ+1. To calculate |Ai| and |Vi| for a general i, we must

consider many things. For |Ai|, we know that Si−1 has |Ai−1| anchor nodes, and

that we replicated it a total of (c + 1)|Ai−1| times. Therefore, |Ai| = (c + 1)|Ai−1|2.For |Vi|, we know we have (c + 1)|Ai−1| replicas, and that each replica has |Vi−1|nodes. We must also consider gadget nodes. A gadget has (c+1)|Ai−1| rays (as per

the construction), and since each ray must be of length γ (note that one node of

each ray has already been accounted for in the replicas, and we must add the center

node r for each gadget), each gadget contains (c+1)|Ai−1|(γ− 1)+1 nodes. Lastly,

we know that since the previous level had |Ai−1| anchor nodes, we must have a total

of |Ai−1| gadgets. Therefore, it can be determined that |Vi| = |Vi−1|(c + 1)|Ai−1| +|Ai−1|[(c + 1)|Ai−1|(γ − 1) + 1].

We first solve the recurrence A. |Ai| = (c + 1)|Ai−1|2. Observe that |A0| = 1,

|A1| = (c + 1), |A2| = (c + 1)3, |A3| = (c + 1)7, |A4| = (c + 1)15, and so on.

22

βα|Ai−1|−1

i,0 βα|Ai−1|−1

i,1

βα0i,0

βα1i,0

βα2i,0

βα|Ai−1|−1

i,δ−1

......

...

S0i−1 S1

i−1

Si

...

. . .

. . .

. . .

. . .

βα2i,1

βα1i,1

βα0i,1

βα2i,δ−1

βα1i,δ−1

βα0i,δ−1

Sδ−1i−1

G|Ai−1|−1i

G2i

G1i

G0i

Figure 2.2: This figure demonstrates the structure of a general Si graph.

From this, we see that |Ai| = (c + 1)2i−1, and that |Ai| = Θ((c + 1)2i−1). We can

now solve the recurrence V , first by finding an upper bound, then a lower bound.

|Vi| = |Vi−1|(c + 1)|Ai−1|+ |Ai−1|[(c + 1)|Ai−1|(γ − 1) + 1]. To simplify the algebra,

let p = c + 1, so |Ai−1| = (c + 1)2i−1−1 = p2i−1−1.

|Vi| = |Vi−1| × p× p2i−1−1 + p2i−1−1 × p× p2i−1−1 × (γ − 1) + p2i−1−1

= |Vi−1| × p2i−1

+ p2i−1 × p2i−1−1 × (γ − 1) + p2i−1−1

= |Vi−1| × p2i−1

+ p2i−1 × (γ − 1) + p2i−1−1

≤ |Vi−1| × p2i

+ p2i × (γ − 1) + p2i

≤ |Vi−1| × p2i

+ γp2i

≤ 2|Vi−1| × p2i

, since we know ∀i > 0, |Vi| > γ

Above, observe that |V0| = 1, |V1| ≤ 2p2i, |V2| ≤ 4p2×2i

, |V3| ≤ 8p3×2i, |V4| ≤

16p4×2i, and so on. From this, we see that |Vi| ≤ 2ipi×2i

= 2i(c + 1)i×2i, and that

|Vi| = O(2i(c + 1)i×2i).

|Vi| = |Vi−1| × p× p2i−1−1 + p2i−1−1 × p× p2i−1−1 × (γ − 1) + p2i−1−1

23

= |Vi−1| × p2i−1

+ p2i−1 × p2i−1−1 × (γ − 1) + p2i−1−1

= |Vi−1| × p2i−1

+ p2i−1 × (γ − 1) + p2i−1−1

≥ |Vi−1| × p2i−1

+ p2i−1 × (γ − 1)

≥ |Vi−1| × p2i−1

Above, observe that |V0| = 1, |V1| ≥ p2i−1, |V2| ≥ p2×2i−1

, |V3| ≥ p3×2i−1,

|V4| ≥ p4×2i−1, and so on. From this, we see that |Vi| ≥ pi×2i−1

= (c + 1)i×2i−1, and

that |Vi| = Ω((c + 1)i×2i−1).

Since |Vi| represents the total number of nodes at level i of the construction,

we can now solve for how many levels of the construction must be possible for a

network containing n nodes.

(c + 1)i×2i−1 ≤ n

i× 2i−1 ≤ logc+1 n

2i−1 ≤ logc+1 n

i− 1 ≤ log2 logc+1 n

i ≤ log2 logc+1 n + 1

2i(c + 1)i×2i ≥ n

(c + 1)2i×2i ≥ n, since we know c ≥ 1

2i× 2i ≥ logc+1 n

22i ≥ logc+1 n

2i ≥ log2 logc+1 n

i ≥ 1

2log2 logc+1 n

From this analysis, we see that i = O(log logc n) and also i = Ω(log logc n).

Therefore, i = Θ(log logc n).

24

2.5.3 Proving a Lower Bound

Let Γ be the graph at level i of the construction, consisting of Si−1 subgraphs

R0i , R

1i , . . . , R

δ−1i , as well as the connecting gadgets. Let ∆ be a copy of Γ, except

without the gadgets (that is, ∆ contains only the Si−1 subgraphs).

Lemma 2.5.1 Suppose we have two graphs from the construction L and M (at level

i), such that M is a replica of L; ∃ nodes l1, l2 ∈ L that are a minimum of x hops

apart; ∃ nodes m1,m2 ∈ M that are a minimum of x hops apart; and both l1 and

m1 share the same replica association. When L and M are combined by a gadget g,

nodes l1 and m2 are a minimum of x + 2γ hops apart.

Proof: To create a path from l1 to m2, we must cross through the 2γ hops of

some gadget at least once (since these nodes lie in different Si−1 replicas and gadgets

are the only inter-replica connections). It is possible that our path will cross over

through many gadgets; however this behavior does not affect the following argument:

the new path must contain at least x + 2γ nodes. 2γ come from the crossover,

and x will come from the sum of the paths in L plus the sum of the paths in

M . This distance must be at least x hops, since l1 and m1 are both in the same

position of Si−1, and we know that the distance from m1 to m2 is at least x hops.

If there exists a shorter path than x + 2γ hops from l1 to m2 through some gadget,

then there must exist a shorter path than x hops from m1 to m2 within M , a

contradiction. Therefore, including the required nodes to cross through some gadget

g, the minimum path from l1 to m2 contains at least x + 2γ hops.

Lemma 2.5.2 If two nodes of some Rji are at least x hops apart in ∆, then they

are at least x hops apart in Γ as well.

Proof: Suppose two nodes v1 and v2 are at least x hops apart in ∆ (this path must

be in the same Rji subgraph). For the sake of contradiction, suppose ∃ a path from

v1 to v2 in Γ that is less than x hops. Since Γ is the same as ∆ with the addition

of gadgets, it is obvious that this path must include nodes of some gadget (at least

4γ since it must eventually come back). Since all Rji subgraphs are the same, the

25

...L M...

...

...

g

......

...

γ

...

m2

l2

m1l1

γ

Figure 2.3: This figure demonstrates one way path lengths may grow asdescribed in lemma 2.5.1.

sum of the subpaths in each used Rji must be less than or equal to x hops, or there

would exist a path from v1 to v2 in ∆ less than x hops. Therefore, the path in Γ is

at least x + 4γ hops (under the assumption), a contradiction. Therefore, the path

from v1 to v2 in Γ is also x hops (and contains no gadget nodes).

Property 1: Every cover of Rji should have at least two nodes in Rj

i , at least x

hops apart, that are satisfied in the same cluster.

Property 2: Every cover of Γ contains a cluster C that includes two nodes of Rji ,

at least x hops apart, that are satisfied in C.

Lemma 2.5.3 If Property 1 is true, then Property 2 is true.

Proof: For the sake of contradiction, suppose that Property 2 is false. Then there

exists an Rji subgraph and a cover FΓ of Γ, such that FΓ does not contain any

cluster that satisfies any two nodes of Rji , at least x hops apart (in Rj

i ). We will

now transform the cover FΓ to a cover F∆ in graph ∆ as follows. Let Z be a cluster

in FΓ, of the graph Γ. By removing the nodes and edges that belong to the gadgets

(of level i), Z is transformed to one or more clusters in ∆. Take a node v of Z, so

that it is in both Γ and ∆, and is satisfied in Z. Then one of the clusters in the

transformation of Z will satisfy v in graph ∆ too. The reason for this is as follows.

Let Q be the set of nodes that are at most γ hops away from v, and belong to ∆.

Note that the nodes in Q must belong to the same subgraph Rki that v belongs

26

to. Let Q′ denote the smallest connected subgraph of Z that contains Q (Q′ exists,

since v is satisfied in Z). We now show that Q = Q′. Suppose that Q 6= Q′. Then

v connects to some node v′ of Q using gadget nodes. However, this is impossible,

since this would imply that the distance between v and v′ is at least 2γ. Therefore,

Q = Q′. Since Q is entirely in Rki and connected, Q will be completely within a

cluster of ∆ after the transformation of Z from Γ to ∆. Therefore, v will be satisfied

in one of the clusters of Z in the transformation.

Let F∆ be the set of clusters containing the transformations of the clusters in

FΓ from Γ to ∆. Then F∆ is a cover of ∆, since every node of ∆ is satisfied in one

of the clusters of F∆.

Let Ei be the set of clusters of F∆ that belong to Rji . Clearly, Ei is a cover

for Rji in ∆. Therefore, Property 1 must hold for Ei. Therefore, at least two nodes

v1 and v2 that are a distance of at least x hops are satisfied in the same cluster of

Ei. Therefore, these two nodes must be satisfied in the same cluster in FΓ, since Ei

is obtained by splitting clusters of FΓ.

By lemma 2.5.2, the distance between v1 and v2 in Γ is at least x hops. At

the same time, they are satisfied in the same cluster of FΓ. A contradiction, since

FΓ does not satisfy Property 2 for Rji .

Lemma 2.5.4 At each level i of the construction (except for level 0), ∃ nodes v1

and v2, such that the minimum distance between them is at least 2γi hops, and v1

and v2 are satisfied in the same cluster.

Proof: The base case is at level 1 of the construction. There are c + 1 anchor

nodes connected by the gadget G01. Each such anchor node is 2γ hops away from

any other. If each were satisfied individually, the node at the center of the gadget

would be overlapped c+1 times. This exceeds the degree threshold and is therefore

impossible. So, two anchor nodes must be satisfied in the same cluster, and are 2γ

hops apart.

Suppose the statement is true at level i− 1 of the construction. That is, there

exists two nodes v1 and v2 a distance of 2γ(i − 1) hops apart that are satisfied in

the same cluster.

27

Now consider moving to level i. There will be (c + 1)|Ai−1| replicas of Si−1.

So for each replica x, there exists nodes vx1 and vx

2 that are 2γ(i − 1) hops apart

and are satisfied in the same cluster. Now consider adding the gadgets (to complete

the construction of the level). From lemma 2.5.3, we know that each replica x must

still have two nodes that are 2γ(i− 1) hops apart and satisfied in the same cluster.

Since there are (c+1)|Ai−1| replicas, there must exist some gadget that connects at

least c + 1 of these such nodes, by the pigeon hole principle. If all these nodes were

satisfied in different clusters, the center of this gadget is overlapped c+1 times. This

exceeds the degree threshold and is therefore impossible. So at least two of these

nodes are satisfied in the same cluster, call them v1 and v2. Further, the distance

between these nodes is guaranteed to be 2γ(i−1)+2γ = 2γi hops from lemma 2.5.1.

Therefore, at level i of the construction, nodes v1 and v2 are 2γi hops apart, and

are satisfied in the same cluster.

Theorem 2.5.2 There exists a network with n nodes, and constrained by param-

eters γ and c, such that when clustered, there must exist a cluster whose radius is

Ω(γ log logc n).

Proof: We obtain the lower bound by determining how many levels of recursion

are possible when limited to the use of at most n nodes. Because of the nature of

the construction, this is done by solving the recurrence relation V (and A in the

process), and then determining how many levels of recursion are guaranteed to occur

(i). From Section 2.5.2, we see that i = Θ(log logc n). From lemma 2.5.4, at level i

of the construction, the minimum radius of some cluster is at least 2γi. Constrained

to n nodes, there can be log logc n levels of the construction. From this, we see the

minimum radius of some cluster is Ω(γ log logc n).

2.6 Shortest Path Clustering

Our algorithms for cover construction are based on a recursive application of

a basic routine called shortest-path clustering. We observe that it is easy to cluster

the γ-neighborhood of all nodes along a shortest path in the graph using clusters of

radius O(γ) and degree O(1). For a graph G, we first identify an appropriate set of

28

shortest paths P in G. We cluster the cγ-neighborhood (for a constant c) of every

path p ∈ P using shortest-path clustering, and then remove P together with its c′γ-

neighborhood from G, for some c′ < c. This gives residual connected components

G′1, G

′2, . . . , G

′r that contain the remaining unclustered nodes as a subset. We apply

the same procedure recursively to each G′i component by identifying appropriate

shortest paths in them. The algorithm terminates when there are no remaining

nodes.

Consider an arbitrary weighted graph G, and a shortest path p between a pair

of nodes in G. For any β > 0, we construct a set of clusters R, which β-satisfies

every node of p in G. The returned set R has a small radius, 2β, and a small degree,

3. Algorithm Shortest-Path-Cluster contains the details of the construction of R.

Lemma 2.6.1 establishes the correctness of the algorithm.

Algorithm 1: Shortest-Path-Cluster(G, p, β)

Input: Graph G; shortest path p ∈ G; parameter β > 0;Output: A set of clusters that β-satisfies p;

Suppose p = v1, v2, . . . , v`;1

// partition p into subpaths p1, p2, . . . , ps of length at most βi ← 1; j ← 1;2

while i 6= ` + 1 do3

Let pj consist of all nodes vk such that i ≤ k ≤ ` and distG(vi, vk) ≤ β;4

j ← j + 1;5

Let i be the smallest index such that i ≤ ` and vi is not contained in6

any pk for k < j. If no such i exists, then i = ` + 1;Let s denote the total number of subpaths p1, p2, . . . , ps of p generated;7

// cluster the subpaths

for i = 1 to s do8

Ai ← Nβ(pi, G);9

R ← ⋃1≤i≤s Ai;10

return R;11

Lemma 2.6.1 For any graph G, shortest path p ∈ G, and β > 0, the set R returned

by Algorithm Shortest-Path-Cluster(G, p, β) has the following properties: (i) R is a

set of clusters that β-satisfies p in G; (ii) rad(R) ≤ 2β; (iii) deg(R) ≤ 3.

Proof: For property i, it is easy to see that R is a set of clusters, since each Ai

is a connected subgraph of G consisting of the β-neighborhood of a subpath pi of

29

vj vk

v

≤ βpk pl

vi

> β

≤ β

qiql

vlpi pj

≤ β ≤ β

> β

Figure 2.4: A demonstration of the proof of property iii of Lemma 2.6.1.

p. For each node v ∈ pi, Ai β-satisfies v in G, since it contains Nβ(v,G). Thus, R

β-satisfies p in G.

For property ii, we show that each cluster Ai has radius no more than 2β. Let

vi be an arbitrary vertex in pi. By the construction, for any node v ∈ pi, it must be

true that distG(vi, v) ≤ β. Since any node u ∈ Ai is at a distance of no more than

β from some node in pi, there is a path of length at most 2β from vi to u. Thus,

rad(R) ≤ 2β.

For property iii, suppose for the sake of contradiction that deg(R) ≥ 4 (see

Figure 2.4). Let v be a node with degree deg(v,R) = deg(R). Then v belongs to at

least 4 clusters, say: Ai, Aj, Ak, and Al, with i < j < k < l. Since v belongs to Ai,

there is a path qi of length at most β between v and some node vi ∈ pi. Similarly,

there exists a path ql of length at most β between v and some node vl ∈ pl. By

concatenating qi and ql, we obtain a path of length at most 2β connecting vi and

vl. On the other hand, both vi and vl lie on p, which is a shortest path in G, and

hence the path from vi to vl on p must be a shortest path from vi to vl. Let vj

and vk denote the nodes on pj and pk respectively, that are closest to vi. By the

construction, distG(vj, vk) > β, since otherwise, vk would have been included in

pj. Similarly, distG(vk, vl) > β. Since distG(vi, vl) > distG(vj, vk) + distG(vk, vl), it

follows that distG(vi, vl) > 2β, a contradiction. Thus, deg(R) ≤ 3.

2.7 Cover for k-Path Separable Graphs

We now present Algorithm Separator-Cover, which returns a cover with a small

radius and degree for any graph that has a k-path separator (Sparse Cover Contribu-

tion 5). Theorem 2.7.1 establishes the correctness and properties of the algorithm,

30

and uses Lemma 2.7.1, which gives some useful properties about clusters.

Algorithm 2: Separator-Cover(G, γ)

Input: Connected graph G that is k-path separable; locality parameterγ > 0;

Output: γ-cover for G;

// base case

if G consists of a single vertex v then1

Z ← v;2

return Z ;3

// main case

Let S = P1 ∪ P2 ∪ · · · ∪ Pl be a k-path separator of G;4

for i = 1 to l do5

foreach p ∈ Pi do6

Ai ← Shortest-Path-Cluster(G− ⋃1≤j<i Pj, p, 2γ);7

A ← ⋃1≤i≤l Ai;8

G′ ← G− ⋃1≤j≤l Pj;9

// recursively cluster each connected component

Let G′1, G

′2, . . . , G

′r denote the connected components of G′;10

B ← ⋃1≤i≤r Separator-Cover(G′

i, γ);11

Z ← A ∪B;12

return Z;13

Lemma 2.7.1 Let C be a set of clusters that 2γ-satisfies a set of nodes W in graph

G. If some set of clusters D is a γ-cover for G−W , then C ∪D is a γ-cover for G.

Proof: Since C 2γ-satisfies W in G, C also γ-satisfies Nγ(W,G) in G. Thus, C

γ-satisfies W ∪ Nγ(W,G) in G. Next, consider a vertex u ∈ G − (W ∪ Nγ(W,G)).

For any vertex u′ ∈ W , it must be true that u′ 6∈ Nγ(u,G), since u 6∈ Nγ(W,G),

implying that u 6∈ Nγ(u′, G). Thus, Nγ(u,G) lies completely in G−W . Since D is

a γ-cover for G−W , for every vertex u ∈ (G−W )−Nγ(W,G), D γ-satisfies u in

G −W , and hence in G. For any u′ ∈ W ∪Nγ(W,G), C γ-satisfies u′ in G. Thus,

for any v ∈ G, C ∪D γ-satisfies v in G, and is therefore a γ-cover for G.

Theorem 2.7.1 For any connected k-path separable graph G with n nodes, and

locality parameter γ > 0, Algorithm Separator-Cover(G, γ) returns a set Z with the

following properties: (i) Z is a γ-cover for G; (ii) rad(Z) ≤ 4γ; (iii) deg(Z) ≤3k(lg n + 1).

31

Proof: For property i, the proof is by induction on the number of vertices in G.

The base case is when G has only one vertex, in which case the algorithm is clearly

correct. For the inductive case, suppose that for every k-path separable graph with

less than n vertices, the algorithm returns a γ-cover for the graph. Let G be a

k-path separable graph with n vertices.

The last part of the algorithm recursively calls Separator-Cover on every con-

nected component in G′. Since the number of vertices in G′ is less than n, the

number of vertices in each G′i component is less than n. By the inductive assump-

tion, for each i = 1, 2, . . . , r, Separator-Cover(G′i, k, γ) returns a γ-cover for G′

i. The

union of the γ-covers for the connected components of G′ is clearly a γ-cover for G′,

hence B is a γ-cover for G′.

For i = 1, 2, . . . , l + 1, define Gi = G − ⋃1≤j<i Pj. Clearly, G1 = G and

Gl+1 = G′. We will prove that for all i such that 1 ≤ i ≤ l +1, the set⋃

i≤j≤l Aj ∪B

is a γ-cover for Gi. The proof is through reverse induction on i starting from i = l+1

and going down until i = 1. The base case i = l + 1 is clear since B is a γ-cover for

G′ = Gl+1. Suppose the above statement is true for i = ν, i.e. Aν∪Aν+1∪. . .∪Al∪B

is a γ-cover for Gν . Consider Gν−1 = Gν ∪Pν−1. From the correctness of Algorithm

Shortest-Path-Cluster (proven in Lemma 2.6.1), we have that Aν−1 2γ-satisfies Pν−1

in Gν−1. Since Aν∪Aν+1∪. . .∪Al∪B is a γ-cover for Gν−1−Pν−1, using Lemma 2.7.1

we have Aν−1∪Aν ∪ . . .∪Al∪B is a γ-cover for Gν−1, thereby proving the inductive

step. Thus, we have⋃

1≤j≤l Aj ∪B is a γ-cover for G1 = G, proving the correctness

of the algorithm for graph G with n vertices.

For property ii, we note that each cluster is obtained from an invocation of

Algorithm Shortest-Path-Cluster with input argument β = 2γ. From Lemma 2.6.1,

the radius of each cluster is at most 2β = 4γ. Thus, rad(Z) ≤ 4γ.

For property iii, we visualize the recursive invocations of the algorithm as a

tree T , where each node is associated with an input graph on an invocation of the

recursive algorithm. For each node v ∈ T , let G(v) denote the associated input

graph and N(v) denote the number of vertices in G(v). Let r denote the root, thus

G(r) = G. Clearly, for each vertex v ∈ T , G(v) is a connected subgraph in G, and

the leaves represent components that require no further recursive calls. The depth

32

of any node in T is defined as the distance from the root. The depth of the tree is

defined as the maximum depth of any node.

For any node v ∈ T , by the property of the path separator, we have for each

child v′ of v, N(v′) ≤ N(v)/2. Since N(r) = n, any node at a depth of i has at most

n/2i vertices. Since every leaf has at least 1 vertex, the depth of the tree is no more

than lg n.

Consider any node u ∈ G. Suppose u belongs to G(v) for some node v in

T . At v, clusters are formed by calling Shortest-Path-Cluster no more than k times.

From Lemma 2.6.1, u appears in no more than 3 clusters returned by each call of

Shortest-Path-Cluster. Thus, due to all clusters formed at any node v, u appears in

no more than 3k clusters. Further, if v1, v2, . . . , vx are the children of v, it is clear

that G(v1), G(v2), . . . , G(vx) are all disjoint from each other. Thus, u can belong

to at most one component among G(v1), G(v2), . . . , G(vx). Since the depth of T

is no more than lg n, node u can belong to G(v) for no more than lg n + 1 nodes

v ∈ T . Thus, u can belong to at most 3k(lg n + 1) clusters in total, implying that

deg(Z) ≤ 3k(lg n + 1).

Upon combining Theorem 2.7.1 with Theorem 2.4.1, we get the following.

Theorem 2.7.2 For any graph G that excludes a fixed size minor H, given a pa-

rameter γ > 0, there is an algorithm that returns in polynomial time a set of clusters

Z with the following properties: (i) Z is a γ-cover for G; (ii) rad(Z) ≤ 4γ; (iii)

deg(Z) ≤ 3k(lg n + 1); where k = k(H) is a parameter that depends on the size of

the excluded minor H.

2.8 Cover for Planar Graphs

Since every planar graph is 3-path separable [52], Theorem 2.7.1 immediately

yields a γ-cover for a planar graph with radius O(γ) and degree O(log n). In this

section, we present an improved cover for planar graphs whose radius is O(γ) and

degree O(1), both of which are optimal up to constant factors (Sparse Cover Con-

tribution 1).

33

Consider a connected and weighted planar graph G = (V,E). If G is not con-

nected, then it can be handled by clustering each connected component separately.

Consider also an embedding of G in the Euclidean plane where no two edges cross

each other. In the following discussion, we use G to refer to the planar embedding

of the graph. Clearly, any subgraph of G is also planar.

The edges of G divide the Euclidean plane into closed geometric regions called

faces. The external face is a special face that surrounds the whole graph; the other

faces are internal. A node may belong to multiple faces, while an edge to at most

two faces. A node or edge that belongs to the external face will be called external.

For any node v ∈ G, we denote by depth(v,G) the shortest distance between

v and an external node of G. We also define depth(G) = maxv∈V depth(v, G); note

that depth(G) ≥ 0.

2.8.1 Basic Results for Planar Graphs

Here we prove some basic properties of planar graphs that will be used in the

correctness and performance analysis of our algorithms for planar graphs.

For any planar graph G, it holds that the subgraph consisting of the edges

of a face is connected. This observation also holds for the subgraph induced by

the edges of the external face. The intersection of any two graphs G1 and G2 is

denoted G1 ∩ G2 = (V1 ∩ V2, E1 ∩ E2). The following lemma can be easily verified

as a property of all planar graphs.

Lemma 2.8.1 Let G′ be a subgraph of a planar graph G. If v ∈ G ∩ G′, and v is

external in G, then v is external in G′ too.

Consider now a connected planar graph C that consists of two connected

subgraphs A and B that are node-disjoint, and a set of edges Y , which is an edge-

cut between A and B (the removal of Y partitions C into A and B). Further, each

of A and B contain at least one node external to C. Let Y ′ denote the edges of Y

that are external in C.

Lemma 2.8.2 For any two nodes u, v ∈ A ∩ C that are external in C, there exists

a walk w = u, x1, x2, . . . , xk, v, with k ≥ 0, such that xi ∈ A, and each edge of the

walk is external in C.

34

Proof: Suppose for the sake of contradiction that there exists two nodes u, v ∈ A∩C that are external in C, such that there does not exist a walk w = u, x1, x2, . . . , xk, v,

with k ≥ 0, such that xi ∈ A, and each edge of the walk is external in C. Let fA

be the external face of A, and fC be the external face of C. Let S be the set of

connected components (we will refer to them as segments) in fA ∩ fC (all the nodes

and edges that are external in both A and C). Let su ∈ S be the segment that

contains u. Similarly, let sv ∈ S be the segment that contains v.

We know that in C, there exists a walk of external edges that connects su to

sv. Thus, in A, external edges have been removed (edges from Y ′). All removed

edges span from A to B. Let lu, ru, lv, rv ∈ B and elu, eru, elv, erv ∈ Y ′ be removed

edges (see Figure 2.5) (note it is possible that ru = rv and it is possible that lu = lv).

Since B is connected, there exists a walk from lu to lv residing entirely in B. This

walk cannot go through A since V (A) ∩ V (B) = ∅, so it can go directly from lu to

lv, or all the way around A (see Figure 2.5). If it goes all the way around A, it must

enclose eru and erv, since this walk cannot include the end nodes of su or sv because

they are in A. Hence, eru and erv are not in the external face of C, and could not

have been in Y ′, a contradiction. Therefore, it must go directly from lu to lv.

Similarly, since B is connected, there exists a walk from ru to rv residing

entirely in B. By symmetry, this walk goes directly from ru to rv as well, without

going all the way around A. Once again, we know B is connected, so there must

exist a walk from lu to ru residing entirely in B (see Figure 2.5). If the walk goes

directly from lu to ru, it must enclose the external segment su, a contradiction. So

the walk must go all the way around A, and therefore encloses the external segment

sv, a contradiction.

Therefore, for any two nodes u, v ∈ A ∩ C that are external in C, there exists

a walk w = u, x1, x2, . . . , xk, v, with k ≥ 0, such that xi ∈ A, and each edge of the

walk is external in C.

Lemma 2.8.3 1 ≤ |Y ′| ≤ 2.

Proof: First, we show that |Y ′| ≥ 1. Let fC be the external face of C. Let u ∈ A

and v ∈ B be two external nodes in C. Clearly, u, v ∈ fC . Since fC is connected,

35

BA

B

lv rv

sv

lusu

ru

elv erv

elu eru

A

eruelu

lv rv

sv

lusu

ru

elv erv

A

eru

elv erv

elu

lv rv

sv

lusu

ru

Figure 2.5: For Lemma 2.8.2: the figure on the left shows a configurationof removed edges that are external in C and span from A to B(note, if the lemma were not true, B would be disconnected),the figure in the middle demonstrates the walk options fromlu to lv, and the figure on the right demonstrates the walkoptions from lu to ru.

e1

e2

w A B

eue

wa

ve

wb

u1

u2

v1

v2

pe qe

wa

w2

e2

e1

e

w1

wn

wb

Figure 2.6: For Case 2 of Lemma 2.8.3: the figure on the left demon-strates a possible setup, and the figure on the right demon-strates one of the two possible path configurations.

there is a path p connecting u and v. Since Y is an edge-cut for A and B, p contains

an edge in Y . Thus, one of the edges of Y is external in C, which implies that

|Y ′| ≥ 1.

We now show that |Y ′| ≤ 2. Suppose for the sake of contradiction that |Y ′| >2. Choose two edges e1 = (u1, v1) and e2 = (u2, v2), where e1, e2 ∈ Y ′, u1, u2 ∈ A,

and v1, v2 ∈ B. Let p be the walk from u1 to u2 consisting only of edges external

in A and in C. Similarly, let q be the walk from v1 to v2 consisting only of edges

external in B and in C. We know these walks exist from Lemma 2.8.2. Construct

a closed walk w using the edges in p ∪ q ∪ e1 ∪ e2.

There are two cases to examine:

36

Case 1: w is an external face of C.

There exists an external edge e ∈ Y ′ such that e 6= e1 and e 6= e2. w does not

contain e since e 6∈ A and e 6∈ B. Therefore, e is not in the external face of C,

a contradiction.

Case 2: w is not an external face of C.

That is, there exists an external edge e = (ue, ve), such that e ∈ Y ′, ue ∈ A,

ve ∈ B, and e is not contained within w. Since ue ∈ A, there must exist a walk

of external edges pe from ue to some node wa belonging to w within A, such

that E(pe) ∩ E(w) = ∅, V (pe) ∩ V (w) = wa, and pe is the shortest of such

walks. Similarly, since ve ∈ B, there must exist a walk of external edges qe

from ve to some node wb belonging to w within B, such that E(qe) ∩ E(w) =

∅, V (qe) ∩ V (w) = wb, and qe is the shortest of such walks (see Figure

2.6). These walks exist from Lemma 2.8.2. Within w, there exists two walks

consisting entirely of external edges from wa to wb, one goes through the edge

e1, and the other through the edge e2 (from Lemma 2.8.2). Take the shortest of

such walks and call them w1 and w2 respectively. It is clear that w = w1 ∪w2.

Let wn be the walk from wa to ue (pe), the edge e, and the walk from ve

to wb (qe). We now have three walks w1, w2, and wn, that connect wa to

wb. The subpaths belonging to A may have common nodes and edges, and

the subpaths belonging to B may have common nodes and edges. However,

each walk has a unique external edge (w1 has e1, w2 has e2, and wn has e).

In any possible configuration, one of these external edges (either e1 or e2) is

completely enclosed by the other two walks (see Figure 2.6), and is therefore

not in the external face of C, a contradiction.

Since in both cases we obtained a contradiction, |Y ′| ≤ 2.

Let VB be the nodes adjacent to the edges in Y ′, which are in the graph B.

From Lemma 2.8.3, 1 ≤ |VB| ≤ 2. Let pB ∈ B be a shortest path connecting the

nodes in VB. Let q = v1, v2, . . . , vk be any path in B with the following properties:

pB and q do not intersect (they have no nodes in common), and v1 is adjacent to an

edge in Y .

37

e2

D

e1

e

A B

α β

pBpA

Figure 2.7: This figure demonstrates the subgraphs and paths describedin Lemma 2.8.4.

Lemma 2.8.4 Node vk belongs to a connected component of B − pB that does not

contain any external nodes of C.

Proof: Let VA denote the nodes of A adjacent to Y ′. From Lemma 2.8.3, 1 ≤|VA| ≤ 2. Let pA denote a shortest path between the nodes in VA. The union of the

edges of Y ′, pA, and pB induce a connected subgraph C of C. Let W denote the

set of nodes of C that are contained inside the internal faces (if they exist) of C.

Finally, let D denote the subgraph of C that is induced by the union of the nodes

in W and C.

Now, we show that all the edges of Y are members of D. Suppose for the sake

of contradiction that there exists some edge e = (u, v), where e ∈ Y , u ∈ A, v ∈ B,

and e 6∈ D. Consider first the case where |Y ′| = 1. Let e = (u, v) ∈ Y ′, with u ∈ A

and v ∈ B. We have that pB = v. Thus, q intersects pB, a contradiction. Consider

now the case where |Y ′| = 2. Suppose that Y ′ = e1, e2. Since A is connected,

there is a path α ∈ A that connects edge e to a node in pA; similarly, there is a

path β ∈ B that connects edge e to a node in pB (see Figure 2.7). This implies that

either e1 or e2 is not in the external face of C, a contradiction. Therefore, all the

edges of Y are members of D.

Since v1 is adjacent to an edge in Y , we have that v1 ∈ D. Since q does not

intersect pB, each node of q is a member of D, that is, q ∈ D. Let WB denote the

nodes of W that are members of B. The nodes of q are actually members of WB,

since none of the nodes of q are external in D. Since the nodes of WB are separated

by the path pB from the remaining nodes of B, in B − pB, the nodes of WB are in

connected components consisting only of nodes of WB. These connected components

38

do not contain any external nodes of C, since W does not contain external nodes of

C. Therefore, vk will belong to such a connected component in B − pB.

2.8.2 High Level Description of the Algorithm

At a high level, our cover algorithm breaks up a planar graph G into many

overlapping planar subgraphs called zones, such that: (i) the depth of each zone is

not much greater than γ, (ii) each zone overlaps with a small number of other zones,

and (iii) clustering each zone separately is sufficient to cluster the whole graph. This

way, we can focus on clustering only planar graphs whose depth is not much more

than γ. Thus, our algorithm is divided into two main parts:

• Algorithm Depth-Cover, which clusters graph G with depth(G) ≤ γ, and

• Algorithm Planar-Cover, which clusters arbitrary planar graphs using Depth-

Cover as a subroutine.

We now proceed to describe Algorithms Depth-Cover and Planar-Cover in Sections

2.8.3 and 2.8.5 respectively.

2.8.3 Algorithm Depth-Cover

We now present Algorithm Depth-Cover, which constructs a γ-cover for a pla-

nar graph G where γ ≥ max(depth(G), 1). The resulting cover has radius no more

than 8γ and degree no more than 6. We describe the intuition here, and the

algorithm is formally described in Algorithm 3, which uses Algorithm Subgraph-

Clustering as a subroutine to do most of the work.

Depth-Cover allows us to focus on satisfying only the external nodes in G.

Since depth(G) ≤ γ, if a set of clusters S 2γ-satisfies every external node in the

graph, then S is a γ-cover for G. The reason is that every internal node u is within

a distance of γ from some external node v, and the cluster that contains the 2γ-

neighborhood of v will also contain the γ-neighborhood of u, and will γ-satisfy u.

We now focus on constructing a set of clusters that 2γ-satisfies each external node

of G.

The algorithm begins by selecting an arbitrary external node of G, which is

also trivially a shortest path p in G. Through shortest-path clustering, it constructs

39

a set of clusters I that 4γ-satisfies p in G, and deletes A, the 2γ-neighborhood of

p in G. Let the resulting connected components in G − A be B = B1, B2, . . . , Bx.

By Lemma 2.7.1, the union of 2γ covers of the Bi components with I results in a

2γ-cover of G. Further, since we are only interested in 2γ-satisfying every external

node of G, we need not further consider any component in B that does not contain

an external node of G. Thus, the algorithm proceeds by recursively clustering every

component in B that contains at least one external node of graph G.

Let B ∈ B be a component with at least one external node of G. The recursive

invocation of the algorithm in B requires the selection of a shortest path pB ∈ B

(the path is shortest with respect to its end points). The path pB is selected as

follows. Suppose Y is an edge-cut between A and B (see Figure 2.8.a). Let Y ′ be

the external edges of Y with respect to G. From Lemma 2.8.3, 1 ≤ |Y ′| ≤ 2. Let

VB be the set of nodes in B that are endpoints of edges in Y ′; we have 1 ≤ |VB| ≤ 2.

Path pB is selected to be a shortest path in B between nodes in VB (if VB has

only one vertex, then pB consists of a single node). For example, in Figure 2.8.a

VB1 = v2, v3.Lemma 2.8.4 proves that for every node v ∈ I where v /∈ A, it holds that either:

(i) v appears in the 2γ-neighborhood of pB for one of the connected components

B = Bi, or (ii) v is in a connected component B′ that does not contain any external

nodes of G (for example, see component B′2 in Figure 2.8.c). In either case, node v

will be removed in the next recursive call, which deletes the 2γ-neighborhood of pB.

Thus, v participates in at most two shortest-path clusterings (of p and pB) and is

satisfied by at least one of these two clusterings. Since each instance of shortest-path

clustering contributed at most 3 to the degree of v, the total degree of v is bounded

by 6.

It is useful to compare the algorithm for clustering a planar graph with shortest-

path clustering using path separators, as in Section 2.7. When separators are used,

the graph is decomposed into small pieces upon the removal of the separator (which

is a set of shortest paths), and the depth of this recursion is bounded by lg n. How-

ever, a vertex of the graph may be involved in clusters due to lg n such separators.

In the planar graph case, the resulting components Bi are not necessarily much

40

smaller than G, but the shortest paths are chosen so that the resulting clusters have

little overlap.

Figure 2.8 depicts an example execution of Algorithm Depth-Cover with the

first invocation (Figures 2.8.a and 2.8.b) and the second invocation (Figures 2.8.c

and 2.8.d) of the subroutine Subgraph-Clustering.

A1

v1

v3

v2v2

v3

B′2

v4

4γ

v2

4γ

v5 v5v3

2γ I1 B1 I1 B1 I2B2B2

I2

v4

2γA2

2γ

(d)(a) (b) (c)

Subgraph-Clustering(G,B1, pB1 , γ)Subgraph-Clustering(G,G, v1, γ)

pB1pB1

pB2

Y2

Y1

Figure 2.8: Execution example of Algorithm Subgraph-Clustering.

Algorithm Subgraph-Clustering(G,H, p, γ) is recursive, and parameters G and

γ remain unchanged at each recursive invocation, while H and p change. Parameter

H is the subgraph of G with at least one external node of G, and it is required to

2γ-satisfy all nodes in H that are external nodes of G. Parameter p is a shortest

path in H that will be used for clustering in the current invocation. Initially, H = G

and p = v1, where v1 is an arbitrary external node of G.

Algorithm 3: Depth-Cover(G, γ)

Input: Connected planar graph G; locality parameterγ ≥ max(depth(G), 1);

Output: A γ-cover for G;

Let v be an external node of G;1

Z ← Subgraph-Clustering(G,G, v, γ);2

return Z;3

2.8.4 Analysis

We continue with proving Theorem 2.8.1, which bounds the radius and de-

gree of the resulting covers from Algorithm Depth-Cover. Similar to the analysis of

Algorithm Separator-Cover, it is convenient to represent the execution of Algorithm

Depth-Cover as a tree T , where each node in T corresponds to some invocation of the

41

Algorithm 4: Subgraph-Clustering(G,H, p, γ)

Input: Connected planar graph G; connected subgraph H of G(consisting of vertices that are still unsatisfied); shortest pathp ∈ H whose end nodes are external in H; locality parameterγ ≥ max(depth(G), 1);

I ← Shortest-Path-Cluster(H, p, 4γ);1

A ← N2γ(p,H); H ′ ← H − A;2

J ← ∅;3

foreach connected component B of H ′ that contains at least one external4

node of G doLet Y be the edge-cut between A and B in subgraph H;5

Let Y ′ ⊆ Y be the external edges of Y in subgraph H;6

Let VB be the nodes of B adjacent to the edges of Y ′;7

Let pB be a shortest path in B that connects all the nodes in VB;8

J ← J ∪ Subgraph-Clustering(G,B, pB, γ);9

return I ∪ J ;10

subroutine Subgraph-Clustering. The root r of T corresponds to the first invocation

with parameters (G,G, v, γ). Suppose, for example, that in the first invocation the

removal of A creates two components H1 and H2 in G, for which the algorithm is

invoked recursively with parameters (G,H1, p1, γ) and (G,H2, p2, γ). Then, these

two invocations will correspond in T to the two children of the root. The leaf nodes

correspond to subgraphs Hi that cannot be decomposed further. Suppose that node

w ∈ T corresponds to invocation (G,H, p, γ). We will denote by H(w) the respective

input graph H, and we will use a similar notation to denote the remaining param-

eters and variables used in this invocation; for example, p(w) is the input shortest

path while A(w) is the respective 2γ-neighborhood of p(w) in H(w). As another

example, using this notation, the resulting set of clusters is Z =⋃

w∈T I(w).

Lemma 2.8.5 For any node v ∈ G, there is a node w ∈ T such that Nγ(v,G) =

Nγ(v, H(w)) and v ∈ Nγ(A(w), H(w)).

Proof: By the construction of T , there is a path s = w1, w2, . . . , wk, such that:

s ∈ T , k ≥ 1, v ∈ H(wi) for 1 ≤ i ≤ k, w1 = r (the root of T ), wi is the parent of

wi+1 for 1 ≤ i ≤ k − 1, and wk does not have any child w′ with v ∈ H(w′).

By the construction of T and s, H(wi+1) ⊆ H(wi) for 1 ≤ i ≤ k − 1. Since

42

H(w1) = H(r) = G, Nγ(v, G) = Nγ(v, H(w1)). Let s′ = w1, w2, . . . , wk′ , where 1 ≤k′ ≤ k, be the longest subpath of s with the property that Nγ(v, G) = Nγ(v,H(wi))

for 1 ≤ i ≤ k′.

We examine two cases:

Case 1: k′ < k

It holds that v ∈ H(wk′), v ∈ H(wk′+1), Nγ(v, G) = Nγ(v,H(wk′)), and

Nγ(v, G) 6= Nγ(v, H(wk′+1)). According to Algorithm Subgraph-Clustering, v

belongs to a connected component B of H ′(wk′), such that B contains an

external node of G. Note that B = H(wk′+1) and H ′(wk′) = H(wk′)−A(wk′).

Clearly, v /∈ A(wk′), or else k = k′. Since the γ-neighborhood of v changes

between H(wk′) and B = H(wk′+1), some node u ∈ Nγ(v, H(wk′)) must be

a member of A(wk′) (note that only the nodes of A(wk′) are removed from

H(wk′)). Thus, v ∈ Nγ(A(wk′), H(wk′)). Therefore, wk′ is the desired node of

T .

Case 2: k′ = k

In this case, it holds that v ∈ H(wk), no child w′ of wk has v ∈ H(w′), and

Nγ(v, G) = Nγ(v, H(wk)). According to Algorithm Subgraph-Clustering, there

are two possible scenarios:

Case 2.1: v ∈ A(wk)

This case trivially implies that v ∈ Nγ(A(wk), H(wk)). Thus, wk is the

desired node of T .

Case 2.2: v /∈ A(wk)

In this case, it holds that v belongs to a connected component X of

H ′(wk) = H(wk)−A(wk), such that X does not contain any external node

of G. Since depth(G) ≤ γ, there is a node x ∈ G that is external in G

and x ∈ Nγ(v, G). Since X does not contain any external node of G, x /∈Nγ(v,X). Therefore, Nγ(v, X) 6= Nγ(v, G) = Nγ(v,H(wk)). Thus, the

γ-neighborhood of v changes between H(wk) and X. Hence, some node

u ∈ Nγ(v,H(wk)) is also a member of A(wk) (note that only the nodes of

43

A(wk) are removed from H(wk)), which implies v ∈ Nγ(A(wk), H(wk)).

Therefore, wk is the desired node of T .

Consequently, wk′ is the desired node of T in all cases.

Lemma 2.8.6 Z is a γ-cover for G.

Proof: From Lemma 2.8.5, for each node v ∈ G there is a node w ∈ T such that

Nγ(v, G) = Nγ(v,H(w)) and v ∈ Nγ(A(w), H(w)). By Lemma 2.6.1, p(w) is 4γ-

satisfied by I(w) in H(w). Since A(w) = N2γ(p(w), H(w)), A(w) is 2γ-satisfied by

I(w) in H(w), which implies that v is γ-satisfied by I(w) in H(w). Since Nγ(v, G) =

Nγ(v, H(w)), I(w) also γ-satisfies v in G. Since Z =⋃

w∈T I(w), Z is a γ-cover for

G.

Lemma 2.8.7 rad(Z) ≤ 8γ.

Proof: We have that Z =⋃

w∈T I(w), where each I(w) is obtained by an invocation

of Algorithm Shortest-Path-Cluster, with parameter β = 4γ. Therefore, by Lemma

2.6.1, for any w ∈ T , rad(I(w)) ≤ 2β = 8γ, which implies that rad(Z) ≤ 8γ.

Lemma 2.8.8 deg(Z) ≤ 6.

Proof: Consider an arbitrary node v ∈ G. We only need to show that deg(v, Z) ≤6. Let s = w1, w2, . . . , wk be the path in T as described in Lemma 2.8.5. According

to Algorithm Subgraph-Clustering, the only possible clusters that v can participate

in are I(w1), I(w2), . . . , I(wk). Let i denote the smallest index such that v ∈ I(wi).

We will show that i ∈ k − 1, k. We examine two cases:

Case 1: v ∈ A(wi)

In this case, v will be removed with A(wi), and therefore, v will not appear in any

child of wi. Consequently, wi = wk, hence, i = k.

Case 2: v /∈ A(wi)

In this case, v is a member of a connected component B of H ′(wi) = H(wi)−A(wi).

There are two subcases:

44

Case 2.1: B does not contain any external node of G

In this case, B is discarded, and therefore, v will not appear in any child of

wi. Consequently, wi = wk, hence, i = k.

Case 2.2: B contains an external node of G

If wi = wk, the situation is similar as above, with i = k. So suppose

that i < k. According to Algorithm Subgraph-Clustering, B = H(wi+1).

We will show that v ∈ A(wi+1), which implies that wi+1 = wk (the rea-

son is similar to the case where v ∈ A(wi) above). Since v ∈ I(wi), v ∈N4γ(p,H(wi)) = N2γ(A(wi), H(wi)). Thus, there is a node u ∈ A(wi) such

that v ∈ N2γ(u,H(wi)). Let g = u, x1, x2, . . . , xk, v be a shortest path be-

tween u and v in H(wi). Clearly, length(g) ≤ 2γ. Since u ∈ A(wi) and v

is a member of a connected component B of H ′(wi) = H(wi) − A(wi) with

an external node of G, the path g must contain an edge of Y (or else H(wi)

is disconnected). Choose the node xy such that xy ∈ g, xy ∈ B, and xy is

adjacent to some edge of Y . Now, let g′ = xy, xy+1, . . . , xk, v be a subpath of

g in B. Clearly, length(g′) ≤ 2γ as well.

Case 2.2.1: pB and g′ intersect

Then v ∈ N2γ(pB, B) = N2γ(pB, H(wi+1)). Thus, v ∈ A(wi+1). There-

fore, wi+1 = wk, which implies that i = k − 1.

Case 2.2.2: pB and g′ do not intersect

By Lemma 2.8.4, in B−pB, node v belongs to a connected component B′

that has no external nodes of C. Since C is a subgraph of G, Lemma 2.8.1

implies that B′ has no external nodes of G either. Thus, B′ is discarded

at the recursive invocation of the algorithm that corresponds to the node

wi+1. Consequently, wk = wi+1, which implies that i = k − 1.

Consequently, i ∈ k − 1, k. Thus, the only clusters that v could possibly

belong to are I(wk−1) and I(wk). Since for each x ∈ T , I(x) is the result of an

invocation of Algorithm Shortest-Path-Cluster, from Lemma 2.6.1, deg(I(x)) ≤ 3.

Therefore, deg(v, Z) ≤ deg(I(wk−1)) + deg(I(wk)) ≤ 6.

45

It is easy to verify that Algorithm Depth-Cover computes the cover Z in poly-

nomial time with respect to the size of G. Therefore, the main result in this section

follows from Lemmas 2.8.6, 2.8.7, and 2.8.8.

Theorem 2.8.1 For any connected planar graph G and γ ≥ max(depth(G), 1),

Algorithm Depth-Cover returns in polynomial time a γ-cover Z with rad(Z) ≤ 8γ

and deg(Z) ≤ 6.

2.8.5 General Planar Cover

We now describe the main algorithm, Algorithm Planar-Cover, which given a

planar graph G, constructs a γ-cover with radius O(γ) and degree O(1), for any

γ ≥ 1. In the algorithm, we do the following. If γ ≥ depth(G), then we invoke

Algorithm Depth-Cover(G, γ). However, if γ < depth(G), we first divide G into

zones, and then cluster each zone with Algorithm Depth-Cover. The union of the

zone clusters gives the resulting cover for G.

Algorithm 5: Planar-Cover(G, γ)

Input: Connected planar graph G; locality parameter γ ≥ 1;Output: A γ-cover for G;

Z ← ∅;1

if γ ≥ depth(G) then2

Z ← Depth-Cover(G, γ);3

else4

Introduce artificial nodes to G;5

Let S1, S2, . . . , Sκ be the 3γ-zones of G, where κ = d(depth(G) + 1)/γe;6

foreach connected component S of Si do7

Z ← Z ∪ Depth-Cover(S, 3γ − 1);8

Remove artificial nodes from Z;9

return Z;10

We now describe how to construct the zones. Suppose that γ < depth(G).

For all edges e ∈ G such that ω(e) > 1, place artificial nodes along e as needed,

reducing ω(e) by 1 each time, until all edges have weight 1 (we are simulating an

unweighted graph). Clearly, artificial nodes do not alter the planarity of the graph.

These nodes can later be removed from all clusters without affecting the cover, since

they do not alter the actual nodes in any neighborhood. Next, we will divide the

46

graph into bands, Wj = v ∈ G : depth(v, G) ≥ jγ and depth(v,G) < (j + 1)γ,for j ≥ 0. Our main goal is to γ-satisfy the nodes in each band Wi. However, in

G, the γ-neighborhoods of the nodes in Wi may appear in the adjacent bands Wi−1

and Wi+1. For this reason, we form the 3γ-zone Si consisting of bands Wi−1, Wi,

and Wi+1 (in particular, Si = G(Wi−1 ∪ Wi ∪ Wi+1), where W0 = Wκ+1 = ∅). Si

contains the whole γ-neighborhood of Wi.

Lemma 2.8.9 For γ < depth(G), it holds that: (i) depth(Si) ≤ 3γ − 1; (ii)

Nγ(Wi, G) = Nγ(Wi, Si).

Proof: Consider a zone Si = G(Wi−1 ∪ Wi ∪ Wi+1). We first prove property i.

Consider the outermost nodes of Wi−1 to be external. Consider a generic node

u ∈ Wi+1. Since all edges are of weight 1, u must be within γ of some node v ∈ Wi,

or else it is in the wrong depth band. Similarly, any node in Wi is within γ of some

node in Wi−1, which is less than γ from some external node in Wi−1. Thus, any

node in Si is within 3γ − 1 of an external node, therefore depth(Si) ≤ 3γ − 1.

For property ii, suppose that u ∈ Wi and v ∈ Nγ(Wi, G). We will show that

v ∈ Si. By the construction of Wi, we know that iγ ≤ depth(u,G) < (i + 1)γ.

Suppose for the sake of contradiction that v 6∈ Si. Thus, either depth(v, G) <

(i− 1)γ, or depth(v, G) ≥ (i + 2)γ.

Case 1: depth(v, G) < (i− 1)γ

Since v ∈ Nγ(Wi, G), depth(u,G) ≤ depth(v, G) + γ < (i − 1)γ + γ. Thus,

depth(u,G) < iγ, a contradiction.

Case 2: depth(v, G) ≥ (i + 2)γ

Since v ∈ Nγ(Wi, G), depth(v,G)−γ ≤ depth(u,G) < (i+1)γ. Thus, depth(v, G) <

(i + 2)γ, a contradiction.

Therefore, v ∈ Si, proving that Nγ(Wi, G) = Nγ(Wi, Si).

In this way, we have reduced the problem of satisfying band Wi to the problem

of producing a cover for zone Si, which can be solved with Algorithm Depth-Cover.

Proved in Lemma 2.8.9 regarding the construction of each zone Si, depth(Si) ≤

47

3γ−1. We invoke Algorithm Depth-Cover(Si, 3γ−1) with locality parameter 3γ−1,

since in Algorithm Depth-Cover the locality parameter has to be at least as much

as the depth of the input graph. The resulting cover for G is the union of all the

covers for the zones.

Using Theorem 2.8.1 and the observation that every node participates in at

most three zones, we obtain the main result for planar graphs.

Theorem 2.8.2 For any connected planar graph G and parameter γ ≥ 1, Algorithm

Planar-Cover returns in polynomial time a γ-cover Z with rad(Z) ≤ 24γ − 8 and

deg(Z) ≤ 18.

2.9 Cover for Unit Disk Graphs

Unit disk graphs are often used to model wireless network topologies. In a unit

disk graph G, there exists an edge between two vertices u, v ∈ G if and only if the

Euclidean distance between u and v is at most 1. That is, each node u is surrounded

by a disk of radius 1, and has a link to all other nodes that appear within the disk.

In a multi-hop radio network, u can communicate directly to these nodes, and only

these nodes.

Using Algorithm Planar-Cover, we can construct optimal covers for unit disk

graphs (Sparse Cover Contribution 3). Consider a connected unit disk graph G, and

a spanner G′ ⊆ G such that G′ is planar and there is a positive constant t such that

for any two nodes u and v, distG′(u, v) ≤ t ·distG(u, v) (such a spanner exists for all

unit disk graphs [54]). Since G′ is a connected planar graph, the algorithm Planar-

Cover returns a γ-cover Z for G′ with rad(Z) ≤ 24γ− 8 and deg(Z) ≤ 30. Consider

calling Planar-Cover(G′, γt). Clearly, Z is a γt-cover for G′ with rad(Z) ≤ 24γt− 8

and deg(Z) ≤ 30.

Theorem 2.9.1 Z is a γ-cover for G.

Proof: Let v ∈ G′, so it is also true that v ∈ G, since G′ ⊆ G. Since Z is a γt-cover

for G′, there exists a cluster C that γt-satisfies v in G′. That is, ∀u ∈ Nγt(v,G′),

u ∈ C. Since distG′(u, v) ≤ t · distG(u, v), ∀u ∈ Nγ(v, G), u ∈ Nγt(v,G′), thus

u ∈ C. Therefore, Z is a γ-cover for G.

48

2.10 Summary

In this chapter, we have shown how improved sparse covers can be used to

construct better distributed directories, used for locating mobile data/objects in

wireless sensor networks. We have provided a structural lower bound for sparse

covers of arbitrary graphs, and improved construction algorithms for special well-

studied types of graphs.

We show that using a simple centralized directory is a poor solution to the

problem because it is not locality-sensitive. A better solution uses a distributed di-

rectory, where data/objects do not have a static home. This allows queries to be

answered quickly regardless of the whereabouts of the querying and storing nodes.

This is done through the use of efficient find and move operations. A sparse cover

is the underlying data structure from which a distributed directory is built. Specifi-

cally, a hierarchy of increasing-radius covers is used to construct regional matchings,

which contain read and write sets for all network nodes (refer to Section 2.4 for

formal definitions). As a directory contains only two operations (find and move), its

performance is measured by the Stretchfind and Stretchmove, which are determined

by the structural quality (radius and degree) of the sparse covers used to construct

it.

We first proved a structural lower bound for sparse covers of arbitrary graphs

in Section 2.5. Specifically, there exists a network with n nodes, and constrained

by the locality parameter γ and the maximum tolerable degree c, such that when

clustered, there must exist a cluster whose radius is Ω(γ log logc n), regardless of

the clustering technique (see Theorem 2.5.2). This proves that for arbitrary graphs,

there is an inherent tradeoff in the radius and degree, and these metrics cannot be

simultaneously optimized. The best known construction algorithm for these graphs

can achieve a radius of O(γ log n) and a degree of O(log n) [34], which translates into

a distributed directory with Stretchfind = O(log2 n) and Stretchmove = O(log2 n).

In light of the above tradeoff, we studied construction techniques for special types

of graphs including planar, unit disk, and H-minor free graphs.

In Section 2.7, we presented an algorithm for clustering κ-path separable

graphs that achieves a radius of O(γ) and degree of O(log n). This translates into

49

a distributed directory with Stretchfind = O(log n) and Stretchmove = O(log n), a

savings of a logarithmic term in each metric. In Section 2.8, we presented an opti-

mal algorithm for clustering planar graphs that achieves a radius of O(γ) and degree

of O(1). This translates into a distributed directory with Stretchfind = O(1) and

Stretchmove = O(log n), a savings of log2 n in Stretchfind, and log n in Stretchmove.

Finally, in Section 2.9, we showed how our planar algorithm can be used to con-

struct optimal covers for unit disk graphs (and other graphs with constant-stretch

planar spanners) with a radius of O(γ) and degree of O(1), once again saving log2 n

in Stretchfind and log n in Stretchmove, for the distributed directory operations.

Our work has immediate implications on the efficiency of other important data

structures used to solve fundamental distributed problems such as the construction

of compact routing schemes and synchronizers.

CHAPTER 3

Information Retrieval: P2P Content Delivery

3.1 Introduction

P2P file-sharing is the “killer application” for the consumer broadband Inter-

net. CacheLogic’s [61] monitoring of tier 1 and 2 Internet service providers (ISPs)

in June, 2004 reports that between 50% and 80% of all traffic is attributed to P2P

file-sharing. In 2005, those numbers appear to have been holding steady at 60% of

all network traffic on the reporting consumer-oriented ISPs [61]. At any given time,

over 8 million users are sharing 10 Petabytes of data using major P2P networks.

This accounts for nearly 10% of the broadband connections worldwide, and this

trend is expected to grow [100].

Much of the data being exchanged on P2P networks consists of large video files.

For example, a typical DIVX format movie is 700 MB, a complete single-layer DVD

movie can be more than 4 GB, and the latest high definition movies may require

10 GB or more. With high definition movies (HD-DVD and Blu-Ray formats) just

entering the home theater market, one can expect downloadable content sizes to

grow by a factor of 100 to 1,000, thus pushing the network traffic loads to much

higher levels.

So then, one might ask, what is the driving force behind these trends? In

addition to the attraction to “free” and/or “pirated” content, a key driving force

is the content distribution economics. From both a content publisher’s as well as

content consumer’s point of view, P2P makes good economic sense, especially in the

context of the flash crowd effect. Here, a single piece of data, such as a new online

movie release, is so popular, that the number of people attempting to download it

will overload the capacity of the most powerful single site web server. However,

in a current generation P2P network such as BitTorrent, a single low-bandwidth

host will seed content to a massive swarm of peers. The hosts within the swarm

will then disseminate parts of the content to each other in a peer exchange fashion.

This is the heart of how a BitTorrent swarm operates. As one peer is obtaining

50

51

new content, it is simultaneously sharing its content with other peers. Unlike the

client-server approach, as the swarm grows, the aggregate network bandwidth of

the swarm grows. Thus, from the view point of each node, the data rates are much

faster, there is no denial of service on the part of the content source, and the content

source provider’s computation and network load remain relatively low.

3.1.1 Users Happy ISPs Not

There appears to be a wrinkle in this nirvana of efficient data exchange.

Consumer-oriented ISPs are not pleased with how their networks are being used

by these peering overlay networks. The cost to them is prohibitive - on the order

of $1 billion U.S. Dollars [61], and the ISPs are not making any additional revenue

from these network intensive applications. ISPs have begun to use packet-shaping

technology to throttle the delivery of P2P data, ultimately reducing the load on their

networks. In effect, current ISP networks were never provisioned for P2P overlay

protocols. So if you ask, is P2P good for the Internet?, the answer depends greatly

on who you ask.

Based on the above motivations, the grand goal of our research is to better

understand the real impact P2P overlay software has on Internet network resources

from the distributor, ISP, and end user point of views. In particular, we focus our

research on the BitTorrent protocol. BitTorrent has been one of the most popular

P2P file-sharing technologies, with a number of different client implementations

[74] and an estimated user population on the order of 60 million [102]. In 2004,

BitTorrent traffic single-handedly accounted for 50% of all Internet traffic on U.S.

cable networks [100]. More recently, the usage of BitTorrent has waned to 18%

due to content owners shutting down illegal tracker servers because of copyright

infringements [97]. We believe that the centralized tracker gives “BitTorrent-like”

applications great promise for the legal distribution of legitimate content.

3.2 Contributions

We model the BitTorrent protocol in full detail based on the mainline client

source code [91] using our Internet topology model.

52

1. A memory efficient model of the BitTorrent protocol built on the ROSS discrete-

event simulation system [88, 89]. The memory consumed by a single BitTor-

rent client can be upwards of 70 MB. The memory consumed by a client in

our model is between 67 KB and 2.3 MB (see Section 3.5).

2. A slice-level data model that ensures protocol accuracy while avoiding the

event explosion problem characteristic of typical packet-level models, such

as employed with NS [70]. As a result, we achieve tremendous sequential

processor speedups (up to 180 times) (see Sections 3.5 and 3.6).

3. A realistic Internet topology model that preserves geographic market rela-

tionships, is massively scalable, and accurately models the in-home consumer

broadband Internet (see Section 3.6).

4. Validation of our BitTorrent model against instrumented BitTorrent opera-

tional software as well as previous measurement studies (see Section 3.7.1).

5. Model performance results and analysis for a large number of BitTorrent

swarm scenarios (see Section 3.7.2).

6. Analysis of techniques for streaming content using BitTorrent. We show ac-

ceptable quality of service (QoS) can be achieved when only a small fraction of

a BitTorrent swarm is streaming. Further, we show how the use of BitTorrent

along with a CDN can significantly reduce transit costs while providing an

excellent QoS (see Section 3.8).

Our advancements have allowed us to study large-scale swarms that have been

previously computationally infeasible to simulate. Through this ongoing investiga-

tion, we hope to gain insights that will enable better P2P systems that are considered

both fair and efficient by not only the users and distributors, but the ISPs as well.

We now present related work in the area of BitTorrent studies. In Section 3.4,

we give an overview of the BitTorrent protocol. We discuss our simulator and

topology model in Sections 3.5 and 3.6 respectively. We present our model validation

and some experimental results in Section 3.7. In Section 3.8, we analyze the QoS and

53

transit savings for different streaming modifications of BitTorrent. We summarize

the chapter in Section 3.9.

3.3 Related Work

The current approaches to studying this specific protocol are either through

direct measurement of operational BitTorrent “swarms” during a file-sharing session

[66, 105], or by real experimentation on a physical closed network, such as PlanetLab

[103]. The problem with using PlanetLab as a P2P testbed is that the usage polices

can limit our ability to explore network behaviors under extreme conditions. That

is, your application cannot interfere with other participants in the research network

[104] and lacks the resources needed to examine swarm behaviors at the scale we

would like to investigate. Additionally, real P2P Internet measurement studies are

either limited in terms of data that they are able to collect because of network

blocking issues related to network address translations (NATs) as in the case of

[105], or limited to active “torrents” as in the case of [66]. Another technique, not

necessarily specific to BitTorrent, is the use of complex queuing network models,

such as [95].

While both measurement and queuing network models are highly valuable

analytic tools, neither allow precise control over the configuration for the system

under test, which is necessary to understand the cause and effect relationships among

all aspects of a specific protocol like BitTorrent. For this level of understanding, a

detailed simulation model is required. However, the reality of any simulation is that

- by definition - it is a “falsehood” for which we are trying to extract some “truths”.

Thus, the modeler must take extreme care in determining factors that can and

cannot be ignored. In the case of BitTorrent, there have been some attempts by

Microsoft to model the protocol in detail [86, 96]. However, these models have been

dismissed by the creator of BitTorrent, Brahm Cohen, as not accurately modeling

the true “tit-for-tat”, non-cooperative gaming nature of the protocol as well as other

aspects [99].

A third approach is direct emulation of the operational BitTorrent software

such as done by [87]. Here, the peer code is “fork lifted” from the original imple-

54

mentation. Results are presented using only 700 peers (cable users). It is unclear

which parts of the BitTorrent implementation were left intact, so comparisons be-

tween this approach and ours in terms of memory and computational efficiency are

not possible.

3.4 The BitTorrent Protocol

The BitTorrent protocol creates a virtual P2P overlay network using five major

components: (i) a torrent file, (ii) a web site, (iii) a tracker server, (iv) client seeders,

and (v) client leechers.

A torrent file is composed of a header plus a number of SHA-1 block hashes of

the original file, where each block or piece of the file is a 256 KB chunk of the whole

file. These chunks are further broken down into 16 KB sub-chunks called slices. The

header information denotes the IP address or URL of the tracker for this torrent file.

Once created, the torrent file is then stored on a publicly accessible web site, from

which anyone can download. Next, the original content owner/distributor will start

a BitTorrent client that already has a complete copy of the file along with a copy

of the torrent file. The torrent file is read, and because this BitTorrent client has

a complete copy of the file, it registers itself with the tracker as a seeder. A client

without a complete copy of the file registers itself as a leecher. Upon registering,

the tracker will provide a leecher with a randomly generated list of peers. Because

of the size of the peer-set and the random peer selection, the probability of creating

an isolated clique in the overlay network graph is extremely low, which ensures

robust network routes for piece distribution. The downside to this approach is

that topological locality is completely ignored, resulting in much higher network

utilization (i.e. more network hops and consumption of more link bandwidth).

Thus, the protocol makes a robustness for locality tradeoff.

The seeder and other leechers will begin to transfer pieces of the file amongst

themselves using a complex, non-cooperative, tit-for-tat algorithm. After a piece is

downloaded, the BitTorrent client will validate that piece against the SHA-1 hash

value for that piece. Again, the hash for that piece is contained in the torrent file.

When a piece is validated, the client is able to share it with other peers who have

55

not yet obtained it. Pieces within a peer-set are exchanged using a rarest piece first

policy, which is used exclusively after the first few randomly selected pieces have

been obtained (typically four pieces, but this is a configuration parameter). Because

each peer announces to all peers in its peer-set every piece it obtains (via a HAVE

message), all peers are able to keep copy counts on each piece and determine within

their peer-set which piece or pieces are rarest (i.e. lowest copy count). When a

leecher has obtained all pieces of the file, it then switches to being a pure seeder of

the content. At any point during the piece/file exchange process, clients may join

or leave the swarm (peering network). Because of the highly volatile nature of these

swarms, a peer will re-request an updated list of peers from the tracker periodically

(typically every 300 seconds). This ensures the survival of the swarm, assuming the

tracker remains operational.

More recently, BitTorrent has added a distributed hash table (DHT) based

tracker mechanism. This approach increases swarm robustness even in the face of

tracker failures. However, DHTs are beyond the scope of our current investigation.

3.4.1 Message Protocol

The BitTorrent message protocol consists of 11 distinct messages as of version

4.4.0, with additional messages being added to the new 4.9 version. All intra-peer

messages are sent using TCP, whereas peer-tracker messages are sent using HTTP.

Once a peer has obtained its initial peer-set from the tracker, it will initiate a

HANDSHAKE message to 40 peers by default. The upper bound on the number of

peer connections is 80. Thus, each peer keeps a number of connection slots available

for peers who are not in its immediate peer-set. This reduces the probability that

a clique will be created. The connections are maintained by periodically sending

KEEP ALIVE messages.

Once two-way handshaking between peers is complete, each peer will send the

other a BITFIELD message that contains an encoding of the pieces the peer has.

If a peer has no pieces, no BITFIELD message is sent. Upon getting a BITFIELD

message, a peer will determine if the remote peer has pieces it needs, if so, it will

schedule an INTERESTED message. The remote peer will process the INTER-

56

ESTED message by invoking its choker algorithm, which is described next. The

output from the remote peer’s choker (upload side) is an UNCHOKE or CHOKE

message. The response to an INTERESTED message is typically nothing or an UN-

CHOKE message. Once the peer receives an UNCHOKE message, the piece-picker

algorithm (described below) is invoked, and a REQUEST message will be generated

for a piece and 16 KB offset within that piece. The remote peer will respond with

a PIECE message containing the 16 KB chunk of data. This response will in turn

result in additional REQUESTS being sent.

When all 16 KB chunks within a piece have been obtained, the peer will send

a HAVE message to all other peers to which it is connected. With receipt of the

HAVE message, a remote peer may decide to schedule an INTERESTED message for

that peer, which results in a UNCHOKE message, and then REQUEST and PIECE

messages being exchanged. Thus, the protocol ensures continued downloading of

data among all connected peers. Should a peer have completely downloaded all

content available at a remote peer, it will send a NOT INTERESTED message.

The remote peer will then schedule a CHOKE message if the peer was currently in

the unchoked state. Likewise, the remote peer will periodically choke and unchoke

peers via the choker algorithm. Lastly, when a peer has made a request for all pieces

of content, it will enter endgame mode. Here, requests to multiple peers for the same

piece can occur. Thus, a peer will send a CANCEL message for that piece to other

peers once one has responded with the requested 16 KB chunk.

In order to reduce the complexity of our model, we do not include either KEEP

ALIVE or CANCEL messages. In the case of CANCEL messages, they are very few

and do not impact the overall swarm dynamics [66].

3.4.2 Choker Algorithms

There are two distinct choker algorithms, each with very different goals. The

first is the choker algorithm used by a seeder peer. The goal is not to select the peer

whose upload data transfer rate is best, but instead to maximize the distribution of

pieces. In the case of leecher peers, it uses a sorted list of peers based on upload rates

as the key determining factor. That is, it wants to find the set of peers with whom

57

it can best exchange data with. Both choker algorithms are scheduled to run every

10 seconds and can be invoked in response to INTERESTED/NOT INTERESTED

messages. Each invocation of the choker algorithm counts as a round. There are

three distinct rounds that both choker algorithms cycle through. We begin with the

details for the seeder choker algorithm (SCA).

SCA only considers peers that have expressed interest and have been unchoked

by this peer. First, the SCA orders peers according to the time they were last

unchoked with most recently unchoked peers listed first within a 20 second window.

All other peers outside that window are ordered by their upload rate. In both cases,

the fastest upload rate is used to break ties between peers. During two of the three

rounds, the algorithm leaves the first three peers unchoked, and unchokes another

randomly selected peer. This peer is known as the optimistic unchoked peer (OUP).

During the third round, the first four peers are left unchoked and the remaining peers

are sent CHOKE messages if they are currently in the unchoked state.

For the leecher choker algorithm (LCA), at the start of round 1 (i.e. every 30

seconds), the algorithm chooses one peer at random that is choked and interested.

As in SCA, this is the OUP. Next, the LCA orders all peers that are interested and

have at least one data block that was sent in the last 30 second time interval, all

other peers are considered to be snubbed. Snubbed peers are excluded from being

unchoked to prevent free-riders and ensure that peers share data in a relatively

fair way. From that ordered list, the three fastest peers along with the OUP are

unchoked. If the OUP is one of the three fastest, a new OUP is determined and

unchoked. If the OUP is not interested, the choker algorithm will later be invoked

as part of INTERESTED messaging processing.

3.4.3 Piece-Picker

The piece-picker is a two phase algorithm. The first phase is random. When a

leecher peer has no content, it selects four pieces at random to download from peers

that have those particular pieces. Once a peer has those four pieces, it shifts to a

second phase of the algorithm that is based on a rarest piece first policy. Here, each

piece’s count is incremented based on HAVE and BITFIELD messages. The piece

58

with the lowest count (but not zero) is selected as the next piece.

3.4.4 Implications for Network Model Design

As one can see, the dynamics and causal relationships among peers is extremely

complex. Consequently, we are limited to the extent with which we can abstract

away such interactions without incurring losses with respect to peer-protocol inter-

actions. For example, a peer need not receive a full 256 KB piece from a single peer,

nor is it guaranteed to receive blocks within a piece in-order, or pieces themselves in

any particular order. Additionally, the pattern with which pieces are received im-

pacts the “rarest piece” within a peer-set. This rarest piece will vary among peer-sets

as their view of the available pieces changes over time. This in turn impacts which

pieces a peer will request, and ultimately determines the download completion time

along with other network effects. This point is especially critical if we attempt to

make any sort of cross-P2P model performance comparisons. Thus, it is impera-

tive that any abstraction preserve the dynamics between peers, peer-sets, available

pieces, and rarest pieces. Because of this, we are forced to model this protocol at the

level of a slice. However, as we will show, this level affords a 180x event reduction

over a pure packet-level model.

3.5 Simulator

Our model [57] of the BitTorrent protocol (P2P Contributions 1 and 2) is

written on top of ROSS [88, 89], which is an optimistically synchronized parallel

simulator based on the Time Warp protocol [98]. In this modeling framework, sim-

ulation objects, such as peers, are realized as logical processes (LPs) that exchange

time-stamped event messages in order to communicate. Each message within the

BitTorrent protocol is realized as a time-stamped event message, where the time

stamps are generated by delays from our network topology model [58], which real-

istically approximates today’s home broadband Internet service.

The simulator is flow-based and operates at the slice-level. In addition, our

topology model allows us to abstract away details that are non-pertinent to Internet

simulations, where delays experienced in the core are negligible compared to those

59

in the last mile [59]. As a result, we have a realistic model that achieves significant

sequential processor speedups and reductions in required memory, allowing us to

simulate extremely large-scale swarms of 100’s of thousands of peers.

3.5.1 BitTorrent Model Data Structure

The data structure layout for our BitTorrent model is shown in Figure 3.1. At

the core of the model is the peer state, denoted by bt_peer_lp_state_t. Inside

each peer, there are three core components. First is the peer_list, followed by

the picker and the choker. The peer list captures all the upload and download

state for each peer connection, as denoted by the bt_peer_list_t structure. A

peer list can be up to 80 in length. The picker contains all the necessary state for

a peer’s piece-picker. The choker manages all the required data for a peer’s choker

algorithm. This algorithm makes extensive use of the download data structure

for each peer connection. Finally, each peer contains data structures to manage

statistical information as well as simulated CPU usage that are used in protocol

analysis.

Next, each peer connection has upload and download states associated with

it. The upload, denoted by the bt_upload_t structure, contains upload side sta-

tus flags, such as choked, interested, etc. The download, denoted by the bt_

download_t structure, is significantly more complex from a modeling perspective.

In particular, this structure contains a list of active requests made by the owning

peer. The bt_request_t data structure contains a pointer to the destination peer

in the peer list along with the piece and offset information. Recall that each 256

KB piece is further divided into partial chunks of 16 KB each. The offset indicates

which 16 KB chunk this request is for within an overarching piece.

Now, inside the piece-picker data structure, denoted by bt_picker_t, is an

array of piece structures along with a rarest piece priority queue. Inside each piece

array element, denoted by the bt_piece_t, is the current download status of that

particular piece. Two key lists inside of the data structure are the lost_request

and peer_list. The lost_request is a queue for requests that need to be remade

because the connection to the original destination peer was closed/terminated as per

60

the BitTorrent protocol. The peer_list is the list of peers that have this particular

piece (determined by receipt of a HAVE message). This list is used by the piece-

picker algorithm to select which peer to send the request to for this particular piece.

We observe here that this piece_peer list is different from the previous ones

in that it points to a container structure bt_piece_peer_t, which is just a list

with a peer list pointer contained within it. It is this data design that results in

significant memory savings over a static allocation of peer arrays. This enables us

to manage our own piece-peer memory and reuse memory once a piece has been

fully obtained. Similarly, we also manage bt_request_t memory. As a leecher-peer

becomes a seeder-peer, it no longer issues piece download requests. Thus, those

memory buffers can be re-used within the simulation model for other download

requests, enabling greater scalability in terms of the number of peer-clients that can

be modeled.

The final key data structure within the piece-picker is a Splay Tree priority

queue [101], which is used to keep the rarest piece at the top of the priority queue.

Our selection of this data structure over others such as a Calendar Queue [93], is

because of its low memory and high-performance for small queue lengths (i.e. less

than 100). The key sorting criteria for this queue is based on counts of peers that

have each piece. The lowest count piece will be at the top of the queue. Each peer

LP manages its own rarest piece priority queue.

3.5.2 Tuning Parameters

In terms of tuning parameters, BitTorrent has on the order of 20 or more that

are beyond full consideration here. However, we do focus on two key parameters

that have a profound impact on simulator performance. The first is max_allow_in.

This parameter determines the maximum number of peers that a peer will make

connections to, or accept requests from. This parameter determines how long a

peer’s peer_list is, which impacts the complexity of the piece-picker and choker

algorithms. Another key parameter is max_backlog, which sets a threshold on the

number of outstanding requests that can be made on any single peer connection.

In the BitTorrent implementation, max_backlog is hard-coded to be 50 unless the

61

Figure 3.1: This figure shows our BitTorrent model data structure.

data transfer rate is extremely high (greater than 3 MBps), in which case it can

go beyond that value. So, an approximate upper bound on the number of request

events that can be scheduled is the product of max_allow_in, max_backlog, and

the number of peers. A consequence of this product is that memory usage in the

simulation model grows very quickly.

3.6 Topology Model

The Internet’s inherent heterogeneity and constantly changing nature make it

difficult to construct a realistic, yet computationally feasible model. In the construc-

tion of any model, one must take into consideration flexibility, accuracy, required

resources, execution time, and realism. In this section, we discuss the methodology

and creation of our model used to simulate Internet content distribution, and the

rationale behind its design. In particular, we are interested in modeling the in-home

consumer broadband Internet, while preserving geographic market relationships. In

62

our performance study, our simulations experience tremendous sequential processor

speedups, and require a fraction of the memory of other models, without sacrificing

the accuracy of our findings. Specifically, our slice-level model achieves the accu-

racy of a packet-level model, while requiring the processing of 180 times fewer events

(P2P Contributions 2 and 3).

Our topology model is comprised of several components that are largely inde-

pendent of each other. The components include: the Internet connectivity model

(Section 3.6.2), the population model (Section 3.6.3), the delay model (Section 3.6.4),

the technology model (Section 3.6.5), and the bandwidth model (Section 3.6.6).

In the design of these models, many decisions were made whether or not to

include certain features. Considered in this process are the model’s overall realism,

its accuracy, the data collection and maintenance required, the execution time of

the simulations, and the required system resources. In some cases, a model can be

unnecessarily complex, and produce results that either cannot be analyzed, or are

no better than those of a simpler version [71]. In Section 3.6.7, we demonstrate the

efficacy of our model, and discuss some of the benefits that we reap as a result of

our decisions.

3.6.1 Related Work

3.6.1.1 Internet Mapping Projects

Caida [65] used skitter data that they collected over time to construct a con-

nectivity model between registered Autonomous Systems (AS) on the Internet. This

model captures the connectivity between groups of networks, however, leaves inter-

nal network structure unknown. This model is not suitable for our simulations

because realistic hop counts cannot be determined (AS can be disconnected or even

span the country in 1 hop). Further, we have no data regarding the location of

nodes or their corresponding bandwidths.

Lumeta [68] also created an Internet map using trace data. The map is very

large-scale, and does give a notion of location. However, probes were initiated

from a single source, thus the map is very tree-like, and hop counts cannot be

accurately inferred. Rocketfuel [78] is an Internet mapping tool that allows for

63

direct measurements of router-level ISP topologies. The number of required traces

is significantly reduced by exploiting BGP routing tables, using properties of IP

routing to eliminate redundant measurements, alias resolution, and using DNS to

divide maps into POPs and backbone. Using 300 sources and 800 sinks, Rocketfuel

creates extremely detailed maps of specific ISPs [79]. We use a similar technique to

map parts of the backbone and POPs, however, we abstract out specific ISPs. This

allows us to scale to larger simulations while keeping realistic ISP properties.

Mercator [81] is a similar tool that uses informed random-access hop-limited

probes to explore the IP address space. Targets are informed by the results of earlier

probes as well as IP address allocation policies. Mercator is deployable anywhere

because it makes no assumptions about the availability of external information to

direct the probes. It uses alias resolution and a technique called source-routing to

direct probes in non-radial directions from the source in order to discover cross-links

that would not have otherwise been found. In our model, we use carefully chosen

addresses and ranges to probe in order to guarantee the coverage of certain key

geographic regions.

[80] describes a model of the U.S. Internet backbone constructed using merged

data sets from the existing Internet mapping efforts Rocketfuel and Mercator, and

identifies areas where the research community lacks data, such as link bandwidth

and link delay data.

3.6.1.2 Abstractions

Presented in [85] are fluid models used to study the scalability, performance,

and efficiency of BitTorrent-like file-sharing mechanisms. The idea is to approximate

a system through theoretical analysis, rather than a detailed simulation.

In [73], NIx-Vector routing (short for Neighbor-Index Vector) is introduced.

Typically, routing of packets on the Internet consists of a series of independent

routing decisions made at each router along the path between any source and des-

tination. Hence, when many packets are sent between the same pair of nodes, the

same decisions are made repeatedly and independently, without knowledge of any

previous decisions. A NIx-Vector is a compact representation of a routing path that

64

is small enough to be included in a packet header. Once this vector exists, routing

decisions can be made at each router in constant time, without requiring caching or

state saving. This technique can significantly reduce the burden on routers.

Staged Simulation [76] is a technique for improving the runtime performance

and scale of discrete-event simulators. It works by restructuring discrete-event sim-

ulators to operate in stages that pre-compute, cache, and reuse partial results to

drastically reduce the amount of redundant computation within a simulation. Like

all abstraction techniques, there are advantages and tradeoffs. Experiments show

that this technique can improve the execution time of the NS2 simulator consider-

ably.

One of the first flow-based network models was reported in [63]. Here, a two

order of magnitude speedup is achieved over a pure packet-level model by coarsening

the representation of the traffic from a packet-basis to a “cluster” of closely spaced

packets called a train. Narses [82] and GPS [83] are other flow-based network sim-

ulators that approximate the low-level details such as the physical, link, network,

and transport layers. A similar framework is presented in [84]. Our simulator is also

flow-based performing at the slice-level without neglecting low-level details, allow-

ing us to analyze application layer behavior as well as the effects on the underlying

network.

Most recently, [72] reports a new method for periodically computing traffic at

a time scale larger than that typically used for detailed packet simulations. This is

especially useful for large-scale simulations where the execution cost is exceedingly

expensive. Results suggest huge speedups are possible when comparing background

flows to those simulated in pure packet simulators. In addition, comparing the

foreground interactions verifies the accuracy of the technique.

[75] discusses a novel approach for scalable and efficient network simulation,

which partitions the network into domains and the simulation time into intervals.

Each domain is simulated concurrently and independent of the others, using only

local information of the interval. At the end of each interval, simulation data is

exchanged between domains. When the exchanged information converges to a value

within a prescribed precision, all simulators progress to the next time interval. This

65

approach results in speedups due to the parallelization with infrequent synchroniza-

tion.

Common to all of these approaches is the tradeoff of accuracy for a decrease

in computational complexity. In many cases, that tradeoff must be made in or-

der to make the model computationally tractable. Our plight is no different here.

Large-scale P2P protocol sessions exist for many hours to days. To capture the

larger-scale session dynamics within a tractable computational budget on common

hardware is not possible at the packet-level. What makes our approach different

are the constraints P2P protocols and BitTorrent in particular place on our network

abstraction coupled with the in-home broadband usage model.

3.6.2 Internet Connectivity Model

The Internet connectivity model defines all the nodes and links present in the

simulated network. As the Internet is constantly changing, a true-to-life connectiv-

ity graph of the Internet does not exist. Our model features two key components:

the Internet backbone, and the neighborhood-level networks of lower-tiered ISPs.

The Internet backbone contains many of the key links that glue the Internet to-

gether. The backbone is very non-uniform, and has evolved slowly over time. The

neighborhood-level networks on the other hand, are very uniform, and have evolved

based on the current Internet connection technology trends (i.e. cable or DSL).

In particular, these two device technologies have different performance character-

istics that need to be considered when distributing large video content to in-home

audiences via the Internet.

In order to preserve realism and accuracy in our simulations, the model must

capture many properties of the Internet, especially those in the “last mile” where

most of the delay and congestion for in-home broadband networks is likely to occur.

Additionally, our model must allow for a configurable number of nodes. Thus, we

have developed a hybrid abstraction connectivity model to do just that.

3.6.2.1 Backbone

The importance of the Internet backbone is obvious, and because of its non-

uniformity, cannot be generated easily. Because of this, our model uses a subset

66

of the actual backbone. These nodes and connections were obtained by performing

thousands of traces from 15 sources to 99 sinks all over the U.S. (see Figure 3.2). We

reached 3,331 distinct nodes, and covered 6,239 edges. The maximum experienced

degree (number of links connecting a single node) was 36, and the average degree was

3.746. When data is sent across the backbone in the simulation, we can use typical

delays based on the path length to estimate its total backbone delay. Figure 3.3

shows the lengths of distinct shortest paths in the simulated backbone. Within

the modeled backbone, low-tiered ISPs were located in many of the designated

market areas defined by Nielsen Media Research [60]. These markets are driven

by the Nielsen Rating System, which is used to determine viewing rates of cable

and broadcast television shows by location. We use the Nielsen market data to

provide a distribution of potential home viewers of content received over the Internet.

This aspect is discussed in the sections below. By design, these nodes border the

backbone, and can therefore be used to expand to the particular ISP’s neighborhood-

level networks.

Figure 3.2: This figure is the connectivity graph of the backbone of theconnectivity model. The nodes represent sources, sinks, in-termediate backbone routers, and identified low-tiered ISProuters. The edges represent links between respective nodes.

67

Figure 3.3: This figure shows the distribution of shortest path lengthsfor distinct paths in the backbone of the connectivity model.This curve is typical of the Internet, demonstrating that wehave preserved the required path properties.

3.6.2.2 Neighborhood-Level

Having up-to-date trace results for all ISPs would allow for maximum realism

in our simulations. However, this would require constant data gathering, and the

memory required to store such data (typically an adjacency matrix or adjacency list)

can be on the order of gigabytes for large simulations like the ones we study. For ex-

ample, a 100,000 by 100,000 matrix with 32-bit entries would require approximately

37 GB of memory (see Table 3.1 for memory comparisons). Luckily, network design

theory implies, and traces confirm, that neighborhood-level networks have similar

structures regardless of the particular ISP (in particular we looked at cable and DSL

ISPs). Because of this, specific ISPs have been abstracted out of the model, and we

can dynamically generate these types of networks in a realistic manner. In this case,

the speedup, the reduction of required system resources, and the elimination of the

need to maintain an up-to-date connectivity model is worth the slight degradation

of system realism.

Figure 3.4 shows the connectivity graph resulting from one set of traces to a

popular cable ISP. From the figure, we can see how the routers are interconnecting at

the different network levels, and also the fan-outs at the network’s edge connecting

to home computers/networks. This figure includes a total of 21,146 nodes resulting

68

Figure 3.4: This figure is the connectivity graph resulting from one setof traces to a popular cable ISP.

from responses from 21,037 homes and 109 intermediate routers. From this set of

traces, the average fan-out size is approximately 540 nodes.

Our market/neighborhood-level model allows us to take advantage of symme-

tries that exist at the consumer broadband level of the Internet. This allows us to

route without using any adjacency-storing data structures. For example, all peers

in the same neighborhood have common routers (usually a few hops away) used

to route within the neighborhood. Similarly, peers within the same market area

have common routers used to route between neighborhoods and the ISP’s back-

bone. Thus, an individual peer’s adjacencies are unimportant. Whether a message

is being sent to the same neighborhood, a different neighborhood within the same

market, or to a completely different market, hops along the paths to common routers

can be accounted for, and the message can be forwarded to the appropriate peer

or backbone router. Although asymptotically the same, this technique provides a

space and computational complexity improvement over the popular adjacency list

data structure, while providing the same routes.

69

3.6.3 Population Model

As previously mentioned, the backbone portion of the connectivity model has

identified ISPs by location. Using our current population statistics of the given

designated market areas [60], we can generate realistic neighborhood-level networks.

For example, if a city has 1 million cable Internet subscribers, it is unrealistic to

generate 5 million such nodes within the neighborhood-level networks of that city.

In terms of abstraction, the population data for specific cities allows us to take into

consideration time zones and the targeting of certain populations. Since we are

mostly concerned with media distribution, including streaming, this level of fidelity

is required to give realistic simulation results.

3.6.4 Delay Model

Our Internet connectivity model (discussed in Section 3.6.2) provides our sim-

ulations with realistic hop counts. As previously mentioned, we are simulating

Internet content distribution, thus, we must measure time in order for our exper-

iments to be useful for analysis. Therefore, we must have appropriate delay and

bandwidth models. In this section, we describe our delay model.

Research has shown that compared to the delay at the first and last links on

a packet’s course, the delay through the Internet’s core is negligible [59]. Because

of this, we can use estimates of the core delays without significantly impacting our

results. Our estimates come from live measurements. Figure 3.5 shows the average

delay experienced at each of the first 18 hops along a packet’s trajectory for roughly

100,000 performed traces. The curve suggests that even though many factors, both

predictable and unpredictable, contribute to delay, it generally increases at later

hops (and decreases closer to the destination). Because of this, we believe that

using average delays and distributions around core hops is realistic. Further, the

traces were performed on the live Internet over several days. Thus, the averages

inherently capture the effects of background traffic, while reducing computational

costs and data-gathering needs. For the first and last links, the delays and available

bandwidths are specified by the technology model.

70

Figure 3.5: This figure shows the average delays experienced at each ofthe first 18 links along a packet’s path from our traces.

3.6.5 Technology Model

The technology model describes what type of device a home user uses to

connect to the Internet. Research has shown that long delays exists at the first and

last links along a packet’s path. Thus, the technology model affects the delay model.

According to [77], depending on the DSL provider, service levels can range from 128

Kbps to 7 Mbps downstream from the Internet to the user, while upstream service

levels from the user to the Internet can range from 128 Kbps to 1 Mbps. Cable

service levels can range from 400 Kbps to 10 Mbps downstream and 128 Kbps to 10

Mbps upstream. Service levels depend on service agreements offered by each cable

system operator per market, and depend on whether the access is for residential

or commercial use. But typically, cable (hybrid-fiber coax) has more bandwidth

available than DSL.

Since we are interested in simulating cable and DSL users, we will generate the

nodes in our connectivity model according to the national percentages of home users

that connect using the two technologies. The delay model can therefore include the

delays at the first and last links based on the device being used. These delays have

been observed in our traces. Figure 3.6 shows the national averages of cable and

DSL users from 2003 and 2006 [62].

Depending on the simulation needs, more devices can be used in the technology

model, and the other topology components should be updated appropriately.

71

Figure 3.6: This figure shows the national technology distribution forhome high-speed Internet connections for March of 2003 andMarch of 2006.

3.6.6 Bandwidth Model

The last major component of our topology model is the bandwidth model.

Equation 3.1 [69] provides an upper bound estimate on the bandwidth for delivering

a 16 KB block. In Equation 3.1, BW is the bandwidth; MSS is the maximum

segment size (which is 1,460 bytes in default TCP, and 1,380 bytes in BitTorrent);

RTT is the round trip time; and p is the probability of packet loss.

To calculate the delay of a path, we apply a truncated (values below zero are

not used) normal distribution to the observed average delay. We use this distribution

because we observed a Gaussian curve in the real trace delays similar to Figure 3.3.

From here, RTT is set to twice the sum of the overall path delay (accounting for

the path and the return path).

The probability of packet loss is set to 0.05, which is a conservative estimate

based on the loss rates of the ISPs we observed. The final bandwidth is rate-shaped

based on the available bandwidth remaining along the pipe.

BW <MSS

RTT

1√p

(3.1)

3.6.7 Results

In this section, we provide experimental results that defend our topology model

design. In Table 3.1, we compare the amount of memory required by our connectivity

model versus a model with all real nodes stored in an adjacency list and an adjacency

matrix, for several simulation runs with a varying number of nodes. The results are

based on a memory footprint of 67 KB per peer (which has been achieved in [57]). It

72

Simulated Memory Required Memory Required Memory Required

Peers (MB) (MB) With List (MB) With Matrix

10,000 654 656 1,03520,000 1,308 1,311 2,83350,000 3,270 3,278 12,806100,000 6,540 6,555 44,686Lookup O(1) O(degree) O(1)

Table 3.1: Approximate memory required for simulation runs, and tech-nique lookup complexity.

is obvious that the savings are drastic, and it is not shown, but the memory accesses

may also increase the simulation time significantly for the adjacency storing models.

In Table 3.2, we revisit some simulation runs published in [57]. In particular,

we look at a swarm of 1,000 peers, and files with the following number of 256 KB

pieces: 128, 256, 512, and 1,024. We compare the number of events processed in

each simulation to a lower bound estimate on the number of events required in the

equivalent packet-level simulations. The lower bound takes into consideration the

actual data broken up into packets and forwarded at each hop. TCP control mes-

sages are ignored, and BitTorrent protocol messages are not broken up (both would

increase the number of events further for the packet-level simulator). As shown,

the number of events increases by a factor of up to 180 for the given simulation

runs. Note that this event increase significantly increases the execution time of

the simulations, as the event-rate is likely to remain roughly the same, hence, our

simulations experience a tremendous speedup as a result of the reduced number of

events processed.

Figure 3.7 shows the download completion times across all peers for a modified

version of the INRIA/PlanetLab test-bed scenario [67]. Here, 40 peers are divided

into 2 groups, fast peers and slow peers. The fast peers have a 200 KBps upload

capacity, while the slow peers have only a 20 KBps upload capacity. The download

capacity is set in our simulation to 100 MBps. The primordial seeder has the same

upload capacity as a fast peer. There are a few key differences between our scenario

and the INRIA/PlanetLab scenario. First, the PlanetLab network topology is less

complex than our topology. Our 40 peer scenario is distributed across a network that

73

Pieces Slice-Level Packet-Level

Events Events

128 19M 3.42B256 36M 6,48B512 66M 11.88B

1,024 122M 21.96B

Table 3.2: Number of events generated in the slice-level simulations andlower bound on the number of events generated in the packet-level simulations.

0

500

1000

1500

2000

2500

3000

3500

4000

0 50 100 150 200 250 300 350 400

Dow

nloa

d C

ompl

etio

n T

ime

(sec

)

Peer

"peersum-sorted"

Figure 3.7: This figure shows the download completion times of the modi-fied INRIA/PlanetLab scenario taken from [67]. In our case,we varied the random number seed-sets across 10 separateruns of the 40 peer, 1 seeder scenario. Thus providing uswith 400 peer data points.

spans the top 31 television markets in the U.S. Next, because our model is currently

only able to support cable and DSL devices, we only have two speed classes of users

at this time. Lastly, because of the radically different random number generation

seed-sets used across the 10 experiments, our range of different peer-sets and piece

selection is much greater. However, despite these differences, we observe that our

download completion times, in terms of shape, are similar to what they report – i.e.,

74

Figure 3.8: Simulated download completion times (seconds) for the 1,024peer 1,024 piece scenario.

the conical S-shape. This shape has also been reported by [87] in their emulation

of BitTorrent for a 700 peer scenario. This result provides confidence that both our

network and BitTorrent models are behaving as expected.

3.7 Experimental Results

The experiments in this section were conducted on a 16 processor, 2.6 GHz

Opteron system with 64 GB of RAM running Novell SuSe 10.1 Linux. These exper-

iments were conducted sequentially.

3.7.1 Model Validation

In order to validate our BitTorrent model, we created three tests: (i) download

completion test, (ii) download time test, and (iii) message count test. While there

is no consensus in the BitTorrent community on a valid BitTorrent implementation

because of variability that is acceptable within the protocol, we believe these tests

provide us with some confidence in the behavioral accuracy of our model (P2P

Contribution 4).

The first test asks the most basic question, did all leecher peers obtain a com-

plete copy of the file? To conduct this test, we executed our model in 16 different

75

Table 3.3: Number of messages received per type per simulation scenario.

Scenario Choke Unchoke Interested Not Interested Have Request

128 peer, 128 pieces 10,402 10,402 30,693 26,539 1,247,294 333,719256 peer, 128 pieces 21,181 21,181 66,213 57,520 2,552,040 669,542512 peer, 128 pieces 42,964 42,964 144,466 122,469 5,129,973 1,319,8301K peer, 128 pieces 86,240 86,240 287,397 240,759 10,271,737 2,668,919128 peer, 256 pieces 10,584 10,584 42,933 38,899 2,494,655 622,895256 peer, 256 pieces 21,399 21,399 96,786 86,870 5,104,101 1,236,486512 peer, 256 pieces 43,013 43,013 208,328 186,197 10,258,933 2,517,2611K peer, 256 pieces 85,753 85,753 389,975 348,770 20,543,479 5,028,309128 peer, 512 pieces 10,950 10,950 68,810 63,851 4,989,376 1,177,809256 peer, 512 pieces 21,661 21,661 137,811 128,039 10,208,231 2,350,252512 peer, 512 pieces 43,258 43,258 294,295 271,613 20,517,877 4,706,6051K peer, 512 pieces 86,051 86,051 581,193 537,537 41,087,991 9,521,465128 peer, 1K pieces 11,240 11,240 104,061 99,097 9,978,815 2,258,058256 peer, 1K pieces 21,979 21,979 206,557 196,053 20,416,487 4,531,166512 peer, 1K pieces 43,340 43,340 396,434 373,540 41,033,714 9,000,3431K peer, 1K pieces 10,402 10,402 30,693 26,539 82,178,039 18,316,898

configurations based on the number of peers and the number of pieces. The number

of peers ranged from 128 to 1,024 by a power of two. Similarly, the number of pieces

also ranged from 128 to 1,024 by a power of two. At the end of each simulation run,

we collected statistics on each peer. In particular, we noted how many remaining

pieces a peer had, which was zero in all cases. Furthermore, all pending requests

should have been satisfied. Thus, the active request list for each connection should

be empty. We confirmed for all 16 cases that no requests were pending, and in fact,

all request memory buffers had been returned to the request memory pool, thus en-

suring no memory leaks existed. Finally, we ensured that as a piece is downloaded,

we correctly free the peer_list structures that have that piece, and remove it from

our rarest piece priority queue. Again, this verifies that we do not have any memory

leaks in the management of the piece-peer list structures, and serves as a cross-check

that all pieces have been correctly obtained by a peer.

In the second test, we want to know the distribution in time for when a peer

completes the download of the file. We then verify the shape of our download times

76

curve against those most recently published in [87]. For this test, we use the 1K

peer, 1K piece scenario. We observe that at each milestone, 25%, 50%, 75%, and

100% (download complete), most of the peers reach it at the same point in time,

as shown in Figure 3.8. This trend is attributed to the rarest piece first policy used

to govern piece selection, coupled with fair tit-for-tat trading. For the most part,

this prevents any peer from “getting ahead” in the overall download process. We

do however note, there does appear to be some “early winners” and “late losers”

in the process. This phenomenon occurs because not all peer-sets have access to

all pieces at the same time. Some peer-sets are losers (i.e., many hops away from

the original seeder), and peer-sets that contain the seeder have rare pieces more

readily available. The shape of our download completion times curve is confirmed

by the emulated results presented in [87]. Additionally, we find the real measurement

data in [66] reports a similar shaped download time distribution curve. However,

a key difference is that its variance is much greater, leading not to a relatively flat

line as we have, but to a positive-sloped line. We attribute this difference to the

measurement data only covering an extremely small “swarm” (only 30 leechers with

9 peers in each peer-set). Thus, the network parallelism is not available because

of fewer connections. Therefore, downloads will be more serialized, yielding longer,

more staggered download completion times.

In the last test, we validate our message count data as shown in Table 3.3,

against the real measurement data reported in [66]. There are two key trends that

appear to point to proper BitTorrent operation. The first is that the number of

choke and unchoke messages should be equal. In all 16 configurations, we find this

assertion to be true. This is because the choke algorithm forces these messages

to operate in pairs. Second, the number of interested messages should be slightly

higher, but almost equal, to the number of not interested messages. We observe this

phenomenon across all 16 model configurations. Finally, we observe that the number

of have and request messages meet our expectations. In the case of have messages,

they are approximated by the number of peers, times the number of pieces, times the

number of peer connections per peer. In the case of the 1K peer, 1K piece scenario,

this is bounded by 80 × 1, 024 × 1, 024 = 83, 886, 080. Likewise, the number of

77

Figure 3.9: Model execution time as a function of the number of piecesand the number of peers.

requests has a lower bound of the number of pieces, times 16 slices per piece, times

the number of peers. The reason this is a lower bound is because of endgame mode,

which allows for the same piece/offset to be requested many times across different

peers.

3.7.2 Model Performance

To better understand how our BitTorrent model scales and affects simulator

performance, we conducted the following series of experiments (P2P Contribution 5).

In the first set shown in Figure 3.9, we plot the simulation execution time as a

function of the number of peers and the number of pieces. The number of peers

and pieces range from 128 to 1,024 by a power of 2, yielding 4 sets of 4 data points

each. We observe that increasing the number of peers for small (128) piece files

does not impact simulator performance significantly. However, as the number of

pieces grows, the slope of the execution time increases tremendously as the number

of peers increases. We attribute this behavior to the increased complexity in the

download side of a peer connection as a consequence of a large number of pieces

to consider. Additionally, by increasing the number of peers and pieces, the overall

event population increases, which leads to larger event list management overheads.

78

Figure 3.10: Model event rate as a function of the number of pieces andthe number of peers.

Figure 3.11: Model memory usage in MB as a function of the number ofpieces and the number of peers.

To verify this view, we plot the event rate as a function of the number of peers

and pieces in Figure 3.10. We observe that the event rate is highest (close to 100K

events per second) when both peers and pieces are small in number. However, in

the 1K peer, 1K piece case, we observe that the event rate has decreased to only

40K events per second because of more work per event, as well as higher event list

79

Figure 3.12: Simulated download completion times (seconds) for the16,384 peer 4,096 piece scenario (this simulation run re-quired 15.14 GB of RAM and 59.66 hours to execute withan event rate of 35,179 events per second).

management overheads.

The memory usage across the number of peers and pieces is shown in Fig-

ure 3.11. Here, we observe that memory usage grows much slower for smaller piece

torrents than for larger ones. For example, the 1K peer, 128 piece scenario only

consumes 452 MB of RAM or 441 KB per peer, whereas the 128 peer, 1K piece sce-

nario consumes 289 MB or 2.3 MB per peer. This change in memory consumption

is attributed to two reasons. First, as the number of pieces grows, the availability of

pieces to select from grows as well. This in turn allows more requests to be simul-

taneously scheduled, which results in a larger peak event population, and increases

the overall event memory required to execute the simulation model. Increasing the

number of peers has a similar impact, in that it will also increase the event popula-

tion (and demand for request memory buffers), which raises the amount of memory

necessary to execute the simulation model.

In the last performance curve, we show the download completion times for

a very large 16K peer, 4K piece scenario. We note here that we observe a larger

population of “late” downloaders at each milestone. Observe that endgame and

80

download completion occur extremely close to each other. Thus, peers do not spend

a great deal of time in endgame mode overall. For simulator performance, this model

consumed 15.14 GB of RAM and required almost 60 hours to complete. At first

pass, 60 hours does appear to be a significant amount of time, but we observe that

it is orders of magnitude smaller than the measurement studies that require many

weeks or even months of peer/torrent data collection.

Finally, we report the completion of a simulation scenario with 128K peers, 128

pieces, 32 connections per peer, and a max_backlog of 8. The impact on memory

usage was significant. This scenario only consumes 8.15 GB of RAM or 67 KB

per peer, which points to how the interplay between pieces, peers, and requests

dramatically affects the underlying memory demands of the simulation model.

3.8 BitTorrent as a Streaming Protocol

With so much digital media content being transferred, it is natural to examine

P2P’s potential for streaming delivery. If content is streamed, the user can begin to

enjoy it sooner, and can evaluate its quality early on in order to preserve valuable

resources [108]. While many P2P protocols exist for streaming, none have achieved

the degree of performance, scalability, user-fairness, and popularity as BitTorrent

has for accomplishing time-insensitive mass-downloads. For this reason, we explore

modifying BitTorrent for streaming downloads (P2P Contribution 6).

In this section, we determine the potential of BitTorrent modifications BiToS

[108] and BASS [107]. BiToS uses a modified piece-picker algorithm, while BASS

augments the system with a dedicated streaming server. We simulate the two tech-

niques over a wide range of scenarios (using the simulator mentioned above). We

then analyze peer completion times from the simulation results to determine which

techniques are viable for streaming content with reasonable quality playback. We

then present the cost of these techniques in terms of total data delivered, and server

utilization at any point in time.

81

3.8.1 Related Work

P2P streaming is often accomplished using application layer multicast, where

an overlay network is constructed containing the participating nodes. The content

owner injects the stream into the overlay, where the nodes may consume it and

forward it to their children. The structure of the overlay is typically a tree, forest,

or mesh.

A multicast tree is the simplest structure. Each node receives content from

a single parent, and forwards it to its children. The height of this tree translates

into its latency, and the width translates into the number of bandwidth bottlenecks.

ZIGZAG [110] is an architecture composed of a clustering hierarchy and a multicast

tree of logarithmic height, and constant node degree. Overcast [113] also builds

a tree, but attempts to maximize a metric such as bandwidth or latency from all

nodes to the root. A common problem characteristic of multicast trees is their lack

of fault-tolerance. If a single node fails, it may disconnect portions of the tree,

rendering them useless until the failing node’s children can recognize the failure and

reconnect. Bayeux [115] (an architecture that leverages Tapestry [19]) attempts to

solve this problem using secondary pointers.

Traditional tree-based multicast is not well suited for P2P, as the burden of

duplicating and forwarding traffic is carried by a small subset of peers that are

interior nodes of the tree. This conflicts with the expectation that all nodes will

share this burden equally. SplitStream [109] splits the stream into stripes, each

delivered with a separate multicast tree. It attempts to create a structure where

interior nodes of one tree are leaf nodes in all the remaining trees, in order to fairly

distribute the forwarding burden. Other systems that use forests include Narada

[116] and PALS [117].

Fundamentally, forest overlays suffer from the same problems as tree overlays

[112], since a node in any stripe may fail. Like a forest, a mesh overlay allows for si-

multaneous downloads, but also allows parts of the file to come from perpendicular

nodes. If a node fails, other nodes can continue to receive content while recon-

necting to the overlay. However, a protocol is needed to locate missing content in

the network. Bullet [114], CollectCast [111], and DONet/CoolStreaming [106] are

82

examples of systems that use mesh overlays.

Under ideal conditions, application layer multicast works well for streaming

media. But, even with clever techniques to ensure performance, scalability, and

fault-tolerance, all these schemes lack user incentives. Users upload in good-faith

[118], and are not penalized if they choose not to contribute their resources to the

P2P network. This has sparked some interest in using BitTorrent for streaming,

since BitTorrent employs an incentive mechanism. Studied in this section are BiToS

[108] and BASS [107], both of which are streaming systems built around the Bit-

Torrent protocol.

3.8.2 BiToS

BiToS [108] is a BitTorrent derivative that imposes minimal changes to the

protocol’s piece-picker to allow for streaming. Since changes are only made to the

piece-picker, the modified client can still participate in swarms with unmodified

clients.

BiToS organizes needed pieces into two queues, the high-priority pieces and

the remaining pieces. Any piece that misses its playback deadline will be removed

from the queues and will no longer be considered for download, thus degrading the

video quality. With a probability of p, the earliest deadline piece of the high-priority

piece set is requested, and with a probability of 1− p, the rarest remaining piece is

requested (p can be fixed or dynamically assigned). The goal is to download pieces

in order as they are needed for playback, and occasionally download rare pieces to

make the peer an attractive trading partner as per BitTorrent’s incentive mechanism.

Thus, the different values of p affect the content quality (and the download time

always remains the same).

In [108], BiToS is simulated for a flash crowd of 400 peers downloading a 147

piece file (10 minutes at 500 Kbps), using a synthetic symmetric-bandwidth network.

It is shown that if p = 0.8 in this scenario, tolerable quality streaming in terms of

continuity index [106] can be achieved.

Through our experiments of BiToS, we can confirm its performance in the

above scenario. Further, if only a small percentage of the swarm is streaming, then

83

their performance and the overall swarm performance is still very good. The reason

for this is because there is still good entropy throughout the swarm, and stream-

ing peers can still have non-streaming peers interested in them, since they all have

different perspectives regarding rare pieces. However, in general, for larger swarms,

larger files, higher bit rates, or different values of p, the number of piece deadlines

missed by BiToS increases astronomically. As a result, the playback quality dete-

riorates. Due to its inability to scale and lack of robustness, BiToS is ill-suited for

streaming delivery when high-quality playback is desired.

3.8.3 BASS

BitTorrent Assisted Streaming System (BASS) [107] is a hybrid server/P2P

streaming system for large-scale video-on-demand (VoD). In BASS, clients can

stream via BitTorrent connections and media servers simultaneously. File pieces

are downloaded from a server sequentially, with the exception of pieces already ob-

tained using BitTorrent. Similarly, the BitTorrent piece-picker will not choose to

download pieces scheduled prior to the current playback point, as they have already

been obtained from a server. In [107], a P2P contribution rate of 34% has been

reported for a scenario of 350 peers distributing a 692 piece file (at 1,024 Kbps).

For our model to simulate BASS, two new entities needed to be added. The

first is a streaming server. A streaming server is an LP that represents a highly-

capable peer, which answers all requests in FIFO order. With the exception of using

the same slice/piece scheme, a server does not run the BitTorrent protocol (it does

not choke peers). The implementation allows for any number of streaming servers

in the system. It should be noted that having these servers in an environment with

malicious or selfish peers would require new security considerations. The second

entity added to the model is a streaming peer. This peer is a BitTorrent peer, but is

also responsible for keeping track of buffer state and playback deadlines. Streaming

peers may co-exist and cooperate with non-streaming peers in a simulation run.

Algorithm 6 demonstrates the minor modifications required at birth to accommodate

streaming peers.

A streaming peer employs a double-buffering scheme consisting of a playback

84

Algorithm 6: BIRTH Event

if streaming peer then1

Initialize buffers;2

Schedule PLAYBACK now;3

// All peers

Initiate BitTorrent protocol;4

buffer and a look-ahead buffer (see Figure 3.13). The playback buffer contains all

video data that will be played next. If this buffer is not full when required, a re-

buffer is triggered, causing the remaining buffer slices to be requested from a server.

In addition, all remaining slices in the look-ahead buffer are also requested (see

Algorithm 7). The purpose of the look-ahead buffer is to download pieces that have

not missed their deadlines yet, but are needed soon. The chance that these pieces

will be downloaded via BitTorrent is small, and the goal is to reduce the amount

of re-buffers necessary, and the total buffering time. Data coming from a server

is treated in the same way as data coming from a BitTorrent peer. This forces a

peer to send out HAVE messages for pieces downloaded from a server. Further, the

BitTorrent piece-picker does not need to be modified, since a peer will not request

data that it already has, regardless of where it came from.

In the event of a re-buffer, the client will play the content as soon as it becomes

available. Whenever content is successfully played, the next playback is scheduled

for the point in the future when the current playback finishes. To initiate the first

buffering and playback, a streaming peer has a playback event scheduled at its birth

time. Clearly, the size of the buffers impacts both the QoS and the distribution

costs (larger buffers result in less P2P contribution but better QoS).

Figure 3.13: This figure demonstrates our double-buffering scheme. Inthis example, the playback buffer is 5 slices and the look-ahead buffer is 15 slices.

85

Algorithm 7: PLAYBACK Event

// Check playback buffer

for i = 1 to playback buffer.size do1

if playback buffer[i].missing then2

Request playback buffer[i].piece;3

Set missed playback;4

// Check look-ahead buffer

for j = 1 to lookahead buffer.size do5

if lookahead buffer[j].missing then6

Request lookahead buffer[j].piece;7

// Handle the successful or unsuccessful playback

if missed playback then8

if Last playback was successful then9

Store first unsuccessful time;10

else11

if Last playback was unsuccessful then12

Increment rebuffers;13

Update buffer time;14

Clear first unsuccessful time;15

// Playback was successful

Update both buffers;16

Schedule next PLAYBACK event;17

3.8.3.1 Simulation Results

In this section, we demonstrate system performance, cost savings, and server

utilization (average and peak) for several streaming scenarios using our model of

BASS.

In the following experiments, the distributed video consists of 512 pieces at

700 Kbps (approximately 24 minutes). We test swarms of 16,384 (flash crowds of

1,024 and 2,048) and 32,768 (flash crowds of 2,048 and 4,096) peers, where the flash

crowds consist of peers that are not streaming, consistent with the subscription

model for new content. Through prior experimentation, we determined the flash

crowds should be at least 1,024 peers. If the flash crowd is too small, there is not

enough content in the network to satisfy the streamers, and buffering times are

much higher. Further, our results suggest that the flash crowd does not need to be

very large, since larger flash crowds do not impact the buffering times, the number

of re-buffering events, or the P2P contribution. The peers in each swarm arrive for

86

approximately two hours.

Figure 3.14: This graph demonstrates the average buffer times (seconds)experienced by streaming peers in the simulation runs.

Figure 3.15: This graph demonstrates the average number of bufferevents experienced by streaming peers in the simulationruns. Note that the first buffer event is mandatory for allstreaming peers.

As mentioned above, the main discriminating variable is the size of the look-

ahead buffer. In our experiments, we range it from 500 to 3,000 slices by increments

of 500 (note that the entire file consists of 8,192 slices). If the buffer is too small,

the P2P contribution is very high, however the QoS (buffering time) is very poor.

Further, if the buffer is very large, we see a drastic decrease in P2P contribution, with

87

Figure 3.16: This graph demonstrates the percent of bandwidth con-tributed by the P2P network for the file distribution in thesimulation runs.

only a marginal increase in quality. In each run, we see a look-ahead buffer of size

500 results in an average buffering time of 3.3 seconds (1.5 average buffer events per

user) with a P2P contribution of 78% (see Figures 3.14, 3.15, and 3.16 respectively).

While this sounds good, 3.3 seconds may be too long for some applications. We

can achieve an average buffering time of under 2 seconds (1.3 average buffer events

per user) with up to a 73% contribution from the P2P network. We can lower this

time by 0.4 seconds, but we lower the P2P contribution by 16%. Depending on the

distributor’s budget and needs, the size of the look-ahead buffer can be established.

For an average buffering time of 1.1 seconds, we can still achieve a P2P contribution

of 53%. Thus, using BitTorrent to assist a CDN or streaming architecture can

significantly lower transit costs, while achieving an excellent QoS for users.

3.8.3.2 QoS

Although the average user buffer times and number of buffer events are very

good, we would like to know how all peers fare. Consider the case of 16,384 peers, a

flash crowd of 1,024, and a look-ahead buffer of 1,500 slices. The average buffering

time is 1.3 seconds (with a standard deviation of 2.6 seconds). The histogram

in Figure 3.17 shows that most peers do indeed experience good performance, with

99.76% of peers experiencing a buffer time of under 3 seconds. Similarly, on average,

88

Simulated Avg. Buffer Avg. P2P Avg. CDN Peak CDN

Peers Time (s) Buffers Contribution Util. (MBps) Util. (MBps)

16,384 1.8 1.3 73% 104 14532,768 1.5 1.3 73% 158 31265,536 1.4 1.3 73% 314 617131,072 1.4 1.3 73% 633 1,228

Table 3.4: For a flash crowd of 2,048 peers and a look-ahead buffer of1,000 slices, this table shows the performance of several largeswarms (16,384 peers to 131,072 peers).

each peer requires only 1.3 buffering events (with a standard deviation of 0.6 buffer

events), with 99.15% of peers requiring at most 2 buffer events (including the initial

mandatory buffering, see Figure 3.18).

Figure 3.17: This histogram of the buffering times demonstrates thatmost streaming peers experience a buffering time of under3 seconds.

Overall, streaming quality can be measured by a QoS metric called adjusted

frustration time [129]. Adjusted frustration time is defined as the total sum of

buffering times, plus a 2 second penalty for every re-buffering event (this metric

is used as part of the StreamQ user experience rating system, where any adjusted

frustration time of under 6 seconds is given a grade of A+). Figure 3.19 is a his-

togram of the swarm’s adjusted frustration times (the average is 1.8 seconds, with

a standard deviation of 3.1 seconds, and 99.12% of users experience an adjusted

89

Figure 3.18: This histogram of the number of buffering events demon-strates that most streaming peers experience few re-buffers.

Figure 3.19: This histogram of the adjusted frustration times demon-strates that most streaming peers experience a high QoS.

frustration time of under 3.6 seconds), proving that the overall QoS is very good.

The StreamQ user performance ratings for this run can be found in Table 3.5.

3.8.3.3 CDN Utilization

Figures 3.20 and 3.21 show the average and peak server utilizations for each

scenario. We see that the peak never exceeds 390 MBps, and the averages are usually

around half the peaks. Further, for the simulated scenarios, the server utilization

scales roughly linearly with the size of the swarm. Thus, if a content provider can

90

Grade Frequency

A+ 16,245A 21B+ 15B 13C+ 18C 15D+ 5D 10F 42

Table 3.5: This table shows how many streaming peers received eachgrade of the StreamQ performance rating system. Note thata grade of F is given when a peer’s adjusted frustration timeis 27 seconds or more.

estimate swarm sizes, an excellent estimate of server requirements can be made.

Figure 3.20: This graph demonstrates the average server utilizations overthe simulation runs.

Most CDNs and ISPs use a method called burstable billing [130] to charge their

customers. This method charges based on a regular sustained utilization, allowing

brief usage peaks to occasionally exceed the threshold without penalty. Typically,

customers are billed at the 95th percentile of their usage. This method is beneficial

for customers whose usage is fairly steady. If usage is bursty or unpredictable, a

flat-rate system that charges per byte (or GB) delivered may be the best option.

91

Figure 3.21: This graph demonstrates the peak server utilizations overthe simulation runs.

Simulated 95th Percentile Approximate Cost Without

Peers CDN Util. (MBps) Distribution Cost (flat) P2P Network (flat)

16,384 112 $56.62 $209.7232,768 224 $113.25 $419.4365,536 442 $226.49 $838.86131,072 922 $452.99 $1,677.72

Table 3.6: For a flash crowd of 2,048 peers and a look-ahead bufferof 1,000 slices, this table shows the 95th percentiles and ap-proximate flat-rate distribution costs for several large swarms(16,384 peers to 131,072 peers), at $0.10 per GB delivered.

Table 3.6 shows the 95th percentiles and approximate flat-rate distribution costs for

the scenarios presented in Table 3.4. The 95th percentile costs are not in the table

since service contracts vary from customer to customer.

We confirm the claims published in [107] that most CDN contribution occurs

in the first pieces of the file, with very little towards the end. This is due to the

fact that BitTorrent (which usually employs a rarest piece first algorithm) does not

have a chance to obtain early pieces because they are needed too soon, while the

later pieces have more time before playback, and thus more opportunities to be

downloaded via BitTorrent. Figure 3.22 shows how many times a slice (16 KB) is

downloaded from the server, for each piece (256 KB) of the file.

92

Figure 3.22: This graph shows the distribution of slices delivered by theCDN throughout the file, for the 16,384 peer, 1,024 flashcrowd, and 2,000 slice look-ahead scenario.

3.8.3.4 Video Bit Rate

To this point, all results have been for a bit rate of 700 Kbps. Now, we

present results from simulations at 1.5 Mbps (a 12 minute video), and show that

with a nominal increase to the look-ahead parameter, we can achieve a similar QoS

and CDN requirements. Table 3.7 shows that an increase of 500 slices to the look-

ahead buffer will allow us to achieve the same QoS as the 700 Kbps scenario (this is

the point where both curves begin to converge to 1.1 seconds and 1.3 buffer events).

Table 3.8 shows the P2P contribution and the server utilizations for these same

scenarios. We see that to achieve the same QoS, we require more CDN involvement

for the higher bit rate, which is what we expected since the only difference is that

now all playback deadlines occur sooner.

When the look-ahead buffer is 500 slices for all scenarios (the worst case for

both bit rates), the average buffering time for the 1.5 Mbps video is approximately

twice that of the 700 Kbps video (a maximum of 7.2 seconds compared to 3.6 sec-

onds), and there are on average 0.6 more buffer events per user. For any look-ahead

buffer size, the P2P contribution is consistently 10 to 13% less for the higher bit

rate. While the average and peak server utilizations appear to go down occasionally,

they typically increase for the higher bit rate by 30 Mbps and 60 Mbps respectively

93

Simulated Bit Rate Look-Ahead Avg. Buffer Avg.

Peers (Kbps) (slices) Time (s) Buffers

16,384 700 1,500 1.4 1.316,384 1,500 2,000 1.4 1.332,768 700 1,500 1.3 1.332,768 1,500 2,000 1.4 1.3

Table 3.7: For a flash crowd of 2,048 peers, this table shows the appro-priate size of the look-ahead buffer to achieve a similar QoSfor different bit rates (700 Kbps and 1.5 Mbps) and swarmsizes (16,384 peers and 32,768 peers).

Simulated Bit Rate Look-Ahead P2P Avg. CDN Peak CDN

Peers (Kbps) (slices) Contribution Util. (MBps) Util. (MBps)

16,384 700 1,500 68% 104 15516,384 1,500 2,000 50% 145 23932,768 700 1,500 68% 188 33232,768 1,500 2,000 51% 143 225

Table 3.8: This table shows the differences in P2P contribution and CDNutilization for the scenarios presented in Table 3.7.

for the 16,384 peer swarms, and by 60 Mbps and 90 Mbps respectively for the 32,768

peer swarms.

These results show that while P2P contribution decreases and server utilization

increases (an increase in overall CDN involvement), we can achieve the same QoS

at higher bit rates as with lower ones.

3.9 Summary

In this chapter, we have discussed using P2P-overlay networks for the delivery

of data. P2P networks show great promise in their ability to distribute content

to extremely large audiences without overwhelming origin servers, and significantly

reducing distributor transit costs. Specifically, we studied the BitTorrent protocol

because of its performance, scalability, user-fairness, popularity, and potential for

legal content distribution. We are interested in studying large television-size audi-

ences, for which measurement data does not exist. Swarms of this size have never

94

existed in the wild, and data is not available for the largest swarms that have ex-

isted. This is due to the fact that the data is either proprietary, or the swarms

were distributing content illegally, and do not publish records. We must therefore

simulate these swarms, which are much larger than in any previous simulation effort.

We have constructed a discrete-event simulator (see Section 3.5) and Internet

topology model (see Section 3.6) that realistically capture the characteristics of home

broadband Internet service. We carefully abstract away details non-pertinent to

Internet simulations in order to achieve our desired scale and degree of accuracy.

This is mostly done by estimating low-level details, and routing traffic based on

Internet structures rather than complete Internet adjacencies.

Lastly, we have shown BitTorrent’s capabilities as a protocol for streaming

(real-time) data delivery (see Section 3.8) in an emerging market (some companies

include: PPLive [119], PPStream [120], MySee [121], Roxbeam [122], UUSEE [123],

BitTorrent.com [131], Verisign Kontiki [133], ITIVA [132], Joost [124], Pando [125],

and Red Swoosh [126]). Specifically, if a small fraction of the swarm wishes to stream

the content, it can do so with a high QoS by simply modifying the piece-picker to

request data in order. However, piece-picker modifications do not scale well. When

many peers are streaming, we show that the distributor can save significantly on

data transit costs by using BitTorrent along with a server or CDN infrastructure.

Specifically, we have shown that the distributor can save 73% of their transit costs

while providing users with a viewing experience requiring under 2 seconds of total

buffering on average (an A+ using the StreamQ rating system [129]).

With our advancements, we can study other protocols and create new ones

that are more efficient for users and distributors, and impose less of a burden on

ISPs.

CHAPTER 4

Discussion and Conclusions

In this thesis, we have discussed two fundamental problems of distributed networks.

The first problem deals with locating mobile data/objects in a wireless sensor net-

work (see Chapter 2). We show that using a simple centralized directory is a poor

solution to the problem because it is not locality-sensitive. A better solution uses

a distributed directory, where data/objects do not have a static home. This allows

queries to be answered quickly regardless of the whereabouts of the querying and

storing nodes. This is done through the use of efficient find and move operations.

A sparse cover is the underlying data structure from which a distributed directory

is built. Specifically, a hierarchy of increasing-radius covers is used to construct

regional matchings, which contain read and write sets for all network nodes (refer

to Section 2.4 for formal definitions). As a directory contains only two operations

(find and move), its performance is measured by the Stretchfind and Stretchmove,

which are determined by the structural quality (radius and degree) of the sparse

covers used to construct it.

We first proved a structural lower bound for sparse covers of arbitrary graphs

in Section 2.5. Specifically, there exists a network with n nodes, and constrained

by the locality parameter γ and the maximum tolerable degree c, such that when

clustered, there must exist a cluster whose radius is Ω(γ log logc n), regardless of

the clustering technique (see Theorem 2.5.2). This proves that for arbitrary graphs,

there is an inherent tradeoff in the radius and degree, and these metrics cannot be

simultaneously optimized. The best known construction algorithm for these graphs

can achieve a radius of O(γ log n) and a degree of O(log n) [34], which translates into

a distributed directory with Stretchfind = O(log2 n) and Stretchmove = O(log2 n).

In light of the above tradeoff, we studied construction techniques for special types

of graphs including planar, unit disk, and H-minor free graphs.

In Section 2.7, we presented an algorithm for clustering κ-path separable

graphs that achieves a radius of O(γ) and degree of O(log n). This translates into

95

96

a distributed directory with Stretchfind = O(log n) and Stretchmove = O(log n), a

savings of a logarithmic term in each metric. In Section 2.8, we presented an opti-

mal algorithm for clustering planar graphs that achieves a radius of O(γ) and degree

of O(1). This translates into a distributed directory with Stretchfind = O(1) and

Stretchmove = O(log n), a savings of log2 n in Stretchfind, and log n in Stretchmove.

Finally, in Section 2.9, we showed how our planar algorithm can be used to con-

struct optimal covers for unit disk graphs (and other graphs with constant-stretch

planar spanners) with a radius of O(γ) and degree of O(1), once again saving log2 n

in Stretchfind and log n in Stretchmove, for the distributed directory operations.

Our work has immediate implications on the efficiency of other important data

structures used to solve fundamental distributed problems such as the construction

of compact routing schemes and synchronizers.

The second problem deals with the retrieval (delivery) of digital content in a

complex P2P overlay network (see Chapter 3). P2P networks show great promise in

their ability to distribute content to extremely large audiences without overwhelming

origin servers, and significantly reducing distributor transit costs. Specifically, we

studied the BitTorrent protocol because of its performance, scalability, user-fairness,

popularity, and potential for legal content distribution. We are interested in studying

large television-size audiences, for which measurement data does not exist. Swarms

of this size have never existed in the wild, and data is not available for the largest

swarms that have existed. This is due to the fact that the data is either proprietary,

or the swarms were distributing content illegally, and do not publish records. We

must therefore simulate these swarms, which are much larger than in any previous

simulation effort.

We have constructed a discrete-event simulator (see Section 3.5) and Internet

topology model (see Section 3.6) that realistically capture the characteristics of home

broadband Internet service. We carefully abstract away details non-pertinent to

Internet simulations in order to achieve our desired scale and degree of accuracy.

This is mostly done by estimating low-level details, and routing traffic based on

Internet structures rather than complete Internet adjacencies.

Lastly, we have shown BitTorrent’s capabilities as a protocol for streaming

97

(real-time) data delivery. Specifically, if a small fraction of the swarm wishes to

stream the content, it can do so with a high QoS by simply modifying the piece-picker

to request data in order. However, piece-picker modifications do not scale well.

When many peers are streaming, we show that the distributor can save significantly

on data transit costs by using BitTorrent along with a server or CDN infrastructure.

Specifically, we have shown that the distributor can save 73% of their transit costs

while providing users with a viewing experience requiring under 2 seconds of total

buffering on average (an A+ using the StreamQ rating system [129]).

The contributions presented in this thesis are innovative and significantly im-

prove data structures and techniques for data access and retrieval in distributed

networks.

LITERATURE CITED

[1] P. Zhang, C. Sadler, S. Lyon, and M. Martonosi. Hardware DesignExperiences in ZebraNet. In Proc. ACM Conference on Embedded NetworkedSensor Systems, Baltimore, MD, November 2004.

[2] T. Liu, C. Sadler, P. Zhang, and M. Martonosi. Implementing Software onResource-Constrained Mobile Sensors: Experiences with Impala andZebraNet. In Proc. International Conference on Mobile Systems,Applications, and Services, Boston, MA, June 2004.

[3] P. Juang, H. Oki, Y. Wang, M. Martonosi, L.S. Peh, and D. Rubenstein.Energy-Efficient Computing forWildlife Tracking: Design Tradeoffs and EarlyExperiences with ZebraNet. In Proc. International Conference onArchitectural Support for Programming Languages and Operating Systems,October 2002.

[4] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A Survey onSensor Networks. In IEEE Communications Magazine, 37(8):102–114, August2002.

[5] D. Braginsky and D. Estrin. Rumor Routing Algorithm for Sensor Networks.In Proc. ACM International Workshop on Wireless Sensor Networks andApplications, Atlanta, Georgia, 2002.

[6] Y. Yu, R. Govindan, and D. Estrin. Geographical and Energy Aware Routing:A Recursive Data Dissemination Protocol for Wireless Sensor Networks.Technical Report, UCLA/CSD-TR-01-0023, May 2001.

[7] D. Estrin, R. Govindan, J.S. Heidemann, and S. Kumar. Next CenturyChallenges: Scalable Coordination in Sensor Networks. In Mobile Computingand Networking, pages 263–270, 1999.

[8] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed Diffusion: AScalable and Robust Communication Paradigm for Sensor Networks. InMobile Computing and Networking, pages 56–67, 2000.

[9] S. Ratnasamy, D. Estrin, R. Govindan, B. Karp, S. Shenker, L. Yin, and F.Yu. Data-Centric Storage in Sensornets. In ACM SIGCOMM ComputerCommunication Review, 33(1):137–142, January 2003.

[10] N. Chang and M. Liu. Revisiting the TTL-Based Controlled Flooding Search:Optimality and Randomization. In Proc. International Conference on MobileComputing and Networking, pages 85–99, New York, NY, 2004. ACM Press.

98

99

[11] B. Krishnamachari and J. Ahn. Optimizing Data Replication for ExpandingRing-Based Queries in Wireless Sensor Networks. Technical Report, USCComputer Engineering, October 2005.

[12] N. Sadagopan, B. Krishnamachari, and A. Helmy. Active Query Forwardingin Sensor Networks. IEEE SNPA Workshop, 2003.

[13] X. Liu, Q. Huang, and Y. Zhang. Combs, Needles, Haystacks: Balancing Pushand Pull for Discovery in Large-Scale Sensor Networks. In Proc. InternationalConference on Embedded Networked Sensor Systems, Baltimore, MD, 2004.

[14] S. Madden, M. Franklin, J. Hellerstein, and W. Hong. TAG: A TinyAggregation Service for Ad-Hoc Sensor Networks. In ACM SIGOPSOperating Systems Review, pages 131–146, 2002.

[15] N. Trigoni, Y. Yao, A.J. Demers, J. Gehrke, and R. Rajaraman. HybridPush-Pull Query Processing for Sensor Networks. In GI Jahrestagung (2),pages 370–374, 2004.

[16] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: AScalable Peer-To-Peer Lookup Service for Internet Applications. In Proc.ACM SIGCOMM, pages 149–160, San Diego, CA, 2001.

[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A ScalableContent-Addressable Network. In Proc. ACM SIGCOMM, pages 161–172,San Diego, CA, 2001.

[18] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized ObjectLocation and Routing for Large-Scale Peer-To-Peer Systems. In Proc.IFIP/ACM International Conference on Distributed Systems Platforms, pages329–350, Heidelberg, Germany, November 2001.

[19] B.Y. Zhao, J.D. Kubiatowicz, and A.D. Joseph. Tapestry: An Infrastructurefor Fault-Tolerant Wide-Area Location and Routing. Technical Report,UCB/CSD-01-1141, April 2001.

[20] G. S. Manku, M. Bawa, and P. Raghavan. Symphony: Distributed Hashing ina Small World. In Proc. of Symposium on Internet Topologies and Systems,pages 127–140, 2003.

[21] K. P. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, andI. Stoica. The Impact of DHT Routing Geometry on Resilience andProximity. In Proc. of ACM SIGCOMM, pages 381–394, 2003.

[22] M. Castro, P. Druschel, Y. C. Hu, and A. I. T. Rowstron. Topology-AwareRouting in Structured Peer-to-Peer Overlay Networks. In Proc. ofInternational Workshop on Future Directions in Distributed Computing, pages103–107, 2003.

100

[23] H. Zhang, A. Goel, and R. Govindan. Incrementally Improving LookupLatency in Distributed Hash Table Systems. In Proc. of ACM SIGMETRICS,pages 114–125, June 2003.

[24] Ittai Abraham and Cyril Gavoille. Object Location Using Path Separators. InProc. ACM Symposium on Principles of Distributed Computing (PODC),pages 188–197, 2006.

[25] Ittai Abraham, Cyril Gavoille, Andrew Goldberg, and Dahlia Malkhi.Routing in Networks with Low Doubling Dimension. In Proc. InternationalConference on Distributed Computing Systems (ICDCS), 2006.

[26] Ittai Abraham, Cyril Gavoille, and Dahlia Malkhi. Compact Routing forGraphs Excluding a Fixed Minor. In Proc. International Conference onDistributed Computing (DISC), pages 442–456, 2005.

[27] Ittai Abraham, Cyril Gavoille, Dahlia Malkhi, Noam Nisan, and MikkelThorup. Compact Name-Independent Routing with Minimum Stretch. InProc. SPAA, pages 20–24, 2004.

[28] Ittai Abraham, Cyril Gavoille, Dahlia Malkhi, and Udi Wieder.Strongly-Bounded Sparse Decompositions of Minor Free Graphs. InProceedings of the Nineteenth Annual ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA’07), San Diego, California, June 2007.Also appears as Technical Report MSR-TR-2006-192 in Microsoft Research,December 2006.

[29] M. Arias, L. Cowen, K. Laing, R. Rajaraman, and O. Taka. Compact Routingwith Name Independence. In Proc. ACM Symposium on Parallel Algorithmsand Architectures, pages 184–192, 2003.

[30] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals,Simulations and Advanced Topics. McGraw-Hill, 1st edition, 1998.

[31] Baruch Awerbuch. Complexity of Network Synchronization. Journal of theACM, 32(4), 1985.

[32] Baruch Awerbuch, Shay Kutten, and David Peleg. On Buffer-EconomicalStore-and-Forward Deadlock Prevention. In INFOCOM, pages 410–414, 1991.

[33] Baruch Awerbuch and David Peleg. Network Synchronization withPolylogarithmic Overhead. In Proc. IEEE Symposium on Foundations ofComputer Science, pages 514–522, 1990.

[34] Baruch Awerbuch and David Peleg. Sparse Partitions (extended abstract). InIEEE Symposium on Foundations of Computer Science, pages 503–513, 1990.

101

[35] Baruch Awerbuch and David Peleg. Online Tracking of Mobile Users. InProc. ACM SIGCOMM Symposium on Communication Architectures andProtocols, 1991.

[36] Baruch Awerbuch and David Peleg. Online Tracking of Mobile Users. Journalof the ACM, 42(5):1021–1058, 1995.

[37] Brenda S. Baker. Approximation Algorithms for NP-Complete Problems onPlanar Graphs. Journal of the ACM, 41(1):153–180, 1994.

[38] Costas Busch, Ryan LaFortune, and Srikanta Tirthapura. Improved SparseCovers for Graphs Excluding a Fixed Minor. Technical Report TR 06-16,Department of Computer Science, Rensselaer Polytechnic Institute, November2006.

[39] Greg N. Frederickson and Ravi Janardan. Efficient Message Routing inPlanar Networks. SIAM Journal on Computing, 18(4):843–857, 1989.

[40] Cyril Gavoille. Routing in Distributed Networks: Overview and OpenProblems. SIGACT News, 32(1):36–52, 2001.

[41] Cyril Gavoille and David Peleg. Compact and Localized Distributed DataStructures. Distributed Computing, 16(2-3):111–120, 2003.

[42] Philip Klein, Serge A. Plotkin, and Satish Rao. Excluded Minors, NetworkDecomposition, and Multicommodity Flow. In Proc. 25th annual ACMSymposium on Theory of Computing (STOC), pages 682–690, 1993.

[43] Goran Konjevod, Andrea W. Richa, and Donglin Xia. Optimal-StretchName-Independent Compact Routing in Doubling Metrics. In PODC ’06:Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles ofDistributed Computing, pages 198–207, Denver, Colorado, USA, 2006.

[44] Goran Konjevod, Andrea W. Richa, and Donglin Xia. Optimal Scale-FreeCompact Routing Schemes in Doubling Networks. In SODA ’07: Proceedingsof the ACM-SIAM Symposium on Discrete Algorithms, New Orleans,Louisiana, 2007.

[45] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers Inc.,1996.

[46] David Peleg. Distance-Dependent Distributed Directories. Information andComputation, 103(2), 1993.

[47] David Peleg. Distributed Computing: A Locality-Sensitive Approach. Societyfor Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

102

[48] David Peleg and Eli Upfal. A Trade-Off Between Space and Efficiency forRouting Tables. Journal of the ACM, 36(3), 1989.

[49] Neil Robertson and Paul D. Seymour. Graph minors. V. excluding a planargraph. Journal of Combinatorial Theory, Series B, 41:92–114, 1986.

[50] Neil Robertson and Paul D. Seymour. Graph Minors. XVI. Excluding aNon-Planar Graph. Journal of Combinatorial Theory, Series B, 89(1):43–76,2003.

[51] Lior Shabtay and Adrian Segall. Low Complexity Network Synchronization.In WDAG ’94: Proceedings of the 8th International Workshop on DistributedAlgorithms, pages 223–237, London, UK, 1994.

[52] Mikkel Thorup. Compact Oracles for Reachability and ApproximateDistances in Planar Digraphs. Journal of the ACM, 51(6):993–1024, 2004.

[53] Mikkel Thorup and Uri Zwick. Compact Routing Schemes. In Proc. ACMSymposium on Parallel Algorithms and Architectures (SPAA), pages 1–10,2001.

[54] K. Alzoubi, X. Li, Y. Wang, P. Wan, and O. Frieder. Geometric Spanners forWireless Ad Hoc Networks. In IEEE Transactions on Parallel and DistributedSystems, 14(4):408–421, 2003.

[55] Costas Busch, Ryan LaFortune, and Srikanta Tirthapura. Improved SparseCovers for Graphs Excluding a Fixed Minor. In Proc. of ACM Symposium onPrinciples of Distributed Computing, Portland, OR, August 2007.

[56] Ryan LaFortune. A Structural Lower Bound for Sparse Covers. RensselaerPolytechnic Institute, 2006.

[57] C. D. Carothers, R. LaFortune, W.D. Smith, and M. R. Gilder. A Case Studyin Modeling Large-Scale Peer-to-Peer File-Sharing Networks usingDiscrete-Event Simulation. In Proceedings of the International MediterraneanModeling Multiconference, pages 617–624, Barcelona, Spain, October 2006.

[58] Ryan LaFortune, Christopher Carothers, William Smith, and MichaelHartman. An Abstract Internet Topology Model for Simulating Peer-to-PeerContent Distribution. In Principles of Advanced and Distributed Simulations,San Diego, CA, June 2007.

[59] TJ Giuli and Mary Baker. Narses: A Scalable Flow-Based Network Simulator.ArXiv Computer Science e-prints, CS0211024, November 2002.

[60] Nielsen Media – Home Page, 2006.http://www.nielsenmedia.com/dmas.html.

103

[61] A. Parker. P2P in 2005,http://www.cachelogic.com/research/2005_slide01.php.

[62] John B. Horrigan. “Home Broadband Adoption 2006”. Pew Internet &American Life Project, May 2006.

[63] Jong. S. Ahn and Peter B. Danzig. Packet Network Simulation: Speedup andAccuracy Versus Timing Granularity. ACM Transactions on Network (TON),Volume 4, Number 5, October 1996.

[64] BitTorrent – Home Page, 2006. http://www.bittorrent.org.

[65] CAIDA – Home Page, 2006. http://www.caida.org.

[66] A. Legout, G. Urvoy-Keller, and P. Michiardi. Understanding BitTorrent: AnExperimental Perspective, Technical Report, INRIA, Eurecom, France,November 2005.

[67] A. Legout, N. Liogkas, E. Kohler, and L. Zhang. Cluster and SharingIncentive in BitTorrent Systems, Technical Report, INRIA, #inria-00112066,version 1 Eurecom, France, November 21, 2006.

[68] Lumeta – Research Mapping Home Page, 2006.http://www.lumeta.com/research/mapping.asp.

[69] M. Mathis, J. Semke, and J. Mahdavi. The Macroscopic Behavior of the TCPCongestion Avoidance Algorithm, Computer Communications Review, 27(3),1997.

[70] Network Simulator (NS) – Home Page, 2006.http://www.isi.edu/nsnam/ns/ns.html.

[71] D. Nicol. Tradeoffs Between Model Abstraction, Execution Speed, andBehavioral Accuracy, In European Modeling and Simulation Symposium, 2006.

[72] David M. Nicol and Guanhua Yan. Simulation of Network Traffic at CoarseTimescales. In PADS ’05: Proceedings of the 19th Workshop on Principles ofAdvanced and Distributed Simulation, pages 141–150, Washington, DC, USA,2005.

[73] G. Riley, E. Zegura, and M. Ammar. Efficient Routing Using Nix-Vectors.Technical Report, GIT-CC-00-13, March 2000.

[74] Slyck – Home Page, 2006. http://slyck.com/bt.php?page=21.

[75] B.K. Szymanski, Y. Liu, and R. Gupta. Parallel Network Simulation UnderDistributed Genesis. In Proceedings of the 17th Workshop on Parallel andDistributed Simulation, pages 61–68, San Diego, CA, June 2003.

104

[76] K. Walsh and E. Sirer. Staged Simulation: A General Technique for ImprovingSimulation Scale and Performance. ACM TMACS, 14(2):170–195, April 2004.

[77] Time Warner Cable. “Cable Vs. DSL”,http://raleigh.twcbc.com/about/cable_vs_dsl.cfm.

[78] Rocketfuel – Home Page.http://www.cs.washington.edu/research/networking/rocketfuel.

[79] N. Spring, R. Mahajan, and D. Wetherall. Measuring ISP Topologies withRocketfuel. In Proceedings of ACM/SIGCOMM ’02, August 2002.

[80] M. Liljenstam, J. Liu, and D. Nicol. Development of an Internet BackboneTopology for Large-Scale Network Simulations. In Proceedings of the 35thConference on Winter Simulation: Driving innovation, pages 694–702, NewOrleans, Louisiana, December 2003.

[81] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for InternetMap Discovery. In Proceedings of IEEE INFOCOM, pages 1371–1380, TelAviv, Israel, March 2000.

[82] Narses Network Simulator – Home Page, 2006.http://sourceforge.net/projects/narses.

[83] Weishuai Yang and Nael Abu-Ghazaleh. GPS: A General Peer-to-PeerSimulator and its Use for Modeling BitTorrent. In Proceedings of the 13thIEEE International Symposium on Modeling, Analysis, and Simulation ofComputer and Telecommunication Systems, pages 425–434, Washington, DC,USA, 2005.

[84] Hannes Birck, Oliver Heckmann, Andreas Mauthe, and Ralf Steinmetz.Analysis of Overlay Networks at Message- and Packet-Level. TechnicalReport, KOM-TR-2004-03, Darmstadt University of Technology, June 2004.

[85] Dongyu Qiu and R. Srikant. Modeling and Performance Analysis ofBitTorrent-Like Peer-to-Peer Networks. SIGCOMM Comput. Commun. Rev.,pages 367–378, October 2004.

[86] Ashwin R. Bharambe, Cormac Herley, and Venkata N. Padmanabhan.Analyzing and Improving BitTorrent Performance, Technical Report,MSR-TR-2005-03, Microsoft Research, February 2005.

[87] R. Bindal, P. Cao, W. Chan, J. Medval, G. Suwala, T. Bates, and A. Zhang.Improving Traffic Locality in BitTorrent via Biased Neighbor Selection, InProceedings of the 2006 International Conference on Distributed ComputingSystems, July 2006, Spain.

105

[88] C. D. Carothers, D. Bauer, and S. Pearce. ROSS: Rensselaer’s OptimisticSimulation System User’s Guide. Technical Report #02-12, Department ofComputer Science, Rensselaer Polytechnic Institute, 2002,http://www.cs.rpi.edu/tr/02-12.pdf.

[89] C. Carothers, D. Bauer, and S. Pearce. ROSS: A High-Performance, LowMemory, Modular Time Warp System. In Proceeding of the 14th Workshop onParallel and Distributed Simulation, pages 53–60, May 2000.

[90] Abilene/Internet II Usage Policy.http://abilene.internet2.edu/policies/cou.html.

[91] BitTorrent Source Code, ver. 4.4.0, Linux Release.http://www.bittorrent.com/download.myt.

[92] BitTorrent, News Release: Partnership with Warner Brothers, 2006.http://www.bittorrent.com/2006-05-09-Warner-Bros.html.

[93] R. Brown. Calendar Queues: A Fast O(1) Priority Queue Implementation forthe Simulation Event Set Problem, Communications of the ACM (CACM),vol. 31, pp. 1220–1227, 1988.

[94] J. Cowie, A. Ogielski, and B.J. Premore. Internet Worms and Global RoutingInstabilities, In Proceedings of the Annual SPIE 2002 Conference, July 2002.

[95] Z. Ge, D. R. Figueredo, S. Jaswal, J. Jurose, and D. Towsley. ModelingPeer-Peer File-Sharing Systems, In Proceeding of the IEEE INFCOM, 2003.

[96] C. Gkantsidis and P. Rodriguez. Network Coding for Large Scale ContentDistribution, In Proceedings of IEEE INFOCOM, Miami, March 2005.

[97] P. Grant and J. Drucker. Phone, Cable Firms Rein In Consumers’ InternetUse Big Operators See Threat To Service as Web Calls, Videos Clog UpNetworks, The Wall Street Journal, October 21, 2005, page A1.

[98] D. R. Jefferson. Virtual Time, ACM Transactions on Programming Languagesand Systems, 7(3):404–425, July 1985.

[99] R. LeMay. BitTorrent Creator Slams Microsoft’s Methods, June 21, 2005.ZDNet Australia. http://www.zdnet.com.au/news/software/0,2000061733,39198116,00.htm.

[100] A. Parker. The True Picture of Peer-to-Peer File-Sharing,http://www.cachelogic.com/research/slide1.php.

[101] R. Ronngren and Rassul Ayani. A Comparative Study of Parallel andSequential Priority Queue Algorithms, ACM Transactions on Modeling andComputer Simulation, vol. 7, no. 2, pp. 157–209, April 1997.

106

[102] R. Shaw. BitTorrent Users, Ignore Opera at Your Inconvenience, ZDNetBlogs, February 17, 2006. http://blogs.zdnet.com/ip-telephony/?p=918.

[103] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A Blueprint forIntroducing Disruptive Technology into the Internet, In Proceedings of theFirst Workshop on Hot Topics in Networking (HotNets-I), October 2002.

[104] Planet Lab Acceptable Use Policy, 2006.http://www.planet-lab.org/php/aup.

[105] J. A. Pouwelse, P. Garbacki, D. H. J. Epema, and H. J. Sips. The BitTorrentP2P File-Sharing System: Measurements and Analysis, In Proceedings of the4th International Workshop on Peer-2-Peer System (IPTPS ’05), February2005.

[106] Xinyan Zhang, Jiangchuan Liu, Bo Li, and Tak-Shing Peter Yum.CoolStreaming/DONet: A Data-Driven Overlay Network for Efficient LiveMedia Streaming. In Proceeding of IEEE/INFOCOM, Miami, FL, March2005.

[107] C. Dana, D. Li, D. Harrison, and C. Chuah. BASS: Bittorrent AssistedStreaming System for Video-on-Demand. In International Workshop onMultimedia Signal Processing IEEE Press, 2005.

[108] Aggelos Vlavianos, Marios Iliofotou, and Michalis Faloutsos. BiToS:Enhancing BitTorrent for Supporting Streaming Applications. IEEEINFOCOM 2006 Global Internet Workshop, April 2006.

[109] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi,Antony Rowstron, and Atul Singh. SplitStream: High-Bandwidth Multicast ina Cooperative Environment. In SOSP’03, Lake Bolton, New York, October2003.

[110] Duc A. Tran, Kien A. Hua, and Tai T. Do. A Peer-to-Peer Architecture forMedia Streaming. Journal on Selected Areas in Communications, SpecialIssue on Advances in Service Overlay Networks.

[111] M. Hefeeda, A. Habib, D. Xu, B. Bhargava, and B. Botev. CollectCast: APeer-to-Peer Service for Media Streaming. ACM/Springer MultimediaSystems Journal, October 2003.

[112] G. Wen, H. Longshe, and F. Qiang. Recent Advances in Peer-to-Peer MediaStreaming Systems. In China Communications, October 2006.

[113] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, andJames W. O’Toole Jr. Overcast: Reliable Multicasting with an OverlayNetwork.

107

[114] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: HighBandwidth Data Dissemination Using an Overlay Mesh. In Proceedings ofACM SOSP, 2003.

[115] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz. Bayeux: AnArchitecture for Scalable and Fault-tolerant Wide-area Data Dissemination.In Proceedings of the Eleventh International Workshop on Network andOperating System Support for Digital Audio and Video, June 2001.

[116] Y. H. Chu, S. G. Rao, and H. Zhang. A Case for End System Multicast. InMeasurement and Modeling of Computer System, pages 1–12, 2000.

[117] R. Rejaie and A. Ortega. PALS: Peer-to-Peer Adaptive Layered Streaming.In Proceedings of ACM NOSSDAV, pages 153–161, June 2003.

[118] S. Tewari and L. Kleinrock. Analytical Model for BitTorrent-Based LiveVideo Streaming. In Proceedings of IEEE NIME Workshop, Las Vegas, NV,January 2007.

[119] PPLive. http://www.pplive.com/en/index.html.

[120] PPStream. http://www.ppstream.com.

[121] MySee. http://www.mysee.com.

[122] Roxbeam. http://www.roxbeam.com.

[123] UUSEE. http://www.uusee.com.

[124] Joost. http://www.joost.com.

[125] Pando. http://www.pando.com.

[126] Red Swoosh, an Akamai Company. http://www.akamai.com/redswoosh.

[127] Vonage – Home Page. http://www.vonage.com/index.php?ic=1.

[128] Skype – Home Page. http://www.skype.com.

[129] Keynote Systems - Hosted Streaming Quality Measurement, 2006.http://www.keynote.com/products/voip_and_streaming/streaming_

performance/streaming_perspective_stremq.html.

[130] Burstable Billing - Wikipedia, November 2007.http://en.wikipedia.org/wiki/Burstable_billing.

[131] BitTorrent. http://www.bittorrent.com.

[132] ITIVA NETWORKS. http://www.itiva.com.

108

[133] Kontiki Delivery Management System. http://www.verisign.com/products-services/content-messaging/broadband-

delivery/kontiki-delivery-management.

techniques and data structures for efficient …szymansk/theses/lafortune_thesis.pdfdoctor of...

Documents