techniques and data structures for efficient …szymansk/theses/lafortune_thesis.pdfdoctor of...
TRANSCRIPT
TECHNIQUES AND DATA STRUCTURES FOREFFICIENT INFORMATION ACCESS AND
RETRIEVAL IN DISTRIBUTED NETWORKS
By
Ryan LaFortune
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Computer Science
Approved by theExamining Committee:
Christopher Carothers, Thesis Adviser
Konstantin Busch, Thesis Adviser
Boleslaw Szymanski, Member
Bulent Yener, Member
Srikanta Tirthapura, Member
Rensselaer Polytechnic InstituteTroy, New York
March 2008(For Graduation May 2008)
TECHNIQUES AND DATA STRUCTURES FOREFFICIENT INFORMATION ACCESS AND
RETRIEVAL IN DISTRIBUTED NETWORKS
By
Ryan LaFortune
An Abstract of a Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Computer Science
The original of the complete thesis is on filein the Rensselaer Polytechnic Institute Library
Examining Committee:
Christopher Carothers, Thesis Adviser
Konstantin Busch, Thesis Adviser
Boleslaw Szymanski, Member
Bulent Yener, Member
Srikanta Tirthapura, Member
Rensselaer Polytechnic InstituteTroy, New York
March 2008(For Graduation May 2008)
c© Copyright 2008
by
Ryan LaFortune
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction and Historical Review . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Information Access Contributions . . . . . . . . . . . . . . . . 5
1.3.2 Information Retrieval Contributions . . . . . . . . . . . . . . . 6
2. Information Access: Tracking Mobile Objects in Wireless Sensor Networks 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Querying Schemes . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Sparse Covers and Motivations . . . . . . . . . . . . . . . . . 10
2.1.2.1 Name-Independent Compact Routing . . . . . . . . . 11
2.1.2.2 Synchronizers . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Definitions and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Graph Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Path Separators . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Graph Minors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 A Structural Lower Bound for Sparse Covers . . . . . . . . . . . . . . 19
2.5.1 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Set Cardinalities . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Proving a Lower Bound . . . . . . . . . . . . . . . . . . . . . 24
2.6 Shortest Path Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Cover for k-Path Separable Graphs . . . . . . . . . . . . . . . . . . . 29
iii
2.8 Cover for Planar Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.1 Basic Results for Planar Graphs . . . . . . . . . . . . . . . . . 33
2.8.2 High Level Description of the Algorithm . . . . . . . . . . . . 38
2.8.3 Algorithm Depth-Cover . . . . . . . . . . . . . . . . . . . . . . 38
2.8.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8.5 General Planar Cover . . . . . . . . . . . . . . . . . . . . . . . 45
2.9 Cover for Unit Disk Graphs . . . . . . . . . . . . . . . . . . . . . . . 47
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3. Information Retrieval: P2P Content Delivery . . . . . . . . . . . . . . . . . 50
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.1 Users Happy ISPs Not . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 The BitTorrent Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Message Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.2 Choker Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.3 Piece-Picker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.4 Implications for Network Model Design . . . . . . . . . . . . . 58
3.5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 BitTorrent Model Data Structure . . . . . . . . . . . . . . . . 59
3.5.2 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Topology Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1.1 Internet Mapping Projects . . . . . . . . . . . . . . . 62
3.6.1.2 Abstractions . . . . . . . . . . . . . . . . . . . . . . 63
3.6.2 Internet Connectivity Model . . . . . . . . . . . . . . . . . . . 65
3.6.2.1 Backbone . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.2.2 Neighborhood-Level . . . . . . . . . . . . . . . . . . 67
3.6.3 Population Model . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.4 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.5 Technology Model . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6.6 Bandwidth Model . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
iv
3.7.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.2 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8 BitTorrent as a Streaming Protocol . . . . . . . . . . . . . . . . . . . 80
3.8.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.8.2 BiToS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.8.3 BASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.3.1 Simulation Results . . . . . . . . . . . . . . . . . . . 85
3.8.3.2 QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.3.3 CDN Utilization . . . . . . . . . . . . . . . . . . . . 89
3.8.3.4 Video Bit Rate . . . . . . . . . . . . . . . . . . . . . 92
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4. Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 95
LITERATURE CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
v
LIST OF TABLES
3.1 Approximate memory required for simulation runs, and technique lookupcomplexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 Number of events generated in the slice-level simulations and lowerbound on the number of events generated in the packet-level simula-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Number of messages received per type per simulation scenario. . . . . . 75
3.4 For a flash crowd of 2,048 peers and a look-ahead buffer of 1,000 slices,this table shows the performance of several large swarms (16,384 peersto 131,072 peers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 This table shows how many streaming peers received each grade of theStreamQ performance rating system. Note that a grade of F is givenwhen a peer’s adjusted frustration time is 27 seconds or more. . . . . . 90
3.6 For a flash crowd of 2,048 peers and a look-ahead buffer of 1,000 slices,this table shows the 95th percentiles and approximate flat-rate distribu-tion costs for several large swarms (16,384 peers to 131,072 peers), at$0.10 per GB delivered. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.7 For a flash crowd of 2,048 peers, this table shows the appropriate sizeof the look-ahead buffer to achieve a similar QoS for different bit rates(700 Kbps and 1.5 Mbps) and swarm sizes (16,384 peers and 32,768peers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.8 This table shows the differences in P2P contribution and CDN utiliza-tion for the scenarios presented in Table 3.7. . . . . . . . . . . . . . . . 93
vi
LIST OF FIGURES
2.1 This figure shows S0 and S1. In S1, S0 is replicated c + 1 times andconnected using the gadget G0
1. . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 This figure demonstrates the structure of a general Si graph. . . . . . . 22
2.3 This figure demonstrates one way path lengths may grow as describedin lemma 2.5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 A demonstration of the proof of property iii of Lemma 2.6.1. . . . . . . 29
2.5 For Lemma 2.8.2: the figure on the left shows a configuration of re-moved edges that are external in C and span from A to B (note, ifthe lemma were not true, B would be disconnected), the figure in themiddle demonstrates the walk options from lu to lv, and the figure onthe right demonstrates the walk options from lu to ru. . . . . . . . . . . 35
2.6 For Case 2 of Lemma 2.8.3: the figure on the left demonstrates a possiblesetup, and the figure on the right demonstrates one of the two possiblepath configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 This figure demonstrates the subgraphs and paths described in Lemma2.8.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Execution example of Algorithm Subgraph-Clustering. . . . . . . . . . . 40
3.1 This figure shows our BitTorrent model data structure. . . . . . . . . . 61
3.2 This figure is the connectivity graph of the backbone of the connectiv-ity model. The nodes represent sources, sinks, intermediate backbonerouters, and identified low-tiered ISP routers. The edges represent linksbetween respective nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 This figure shows the distribution of shortest path lengths for distinctpaths in the backbone of the connectivity model. This curve is typicalof the Internet, demonstrating that we have preserved the required pathproperties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 This figure is the connectivity graph resulting from one set of traces toa popular cable ISP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 This figure shows the average delays experienced at each of the first 18links along a packet’s path from our traces. . . . . . . . . . . . . . . . . 70
vii
3.6 This figure shows the national technology distribution for home high-speed Internet connections for March of 2003 and March of 2006. . . . . 71
3.7 This figure shows the download completion times of the modified IN-RIA/PlanetLab scenario taken from [67]. In our case, we varied therandom number seed-sets across 10 separate runs of the 40 peer, 1seeder scenario. Thus providing us with 400 peer data points. . . . . . . 73
3.8 Simulated download completion times (seconds) for the 1,024 peer 1,024piece scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.9 Model execution time as a function of the number of pieces and thenumber of peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.10 Model event rate as a function of the number of pieces and the numberof peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.11 Model memory usage in MB as a function of the number of pieces andthe number of peers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.12 Simulated download completion times (seconds) for the 16,384 peer4,096 piece scenario (this simulation run required 15.14 GB of RAMand 59.66 hours to execute with an event rate of 35,179 events persecond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.13 This figure demonstrates our double-buffering scheme. In this example,the playback buffer is 5 slices and the look-ahead buffer is 15 slices. . . 84
3.14 This graph demonstrates the average buffer times (seconds) experiencedby streaming peers in the simulation runs. . . . . . . . . . . . . . . . . 86
3.15 This graph demonstrates the average number of buffer events experi-enced by streaming peers in the simulation runs. Note that the firstbuffer event is mandatory for all streaming peers. . . . . . . . . . . . . 86
3.16 This graph demonstrates the percent of bandwidth contributed by theP2P network for the file distribution in the simulation runs. . . . . . . . 87
3.17 This histogram of the buffering times demonstrates that most streamingpeers experience a buffering time of under 3 seconds. . . . . . . . . . . . 88
3.18 This histogram of the number of buffering events demonstrates thatmost streaming peers experience few re-buffers. . . . . . . . . . . . . . . 89
3.19 This histogram of the adjusted frustration times demonstrates that moststreaming peers experience a high QoS. . . . . . . . . . . . . . . . . . . 89
3.20 This graph demonstrates the average server utilizations over the simu-lation runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
viii
3.21 This graph demonstrates the peak server utilizations over the simulationruns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.22 This graph shows the distribution of slices delivered by the CDN through-out the file, for the 16,384 peer, 1,024 flash crowd, and 2,000 slice look-ahead scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
ix
ACKNOWLEDGMENT
Pursuing a doctoral degree is an extensive and intricate process. I would like to
thank my parents for the support and encouragement they have provided me with
throughout my entire life. I would like to thank my brother Erik for challenging
me and for helping initiate my pursuit of mathematical superiority at an early age.
I would also like to thank my fiance Christina for her support and understanding
throughout this challenging feat.
Further, I would like to thank my doctoral committee, and my advisors Chris
Carothers and Costas Busch for their guidance, for sharing their knowledge, and for
helping me throughout graduate school.
x
ABSTRACT
Networks were designed and continue to exist to allow for fast and convenient ac-
cess to remote data. With data scattered across a large network, there exists a
fundamental challenge to efficiently find any sought data. This challenge is further
complicated when the data is periodically relocated in the network, as is the case
with wireless sensor networks. Thus, a solution to the problem necessitates a data
structure with the ability to update in response to object relocations. A trivial solu-
tion to the problem uses a centralized directory responsible for knowing the location
of all data at all times, and directing all querying nodes to the location of the data
they seek. Dependence on one node to provide directions results in a single point of
failure, and may cause some queries to be unnecessarily long, especially when the
sought data lies at a node topologically close to the querying node. A better solution
to the problem uses a distributed directory, where all queries are answered quickly
regardless of the whereabouts of the querying and storing nodes. In this thesis, we
provide significant improvements to previous distributed directory solutions by cre-
ating innovative algorithms that improve the structural properties of sparse covers,
the underlying data structure from which a directory is built. Specifically, we im-
prove directory performance to Stretchfind = O(log n) and Stretchmove = O(log n)
for H-minor free graphs (a savings of log n in each measure), and Stretchfind = O(1)
and Stretchmove = O(log n) for planar graphs, unit disk graphs, and other graphs
with a constant-stretch planar spanner (an additional savings of log n in Stretchfind).
Once data is located in the network, it must then be retrieved (delivered). In a
simple world, this delivery would be between a single source and a single destination.
The possibilities for delivery techniques increase greatly when there are many sources
and destinations, like in peer-to-peer (P2P) networks. P2P networks have gained
much attention in recent years due to their scalability and fault-tolerance, and also
their potential to drastically reduce distributor transit costs. In order to study the
dynamics and causal relationships between peer entities in these complex overlay
networks, we have developed a flow-based discrete-event simulator and abstract
xi
Internet topology model that accurately and realistically model today’s broadband
service, at a scale larger than previous efforts. Specifically, our model can scale to
hundreds of thousands of peers, where prior efforts peak at only a few thousand.
Using detailed simulations, we have improved the efficiency of data dissemination
and reduced distributor transit costs for both the time-insensitive mass-download
scenario and the real-time streaming scenario.
xii
CHAPTER 1
Introduction and Historical Review
This thesis discusses two fundamental problems in distributed networking. The first
problem deals with efficiently locating sought data and objects in wireless sensor
networks. We provide a distributed directory solution with substantial performance
improvements over previous results. The second problem deals with techniques for
data retrieval and delivery in peer-to-peer (P2P) networks. Here, we show through
detailed simulation how to improve network throughput, user performance, and
reduce distributor transit costs.
1.1 Information Access
Networks were designed and continue to exist to allow for fast and convenient
access to remote data. With data scattered across a large network, there exists a
fundamental challenge to efficiently find any sought data. There are many lookup
systems for P2P networks like: Chord [16], CAN [17], Pastry [18], Tapestry [19],
Symphony [20], randomized hypercubes [21, 22], and randomized Chord [21, 23], to
name a few. The challenge of efficiently finding data is further complicated when
the sought data/objects are mobile.
Consider a wireless sensor network responsible for tracking objects with the
ability to relocate frequently and at will. Finding data in such a network necessitates
a data structure with the ability to update in response to object relocations. The
current state of the art solution is a directory service for mobile objects. This service
is responsible for establishing and maintaining paths to objects in the network, so
that it may provide directions to navigate a user to the location of a desired object.
A trivial solution to the problem uses a centralized directory responsible for knowing
the location of all data at all times, and directing all querying nodes to the location of
the data they seek. Dependence on one node to provide directions results in a single
point of failure, and may cause some queries to be unnecessarily long, especially
when the sought data lies at a node topologically close to the querying node. A
1
2
better solution to the problem uses a distributed directory [35], where all queries are
answered quickly regardless of the whereabouts of the querying and storing nodes.
A distributed directory provides two operations: find, to locate an object
given its name, and move, to move an object from one node to another. There is an
inherent tradeoff between the cost of implementing the find and move operations.
The performance of a directory is measured by the Stretchfind, the Stretchmove, and
the memory overhead of the directory, where stretch is defined as the ratio between
the cost of performing an operation and the optimal cost.
A sparse cover is the underlying data structure from which a directory is built,
consisting of a set of connected components called clusters, where every node in the
graph (network) belongs to some cluster containing its entire γ-neighborhood (where
γ is some desired locality parameter). Structurally, a cover is characterized by two
locality metrics, its radius (the maximum cluster radius, which is the minimum
eccentricity (maximum shortest path distance to any cluster node) of a node in any
cluster) and degree (the maximum number of clusters a node participates in). The
radius often translates into latency, and the degree translates into the load imposed
on a node by the data structure.
In Chapter 2 of this thesis, we prove that there exists a network with n nodes,
and constrained by the locality parameter γ and the maximum tolerable degree c,
such that when clustered, there must exist a cluster whose radius is Ω(γ log logc n).
This proves that for arbitrary graphs, we cannot simultaneously optimize both met-
rics, limiting the quality of the distributed directories we can create for these graphs
using sparse covers.
In light of the above tradeoff for arbitrary graphs, it is natural to ask whether
better sparse covers can be obtained for special classes of graphs. We answer this
question in the affirmative for the class of graphs that exclude a fixed minor. This
includes many popular graph families, such as: planar graphs, which exclude K5
and K3,3, outerplanar graphs, which exclude K4 and K2,3, series-parallel graphs,
which exclude K4, and trees, which exclude K3. For any graph G that excludes
a fixed minor graph H, we present an algorithm for computing a sparse cover Z
such that rad(Z) ≤ 4γ and deg(Z) = O(log n), where n is the number of nodes in
3
G (rad(Z) refers to the radius of Z, and deg(Z) refers to the degree of Z). The
constants in the degree bound depend on the size of H. For any planar graph G,
we present an algorithm for computing a sparse cover Z with rad(Z) ≤ 24γ − 8
and deg(Z) ≤ 18. This cover is optimal (modulo constant factors) with respect
to both the degree and the radius. To our knowledge, this is the first optimal
construction for planar graphs. Finally, for any unit disk graph G (or other graph
with a constant-stretch planar spanner), we present a technique for computing a
sparse cover Z with rad(Z) ≤ 24γt−8 and deg(Z) ≤ 18 (for some constant t). This
cover is also optimal (modulo constant factors) with respect to both the degree and
the radius. Once again, this is the first optimal construction for unit disk graphs.
Using our innovative algorithms that improve the structural properties of
sparse covers, we significantly improve the performance of both the find and move
operations of distributed directories for the studied families of graphs. Using the gen-
eral algorithm of Awerbuch and Peleg [36], we can build a directory with Stretchfind =
O(log2 n) and Stretchmove = O(log2 n). Using our improved algorithm for H-minor
free graphs, we achieve Stretchfind = O(log n) and Stretchmove = O(log n). Using
our improved algorithm for planar graphs (similarly unit disk graphs) we achieve
Stretchfind = O(1) and Stretchmove = O(log n).
Our contributions significantly improve the performance of distributed directo-
ries as well as other well-studied distributed computing problems including network
synchronizers and name-independent compact routing schemes.
1.2 Information Retrieval
Once data is located in the network, it must then be retrieved (delivered).
In the past, the data would typically be delivered directly from some dedicated
server or content distribution network (CDN), and the distributor would pay all
associated transit costs. An attractive alternative to this architecture is using a
P2P overlay network, such as BitTorrent [64]. In such a network, all participants
act as both clients and servers, by downloading content for themselves, and also
uploading it to other users. This architecture can alleviate the costs required for
distribution, as most content is delivered using the swarm’s aggregate bandwidth,
4
rather than bandwidth purchased by the distributor. Further, if the demand for
some particular data is extremely high, even the most powerful single server would
be quickly overwhelmed. This is not a concern in P2P networks, as requests would
be distributed across a very large set of nodes.
The Internet is evolving in ways unforseen upon its conception. In recent
years, we have seen the Internet used as a phone service. Vonage [127] is a company
providing voice over Internet protocol (VoIP) and Skype [128] provides a P2P-based
phone service. We are beginning to see companies offer television programs over the
Internet. If Internet protocol television (IPTV) succeeds, we will see an explosion
in the amount of data transferred over the Internet. Specifically, it is clear that
BitTorrent-like networks and other P2P networks will be used for the bulk of this
delivery, making it economically feasible for content distributors.
BitTorrent is known to be very scalable, robust, and provide high performance
to its users. Unfortunately, the protocol is based almost entirely on heuristics,
making it all but impossible to analyze through theoretical measures. Further,
there are no statistics available for television-scale swarms (as none have existed in
real life), and for smaller swarms that have existed, statistics are not available, as
the sessions were likely distributing data illegally. Thus, we turn to simulation to
help us study the problem at hand. To this point, simulators of the protocol tend
to examine small-scale swarms consisting of at most a few thousand peers.
Through careful design and the use of abstractions, we have developed a Bit-
Torrent simulator and Internet topology model capable of scaling to hundreds of
thousands of users (television-size audiences), on commodity hardware, while main-
taining a high level of accuracy. In Chapter 3, we discuss our model, provide ex-
perimental results and validation, and discuss how we have used it to study data
dissemination in both the time-insensitive mass-download scenario, and the real-
time streaming scenario.
1.3 Summary of Contributions
The following is a summary of all contributions in the areas of information
access (discussed in Chapter 2) and information retrieval (discussed in Chapter 3)
5
presented in this thesis.
1.3.1 Information Access Contributions
1. For any planar graph G, we present an algorithm for computing a sparse
cover Z with rad(Z) ≤ 24γ − 8 and deg(Z) ≤ 18. This cover is optimal
(modulo constant factors) with respect to both the degree and the radius. To
our knowledge, this is the first optimal construction for planar graphs (see
Section 2.8).
2. Using our sparse cover construction algorithm for planar graphs, we improve
distributed directory performance for networks that can be represented by
these graphs, achieving Stretchfind = O(1) and Stretchmove = O(log n).
3. For any unit disk graph G (or other graph with a constant-stretch planar span-
ner), we present a technique for computing a sparse cover Z with rad(Z) ≤24γt− 8 and deg(Z) ≤ 18 (for some constant t). This cover is optimal (mod-
ulo constant factors) with respect to both the degree and the radius. To
our knowledge, this is the first optimal construction for unit disk graphs (see
Section 2.9).
4. Using our sparse cover construction technique for unit disk graphs (and other
graphs with a constant-stretch planar spanner), we improve distributed di-
rectory performance for networks that can be represented by these graphs,
achieving Stretchfind = O(1) and Stretchmove = O(log n).
5. For any graph G that excludes a fixed minor graph H, we present an algorithm
for computing a sparse cover Z such that rad(Z) ≤ 4γ and deg(Z) = O(log n),
where n is the number of nodes in G. The constants in the degree bound
depend on the size of H (see Section 2.7).
6. Using our sparse cover construction algorithm for H-minor free graphs, we im-
prove distributed directory performance for networks that can be represented
by these graphs, achieving Stretchfind = O(log n) and Stretchmove = O(log n).
6
7. We prove there exists a network with n nodes, and constrained by the locality
parameter γ and the maximum tolerable degree c, such that when clustered,
there must exist a cluster whose radius is Ω(γ log logc n), proving the inherent
tradeoff between radius and degree, and that for arbitrary graphs these metrics
cannot be simultaneously optimized (see Section 2.5).
1.3.2 Information Retrieval Contributions
1. A memory efficient model of the BitTorrent protocol built on the ROSS discrete-
event simulation system [88, 89]. The memory consumed by a single BitTor-
rent client can be upwards of 70 MB. The memory consumed by a client in
our model is between 67 KB and 2.3 MB (see Section 3.5).
2. A slice-level data model that ensures protocol accuracy while avoiding the
event explosion problem characteristic of typical packet-level models, such
as employed with NS [70]. As a result, we achieve tremendous sequential
processor speedups (up to 180 times) (see Sections 3.5 and 3.6).
3. A realistic Internet topology model that preserves geographic market rela-
tionships, is massively scalable, and accurately models the in-home consumer
broadband Internet (see Section 3.6).
4. Validation of our BitTorrent model against instrumented BitTorrent opera-
tional software as well as previous measurement studies (see Section 3.7.1).
5. Model performance results and analysis for a large number of BitTorrent
swarm scenarios (see Section 3.7.2).
6. Analysis of techniques for streaming content using BitTorrent. We show ac-
ceptable quality of service (QoS) can be achieved when only a small fraction of
a BitTorrent swarm is streaming. Further, we show how the use of BitTorrent
along with a CDN can significantly reduce transit costs while providing an
excellent QoS (see Section 3.8).
CHAPTER 2
Information Access: Tracking Mobile Objects in Wireless
Sensor Networks
2.1 Introduction
Networks of wireless sensors provide unprecedented opportunities for distributed
sensing and monitoring of the physical environment. Many applications of sensor
networks such as distributed surveillance and habitat monitoring deal with mobile
objects such as people, animals, or vehicles. Fundamental tasks of the sensor net-
work are to track these objects, navigate users to them, answer queries about them,
and route data to/from them.
Consider a senor network responsible for the surveillance of vehicles. It should
be able to track the whereabouts of a vehicle, or even reach a vehicle by navigating
to its current location. Such a network would be capable of warning vehicles of
impending danger, or informing users of vehicles in the area.
An example of habitat monitoring is Zebranet [1, 2, 3]. This system monitors
animals in a region, and can be used to navigate users (such as photographers) to a
specific animal, herd of animals, or to the animal closest to the user. It can also be
used to answer aggregate queries about the habitat, such as distances traveled over
time.
The above problems can be rephrased in terms of building a directory service
for mobile objects in a wireless sensor network. A directory service in a distributed
system must establish and maintain paths to objects in the network so that it may
provide directions to navigate a user to the current location of any desired object.
Sensor networks are inherently resource-constrained [4]; each sensor node has
only limited energy, processing power, memory, storage capacity, and communica-
tion bandwidth. The combination of the large-scale nature and resource constraints
of sensor networks make the task of building a scalable directory service for them
extremely challenging.
Consider the conventional centralized implementation of a directory service.
7
8
Centralized nodes record location estimates of all interesting objects that are being
sensed. Users then communicate with these central nodes in order to find the objects
they seek. There are three drawbacks to this type of implementation. First, it
is expensive in terms of communication and energy consumed to keep the central
nodes up-to-date every time objects move, and for them to be involved in all queries.
Second, if a central node fails, many objects may be unreachable. The final drawback
is that the centralized scheme is inherently global. If a user is close to the sought
object, it must still communicate with a central node, which may be very far away.
In contrast, the ideal solution would take advantage of this user-object proximity
and would only involve local communication among nodes that are nearby to the
user and the object.
Such an ideal solution can be approached via a carefully designed distributed
data structure for the directory. In a distributed directory, no single node serves
as a “home” for the object, constantly knowing its current coordinates. Instead,
the directory information is spread out through the network in a way that makes
it possible to easily reach the object using local queries. Yet, the directory can be
updated locally whenever the object moves. The directory must also be lightweight
since it must operate within the energy, memory, and processing constraints of the
sensor nodes.
A distributed directory provides two operations: find, to locate an object
given its name, and move, to move an object from one node to another. There is an
inherent tradeoff between the cost of implementing the find and move operations.
The performance of a directory is measured by the Stretchfind, the Stretchmove, and
the memory overhead of the directory, where stretch is defined as the ratio between
the cost of performing an operation and the optimal cost.
2.1.1 Querying Schemes
There exist many schemes for storing data and querying wireless sensor net-
works. Some designs aim to reduce communication complexity by using named-data,
replication, and other methods of avoiding or controlling flooding through the net-
work. Another popular design feature is to reduce the amount of data sent by
9
performing filtering or aggregation at intermediate nodes.
Directed diffusion [7, 8] is a data-centric method where a network of application-
aware nodes implement data-naming. A user may run a query by disseminating a
task as an interest for named data, and awaits the flow back of events. Along the
path back to the user, intermediate nodes may choose to locally cache or aggre-
gate the results before forwarding them. TAG [14] is a generic aggregation service
that operates in a similar fashion. However, this service is specifically designed to
run on ad hoc networks comprised of motes running the TinyOS operating system.
ACQUIRE [12] is another data-centric querying mechanism, which treats the net-
work like a distributed database. When required, an active query packet is injected,
and follows some trajectory through the network. This path can be random, pre-
determined, or guided, and helps avoid flooding. When a node receives the active
query, it performs an on-demand update for which it obtains information from all its
neighbors within its lookahead parameter. As the active query moves through the
network, it gets progressively resolved until at some point it is completely solved,
at which time it is returned back to the querying node.
Introduced in [9] is a data-centric storage mechanism built upon the GPSR
geographic routing algorithm and a P2P lookup system such as Chord [16], CAN
[17], Pastry [18], or Tapestry [19] (some others include: Symphony [20], randomized
hypercubes [21, 22], and randomized Chord [21, 23]). This technique is based on
hashing, and stores data in different locations of the network. Therefore, queries
can be directed to certain locations rather than flooding the network. This mech-
anism assumes the availability of geographic information regarding the network.
Geographical routing is also discussed in [6]. The rumor routing algorithm [5] is
another method that disassociates data with nodes and stores it in regions. It is
intended to be used when geographic routing is inapplicable. This method does not
guarantee delivery, but is highly configurable for different network topologies, query
rates, and event rates. Configuring appropriately is a compromise between flooding
queries and flooding event notifications.
In [15], data is proactively pushed to select nodes, and later pulled when
queries are requested. This technique is orthogonal to data-centric storage, as data
10
is stored at the push-pull boundary. Carefully defining the line between push and
pull areas can result in significant communication savings. A comb-needle model
is proposed in [13]. In this model, the push component features data duplication
in a linear neighborhood of each node, and the pull component features a dynamic
formation of an on-demand routing structure that resembles a comb. Queries need
only go to a subset of the network, avoiding flooding.
In [10], a controlled flooding TTL-based model is discussed. The idea is to
flood the network with a query, but control it using a TTL (time to live) value. When
a query has reached its time to live, it does not progress any further. If at this time
the query is not solved, the user can give up or increase the TTL value (expanding
ring search [11] is one TTL strategy). A dynamic programming formulation with
search strategies that minimize the expected cost is given, which can be used when
the probability distribution of the location of an object is known. It is also shown
that given any deterministic TTL sequence, there exists a randomized version with
a better worst case expected search cost. This strategy can be used when the
probability distribution of the location of an object is not known.
2.1.2 Sparse Covers and Motivations
Awerbuch and Peleg have created distributed directories using hierarchies of
regional matchings [35], which are constructed from sparse covers [34] (both de-
fined in Section 2.4.2). Their directory scheme features find and move stretches
of Stretchfind = O(log2 n) and Stretchmove = O(log2 n). Thus, improvements to
sparse cover construction techniques can improve the quality of regional matchings,
and therefore improve the performance of distributed directories.
A cover Z of a graph G is a set of connected components called clusters, such
that the union of all clusters is the vertex set of G. A cover is defined with respect
to a locality parameter γ > 0. It is required that for each node v ∈ G, there is some
cluster in Z that contains the entire γ-neighborhood of v. Two locality metrics
characterize the cover: the radius, denoted rad(Z), which is the maximum radius of
any of its clusters1, and the degree, denoted deg(Z), which is the maximum number
1The radius of a cluster C is defined with respect to the subgraph G′ that it induces in G. Theradius of C is the minimum eccentricity of any node in G′, where the eccentricity of a node v ∈ G′
11
of clusters that a node in G is a part of.
In addition to the construction of distance-dependant distributed directories
[35, 46, 47], covers play a key role in the design of several other locality preserving
distributed data structures, including compact routing schemes [26, 27, 32, 47, 48,
53], network synchronizers [30, 33, 45, 47], and transformers for certain classes of
distributed algorithms [32]. In the design of these data structures, the degree of the
cover often translates into the load on a vertex imposed by the data structure, and
the radius of the cover translates into the latency. Thus, it is desirable to have a
sparse cover, whose radius is close to its locality parameter γ, and whose degree is
small.
2.1.2.1 Name-Independent Compact Routing
Consider a distributed system where nodes have arbitrary identifiers. A routing
scheme is a method that delivers a message to a destination given the identifier of
the destination. A name-independent routing scheme does not alter the identifiers of
the nodes, which are assumed to be in the range 1, 2, . . . , n. The stretch of a routing
scheme is the worst case ratio between the total cost of messages sent between a
source and destination pair, and the length of the respective shortest path. The
memory overhead is the number of bits (per node) used to store the routing table.
A routing scheme is compact if its stretch and memory overhead are small.
There is a tradeoff between stretch and memory overhead. For example, a
routing scheme that stores the next hop along the shortest path to every destination
has stretch 1, but a very high memory overhead of O(n log n), and hence is not
compact. The other extreme of flooding a message through the network has very
little memory overhead, but is not compact either since the stretch can be as much
as the total weight of all edges in the network. There has been much work on
deriving interesting tradeoffs between the stretch and memory overhead of routing,
including [26, 27, 29, 43, 44, 48, 53].
Sparse covers can be used to provide efficient name-independent routing schemes
(for example, see [30]). A hierarchy of regional routing schemes is created based on a
is the maximum distance from v to any other node in G′.
12
hierarchy of covers Z1, Z2, . . . , Zδ, where the locality parameter of cover Zi is γi = 2i,
and δ = dlog De where D is the diameter of the graph2. Henceforth, we assume that
log D = O(log n), i.e. the diameter of the graph is polynomial in the number of
nodes.
Using the covers of Awerbuch and Peleg [34], the resulting routing scheme
has stretch O(k) and the average memory bits per node is O(n1/k log2 n), for some
parameter k. When k = log n, the stretch is O(log n) and the average memory
overhead is O(log2 n) bits per node.
On the other hand, using our covers we obtain routing schemes with optimal
stretch (within constant factors) for planar and H-minor free graphs. For any planar
graph G with n nodes, our covers give a name-independent routing scheme with
O(1) stretch and O(log2 n) average memory overhead per node. For any graph that
excludes a fixed minor, our covers give a name-independent routing scheme with
O(1) stretch and O(log3 n) average memory overhead per node.
For planar graphs, to our knowledge, this is the first name-independent routing
scheme that achieves constant stretch with O(log2 n) space per node on average. For
H-minor free graphs, Abraham, Gavoille, and Malkhi [26] present name-independent
compact routing schemes with O(1) stretch and O(1) maximum space per node (the
O notation hides polylogarithmic factors). However, their paper does not provide
the explicit power of log n inside the O, hence, we cannot directly compare our
results with those in [26]. Although, it is noted in [26] that it is an open problem to
construct efficient sparse covers for planar graphs with O(γ) radius and O(1) degree,
which we have solved.
There are also efficient routing schemes known for a weaker version of the rout-
ing problem called labeled routing, where the designer of the routing scheme is given
the flexibility to assign names to nodes. Thorup [52] gives a labeled routing scheme
for planar graphs with stretch (1+ ε) and memory overhead of O((1/ε) log2 n) max-
imum bits per node. Name-independent routing schemes are clearly less restrictive
to the user than labeled routing, and hence a harder problem.
2The diameter D of a graph G is the maximum shortest path distance between any two nodesin the graph. It also holds that rad(G) ≤ D ≤ 2 · rad(G), where rad(G) denotes the radius of G.
13
2.1.2.2 Synchronizers
Many distributed algorithms are designed assuming a synchronous model where
the processors execute and communicate in time synchronized rounds [30, 45]. How-
ever, synchrony is not always feasible in real systems due to physical limitations
such as different processing speeds or geographical dispersal. Synchronizers are
distributed programs that enable the execution of synchronized algorithms in asyn-
chronous systems [30, 31, 45, 47]. A synchronizer uses logical rounds to simulate
the time rounds of the synchronous algorithm.
One of the most efficient synchronizers is called ZETA [51]. This synchronizer
is based on a sparse cover with locality parameter γ = 1, radius O(logk n), and
average degree O(k), for some parameter k. ZETA simulates a round in O(logk n)
time steps and uses O(k) messages per node on average. In contrast, using our
covers, we obtain a better time to simulate a round. For planar graphs, our covers
give a synchronizer with O(1) time and average messages per node. For H-minor
free graphs, the synchronizer has time O(1) and uses O(log n) messages per node
on average.
Awerbuch and Peleg [34] present an algorithm for constructing a sparse cover
of a general graph based on the idea of coarsening. Starting from an initial cover
S consisting of the n clusters formed by taking the γ-neighborhoods of each of
the n nodes in G, their algorithm constructs a coarsening cover Z by repeatedly
merging clusters in S. For a parameter k ≥ 1, their algorithm returns a cover Z
with rad(Z) = O(kγ) and deg(Z) = O(kn1/k) (the average degree is O(n1/k)). By
choosing k = log n, the radius is O(γ log n) and the degree is O(log n). This is the
best known result for general graphs. For these graphs, there exists an inherent
tradeoff between the radius of a cover and its degree: a small degree may require a
large radius, and vice versa.
It is known ([47, Theorem 16.2.4]) that for every k ≥ 3, there exist graphs and
values of γ (e.g. γ = 1) such that for every cover Z, if rad(Z) ≤ kγ, then deg(Z) =
Ω(n1/k). Thus, in these graphs if rad(Z) = O(γ), then deg(Z) is polynomial in n.
14
2.2 Contributions
In light of the above tradeoff for arbitrary graphs, it is natural to ask whether
better sparse covers can be obtained for special classes of graphs. We answer this
question in the affirmative for the class of graphs that exclude a fixed minor. This
includes many popular graph families, such as: planar graphs, which exclude K5 and
K3,3, outerplanar graphs, which exclude K4 and K2,3, series-parallel graphs, which
exclude K4, and trees, which exclude K3.
We give improved bounds for planar graphs, unit disk graphs, and other graphs
excluding fixed minors (and improvements to distributed directory performance for
networks modeled by these graphs), and also a structural lower bound for sparse
covers in arbitrary graphs.
1. For any planar graph G, we present an algorithm for computing a sparse
cover Z with rad(Z) ≤ 24γ − 8 and deg(Z) ≤ 18. This cover is optimal
(modulo constant factors) with respect to both the degree and the radius. To
our knowledge, this is the first optimal construction for planar graphs (see
Section 2.8).
2. Using our sparse cover construction algorithm for planar graphs, we improve
distributed directory performance for networks that can be represented by
these graphs, achieving Stretchfind = O(1) and Stretchmove = O(log n).
3. For any unit disk graph G (or other graph with a constant-stretch planar span-
ner), we present a technique for computing a sparse cover Z with rad(Z) ≤24γt− 8 and deg(Z) ≤ 18 (for some constant t). This cover is optimal (mod-
ulo constant factors) with respect to both the degree and the radius. To
our knowledge, this is the first optimal construction for unit disk graphs (see
Section 2.9).
4. Using our sparse cover construction technique for unit disk graphs (and other
graphs with a constant-stretch planar spanner), we improve distributed di-
rectory performance for networks that can be represented by these graphs,
achieving Stretchfind = O(1) and Stretchmove = O(log n).
15
5. For any graph G that excludes a fixed minor graph H, we present an algorithm
for computing a sparse cover Z such that rad(Z) ≤ 4γ and deg(Z) = O(log n),
where n is the number of nodes in G. The constants in the degree bound
depend on the size of H (see Section 2.7).
6. Using our sparse cover construction algorithm for H-minor free graphs, we im-
prove distributed directory performance for networks that can be represented
by these graphs, achieving Stretchfind = O(log n) and Stretchmove = O(log n).
7. There exists a network with n nodes, and constrained by the locality parameter
γ and the maximum tolerable degree c, such that when clustered, there must
exist a cluster whose radius is Ω(γ log logc n) (see Section 2.5).
In each case the graphs are weighted, and the algorithms run in polynomial
time with respect to G. For the class of H-minor free graphs, our construction
improves upon the previous work of Awerbuch and Peleg [34] by providing a smaller
radius. For planar graphs and unit disk graphs, our construction simultaneously
improves both the degree and the radius.
We now present related work in the area of sparse covers. Definitions and
preliminaries can be found in Section 2.4. A structural lower bound is presented in
Section 2.5. A technique for clustering shortest paths is described in Section 2.6.
Our sparse cover construction algorithm for graphs excluding a fixed minor can be
found in Section 2.7, for planar graphs in Section 2.8, and for unit disk graphs in
Section 2.9. We summarize the chapter in Section 2.10.
2.3 Related Work
Concurrent with our work, we have become aware of a closely related work by
Abraham, Gavoille, Malkhi, and Wieder [28] that gives an algorithm for constructing
a sparse cover of diameter 4(r + 1)2γ and degree O(1) for any graph excluding Kr,r,
for a fixed r > 1. Though the goal of both our works is the same, our work yields
different tradeoffs than [28]. For graphs excluding a fixed minor H, our algorithm
returns a cover with radius at most 4γ, while their cover has a radius of 4(r + 1)2γ,
which is clearly greater. On the other hand, their degree of O(1) is smaller than
16
ours of O(log n). We note that the constants for the degree are exponential in the
size of the excluded minor for both algorithms.
For planar graphs, our algorithm yields a much better tradeoff than [28] since
we give a radius of no more than 24γ − 6, and a degree of no more than 18, while
their cover (by using r = 3, since a planar graph must exclude K3,3) gives a diameter
of 64γ (which translates to a radius of at least 32γ) and the degree of the cover is
840 (this can be derived from the proof of Theorem 1.2 on page 6 of the technical
report [28]).
Klein, Plotkin, and Rao [42] obtain sparse covers for H-minor free graphs
with degree O(1) but with a weak diameter O(γ), where the O(γ) length shortest
path between two nodes in the same cluster may not necessarily lie in the cluster
itself. For many applications of covers, such as compact routing and distributed
directories, this is not sufficient. In contrast, our construction yields clusters with a
strong diameter of O(γ) where the shortest path lies completely within the cluster.
For graphs with doubling dimension α, Abraham, Gavoille, Goldberg, and
Malkhi [25] present a sparse cover with degree 4α and radius O(γ). However, since
planar graphs and H-minor free graphs can have large doubling dimensions, this
does not yield efficient sparse covers for these graphs.
2.4 Definitions and Preliminaries
Some of the following definitions are borrowed from Awerbuch and Peleg [34]
and from Abraham and Gavoille [24].
2.4.1 Graph Basics
Consider a weighted graph G = (V,E, ω), where V is the set of nodes, E is
the set of edges, and ω is a weight function E → Z+ that assigns a weight ω(e) > 0
to every edge e ∈ E. An “unweighted” graph is a special case where ω(e) = 1 for all
e ∈ E. For simplicity, we will also write G = (V, E) and sometimes use the notation
v ∈ G to denote v ∈ V and e ∈ G to denote e ∈ E. For a graph H, we use the
notation V (H) and E(H) to denote the nodes and edges of H respectively.
A walk q is a sequence of nodes q = v1, v2, . . . , vk where nodes may be repeated.
17
The length of q is defined as length(q) =∑k−1
i=0 ω(vi, vi+1). We also use walks with
one node q = v, where v ∈ V , which has length(q) = 0. If v1 = vk, the walk is
closed. A path is a walk with no repeated nodes.
Graph G is connected if there is a path between every pair of nodes. G′ =
(V ′, E ′) is a subgraph of G = (V, E), if V ′ ⊆ V , and E ′ ⊆ E. If V ′ 6= V or E ′ 6= E,
then G′ is said to be a proper subgraph of G. In the case where graph G is not
connected, it consists of connected components G1, G2, . . . , Gk, where each Gi is a
connected subgraph that is not a proper subgraph of any other connected subgraph
of G. For any set of nodes V ′ ⊆ V , the induced subgraph by V ′ is G(V ′) = (V ′, E ′)
where E ′ = (u, v) ∈ E : u, v ∈ V ′. Let G− V ′ = G(V − V ′) denote the subgraph
obtained by removing the vertex set V ′ from G. For any subgraph G′ = (V ′, E ′),
G−G′ = G− V ′. For any two graphs G1 = (V1, E1) and G2 = (V2, E2), their union
graph is G1 ∪G2 = (V1 ∪ V2, E1 ∪ E2).
The distance between two nodes u, v in G, denoted distG(u, v), is the length
of the shortest path between u and v in G. If there is no path connecting the nodes,
then distG(u, v) = ∞. The j-neighborhood of a node v in G is Nj(v, G) = w ∈V : distG(v, w) ≤ j. For V ′ ⊆ V , the j-neighborhood of V ′ in G is Nj(V
′, G) =⋃
v∈V ′ Nj(v, G).
If G is connected, the radius of a node v ∈ V with respect to G is rad(v, G) =
maxw∈V (distG(v, w)). The radius of G is defined as rad(G) = minv∈V (rad(v, G)).
If G is not connected, then rad(G) = ∞. For every connected graph G, rad(G) ≤diam(G) ≤ 2 · rad(G).
2.4.2 Covers
Consider a set of vertices C ⊆ V in graph G = (V, E). The set C is called a
cluster if the induced subgraph G(C) is connected. When the context is clear, we
will use C to refer to G(C). Let Z = C1, C2, . . . , Ck be a set of clusters in G. For
every node v ∈ G, let Z(v) ⊆ Z denote the set of clusters that contain v. The degree
of v in Z is defined as deg(v, Z) = |Z(v)|. The degree of Z is defined as deg(Z) =
maxv∈V deg(v, Z). The radius of Z is defined as rad(Z) = maxC∈Z(rad(C)).
For γ > 0, a set of clusters Z is said to γ-satisfy a node v in G, if there is a
18
cluster C ∈ Z, such that Nγ(v, G) ⊆ C. A set of clusters Z is said to be a γ-cover
for G, if every node of G is γ-satisfied by Z in G. We also say that Z γ-satisfies
a set of nodes X in G, if every node in X is γ-satisfied by Z in G (note that the
γ-neighborhood of the nodes in X is taken with respect to G).
An m-regional matching is a collection of read and write sets such that for any
pair of nodes u and v where distG(u, v) ≤ m, Read(u) and Write(v) intersect. The
radius of a regional matching is the furthest distance between any pair of nodes in
a read or write set, and the degree is the maximum number of nodes in such a set.
Given any sparse cover, a regional matching with the same radius and degree can
be easily constructed [35].
2.4.3 Path Separators
A graph G with n nodes is k-path separable [24] if there exists a subgraph S,
called the k-path separator, such that:
(i) S = P1 ∪ P2 ∪ · · · ∪ P`, where for each 1 ≤ i ≤ `, subgraph Pi is the union of
ki paths where each path is shortest in G − ⋃1≤j<i Pj with respect to its end
points,
(ii)∑
i ki ≤ k, and
(iii) either G−S is empty, or each connected component of G−S is k-path separable
and has at most n/2 nodes.
For instance, any rectangular grid of nodes (2-dimensional mesh) is 1-path separable
by taking S to be the middle row path. Trees are also 1-path separable by taking S
to be the center node whose subtrees have at most n/2 nodes. Thorup [52] shows how
to compute in polynomial time a 3-path separator for planar graphs, in particular,
the 3-path separator is S = P1. That is, S consists of three paths each of which is
a shortest path in the original graph.
2.4.4 Graph Minors
The contraction of edge e = (u, v) in G is the replacement of vertices u and v
by a single vertex whose incident edges are all the edges incident to u or to v except
19
for e. A graph H is said to be a minor of graph G, if H is a subgraph of a graph
obtained by a series of edge contractions starting from G. Graph G is said to be
H-minor free, if H is not a minor of G. Abraham and Gavoille [24] generalize the
result of Thorup [52] for the class of H-minor free graphs:
Theorem 2.4.1 (Abraham and Gavoille [24]) Every H-minor free connected graph
is k-path separable, for some k = k(H), and a k-path separator can be computed in
polynomial time.
The proof of Theorem 2.4.1 is based on the structure theorems for graphs
excluding minors of Robertson and Seymour [49, 50]. We note that in Theorem 2.4.1,
the parameter k is exponential in the size of the minor. Some interesting classes of
H-minor free graphs are: planar graphs, which exclude K5 and K3,3, outerplanar
graphs, which exclude K4 and K2,3, series-parallel graphs, which exclude K4, and
trees, which exclude K3.
2.5 A Structural Lower Bound for Sparse Covers
We now present a structural lower bound for sparse covers of arbitrary graphs
(Sparse Cover Contribution 7). As previously mentioned, the best known sparse
cover result for general graphs is due to Awerbuch and Peleg, and has a radius of
O(γ log n) and a degree of O(log n). In this section, we provide a lower bound on the
radius of clusters in an arbitrary graph, given the locality parameter γ and c, the
maximum tolerable degree. This is done using a recursive construction that allows
us to force a radius increase in certain clusters of the graph. That is, we provide
a graph that no-matter how it is clustered, at least one cluster’s radius is greater
than or equal to the lower bound.
For any possible cover S, the optimal radius is rad(S) = O(γ), and the optimal
degree is deg(S) = O(1). The lower bound given in this section proves that the ra-
dius and degree cannot both be optimized simultaneously when clustering arbitrary
graphs.
20
Theorem 2.5.1 There exists a network with n nodes, and constrained by param-
eters γ and c, such that when clustered, there must exist a cluster whose radius is
Ω(γ log logc n).
The lower bound is obtained from a recursive construction, that at each level,
guarantees the increase of some cluster’s radius. We have an initial graph, that
when clustered into a cover S, rad(S) is known. We then replicate the graph many
times, and create connections between specific nodes. The manner in which this is
done guarantees that when the new graph is clustered into a cover Z, rad(Z) is at
least rad(S) + 2γ. This behavior is independent of the clustering algorithm, and
thus proves a lower bound for arbitrary graphs.
2.5.1 Graph Construction
The construction takes in user-specified parameters γ (the locality parameter),
c (the maximum tolerable degree, c ≥ 1), and n (the number of nodes in the graph).
Si = (Vi, Ei) will refer to the graph at level i of the construction. At each such level,
we also have a set Ai ⊆ Vi containing all the anchor nodes of Si. When we move
from level i−1 to level i of the construction, we replicate Si−1 a total of (c+1)|Ai−1|times, and connect the replicas together with gadgets. A gadget is a star network
with rays of length γ, such that the tip of each ray is an anchor node of a replicated
Si−1 graph (one ray connects each replica). Ai contains the anchor nodes from all
the Si−1 replicas. Each replicated node is associated with its original, and when
gadgets are added to the graph, their rays will connect all nodes sharing the same
association. We can continue the recursion until all nodes in the network have been
used. This process allows us to continuously increase the radius of some cluster in
the graph, allowing us to provide a lower bound on the size of the radius.
The recursion basis is at level 0. We initialize the graph to have a single
node v (which is also an anchor node). S0 = (v, ∅) and A0 = v. For each
level i > 0, we create a set Ri comprised of δ = (c + 1)|Ai−1| replicas of Si−1,
Ri = S0i−1, S
1i−1, . . . , S
δ−1i−1 . For each Sx
i−1 ∈ Ri, there is a set of anchor nodes
βi,x = α0, α1, . . . , α|Ai−1|−1. We know that for any anchor node αt and Si−1 replicas
x and y, βαti,x and βαt
i,y are associated since both x and y represent the same graph
21
G01
S0 S1
S0 S0 S0 S0
. . .
︸ ︷︷ ︸c + 1
Figure 2.1: This figure shows S0 and S1. In S1, S0 is replicated c+1 timesand connected using the gadget G0
1.
from the previous level of recursion. We complete the construction of the current
level by connecting the Si−1 replicas using |Ai−1| gadgets (star networks with δ rays,
each of length γ) in the following manner: ∀j < |Ai−1|, create gadget Gji whose rays
connect βαj
i,0 , βαj
i,1 , . . . , βαj
i,δ−1, and add it to the set Gi. Si = R0i ∪ R1
i ∪ . . . ∪ Rδ−1i ∪
G0i ∪ G1
i ∪ . . . ∪ G|Ai−1|−1i and Ai = βi,0 ∪ βi,1 ∪ . . . ∪ βi,δ−1. While there still exists
unused nodes in the network, increment i and continue to the next level.
2.5.2 Set Cardinalities
Based on the construction, it is clear that |A0| = 1 and |A1| = c + 1, and also
that |V0| = 1 and |V1| = cγ+γ+1. To calculate |Ai| and |Vi| for a general i, we must
consider many things. For |Ai|, we know that Si−1 has |Ai−1| anchor nodes, and
that we replicated it a total of (c + 1)|Ai−1| times. Therefore, |Ai| = (c + 1)|Ai−1|2.For |Vi|, we know we have (c + 1)|Ai−1| replicas, and that each replica has |Vi−1|nodes. We must also consider gadget nodes. A gadget has (c+1)|Ai−1| rays (as per
the construction), and since each ray must be of length γ (note that one node of
each ray has already been accounted for in the replicas, and we must add the center
node r for each gadget), each gadget contains (c+1)|Ai−1|(γ− 1)+1 nodes. Lastly,
we know that since the previous level had |Ai−1| anchor nodes, we must have a total
of |Ai−1| gadgets. Therefore, it can be determined that |Vi| = |Vi−1|(c + 1)|Ai−1| +|Ai−1|[(c + 1)|Ai−1|(γ − 1) + 1].
We first solve the recurrence A. |Ai| = (c + 1)|Ai−1|2. Observe that |A0| = 1,
|A1| = (c + 1), |A2| = (c + 1)3, |A3| = (c + 1)7, |A4| = (c + 1)15, and so on.
22
βα|Ai−1|−1
i,0 βα|Ai−1|−1
i,1
βα0i,0
βα1i,0
βα2i,0
βα|Ai−1|−1
i,δ−1
......
...
S0i−1 S1
i−1
Si
...
. . .
. . .
. . .
. . .
βα2i,1
βα1i,1
βα0i,1
βα2i,δ−1
βα1i,δ−1
βα0i,δ−1
Sδ−1i−1
G|Ai−1|−1i
G2i
G1i
G0i
Figure 2.2: This figure demonstrates the structure of a general Si graph.
From this, we see that |Ai| = (c + 1)2i−1, and that |Ai| = Θ((c + 1)2i−1). We can
now solve the recurrence V , first by finding an upper bound, then a lower bound.
|Vi| = |Vi−1|(c + 1)|Ai−1|+ |Ai−1|[(c + 1)|Ai−1|(γ − 1) + 1]. To simplify the algebra,
let p = c + 1, so |Ai−1| = (c + 1)2i−1−1 = p2i−1−1.
|Vi| = |Vi−1| × p× p2i−1−1 + p2i−1−1 × p× p2i−1−1 × (γ − 1) + p2i−1−1
= |Vi−1| × p2i−1
+ p2i−1 × p2i−1−1 × (γ − 1) + p2i−1−1
= |Vi−1| × p2i−1
+ p2i−1 × (γ − 1) + p2i−1−1
≤ |Vi−1| × p2i
+ p2i × (γ − 1) + p2i
≤ |Vi−1| × p2i
+ γp2i
≤ 2|Vi−1| × p2i
, since we know ∀i > 0, |Vi| > γ
Above, observe that |V0| = 1, |V1| ≤ 2p2i, |V2| ≤ 4p2×2i
, |V3| ≤ 8p3×2i, |V4| ≤
16p4×2i, and so on. From this, we see that |Vi| ≤ 2ipi×2i
= 2i(c + 1)i×2i, and that
|Vi| = O(2i(c + 1)i×2i).
|Vi| = |Vi−1| × p× p2i−1−1 + p2i−1−1 × p× p2i−1−1 × (γ − 1) + p2i−1−1
23
= |Vi−1| × p2i−1
+ p2i−1 × p2i−1−1 × (γ − 1) + p2i−1−1
= |Vi−1| × p2i−1
+ p2i−1 × (γ − 1) + p2i−1−1
≥ |Vi−1| × p2i−1
+ p2i−1 × (γ − 1)
≥ |Vi−1| × p2i−1
Above, observe that |V0| = 1, |V1| ≥ p2i−1, |V2| ≥ p2×2i−1
, |V3| ≥ p3×2i−1,
|V4| ≥ p4×2i−1, and so on. From this, we see that |Vi| ≥ pi×2i−1
= (c + 1)i×2i−1, and
that |Vi| = Ω((c + 1)i×2i−1).
Since |Vi| represents the total number of nodes at level i of the construction,
we can now solve for how many levels of the construction must be possible for a
network containing n nodes.
(c + 1)i×2i−1 ≤ n
i× 2i−1 ≤ logc+1 n
2i−1 ≤ logc+1 n
i− 1 ≤ log2 logc+1 n
i ≤ log2 logc+1 n + 1
2i(c + 1)i×2i ≥ n
(c + 1)2i×2i ≥ n, since we know c ≥ 1
2i× 2i ≥ logc+1 n
22i ≥ logc+1 n
2i ≥ log2 logc+1 n
i ≥ 1
2log2 logc+1 n
From this analysis, we see that i = O(log logc n) and also i = Ω(log logc n).
Therefore, i = Θ(log logc n).
24
2.5.3 Proving a Lower Bound
Let Γ be the graph at level i of the construction, consisting of Si−1 subgraphs
R0i , R
1i , . . . , R
δ−1i , as well as the connecting gadgets. Let ∆ be a copy of Γ, except
without the gadgets (that is, ∆ contains only the Si−1 subgraphs).
Lemma 2.5.1 Suppose we have two graphs from the construction L and M (at level
i), such that M is a replica of L; ∃ nodes l1, l2 ∈ L that are a minimum of x hops
apart; ∃ nodes m1,m2 ∈ M that are a minimum of x hops apart; and both l1 and
m1 share the same replica association. When L and M are combined by a gadget g,
nodes l1 and m2 are a minimum of x + 2γ hops apart.
Proof: To create a path from l1 to m2, we must cross through the 2γ hops of
some gadget at least once (since these nodes lie in different Si−1 replicas and gadgets
are the only inter-replica connections). It is possible that our path will cross over
through many gadgets; however this behavior does not affect the following argument:
the new path must contain at least x + 2γ nodes. 2γ come from the crossover,
and x will come from the sum of the paths in L plus the sum of the paths in
M . This distance must be at least x hops, since l1 and m1 are both in the same
position of Si−1, and we know that the distance from m1 to m2 is at least x hops.
If there exists a shorter path than x + 2γ hops from l1 to m2 through some gadget,
then there must exist a shorter path than x hops from m1 to m2 within M , a
contradiction. Therefore, including the required nodes to cross through some gadget
g, the minimum path from l1 to m2 contains at least x + 2γ hops.
Lemma 2.5.2 If two nodes of some Rji are at least x hops apart in ∆, then they
are at least x hops apart in Γ as well.
Proof: Suppose two nodes v1 and v2 are at least x hops apart in ∆ (this path must
be in the same Rji subgraph). For the sake of contradiction, suppose ∃ a path from
v1 to v2 in Γ that is less than x hops. Since Γ is the same as ∆ with the addition
of gadgets, it is obvious that this path must include nodes of some gadget (at least
4γ since it must eventually come back). Since all Rji subgraphs are the same, the
25
...L M...
...
...
g
......
...
γ
...
m2
l2
m1l1
γ
Figure 2.3: This figure demonstrates one way path lengths may grow asdescribed in lemma 2.5.1.
sum of the subpaths in each used Rji must be less than or equal to x hops, or there
would exist a path from v1 to v2 in ∆ less than x hops. Therefore, the path in Γ is
at least x + 4γ hops (under the assumption), a contradiction. Therefore, the path
from v1 to v2 in Γ is also x hops (and contains no gadget nodes).
Property 1: Every cover of Rji should have at least two nodes in Rj
i , at least x
hops apart, that are satisfied in the same cluster.
Property 2: Every cover of Γ contains a cluster C that includes two nodes of Rji ,
at least x hops apart, that are satisfied in C.
Lemma 2.5.3 If Property 1 is true, then Property 2 is true.
Proof: For the sake of contradiction, suppose that Property 2 is false. Then there
exists an Rji subgraph and a cover FΓ of Γ, such that FΓ does not contain any
cluster that satisfies any two nodes of Rji , at least x hops apart (in Rj
i ). We will
now transform the cover FΓ to a cover F∆ in graph ∆ as follows. Let Z be a cluster
in FΓ, of the graph Γ. By removing the nodes and edges that belong to the gadgets
(of level i), Z is transformed to one or more clusters in ∆. Take a node v of Z, so
that it is in both Γ and ∆, and is satisfied in Z. Then one of the clusters in the
transformation of Z will satisfy v in graph ∆ too. The reason for this is as follows.
Let Q be the set of nodes that are at most γ hops away from v, and belong to ∆.
Note that the nodes in Q must belong to the same subgraph Rki that v belongs
26
to. Let Q′ denote the smallest connected subgraph of Z that contains Q (Q′ exists,
since v is satisfied in Z). We now show that Q = Q′. Suppose that Q 6= Q′. Then
v connects to some node v′ of Q using gadget nodes. However, this is impossible,
since this would imply that the distance between v and v′ is at least 2γ. Therefore,
Q = Q′. Since Q is entirely in Rki and connected, Q will be completely within a
cluster of ∆ after the transformation of Z from Γ to ∆. Therefore, v will be satisfied
in one of the clusters of Z in the transformation.
Let F∆ be the set of clusters containing the transformations of the clusters in
FΓ from Γ to ∆. Then F∆ is a cover of ∆, since every node of ∆ is satisfied in one
of the clusters of F∆.
Let Ei be the set of clusters of F∆ that belong to Rji . Clearly, Ei is a cover
for Rji in ∆. Therefore, Property 1 must hold for Ei. Therefore, at least two nodes
v1 and v2 that are a distance of at least x hops are satisfied in the same cluster of
Ei. Therefore, these two nodes must be satisfied in the same cluster in FΓ, since Ei
is obtained by splitting clusters of FΓ.
By lemma 2.5.2, the distance between v1 and v2 in Γ is at least x hops. At
the same time, they are satisfied in the same cluster of FΓ. A contradiction, since
FΓ does not satisfy Property 2 for Rji .
Lemma 2.5.4 At each level i of the construction (except for level 0), ∃ nodes v1
and v2, such that the minimum distance between them is at least 2γi hops, and v1
and v2 are satisfied in the same cluster.
Proof: The base case is at level 1 of the construction. There are c + 1 anchor
nodes connected by the gadget G01. Each such anchor node is 2γ hops away from
any other. If each were satisfied individually, the node at the center of the gadget
would be overlapped c+1 times. This exceeds the degree threshold and is therefore
impossible. So, two anchor nodes must be satisfied in the same cluster, and are 2γ
hops apart.
Suppose the statement is true at level i− 1 of the construction. That is, there
exists two nodes v1 and v2 a distance of 2γ(i − 1) hops apart that are satisfied in
the same cluster.
27
Now consider moving to level i. There will be (c + 1)|Ai−1| replicas of Si−1.
So for each replica x, there exists nodes vx1 and vx
2 that are 2γ(i − 1) hops apart
and are satisfied in the same cluster. Now consider adding the gadgets (to complete
the construction of the level). From lemma 2.5.3, we know that each replica x must
still have two nodes that are 2γ(i− 1) hops apart and satisfied in the same cluster.
Since there are (c+1)|Ai−1| replicas, there must exist some gadget that connects at
least c + 1 of these such nodes, by the pigeon hole principle. If all these nodes were
satisfied in different clusters, the center of this gadget is overlapped c+1 times. This
exceeds the degree threshold and is therefore impossible. So at least two of these
nodes are satisfied in the same cluster, call them v1 and v2. Further, the distance
between these nodes is guaranteed to be 2γ(i−1)+2γ = 2γi hops from lemma 2.5.1.
Therefore, at level i of the construction, nodes v1 and v2 are 2γi hops apart, and
are satisfied in the same cluster.
Theorem 2.5.2 There exists a network with n nodes, and constrained by param-
eters γ and c, such that when clustered, there must exist a cluster whose radius is
Ω(γ log logc n).
Proof: We obtain the lower bound by determining how many levels of recursion
are possible when limited to the use of at most n nodes. Because of the nature of
the construction, this is done by solving the recurrence relation V (and A in the
process), and then determining how many levels of recursion are guaranteed to occur
(i). From Section 2.5.2, we see that i = Θ(log logc n). From lemma 2.5.4, at level i
of the construction, the minimum radius of some cluster is at least 2γi. Constrained
to n nodes, there can be log logc n levels of the construction. From this, we see the
minimum radius of some cluster is Ω(γ log logc n).
2.6 Shortest Path Clustering
Our algorithms for cover construction are based on a recursive application of
a basic routine called shortest-path clustering. We observe that it is easy to cluster
the γ-neighborhood of all nodes along a shortest path in the graph using clusters of
radius O(γ) and degree O(1). For a graph G, we first identify an appropriate set of
28
shortest paths P in G. We cluster the cγ-neighborhood (for a constant c) of every
path p ∈ P using shortest-path clustering, and then remove P together with its c′γ-
neighborhood from G, for some c′ < c. This gives residual connected components
G′1, G
′2, . . . , G
′r that contain the remaining unclustered nodes as a subset. We apply
the same procedure recursively to each G′i component by identifying appropriate
shortest paths in them. The algorithm terminates when there are no remaining
nodes.
Consider an arbitrary weighted graph G, and a shortest path p between a pair
of nodes in G. For any β > 0, we construct a set of clusters R, which β-satisfies
every node of p in G. The returned set R has a small radius, 2β, and a small degree,
3. Algorithm Shortest-Path-Cluster contains the details of the construction of R.
Lemma 2.6.1 establishes the correctness of the algorithm.
Algorithm 1: Shortest-Path-Cluster(G, p, β)
Input: Graph G; shortest path p ∈ G; parameter β > 0;Output: A set of clusters that β-satisfies p;
Suppose p = v1, v2, . . . , v`;1
// partition p into subpaths p1, p2, . . . , ps of length at most βi ← 1; j ← 1;2
while i 6= ` + 1 do3
Let pj consist of all nodes vk such that i ≤ k ≤ ` and distG(vi, vk) ≤ β;4
j ← j + 1;5
Let i be the smallest index such that i ≤ ` and vi is not contained in6
any pk for k < j. If no such i exists, then i = ` + 1;Let s denote the total number of subpaths p1, p2, . . . , ps of p generated;7
// cluster the subpaths
for i = 1 to s do8
Ai ← Nβ(pi, G);9
R ← ⋃1≤i≤s Ai;10
return R;11
Lemma 2.6.1 For any graph G, shortest path p ∈ G, and β > 0, the set R returned
by Algorithm Shortest-Path-Cluster(G, p, β) has the following properties: (i) R is a
set of clusters that β-satisfies p in G; (ii) rad(R) ≤ 2β; (iii) deg(R) ≤ 3.
Proof: For property i, it is easy to see that R is a set of clusters, since each Ai
is a connected subgraph of G consisting of the β-neighborhood of a subpath pi of
29
vj vk
v
≤ βpk pl
vi
> β
≤ β
qiql
vlpi pj
≤ β ≤ β
> β
Figure 2.4: A demonstration of the proof of property iii of Lemma 2.6.1.
p. For each node v ∈ pi, Ai β-satisfies v in G, since it contains Nβ(v,G). Thus, R
β-satisfies p in G.
For property ii, we show that each cluster Ai has radius no more than 2β. Let
vi be an arbitrary vertex in pi. By the construction, for any node v ∈ pi, it must be
true that distG(vi, v) ≤ β. Since any node u ∈ Ai is at a distance of no more than
β from some node in pi, there is a path of length at most 2β from vi to u. Thus,
rad(R) ≤ 2β.
For property iii, suppose for the sake of contradiction that deg(R) ≥ 4 (see
Figure 2.4). Let v be a node with degree deg(v,R) = deg(R). Then v belongs to at
least 4 clusters, say: Ai, Aj, Ak, and Al, with i < j < k < l. Since v belongs to Ai,
there is a path qi of length at most β between v and some node vi ∈ pi. Similarly,
there exists a path ql of length at most β between v and some node vl ∈ pl. By
concatenating qi and ql, we obtain a path of length at most 2β connecting vi and
vl. On the other hand, both vi and vl lie on p, which is a shortest path in G, and
hence the path from vi to vl on p must be a shortest path from vi to vl. Let vj
and vk denote the nodes on pj and pk respectively, that are closest to vi. By the
construction, distG(vj, vk) > β, since otherwise, vk would have been included in
pj. Similarly, distG(vk, vl) > β. Since distG(vi, vl) > distG(vj, vk) + distG(vk, vl), it
follows that distG(vi, vl) > 2β, a contradiction. Thus, deg(R) ≤ 3.
2.7 Cover for k-Path Separable Graphs
We now present Algorithm Separator-Cover, which returns a cover with a small
radius and degree for any graph that has a k-path separator (Sparse Cover Contribu-
tion 5). Theorem 2.7.1 establishes the correctness and properties of the algorithm,
30
and uses Lemma 2.7.1, which gives some useful properties about clusters.
Algorithm 2: Separator-Cover(G, γ)
Input: Connected graph G that is k-path separable; locality parameterγ > 0;
Output: γ-cover for G;
// base case
if G consists of a single vertex v then1
Z ← v;2
return Z ;3
// main case
Let S = P1 ∪ P2 ∪ · · · ∪ Pl be a k-path separator of G;4
for i = 1 to l do5
foreach p ∈ Pi do6
Ai ← Shortest-Path-Cluster(G− ⋃1≤j<i Pj, p, 2γ);7
A ← ⋃1≤i≤l Ai;8
G′ ← G− ⋃1≤j≤l Pj;9
// recursively cluster each connected component
Let G′1, G
′2, . . . , G
′r denote the connected components of G′;10
B ← ⋃1≤i≤r Separator-Cover(G′
i, γ);11
Z ← A ∪B;12
return Z;13
Lemma 2.7.1 Let C be a set of clusters that 2γ-satisfies a set of nodes W in graph
G. If some set of clusters D is a γ-cover for G−W , then C ∪D is a γ-cover for G.
Proof: Since C 2γ-satisfies W in G, C also γ-satisfies Nγ(W,G) in G. Thus, C
γ-satisfies W ∪ Nγ(W,G) in G. Next, consider a vertex u ∈ G − (W ∪ Nγ(W,G)).
For any vertex u′ ∈ W , it must be true that u′ 6∈ Nγ(u,G), since u 6∈ Nγ(W,G),
implying that u 6∈ Nγ(u′, G). Thus, Nγ(u,G) lies completely in G−W . Since D is
a γ-cover for G−W , for every vertex u ∈ (G−W )−Nγ(W,G), D γ-satisfies u in
G −W , and hence in G. For any u′ ∈ W ∪Nγ(W,G), C γ-satisfies u′ in G. Thus,
for any v ∈ G, C ∪D γ-satisfies v in G, and is therefore a γ-cover for G.
Theorem 2.7.1 For any connected k-path separable graph G with n nodes, and
locality parameter γ > 0, Algorithm Separator-Cover(G, γ) returns a set Z with the
following properties: (i) Z is a γ-cover for G; (ii) rad(Z) ≤ 4γ; (iii) deg(Z) ≤3k(lg n + 1).
31
Proof: For property i, the proof is by induction on the number of vertices in G.
The base case is when G has only one vertex, in which case the algorithm is clearly
correct. For the inductive case, suppose that for every k-path separable graph with
less than n vertices, the algorithm returns a γ-cover for the graph. Let G be a
k-path separable graph with n vertices.
The last part of the algorithm recursively calls Separator-Cover on every con-
nected component in G′. Since the number of vertices in G′ is less than n, the
number of vertices in each G′i component is less than n. By the inductive assump-
tion, for each i = 1, 2, . . . , r, Separator-Cover(G′i, k, γ) returns a γ-cover for G′
i. The
union of the γ-covers for the connected components of G′ is clearly a γ-cover for G′,
hence B is a γ-cover for G′.
For i = 1, 2, . . . , l + 1, define Gi = G − ⋃1≤j<i Pj. Clearly, G1 = G and
Gl+1 = G′. We will prove that for all i such that 1 ≤ i ≤ l +1, the set⋃
i≤j≤l Aj ∪B
is a γ-cover for Gi. The proof is through reverse induction on i starting from i = l+1
and going down until i = 1. The base case i = l + 1 is clear since B is a γ-cover for
G′ = Gl+1. Suppose the above statement is true for i = ν, i.e. Aν∪Aν+1∪. . .∪Al∪B
is a γ-cover for Gν . Consider Gν−1 = Gν ∪Pν−1. From the correctness of Algorithm
Shortest-Path-Cluster (proven in Lemma 2.6.1), we have that Aν−1 2γ-satisfies Pν−1
in Gν−1. Since Aν∪Aν+1∪. . .∪Al∪B is a γ-cover for Gν−1−Pν−1, using Lemma 2.7.1
we have Aν−1∪Aν ∪ . . .∪Al∪B is a γ-cover for Gν−1, thereby proving the inductive
step. Thus, we have⋃
1≤j≤l Aj ∪B is a γ-cover for G1 = G, proving the correctness
of the algorithm for graph G with n vertices.
For property ii, we note that each cluster is obtained from an invocation of
Algorithm Shortest-Path-Cluster with input argument β = 2γ. From Lemma 2.6.1,
the radius of each cluster is at most 2β = 4γ. Thus, rad(Z) ≤ 4γ.
For property iii, we visualize the recursive invocations of the algorithm as a
tree T , where each node is associated with an input graph on an invocation of the
recursive algorithm. For each node v ∈ T , let G(v) denote the associated input
graph and N(v) denote the number of vertices in G(v). Let r denote the root, thus
G(r) = G. Clearly, for each vertex v ∈ T , G(v) is a connected subgraph in G, and
the leaves represent components that require no further recursive calls. The depth
32
of any node in T is defined as the distance from the root. The depth of the tree is
defined as the maximum depth of any node.
For any node v ∈ T , by the property of the path separator, we have for each
child v′ of v, N(v′) ≤ N(v)/2. Since N(r) = n, any node at a depth of i has at most
n/2i vertices. Since every leaf has at least 1 vertex, the depth of the tree is no more
than lg n.
Consider any node u ∈ G. Suppose u belongs to G(v) for some node v in
T . At v, clusters are formed by calling Shortest-Path-Cluster no more than k times.
From Lemma 2.6.1, u appears in no more than 3 clusters returned by each call of
Shortest-Path-Cluster. Thus, due to all clusters formed at any node v, u appears in
no more than 3k clusters. Further, if v1, v2, . . . , vx are the children of v, it is clear
that G(v1), G(v2), . . . , G(vx) are all disjoint from each other. Thus, u can belong
to at most one component among G(v1), G(v2), . . . , G(vx). Since the depth of T
is no more than lg n, node u can belong to G(v) for no more than lg n + 1 nodes
v ∈ T . Thus, u can belong to at most 3k(lg n + 1) clusters in total, implying that
deg(Z) ≤ 3k(lg n + 1).
Upon combining Theorem 2.7.1 with Theorem 2.4.1, we get the following.
Theorem 2.7.2 For any graph G that excludes a fixed size minor H, given a pa-
rameter γ > 0, there is an algorithm that returns in polynomial time a set of clusters
Z with the following properties: (i) Z is a γ-cover for G; (ii) rad(Z) ≤ 4γ; (iii)
deg(Z) ≤ 3k(lg n + 1); where k = k(H) is a parameter that depends on the size of
the excluded minor H.
2.8 Cover for Planar Graphs
Since every planar graph is 3-path separable [52], Theorem 2.7.1 immediately
yields a γ-cover for a planar graph with radius O(γ) and degree O(log n). In this
section, we present an improved cover for planar graphs whose radius is O(γ) and
degree O(1), both of which are optimal up to constant factors (Sparse Cover Con-
tribution 1).
33
Consider a connected and weighted planar graph G = (V,E). If G is not con-
nected, then it can be handled by clustering each connected component separately.
Consider also an embedding of G in the Euclidean plane where no two edges cross
each other. In the following discussion, we use G to refer to the planar embedding
of the graph. Clearly, any subgraph of G is also planar.
The edges of G divide the Euclidean plane into closed geometric regions called
faces. The external face is a special face that surrounds the whole graph; the other
faces are internal. A node may belong to multiple faces, while an edge to at most
two faces. A node or edge that belongs to the external face will be called external.
For any node v ∈ G, we denote by depth(v,G) the shortest distance between
v and an external node of G. We also define depth(G) = maxv∈V depth(v, G); note
that depth(G) ≥ 0.
2.8.1 Basic Results for Planar Graphs
Here we prove some basic properties of planar graphs that will be used in the
correctness and performance analysis of our algorithms for planar graphs.
For any planar graph G, it holds that the subgraph consisting of the edges
of a face is connected. This observation also holds for the subgraph induced by
the edges of the external face. The intersection of any two graphs G1 and G2 is
denoted G1 ∩ G2 = (V1 ∩ V2, E1 ∩ E2). The following lemma can be easily verified
as a property of all planar graphs.
Lemma 2.8.1 Let G′ be a subgraph of a planar graph G. If v ∈ G ∩ G′, and v is
external in G, then v is external in G′ too.
Consider now a connected planar graph C that consists of two connected
subgraphs A and B that are node-disjoint, and a set of edges Y , which is an edge-
cut between A and B (the removal of Y partitions C into A and B). Further, each
of A and B contain at least one node external to C. Let Y ′ denote the edges of Y
that are external in C.
Lemma 2.8.2 For any two nodes u, v ∈ A ∩ C that are external in C, there exists
a walk w = u, x1, x2, . . . , xk, v, with k ≥ 0, such that xi ∈ A, and each edge of the
walk is external in C.
34
Proof: Suppose for the sake of contradiction that there exists two nodes u, v ∈ A∩C that are external in C, such that there does not exist a walk w = u, x1, x2, . . . , xk, v,
with k ≥ 0, such that xi ∈ A, and each edge of the walk is external in C. Let fA
be the external face of A, and fC be the external face of C. Let S be the set of
connected components (we will refer to them as segments) in fA ∩ fC (all the nodes
and edges that are external in both A and C). Let su ∈ S be the segment that
contains u. Similarly, let sv ∈ S be the segment that contains v.
We know that in C, there exists a walk of external edges that connects su to
sv. Thus, in A, external edges have been removed (edges from Y ′). All removed
edges span from A to B. Let lu, ru, lv, rv ∈ B and elu, eru, elv, erv ∈ Y ′ be removed
edges (see Figure 2.5) (note it is possible that ru = rv and it is possible that lu = lv).
Since B is connected, there exists a walk from lu to lv residing entirely in B. This
walk cannot go through A since V (A) ∩ V (B) = ∅, so it can go directly from lu to
lv, or all the way around A (see Figure 2.5). If it goes all the way around A, it must
enclose eru and erv, since this walk cannot include the end nodes of su or sv because
they are in A. Hence, eru and erv are not in the external face of C, and could not
have been in Y ′, a contradiction. Therefore, it must go directly from lu to lv.
Similarly, since B is connected, there exists a walk from ru to rv residing
entirely in B. By symmetry, this walk goes directly from ru to rv as well, without
going all the way around A. Once again, we know B is connected, so there must
exist a walk from lu to ru residing entirely in B (see Figure 2.5). If the walk goes
directly from lu to ru, it must enclose the external segment su, a contradiction. So
the walk must go all the way around A, and therefore encloses the external segment
sv, a contradiction.
Therefore, for any two nodes u, v ∈ A ∩ C that are external in C, there exists
a walk w = u, x1, x2, . . . , xk, v, with k ≥ 0, such that xi ∈ A, and each edge of the
walk is external in C.
Lemma 2.8.3 1 ≤ |Y ′| ≤ 2.
Proof: First, we show that |Y ′| ≥ 1. Let fC be the external face of C. Let u ∈ A
and v ∈ B be two external nodes in C. Clearly, u, v ∈ fC . Since fC is connected,
35
BA
B
lv rv
sv
lusu
ru
elv erv
elu eru
A
eruelu
lv rv
sv
lusu
ru
elv erv
A
eru
elv erv
elu
lv rv
sv
lusu
ru
Figure 2.5: For Lemma 2.8.2: the figure on the left shows a configurationof removed edges that are external in C and span from A to B(note, if the lemma were not true, B would be disconnected),the figure in the middle demonstrates the walk options fromlu to lv, and the figure on the right demonstrates the walkoptions from lu to ru.
e1
e2
w A B
eue
wa
ve
wb
u1
u2
v1
v2
pe qe
wa
w2
e2
e1
e
w1
wn
wb
Figure 2.6: For Case 2 of Lemma 2.8.3: the figure on the left demon-strates a possible setup, and the figure on the right demon-strates one of the two possible path configurations.
there is a path p connecting u and v. Since Y is an edge-cut for A and B, p contains
an edge in Y . Thus, one of the edges of Y is external in C, which implies that
|Y ′| ≥ 1.
We now show that |Y ′| ≤ 2. Suppose for the sake of contradiction that |Y ′| >2. Choose two edges e1 = (u1, v1) and e2 = (u2, v2), where e1, e2 ∈ Y ′, u1, u2 ∈ A,
and v1, v2 ∈ B. Let p be the walk from u1 to u2 consisting only of edges external
in A and in C. Similarly, let q be the walk from v1 to v2 consisting only of edges
external in B and in C. We know these walks exist from Lemma 2.8.2. Construct
a closed walk w using the edges in p ∪ q ∪ e1 ∪ e2.
There are two cases to examine:
36
Case 1: w is an external face of C.
There exists an external edge e ∈ Y ′ such that e 6= e1 and e 6= e2. w does not
contain e since e 6∈ A and e 6∈ B. Therefore, e is not in the external face of C,
a contradiction.
Case 2: w is not an external face of C.
That is, there exists an external edge e = (ue, ve), such that e ∈ Y ′, ue ∈ A,
ve ∈ B, and e is not contained within w. Since ue ∈ A, there must exist a walk
of external edges pe from ue to some node wa belonging to w within A, such
that E(pe) ∩ E(w) = ∅, V (pe) ∩ V (w) = wa, and pe is the shortest of such
walks. Similarly, since ve ∈ B, there must exist a walk of external edges qe
from ve to some node wb belonging to w within B, such that E(qe) ∩ E(w) =
∅, V (qe) ∩ V (w) = wb, and qe is the shortest of such walks (see Figure
2.6). These walks exist from Lemma 2.8.2. Within w, there exists two walks
consisting entirely of external edges from wa to wb, one goes through the edge
e1, and the other through the edge e2 (from Lemma 2.8.2). Take the shortest of
such walks and call them w1 and w2 respectively. It is clear that w = w1 ∪w2.
Let wn be the walk from wa to ue (pe), the edge e, and the walk from ve
to wb (qe). We now have three walks w1, w2, and wn, that connect wa to
wb. The subpaths belonging to A may have common nodes and edges, and
the subpaths belonging to B may have common nodes and edges. However,
each walk has a unique external edge (w1 has e1, w2 has e2, and wn has e).
In any possible configuration, one of these external edges (either e1 or e2) is
completely enclosed by the other two walks (see Figure 2.6), and is therefore
not in the external face of C, a contradiction.
Since in both cases we obtained a contradiction, |Y ′| ≤ 2.
Let VB be the nodes adjacent to the edges in Y ′, which are in the graph B.
From Lemma 2.8.3, 1 ≤ |VB| ≤ 2. Let pB ∈ B be a shortest path connecting the
nodes in VB. Let q = v1, v2, . . . , vk be any path in B with the following properties:
pB and q do not intersect (they have no nodes in common), and v1 is adjacent to an
edge in Y .
37
e2
D
e1
e
A B
α β
pBpA
Figure 2.7: This figure demonstrates the subgraphs and paths describedin Lemma 2.8.4.
Lemma 2.8.4 Node vk belongs to a connected component of B − pB that does not
contain any external nodes of C.
Proof: Let VA denote the nodes of A adjacent to Y ′. From Lemma 2.8.3, 1 ≤|VA| ≤ 2. Let pA denote a shortest path between the nodes in VA. The union of the
edges of Y ′, pA, and pB induce a connected subgraph C of C. Let W denote the
set of nodes of C that are contained inside the internal faces (if they exist) of C.
Finally, let D denote the subgraph of C that is induced by the union of the nodes
in W and C.
Now, we show that all the edges of Y are members of D. Suppose for the sake
of contradiction that there exists some edge e = (u, v), where e ∈ Y , u ∈ A, v ∈ B,
and e 6∈ D. Consider first the case where |Y ′| = 1. Let e = (u, v) ∈ Y ′, with u ∈ A
and v ∈ B. We have that pB = v. Thus, q intersects pB, a contradiction. Consider
now the case where |Y ′| = 2. Suppose that Y ′ = e1, e2. Since A is connected,
there is a path α ∈ A that connects edge e to a node in pA; similarly, there is a
path β ∈ B that connects edge e to a node in pB (see Figure 2.7). This implies that
either e1 or e2 is not in the external face of C, a contradiction. Therefore, all the
edges of Y are members of D.
Since v1 is adjacent to an edge in Y , we have that v1 ∈ D. Since q does not
intersect pB, each node of q is a member of D, that is, q ∈ D. Let WB denote the
nodes of W that are members of B. The nodes of q are actually members of WB,
since none of the nodes of q are external in D. Since the nodes of WB are separated
by the path pB from the remaining nodes of B, in B − pB, the nodes of WB are in
connected components consisting only of nodes of WB. These connected components
38
do not contain any external nodes of C, since W does not contain external nodes of
C. Therefore, vk will belong to such a connected component in B − pB.
2.8.2 High Level Description of the Algorithm
At a high level, our cover algorithm breaks up a planar graph G into many
overlapping planar subgraphs called zones, such that: (i) the depth of each zone is
not much greater than γ, (ii) each zone overlaps with a small number of other zones,
and (iii) clustering each zone separately is sufficient to cluster the whole graph. This
way, we can focus on clustering only planar graphs whose depth is not much more
than γ. Thus, our algorithm is divided into two main parts:
• Algorithm Depth-Cover, which clusters graph G with depth(G) ≤ γ, and
• Algorithm Planar-Cover, which clusters arbitrary planar graphs using Depth-
Cover as a subroutine.
We now proceed to describe Algorithms Depth-Cover and Planar-Cover in Sections
2.8.3 and 2.8.5 respectively.
2.8.3 Algorithm Depth-Cover
We now present Algorithm Depth-Cover, which constructs a γ-cover for a pla-
nar graph G where γ ≥ max(depth(G), 1). The resulting cover has radius no more
than 8γ and degree no more than 6. We describe the intuition here, and the
algorithm is formally described in Algorithm 3, which uses Algorithm Subgraph-
Clustering as a subroutine to do most of the work.
Depth-Cover allows us to focus on satisfying only the external nodes in G.
Since depth(G) ≤ γ, if a set of clusters S 2γ-satisfies every external node in the
graph, then S is a γ-cover for G. The reason is that every internal node u is within
a distance of γ from some external node v, and the cluster that contains the 2γ-
neighborhood of v will also contain the γ-neighborhood of u, and will γ-satisfy u.
We now focus on constructing a set of clusters that 2γ-satisfies each external node
of G.
The algorithm begins by selecting an arbitrary external node of G, which is
also trivially a shortest path p in G. Through shortest-path clustering, it constructs
39
a set of clusters I that 4γ-satisfies p in G, and deletes A, the 2γ-neighborhood of
p in G. Let the resulting connected components in G − A be B = B1, B2, . . . , Bx.
By Lemma 2.7.1, the union of 2γ covers of the Bi components with I results in a
2γ-cover of G. Further, since we are only interested in 2γ-satisfying every external
node of G, we need not further consider any component in B that does not contain
an external node of G. Thus, the algorithm proceeds by recursively clustering every
component in B that contains at least one external node of graph G.
Let B ∈ B be a component with at least one external node of G. The recursive
invocation of the algorithm in B requires the selection of a shortest path pB ∈ B
(the path is shortest with respect to its end points). The path pB is selected as
follows. Suppose Y is an edge-cut between A and B (see Figure 2.8.a). Let Y ′ be
the external edges of Y with respect to G. From Lemma 2.8.3, 1 ≤ |Y ′| ≤ 2. Let
VB be the set of nodes in B that are endpoints of edges in Y ′; we have 1 ≤ |VB| ≤ 2.
Path pB is selected to be a shortest path in B between nodes in VB (if VB has
only one vertex, then pB consists of a single node). For example, in Figure 2.8.a
VB1 = v2, v3.Lemma 2.8.4 proves that for every node v ∈ I where v /∈ A, it holds that either:
(i) v appears in the 2γ-neighborhood of pB for one of the connected components
B = Bi, or (ii) v is in a connected component B′ that does not contain any external
nodes of G (for example, see component B′2 in Figure 2.8.c). In either case, node v
will be removed in the next recursive call, which deletes the 2γ-neighborhood of pB.
Thus, v participates in at most two shortest-path clusterings (of p and pB) and is
satisfied by at least one of these two clusterings. Since each instance of shortest-path
clustering contributed at most 3 to the degree of v, the total degree of v is bounded
by 6.
It is useful to compare the algorithm for clustering a planar graph with shortest-
path clustering using path separators, as in Section 2.7. When separators are used,
the graph is decomposed into small pieces upon the removal of the separator (which
is a set of shortest paths), and the depth of this recursion is bounded by lg n. How-
ever, a vertex of the graph may be involved in clusters due to lg n such separators.
In the planar graph case, the resulting components Bi are not necessarily much
40
smaller than G, but the shortest paths are chosen so that the resulting clusters have
little overlap.
Figure 2.8 depicts an example execution of Algorithm Depth-Cover with the
first invocation (Figures 2.8.a and 2.8.b) and the second invocation (Figures 2.8.c
and 2.8.d) of the subroutine Subgraph-Clustering.
A1
v1
v3
v2v2
v3
B′2
v4
4γ
v2
4γ
v5 v5v3
2γ I1 B1 I1 B1 I2B2B2
I2
v4
2γA2
2γ
(d)(a) (b) (c)
Subgraph-Clustering(G,B1, pB1 , γ)Subgraph-Clustering(G,G, v1, γ)
pB1pB1
pB2
Y2
Y1
Figure 2.8: Execution example of Algorithm Subgraph-Clustering.
Algorithm Subgraph-Clustering(G,H, p, γ) is recursive, and parameters G and
γ remain unchanged at each recursive invocation, while H and p change. Parameter
H is the subgraph of G with at least one external node of G, and it is required to
2γ-satisfy all nodes in H that are external nodes of G. Parameter p is a shortest
path in H that will be used for clustering in the current invocation. Initially, H = G
and p = v1, where v1 is an arbitrary external node of G.
Algorithm 3: Depth-Cover(G, γ)
Input: Connected planar graph G; locality parameterγ ≥ max(depth(G), 1);
Output: A γ-cover for G;
Let v be an external node of G;1
Z ← Subgraph-Clustering(G,G, v, γ);2
return Z;3
2.8.4 Analysis
We continue with proving Theorem 2.8.1, which bounds the radius and de-
gree of the resulting covers from Algorithm Depth-Cover. Similar to the analysis of
Algorithm Separator-Cover, it is convenient to represent the execution of Algorithm
Depth-Cover as a tree T , where each node in T corresponds to some invocation of the
41
Algorithm 4: Subgraph-Clustering(G,H, p, γ)
Input: Connected planar graph G; connected subgraph H of G(consisting of vertices that are still unsatisfied); shortest pathp ∈ H whose end nodes are external in H; locality parameterγ ≥ max(depth(G), 1);
I ← Shortest-Path-Cluster(H, p, 4γ);1
A ← N2γ(p,H); H ′ ← H − A;2
J ← ∅;3
foreach connected component B of H ′ that contains at least one external4
node of G doLet Y be the edge-cut between A and B in subgraph H;5
Let Y ′ ⊆ Y be the external edges of Y in subgraph H;6
Let VB be the nodes of B adjacent to the edges of Y ′;7
Let pB be a shortest path in B that connects all the nodes in VB;8
J ← J ∪ Subgraph-Clustering(G,B, pB, γ);9
return I ∪ J ;10
subroutine Subgraph-Clustering. The root r of T corresponds to the first invocation
with parameters (G,G, v, γ). Suppose, for example, that in the first invocation the
removal of A creates two components H1 and H2 in G, for which the algorithm is
invoked recursively with parameters (G,H1, p1, γ) and (G,H2, p2, γ). Then, these
two invocations will correspond in T to the two children of the root. The leaf nodes
correspond to subgraphs Hi that cannot be decomposed further. Suppose that node
w ∈ T corresponds to invocation (G,H, p, γ). We will denote by H(w) the respective
input graph H, and we will use a similar notation to denote the remaining param-
eters and variables used in this invocation; for example, p(w) is the input shortest
path while A(w) is the respective 2γ-neighborhood of p(w) in H(w). As another
example, using this notation, the resulting set of clusters is Z =⋃
w∈T I(w).
Lemma 2.8.5 For any node v ∈ G, there is a node w ∈ T such that Nγ(v,G) =
Nγ(v, H(w)) and v ∈ Nγ(A(w), H(w)).
Proof: By the construction of T , there is a path s = w1, w2, . . . , wk, such that:
s ∈ T , k ≥ 1, v ∈ H(wi) for 1 ≤ i ≤ k, w1 = r (the root of T ), wi is the parent of
wi+1 for 1 ≤ i ≤ k − 1, and wk does not have any child w′ with v ∈ H(w′).
By the construction of T and s, H(wi+1) ⊆ H(wi) for 1 ≤ i ≤ k − 1. Since
42
H(w1) = H(r) = G, Nγ(v, G) = Nγ(v, H(w1)). Let s′ = w1, w2, . . . , wk′ , where 1 ≤k′ ≤ k, be the longest subpath of s with the property that Nγ(v, G) = Nγ(v,H(wi))
for 1 ≤ i ≤ k′.
We examine two cases:
Case 1: k′ < k
It holds that v ∈ H(wk′), v ∈ H(wk′+1), Nγ(v, G) = Nγ(v,H(wk′)), and
Nγ(v, G) 6= Nγ(v, H(wk′+1)). According to Algorithm Subgraph-Clustering, v
belongs to a connected component B of H ′(wk′), such that B contains an
external node of G. Note that B = H(wk′+1) and H ′(wk′) = H(wk′)−A(wk′).
Clearly, v /∈ A(wk′), or else k = k′. Since the γ-neighborhood of v changes
between H(wk′) and B = H(wk′+1), some node u ∈ Nγ(v, H(wk′)) must be
a member of A(wk′) (note that only the nodes of A(wk′) are removed from
H(wk′)). Thus, v ∈ Nγ(A(wk′), H(wk′)). Therefore, wk′ is the desired node of
T .
Case 2: k′ = k
In this case, it holds that v ∈ H(wk), no child w′ of wk has v ∈ H(w′), and
Nγ(v, G) = Nγ(v, H(wk)). According to Algorithm Subgraph-Clustering, there
are two possible scenarios:
Case 2.1: v ∈ A(wk)
This case trivially implies that v ∈ Nγ(A(wk), H(wk)). Thus, wk is the
desired node of T .
Case 2.2: v /∈ A(wk)
In this case, it holds that v belongs to a connected component X of
H ′(wk) = H(wk)−A(wk), such that X does not contain any external node
of G. Since depth(G) ≤ γ, there is a node x ∈ G that is external in G
and x ∈ Nγ(v, G). Since X does not contain any external node of G, x /∈Nγ(v,X). Therefore, Nγ(v, X) 6= Nγ(v, G) = Nγ(v,H(wk)). Thus, the
γ-neighborhood of v changes between H(wk) and X. Hence, some node
u ∈ Nγ(v,H(wk)) is also a member of A(wk) (note that only the nodes of
43
A(wk) are removed from H(wk)), which implies v ∈ Nγ(A(wk), H(wk)).
Therefore, wk is the desired node of T .
Consequently, wk′ is the desired node of T in all cases.
Lemma 2.8.6 Z is a γ-cover for G.
Proof: From Lemma 2.8.5, for each node v ∈ G there is a node w ∈ T such that
Nγ(v, G) = Nγ(v,H(w)) and v ∈ Nγ(A(w), H(w)). By Lemma 2.6.1, p(w) is 4γ-
satisfied by I(w) in H(w). Since A(w) = N2γ(p(w), H(w)), A(w) is 2γ-satisfied by
I(w) in H(w), which implies that v is γ-satisfied by I(w) in H(w). Since Nγ(v, G) =
Nγ(v, H(w)), I(w) also γ-satisfies v in G. Since Z =⋃
w∈T I(w), Z is a γ-cover for
G.
Lemma 2.8.7 rad(Z) ≤ 8γ.
Proof: We have that Z =⋃
w∈T I(w), where each I(w) is obtained by an invocation
of Algorithm Shortest-Path-Cluster, with parameter β = 4γ. Therefore, by Lemma
2.6.1, for any w ∈ T , rad(I(w)) ≤ 2β = 8γ, which implies that rad(Z) ≤ 8γ.
Lemma 2.8.8 deg(Z) ≤ 6.
Proof: Consider an arbitrary node v ∈ G. We only need to show that deg(v, Z) ≤6. Let s = w1, w2, . . . , wk be the path in T as described in Lemma 2.8.5. According
to Algorithm Subgraph-Clustering, the only possible clusters that v can participate
in are I(w1), I(w2), . . . , I(wk). Let i denote the smallest index such that v ∈ I(wi).
We will show that i ∈ k − 1, k. We examine two cases:
Case 1: v ∈ A(wi)
In this case, v will be removed with A(wi), and therefore, v will not appear in any
child of wi. Consequently, wi = wk, hence, i = k.
Case 2: v /∈ A(wi)
In this case, v is a member of a connected component B of H ′(wi) = H(wi)−A(wi).
There are two subcases:
44
Case 2.1: B does not contain any external node of G
In this case, B is discarded, and therefore, v will not appear in any child of
wi. Consequently, wi = wk, hence, i = k.
Case 2.2: B contains an external node of G
If wi = wk, the situation is similar as above, with i = k. So suppose
that i < k. According to Algorithm Subgraph-Clustering, B = H(wi+1).
We will show that v ∈ A(wi+1), which implies that wi+1 = wk (the rea-
son is similar to the case where v ∈ A(wi) above). Since v ∈ I(wi), v ∈N4γ(p,H(wi)) = N2γ(A(wi), H(wi)). Thus, there is a node u ∈ A(wi) such
that v ∈ N2γ(u,H(wi)). Let g = u, x1, x2, . . . , xk, v be a shortest path be-
tween u and v in H(wi). Clearly, length(g) ≤ 2γ. Since u ∈ A(wi) and v
is a member of a connected component B of H ′(wi) = H(wi) − A(wi) with
an external node of G, the path g must contain an edge of Y (or else H(wi)
is disconnected). Choose the node xy such that xy ∈ g, xy ∈ B, and xy is
adjacent to some edge of Y . Now, let g′ = xy, xy+1, . . . , xk, v be a subpath of
g in B. Clearly, length(g′) ≤ 2γ as well.
Case 2.2.1: pB and g′ intersect
Then v ∈ N2γ(pB, B) = N2γ(pB, H(wi+1)). Thus, v ∈ A(wi+1). There-
fore, wi+1 = wk, which implies that i = k − 1.
Case 2.2.2: pB and g′ do not intersect
By Lemma 2.8.4, in B−pB, node v belongs to a connected component B′
that has no external nodes of C. Since C is a subgraph of G, Lemma 2.8.1
implies that B′ has no external nodes of G either. Thus, B′ is discarded
at the recursive invocation of the algorithm that corresponds to the node
wi+1. Consequently, wk = wi+1, which implies that i = k − 1.
Consequently, i ∈ k − 1, k. Thus, the only clusters that v could possibly
belong to are I(wk−1) and I(wk). Since for each x ∈ T , I(x) is the result of an
invocation of Algorithm Shortest-Path-Cluster, from Lemma 2.6.1, deg(I(x)) ≤ 3.
Therefore, deg(v, Z) ≤ deg(I(wk−1)) + deg(I(wk)) ≤ 6.
45
It is easy to verify that Algorithm Depth-Cover computes the cover Z in poly-
nomial time with respect to the size of G. Therefore, the main result in this section
follows from Lemmas 2.8.6, 2.8.7, and 2.8.8.
Theorem 2.8.1 For any connected planar graph G and γ ≥ max(depth(G), 1),
Algorithm Depth-Cover returns in polynomial time a γ-cover Z with rad(Z) ≤ 8γ
and deg(Z) ≤ 6.
2.8.5 General Planar Cover
We now describe the main algorithm, Algorithm Planar-Cover, which given a
planar graph G, constructs a γ-cover with radius O(γ) and degree O(1), for any
γ ≥ 1. In the algorithm, we do the following. If γ ≥ depth(G), then we invoke
Algorithm Depth-Cover(G, γ). However, if γ < depth(G), we first divide G into
zones, and then cluster each zone with Algorithm Depth-Cover. The union of the
zone clusters gives the resulting cover for G.
Algorithm 5: Planar-Cover(G, γ)
Input: Connected planar graph G; locality parameter γ ≥ 1;Output: A γ-cover for G;
Z ← ∅;1
if γ ≥ depth(G) then2
Z ← Depth-Cover(G, γ);3
else4
Introduce artificial nodes to G;5
Let S1, S2, . . . , Sκ be the 3γ-zones of G, where κ = d(depth(G) + 1)/γe;6
foreach connected component S of Si do7
Z ← Z ∪ Depth-Cover(S, 3γ − 1);8
Remove artificial nodes from Z;9
return Z;10
We now describe how to construct the zones. Suppose that γ < depth(G).
For all edges e ∈ G such that ω(e) > 1, place artificial nodes along e as needed,
reducing ω(e) by 1 each time, until all edges have weight 1 (we are simulating an
unweighted graph). Clearly, artificial nodes do not alter the planarity of the graph.
These nodes can later be removed from all clusters without affecting the cover, since
they do not alter the actual nodes in any neighborhood. Next, we will divide the
46
graph into bands, Wj = v ∈ G : depth(v, G) ≥ jγ and depth(v,G) < (j + 1)γ,for j ≥ 0. Our main goal is to γ-satisfy the nodes in each band Wi. However, in
G, the γ-neighborhoods of the nodes in Wi may appear in the adjacent bands Wi−1
and Wi+1. For this reason, we form the 3γ-zone Si consisting of bands Wi−1, Wi,
and Wi+1 (in particular, Si = G(Wi−1 ∪ Wi ∪ Wi+1), where W0 = Wκ+1 = ∅). Si
contains the whole γ-neighborhood of Wi.
Lemma 2.8.9 For γ < depth(G), it holds that: (i) depth(Si) ≤ 3γ − 1; (ii)
Nγ(Wi, G) = Nγ(Wi, Si).
Proof: Consider a zone Si = G(Wi−1 ∪ Wi ∪ Wi+1). We first prove property i.
Consider the outermost nodes of Wi−1 to be external. Consider a generic node
u ∈ Wi+1. Since all edges are of weight 1, u must be within γ of some node v ∈ Wi,
or else it is in the wrong depth band. Similarly, any node in Wi is within γ of some
node in Wi−1, which is less than γ from some external node in Wi−1. Thus, any
node in Si is within 3γ − 1 of an external node, therefore depth(Si) ≤ 3γ − 1.
For property ii, suppose that u ∈ Wi and v ∈ Nγ(Wi, G). We will show that
v ∈ Si. By the construction of Wi, we know that iγ ≤ depth(u,G) < (i + 1)γ.
Suppose for the sake of contradiction that v 6∈ Si. Thus, either depth(v, G) <
(i− 1)γ, or depth(v, G) ≥ (i + 2)γ.
Case 1: depth(v, G) < (i− 1)γ
Since v ∈ Nγ(Wi, G), depth(u,G) ≤ depth(v, G) + γ < (i − 1)γ + γ. Thus,
depth(u,G) < iγ, a contradiction.
Case 2: depth(v, G) ≥ (i + 2)γ
Since v ∈ Nγ(Wi, G), depth(v,G)−γ ≤ depth(u,G) < (i+1)γ. Thus, depth(v, G) <
(i + 2)γ, a contradiction.
Therefore, v ∈ Si, proving that Nγ(Wi, G) = Nγ(Wi, Si).
In this way, we have reduced the problem of satisfying band Wi to the problem
of producing a cover for zone Si, which can be solved with Algorithm Depth-Cover.
Proved in Lemma 2.8.9 regarding the construction of each zone Si, depth(Si) ≤
47
3γ−1. We invoke Algorithm Depth-Cover(Si, 3γ−1) with locality parameter 3γ−1,
since in Algorithm Depth-Cover the locality parameter has to be at least as much
as the depth of the input graph. The resulting cover for G is the union of all the
covers for the zones.
Using Theorem 2.8.1 and the observation that every node participates in at
most three zones, we obtain the main result for planar graphs.
Theorem 2.8.2 For any connected planar graph G and parameter γ ≥ 1, Algorithm
Planar-Cover returns in polynomial time a γ-cover Z with rad(Z) ≤ 24γ − 8 and
deg(Z) ≤ 18.
2.9 Cover for Unit Disk Graphs
Unit disk graphs are often used to model wireless network topologies. In a unit
disk graph G, there exists an edge between two vertices u, v ∈ G if and only if the
Euclidean distance between u and v is at most 1. That is, each node u is surrounded
by a disk of radius 1, and has a link to all other nodes that appear within the disk.
In a multi-hop radio network, u can communicate directly to these nodes, and only
these nodes.
Using Algorithm Planar-Cover, we can construct optimal covers for unit disk
graphs (Sparse Cover Contribution 3). Consider a connected unit disk graph G, and
a spanner G′ ⊆ G such that G′ is planar and there is a positive constant t such that
for any two nodes u and v, distG′(u, v) ≤ t ·distG(u, v) (such a spanner exists for all
unit disk graphs [54]). Since G′ is a connected planar graph, the algorithm Planar-
Cover returns a γ-cover Z for G′ with rad(Z) ≤ 24γ− 8 and deg(Z) ≤ 30. Consider
calling Planar-Cover(G′, γt). Clearly, Z is a γt-cover for G′ with rad(Z) ≤ 24γt− 8
and deg(Z) ≤ 30.
Theorem 2.9.1 Z is a γ-cover for G.
Proof: Let v ∈ G′, so it is also true that v ∈ G, since G′ ⊆ G. Since Z is a γt-cover
for G′, there exists a cluster C that γt-satisfies v in G′. That is, ∀u ∈ Nγt(v,G′),
u ∈ C. Since distG′(u, v) ≤ t · distG(u, v), ∀u ∈ Nγ(v, G), u ∈ Nγt(v,G′), thus
u ∈ C. Therefore, Z is a γ-cover for G.
48
2.10 Summary
In this chapter, we have shown how improved sparse covers can be used to
construct better distributed directories, used for locating mobile data/objects in
wireless sensor networks. We have provided a structural lower bound for sparse
covers of arbitrary graphs, and improved construction algorithms for special well-
studied types of graphs.
We show that using a simple centralized directory is a poor solution to the
problem because it is not locality-sensitive. A better solution uses a distributed di-
rectory, where data/objects do not have a static home. This allows queries to be
answered quickly regardless of the whereabouts of the querying and storing nodes.
This is done through the use of efficient find and move operations. A sparse cover
is the underlying data structure from which a distributed directory is built. Specifi-
cally, a hierarchy of increasing-radius covers is used to construct regional matchings,
which contain read and write sets for all network nodes (refer to Section 2.4 for
formal definitions). As a directory contains only two operations (find and move), its
performance is measured by the Stretchfind and Stretchmove, which are determined
by the structural quality (radius and degree) of the sparse covers used to construct
it.
We first proved a structural lower bound for sparse covers of arbitrary graphs
in Section 2.5. Specifically, there exists a network with n nodes, and constrained
by the locality parameter γ and the maximum tolerable degree c, such that when
clustered, there must exist a cluster whose radius is Ω(γ log logc n), regardless of
the clustering technique (see Theorem 2.5.2). This proves that for arbitrary graphs,
there is an inherent tradeoff in the radius and degree, and these metrics cannot be
simultaneously optimized. The best known construction algorithm for these graphs
can achieve a radius of O(γ log n) and a degree of O(log n) [34], which translates into
a distributed directory with Stretchfind = O(log2 n) and Stretchmove = O(log2 n).
In light of the above tradeoff, we studied construction techniques for special types
of graphs including planar, unit disk, and H-minor free graphs.
In Section 2.7, we presented an algorithm for clustering κ-path separable
graphs that achieves a radius of O(γ) and degree of O(log n). This translates into
49
a distributed directory with Stretchfind = O(log n) and Stretchmove = O(log n), a
savings of a logarithmic term in each metric. In Section 2.8, we presented an opti-
mal algorithm for clustering planar graphs that achieves a radius of O(γ) and degree
of O(1). This translates into a distributed directory with Stretchfind = O(1) and
Stretchmove = O(log n), a savings of log2 n in Stretchfind, and log n in Stretchmove.
Finally, in Section 2.9, we showed how our planar algorithm can be used to con-
struct optimal covers for unit disk graphs (and other graphs with constant-stretch
planar spanners) with a radius of O(γ) and degree of O(1), once again saving log2 n
in Stretchfind and log n in Stretchmove, for the distributed directory operations.
Our work has immediate implications on the efficiency of other important data
structures used to solve fundamental distributed problems such as the construction
of compact routing schemes and synchronizers.
CHAPTER 3
Information Retrieval: P2P Content Delivery
3.1 Introduction
P2P file-sharing is the “killer application” for the consumer broadband Inter-
net. CacheLogic’s [61] monitoring of tier 1 and 2 Internet service providers (ISPs)
in June, 2004 reports that between 50% and 80% of all traffic is attributed to P2P
file-sharing. In 2005, those numbers appear to have been holding steady at 60% of
all network traffic on the reporting consumer-oriented ISPs [61]. At any given time,
over 8 million users are sharing 10 Petabytes of data using major P2P networks.
This accounts for nearly 10% of the broadband connections worldwide, and this
trend is expected to grow [100].
Much of the data being exchanged on P2P networks consists of large video files.
For example, a typical DIVX format movie is 700 MB, a complete single-layer DVD
movie can be more than 4 GB, and the latest high definition movies may require
10 GB or more. With high definition movies (HD-DVD and Blu-Ray formats) just
entering the home theater market, one can expect downloadable content sizes to
grow by a factor of 100 to 1,000, thus pushing the network traffic loads to much
higher levels.
So then, one might ask, what is the driving force behind these trends? In
addition to the attraction to “free” and/or “pirated” content, a key driving force
is the content distribution economics. From both a content publisher’s as well as
content consumer’s point of view, P2P makes good economic sense, especially in the
context of the flash crowd effect. Here, a single piece of data, such as a new online
movie release, is so popular, that the number of people attempting to download it
will overload the capacity of the most powerful single site web server. However,
in a current generation P2P network such as BitTorrent, a single low-bandwidth
host will seed content to a massive swarm of peers. The hosts within the swarm
will then disseminate parts of the content to each other in a peer exchange fashion.
This is the heart of how a BitTorrent swarm operates. As one peer is obtaining
50
51
new content, it is simultaneously sharing its content with other peers. Unlike the
client-server approach, as the swarm grows, the aggregate network bandwidth of
the swarm grows. Thus, from the view point of each node, the data rates are much
faster, there is no denial of service on the part of the content source, and the content
source provider’s computation and network load remain relatively low.
3.1.1 Users Happy ISPs Not
There appears to be a wrinkle in this nirvana of efficient data exchange.
Consumer-oriented ISPs are not pleased with how their networks are being used
by these peering overlay networks. The cost to them is prohibitive - on the order
of $1 billion U.S. Dollars [61], and the ISPs are not making any additional revenue
from these network intensive applications. ISPs have begun to use packet-shaping
technology to throttle the delivery of P2P data, ultimately reducing the load on their
networks. In effect, current ISP networks were never provisioned for P2P overlay
protocols. So if you ask, is P2P good for the Internet?, the answer depends greatly
on who you ask.
Based on the above motivations, the grand goal of our research is to better
understand the real impact P2P overlay software has on Internet network resources
from the distributor, ISP, and end user point of views. In particular, we focus our
research on the BitTorrent protocol. BitTorrent has been one of the most popular
P2P file-sharing technologies, with a number of different client implementations
[74] and an estimated user population on the order of 60 million [102]. In 2004,
BitTorrent traffic single-handedly accounted for 50% of all Internet traffic on U.S.
cable networks [100]. More recently, the usage of BitTorrent has waned to 18%
due to content owners shutting down illegal tracker servers because of copyright
infringements [97]. We believe that the centralized tracker gives “BitTorrent-like”
applications great promise for the legal distribution of legitimate content.
3.2 Contributions
We model the BitTorrent protocol in full detail based on the mainline client
source code [91] using our Internet topology model.
52
1. A memory efficient model of the BitTorrent protocol built on the ROSS discrete-
event simulation system [88, 89]. The memory consumed by a single BitTor-
rent client can be upwards of 70 MB. The memory consumed by a client in
our model is between 67 KB and 2.3 MB (see Section 3.5).
2. A slice-level data model that ensures protocol accuracy while avoiding the
event explosion problem characteristic of typical packet-level models, such
as employed with NS [70]. As a result, we achieve tremendous sequential
processor speedups (up to 180 times) (see Sections 3.5 and 3.6).
3. A realistic Internet topology model that preserves geographic market rela-
tionships, is massively scalable, and accurately models the in-home consumer
broadband Internet (see Section 3.6).
4. Validation of our BitTorrent model against instrumented BitTorrent opera-
tional software as well as previous measurement studies (see Section 3.7.1).
5. Model performance results and analysis for a large number of BitTorrent
swarm scenarios (see Section 3.7.2).
6. Analysis of techniques for streaming content using BitTorrent. We show ac-
ceptable quality of service (QoS) can be achieved when only a small fraction of
a BitTorrent swarm is streaming. Further, we show how the use of BitTorrent
along with a CDN can significantly reduce transit costs while providing an
excellent QoS (see Section 3.8).
Our advancements have allowed us to study large-scale swarms that have been
previously computationally infeasible to simulate. Through this ongoing investiga-
tion, we hope to gain insights that will enable better P2P systems that are considered
both fair and efficient by not only the users and distributors, but the ISPs as well.
We now present related work in the area of BitTorrent studies. In Section 3.4,
we give an overview of the BitTorrent protocol. We discuss our simulator and
topology model in Sections 3.5 and 3.6 respectively. We present our model validation
and some experimental results in Section 3.7. In Section 3.8, we analyze the QoS and
53
transit savings for different streaming modifications of BitTorrent. We summarize
the chapter in Section 3.9.
3.3 Related Work
The current approaches to studying this specific protocol are either through
direct measurement of operational BitTorrent “swarms” during a file-sharing session
[66, 105], or by real experimentation on a physical closed network, such as PlanetLab
[103]. The problem with using PlanetLab as a P2P testbed is that the usage polices
can limit our ability to explore network behaviors under extreme conditions. That
is, your application cannot interfere with other participants in the research network
[104] and lacks the resources needed to examine swarm behaviors at the scale we
would like to investigate. Additionally, real P2P Internet measurement studies are
either limited in terms of data that they are able to collect because of network
blocking issues related to network address translations (NATs) as in the case of
[105], or limited to active “torrents” as in the case of [66]. Another technique, not
necessarily specific to BitTorrent, is the use of complex queuing network models,
such as [95].
While both measurement and queuing network models are highly valuable
analytic tools, neither allow precise control over the configuration for the system
under test, which is necessary to understand the cause and effect relationships among
all aspects of a specific protocol like BitTorrent. For this level of understanding, a
detailed simulation model is required. However, the reality of any simulation is that
- by definition - it is a “falsehood” for which we are trying to extract some “truths”.
Thus, the modeler must take extreme care in determining factors that can and
cannot be ignored. In the case of BitTorrent, there have been some attempts by
Microsoft to model the protocol in detail [86, 96]. However, these models have been
dismissed by the creator of BitTorrent, Brahm Cohen, as not accurately modeling
the true “tit-for-tat”, non-cooperative gaming nature of the protocol as well as other
aspects [99].
A third approach is direct emulation of the operational BitTorrent software
such as done by [87]. Here, the peer code is “fork lifted” from the original imple-
54
mentation. Results are presented using only 700 peers (cable users). It is unclear
which parts of the BitTorrent implementation were left intact, so comparisons be-
tween this approach and ours in terms of memory and computational efficiency are
not possible.
3.4 The BitTorrent Protocol
The BitTorrent protocol creates a virtual P2P overlay network using five major
components: (i) a torrent file, (ii) a web site, (iii) a tracker server, (iv) client seeders,
and (v) client leechers.
A torrent file is composed of a header plus a number of SHA-1 block hashes of
the original file, where each block or piece of the file is a 256 KB chunk of the whole
file. These chunks are further broken down into 16 KB sub-chunks called slices. The
header information denotes the IP address or URL of the tracker for this torrent file.
Once created, the torrent file is then stored on a publicly accessible web site, from
which anyone can download. Next, the original content owner/distributor will start
a BitTorrent client that already has a complete copy of the file along with a copy
of the torrent file. The torrent file is read, and because this BitTorrent client has
a complete copy of the file, it registers itself with the tracker as a seeder. A client
without a complete copy of the file registers itself as a leecher. Upon registering,
the tracker will provide a leecher with a randomly generated list of peers. Because
of the size of the peer-set and the random peer selection, the probability of creating
an isolated clique in the overlay network graph is extremely low, which ensures
robust network routes for piece distribution. The downside to this approach is
that topological locality is completely ignored, resulting in much higher network
utilization (i.e. more network hops and consumption of more link bandwidth).
Thus, the protocol makes a robustness for locality tradeoff.
The seeder and other leechers will begin to transfer pieces of the file amongst
themselves using a complex, non-cooperative, tit-for-tat algorithm. After a piece is
downloaded, the BitTorrent client will validate that piece against the SHA-1 hash
value for that piece. Again, the hash for that piece is contained in the torrent file.
When a piece is validated, the client is able to share it with other peers who have
55
not yet obtained it. Pieces within a peer-set are exchanged using a rarest piece first
policy, which is used exclusively after the first few randomly selected pieces have
been obtained (typically four pieces, but this is a configuration parameter). Because
each peer announces to all peers in its peer-set every piece it obtains (via a HAVE
message), all peers are able to keep copy counts on each piece and determine within
their peer-set which piece or pieces are rarest (i.e. lowest copy count). When a
leecher has obtained all pieces of the file, it then switches to being a pure seeder of
the content. At any point during the piece/file exchange process, clients may join
or leave the swarm (peering network). Because of the highly volatile nature of these
swarms, a peer will re-request an updated list of peers from the tracker periodically
(typically every 300 seconds). This ensures the survival of the swarm, assuming the
tracker remains operational.
More recently, BitTorrent has added a distributed hash table (DHT) based
tracker mechanism. This approach increases swarm robustness even in the face of
tracker failures. However, DHTs are beyond the scope of our current investigation.
3.4.1 Message Protocol
The BitTorrent message protocol consists of 11 distinct messages as of version
4.4.0, with additional messages being added to the new 4.9 version. All intra-peer
messages are sent using TCP, whereas peer-tracker messages are sent using HTTP.
Once a peer has obtained its initial peer-set from the tracker, it will initiate a
HANDSHAKE message to 40 peers by default. The upper bound on the number of
peer connections is 80. Thus, each peer keeps a number of connection slots available
for peers who are not in its immediate peer-set. This reduces the probability that
a clique will be created. The connections are maintained by periodically sending
KEEP ALIVE messages.
Once two-way handshaking between peers is complete, each peer will send the
other a BITFIELD message that contains an encoding of the pieces the peer has.
If a peer has no pieces, no BITFIELD message is sent. Upon getting a BITFIELD
message, a peer will determine if the remote peer has pieces it needs, if so, it will
schedule an INTERESTED message. The remote peer will process the INTER-
56
ESTED message by invoking its choker algorithm, which is described next. The
output from the remote peer’s choker (upload side) is an UNCHOKE or CHOKE
message. The response to an INTERESTED message is typically nothing or an UN-
CHOKE message. Once the peer receives an UNCHOKE message, the piece-picker
algorithm (described below) is invoked, and a REQUEST message will be generated
for a piece and 16 KB offset within that piece. The remote peer will respond with
a PIECE message containing the 16 KB chunk of data. This response will in turn
result in additional REQUESTS being sent.
When all 16 KB chunks within a piece have been obtained, the peer will send
a HAVE message to all other peers to which it is connected. With receipt of the
HAVE message, a remote peer may decide to schedule an INTERESTED message for
that peer, which results in a UNCHOKE message, and then REQUEST and PIECE
messages being exchanged. Thus, the protocol ensures continued downloading of
data among all connected peers. Should a peer have completely downloaded all
content available at a remote peer, it will send a NOT INTERESTED message.
The remote peer will then schedule a CHOKE message if the peer was currently in
the unchoked state. Likewise, the remote peer will periodically choke and unchoke
peers via the choker algorithm. Lastly, when a peer has made a request for all pieces
of content, it will enter endgame mode. Here, requests to multiple peers for the same
piece can occur. Thus, a peer will send a CANCEL message for that piece to other
peers once one has responded with the requested 16 KB chunk.
In order to reduce the complexity of our model, we do not include either KEEP
ALIVE or CANCEL messages. In the case of CANCEL messages, they are very few
and do not impact the overall swarm dynamics [66].
3.4.2 Choker Algorithms
There are two distinct choker algorithms, each with very different goals. The
first is the choker algorithm used by a seeder peer. The goal is not to select the peer
whose upload data transfer rate is best, but instead to maximize the distribution of
pieces. In the case of leecher peers, it uses a sorted list of peers based on upload rates
as the key determining factor. That is, it wants to find the set of peers with whom
57
it can best exchange data with. Both choker algorithms are scheduled to run every
10 seconds and can be invoked in response to INTERESTED/NOT INTERESTED
messages. Each invocation of the choker algorithm counts as a round. There are
three distinct rounds that both choker algorithms cycle through. We begin with the
details for the seeder choker algorithm (SCA).
SCA only considers peers that have expressed interest and have been unchoked
by this peer. First, the SCA orders peers according to the time they were last
unchoked with most recently unchoked peers listed first within a 20 second window.
All other peers outside that window are ordered by their upload rate. In both cases,
the fastest upload rate is used to break ties between peers. During two of the three
rounds, the algorithm leaves the first three peers unchoked, and unchokes another
randomly selected peer. This peer is known as the optimistic unchoked peer (OUP).
During the third round, the first four peers are left unchoked and the remaining peers
are sent CHOKE messages if they are currently in the unchoked state.
For the leecher choker algorithm (LCA), at the start of round 1 (i.e. every 30
seconds), the algorithm chooses one peer at random that is choked and interested.
As in SCA, this is the OUP. Next, the LCA orders all peers that are interested and
have at least one data block that was sent in the last 30 second time interval, all
other peers are considered to be snubbed. Snubbed peers are excluded from being
unchoked to prevent free-riders and ensure that peers share data in a relatively
fair way. From that ordered list, the three fastest peers along with the OUP are
unchoked. If the OUP is one of the three fastest, a new OUP is determined and
unchoked. If the OUP is not interested, the choker algorithm will later be invoked
as part of INTERESTED messaging processing.
3.4.3 Piece-Picker
The piece-picker is a two phase algorithm. The first phase is random. When a
leecher peer has no content, it selects four pieces at random to download from peers
that have those particular pieces. Once a peer has those four pieces, it shifts to a
second phase of the algorithm that is based on a rarest piece first policy. Here, each
piece’s count is incremented based on HAVE and BITFIELD messages. The piece
58
with the lowest count (but not zero) is selected as the next piece.
3.4.4 Implications for Network Model Design
As one can see, the dynamics and causal relationships among peers is extremely
complex. Consequently, we are limited to the extent with which we can abstract
away such interactions without incurring losses with respect to peer-protocol inter-
actions. For example, a peer need not receive a full 256 KB piece from a single peer,
nor is it guaranteed to receive blocks within a piece in-order, or pieces themselves in
any particular order. Additionally, the pattern with which pieces are received im-
pacts the “rarest piece” within a peer-set. This rarest piece will vary among peer-sets
as their view of the available pieces changes over time. This in turn impacts which
pieces a peer will request, and ultimately determines the download completion time
along with other network effects. This point is especially critical if we attempt to
make any sort of cross-P2P model performance comparisons. Thus, it is impera-
tive that any abstraction preserve the dynamics between peers, peer-sets, available
pieces, and rarest pieces. Because of this, we are forced to model this protocol at the
level of a slice. However, as we will show, this level affords a 180x event reduction
over a pure packet-level model.
3.5 Simulator
Our model [57] of the BitTorrent protocol (P2P Contributions 1 and 2) is
written on top of ROSS [88, 89], which is an optimistically synchronized parallel
simulator based on the Time Warp protocol [98]. In this modeling framework, sim-
ulation objects, such as peers, are realized as logical processes (LPs) that exchange
time-stamped event messages in order to communicate. Each message within the
BitTorrent protocol is realized as a time-stamped event message, where the time
stamps are generated by delays from our network topology model [58], which real-
istically approximates today’s home broadband Internet service.
The simulator is flow-based and operates at the slice-level. In addition, our
topology model allows us to abstract away details that are non-pertinent to Internet
simulations, where delays experienced in the core are negligible compared to those
59
in the last mile [59]. As a result, we have a realistic model that achieves significant
sequential processor speedups and reductions in required memory, allowing us to
simulate extremely large-scale swarms of 100’s of thousands of peers.
3.5.1 BitTorrent Model Data Structure
The data structure layout for our BitTorrent model is shown in Figure 3.1. At
the core of the model is the peer state, denoted by bt_peer_lp_state_t. Inside
each peer, there are three core components. First is the peer_list, followed by
the picker and the choker. The peer list captures all the upload and download
state for each peer connection, as denoted by the bt_peer_list_t structure. A
peer list can be up to 80 in length. The picker contains all the necessary state for
a peer’s piece-picker. The choker manages all the required data for a peer’s choker
algorithm. This algorithm makes extensive use of the download data structure
for each peer connection. Finally, each peer contains data structures to manage
statistical information as well as simulated CPU usage that are used in protocol
analysis.
Next, each peer connection has upload and download states associated with
it. The upload, denoted by the bt_upload_t structure, contains upload side sta-
tus flags, such as choked, interested, etc. The download, denoted by the bt_
download_t structure, is significantly more complex from a modeling perspective.
In particular, this structure contains a list of active requests made by the owning
peer. The bt_request_t data structure contains a pointer to the destination peer
in the peer list along with the piece and offset information. Recall that each 256
KB piece is further divided into partial chunks of 16 KB each. The offset indicates
which 16 KB chunk this request is for within an overarching piece.
Now, inside the piece-picker data structure, denoted by bt_picker_t, is an
array of piece structures along with a rarest piece priority queue. Inside each piece
array element, denoted by the bt_piece_t, is the current download status of that
particular piece. Two key lists inside of the data structure are the lost_request
and peer_list. The lost_request is a queue for requests that need to be remade
because the connection to the original destination peer was closed/terminated as per
60
the BitTorrent protocol. The peer_list is the list of peers that have this particular
piece (determined by receipt of a HAVE message). This list is used by the piece-
picker algorithm to select which peer to send the request to for this particular piece.
We observe here that this piece_peer list is different from the previous ones
in that it points to a container structure bt_piece_peer_t, which is just a list
with a peer list pointer contained within it. It is this data design that results in
significant memory savings over a static allocation of peer arrays. This enables us
to manage our own piece-peer memory and reuse memory once a piece has been
fully obtained. Similarly, we also manage bt_request_t memory. As a leecher-peer
becomes a seeder-peer, it no longer issues piece download requests. Thus, those
memory buffers can be re-used within the simulation model for other download
requests, enabling greater scalability in terms of the number of peer-clients that can
be modeled.
The final key data structure within the piece-picker is a Splay Tree priority
queue [101], which is used to keep the rarest piece at the top of the priority queue.
Our selection of this data structure over others such as a Calendar Queue [93], is
because of its low memory and high-performance for small queue lengths (i.e. less
than 100). The key sorting criteria for this queue is based on counts of peers that
have each piece. The lowest count piece will be at the top of the queue. Each peer
LP manages its own rarest piece priority queue.
3.5.2 Tuning Parameters
In terms of tuning parameters, BitTorrent has on the order of 20 or more that
are beyond full consideration here. However, we do focus on two key parameters
that have a profound impact on simulator performance. The first is max_allow_in.
This parameter determines the maximum number of peers that a peer will make
connections to, or accept requests from. This parameter determines how long a
peer’s peer_list is, which impacts the complexity of the piece-picker and choker
algorithms. Another key parameter is max_backlog, which sets a threshold on the
number of outstanding requests that can be made on any single peer connection.
In the BitTorrent implementation, max_backlog is hard-coded to be 50 unless the
61
Figure 3.1: This figure shows our BitTorrent model data structure.
data transfer rate is extremely high (greater than 3 MBps), in which case it can
go beyond that value. So, an approximate upper bound on the number of request
events that can be scheduled is the product of max_allow_in, max_backlog, and
the number of peers. A consequence of this product is that memory usage in the
simulation model grows very quickly.
3.6 Topology Model
The Internet’s inherent heterogeneity and constantly changing nature make it
difficult to construct a realistic, yet computationally feasible model. In the construc-
tion of any model, one must take into consideration flexibility, accuracy, required
resources, execution time, and realism. In this section, we discuss the methodology
and creation of our model used to simulate Internet content distribution, and the
rationale behind its design. In particular, we are interested in modeling the in-home
consumer broadband Internet, while preserving geographic market relationships. In
62
our performance study, our simulations experience tremendous sequential processor
speedups, and require a fraction of the memory of other models, without sacrificing
the accuracy of our findings. Specifically, our slice-level model achieves the accu-
racy of a packet-level model, while requiring the processing of 180 times fewer events
(P2P Contributions 2 and 3).
Our topology model is comprised of several components that are largely inde-
pendent of each other. The components include: the Internet connectivity model
(Section 3.6.2), the population model (Section 3.6.3), the delay model (Section 3.6.4),
the technology model (Section 3.6.5), and the bandwidth model (Section 3.6.6).
In the design of these models, many decisions were made whether or not to
include certain features. Considered in this process are the model’s overall realism,
its accuracy, the data collection and maintenance required, the execution time of
the simulations, and the required system resources. In some cases, a model can be
unnecessarily complex, and produce results that either cannot be analyzed, or are
no better than those of a simpler version [71]. In Section 3.6.7, we demonstrate the
efficacy of our model, and discuss some of the benefits that we reap as a result of
our decisions.
3.6.1 Related Work
3.6.1.1 Internet Mapping Projects
Caida [65] used skitter data that they collected over time to construct a con-
nectivity model between registered Autonomous Systems (AS) on the Internet. This
model captures the connectivity between groups of networks, however, leaves inter-
nal network structure unknown. This model is not suitable for our simulations
because realistic hop counts cannot be determined (AS can be disconnected or even
span the country in 1 hop). Further, we have no data regarding the location of
nodes or their corresponding bandwidths.
Lumeta [68] also created an Internet map using trace data. The map is very
large-scale, and does give a notion of location. However, probes were initiated
from a single source, thus the map is very tree-like, and hop counts cannot be
accurately inferred. Rocketfuel [78] is an Internet mapping tool that allows for
63
direct measurements of router-level ISP topologies. The number of required traces
is significantly reduced by exploiting BGP routing tables, using properties of IP
routing to eliminate redundant measurements, alias resolution, and using DNS to
divide maps into POPs and backbone. Using 300 sources and 800 sinks, Rocketfuel
creates extremely detailed maps of specific ISPs [79]. We use a similar technique to
map parts of the backbone and POPs, however, we abstract out specific ISPs. This
allows us to scale to larger simulations while keeping realistic ISP properties.
Mercator [81] is a similar tool that uses informed random-access hop-limited
probes to explore the IP address space. Targets are informed by the results of earlier
probes as well as IP address allocation policies. Mercator is deployable anywhere
because it makes no assumptions about the availability of external information to
direct the probes. It uses alias resolution and a technique called source-routing to
direct probes in non-radial directions from the source in order to discover cross-links
that would not have otherwise been found. In our model, we use carefully chosen
addresses and ranges to probe in order to guarantee the coverage of certain key
geographic regions.
[80] describes a model of the U.S. Internet backbone constructed using merged
data sets from the existing Internet mapping efforts Rocketfuel and Mercator, and
identifies areas where the research community lacks data, such as link bandwidth
and link delay data.
3.6.1.2 Abstractions
Presented in [85] are fluid models used to study the scalability, performance,
and efficiency of BitTorrent-like file-sharing mechanisms. The idea is to approximate
a system through theoretical analysis, rather than a detailed simulation.
In [73], NIx-Vector routing (short for Neighbor-Index Vector) is introduced.
Typically, routing of packets on the Internet consists of a series of independent
routing decisions made at each router along the path between any source and des-
tination. Hence, when many packets are sent between the same pair of nodes, the
same decisions are made repeatedly and independently, without knowledge of any
previous decisions. A NIx-Vector is a compact representation of a routing path that
64
is small enough to be included in a packet header. Once this vector exists, routing
decisions can be made at each router in constant time, without requiring caching or
state saving. This technique can significantly reduce the burden on routers.
Staged Simulation [76] is a technique for improving the runtime performance
and scale of discrete-event simulators. It works by restructuring discrete-event sim-
ulators to operate in stages that pre-compute, cache, and reuse partial results to
drastically reduce the amount of redundant computation within a simulation. Like
all abstraction techniques, there are advantages and tradeoffs. Experiments show
that this technique can improve the execution time of the NS2 simulator consider-
ably.
One of the first flow-based network models was reported in [63]. Here, a two
order of magnitude speedup is achieved over a pure packet-level model by coarsening
the representation of the traffic from a packet-basis to a “cluster” of closely spaced
packets called a train. Narses [82] and GPS [83] are other flow-based network sim-
ulators that approximate the low-level details such as the physical, link, network,
and transport layers. A similar framework is presented in [84]. Our simulator is also
flow-based performing at the slice-level without neglecting low-level details, allow-
ing us to analyze application layer behavior as well as the effects on the underlying
network.
Most recently, [72] reports a new method for periodically computing traffic at
a time scale larger than that typically used for detailed packet simulations. This is
especially useful for large-scale simulations where the execution cost is exceedingly
expensive. Results suggest huge speedups are possible when comparing background
flows to those simulated in pure packet simulators. In addition, comparing the
foreground interactions verifies the accuracy of the technique.
[75] discusses a novel approach for scalable and efficient network simulation,
which partitions the network into domains and the simulation time into intervals.
Each domain is simulated concurrently and independent of the others, using only
local information of the interval. At the end of each interval, simulation data is
exchanged between domains. When the exchanged information converges to a value
within a prescribed precision, all simulators progress to the next time interval. This
65
approach results in speedups due to the parallelization with infrequent synchroniza-
tion.
Common to all of these approaches is the tradeoff of accuracy for a decrease
in computational complexity. In many cases, that tradeoff must be made in or-
der to make the model computationally tractable. Our plight is no different here.
Large-scale P2P protocol sessions exist for many hours to days. To capture the
larger-scale session dynamics within a tractable computational budget on common
hardware is not possible at the packet-level. What makes our approach different
are the constraints P2P protocols and BitTorrent in particular place on our network
abstraction coupled with the in-home broadband usage model.
3.6.2 Internet Connectivity Model
The Internet connectivity model defines all the nodes and links present in the
simulated network. As the Internet is constantly changing, a true-to-life connectiv-
ity graph of the Internet does not exist. Our model features two key components:
the Internet backbone, and the neighborhood-level networks of lower-tiered ISPs.
The Internet backbone contains many of the key links that glue the Internet to-
gether. The backbone is very non-uniform, and has evolved slowly over time. The
neighborhood-level networks on the other hand, are very uniform, and have evolved
based on the current Internet connection technology trends (i.e. cable or DSL).
In particular, these two device technologies have different performance character-
istics that need to be considered when distributing large video content to in-home
audiences via the Internet.
In order to preserve realism and accuracy in our simulations, the model must
capture many properties of the Internet, especially those in the “last mile” where
most of the delay and congestion for in-home broadband networks is likely to occur.
Additionally, our model must allow for a configurable number of nodes. Thus, we
have developed a hybrid abstraction connectivity model to do just that.
3.6.2.1 Backbone
The importance of the Internet backbone is obvious, and because of its non-
uniformity, cannot be generated easily. Because of this, our model uses a subset
66
of the actual backbone. These nodes and connections were obtained by performing
thousands of traces from 15 sources to 99 sinks all over the U.S. (see Figure 3.2). We
reached 3,331 distinct nodes, and covered 6,239 edges. The maximum experienced
degree (number of links connecting a single node) was 36, and the average degree was
3.746. When data is sent across the backbone in the simulation, we can use typical
delays based on the path length to estimate its total backbone delay. Figure 3.3
shows the lengths of distinct shortest paths in the simulated backbone. Within
the modeled backbone, low-tiered ISPs were located in many of the designated
market areas defined by Nielsen Media Research [60]. These markets are driven
by the Nielsen Rating System, which is used to determine viewing rates of cable
and broadcast television shows by location. We use the Nielsen market data to
provide a distribution of potential home viewers of content received over the Internet.
This aspect is discussed in the sections below. By design, these nodes border the
backbone, and can therefore be used to expand to the particular ISP’s neighborhood-
level networks.
Figure 3.2: This figure is the connectivity graph of the backbone of theconnectivity model. The nodes represent sources, sinks, in-termediate backbone routers, and identified low-tiered ISProuters. The edges represent links between respective nodes.
67
Figure 3.3: This figure shows the distribution of shortest path lengthsfor distinct paths in the backbone of the connectivity model.This curve is typical of the Internet, demonstrating that wehave preserved the required path properties.
3.6.2.2 Neighborhood-Level
Having up-to-date trace results for all ISPs would allow for maximum realism
in our simulations. However, this would require constant data gathering, and the
memory required to store such data (typically an adjacency matrix or adjacency list)
can be on the order of gigabytes for large simulations like the ones we study. For ex-
ample, a 100,000 by 100,000 matrix with 32-bit entries would require approximately
37 GB of memory (see Table 3.1 for memory comparisons). Luckily, network design
theory implies, and traces confirm, that neighborhood-level networks have similar
structures regardless of the particular ISP (in particular we looked at cable and DSL
ISPs). Because of this, specific ISPs have been abstracted out of the model, and we
can dynamically generate these types of networks in a realistic manner. In this case,
the speedup, the reduction of required system resources, and the elimination of the
need to maintain an up-to-date connectivity model is worth the slight degradation
of system realism.
Figure 3.4 shows the connectivity graph resulting from one set of traces to a
popular cable ISP. From the figure, we can see how the routers are interconnecting at
the different network levels, and also the fan-outs at the network’s edge connecting
to home computers/networks. This figure includes a total of 21,146 nodes resulting
68
Figure 3.4: This figure is the connectivity graph resulting from one setof traces to a popular cable ISP.
from responses from 21,037 homes and 109 intermediate routers. From this set of
traces, the average fan-out size is approximately 540 nodes.
Our market/neighborhood-level model allows us to take advantage of symme-
tries that exist at the consumer broadband level of the Internet. This allows us to
route without using any adjacency-storing data structures. For example, all peers
in the same neighborhood have common routers (usually a few hops away) used
to route within the neighborhood. Similarly, peers within the same market area
have common routers used to route between neighborhoods and the ISP’s back-
bone. Thus, an individual peer’s adjacencies are unimportant. Whether a message
is being sent to the same neighborhood, a different neighborhood within the same
market, or to a completely different market, hops along the paths to common routers
can be accounted for, and the message can be forwarded to the appropriate peer
or backbone router. Although asymptotically the same, this technique provides a
space and computational complexity improvement over the popular adjacency list
data structure, while providing the same routes.
69
3.6.3 Population Model
As previously mentioned, the backbone portion of the connectivity model has
identified ISPs by location. Using our current population statistics of the given
designated market areas [60], we can generate realistic neighborhood-level networks.
For example, if a city has 1 million cable Internet subscribers, it is unrealistic to
generate 5 million such nodes within the neighborhood-level networks of that city.
In terms of abstraction, the population data for specific cities allows us to take into
consideration time zones and the targeting of certain populations. Since we are
mostly concerned with media distribution, including streaming, this level of fidelity
is required to give realistic simulation results.
3.6.4 Delay Model
Our Internet connectivity model (discussed in Section 3.6.2) provides our sim-
ulations with realistic hop counts. As previously mentioned, we are simulating
Internet content distribution, thus, we must measure time in order for our exper-
iments to be useful for analysis. Therefore, we must have appropriate delay and
bandwidth models. In this section, we describe our delay model.
Research has shown that compared to the delay at the first and last links on
a packet’s course, the delay through the Internet’s core is negligible [59]. Because
of this, we can use estimates of the core delays without significantly impacting our
results. Our estimates come from live measurements. Figure 3.5 shows the average
delay experienced at each of the first 18 hops along a packet’s trajectory for roughly
100,000 performed traces. The curve suggests that even though many factors, both
predictable and unpredictable, contribute to delay, it generally increases at later
hops (and decreases closer to the destination). Because of this, we believe that
using average delays and distributions around core hops is realistic. Further, the
traces were performed on the live Internet over several days. Thus, the averages
inherently capture the effects of background traffic, while reducing computational
costs and data-gathering needs. For the first and last links, the delays and available
bandwidths are specified by the technology model.
70
Figure 3.5: This figure shows the average delays experienced at each ofthe first 18 links along a packet’s path from our traces.
3.6.5 Technology Model
The technology model describes what type of device a home user uses to
connect to the Internet. Research has shown that long delays exists at the first and
last links along a packet’s path. Thus, the technology model affects the delay model.
According to [77], depending on the DSL provider, service levels can range from 128
Kbps to 7 Mbps downstream from the Internet to the user, while upstream service
levels from the user to the Internet can range from 128 Kbps to 1 Mbps. Cable
service levels can range from 400 Kbps to 10 Mbps downstream and 128 Kbps to 10
Mbps upstream. Service levels depend on service agreements offered by each cable
system operator per market, and depend on whether the access is for residential
or commercial use. But typically, cable (hybrid-fiber coax) has more bandwidth
available than DSL.
Since we are interested in simulating cable and DSL users, we will generate the
nodes in our connectivity model according to the national percentages of home users
that connect using the two technologies. The delay model can therefore include the
delays at the first and last links based on the device being used. These delays have
been observed in our traces. Figure 3.6 shows the national averages of cable and
DSL users from 2003 and 2006 [62].
Depending on the simulation needs, more devices can be used in the technology
model, and the other topology components should be updated appropriately.
71
Figure 3.6: This figure shows the national technology distribution forhome high-speed Internet connections for March of 2003 andMarch of 2006.
3.6.6 Bandwidth Model
The last major component of our topology model is the bandwidth model.
Equation 3.1 [69] provides an upper bound estimate on the bandwidth for delivering
a 16 KB block. In Equation 3.1, BW is the bandwidth; MSS is the maximum
segment size (which is 1,460 bytes in default TCP, and 1,380 bytes in BitTorrent);
RTT is the round trip time; and p is the probability of packet loss.
To calculate the delay of a path, we apply a truncated (values below zero are
not used) normal distribution to the observed average delay. We use this distribution
because we observed a Gaussian curve in the real trace delays similar to Figure 3.3.
From here, RTT is set to twice the sum of the overall path delay (accounting for
the path and the return path).
The probability of packet loss is set to 0.05, which is a conservative estimate
based on the loss rates of the ISPs we observed. The final bandwidth is rate-shaped
based on the available bandwidth remaining along the pipe.
BW <MSS
RTT
1√p
(3.1)
3.6.7 Results
In this section, we provide experimental results that defend our topology model
design. In Table 3.1, we compare the amount of memory required by our connectivity
model versus a model with all real nodes stored in an adjacency list and an adjacency
matrix, for several simulation runs with a varying number of nodes. The results are
based on a memory footprint of 67 KB per peer (which has been achieved in [57]). It
72
Simulated Memory Required Memory Required Memory Required
Peers (MB) (MB) With List (MB) With Matrix
10,000 654 656 1,03520,000 1,308 1,311 2,83350,000 3,270 3,278 12,806100,000 6,540 6,555 44,686Lookup O(1) O(degree) O(1)
Table 3.1: Approximate memory required for simulation runs, and tech-nique lookup complexity.
is obvious that the savings are drastic, and it is not shown, but the memory accesses
may also increase the simulation time significantly for the adjacency storing models.
In Table 3.2, we revisit some simulation runs published in [57]. In particular,
we look at a swarm of 1,000 peers, and files with the following number of 256 KB
pieces: 128, 256, 512, and 1,024. We compare the number of events processed in
each simulation to a lower bound estimate on the number of events required in the
equivalent packet-level simulations. The lower bound takes into consideration the
actual data broken up into packets and forwarded at each hop. TCP control mes-
sages are ignored, and BitTorrent protocol messages are not broken up (both would
increase the number of events further for the packet-level simulator). As shown,
the number of events increases by a factor of up to 180 for the given simulation
runs. Note that this event increase significantly increases the execution time of
the simulations, as the event-rate is likely to remain roughly the same, hence, our
simulations experience a tremendous speedup as a result of the reduced number of
events processed.
Figure 3.7 shows the download completion times across all peers for a modified
version of the INRIA/PlanetLab test-bed scenario [67]. Here, 40 peers are divided
into 2 groups, fast peers and slow peers. The fast peers have a 200 KBps upload
capacity, while the slow peers have only a 20 KBps upload capacity. The download
capacity is set in our simulation to 100 MBps. The primordial seeder has the same
upload capacity as a fast peer. There are a few key differences between our scenario
and the INRIA/PlanetLab scenario. First, the PlanetLab network topology is less
complex than our topology. Our 40 peer scenario is distributed across a network that
73
Pieces Slice-Level Packet-Level
Events Events
128 19M 3.42B256 36M 6,48B512 66M 11.88B
1,024 122M 21.96B
Table 3.2: Number of events generated in the slice-level simulations andlower bound on the number of events generated in the packet-level simulations.
0
500
1000
1500
2000
2500
3000
3500
4000
0 50 100 150 200 250 300 350 400
Dow
nloa
d C
ompl
etio
n T
ime
(sec
)
Peer
"peersum-sorted"
Figure 3.7: This figure shows the download completion times of the modi-fied INRIA/PlanetLab scenario taken from [67]. In our case,we varied the random number seed-sets across 10 separateruns of the 40 peer, 1 seeder scenario. Thus providing uswith 400 peer data points.
spans the top 31 television markets in the U.S. Next, because our model is currently
only able to support cable and DSL devices, we only have two speed classes of users
at this time. Lastly, because of the radically different random number generation
seed-sets used across the 10 experiments, our range of different peer-sets and piece
selection is much greater. However, despite these differences, we observe that our
download completion times, in terms of shape, are similar to what they report – i.e.,
74
Figure 3.8: Simulated download completion times (seconds) for the 1,024peer 1,024 piece scenario.
the conical S-shape. This shape has also been reported by [87] in their emulation
of BitTorrent for a 700 peer scenario. This result provides confidence that both our
network and BitTorrent models are behaving as expected.
3.7 Experimental Results
The experiments in this section were conducted on a 16 processor, 2.6 GHz
Opteron system with 64 GB of RAM running Novell SuSe 10.1 Linux. These exper-
iments were conducted sequentially.
3.7.1 Model Validation
In order to validate our BitTorrent model, we created three tests: (i) download
completion test, (ii) download time test, and (iii) message count test. While there
is no consensus in the BitTorrent community on a valid BitTorrent implementation
because of variability that is acceptable within the protocol, we believe these tests
provide us with some confidence in the behavioral accuracy of our model (P2P
Contribution 4).
The first test asks the most basic question, did all leecher peers obtain a com-
plete copy of the file? To conduct this test, we executed our model in 16 different
75
Table 3.3: Number of messages received per type per simulation scenario.
Scenario Choke Unchoke Interested Not Interested Have Request
128 peer, 128 pieces 10,402 10,402 30,693 26,539 1,247,294 333,719256 peer, 128 pieces 21,181 21,181 66,213 57,520 2,552,040 669,542512 peer, 128 pieces 42,964 42,964 144,466 122,469 5,129,973 1,319,8301K peer, 128 pieces 86,240 86,240 287,397 240,759 10,271,737 2,668,919128 peer, 256 pieces 10,584 10,584 42,933 38,899 2,494,655 622,895256 peer, 256 pieces 21,399 21,399 96,786 86,870 5,104,101 1,236,486512 peer, 256 pieces 43,013 43,013 208,328 186,197 10,258,933 2,517,2611K peer, 256 pieces 85,753 85,753 389,975 348,770 20,543,479 5,028,309128 peer, 512 pieces 10,950 10,950 68,810 63,851 4,989,376 1,177,809256 peer, 512 pieces 21,661 21,661 137,811 128,039 10,208,231 2,350,252512 peer, 512 pieces 43,258 43,258 294,295 271,613 20,517,877 4,706,6051K peer, 512 pieces 86,051 86,051 581,193 537,537 41,087,991 9,521,465128 peer, 1K pieces 11,240 11,240 104,061 99,097 9,978,815 2,258,058256 peer, 1K pieces 21,979 21,979 206,557 196,053 20,416,487 4,531,166512 peer, 1K pieces 43,340 43,340 396,434 373,540 41,033,714 9,000,3431K peer, 1K pieces 10,402 10,402 30,693 26,539 82,178,039 18,316,898
configurations based on the number of peers and the number of pieces. The number
of peers ranged from 128 to 1,024 by a power of two. Similarly, the number of pieces
also ranged from 128 to 1,024 by a power of two. At the end of each simulation run,
we collected statistics on each peer. In particular, we noted how many remaining
pieces a peer had, which was zero in all cases. Furthermore, all pending requests
should have been satisfied. Thus, the active request list for each connection should
be empty. We confirmed for all 16 cases that no requests were pending, and in fact,
all request memory buffers had been returned to the request memory pool, thus en-
suring no memory leaks existed. Finally, we ensured that as a piece is downloaded,
we correctly free the peer_list structures that have that piece, and remove it from
our rarest piece priority queue. Again, this verifies that we do not have any memory
leaks in the management of the piece-peer list structures, and serves as a cross-check
that all pieces have been correctly obtained by a peer.
In the second test, we want to know the distribution in time for when a peer
completes the download of the file. We then verify the shape of our download times
76
curve against those most recently published in [87]. For this test, we use the 1K
peer, 1K piece scenario. We observe that at each milestone, 25%, 50%, 75%, and
100% (download complete), most of the peers reach it at the same point in time,
as shown in Figure 3.8. This trend is attributed to the rarest piece first policy used
to govern piece selection, coupled with fair tit-for-tat trading. For the most part,
this prevents any peer from “getting ahead” in the overall download process. We
do however note, there does appear to be some “early winners” and “late losers”
in the process. This phenomenon occurs because not all peer-sets have access to
all pieces at the same time. Some peer-sets are losers (i.e., many hops away from
the original seeder), and peer-sets that contain the seeder have rare pieces more
readily available. The shape of our download completion times curve is confirmed
by the emulated results presented in [87]. Additionally, we find the real measurement
data in [66] reports a similar shaped download time distribution curve. However,
a key difference is that its variance is much greater, leading not to a relatively flat
line as we have, but to a positive-sloped line. We attribute this difference to the
measurement data only covering an extremely small “swarm” (only 30 leechers with
9 peers in each peer-set). Thus, the network parallelism is not available because
of fewer connections. Therefore, downloads will be more serialized, yielding longer,
more staggered download completion times.
In the last test, we validate our message count data as shown in Table 3.3,
against the real measurement data reported in [66]. There are two key trends that
appear to point to proper BitTorrent operation. The first is that the number of
choke and unchoke messages should be equal. In all 16 configurations, we find this
assertion to be true. This is because the choke algorithm forces these messages
to operate in pairs. Second, the number of interested messages should be slightly
higher, but almost equal, to the number of not interested messages. We observe this
phenomenon across all 16 model configurations. Finally, we observe that the number
of have and request messages meet our expectations. In the case of have messages,
they are approximated by the number of peers, times the number of pieces, times the
number of peer connections per peer. In the case of the 1K peer, 1K piece scenario,
this is bounded by 80 × 1, 024 × 1, 024 = 83, 886, 080. Likewise, the number of
77
Figure 3.9: Model execution time as a function of the number of piecesand the number of peers.
requests has a lower bound of the number of pieces, times 16 slices per piece, times
the number of peers. The reason this is a lower bound is because of endgame mode,
which allows for the same piece/offset to be requested many times across different
peers.
3.7.2 Model Performance
To better understand how our BitTorrent model scales and affects simulator
performance, we conducted the following series of experiments (P2P Contribution 5).
In the first set shown in Figure 3.9, we plot the simulation execution time as a
function of the number of peers and the number of pieces. The number of peers
and pieces range from 128 to 1,024 by a power of 2, yielding 4 sets of 4 data points
each. We observe that increasing the number of peers for small (128) piece files
does not impact simulator performance significantly. However, as the number of
pieces grows, the slope of the execution time increases tremendously as the number
of peers increases. We attribute this behavior to the increased complexity in the
download side of a peer connection as a consequence of a large number of pieces
to consider. Additionally, by increasing the number of peers and pieces, the overall
event population increases, which leads to larger event list management overheads.
78
Figure 3.10: Model event rate as a function of the number of pieces andthe number of peers.
Figure 3.11: Model memory usage in MB as a function of the number ofpieces and the number of peers.
To verify this view, we plot the event rate as a function of the number of peers
and pieces in Figure 3.10. We observe that the event rate is highest (close to 100K
events per second) when both peers and pieces are small in number. However, in
the 1K peer, 1K piece case, we observe that the event rate has decreased to only
40K events per second because of more work per event, as well as higher event list
79
Figure 3.12: Simulated download completion times (seconds) for the16,384 peer 4,096 piece scenario (this simulation run re-quired 15.14 GB of RAM and 59.66 hours to execute withan event rate of 35,179 events per second).
management overheads.
The memory usage across the number of peers and pieces is shown in Fig-
ure 3.11. Here, we observe that memory usage grows much slower for smaller piece
torrents than for larger ones. For example, the 1K peer, 128 piece scenario only
consumes 452 MB of RAM or 441 KB per peer, whereas the 128 peer, 1K piece sce-
nario consumes 289 MB or 2.3 MB per peer. This change in memory consumption
is attributed to two reasons. First, as the number of pieces grows, the availability of
pieces to select from grows as well. This in turn allows more requests to be simul-
taneously scheduled, which results in a larger peak event population, and increases
the overall event memory required to execute the simulation model. Increasing the
number of peers has a similar impact, in that it will also increase the event popula-
tion (and demand for request memory buffers), which raises the amount of memory
necessary to execute the simulation model.
In the last performance curve, we show the download completion times for
a very large 16K peer, 4K piece scenario. We note here that we observe a larger
population of “late” downloaders at each milestone. Observe that endgame and
80
download completion occur extremely close to each other. Thus, peers do not spend
a great deal of time in endgame mode overall. For simulator performance, this model
consumed 15.14 GB of RAM and required almost 60 hours to complete. At first
pass, 60 hours does appear to be a significant amount of time, but we observe that
it is orders of magnitude smaller than the measurement studies that require many
weeks or even months of peer/torrent data collection.
Finally, we report the completion of a simulation scenario with 128K peers, 128
pieces, 32 connections per peer, and a max_backlog of 8. The impact on memory
usage was significant. This scenario only consumes 8.15 GB of RAM or 67 KB
per peer, which points to how the interplay between pieces, peers, and requests
dramatically affects the underlying memory demands of the simulation model.
3.8 BitTorrent as a Streaming Protocol
With so much digital media content being transferred, it is natural to examine
P2P’s potential for streaming delivery. If content is streamed, the user can begin to
enjoy it sooner, and can evaluate its quality early on in order to preserve valuable
resources [108]. While many P2P protocols exist for streaming, none have achieved
the degree of performance, scalability, user-fairness, and popularity as BitTorrent
has for accomplishing time-insensitive mass-downloads. For this reason, we explore
modifying BitTorrent for streaming downloads (P2P Contribution 6).
In this section, we determine the potential of BitTorrent modifications BiToS
[108] and BASS [107]. BiToS uses a modified piece-picker algorithm, while BASS
augments the system with a dedicated streaming server. We simulate the two tech-
niques over a wide range of scenarios (using the simulator mentioned above). We
then analyze peer completion times from the simulation results to determine which
techniques are viable for streaming content with reasonable quality playback. We
then present the cost of these techniques in terms of total data delivered, and server
utilization at any point in time.
81
3.8.1 Related Work
P2P streaming is often accomplished using application layer multicast, where
an overlay network is constructed containing the participating nodes. The content
owner injects the stream into the overlay, where the nodes may consume it and
forward it to their children. The structure of the overlay is typically a tree, forest,
or mesh.
A multicast tree is the simplest structure. Each node receives content from
a single parent, and forwards it to its children. The height of this tree translates
into its latency, and the width translates into the number of bandwidth bottlenecks.
ZIGZAG [110] is an architecture composed of a clustering hierarchy and a multicast
tree of logarithmic height, and constant node degree. Overcast [113] also builds
a tree, but attempts to maximize a metric such as bandwidth or latency from all
nodes to the root. A common problem characteristic of multicast trees is their lack
of fault-tolerance. If a single node fails, it may disconnect portions of the tree,
rendering them useless until the failing node’s children can recognize the failure and
reconnect. Bayeux [115] (an architecture that leverages Tapestry [19]) attempts to
solve this problem using secondary pointers.
Traditional tree-based multicast is not well suited for P2P, as the burden of
duplicating and forwarding traffic is carried by a small subset of peers that are
interior nodes of the tree. This conflicts with the expectation that all nodes will
share this burden equally. SplitStream [109] splits the stream into stripes, each
delivered with a separate multicast tree. It attempts to create a structure where
interior nodes of one tree are leaf nodes in all the remaining trees, in order to fairly
distribute the forwarding burden. Other systems that use forests include Narada
[116] and PALS [117].
Fundamentally, forest overlays suffer from the same problems as tree overlays
[112], since a node in any stripe may fail. Like a forest, a mesh overlay allows for si-
multaneous downloads, but also allows parts of the file to come from perpendicular
nodes. If a node fails, other nodes can continue to receive content while recon-
necting to the overlay. However, a protocol is needed to locate missing content in
the network. Bullet [114], CollectCast [111], and DONet/CoolStreaming [106] are
82
examples of systems that use mesh overlays.
Under ideal conditions, application layer multicast works well for streaming
media. But, even with clever techniques to ensure performance, scalability, and
fault-tolerance, all these schemes lack user incentives. Users upload in good-faith
[118], and are not penalized if they choose not to contribute their resources to the
P2P network. This has sparked some interest in using BitTorrent for streaming,
since BitTorrent employs an incentive mechanism. Studied in this section are BiToS
[108] and BASS [107], both of which are streaming systems built around the Bit-
Torrent protocol.
3.8.2 BiToS
BiToS [108] is a BitTorrent derivative that imposes minimal changes to the
protocol’s piece-picker to allow for streaming. Since changes are only made to the
piece-picker, the modified client can still participate in swarms with unmodified
clients.
BiToS organizes needed pieces into two queues, the high-priority pieces and
the remaining pieces. Any piece that misses its playback deadline will be removed
from the queues and will no longer be considered for download, thus degrading the
video quality. With a probability of p, the earliest deadline piece of the high-priority
piece set is requested, and with a probability of 1− p, the rarest remaining piece is
requested (p can be fixed or dynamically assigned). The goal is to download pieces
in order as they are needed for playback, and occasionally download rare pieces to
make the peer an attractive trading partner as per BitTorrent’s incentive mechanism.
Thus, the different values of p affect the content quality (and the download time
always remains the same).
In [108], BiToS is simulated for a flash crowd of 400 peers downloading a 147
piece file (10 minutes at 500 Kbps), using a synthetic symmetric-bandwidth network.
It is shown that if p = 0.8 in this scenario, tolerable quality streaming in terms of
continuity index [106] can be achieved.
Through our experiments of BiToS, we can confirm its performance in the
above scenario. Further, if only a small percentage of the swarm is streaming, then
83
their performance and the overall swarm performance is still very good. The reason
for this is because there is still good entropy throughout the swarm, and stream-
ing peers can still have non-streaming peers interested in them, since they all have
different perspectives regarding rare pieces. However, in general, for larger swarms,
larger files, higher bit rates, or different values of p, the number of piece deadlines
missed by BiToS increases astronomically. As a result, the playback quality dete-
riorates. Due to its inability to scale and lack of robustness, BiToS is ill-suited for
streaming delivery when high-quality playback is desired.
3.8.3 BASS
BitTorrent Assisted Streaming System (BASS) [107] is a hybrid server/P2P
streaming system for large-scale video-on-demand (VoD). In BASS, clients can
stream via BitTorrent connections and media servers simultaneously. File pieces
are downloaded from a server sequentially, with the exception of pieces already ob-
tained using BitTorrent. Similarly, the BitTorrent piece-picker will not choose to
download pieces scheduled prior to the current playback point, as they have already
been obtained from a server. In [107], a P2P contribution rate of 34% has been
reported for a scenario of 350 peers distributing a 692 piece file (at 1,024 Kbps).
For our model to simulate BASS, two new entities needed to be added. The
first is a streaming server. A streaming server is an LP that represents a highly-
capable peer, which answers all requests in FIFO order. With the exception of using
the same slice/piece scheme, a server does not run the BitTorrent protocol (it does
not choke peers). The implementation allows for any number of streaming servers
in the system. It should be noted that having these servers in an environment with
malicious or selfish peers would require new security considerations. The second
entity added to the model is a streaming peer. This peer is a BitTorrent peer, but is
also responsible for keeping track of buffer state and playback deadlines. Streaming
peers may co-exist and cooperate with non-streaming peers in a simulation run.
Algorithm 6 demonstrates the minor modifications required at birth to accommodate
streaming peers.
A streaming peer employs a double-buffering scheme consisting of a playback
84
Algorithm 6: BIRTH Event
if streaming peer then1
Initialize buffers;2
Schedule PLAYBACK now;3
// All peers
Initiate BitTorrent protocol;4
buffer and a look-ahead buffer (see Figure 3.13). The playback buffer contains all
video data that will be played next. If this buffer is not full when required, a re-
buffer is triggered, causing the remaining buffer slices to be requested from a server.
In addition, all remaining slices in the look-ahead buffer are also requested (see
Algorithm 7). The purpose of the look-ahead buffer is to download pieces that have
not missed their deadlines yet, but are needed soon. The chance that these pieces
will be downloaded via BitTorrent is small, and the goal is to reduce the amount
of re-buffers necessary, and the total buffering time. Data coming from a server
is treated in the same way as data coming from a BitTorrent peer. This forces a
peer to send out HAVE messages for pieces downloaded from a server. Further, the
BitTorrent piece-picker does not need to be modified, since a peer will not request
data that it already has, regardless of where it came from.
In the event of a re-buffer, the client will play the content as soon as it becomes
available. Whenever content is successfully played, the next playback is scheduled
for the point in the future when the current playback finishes. To initiate the first
buffering and playback, a streaming peer has a playback event scheduled at its birth
time. Clearly, the size of the buffers impacts both the QoS and the distribution
costs (larger buffers result in less P2P contribution but better QoS).
Figure 3.13: This figure demonstrates our double-buffering scheme. Inthis example, the playback buffer is 5 slices and the look-ahead buffer is 15 slices.
85
Algorithm 7: PLAYBACK Event
// Check playback buffer
for i = 1 to playback buffer.size do1
if playback buffer[i].missing then2
Request playback buffer[i].piece;3
Set missed playback;4
// Check look-ahead buffer
for j = 1 to lookahead buffer.size do5
if lookahead buffer[j].missing then6
Request lookahead buffer[j].piece;7
// Handle the successful or unsuccessful playback
if missed playback then8
if Last playback was successful then9
Store first unsuccessful time;10
else11
if Last playback was unsuccessful then12
Increment rebuffers;13
Update buffer time;14
Clear first unsuccessful time;15
// Playback was successful
Update both buffers;16
Schedule next PLAYBACK event;17
3.8.3.1 Simulation Results
In this section, we demonstrate system performance, cost savings, and server
utilization (average and peak) for several streaming scenarios using our model of
BASS.
In the following experiments, the distributed video consists of 512 pieces at
700 Kbps (approximately 24 minutes). We test swarms of 16,384 (flash crowds of
1,024 and 2,048) and 32,768 (flash crowds of 2,048 and 4,096) peers, where the flash
crowds consist of peers that are not streaming, consistent with the subscription
model for new content. Through prior experimentation, we determined the flash
crowds should be at least 1,024 peers. If the flash crowd is too small, there is not
enough content in the network to satisfy the streamers, and buffering times are
much higher. Further, our results suggest that the flash crowd does not need to be
very large, since larger flash crowds do not impact the buffering times, the number
of re-buffering events, or the P2P contribution. The peers in each swarm arrive for
86
approximately two hours.
Figure 3.14: This graph demonstrates the average buffer times (seconds)experienced by streaming peers in the simulation runs.
Figure 3.15: This graph demonstrates the average number of bufferevents experienced by streaming peers in the simulationruns. Note that the first buffer event is mandatory for allstreaming peers.
As mentioned above, the main discriminating variable is the size of the look-
ahead buffer. In our experiments, we range it from 500 to 3,000 slices by increments
of 500 (note that the entire file consists of 8,192 slices). If the buffer is too small,
the P2P contribution is very high, however the QoS (buffering time) is very poor.
Further, if the buffer is very large, we see a drastic decrease in P2P contribution, with
87
Figure 3.16: This graph demonstrates the percent of bandwidth con-tributed by the P2P network for the file distribution in thesimulation runs.
only a marginal increase in quality. In each run, we see a look-ahead buffer of size
500 results in an average buffering time of 3.3 seconds (1.5 average buffer events per
user) with a P2P contribution of 78% (see Figures 3.14, 3.15, and 3.16 respectively).
While this sounds good, 3.3 seconds may be too long for some applications. We
can achieve an average buffering time of under 2 seconds (1.3 average buffer events
per user) with up to a 73% contribution from the P2P network. We can lower this
time by 0.4 seconds, but we lower the P2P contribution by 16%. Depending on the
distributor’s budget and needs, the size of the look-ahead buffer can be established.
For an average buffering time of 1.1 seconds, we can still achieve a P2P contribution
of 53%. Thus, using BitTorrent to assist a CDN or streaming architecture can
significantly lower transit costs, while achieving an excellent QoS for users.
3.8.3.2 QoS
Although the average user buffer times and number of buffer events are very
good, we would like to know how all peers fare. Consider the case of 16,384 peers, a
flash crowd of 1,024, and a look-ahead buffer of 1,500 slices. The average buffering
time is 1.3 seconds (with a standard deviation of 2.6 seconds). The histogram
in Figure 3.17 shows that most peers do indeed experience good performance, with
99.76% of peers experiencing a buffer time of under 3 seconds. Similarly, on average,
88
Simulated Avg. Buffer Avg. P2P Avg. CDN Peak CDN
Peers Time (s) Buffers Contribution Util. (MBps) Util. (MBps)
16,384 1.8 1.3 73% 104 14532,768 1.5 1.3 73% 158 31265,536 1.4 1.3 73% 314 617131,072 1.4 1.3 73% 633 1,228
Table 3.4: For a flash crowd of 2,048 peers and a look-ahead buffer of1,000 slices, this table shows the performance of several largeswarms (16,384 peers to 131,072 peers).
each peer requires only 1.3 buffering events (with a standard deviation of 0.6 buffer
events), with 99.15% of peers requiring at most 2 buffer events (including the initial
mandatory buffering, see Figure 3.18).
Figure 3.17: This histogram of the buffering times demonstrates thatmost streaming peers experience a buffering time of under3 seconds.
Overall, streaming quality can be measured by a QoS metric called adjusted
frustration time [129]. Adjusted frustration time is defined as the total sum of
buffering times, plus a 2 second penalty for every re-buffering event (this metric
is used as part of the StreamQ user experience rating system, where any adjusted
frustration time of under 6 seconds is given a grade of A+). Figure 3.19 is a his-
togram of the swarm’s adjusted frustration times (the average is 1.8 seconds, with
a standard deviation of 3.1 seconds, and 99.12% of users experience an adjusted
89
Figure 3.18: This histogram of the number of buffering events demon-strates that most streaming peers experience few re-buffers.
Figure 3.19: This histogram of the adjusted frustration times demon-strates that most streaming peers experience a high QoS.
frustration time of under 3.6 seconds), proving that the overall QoS is very good.
The StreamQ user performance ratings for this run can be found in Table 3.5.
3.8.3.3 CDN Utilization
Figures 3.20 and 3.21 show the average and peak server utilizations for each
scenario. We see that the peak never exceeds 390 MBps, and the averages are usually
around half the peaks. Further, for the simulated scenarios, the server utilization
scales roughly linearly with the size of the swarm. Thus, if a content provider can
90
Grade Frequency
A+ 16,245A 21B+ 15B 13C+ 18C 15D+ 5D 10F 42
Table 3.5: This table shows how many streaming peers received eachgrade of the StreamQ performance rating system. Note thata grade of F is given when a peer’s adjusted frustration timeis 27 seconds or more.
estimate swarm sizes, an excellent estimate of server requirements can be made.
Figure 3.20: This graph demonstrates the average server utilizations overthe simulation runs.
Most CDNs and ISPs use a method called burstable billing [130] to charge their
customers. This method charges based on a regular sustained utilization, allowing
brief usage peaks to occasionally exceed the threshold without penalty. Typically,
customers are billed at the 95th percentile of their usage. This method is beneficial
for customers whose usage is fairly steady. If usage is bursty or unpredictable, a
flat-rate system that charges per byte (or GB) delivered may be the best option.
91
Figure 3.21: This graph demonstrates the peak server utilizations overthe simulation runs.
Simulated 95th Percentile Approximate Cost Without
Peers CDN Util. (MBps) Distribution Cost (flat) P2P Network (flat)
16,384 112 $56.62 $209.7232,768 224 $113.25 $419.4365,536 442 $226.49 $838.86131,072 922 $452.99 $1,677.72
Table 3.6: For a flash crowd of 2,048 peers and a look-ahead bufferof 1,000 slices, this table shows the 95th percentiles and ap-proximate flat-rate distribution costs for several large swarms(16,384 peers to 131,072 peers), at $0.10 per GB delivered.
Table 3.6 shows the 95th percentiles and approximate flat-rate distribution costs for
the scenarios presented in Table 3.4. The 95th percentile costs are not in the table
since service contracts vary from customer to customer.
We confirm the claims published in [107] that most CDN contribution occurs
in the first pieces of the file, with very little towards the end. This is due to the
fact that BitTorrent (which usually employs a rarest piece first algorithm) does not
have a chance to obtain early pieces because they are needed too soon, while the
later pieces have more time before playback, and thus more opportunities to be
downloaded via BitTorrent. Figure 3.22 shows how many times a slice (16 KB) is
downloaded from the server, for each piece (256 KB) of the file.
92
Figure 3.22: This graph shows the distribution of slices delivered by theCDN throughout the file, for the 16,384 peer, 1,024 flashcrowd, and 2,000 slice look-ahead scenario.
3.8.3.4 Video Bit Rate
To this point, all results have been for a bit rate of 700 Kbps. Now, we
present results from simulations at 1.5 Mbps (a 12 minute video), and show that
with a nominal increase to the look-ahead parameter, we can achieve a similar QoS
and CDN requirements. Table 3.7 shows that an increase of 500 slices to the look-
ahead buffer will allow us to achieve the same QoS as the 700 Kbps scenario (this is
the point where both curves begin to converge to 1.1 seconds and 1.3 buffer events).
Table 3.8 shows the P2P contribution and the server utilizations for these same
scenarios. We see that to achieve the same QoS, we require more CDN involvement
for the higher bit rate, which is what we expected since the only difference is that
now all playback deadlines occur sooner.
When the look-ahead buffer is 500 slices for all scenarios (the worst case for
both bit rates), the average buffering time for the 1.5 Mbps video is approximately
twice that of the 700 Kbps video (a maximum of 7.2 seconds compared to 3.6 sec-
onds), and there are on average 0.6 more buffer events per user. For any look-ahead
buffer size, the P2P contribution is consistently 10 to 13% less for the higher bit
rate. While the average and peak server utilizations appear to go down occasionally,
they typically increase for the higher bit rate by 30 Mbps and 60 Mbps respectively
93
Simulated Bit Rate Look-Ahead Avg. Buffer Avg.
Peers (Kbps) (slices) Time (s) Buffers
16,384 700 1,500 1.4 1.316,384 1,500 2,000 1.4 1.332,768 700 1,500 1.3 1.332,768 1,500 2,000 1.4 1.3
Table 3.7: For a flash crowd of 2,048 peers, this table shows the appro-priate size of the look-ahead buffer to achieve a similar QoSfor different bit rates (700 Kbps and 1.5 Mbps) and swarmsizes (16,384 peers and 32,768 peers).
Simulated Bit Rate Look-Ahead P2P Avg. CDN Peak CDN
Peers (Kbps) (slices) Contribution Util. (MBps) Util. (MBps)
16,384 700 1,500 68% 104 15516,384 1,500 2,000 50% 145 23932,768 700 1,500 68% 188 33232,768 1,500 2,000 51% 143 225
Table 3.8: This table shows the differences in P2P contribution and CDNutilization for the scenarios presented in Table 3.7.
for the 16,384 peer swarms, and by 60 Mbps and 90 Mbps respectively for the 32,768
peer swarms.
These results show that while P2P contribution decreases and server utilization
increases (an increase in overall CDN involvement), we can achieve the same QoS
at higher bit rates as with lower ones.
3.9 Summary
In this chapter, we have discussed using P2P-overlay networks for the delivery
of data. P2P networks show great promise in their ability to distribute content
to extremely large audiences without overwhelming origin servers, and significantly
reducing distributor transit costs. Specifically, we studied the BitTorrent protocol
because of its performance, scalability, user-fairness, popularity, and potential for
legal content distribution. We are interested in studying large television-size audi-
ences, for which measurement data does not exist. Swarms of this size have never
94
existed in the wild, and data is not available for the largest swarms that have ex-
isted. This is due to the fact that the data is either proprietary, or the swarms
were distributing content illegally, and do not publish records. We must therefore
simulate these swarms, which are much larger than in any previous simulation effort.
We have constructed a discrete-event simulator (see Section 3.5) and Internet
topology model (see Section 3.6) that realistically capture the characteristics of home
broadband Internet service. We carefully abstract away details non-pertinent to
Internet simulations in order to achieve our desired scale and degree of accuracy.
This is mostly done by estimating low-level details, and routing traffic based on
Internet structures rather than complete Internet adjacencies.
Lastly, we have shown BitTorrent’s capabilities as a protocol for streaming
(real-time) data delivery (see Section 3.8) in an emerging market (some companies
include: PPLive [119], PPStream [120], MySee [121], Roxbeam [122], UUSEE [123],
BitTorrent.com [131], Verisign Kontiki [133], ITIVA [132], Joost [124], Pando [125],
and Red Swoosh [126]). Specifically, if a small fraction of the swarm wishes to stream
the content, it can do so with a high QoS by simply modifying the piece-picker to
request data in order. However, piece-picker modifications do not scale well. When
many peers are streaming, we show that the distributor can save significantly on
data transit costs by using BitTorrent along with a server or CDN infrastructure.
Specifically, we have shown that the distributor can save 73% of their transit costs
while providing users with a viewing experience requiring under 2 seconds of total
buffering on average (an A+ using the StreamQ rating system [129]).
With our advancements, we can study other protocols and create new ones
that are more efficient for users and distributors, and impose less of a burden on
ISPs.
CHAPTER 4
Discussion and Conclusions
In this thesis, we have discussed two fundamental problems of distributed networks.
The first problem deals with locating mobile data/objects in a wireless sensor net-
work (see Chapter 2). We show that using a simple centralized directory is a poor
solution to the problem because it is not locality-sensitive. A better solution uses
a distributed directory, where data/objects do not have a static home. This allows
queries to be answered quickly regardless of the whereabouts of the querying and
storing nodes. This is done through the use of efficient find and move operations.
A sparse cover is the underlying data structure from which a distributed directory
is built. Specifically, a hierarchy of increasing-radius covers is used to construct
regional matchings, which contain read and write sets for all network nodes (refer
to Section 2.4 for formal definitions). As a directory contains only two operations
(find and move), its performance is measured by the Stretchfind and Stretchmove,
which are determined by the structural quality (radius and degree) of the sparse
covers used to construct it.
We first proved a structural lower bound for sparse covers of arbitrary graphs
in Section 2.5. Specifically, there exists a network with n nodes, and constrained
by the locality parameter γ and the maximum tolerable degree c, such that when
clustered, there must exist a cluster whose radius is Ω(γ log logc n), regardless of
the clustering technique (see Theorem 2.5.2). This proves that for arbitrary graphs,
there is an inherent tradeoff in the radius and degree, and these metrics cannot be
simultaneously optimized. The best known construction algorithm for these graphs
can achieve a radius of O(γ log n) and a degree of O(log n) [34], which translates into
a distributed directory with Stretchfind = O(log2 n) and Stretchmove = O(log2 n).
In light of the above tradeoff, we studied construction techniques for special types
of graphs including planar, unit disk, and H-minor free graphs.
In Section 2.7, we presented an algorithm for clustering κ-path separable
graphs that achieves a radius of O(γ) and degree of O(log n). This translates into
95
96
a distributed directory with Stretchfind = O(log n) and Stretchmove = O(log n), a
savings of a logarithmic term in each metric. In Section 2.8, we presented an opti-
mal algorithm for clustering planar graphs that achieves a radius of O(γ) and degree
of O(1). This translates into a distributed directory with Stretchfind = O(1) and
Stretchmove = O(log n), a savings of log2 n in Stretchfind, and log n in Stretchmove.
Finally, in Section 2.9, we showed how our planar algorithm can be used to con-
struct optimal covers for unit disk graphs (and other graphs with constant-stretch
planar spanners) with a radius of O(γ) and degree of O(1), once again saving log2 n
in Stretchfind and log n in Stretchmove, for the distributed directory operations.
Our work has immediate implications on the efficiency of other important data
structures used to solve fundamental distributed problems such as the construction
of compact routing schemes and synchronizers.
The second problem deals with the retrieval (delivery) of digital content in a
complex P2P overlay network (see Chapter 3). P2P networks show great promise in
their ability to distribute content to extremely large audiences without overwhelming
origin servers, and significantly reducing distributor transit costs. Specifically, we
studied the BitTorrent protocol because of its performance, scalability, user-fairness,
popularity, and potential for legal content distribution. We are interested in studying
large television-size audiences, for which measurement data does not exist. Swarms
of this size have never existed in the wild, and data is not available for the largest
swarms that have existed. This is due to the fact that the data is either proprietary,
or the swarms were distributing content illegally, and do not publish records. We
must therefore simulate these swarms, which are much larger than in any previous
simulation effort.
We have constructed a discrete-event simulator (see Section 3.5) and Internet
topology model (see Section 3.6) that realistically capture the characteristics of home
broadband Internet service. We carefully abstract away details non-pertinent to
Internet simulations in order to achieve our desired scale and degree of accuracy.
This is mostly done by estimating low-level details, and routing traffic based on
Internet structures rather than complete Internet adjacencies.
Lastly, we have shown BitTorrent’s capabilities as a protocol for streaming
97
(real-time) data delivery. Specifically, if a small fraction of the swarm wishes to
stream the content, it can do so with a high QoS by simply modifying the piece-picker
to request data in order. However, piece-picker modifications do not scale well.
When many peers are streaming, we show that the distributor can save significantly
on data transit costs by using BitTorrent along with a server or CDN infrastructure.
Specifically, we have shown that the distributor can save 73% of their transit costs
while providing users with a viewing experience requiring under 2 seconds of total
buffering on average (an A+ using the StreamQ rating system [129]).
The contributions presented in this thesis are innovative and significantly im-
prove data structures and techniques for data access and retrieval in distributed
networks.
LITERATURE CITED
[1] P. Zhang, C. Sadler, S. Lyon, and M. Martonosi. Hardware DesignExperiences in ZebraNet. In Proc. ACM Conference on Embedded NetworkedSensor Systems, Baltimore, MD, November 2004.
[2] T. Liu, C. Sadler, P. Zhang, and M. Martonosi. Implementing Software onResource-Constrained Mobile Sensors: Experiences with Impala andZebraNet. In Proc. International Conference on Mobile Systems,Applications, and Services, Boston, MA, June 2004.
[3] P. Juang, H. Oki, Y. Wang, M. Martonosi, L.S. Peh, and D. Rubenstein.Energy-Efficient Computing forWildlife Tracking: Design Tradeoffs and EarlyExperiences with ZebraNet. In Proc. International Conference onArchitectural Support for Programming Languages and Operating Systems,October 2002.
[4] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A Survey onSensor Networks. In IEEE Communications Magazine, 37(8):102–114, August2002.
[5] D. Braginsky and D. Estrin. Rumor Routing Algorithm for Sensor Networks.In Proc. ACM International Workshop on Wireless Sensor Networks andApplications, Atlanta, Georgia, 2002.
[6] Y. Yu, R. Govindan, and D. Estrin. Geographical and Energy Aware Routing:A Recursive Data Dissemination Protocol for Wireless Sensor Networks.Technical Report, UCLA/CSD-TR-01-0023, May 2001.
[7] D. Estrin, R. Govindan, J.S. Heidemann, and S. Kumar. Next CenturyChallenges: Scalable Coordination in Sensor Networks. In Mobile Computingand Networking, pages 263–270, 1999.
[8] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed Diffusion: AScalable and Robust Communication Paradigm for Sensor Networks. InMobile Computing and Networking, pages 56–67, 2000.
[9] S. Ratnasamy, D. Estrin, R. Govindan, B. Karp, S. Shenker, L. Yin, and F.Yu. Data-Centric Storage in Sensornets. In ACM SIGCOMM ComputerCommunication Review, 33(1):137–142, January 2003.
[10] N. Chang and M. Liu. Revisiting the TTL-Based Controlled Flooding Search:Optimality and Randomization. In Proc. International Conference on MobileComputing and Networking, pages 85–99, New York, NY, 2004. ACM Press.
98
99
[11] B. Krishnamachari and J. Ahn. Optimizing Data Replication for ExpandingRing-Based Queries in Wireless Sensor Networks. Technical Report, USCComputer Engineering, October 2005.
[12] N. Sadagopan, B. Krishnamachari, and A. Helmy. Active Query Forwardingin Sensor Networks. IEEE SNPA Workshop, 2003.
[13] X. Liu, Q. Huang, and Y. Zhang. Combs, Needles, Haystacks: Balancing Pushand Pull for Discovery in Large-Scale Sensor Networks. In Proc. InternationalConference on Embedded Networked Sensor Systems, Baltimore, MD, 2004.
[14] S. Madden, M. Franklin, J. Hellerstein, and W. Hong. TAG: A TinyAggregation Service for Ad-Hoc Sensor Networks. In ACM SIGOPSOperating Systems Review, pages 131–146, 2002.
[15] N. Trigoni, Y. Yao, A.J. Demers, J. Gehrke, and R. Rajaraman. HybridPush-Pull Query Processing for Sensor Networks. In GI Jahrestagung (2),pages 370–374, 2004.
[16] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: AScalable Peer-To-Peer Lookup Service for Internet Applications. In Proc.ACM SIGCOMM, pages 149–160, San Diego, CA, 2001.
[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A ScalableContent-Addressable Network. In Proc. ACM SIGCOMM, pages 161–172,San Diego, CA, 2001.
[18] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized ObjectLocation and Routing for Large-Scale Peer-To-Peer Systems. In Proc.IFIP/ACM International Conference on Distributed Systems Platforms, pages329–350, Heidelberg, Germany, November 2001.
[19] B.Y. Zhao, J.D. Kubiatowicz, and A.D. Joseph. Tapestry: An Infrastructurefor Fault-Tolerant Wide-Area Location and Routing. Technical Report,UCB/CSD-01-1141, April 2001.
[20] G. S. Manku, M. Bawa, and P. Raghavan. Symphony: Distributed Hashing ina Small World. In Proc. of Symposium on Internet Topologies and Systems,pages 127–140, 2003.
[21] K. P. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, andI. Stoica. The Impact of DHT Routing Geometry on Resilience andProximity. In Proc. of ACM SIGCOMM, pages 381–394, 2003.
[22] M. Castro, P. Druschel, Y. C. Hu, and A. I. T. Rowstron. Topology-AwareRouting in Structured Peer-to-Peer Overlay Networks. In Proc. ofInternational Workshop on Future Directions in Distributed Computing, pages103–107, 2003.
100
[23] H. Zhang, A. Goel, and R. Govindan. Incrementally Improving LookupLatency in Distributed Hash Table Systems. In Proc. of ACM SIGMETRICS,pages 114–125, June 2003.
[24] Ittai Abraham and Cyril Gavoille. Object Location Using Path Separators. InProc. ACM Symposium on Principles of Distributed Computing (PODC),pages 188–197, 2006.
[25] Ittai Abraham, Cyril Gavoille, Andrew Goldberg, and Dahlia Malkhi.Routing in Networks with Low Doubling Dimension. In Proc. InternationalConference on Distributed Computing Systems (ICDCS), 2006.
[26] Ittai Abraham, Cyril Gavoille, and Dahlia Malkhi. Compact Routing forGraphs Excluding a Fixed Minor. In Proc. International Conference onDistributed Computing (DISC), pages 442–456, 2005.
[27] Ittai Abraham, Cyril Gavoille, Dahlia Malkhi, Noam Nisan, and MikkelThorup. Compact Name-Independent Routing with Minimum Stretch. InProc. SPAA, pages 20–24, 2004.
[28] Ittai Abraham, Cyril Gavoille, Dahlia Malkhi, and Udi Wieder.Strongly-Bounded Sparse Decompositions of Minor Free Graphs. InProceedings of the Nineteenth Annual ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA’07), San Diego, California, June 2007.Also appears as Technical Report MSR-TR-2006-192 in Microsoft Research,December 2006.
[29] M. Arias, L. Cowen, K. Laing, R. Rajaraman, and O. Taka. Compact Routingwith Name Independence. In Proc. ACM Symposium on Parallel Algorithmsand Architectures, pages 184–192, 2003.
[30] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals,Simulations and Advanced Topics. McGraw-Hill, 1st edition, 1998.
[31] Baruch Awerbuch. Complexity of Network Synchronization. Journal of theACM, 32(4), 1985.
[32] Baruch Awerbuch, Shay Kutten, and David Peleg. On Buffer-EconomicalStore-and-Forward Deadlock Prevention. In INFOCOM, pages 410–414, 1991.
[33] Baruch Awerbuch and David Peleg. Network Synchronization withPolylogarithmic Overhead. In Proc. IEEE Symposium on Foundations ofComputer Science, pages 514–522, 1990.
[34] Baruch Awerbuch and David Peleg. Sparse Partitions (extended abstract). InIEEE Symposium on Foundations of Computer Science, pages 503–513, 1990.
101
[35] Baruch Awerbuch and David Peleg. Online Tracking of Mobile Users. InProc. ACM SIGCOMM Symposium on Communication Architectures andProtocols, 1991.
[36] Baruch Awerbuch and David Peleg. Online Tracking of Mobile Users. Journalof the ACM, 42(5):1021–1058, 1995.
[37] Brenda S. Baker. Approximation Algorithms for NP-Complete Problems onPlanar Graphs. Journal of the ACM, 41(1):153–180, 1994.
[38] Costas Busch, Ryan LaFortune, and Srikanta Tirthapura. Improved SparseCovers for Graphs Excluding a Fixed Minor. Technical Report TR 06-16,Department of Computer Science, Rensselaer Polytechnic Institute, November2006.
[39] Greg N. Frederickson and Ravi Janardan. Efficient Message Routing inPlanar Networks. SIAM Journal on Computing, 18(4):843–857, 1989.
[40] Cyril Gavoille. Routing in Distributed Networks: Overview and OpenProblems. SIGACT News, 32(1):36–52, 2001.
[41] Cyril Gavoille and David Peleg. Compact and Localized Distributed DataStructures. Distributed Computing, 16(2-3):111–120, 2003.
[42] Philip Klein, Serge A. Plotkin, and Satish Rao. Excluded Minors, NetworkDecomposition, and Multicommodity Flow. In Proc. 25th annual ACMSymposium on Theory of Computing (STOC), pages 682–690, 1993.
[43] Goran Konjevod, Andrea W. Richa, and Donglin Xia. Optimal-StretchName-Independent Compact Routing in Doubling Metrics. In PODC ’06:Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles ofDistributed Computing, pages 198–207, Denver, Colorado, USA, 2006.
[44] Goran Konjevod, Andrea W. Richa, and Donglin Xia. Optimal Scale-FreeCompact Routing Schemes in Doubling Networks. In SODA ’07: Proceedingsof the ACM-SIAM Symposium on Discrete Algorithms, New Orleans,Louisiana, 2007.
[45] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers Inc.,1996.
[46] David Peleg. Distance-Dependent Distributed Directories. Information andComputation, 103(2), 1993.
[47] David Peleg. Distributed Computing: A Locality-Sensitive Approach. Societyfor Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.
102
[48] David Peleg and Eli Upfal. A Trade-Off Between Space and Efficiency forRouting Tables. Journal of the ACM, 36(3), 1989.
[49] Neil Robertson and Paul D. Seymour. Graph minors. V. excluding a planargraph. Journal of Combinatorial Theory, Series B, 41:92–114, 1986.
[50] Neil Robertson and Paul D. Seymour. Graph Minors. XVI. Excluding aNon-Planar Graph. Journal of Combinatorial Theory, Series B, 89(1):43–76,2003.
[51] Lior Shabtay and Adrian Segall. Low Complexity Network Synchronization.In WDAG ’94: Proceedings of the 8th International Workshop on DistributedAlgorithms, pages 223–237, London, UK, 1994.
[52] Mikkel Thorup. Compact Oracles for Reachability and ApproximateDistances in Planar Digraphs. Journal of the ACM, 51(6):993–1024, 2004.
[53] Mikkel Thorup and Uri Zwick. Compact Routing Schemes. In Proc. ACMSymposium on Parallel Algorithms and Architectures (SPAA), pages 1–10,2001.
[54] K. Alzoubi, X. Li, Y. Wang, P. Wan, and O. Frieder. Geometric Spanners forWireless Ad Hoc Networks. In IEEE Transactions on Parallel and DistributedSystems, 14(4):408–421, 2003.
[55] Costas Busch, Ryan LaFortune, and Srikanta Tirthapura. Improved SparseCovers for Graphs Excluding a Fixed Minor. In Proc. of ACM Symposium onPrinciples of Distributed Computing, Portland, OR, August 2007.
[56] Ryan LaFortune. A Structural Lower Bound for Sparse Covers. RensselaerPolytechnic Institute, 2006.
[57] C. D. Carothers, R. LaFortune, W.D. Smith, and M. R. Gilder. A Case Studyin Modeling Large-Scale Peer-to-Peer File-Sharing Networks usingDiscrete-Event Simulation. In Proceedings of the International MediterraneanModeling Multiconference, pages 617–624, Barcelona, Spain, October 2006.
[58] Ryan LaFortune, Christopher Carothers, William Smith, and MichaelHartman. An Abstract Internet Topology Model for Simulating Peer-to-PeerContent Distribution. In Principles of Advanced and Distributed Simulations,San Diego, CA, June 2007.
[59] TJ Giuli and Mary Baker. Narses: A Scalable Flow-Based Network Simulator.ArXiv Computer Science e-prints, CS0211024, November 2002.
[60] Nielsen Media – Home Page, 2006.http://www.nielsenmedia.com/dmas.html.
103
[61] A. Parker. P2P in 2005,http://www.cachelogic.com/research/2005_slide01.php.
[62] John B. Horrigan. “Home Broadband Adoption 2006”. Pew Internet &American Life Project, May 2006.
[63] Jong. S. Ahn and Peter B. Danzig. Packet Network Simulation: Speedup andAccuracy Versus Timing Granularity. ACM Transactions on Network (TON),Volume 4, Number 5, October 1996.
[64] BitTorrent – Home Page, 2006. http://www.bittorrent.org.
[65] CAIDA – Home Page, 2006. http://www.caida.org.
[66] A. Legout, G. Urvoy-Keller, and P. Michiardi. Understanding BitTorrent: AnExperimental Perspective, Technical Report, INRIA, Eurecom, France,November 2005.
[67] A. Legout, N. Liogkas, E. Kohler, and L. Zhang. Cluster and SharingIncentive in BitTorrent Systems, Technical Report, INRIA, #inria-00112066,version 1 Eurecom, France, November 21, 2006.
[68] Lumeta – Research Mapping Home Page, 2006.http://www.lumeta.com/research/mapping.asp.
[69] M. Mathis, J. Semke, and J. Mahdavi. The Macroscopic Behavior of the TCPCongestion Avoidance Algorithm, Computer Communications Review, 27(3),1997.
[70] Network Simulator (NS) – Home Page, 2006.http://www.isi.edu/nsnam/ns/ns.html.
[71] D. Nicol. Tradeoffs Between Model Abstraction, Execution Speed, andBehavioral Accuracy, In European Modeling and Simulation Symposium, 2006.
[72] David M. Nicol and Guanhua Yan. Simulation of Network Traffic at CoarseTimescales. In PADS ’05: Proceedings of the 19th Workshop on Principles ofAdvanced and Distributed Simulation, pages 141–150, Washington, DC, USA,2005.
[73] G. Riley, E. Zegura, and M. Ammar. Efficient Routing Using Nix-Vectors.Technical Report, GIT-CC-00-13, March 2000.
[74] Slyck – Home Page, 2006. http://slyck.com/bt.php?page=21.
[75] B.K. Szymanski, Y. Liu, and R. Gupta. Parallel Network Simulation UnderDistributed Genesis. In Proceedings of the 17th Workshop on Parallel andDistributed Simulation, pages 61–68, San Diego, CA, June 2003.
104
[76] K. Walsh and E. Sirer. Staged Simulation: A General Technique for ImprovingSimulation Scale and Performance. ACM TMACS, 14(2):170–195, April 2004.
[77] Time Warner Cable. “Cable Vs. DSL”,http://raleigh.twcbc.com/about/cable_vs_dsl.cfm.
[78] Rocketfuel – Home Page.http://www.cs.washington.edu/research/networking/rocketfuel.
[79] N. Spring, R. Mahajan, and D. Wetherall. Measuring ISP Topologies withRocketfuel. In Proceedings of ACM/SIGCOMM ’02, August 2002.
[80] M. Liljenstam, J. Liu, and D. Nicol. Development of an Internet BackboneTopology for Large-Scale Network Simulations. In Proceedings of the 35thConference on Winter Simulation: Driving innovation, pages 694–702, NewOrleans, Louisiana, December 2003.
[81] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for InternetMap Discovery. In Proceedings of IEEE INFOCOM, pages 1371–1380, TelAviv, Israel, March 2000.
[82] Narses Network Simulator – Home Page, 2006.http://sourceforge.net/projects/narses.
[83] Weishuai Yang and Nael Abu-Ghazaleh. GPS: A General Peer-to-PeerSimulator and its Use for Modeling BitTorrent. In Proceedings of the 13thIEEE International Symposium on Modeling, Analysis, and Simulation ofComputer and Telecommunication Systems, pages 425–434, Washington, DC,USA, 2005.
[84] Hannes Birck, Oliver Heckmann, Andreas Mauthe, and Ralf Steinmetz.Analysis of Overlay Networks at Message- and Packet-Level. TechnicalReport, KOM-TR-2004-03, Darmstadt University of Technology, June 2004.
[85] Dongyu Qiu and R. Srikant. Modeling and Performance Analysis ofBitTorrent-Like Peer-to-Peer Networks. SIGCOMM Comput. Commun. Rev.,pages 367–378, October 2004.
[86] Ashwin R. Bharambe, Cormac Herley, and Venkata N. Padmanabhan.Analyzing and Improving BitTorrent Performance, Technical Report,MSR-TR-2005-03, Microsoft Research, February 2005.
[87] R. Bindal, P. Cao, W. Chan, J. Medval, G. Suwala, T. Bates, and A. Zhang.Improving Traffic Locality in BitTorrent via Biased Neighbor Selection, InProceedings of the 2006 International Conference on Distributed ComputingSystems, July 2006, Spain.
105
[88] C. D. Carothers, D. Bauer, and S. Pearce. ROSS: Rensselaer’s OptimisticSimulation System User’s Guide. Technical Report #02-12, Department ofComputer Science, Rensselaer Polytechnic Institute, 2002,http://www.cs.rpi.edu/tr/02-12.pdf.
[89] C. Carothers, D. Bauer, and S. Pearce. ROSS: A High-Performance, LowMemory, Modular Time Warp System. In Proceeding of the 14th Workshop onParallel and Distributed Simulation, pages 53–60, May 2000.
[90] Abilene/Internet II Usage Policy.http://abilene.internet2.edu/policies/cou.html.
[91] BitTorrent Source Code, ver. 4.4.0, Linux Release.http://www.bittorrent.com/download.myt.
[92] BitTorrent, News Release: Partnership with Warner Brothers, 2006.http://www.bittorrent.com/2006-05-09-Warner-Bros.html.
[93] R. Brown. Calendar Queues: A Fast O(1) Priority Queue Implementation forthe Simulation Event Set Problem, Communications of the ACM (CACM),vol. 31, pp. 1220–1227, 1988.
[94] J. Cowie, A. Ogielski, and B.J. Premore. Internet Worms and Global RoutingInstabilities, In Proceedings of the Annual SPIE 2002 Conference, July 2002.
[95] Z. Ge, D. R. Figueredo, S. Jaswal, J. Jurose, and D. Towsley. ModelingPeer-Peer File-Sharing Systems, In Proceeding of the IEEE INFCOM, 2003.
[96] C. Gkantsidis and P. Rodriguez. Network Coding for Large Scale ContentDistribution, In Proceedings of IEEE INFOCOM, Miami, March 2005.
[97] P. Grant and J. Drucker. Phone, Cable Firms Rein In Consumers’ InternetUse Big Operators See Threat To Service as Web Calls, Videos Clog UpNetworks, The Wall Street Journal, October 21, 2005, page A1.
[98] D. R. Jefferson. Virtual Time, ACM Transactions on Programming Languagesand Systems, 7(3):404–425, July 1985.
[99] R. LeMay. BitTorrent Creator Slams Microsoft’s Methods, June 21, 2005.ZDNet Australia. http://www.zdnet.com.au/news/software/0,2000061733,39198116,00.htm.
[100] A. Parker. The True Picture of Peer-to-Peer File-Sharing,http://www.cachelogic.com/research/slide1.php.
[101] R. Ronngren and Rassul Ayani. A Comparative Study of Parallel andSequential Priority Queue Algorithms, ACM Transactions on Modeling andComputer Simulation, vol. 7, no. 2, pp. 157–209, April 1997.
106
[102] R. Shaw. BitTorrent Users, Ignore Opera at Your Inconvenience, ZDNetBlogs, February 17, 2006. http://blogs.zdnet.com/ip-telephony/?p=918.
[103] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A Blueprint forIntroducing Disruptive Technology into the Internet, In Proceedings of theFirst Workshop on Hot Topics in Networking (HotNets-I), October 2002.
[104] Planet Lab Acceptable Use Policy, 2006.http://www.planet-lab.org/php/aup.
[105] J. A. Pouwelse, P. Garbacki, D. H. J. Epema, and H. J. Sips. The BitTorrentP2P File-Sharing System: Measurements and Analysis, In Proceedings of the4th International Workshop on Peer-2-Peer System (IPTPS ’05), February2005.
[106] Xinyan Zhang, Jiangchuan Liu, Bo Li, and Tak-Shing Peter Yum.CoolStreaming/DONet: A Data-Driven Overlay Network for Efficient LiveMedia Streaming. In Proceeding of IEEE/INFOCOM, Miami, FL, March2005.
[107] C. Dana, D. Li, D. Harrison, and C. Chuah. BASS: Bittorrent AssistedStreaming System for Video-on-Demand. In International Workshop onMultimedia Signal Processing IEEE Press, 2005.
[108] Aggelos Vlavianos, Marios Iliofotou, and Michalis Faloutsos. BiToS:Enhancing BitTorrent for Supporting Streaming Applications. IEEEINFOCOM 2006 Global Internet Workshop, April 2006.
[109] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi,Antony Rowstron, and Atul Singh. SplitStream: High-Bandwidth Multicast ina Cooperative Environment. In SOSP’03, Lake Bolton, New York, October2003.
[110] Duc A. Tran, Kien A. Hua, and Tai T. Do. A Peer-to-Peer Architecture forMedia Streaming. Journal on Selected Areas in Communications, SpecialIssue on Advances in Service Overlay Networks.
[111] M. Hefeeda, A. Habib, D. Xu, B. Bhargava, and B. Botev. CollectCast: APeer-to-Peer Service for Media Streaming. ACM/Springer MultimediaSystems Journal, October 2003.
[112] G. Wen, H. Longshe, and F. Qiang. Recent Advances in Peer-to-Peer MediaStreaming Systems. In China Communications, October 2006.
[113] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, andJames W. O’Toole Jr. Overcast: Reliable Multicasting with an OverlayNetwork.
107
[114] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: HighBandwidth Data Dissemination Using an Overlay Mesh. In Proceedings ofACM SOSP, 2003.
[115] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz. Bayeux: AnArchitecture for Scalable and Fault-tolerant Wide-area Data Dissemination.In Proceedings of the Eleventh International Workshop on Network andOperating System Support for Digital Audio and Video, June 2001.
[116] Y. H. Chu, S. G. Rao, and H. Zhang. A Case for End System Multicast. InMeasurement and Modeling of Computer System, pages 1–12, 2000.
[117] R. Rejaie and A. Ortega. PALS: Peer-to-Peer Adaptive Layered Streaming.In Proceedings of ACM NOSSDAV, pages 153–161, June 2003.
[118] S. Tewari and L. Kleinrock. Analytical Model for BitTorrent-Based LiveVideo Streaming. In Proceedings of IEEE NIME Workshop, Las Vegas, NV,January 2007.
[119] PPLive. http://www.pplive.com/en/index.html.
[120] PPStream. http://www.ppstream.com.
[121] MySee. http://www.mysee.com.
[122] Roxbeam. http://www.roxbeam.com.
[123] UUSEE. http://www.uusee.com.
[124] Joost. http://www.joost.com.
[125] Pando. http://www.pando.com.
[126] Red Swoosh, an Akamai Company. http://www.akamai.com/redswoosh.
[127] Vonage – Home Page. http://www.vonage.com/index.php?ic=1.
[128] Skype – Home Page. http://www.skype.com.
[129] Keynote Systems - Hosted Streaming Quality Measurement, 2006.http://www.keynote.com/products/voip_and_streaming/streaming_
performance/streaming_perspective_stremq.html.
[130] Burstable Billing - Wikipedia, November 2007.http://en.wikipedia.org/wiki/Burstable_billing.
[131] BitTorrent. http://www.bittorrent.com.
[132] ITIVA NETWORKS. http://www.itiva.com.
108
[133] Kontiki Delivery Management System. http://www.verisign.com/products-services/content-messaging/broadband-
delivery/kontiki-delivery-management.