1 a survey and comparison of application-level multicast ... · a survey and comparison of...
TRANSCRIPT
1
A Survey and Comparison of
Application-level Multicast Protocols
Xing Jin Wan-Ching Wong S.-H. Gary Chan Hoi-Lun Ngan
Department of Computer Science
The Hong Kong University of Science and Technology
Clear Water Bay, Kowloon
Hong Kong
Email: {csvenus, gchan }@cs.ust.hk
Tel: +852 2358-6990
Fax: +852 2358-1477
Abstract
Applications such as Internet-TV and software distribution often require multicast for their delivery. As of
today, IP multicast still experiences much difficulty in commercial deployment. In order to overcome it, researchers
recently have shifted their focus toapplication-level multicast (ALM), where the multicast-related functionalities
are moved to end-hosts. ALM builds an overlay topology consisting of end-to-end unicast connections between
end-hosts. The major concern is how to route data along the topology efficiently. Depending on the methods of
building data delivery tree, the existing ALM protocols may be classified as mesh-based or tree-based. We have
chosenNarada, Delaunay Triangulationand Scribeas the representatives of mesh-based approach, whileNICE
and Overcastare chosen for tree-based approach. We describe in detail and illustrate with examples their basic
mechanisms. Using Internet-like topologies, we also compare their performance in terms of network stress, stretch,
and overhead incurred.
Index Terms
Application-level multicast, multicast, overlay network, performance evaluation
This work is supported, in part, by the Competitive Earmarked Research Grant from the Research Grant Council in Hong Kong
(HKUST6014/01E).
2
I. I NTRODUCTION
Applications such as Internet-TV, multi-party conferencing, and software distribution require point-
to-multipoint Internet communication. In IP multicast, the routers take the responsibility to replicate
packets[1]. In Figure 1(a), we show the mechanism of IP multicast, where the end-hosts are denoted by
circles and the routers by squares, with the cost of a link as indicated. SourceS needs to deliver data
to recipientsH1, H2 andH3. RoutersR1 to R5 replicate and forward data packets along the physical
links formed by the spanning tree rooted at sourceS. Despite its proposal more than a decade ago, IP
multicast still has not been widely deployed, mainly due to the following reasons:
• Multicast management issues— IP multicast standard does not provide solutions for many important
multicast management issues for large-scale commercial deployment, such as group management,
multicast address allocation and support for network management [2].
• Stateful routers— An IP multicast router needs to maintain per group state for packet replication.
This state-keeping makes routers unscalable in terms of the number of multicast groups that can be
supported.
• Lack of support of higher level functionalities— IP multicast is based on best-effort data delivery,
and hence is not reliable. This is not desirable for some applications such as multi-party games and
software distribution. Although a number of reliable multicast protocols such as RMTP [3], SRM
and PGM have been proposed, many challenges such as the heterogeneity of receivers and feedback
issues have to be addressed before they can be extensively deployed.
Due to the above problems, researchers have recently shifted their attempts from the network layer to
the application level, i.e., the multicast-related functionalities are moved from routers to end-hosts. This
is the so-calledapplication-level multicast(ALM). We show an example of ALM delivery mechanism in
Figure 1(b). As opposed to Figure 1(a),S establishes unicast connections withH1 and H3, while H1
in turn delivers data toH2 via unicast. Multicast is hence achieved via piece-wise unicast connections.
Clearly, end-hosts, instead of routers, are responsible for replicating and forwarding multicast packets.
In this way, the spanning tree of ALM forms anoverlay topology which consists of only end-hosts and
unicast connections between end-hosts. In discussing ALM, we hence often can abstract the network to
consist of end-hosts only. In the rest of the paper, we will focus on such overlay topology and data
delivery.
Implementing multicast functionalities at the application level addresses many issues associated with
IP multicast, and comes with a number of strengths:
• Absence of multicast routers— ALM is built on an overlay topology without the need of multicast
routers. In other words, routers do not need to maintain any multicast group information and hence
3
1
2
S
H1R2
R1
R5R4
R3
H3H2
3
1
6
2
1
1
14
(a) Packet flows for IP multicast.
6
1
2
1
3
4
2
1
11
R5R4
R3
H3H2
R1
R2H1
S
(b) Packet flows for application-level multicast.
Fig. 1. A Comparison between IP multicast and application-level multicast. In IP multicast, packets are duplicated and forwarded by routers.
But, in application-level multicast, this is done by the end-hosts.
are stateless,
• Leverage of current unicast protocols— Since the connections are based on unicast, the existing
solutions and functionalities of unicast protocols in the transport layer (such as TCP and UDP) can
be straightforwardly applied in ALM.
When designing an ALM protocol, the following metrics are usually used to evaluate its performance:
• Link and node stress
Because multiple overlay edges may span the same underlying physical link, a packet may traverse
multiple times along a physical link. Moreover, a host may need to forward packets to many hosts.
In order to quantify these, link stress and node stress are often used, defined as the number of copies
of a packet transmitted over a given physical link and forwarded by an end-host, respectively. The
distribution of stresses indicates whether an ALM protocol balances load fairly. If it does not, network
traffic will be concentrated at several physical links or several hosts.
As an example, refer to Figure 1(b) again. Since both overlay edgesS −H1 andS −H3 span the
same physical linkS −R1, the link stress ofS −R1 and the node stress ofS are both2.
• Stretch
Since a packet may take multiple unicast segments before reaching its destination, it experiences
higher delay as compared to the path formed by multicast-capable routers. Stretch is a general term
used to refer to such increase in latency. Clearly, ALM protocols designed for latency-sensitive
applications (such as conferencing applications) should have a low stretch. In order to quantify the
stretch, the following two metrics are often used:
– Relay Delay Penalty (RDP), defined as the ratio of the overlay latency from the source to a
given host to the delay along the shortest unicast path.
– Path Length, defined as the number of unicast segments taken before a packet reaches a given
host.
4
For example, in Figure 1(b), the path of ALM packets going fromS to H2 is S → R1 → R2 →H1 → R2 → R3 → H2, instead of the shortest pathS → R1 → R4 → R3 → H2. Therefore the
RDP of H2 is (1 + 4 + 1 + 1 + 2 + 1)/(1 + 2 + 1 + 1) = 10/5 = 2, while the path length is2.
• Control message overhead
In ALM protocols, control messages in general serve two purposes:
– Connectivity maintenance: Since the multicast group is generally not a static environment (due
to failure of links and nodes and joining and leaving of group members), periodic message
exchange among hosts is essential to maintain the connectivity of the overlay topology.
– Network condition measurement: Most ALM protocols continue to improve its connectivity by
measuring the round-trip time and available bandwidth between hosts in order to reduce the
stress and stretch.
Overhead Ratioof a protocol is often used to measure the control overhead, which is defined as the
amount of non-data traffic to that of data traffic. Non-data traffic consists of the control packets for
connectivity maintenance and network condition measurement. Clearly, a high ratio indicates much
overhead.
Since delivering packets in ALM is less efficient than IP multicast, how to build an efficient overlay
topology for data delivery becomes a major concern. Moreover, since the multicast group is not a static
environment, the overlay topology formed should be able to recover partition when it occurs. Apart from
these, the average control-messaging overhead of an ALM protocol should be scalable with the size of a
multicast group.
There are in general two approaches to multicast packets in ALM:mesh-basedor tree-based. In mesh-
basedapproach, joining members first form an overlay mesh, and build the multicast tree(s) on top of
it. Examples of such approach are Narada [4], DT [5], Scribe [6], ALMI [7], MCAN, GOSSAMER and
Bayeux. On the other hand, intree-basedapproach, the multicast tree is built directly without need of pre-
constructed mesh. Examples of such approach are NICE [8], Overcast [9], Yoid, HMTP. Our goals in this
paper are to educate readers the basic mechanisms of some representative protocols in each category and
present detailed quantitative comparisons among them in the same setting (as opposed to some previous
comparisons which are mostly qualitative in nature).
II. M ESH-BASED PROTOCOLS
In a mesh-based protocol, a mesh-like topology is first built, in which a host connects to a number of
other hosts termed as neighbors. There can be multiple paths connecting a pair of hosts on the mesh.
Based on the mesh, single or multiple data delivery tree(s) are built, according to the specific protocols.
5
ALGORTHIM IEVALUATE UTILITY (u, v)
1 utility ← 0
2 for each m in multicast group− {u}3 do current latency← latency betweenu andm along mesh
4 new latency← latency betweenu andm along mesh whenuv were added
5 if (new latency< current latency)
6 then utility ← utility +current latency−new latency
current latency7 Return utility
ALGORITHM IIEVALUATE CONSENSUSCOST(uv)
1 Costuv ← number of group members for whichu usesv as the next hop for forwarding packets.
2 Costvu ← number of group members for whichv usesu as the next hop for forwarding packets.
3 returnmax(Costuv, Costvu)
Because delivery trees are embedded in the overlay mesh, delivery efficiency depends on how well overlay
mesh is constructed. On the one hand, more overlay edges means more direct connections between hosts,
and hence lower latency. On the other hand, fewer mesh edges means lower node stress. Therefore, in
routing decision, selecting the overlay edges for low latency and node stress is important. Another issue
is that mesh partition leads to tree partition; therefore partition avoidance and detection mechanisms are
important. In order to illustrate and understand in a more concrete manner the basic mechanisms involved,
we have chosen three representative schemes, namely, Narada, Delaunay Triangulation and Scribe, and
discuss their operations in detail in the following.
A. Narada
Narada is suitable for Internet conferencing applications, where participants can be both sources and
receivers at the same time. In Narada, an overlay mesh is constructed by periodically adding and dropping
connections between hosts. A host periodically exchanges control messages with its neighbors to maintain
connectivity and build its own routing table with the shortest widest path algorithm, of which bandwidth
and latency are considered primary and secondary metrics, respectively.
A joining host first obtains a list of already-joined hosts from a rendezvous point. The joining host then
randomly selects a few of them to connect to. Through periodic exchange of refresh messages among
6
1
251
1
2
22
F
D
E
BA
c
(a) The utility of adding edgeBD is 11963
.
1
251
1
2
22
F
D
E
BA
C
(b) The consensus cost of edgeAF is 1.
Fig. 2. a) By adding edgeBD, the delay fromB to D is reduced from9 (along the pathB → A → F → D) to 2 (along the path
B → D), and the delay fromB to F is reduced from7 (along the pathB → A → F ) to 4 (along the pathB → D → F ), and the delay
from B to E is reduced from9 (along the pathB → A → F → E) to 3 (along the pathB → D → E). Therefore the utility ofBD is9−29
+ 7−47
+ 9−39
= 11863
. b) Since only the shortest paths betweenA and E passing through the connectionAE, the consensus costof
AE is one only.
neighbors, the changes in membership due to joining or leaving of hosts can eventually be propagated to
all hosts. In Narada, reverse path algorithm is used to multicast packets: when hostu receives a multicast
packet from sources, it forwards the packet to those neighbors from whichu is on the shortest path to
s [10].
Narada uses a periodic mesh refinement algorithm to add or drop connections so as to incrementally
improve the overlay mesh. A hostu periodically probes some other hosts to evaluate the overall delay that
can be reduced if connected to them. If the reduction of a host is above a certain threshold (the adding
threshold),u connects to it.
Meanwhile,u also computes theconsensus costof an edge between its neighbor, say the edgeuv. For
all the shortest paths fromu to the other nodes in the network,u computes the number of them include
uv as one of their edges.v also computes likewise. The maximum of these two numbers is the cost. If
it is below a certain threshold (the dropping threshold), the edgeuv is disconnected.
Note that the adding and dropping thresholds are not fixed values, but are functions of maximum and
minimum fanout of all the nodes in the overlay mesh. Therefore Narada controls the maximum and
minimum fanout to avoid a host from having too many connections and becoming a bottleneck.
We show the algorithm to compute the overall delay reduction (termed asutility) for two nodesu and
v if a connection were made in ALGORITHM I, and show an example in Figure 2(a), where hostsB and
D are not connected at first. WhenB probesD, it finds that by connecting toD the delay from itself to
D can be reduced from9 (along the pathB → A → F → D) to 2 (along the pathB → D), the delay
from B to F can be reduced from7 (along the pathB → A → F ) to 4 (along the pathB → D → F ),
and the delay fromB to E can be reduced from9 (along the pathB → A → F → E) to 3 (along the
path B → D → E). Therefore the utility ofBD is 9−29
+ 7−47
+ 9−39
= 11863
. If the value is above the
7
a b
cd
(a)
a b
cd
(b)
Fig. 3. a) Two adjacent triangles forming a convex quadrilateral4abd and4bdc violating the DT property. b) Restoration of DT property
by disconnectinga from c and connectingb andd.
adding threshold,B will connect toD.
We also show how to evaluate theconsensus costin ALGORITHM II given an edgeuv and an example
in Figure 2(b). Consider the edgeAF (i.e., A connects to hostF initially.) For both A andF , because
there is only one shortest path which includes the edgeAF (the direct pathAF ), the consensus costof
AF is one. Therefore, if the value of the dropping threshold is higher than one,AF will be dropped.
Narada detects mesh partition with the help of periodical exchanging refresh messages. Basically, if
hostu loses a sequence of refresh messages from hostv, hostu suspects thatv is dead and probesv. If v
is still alive, thenu connects tov in order to recover the partition. Since this algorithm does not interact
with the rendezvous point, it is robust to the failure of any host.
Narada is not scalable in term of the number of hosts. This is because:
• As a flat routing algorithm, the shortest widest path algorithm holds the fact that the size of routing
table is as large as the group size. Therefore the state maintained by a host is in the order ofO(N).
• A control message contains the states of all hosts in the group; therefore, the total control overhead
in the system is high for large groups (O(N2)).
Another potential problem is that Narada’s convergence time can be very long. Simulation results show
that it is difficult to form a stable mesh, even after a very long time. This is due to the inconsistent criteria
for adding and dropping edges.
B. Delaunay Triangulation (DT)
In DT, each host has ageographicalcoordinate and hosts first form an overlay mesh based on these
coordinates. Compass routing is used to route a packet from one point to another. DT protocol connects
the nodes together in a triangular manner so that the mesh satisfies Delaunay Triangulation property,
i.e., the minimum internal angle of the adjacent triangles in the mesh are maximized [11]. To illustrate
the triangulation process, consider that pointsa, b, c andd form a convex quadrilateralabcd. There are
two possible ways to triangulate it as shown in Figure 3. Since the minimum internal angle of4abc
8
t
n n
u
3n
2n
1 0
Fig. 4. Whenu receives a unicast packet with destinationt, it forwards the packet ton0. Whenu receives a multicast with sourcet, it
forwards the packet ton1 andn3.
and4acd (Figure 3(a)) is less than that of4abd and4bcd (Figure 3(b)), DT protocol transforms the
former configuration into the latter. The mesh formed this way connects those geographically close nodes
together.
In DT protocol, the joining process is bootstrapped by a DT server, which caches a list of joined host.
A joining hostu first queries the DT server for some already-joined hosts.u then sends via unicast join
requests to these hosts, which in turn sends back those joining requestsalong the DT meshtowardsu until
reaching a set of hosts nearbyu. These nearby hosts then connect tou. Note that the newly established
connections may violate the DT property. To restore it, every host periodically tests its connections against
the property, and drops those failing in the test. A host also discovers nearby hosts through periodical
exchange of control messages with its neighbors, and connects to any nearby hosts if this does not violate
the DT property.
To illustrate this process, suppose that initially hostsa, b, c and d are connected as in Figure 3(a).
Because the connectionac violates the DT property, it is dropped when hosta (or c) tests it. After that
hostb discoversd through control messaging, and connects tod sincebd satisfies the DT property. Then
the resultant overlay mesh (as shown in Figure 3(b)) satisfies the DT property.
To route a packet from one point to another in DT protocol, compass routing can be used. When hostu
receives a unicast packet with destinationv, it first computes the slope between the two coordinate points
u and v and let the slope bes. u then computes the slopes of all itsN neighbors with itself. Let these
be si, 1 ≤ i ≤ N . u forwards the packet to the neighbor which has the closest slope tos, i.e.,u forwards
to neighborj where |sj − s| is the minimum for allsi’s. We illustrate the compass routing in Figure 4,
whereu needs to route a packet destined tot to one of its neighbors. Because the slope betweenu and
n0 is the closest to the slope ofu and t, hostu forwards the unicast packet to hostn0.
In DT, as similar to Narada, reverse path forwarding algorithm is used tomulticastpackets. When host
u receives a multicast packet from sources, it forwards the packet to those neighbors ifu is on the path
9
from them tos.
Refer back to Figure 4 again. When hostu receives a multicast packet with sourcet, it forwards the
packet ton1 and n2. This is because the slope oft → n1 is the closest to slopeu → n1 among all the
slopes ofu → n1, n0 → n1 andn2 → n1, while the slope oft → n2 is the closest to slope ofu → n2
among the slopes ofu → n2, n1 → n2 andn3 → n2.
In DT protocol, mesh partitioning is detected with the DT server. For each connected mesh, one and
only one host is elected as leader (usually the leader is the one with the greatest coordinate1). The leader
periodically exchanges control messages with the server. When the mesh is partitioned, there would
be more than one leaders communicating with the server. The server can then recover the partition by
requesting one of them to connect to the others.
As compared to Narada, DT protocol is much more scalable in terms of the number of hosts. This
is because compass routing is a kind of local routing, which is based on the coordinates of hosts only
directly connected to a node. Therefore the size of the routing table maintained by a host depends on the
number of neighbors it has only, which is on average less than or equal to six.2 However, DT protocols
come with the following two weaknesses:
• Inaccurate host location estimation— The geographic locations of hosts in general do not correlate
well with the latencies between hosts in the Internet; therefore, the end-to-end delay along a DT
overlay may be quite large.
• Single point of failure— The partition detection and recovery scheme relies on a DT server, which
forms a single point of failure.
C. Scribe
In Scribe, there are many hosts in the network but a multicast group only covers or spans a subset of
them. Those hosts which are not group members take part in packet forwarding in the Scribe network. A
possible application for Scribe is Internet chatroom, where usually a small set of users out of a possibly
large pool belongs to the same multicast group. The large pool ofhostsjointly takes part in forwarding
packets for thegroup membersin the system.
Scribe provides multicast group management for data delivery. It builds on top of Pastry, which provides
the actual host-to-host routing and content-delivery mechanisms. Scribe first connects hosts together as a
Pastry overlay mesh, i.e., the larger group. Then it constructs an overlay tree for each multicast group on
1The coordinate of hostA is greater than hostB iff the y-coordinate ofA is greater than that ofB, or, if their y-coordinates are the
same, thex-coordinate ofA is greater than that ofB.2Delaunay Triangulation is a planar graph, which has at most3N edges. Therefore the average number of neighbors is less than or equal
to 2×3NN
= 6.
10
TABLE I
AN EXAMPLE ROUTING TABLE OF HOST3321 FOR PASTRY.
Column
R 0xxx 1xxx 2xxx ∅o 30xx 31xx 32xx ∅w 330x 331x ∅ 333x
3320 ∅ 3322 3323
top of the mesh such that the tree branches are embedded with the mesh edges [12]. Clearly, a host can
be a tree node of multiple overlay trees. When a host receives a packet of a multicast group, it simply
forwards the packet to all of its children in the corresponding multicast overlay tree. In Scribe, the non-leaf
nodes are referred to asforwarders since they forward data to their children for data dissemination.
In Pastry, a host is identified by a random key termedNodeIdof value between0 andM (one may think
of the key as the host address). (NodeID can be made unique with high probability by using common
message digest functions.) Each key is expressed in baseB. A host constructs its own routing table based
on the leading prefix of the destination NodeId. The routing table of a host with NodeIdu = [u1u2 . . . ul]
(in baseB) has l = dlogB(M + 1)e rows andB columns. The entry at therth row andcth column
(1 ≤ r ≤ l and 1 ≤ c ≤ B) is for routing to a destination with NodeID matchingr − 1 prefixes ofu
and has a digit of valuec − 1 at therth position. More formally and specifically, the(r, c) entry is for
the forwarding of a host with NodeIdv = [v1v2 . . . vl] where v1 = u1, v2 = u2, . . . , vr−1 = ur−1, and
vm = c− 1, while vm+1 . . . vl are don’t care.
The routing table shows which entry to look into based on the leading prefix of the destination NodeID.
In the lookup, maximum prefix-matching is used. Upon a match, the entry would indicate the next-hop
NodeID for the packet to be forwarded to. It should be noted that the routing table is constructed in such
a way that the next-hop NodeID also has the same leading prefix as the entry. Each entry indicates at most
one next-hop host although there may be a number of destination hosts matching the prefix. To reduce
the stretch, a host periodically probes those possible next-hop hosts to select the one with the shortest
round-trip time.
We show in Table I an example of the routing table for a host of index3321, with M = 255, B = 4,
andx can be any digit between 0 andB − 1 (i.e., don’t care digit). Note that the next-hop host for the
∅ entries (at the(1, 4), (2, 4), (3, 3), and (4, 2) entries) is the host3321 itself. Each entry indicates the
next-hop NodeID with the same prefix (not shown) to be forwarded to. When the host receives a packet
of destination3310, it forwards the packet to the next-hop host as indicated by the(3, 2) entry,
Clearly, in Pastry overlay, at each hop the packet is one step closer to its destination as the length of
11
prefix matching between the current NodeID and destination NodeID is increased by 1. Therefore, the
path length is equal toO(logd M).3
A multicast group in Scribe is assigned with a key (the group identifier), which is in the same key space
as NodeId. A joining host first sends a join request with the group identifier as the destination key along
the Pastry overlay until the request reaches a host receiving data of the same multicast group. The join
request turns all hosts along the path into forwarders even though they are not members of the multicast
group. Therefore, the overlay tree is an aggregation of Pastry paths from the interested hosts to the host
whose key is numerically closest to the group identifier. This tree is free of loops because the distance to
the destination progressively reduces upon each hop.4 Therefore, an overlay tree is topologically sorted
by the distance from the destination.
We show an example of multicast routing in Pastry in Figure 5, with each digit being either0 or 1
(i.e., B = 2) and the group identifier being0000. There are 8 hosts in the networks, and only the hosts
(unshaded nodes)0100, 0101, 0011 and0110 are group members and they join in such order. We arrange
the hosts into concentric circles based on the length of their matched prefix with the group identifier,
with the circles separated by a distance of 1. In this example, the topological order of the multicast
tree is [0000], [0001], [0010, 0011], [0100, 0101, 0110, 0111], where hosts within the square brackets are
interchangeable. When host0100 joins the multicast group (whose key has only one matched prefix), it
sends a join request to host0001 (two matched prefixes), which in turn forwards the request to host0000.
Similarly, the joining host0101 sends a join request to host0001 which has a longer matching prefix
than 0010. However since host0001 has already been a forwarder, it suppresses the request. Clearly, the
tree is without any loop.
To maintain the connectivity of the overlay tree, each host periodically sends refresh messages to its
children. Any host fails to receive refresh messages would assume that its parent is dead and rejoins the
multicast group to recover the partition. To reduce the refresh overheads, multicast packets can serve as
implicit refresh messages. There is also an algorithm to remove bottleneck in the data delivery tree by
limiting the children number of a host through delegation of its children to other nodes [12]. When a
host is overloaded, it first identifies the multicast group which consumes the most resources and sends a
control message appended with its children NodeIDs to the farthest child within that group from itself.
Upon receiving the control message, the child then chooses a new parent among the children listed in the
3There are some special cases which need to be considered, such as the case when a host cannot find any neighbors with the same
leading prefix as its table entry. This is taken care of by comparing the numerical difference, rather than the prefix, between the destination
NodeID and some other set of neighbor NodeIDs of the host. In this case, the routing is less efficient but the distance to the destination still
monotonically decreases upon each hop. The details of routing mechanism of Pastry has been discussed in [12].4SupposeP = {h0, h1, . . . } is the Pastry path withh0 being the joining host,h1 is the first host, etc. The distance from the destination
from hi is longer than that fromhi+1.
12
1 prefix digit
2 prefix digits
3 prefix digits
matched.
matched
matched0000
0100
0110
0011
0010
0001
01010111
Fig. 5. The loop-free tree built by Scribe.
message. (Note that this algorithm may break the loop-free property of the Pastry tree, and hence has to
rely on other mechanisms for recovery.)
The size of the routing table at a host in Pastry isO(log2B M), and hence Scribe is scalable in terms of
group size. However, the tree-building algorithm requires a host to serve other groups not of its interest.
Moreover, the performance of Pastry routes depends on the key distribution. There may be cases where
even if two hosts are very close in location together, they may be separated by many hops on the Pastry
overlay due to their poor match in prefixes. As a result, a high stretch value results.
III. T REE-BASED PROTOCOLS
In tree-based protocols, a multicast tree is built directly over all joining members, and no mesh topology
is needed. The tree structure does not change with the source host. Because of that, a single node failure
or a loop can destroy the tree, thereof making it more fragile than a mesh. Loop-avoidance mechanism is
hence an important issue in tree-based protocols. We have chosen NICE and Overcast as two representative
schemes in this category and discuss their basic operations in detail in the following.
A. NICE
NICE is suitable for low-bandwidth streaming applications with a large number of receivers. It organizes
hosts into a multi-layer hierarchical structure, with the highest layer consists of only one host and the
13
L0
L1
L3
L2
D E FC
C
B
B D HF
C 03
A G
11C
02C0
0C 01
C
03C
0
D
1
D
H
H
20C
Fig. 6. Nice organizes hosts into a multiple-layer hierarchical structure, where hosts on each layer are grouped into a number of cluster.
lowest layer consists of all the hosts in the group. A host joins a number of layers in a bottom-up manner,
and hosts of the same layer are grouped into a number of clusters.
We show an example of NICE in Figure 6, where shadowed boxes denote clusters, and white circles
denote end-hosts. (The arrows will be explained later.) All hosts (i.e.,A to H) join the bottom layer (L0),
where hosts are grouped into a number of clusters (namely,C00 , C
01 , C
02 andC0
3 ). The leader of each cluster
(B, D, F andH) joins the layer one level up (L1). The grouping of hosts into clusters and selection of
cluster leaders are then repeated.
In NICE, the clusters are organized into a hierarchical tree, with the upper cluster branching out a
number of child clusters in the lower layer. A host may join multiple layers, and belong to different
cluster in different layer. We denote a host’scluster peersas the set of all the nodes sharing clusters with
the host.
A joining host u first selects a cluster from layerL0 to join by successive probing from the highest
layer to the lowest: it first queries the rendezvous point for the host of the highest layer. The host tellsu
the leaders of the layer one level down. By measuring the round trip time with these leaders,u selects
the closest one and queries it for the leaders being covered on the layer one level down. The process is
repeated untilu finds the closest leader of a cluster on the lowest layer.u then joins the cluster. We show
an example in Figure 7. Joining hostI first queries the rendezvous point for the host on the highest layer,
i.e., D, and then it queriesD for hosts being covered on the lower layer, i.e.,D and H. I selects the
closest oneH to repeat the process. Eventually, joining hostI finds the closest clusterC30 on layerL0
with which it joins.
Unlike Scribe, the overlay tree for data delivery in NICE is not pre-constructed. When a host receives
a multicast packet from a host of clusterc, it simply forwards the packet to all its cluster peers except
14
L2
L0
L1
L3 RP
E ID
C
C
1
BA
B D HF
C 03
0
H
10C 1
1C
02C0
0C 0C 03C
G
D
D
F
H
2
(a)
L2
L0
L1
L3 RP
E ID
C
C
1
BA
B D HF
C 03
0
H
10C 1
1C
02C0
0C 0C 03C
G
D
D
F
H
2
(b)
L0
L1
L3
L2
D E
RP
IF
C
CBA
B D HF
C 03
C
10C 1
1C
02C0
0C 01
H
03C
D
G
D H
20
(c)
L0
L1
L3
L2
D E
RP
IF
C
CBA
B D HF
C 03
C
10C 1
1C
02C0
0C 01
H
03C
D
G
D H
20
(d)
L2
L0
L1
L3
C D I
RP
E
C
1
BA
B D HF
C 03
2
H
10C 1
1C
02C0
0C 0
G
C 03C
D
F
D H
0
(e)
Fig. 7. In NICE, the joining host selects a cluster on layerL0 to join with by successively probing from highest layer to lowest layer.
those in clusterc. Referring back to Figure 6 again. When hostD receives a multicast packet from hostB
of clusterC10 , it forwards the packet to its cluster peers inC2
0 andC01 , i.e., nodesH andC, respectively.
Therefore the maximum path length is twice the number of layers (i.e.,O(logk(N))), wherek is the
cluster size, and the maximum node stress is equal to the product of the cluster size and the number of
layers (i.e.,O(k logk(N))).
To maintain the overlay topology, a host periodically sends heartbeat messages to its cluster peers.
Therefore, sudden leaving of any member can be detected through the loss of heartbeat messages. NICE
also limits the size of a cluster fromk to 3k− 1. A cluster will split into two clusters if its size is above
the upper bound, or merges with another cluster if its size falls below the lower bound.
Nice is efficient in terms of end-to-end delay, since the path length of forwarding a data packet is of
orderO(logk(N)). However, NICE creates bottlenecks at the top-layer and higher-layer nodes, since all
the joining members have to query one node at each layer of the hierarchy.
15
B. Overcast
Overcast is designed for single-source applications, e.g., TV-broadcasting. It tries to maximize each
host’s bandwidth from the source. Latency, on the other hand, is not the major concern.
A new member joins the multicast tree by contacting its potential parents. The root node is all new
nodes’ default potential parent. The new node estimates its available bandwidth to this potential parent. It
also estimates the bandwidth to the potential parent through each of this potential parent node’s children.
If the bandwidth through any of the children approximates to the direct bandwidth to the potential parent,
the closest one (in terms of network hops) of all the qualified children becomes the new potential parent
and a new round commences. If there is no qualified children, the procedure stops and the current potential
parent becomes new node’s parent.
To estimate the bandwidth, the node measures the download time of 10K bytes, which includes all the
service costs. If the measured bandwidths to two nodes are within 10% of each other, we consider the
two nodes equally good and select the closer one.
A node periodically reevaluate its position in the tree. It measures the bandwidth to its current siblings,
parent and grandparent. It will move below its sibling if that does not decrease its bandwidth back to the
root. Also, it will move one level up for higher bandwidth.
An example is shown in Figure 8, where nodeR is the root and nodeH is the new member. BeforeH
joins, all other nodes have formed a multicast tree. The thickness of the arrows indicates the overlay link
bandwidth. Initially,H contacts rootR. Since its direct bandwidth toR approximates to the bandwidth
through nodeB, it switches toB. This process is repeated and then it switches toE. Since the switching
to G leads to lower end-to-end bandwidth,H stays at that level thereafter.
In Overcast, each member maintains its ancestor list for partition avoidance and recovery. A member
rejects any connection requests initiated by its ancestor(s) to avoid looping. When a member detects that
its parent has left the multicast group, it connects to its ancestors one by one, from its grandparent to the
root, until a live member is found. Therefore, the loading is distributed along the path to the root, and
the root is not easily overloaded.
Overcast also includes an “up/down” protocol for information exchange. Each node, including the root,
maintains a table of information about all its descendants and a log of all changes to the table. Each
node periodically checks in with its parent. If a child fails to contact its parent within a given interval,
the parent will assume the child and all its descendants have “died”. It will then modify the table. A
node also modify the table if new children arrive. During these periodical check-ins, a node reports new
information that it has observed or been informed of. By this protocol, the root can maintain up-to-date
information about all the other nodes.
16
R
A B C
D E F
GH
R
A B C
D E F
GH
R
A B C
D E F
GH
(a) Initial stage. (b) Intermediate stage. (c) Final stage.
Fig. 8. The joining procedure of Overcast. NodeH is the new member.
The root node has to handle each new member’s joining request and is likely to be system bottleneck.
This problem remains for the up/down protocol. To overcome this single point of failure problem, Overcast
proposes to use a linear structure at the top of the multicast tree. Figure 9 shows an example. The black
node is the root. The root and two grey nodes are configured linearly. Each grey node has enough
information to act as the new root in case of root failure. However, this technique increases latency.
Overcast concentrates on bandwidth allocation, which is different from all above protocols. One key
issue in tree building is the bandwidth estimation. Current estimation technique is not accurate enough,
and testing results may not conform the cases during data transmission. This will affect the tree efficiency
in terms of bandwidth. In the worst case of building a tree, every newly joining node has to contact all
the existing nodes. This leads to a complexity of
O(N∑
i=1
i) = O(N2),
whereN is the number of nodes. On the average case, time complexity should not be that bad. We will
introduce its simulation results in the next section.
IV. COMPARISONS
In this section, we compare the performance of the protocols discussed. We summarize in Table II
the comparison of the protocols in terms of the mechanism of the overlay construction and maintenance,
partition detection and the recovery scheme. We also show in Table III the performance of these protocols
in terms of path length, node stress, and the size of routing table with respect to group sizeN , the
17
Fig. 9. Linear structure at the top of an Overcast tree, which helps overcome the single point of failure problem.
number of possible keys in the networkM (for Scribe), and the cluster sizeK (for NICE). From it, we
see that the maximum path length for Scribe and Nice increases only logarithmically with group size,
while Narada and Overcast in the worse case may have very long path. Although the path length of DT
in the worst case isO(N), its average path length is onlyO(√
N)). Specially, Overcast’s tree topology is
related to path bandwidths and can’t be easily decided. On average the node stress of NICE, DT, Narada
and Overcast are independent of the group size. Only Narada is able to guarantee a certain node stress
in the worst case. The maximum node stress of NICE grows with the group size logarithmically, while
DT and Overcast do not provide any guarantee on the maximum node stress. However, since the host on
the topmost layer is also the leaders of many lower layers, it may become the bottleneck of the system.
The routing table sizes of DT and NICE are independent of the group size, because only neighborhood
information is needed to be maintained. On the other hand, the table sizes of Scribe and Narada grow
with the group size logarithmically and linearly, respectively. Specially, the table size of an Overcast node
is uncertain, since we don’t know the tree topology. Therefore, the scalability of DT is much better than
that of Narada and Scribe.
Besides the above, we have also done simulations on Internet-like topologies to compare the performance
of these protocols in terms of relative delay penalty and physical link stresses. In our simulations, we first
generate a number (10) ofTransit Stubtopologies with Georgia Tech’s random graph generator [13]. The
generated topologies are a two-layer hierarchy of transit networks (with four transit domains, each with
16 randomly-distributed routers on a1024×1024 grid) and stub networks (with64 domains, each with15
randomly-distributed routers on a32× 32 grid). A host is connected to a stub router via a LAN (of4× 4
grid). The delays of LAN links are 1ms while the delays of core links are given by the topology generator.
For each protocol, we randomly select a number of hosts (16–1024) as group members and choose one
of them as the source. We measure theRDP and link stresswhen the source sends packets to all hosts.
For DT, we take the geographical coordinate of a host as its location. For Scribe, we randomly assign
the hosts with key values. There are2128 number of possible keys (corresponding to IPv6 address space),
and B = 16. For Narada, add-link/drop-link thresholds are set according to group size, nodes’ current
18
TABLE II
A COMPARISON BASED ON PROTOCOL DESIGN.
Overlay Topology Construc-
tion
Partition Detection and Re-
covery
Packet Routing
Narada Add edges of highutility and
drop edges of lowconsensus
cost.
Every host maintains a com-
plete list of members and
probes those silent members.
Based on the routing ta-
ble constructed with a short-
est widest algorithm, reverse
path forwarding algorithm is
applied for multicast packet
routing.
DT The overlay mesh
constructed satisfies the
DT property.
If more than one leader
communicates with the DT
server, it recovers the par-
tition by connecting leaders
together.
Based on compass routing,
reverse path forwarding al-
gorithm is applied for mul-
ticast packet routing.
Scribe Data delivery tree is an ag-
gregation of Pastry routes
from interested hosts to the
rendezvous point.
A host rejoins the system if
its parent is silent for a long
time.
Data is disseminated along
the tree from the rendezvous
point
NICE Organize hosts in a multiple-
layer hierarchical structure,
with the highest layer con-
sisting of only one host and
the lowest layer consisting
of all the hosts.
A host periodically probes
its cluster peers on differen
layers.
Forward data to all cluster
peers except the one sending
data to it
Overcast Incrementally build a tree
and try to maximize band-
width to the root for all
nodes.
Each node periodically con-
tacts its parent. A node con-
nects to its ancestors one by
one, from its grandparent to
the root, if its parent is un-
reachable.
Data is disseminated along
the tree from the root.
19
TABLE III
COMPARISON OF PATH LENGTH, NODE STRESS AND SIZE OF ROUTING TABLE.
Protocol Path Length Node Stress Routing table size
Average Maximum Average Maximum
Narada O(N ) O(N ) O(1) O(1) O(N )
DT O(√
N ) [5] O(N ) ≤ 6 O(N ) O(1)
Scribea O(log(M)) O(log(M)) O(N ) O(N ) O(log(M))
Nice O(log(N)) O(log(N)) O(K) O(K × log(N)) O(K) [8]
Overcast Undecided O(N) O(1) O(N) Undecided
aSince the bottleneck removal algorithm may introduce looping, we did not implement it.
and maximum fanout value. Each node’s fanout range (the minimum and maximum number of neighbors
each member strives to maintain in the mesh) is 3-6. For Nice, the cluster size parameter, k, is set to 3.
For Overcast, the bandwidths of links internal in transit domains are uniformly distributed between 0 and
45 Mbits/s, and that of links connecting stub domains and transit domains uniformly distributed between
0 and 1.5Mbit/s, that of links internal in stub domains uniformly distributed between 0 and 100Mbit/s.
We show in Figures 10(a) and 10(b) the average RDP versus group size and the cumulative distribution
of RDP (with 256 group members), respectively, given different protocols. In general RDP increases
with group size as packets take more hops to reach all the end-hosts. The RDP of Scribe and DT is
substantially higher than that of Narada, NICE and Overcast. The high RDP of Scribe is due to random
key distribution, which adversely affects the Pastry route. In DT, since the geographical location of the
hosts does not correlate well with their Internet locations, the overlay mesh built may not reflect the
inter-host distance in the underlay network. This leads to a high average RDP. Because NICE and Narada
continuously measure the round-trip time between hosts to improve their overlay topology, their RDPs
on average are much lower. The distribution of RDP confirms our observations that DT and Scribe have
much higher RDP than NICE, Narada and Overcast.
Note that our simulation results on Scribe are different from what is observed in [6]. This is mainly
because of the differences in simulation environment. In their simulation,100, 000 hosts are attached to
only 5050 routers. Therefore on average about20 hosts are attached to a router, as opposed to less than
one host per router in our case. The Pastry mesh resulted is hence much denser in their experiments.
Since it is more likely to find short or direct paths in a dense mesh, the performance is hence better.
Here Overcast’s RDP is surprisingly low since its tree construction mainly relates to bandwidth not delay.
We investigate its tree topologies and find that node stresses are unevenly distributed among members. A
20
16 32 64 128 256 512 10241
2
3
4
5
6
7
8Average RDP vs. Group Size
Group Size (N)
Ave
rage
RD
P
ScribeDTNaradaNICEOvercast
(a) Average RDP versus different group sizes.
1 3 5 7 9 11 13 150
10
20
30
40
50
60
70
80
90
100Cumulative Distribution of RDP
RDP
Per
cent
age
ScribeDTNaradaNICEOvercast
(b) Cumulative distribution of RDP of a group with256
members.
Fig. 10. A performance comparison among different protocols in terms of RDP.
few nodes’ degrees are over 20 while most of the others have degree 1-5. That is, some nodes have plenty
of bandwidth and are able to contain much more children than others. As we know, a star topology’s
average RDP is equal to 1. Similarly, this partially star topology in Overcast tree helps reduce average
RDP. We have to claim that although this low average RDP is desirable, the system load of Overcast
is not evenly distributed. The real capacity of a node (bandwidth or processing capability) may not
conform the number of its children. Those nodes with large node stress is likely to be system bottleneck.
Correspondingly, we can see later that average link stress of Overcast is much higher, which implies that
a few links are frequently used. Those are just links with high bandwidths.
We next show in Figures 11(a) and 11(b) average link stress versus group size and stress distribution,
respectively, given different protocols. In general the link stress increases with group size as packets take
more hops to reach all the end-hosts. The link stress of Scribe is the highest, while those of DT and NICE
are among the lowest. The high link stress of Scribe is due to its unbounded node stress, while the low
link stress of DT and NICE are due to their rather uniform and low node stress.
V. CONCLUSION
Application-level multicast (ALM) implements multicast-related functionalities at the application level.
Such technique promises to overcome the deployment problems associated with IP multicast. In ALM,
since packets take more hops to reach all members, it has higher delay and stress. In this paper, we have
reviewed a number of application-level protocols.
In general, ALM protocols can be classified as mesh-based and tree-based depending on the steps of
building data delivery tree. We have chosen Narada, DT and Scribe to represent mesh-based protocols,
21
16 32 64 128 256 512 10241
2
3
4
5
6
7
8Average Link Stresses vs. Group Size
Group Size (N)
Ave
rage
Phy
sica
l Lin
k S
tres
s
ScribeDTNaradaNICEOvercast
(a) Average physical link stress versus different group
sizes.
1 3 5 7 9 11 13 1550
55
60
65
70
75
80
85
90
95
100Cumulative Distribution of Physical Link Stress
Physical Link Stress
Per
cent
age
ScribeDTNaradaNICEOvercast
(b) Cumulative distribution of physical link stress of a
group with256 members.
Fig. 11. A performance comparison among different protocols in terms of physical link stresses.
and NICE and Overcast to represent tree-based protocols. We describe their delivery mechanisms in detail
with illustrative examples. Using Internet-like topologies, we also simulate and compare their performance
in terms of stress and relay delay penalty (RDP).
Narada, though not scalable due to its flat routing protocol, is robust in term of fault tolerance since
mesh partitioning can be detected and recovered without the need of a rendezvous point. In contrast, DT is
more scalable due to its local routing protocol, although the DT server may be the single point of failure.
Scribe supports applications where the tree spans only a subset of hosts. However, a host in Scribe may
need to forward packets for other multicast groups, which raises some incentive issue in its deployment. In
NICE, the maximum path length and node stress grows only logarithmically with the group size. Overcast
targets optimal bandwidth allocation and considers latency as a supplement. Our simulations show that
Nice’s performance in terms of RDP and stress is the best among all the schemes, while DT would not
have good performance since network measurements are not done.
REFERENCES
[1] S. E. Deering, “Multicast routing in internetworks and extended lans,”ACM SIGCOMM Computer Communication Review, vol. 18,
no. 4, pp. 55–64, Aug. 1988.
[2] K. Sripanidkucha, A. Myers, and H. Zhzng, “A third-party value-added network service approach to reliable multicast,” inProceedings
of ACM sigmetrics, August 1999.
[3] Sanjoy Paul, Krishan K. Sabnani, John C. Lin, and Supratik Bhattacharyya, “Reliable multicast transport protocol (RMTP),”IEEE
Journal on Selected Areas in Communications, vol. 15, no. 3, pp. 407–421, April 1997.
[4] Yang hua Chu, Sanjay G. Rao, Srinivasan Seshanand, and Hui Zhang, “A case for end system multicast,”IEEE Journal on Selected
Areas in Communicastions, vol. 20, no. 8, pp. 1456–1471, October 2002.
[5] Jorg Liebeherr, Michael Nahas, and Weisheng Si, “Application-layer multicasting with Delaunay triangulation overlays,”IEEE Journal
on Selected Areas in Communicastions, vol. 20, no. 8, pp. 1472–1488, October 2002.
22
[6] Miguel Castro, Peter Druschel, Anne-Marie Kermarec, and Antony I. T. Rowstron, “Scribe: a large-scale and decentralized application-
level multicast infrastructure,”IEEE Journal on Selected Areas in Communicastions, vol. 20, no. 8, pp. 1489–1499, October 2002.
[7] Dimitrios Pendarakis, Sherlia Shi, Dinesh Verma, and Marcel Waldvogel, “ALMI: an application level multicast infrastructure,” in
Proceeding of 3rd USENIX Symposium on Internet Technology and Systems (USITS01, Berkeley, CA, USA., 2001, USENIX Assoc.
2001, pp. 49–60.
[8] Suman Banerjee, Bobby Bhattacharjee, and Christopher Kommareddy, “Scalable application-layer multicast,” inACM. Computer
Communication Review, USA, October 2002, number 4, pp. 205–217.
[9] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kasshoek, and JamesW. O’Toole, “Overcast: reliable multicasting with an
overlay network,” inProceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI 2000). USENIX
Assoc. 2000, October 2000, pp. 197–212.
[10] Y. K. Dalal and R. M. Metcalfe, “Reverse path forwarding of broadcast packets,”Communications of the ACM, vol. 21, no. 12, pp.
1040–1048, Dec. 1978.
[11] Sibson R., “Locally equiangular triangulations,”Computer Journal, vol. 3, no. 21, pp. 243–245, 1978.
[12] A. Rowstron and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” in
Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, November
2001, pp. 329–350.
[13] EW. Zegura, KL. Calvert, and S. Bhattacharjee, “How to model an internetwork,” inProceedings of INFOCOM’ 96. The Conference
on Computer Communications. Fifteenth Annual Joint Conference of the IEEE Computer Societies. Networking the Next Generation,
Los Alamitos, CA, USA., 1996, pp. 594–602.