1 a survey and comparison of application-level multicast ... · a survey and comparison of...

1

A Survey and Comparison of

Application-level Multicast Protocols

Xing Jin Wan-Ching Wong S.-H. Gary Chan Hoi-Lun Ngan

Department of Computer Science

The Hong Kong University of Science and Technology

Clear Water Bay, Kowloon

Hong Kong

Email: {csvenus, gchan }@cs.ust.hk

Tel: +852 2358-6990

Fax: +852 2358-1477

Abstract

Applications such as Internet-TV and software distribution often require multicast for their delivery. As of

today, IP multicast still experiences much difficulty in commercial deployment. In order to overcome it, researchers

recently have shifted their focus toapplication-level multicast (ALM), where the multicast-related functionalities

are moved to end-hosts. ALM builds an overlay topology consisting of end-to-end unicast connections between

end-hosts. The major concern is how to route data along the topology efficiently. Depending on the methods of

building data delivery tree, the existing ALM protocols may be classified as mesh-based or tree-based. We have

chosenNarada, Delaunay Triangulationand Scribeas the representatives of mesh-based approach, whileNICE

and Overcastare chosen for tree-based approach. We describe in detail and illustrate with examples their basic

mechanisms. Using Internet-like topologies, we also compare their performance in terms of network stress, stretch,

and overhead incurred.

Index Terms

Application-level multicast, multicast, overlay network, performance evaluation

This work is supported, in part, by the Competitive Earmarked Research Grant from the Research Grant Council in Hong Kong

(HKUST6014/01E).

2

I. I NTRODUCTION

Applications such as Internet-TV, multi-party conferencing, and software distribution require point-

to-multipoint Internet communication. In IP multicast, the routers take the responsibility to replicate

packets[1]. In Figure 1(a), we show the mechanism of IP multicast, where the end-hosts are denoted by

circles and the routers by squares, with the cost of a link as indicated. SourceS needs to deliver data

to recipientsH1, H2 andH3. RoutersR1 to R5 replicate and forward data packets along the physical

links formed by the spanning tree rooted at sourceS. Despite its proposal more than a decade ago, IP

multicast still has not been widely deployed, mainly due to the following reasons:

• Multicast management issues— IP multicast standard does not provide solutions for many important

multicast management issues for large-scale commercial deployment, such as group management,

multicast address allocation and support for network management [2].

• Stateful routers— An IP multicast router needs to maintain per group state for packet replication.

This state-keeping makes routers unscalable in terms of the number of multicast groups that can be

supported.

• Lack of support of higher level functionalities— IP multicast is based on best-effort data delivery,

and hence is not reliable. This is not desirable for some applications such as multi-party games and

software distribution. Although a number of reliable multicast protocols such as RMTP [3], SRM

and PGM have been proposed, many challenges such as the heterogeneity of receivers and feedback

issues have to be addressed before they can be extensively deployed.

Due to the above problems, researchers have recently shifted their attempts from the network layer to

the application level, i.e., the multicast-related functionalities are moved from routers to end-hosts. This

is the so-calledapplication-level multicast(ALM). We show an example of ALM delivery mechanism in

Figure 1(b). As opposed to Figure 1(a),S establishes unicast connections withH1 and H3, while H1

in turn delivers data toH2 via unicast. Multicast is hence achieved via piece-wise unicast connections.

Clearly, end-hosts, instead of routers, are responsible for replicating and forwarding multicast packets.

In this way, the spanning tree of ALM forms anoverlay topology which consists of only end-hosts and

unicast connections between end-hosts. In discussing ALM, we hence often can abstract the network to

consist of end-hosts only. In the rest of the paper, we will focus on such overlay topology and data

delivery.

Implementing multicast functionalities at the application level addresses many issues associated with

IP multicast, and comes with a number of strengths:

• Absence of multicast routers— ALM is built on an overlay topology without the need of multicast

routers. In other words, routers do not need to maintain any multicast group information and hence

3

1

2

S

H1R2

R1

R5R4

R3

H3H2

3

1

6

2

1

1

14

(a) Packet flows for IP multicast.

6

1

2

1

3

4

2

1

11

R5R4

R3

H3H2

R1

R2H1

S

(b) Packet flows for application-level multicast.

Fig. 1. A Comparison between IP multicast and application-level multicast. In IP multicast, packets are duplicated and forwarded by routers.

But, in application-level multicast, this is done by the end-hosts.

are stateless,

• Leverage of current unicast protocols— Since the connections are based on unicast, the existing

solutions and functionalities of unicast protocols in the transport layer (such as TCP and UDP) can

be straightforwardly applied in ALM.

When designing an ALM protocol, the following metrics are usually used to evaluate its performance:

• Link and node stress

Because multiple overlay edges may span the same underlying physical link, a packet may traverse

multiple times along a physical link. Moreover, a host may need to forward packets to many hosts.

In order to quantify these, link stress and node stress are often used, defined as the number of copies

of a packet transmitted over a given physical link and forwarded by an end-host, respectively. The

distribution of stresses indicates whether an ALM protocol balances load fairly. If it does not, network

traffic will be concentrated at several physical links or several hosts.

As an example, refer to Figure 1(b) again. Since both overlay edgesS −H1 andS −H3 span the

same physical linkS −R1, the link stress ofS −R1 and the node stress ofS are both2.

• Stretch

Since a packet may take multiple unicast segments before reaching its destination, it experiences

higher delay as compared to the path formed by multicast-capable routers. Stretch is a general term

used to refer to such increase in latency. Clearly, ALM protocols designed for latency-sensitive

applications (such as conferencing applications) should have a low stretch. In order to quantify the

stretch, the following two metrics are often used:

– Relay Delay Penalty (RDP), defined as the ratio of the overlay latency from the source to a

given host to the delay along the shortest unicast path.

– Path Length, defined as the number of unicast segments taken before a packet reaches a given

host.

4

For example, in Figure 1(b), the path of ALM packets going fromS to H2 is S → R1 → R2 →H1 → R2 → R3 → H2, instead of the shortest pathS → R1 → R4 → R3 → H2. Therefore the

RDP of H2 is (1 + 4 + 1 + 1 + 2 + 1)/(1 + 2 + 1 + 1) = 10/5 = 2, while the path length is2.

• Control message overhead

In ALM protocols, control messages in general serve two purposes:

– Connectivity maintenance: Since the multicast group is generally not a static environment (due

to failure of links and nodes and joining and leaving of group members), periodic message

exchange among hosts is essential to maintain the connectivity of the overlay topology.

– Network condition measurement: Most ALM protocols continue to improve its connectivity by

measuring the round-trip time and available bandwidth between hosts in order to reduce the

stress and stretch.

Overhead Ratioof a protocol is often used to measure the control overhead, which is defined as the

amount of non-data traffic to that of data traffic. Non-data traffic consists of the control packets for

connectivity maintenance and network condition measurement. Clearly, a high ratio indicates much

overhead.

Since delivering packets in ALM is less efficient than IP multicast, how to build an efficient overlay

topology for data delivery becomes a major concern. Moreover, since the multicast group is not a static

environment, the overlay topology formed should be able to recover partition when it occurs. Apart from

these, the average control-messaging overhead of an ALM protocol should be scalable with the size of a

multicast group.

There are in general two approaches to multicast packets in ALM:mesh-basedor tree-based. In mesh-

basedapproach, joining members first form an overlay mesh, and build the multicast tree(s) on top of

it. Examples of such approach are Narada [4], DT [5], Scribe [6], ALMI [7], MCAN, GOSSAMER and

Bayeux. On the other hand, intree-basedapproach, the multicast tree is built directly without need of pre-

constructed mesh. Examples of such approach are NICE [8], Overcast [9], Yoid, HMTP. Our goals in this

paper are to educate readers the basic mechanisms of some representative protocols in each category and

present detailed quantitative comparisons among them in the same setting (as opposed to some previous

comparisons which are mostly qualitative in nature).

II. M ESH-BASED PROTOCOLS

In a mesh-based protocol, a mesh-like topology is first built, in which a host connects to a number of

other hosts termed as neighbors. There can be multiple paths connecting a pair of hosts on the mesh.

Based on the mesh, single or multiple data delivery tree(s) are built, according to the specific protocols.

5

ALGORTHIM IEVALUATE UTILITY (u, v)

1 utility ← 0

2 for each m in multicast group− {u}3 do current latency← latency betweenu andm along mesh

4 new latency← latency betweenu andm along mesh whenuv were added

5 if (new latency< current latency)

6 then utility ← utility +current latency−new latency

current latency7 Return utility

ALGORITHM IIEVALUATE CONSENSUSCOST(uv)

1 Costuv ← number of group members for whichu usesv as the next hop for forwarding packets.

2 Costvu ← number of group members for whichv usesu as the next hop for forwarding packets.

3 returnmax(Costuv, Costvu)

Because delivery trees are embedded in the overlay mesh, delivery efficiency depends on how well overlay

mesh is constructed. On the one hand, more overlay edges means more direct connections between hosts,

and hence lower latency. On the other hand, fewer mesh edges means lower node stress. Therefore, in

routing decision, selecting the overlay edges for low latency and node stress is important. Another issue

is that mesh partition leads to tree partition; therefore partition avoidance and detection mechanisms are

important. In order to illustrate and understand in a more concrete manner the basic mechanisms involved,

we have chosen three representative schemes, namely, Narada, Delaunay Triangulation and Scribe, and

discuss their operations in detail in the following.

A. Narada

Narada is suitable for Internet conferencing applications, where participants can be both sources and

receivers at the same time. In Narada, an overlay mesh is constructed by periodically adding and dropping

connections between hosts. A host periodically exchanges control messages with its neighbors to maintain

connectivity and build its own routing table with the shortest widest path algorithm, of which bandwidth

and latency are considered primary and secondary metrics, respectively.

A joining host first obtains a list of already-joined hosts from a rendezvous point. The joining host then

randomly selects a few of them to connect to. Through periodic exchange of refresh messages among

6

1

251

1

2

22

F

D

E

BA

c

(a) The utility of adding edgeBD is 11963

.

1

251

1

2

22

F

D

E

BA

C

(b) The consensus cost of edgeAF is 1.

Fig. 2. a) By adding edgeBD, the delay fromB to D is reduced from9 (along the pathB → A → F → D) to 2 (along the path

B → D), and the delay fromB to F is reduced from7 (along the pathB → A → F ) to 4 (along the pathB → D → F ), and the delay

from B to E is reduced from9 (along the pathB → A → F → E) to 3 (along the pathB → D → E). Therefore the utility ofBD is9−29

+ 7−47

+ 9−39

= 11863

. b) Since only the shortest paths betweenA and E passing through the connectionAE, the consensus costof

AE is one only.

neighbors, the changes in membership due to joining or leaving of hosts can eventually be propagated to

all hosts. In Narada, reverse path algorithm is used to multicast packets: when hostu receives a multicast

packet from sources, it forwards the packet to those neighbors from whichu is on the shortest path to

s [10].

Narada uses a periodic mesh refinement algorithm to add or drop connections so as to incrementally

improve the overlay mesh. A hostu periodically probes some other hosts to evaluate the overall delay that

can be reduced if connected to them. If the reduction of a host is above a certain threshold (the adding

threshold),u connects to it.

Meanwhile,u also computes theconsensus costof an edge between its neighbor, say the edgeuv. For

all the shortest paths fromu to the other nodes in the network,u computes the number of them include

uv as one of their edges.v also computes likewise. The maximum of these two numbers is the cost. If

it is below a certain threshold (the dropping threshold), the edgeuv is disconnected.

Note that the adding and dropping thresholds are not fixed values, but are functions of maximum and

minimum fanout of all the nodes in the overlay mesh. Therefore Narada controls the maximum and

minimum fanout to avoid a host from having too many connections and becoming a bottleneck.

We show the algorithm to compute the overall delay reduction (termed asutility) for two nodesu and

v if a connection were made in ALGORITHM I, and show an example in Figure 2(a), where hostsB and

D are not connected at first. WhenB probesD, it finds that by connecting toD the delay from itself to

D can be reduced from9 (along the pathB → A → F → D) to 2 (along the pathB → D), the delay

from B to F can be reduced from7 (along the pathB → A → F ) to 4 (along the pathB → D → F ),

and the delay fromB to E can be reduced from9 (along the pathB → A → F → E) to 3 (along the

path B → D → E). Therefore the utility ofBD is 9−29

+ 7−47

+ 9−39

= 11863

. If the value is above the

7

a b

cd

(a)

a b

cd

(b)

Fig. 3. a) Two adjacent triangles forming a convex quadrilateral4abd and4bdc violating the DT property. b) Restoration of DT property

by disconnectinga from c and connectingb andd.

adding threshold,B will connect toD.

We also show how to evaluate theconsensus costin ALGORITHM II given an edgeuv and an example

in Figure 2(b). Consider the edgeAF (i.e., A connects to hostF initially.) For both A andF , because

there is only one shortest path which includes the edgeAF (the direct pathAF ), the consensus costof

AF is one. Therefore, if the value of the dropping threshold is higher than one,AF will be dropped.

Narada detects mesh partition with the help of periodical exchanging refresh messages. Basically, if

hostu loses a sequence of refresh messages from hostv, hostu suspects thatv is dead and probesv. If v

is still alive, thenu connects tov in order to recover the partition. Since this algorithm does not interact

with the rendezvous point, it is robust to the failure of any host.

Narada is not scalable in term of the number of hosts. This is because:

• As a flat routing algorithm, the shortest widest path algorithm holds the fact that the size of routing

table is as large as the group size. Therefore the state maintained by a host is in the order ofO(N).

• A control message contains the states of all hosts in the group; therefore, the total control overhead

in the system is high for large groups (O(N2)).

Another potential problem is that Narada’s convergence time can be very long. Simulation results show

that it is difficult to form a stable mesh, even after a very long time. This is due to the inconsistent criteria

for adding and dropping edges.

B. Delaunay Triangulation (DT)

In DT, each host has ageographicalcoordinate and hosts first form an overlay mesh based on these

coordinates. Compass routing is used to route a packet from one point to another. DT protocol connects

the nodes together in a triangular manner so that the mesh satisfies Delaunay Triangulation property,

i.e., the minimum internal angle of the adjacent triangles in the mesh are maximized [11]. To illustrate

the triangulation process, consider that pointsa, b, c andd form a convex quadrilateralabcd. There are

two possible ways to triangulate it as shown in Figure 3. Since the minimum internal angle of4abc

8

t

n n

u

3n

2n

1 0

Fig. 4. Whenu receives a unicast packet with destinationt, it forwards the packet ton0. Whenu receives a multicast with sourcet, it

forwards the packet ton1 andn3.

and4acd (Figure 3(a)) is less than that of4abd and4bcd (Figure 3(b)), DT protocol transforms the

former configuration into the latter. The mesh formed this way connects those geographically close nodes

together.

In DT protocol, the joining process is bootstrapped by a DT server, which caches a list of joined host.

A joining hostu first queries the DT server for some already-joined hosts.u then sends via unicast join

requests to these hosts, which in turn sends back those joining requestsalong the DT meshtowardsu until

reaching a set of hosts nearbyu. These nearby hosts then connect tou. Note that the newly established

connections may violate the DT property. To restore it, every host periodically tests its connections against

the property, and drops those failing in the test. A host also discovers nearby hosts through periodical

exchange of control messages with its neighbors, and connects to any nearby hosts if this does not violate

the DT property.

To illustrate this process, suppose that initially hostsa, b, c and d are connected as in Figure 3(a).

Because the connectionac violates the DT property, it is dropped when hosta (or c) tests it. After that

hostb discoversd through control messaging, and connects tod sincebd satisfies the DT property. Then

the resultant overlay mesh (as shown in Figure 3(b)) satisfies the DT property.

To route a packet from one point to another in DT protocol, compass routing can be used. When hostu

receives a unicast packet with destinationv, it first computes the slope between the two coordinate points

u and v and let the slope bes. u then computes the slopes of all itsN neighbors with itself. Let these

be si, 1 ≤ i ≤ N . u forwards the packet to the neighbor which has the closest slope tos, i.e.,u forwards

to neighborj where |sj − s| is the minimum for allsi’s. We illustrate the compass routing in Figure 4,

whereu needs to route a packet destined tot to one of its neighbors. Because the slope betweenu and

n0 is the closest to the slope ofu and t, hostu forwards the unicast packet to hostn0.

In DT, as similar to Narada, reverse path forwarding algorithm is used tomulticastpackets. When host

u receives a multicast packet from sources, it forwards the packet to those neighbors ifu is on the path

9

from them tos.

Refer back to Figure 4 again. When hostu receives a multicast packet with sourcet, it forwards the

packet ton1 and n2. This is because the slope oft → n1 is the closest to slopeu → n1 among all the

slopes ofu → n1, n0 → n1 andn2 → n1, while the slope oft → n2 is the closest to slope ofu → n2

among the slopes ofu → n2, n1 → n2 andn3 → n2.

In DT protocol, mesh partitioning is detected with the DT server. For each connected mesh, one and

only one host is elected as leader (usually the leader is the one with the greatest coordinate1). The leader

periodically exchanges control messages with the server. When the mesh is partitioned, there would

be more than one leaders communicating with the server. The server can then recover the partition by

requesting one of them to connect to the others.

As compared to Narada, DT protocol is much more scalable in terms of the number of hosts. This

is because compass routing is a kind of local routing, which is based on the coordinates of hosts only

directly connected to a node. Therefore the size of the routing table maintained by a host depends on the

number of neighbors it has only, which is on average less than or equal to six.2 However, DT protocols

come with the following two weaknesses:

• Inaccurate host location estimation— The geographic locations of hosts in general do not correlate

well with the latencies between hosts in the Internet; therefore, the end-to-end delay along a DT

overlay may be quite large.

• Single point of failure— The partition detection and recovery scheme relies on a DT server, which

forms a single point of failure.

C. Scribe

In Scribe, there are many hosts in the network but a multicast group only covers or spans a subset of

them. Those hosts which are not group members take part in packet forwarding in the Scribe network. A

possible application for Scribe is Internet chatroom, where usually a small set of users out of a possibly

large pool belongs to the same multicast group. The large pool ofhostsjointly takes part in forwarding

packets for thegroup membersin the system.

Scribe provides multicast group management for data delivery. It builds on top of Pastry, which provides

the actual host-to-host routing and content-delivery mechanisms. Scribe first connects hosts together as a

Pastry overlay mesh, i.e., the larger group. Then it constructs an overlay tree for each multicast group on

1The coordinate of hostA is greater than hostB iff the y-coordinate ofA is greater than that ofB, or, if their y-coordinates are the

same, thex-coordinate ofA is greater than that ofB.2Delaunay Triangulation is a planar graph, which has at most3N edges. Therefore the average number of neighbors is less than or equal

to 2×3NN

= 6.

10

TABLE I

AN EXAMPLE ROUTING TABLE OF HOST3321 FOR PASTRY.

Column

R 0xxx 1xxx 2xxx ∅o 30xx 31xx 32xx ∅w 330x 331x ∅ 333x

3320 ∅ 3322 3323

top of the mesh such that the tree branches are embedded with the mesh edges [12]. Clearly, a host can

be a tree node of multiple overlay trees. When a host receives a packet of a multicast group, it simply

forwards the packet to all of its children in the corresponding multicast overlay tree. In Scribe, the non-leaf

nodes are referred to asforwarders since they forward data to their children for data dissemination.

In Pastry, a host is identified by a random key termedNodeIdof value between0 andM (one may think

of the key as the host address). (NodeID can be made unique with high probability by using common

message digest functions.) Each key is expressed in baseB. A host constructs its own routing table based

on the leading prefix of the destination NodeId. The routing table of a host with NodeIdu = [u1u2 . . . ul]

(in baseB) has l = dlogB(M + 1)e rows andB columns. The entry at therth row andcth column

(1 ≤ r ≤ l and 1 ≤ c ≤ B) is for routing to a destination with NodeID matchingr − 1 prefixes ofu

and has a digit of valuec − 1 at therth position. More formally and specifically, the(r, c) entry is for

the forwarding of a host with NodeIdv = [v1v2 . . . vl] where v1 = u1, v2 = u2, . . . , vr−1 = ur−1, and

vm = c− 1, while vm+1 . . . vl are don’t care.

The routing table shows which entry to look into based on the leading prefix of the destination NodeID.

In the lookup, maximum prefix-matching is used. Upon a match, the entry would indicate the next-hop

NodeID for the packet to be forwarded to. It should be noted that the routing table is constructed in such

a way that the next-hop NodeID also has the same leading prefix as the entry. Each entry indicates at most

one next-hop host although there may be a number of destination hosts matching the prefix. To reduce

the stretch, a host periodically probes those possible next-hop hosts to select the one with the shortest

round-trip time.

We show in Table I an example of the routing table for a host of index3321, with M = 255, B = 4,

andx can be any digit between 0 andB − 1 (i.e., don’t care digit). Note that the next-hop host for the

∅ entries (at the(1, 4), (2, 4), (3, 3), and (4, 2) entries) is the host3321 itself. Each entry indicates the

next-hop NodeID with the same prefix (not shown) to be forwarded to. When the host receives a packet

of destination3310, it forwards the packet to the next-hop host as indicated by the(3, 2) entry,

Clearly, in Pastry overlay, at each hop the packet is one step closer to its destination as the length of

11

prefix matching between the current NodeID and destination NodeID is increased by 1. Therefore, the

path length is equal toO(logd M).3

A multicast group in Scribe is assigned with a key (the group identifier), which is in the same key space

as NodeId. A joining host first sends a join request with the group identifier as the destination key along

the Pastry overlay until the request reaches a host receiving data of the same multicast group. The join

request turns all hosts along the path into forwarders even though they are not members of the multicast

group. Therefore, the overlay tree is an aggregation of Pastry paths from the interested hosts to the host

whose key is numerically closest to the group identifier. This tree is free of loops because the distance to

the destination progressively reduces upon each hop.4 Therefore, an overlay tree is topologically sorted

by the distance from the destination.

We show an example of multicast routing in Pastry in Figure 5, with each digit being either0 or 1

(i.e., B = 2) and the group identifier being0000. There are 8 hosts in the networks, and only the hosts

(unshaded nodes)0100, 0101, 0011 and0110 are group members and they join in such order. We arrange

the hosts into concentric circles based on the length of their matched prefix with the group identifier,

with the circles separated by a distance of 1. In this example, the topological order of the multicast

tree is [0000], [0001], [0010, 0011], [0100, 0101, 0110, 0111], where hosts within the square brackets are

interchangeable. When host0100 joins the multicast group (whose key has only one matched prefix), it

sends a join request to host0001 (two matched prefixes), which in turn forwards the request to host0000.

Similarly, the joining host0101 sends a join request to host0001 which has a longer matching prefix

than 0010. However since host0001 has already been a forwarder, it suppresses the request. Clearly, the

tree is without any loop.

To maintain the connectivity of the overlay tree, each host periodically sends refresh messages to its

children. Any host fails to receive refresh messages would assume that its parent is dead and rejoins the

multicast group to recover the partition. To reduce the refresh overheads, multicast packets can serve as

implicit refresh messages. There is also an algorithm to remove bottleneck in the data delivery tree by

limiting the children number of a host through delegation of its children to other nodes [12]. When a

host is overloaded, it first identifies the multicast group which consumes the most resources and sends a

control message appended with its children NodeIDs to the farthest child within that group from itself.

Upon receiving the control message, the child then chooses a new parent among the children listed in the

3There are some special cases which need to be considered, such as the case when a host cannot find any neighbors with the same

leading prefix as its table entry. This is taken care of by comparing the numerical difference, rather than the prefix, between the destination

NodeID and some other set of neighbor NodeIDs of the host. In this case, the routing is less efficient but the distance to the destination still

monotonically decreases upon each hop. The details of routing mechanism of Pastry has been discussed in [12].4SupposeP = {h0, h1, . . . } is the Pastry path withh0 being the joining host,h1 is the first host, etc. The distance from the destination

from hi is longer than that fromhi+1.

12

1 prefix digit

2 prefix digits

3 prefix digits

matched.

matched

matched0000

0100

0110

0011

0010

0001

01010111

Fig. 5. The loop-free tree built by Scribe.

message. (Note that this algorithm may break the loop-free property of the Pastry tree, and hence has to

rely on other mechanisms for recovery.)

The size of the routing table at a host in Pastry isO(log2B M), and hence Scribe is scalable in terms of

group size. However, the tree-building algorithm requires a host to serve other groups not of its interest.

Moreover, the performance of Pastry routes depends on the key distribution. There may be cases where

even if two hosts are very close in location together, they may be separated by many hops on the Pastry

overlay due to their poor match in prefixes. As a result, a high stretch value results.

III. T REE-BASED PROTOCOLS

In tree-based protocols, a multicast tree is built directly over all joining members, and no mesh topology

is needed. The tree structure does not change with the source host. Because of that, a single node failure

or a loop can destroy the tree, thereof making it more fragile than a mesh. Loop-avoidance mechanism is

hence an important issue in tree-based protocols. We have chosen NICE and Overcast as two representative

schemes in this category and discuss their basic operations in detail in the following.

A. NICE

NICE is suitable for low-bandwidth streaming applications with a large number of receivers. It organizes

hosts into a multi-layer hierarchical structure, with the highest layer consists of only one host and the

13

L0

L1

L3

L2

D E FC

C

B

B D HF

C 03

A G

11C

02C0

0C 01

C

03C

0

D

1

D

H

H

20C

Fig. 6. Nice organizes hosts into a multiple-layer hierarchical structure, where hosts on each layer are grouped into a number of cluster.

lowest layer consists of all the hosts in the group. A host joins a number of layers in a bottom-up manner,

and hosts of the same layer are grouped into a number of clusters.

We show an example of NICE in Figure 6, where shadowed boxes denote clusters, and white circles

denote end-hosts. (The arrows will be explained later.) All hosts (i.e.,A to H) join the bottom layer (L0),

where hosts are grouped into a number of clusters (namely,C00 , C

01 , C

02 andC0

3 ). The leader of each cluster

(B, D, F andH) joins the layer one level up (L1). The grouping of hosts into clusters and selection of

cluster leaders are then repeated.

In NICE, the clusters are organized into a hierarchical tree, with the upper cluster branching out a

number of child clusters in the lower layer. A host may join multiple layers, and belong to different

cluster in different layer. We denote a host’scluster peersas the set of all the nodes sharing clusters with

the host.

A joining host u first selects a cluster from layerL0 to join by successive probing from the highest

layer to the lowest: it first queries the rendezvous point for the host of the highest layer. The host tellsu

the leaders of the layer one level down. By measuring the round trip time with these leaders,u selects

the closest one and queries it for the leaders being covered on the layer one level down. The process is

repeated untilu finds the closest leader of a cluster on the lowest layer.u then joins the cluster. We show

an example in Figure 7. Joining hostI first queries the rendezvous point for the host on the highest layer,

i.e., D, and then it queriesD for hosts being covered on the lower layer, i.e.,D and H. I selects the

closest oneH to repeat the process. Eventually, joining hostI finds the closest clusterC30 on layerL0

with which it joins.

Unlike Scribe, the overlay tree for data delivery in NICE is not pre-constructed. When a host receives

a multicast packet from a host of clusterc, it simply forwards the packet to all its cluster peers except

14

L2

L0

L1

L3 RP

E ID

C

C

1

BA

B D HF

C 03

0

H

10C 1

1C

02C0

0C 0C 03C

G

D

D

F

H

2

(a)

L2

L0

L1

L3 RP

E ID

C

C

1

BA

B D HF

C 03

0

H

10C 1

1C

02C0

0C 0C 03C

G

D

D

F

H

2

(b)

L0

L1

L3

L2

D E

RP

IF

C

CBA

B D HF

C 03

C

10C 1

1C

02C0

0C 01

H

03C

D

G

D H

20

(c)

L0

L1

L3

L2

D E

RP

IF

C

CBA

B D HF

C 03

C

10C 1

1C

02C0

0C 01

H

03C

D

G

D H

20

(d)

L2

L0

L1

L3

C D I

RP

E

C

1

BA

B D HF

C 03

2

H

10C 1

1C

02C0

0C 0

G

C 03C

D

F

D H

0

(e)

Fig. 7. In NICE, the joining host selects a cluster on layerL0 to join with by successively probing from highest layer to lowest layer.

those in clusterc. Referring back to Figure 6 again. When hostD receives a multicast packet from hostB

of clusterC10 , it forwards the packet to its cluster peers inC2

0 andC01 , i.e., nodesH andC, respectively.

Therefore the maximum path length is twice the number of layers (i.e.,O(logk(N))), wherek is the

cluster size, and the maximum node stress is equal to the product of the cluster size and the number of

layers (i.e.,O(k logk(N))).

To maintain the overlay topology, a host periodically sends heartbeat messages to its cluster peers.

Therefore, sudden leaving of any member can be detected through the loss of heartbeat messages. NICE

also limits the size of a cluster fromk to 3k− 1. A cluster will split into two clusters if its size is above

the upper bound, or merges with another cluster if its size falls below the lower bound.

Nice is efficient in terms of end-to-end delay, since the path length of forwarding a data packet is of

orderO(logk(N)). However, NICE creates bottlenecks at the top-layer and higher-layer nodes, since all

the joining members have to query one node at each layer of the hierarchy.

15

B. Overcast

Overcast is designed for single-source applications, e.g., TV-broadcasting. It tries to maximize each

host’s bandwidth from the source. Latency, on the other hand, is not the major concern.

A new member joins the multicast tree by contacting its potential parents. The root node is all new

nodes’ default potential parent. The new node estimates its available bandwidth to this potential parent. It

also estimates the bandwidth to the potential parent through each of this potential parent node’s children.

If the bandwidth through any of the children approximates to the direct bandwidth to the potential parent,

the closest one (in terms of network hops) of all the qualified children becomes the new potential parent

and a new round commences. If there is no qualified children, the procedure stops and the current potential

parent becomes new node’s parent.

To estimate the bandwidth, the node measures the download time of 10K bytes, which includes all the

service costs. If the measured bandwidths to two nodes are within 10% of each other, we consider the

two nodes equally good and select the closer one.

A node periodically reevaluate its position in the tree. It measures the bandwidth to its current siblings,

parent and grandparent. It will move below its sibling if that does not decrease its bandwidth back to the

root. Also, it will move one level up for higher bandwidth.

An example is shown in Figure 8, where nodeR is the root and nodeH is the new member. BeforeH

joins, all other nodes have formed a multicast tree. The thickness of the arrows indicates the overlay link

bandwidth. Initially,H contacts rootR. Since its direct bandwidth toR approximates to the bandwidth

through nodeB, it switches toB. This process is repeated and then it switches toE. Since the switching

to G leads to lower end-to-end bandwidth,H stays at that level thereafter.

In Overcast, each member maintains its ancestor list for partition avoidance and recovery. A member

rejects any connection requests initiated by its ancestor(s) to avoid looping. When a member detects that

its parent has left the multicast group, it connects to its ancestors one by one, from its grandparent to the

root, until a live member is found. Therefore, the loading is distributed along the path to the root, and

the root is not easily overloaded.

Overcast also includes an “up/down” protocol for information exchange. Each node, including the root,

maintains a table of information about all its descendants and a log of all changes to the table. Each

node periodically checks in with its parent. If a child fails to contact its parent within a given interval,

the parent will assume the child and all its descendants have “died”. It will then modify the table. A

node also modify the table if new children arrive. During these periodical check-ins, a node reports new

information that it has observed or been informed of. By this protocol, the root can maintain up-to-date

information about all the other nodes.

16

R

A B C

D E F

GH

R

A B C

D E F

GH

R

A B C

D E F

GH

(a) Initial stage. (b) Intermediate stage. (c) Final stage.

Fig. 8. The joining procedure of Overcast. NodeH is the new member.

The root node has to handle each new member’s joining request and is likely to be system bottleneck.

This problem remains for the up/down protocol. To overcome this single point of failure problem, Overcast

proposes to use a linear structure at the top of the multicast tree. Figure 9 shows an example. The black

node is the root. The root and two grey nodes are configured linearly. Each grey node has enough

information to act as the new root in case of root failure. However, this technique increases latency.

Overcast concentrates on bandwidth allocation, which is different from all above protocols. One key

issue in tree building is the bandwidth estimation. Current estimation technique is not accurate enough,

and testing results may not conform the cases during data transmission. This will affect the tree efficiency

in terms of bandwidth. In the worst case of building a tree, every newly joining node has to contact all

the existing nodes. This leads to a complexity of

O(N∑

i=1

i) = O(N2),

whereN is the number of nodes. On the average case, time complexity should not be that bad. We will

introduce its simulation results in the next section.

IV. COMPARISONS

In this section, we compare the performance of the protocols discussed. We summarize in Table II

the comparison of the protocols in terms of the mechanism of the overlay construction and maintenance,

partition detection and the recovery scheme. We also show in Table III the performance of these protocols

in terms of path length, node stress, and the size of routing table with respect to group sizeN , the

17

Fig. 9. Linear structure at the top of an Overcast tree, which helps overcome the single point of failure problem.

number of possible keys in the networkM (for Scribe), and the cluster sizeK (for NICE). From it, we

see that the maximum path length for Scribe and Nice increases only logarithmically with group size,

while Narada and Overcast in the worse case may have very long path. Although the path length of DT

in the worst case isO(N), its average path length is onlyO(√

N)). Specially, Overcast’s tree topology is

related to path bandwidths and can’t be easily decided. On average the node stress of NICE, DT, Narada

and Overcast are independent of the group size. Only Narada is able to guarantee a certain node stress

in the worst case. The maximum node stress of NICE grows with the group size logarithmically, while

DT and Overcast do not provide any guarantee on the maximum node stress. However, since the host on

the topmost layer is also the leaders of many lower layers, it may become the bottleneck of the system.

The routing table sizes of DT and NICE are independent of the group size, because only neighborhood

information is needed to be maintained. On the other hand, the table sizes of Scribe and Narada grow

with the group size logarithmically and linearly, respectively. Specially, the table size of an Overcast node

is uncertain, since we don’t know the tree topology. Therefore, the scalability of DT is much better than

that of Narada and Scribe.

Besides the above, we have also done simulations on Internet-like topologies to compare the performance

of these protocols in terms of relative delay penalty and physical link stresses. In our simulations, we first

generate a number (10) ofTransit Stubtopologies with Georgia Tech’s random graph generator [13]. The

generated topologies are a two-layer hierarchy of transit networks (with four transit domains, each with

16 randomly-distributed routers on a1024×1024 grid) and stub networks (with64 domains, each with15

randomly-distributed routers on a32× 32 grid). A host is connected to a stub router via a LAN (of4× 4

grid). The delays of LAN links are 1ms while the delays of core links are given by the topology generator.

For each protocol, we randomly select a number of hosts (16–1024) as group members and choose one

of them as the source. We measure theRDP and link stresswhen the source sends packets to all hosts.

For DT, we take the geographical coordinate of a host as its location. For Scribe, we randomly assign

the hosts with key values. There are2128 number of possible keys (corresponding to IPv6 address space),

and B = 16. For Narada, add-link/drop-link thresholds are set according to group size, nodes’ current

18

TABLE II

A COMPARISON BASED ON PROTOCOL DESIGN.

Overlay Topology Construc-

tion

Partition Detection and Re-

covery

Packet Routing

Narada Add edges of highutility and

drop edges of lowconsensus

cost.

Every host maintains a com-

plete list of members and

probes those silent members.

Based on the routing ta-

ble constructed with a short-

est widest algorithm, reverse

path forwarding algorithm is

applied for multicast packet

routing.

DT The overlay mesh

constructed satisfies the

DT property.

If more than one leader

communicates with the DT

server, it recovers the par-

tition by connecting leaders

together.

Based on compass routing,

reverse path forwarding al-

gorithm is applied for mul-

ticast packet routing.

Scribe Data delivery tree is an ag-

gregation of Pastry routes

from interested hosts to the

rendezvous point.

A host rejoins the system if

its parent is silent for a long

time.

Data is disseminated along

the tree from the rendezvous

point

NICE Organize hosts in a multiple-

layer hierarchical structure,

with the highest layer con-

sisting of only one host and

the lowest layer consisting

of all the hosts.

A host periodically probes

its cluster peers on differen

layers.

Forward data to all cluster

peers except the one sending

data to it

Overcast Incrementally build a tree

and try to maximize band-

width to the root for all

nodes.

Each node periodically con-

tacts its parent. A node con-

nects to its ancestors one by

one, from its grandparent to

the root, if its parent is un-

reachable.

Data is disseminated along

the tree from the root.

19

TABLE III

COMPARISON OF PATH LENGTH, NODE STRESS AND SIZE OF ROUTING TABLE.

Protocol Path Length Node Stress Routing table size

Average Maximum Average Maximum

Narada O(N ) O(N ) O(1) O(1) O(N )

DT O(√

N ) [5] O(N ) ≤ 6 O(N ) O(1)

Scribea O(log(M)) O(log(M)) O(N ) O(N ) O(log(M))

Nice O(log(N)) O(log(N)) O(K) O(K × log(N)) O(K) [8]

Overcast Undecided O(N) O(1) O(N) Undecided

aSince the bottleneck removal algorithm may introduce looping, we did not implement it.

and maximum fanout value. Each node’s fanout range (the minimum and maximum number of neighbors

each member strives to maintain in the mesh) is 3-6. For Nice, the cluster size parameter, k, is set to 3.

For Overcast, the bandwidths of links internal in transit domains are uniformly distributed between 0 and

45 Mbits/s, and that of links connecting stub domains and transit domains uniformly distributed between

0 and 1.5Mbit/s, that of links internal in stub domains uniformly distributed between 0 and 100Mbit/s.

We show in Figures 10(a) and 10(b) the average RDP versus group size and the cumulative distribution

of RDP (with 256 group members), respectively, given different protocols. In general RDP increases

with group size as packets take more hops to reach all the end-hosts. The RDP of Scribe and DT is

substantially higher than that of Narada, NICE and Overcast. The high RDP of Scribe is due to random

key distribution, which adversely affects the Pastry route. In DT, since the geographical location of the

hosts does not correlate well with their Internet locations, the overlay mesh built may not reflect the

inter-host distance in the underlay network. This leads to a high average RDP. Because NICE and Narada

continuously measure the round-trip time between hosts to improve their overlay topology, their RDPs

on average are much lower. The distribution of RDP confirms our observations that DT and Scribe have

much higher RDP than NICE, Narada and Overcast.

Note that our simulation results on Scribe are different from what is observed in [6]. This is mainly

because of the differences in simulation environment. In their simulation,100, 000 hosts are attached to

only 5050 routers. Therefore on average about20 hosts are attached to a router, as opposed to less than

one host per router in our case. The Pastry mesh resulted is hence much denser in their experiments.

Since it is more likely to find short or direct paths in a dense mesh, the performance is hence better.

Here Overcast’s RDP is surprisingly low since its tree construction mainly relates to bandwidth not delay.

We investigate its tree topologies and find that node stresses are unevenly distributed among members. A

20

16 32 64 128 256 512 10241

2

3

4

5

6

7

8Average RDP vs. Group Size

Group Size (N)

Ave

rage

RD

P

ScribeDTNaradaNICEOvercast

(a) Average RDP versus different group sizes.

1 3 5 7 9 11 13 150

10

20

30

40

50

60

70

80

90

100Cumulative Distribution of RDP

RDP

Per

cent

age


(b) Cumulative distribution of RDP of a group with256

members.

Fig. 10. A performance comparison among different protocols in terms of RDP.

few nodes’ degrees are over 20 while most of the others have degree 1-5. That is, some nodes have plenty

of bandwidth and are able to contain much more children than others. As we know, a star topology’s

average RDP is equal to 1. Similarly, this partially star topology in Overcast tree helps reduce average

RDP. We have to claim that although this low average RDP is desirable, the system load of Overcast

is not evenly distributed. The real capacity of a node (bandwidth or processing capability) may not

conform the number of its children. Those nodes with large node stress is likely to be system bottleneck.

Correspondingly, we can see later that average link stress of Overcast is much higher, which implies that

a few links are frequently used. Those are just links with high bandwidths.

We next show in Figures 11(a) and 11(b) average link stress versus group size and stress distribution,

respectively, given different protocols. In general the link stress increases with group size as packets take

more hops to reach all the end-hosts. The link stress of Scribe is the highest, while those of DT and NICE

are among the lowest. The high link stress of Scribe is due to its unbounded node stress, while the low

link stress of DT and NICE are due to their rather uniform and low node stress.

V. CONCLUSION

Application-level multicast (ALM) implements multicast-related functionalities at the application level.

Such technique promises to overcome the deployment problems associated with IP multicast. In ALM,

since packets take more hops to reach all members, it has higher delay and stress. In this paper, we have

reviewed a number of application-level protocols.

In general, ALM protocols can be classified as mesh-based and tree-based depending on the steps of

building data delivery tree. We have chosen Narada, DT and Scribe to represent mesh-based protocols,

21

16 32 64 128 256 512 10241

2

3

4

5

6

7

8Average Link Stresses vs. Group Size

Group Size (N)

Ave

rage

Phy

sica

l Lin

k S

tres

s


(a) Average physical link stress versus different group

sizes.

1 3 5 7 9 11 13 1550

55

60

65

70

75

80

85

90

95

100Cumulative Distribution of Physical Link Stress

Physical Link Stress

Per

cent

age


(b) Cumulative distribution of physical link stress of a

group with256 members.

Fig. 11. A performance comparison among different protocols in terms of physical link stresses.

and NICE and Overcast to represent tree-based protocols. We describe their delivery mechanisms in detail

with illustrative examples. Using Internet-like topologies, we also simulate and compare their performance

in terms of stress and relay delay penalty (RDP).

Narada, though not scalable due to its flat routing protocol, is robust in term of fault tolerance since

mesh partitioning can be detected and recovered without the need of a rendezvous point. In contrast, DT is

more scalable due to its local routing protocol, although the DT server may be the single point of failure.

Scribe supports applications where the tree spans only a subset of hosts. However, a host in Scribe may

need to forward packets for other multicast groups, which raises some incentive issue in its deployment. In

NICE, the maximum path length and node stress grows only logarithmically with the group size. Overcast

targets optimal bandwidth allocation and considers latency as a supplement. Our simulations show that

Nice’s performance in terms of RDP and stress is the best among all the schemes, while DT would not

have good performance since network measurements are not done.

REFERENCES

[1] S. E. Deering, “Multicast routing in internetworks and extended lans,”ACM SIGCOMM Computer Communication Review, vol. 18,

no. 4, pp. 55–64, Aug. 1988.

[2] K. Sripanidkucha, A. Myers, and H. Zhzng, “A third-party value-added network service approach to reliable multicast,” inProceedings

of ACM sigmetrics, August 1999.

[3] Sanjoy Paul, Krishan K. Sabnani, John C. Lin, and Supratik Bhattacharyya, “Reliable multicast transport protocol (RMTP),”IEEE

Journal on Selected Areas in Communications, vol. 15, no. 3, pp. 407–421, April 1997.

[4] Yang hua Chu, Sanjay G. Rao, Srinivasan Seshanand, and Hui Zhang, “A case for end system multicast,”IEEE Journal on Selected

Areas in Communicastions, vol. 20, no. 8, pp. 1456–1471, October 2002.

[5] Jorg Liebeherr, Michael Nahas, and Weisheng Si, “Application-layer multicasting with Delaunay triangulation overlays,”IEEE Journal

on Selected Areas in Communicastions, vol. 20, no. 8, pp. 1472–1488, October 2002.

22

[6] Miguel Castro, Peter Druschel, Anne-Marie Kermarec, and Antony I. T. Rowstron, “Scribe: a large-scale and decentralized application-

level multicast infrastructure,”IEEE Journal on Selected Areas in Communicastions, vol. 20, no. 8, pp. 1489–1499, October 2002.

[7] Dimitrios Pendarakis, Sherlia Shi, Dinesh Verma, and Marcel Waldvogel, “ALMI: an application level multicast infrastructure,” in

Proceeding of 3rd USENIX Symposium on Internet Technology and Systems (USITS01, Berkeley, CA, USA., 2001, USENIX Assoc.

2001, pp. 49–60.

[8] Suman Banerjee, Bobby Bhattacharjee, and Christopher Kommareddy, “Scalable application-layer multicast,” inACM. Computer

Communication Review, USA, October 2002, number 4, pp. 205–217.

[9] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kasshoek, and JamesW. O’Toole, “Overcast: reliable multicasting with an

overlay network,” inProceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI 2000). USENIX

Assoc. 2000, October 2000, pp. 197–212.

[10] Y. K. Dalal and R. M. Metcalfe, “Reverse path forwarding of broadcast packets,”Communications of the ACM, vol. 21, no. 12, pp.

1040–1048, Dec. 1978.

[11] Sibson R., “Locally equiangular triangulations,”Computer Journal, vol. 3, no. 21, pp. 243–245, 1978.

[12] A. Rowstron and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” in

Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, November

2001, pp. 329–350.

[13] EW. Zegura, KL. Calvert, and S. Bhattacharjee, “How to model an internetwork,” inProceedings of INFOCOM’ 96. The Conference

on Computer Communications. Fifteenth Annual Joint Conference of the IEEE Computer Societies. Networking the Next Generation,

Los Alamitos, CA, USA., 1996, pp. 594–602.

1 a survey and comparison of application-level multicast ... · a survey and comparison of...

Documents