on the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf ·...

6
On the benefit of network coding for timely and reliable event dissemination in WAN Christian Esposito, Stefano Russo Dipartimento di Informatica e Sistemistica (DIS) Universit´ a degli studi di Napoli “Federico II” via Claudio 21, 80125 - Napoli, Italy {christian.esposito, sterusso}@unina.it Roberto Beraldi, Marco Platania Dipartimento di Informatica e Sistemistica (DIS) Universit´ a degli studi di Roma “La Sapienza” via Ariosto 25, 00185 - Roma, Italy {beraldi, platania}@dis.uniroma.it Abstract—Many interoperable software systems atop of large- scale critical infrastructures are based on the publish/subscribe paradigm. They are developed using data dissemination mid- dleware services, which are required to provide reliability and timeliness in multicast communications. The literature of event dissemination and the market of publish/subscribe middleware technologies are rich of solutions; however, they hardly achieve the goal of providing fault-tolerance without violating the time- liness requirements. In this paper we present an analysis of the related work on this topic and propose an approach for combining two different approaches, namely coding and gossiping, able to satisfy timeliness and reliability requirements, respectively. We evaluate the potential benefit of coding on the information delivery performance, even when the sender introduces a redundancy to improve reliability. Keywords-Publish/Subscribe Middleware; Forward Error Cor- rection; Gossiping I. I NTRODUCTION With the increasing use of the Internet, we are noticing a growing trend in developing applications over Wide Area Networks (WAN). Typical examples are Large-scale Complex Critical Infrastructures (LCCIs), such as Air Traffic Control (ATC) systems, financial infrastructure, stock market, business intelligence, that federate geographically-distributed systems in a “system of systems” manner. A key element in the design of such infrastructures is the event dissemination service used to convey data over a WAN. Due to its implicit decoupling properties, which are suitable for satisfying the scalability requirements exhibited by large-scale infrastructures, the pub- lish/subscribe paradigm is facing a growing interest in the design of LCCIs. However, the non-functional requirements of such infrastructures are not only limited to scalability, but, due to their critical nature, LCCIs also demand reliable and timely communications. This means that exchanged data have to be guaranteed to reach the interested destinations despite possible failures and that latency has to always be within strict temporal constraints. The research community on publish/subscribe services has investigated proper methods to implement a reliable event dissemination, and several approaches, such as [1], [2], [3] have been already proposed in the last decade. Most of such solutions to reliable event dissemination only focused on techniques to provide fault-tolerance by means of retransmis- sions, and any reliability improvement is always gained at the expenses of severe performance fluctuations. Therefore, such techniques are not suitable for critical infrastructures where timeliness matters as much as fault-tolerance. The work presented in this paper aims to fill this gap by proposing a strategy to achieve both reliability and timeliness in event dissemination performed by publish/subscribe services over the Internet. Since several works, such as [4], have showed that the Internet exhibits a non negligible probability to lose several messages in a row (bursty losses), and the analysis presented in [5] has proved that available reliable publish/sub- scribe services have paid less attention to handle losses, we have decided to focus our work on tolerating message losses. To this aim, we have combined coding and gossiping, each known to be timely and reliable respectively, so to achieve the best from both. We consider a scenario where a source sends coded information to a set of destinations over Internet links that exhibit a non negligible probability to have bursty losses. Then, destinations apply a gossiping strategy to recover from possible lost packets, with the possibility to exchange coded information, too. We evaluate the potential benefit of coding on the information delivery performance, even when the sender introduces redundancy to improve reliability. In particular, we evaluate the trade-off between redundancy overhead and gossiping rounds in order to fully reconstruct information while preserving timeliness. The reminder of this paper is organized as follows: in Sec- tion II we discuss approaches available in the current literature of reliable publish/subscribe and multicasting, noticing that none of them are able to jointly provide fault-tolerance and timeliness. In Section III we motivate the use of coding as a mean to mitigate the performance fluctuations implied by the massive use of retransmissions by the gossiping protocols. Section IV describes the proposed strategy to combine gos- siping and coding, and explains why coding is useful in data dissemination. In Section V we evaluate information delivery performance by means of a custom time-driven simulator; we also compare obtained results on reliability improvement in coding and no coding cases. Final Remarks are discussed in Section VI that concludes the paper. II. RECOVERY STRATEGIES In our work, we have analyzed several recovery strategies for multicast services, which can be used within the context of publish/subscribe services, and Figure 1 provides a taxonomy of the described solutions. Due to limited space, in this paper we only provide a quick overview of such strategies, and discuss their pros and cons (the interested reader can refer to [6] for a more detailed description). Approaches for reliable

Upload: others

Post on 03-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

On the benefit of network coding for timely and reliable event dissemination in WAN

Christian Esposito, Stefano RussoDipartimento di Informatica e Sistemistica (DIS)

Universita degli studi di Napoli “Federico II”via Claudio 21, 80125 - Napoli, Italy{christian.esposito, sterusso}@unina.it

Roberto Beraldi, Marco PlataniaDipartimento di Informatica e Sistemistica (DIS)Universita degli studi di Roma “La Sapienza”

via Ariosto 25, 00185 - Roma, Italy{beraldi, platania}@dis.uniroma.it

Abstract—Many interoperable software systems atop of large-scale critical infrastructures are based on the publish/subscribeparadigm. They are developed using data dissemination mid-dleware services, which are required to provide reliability andtimeliness in multicast communications. The literature of eventdissemination and the market of publish/subscribe middlewaretechnologies are rich of solutions; however, they hardly achievethe goal of providing fault-tolerance without violating the time-liness requirements.

In this paper we present an analysis of the related workon this topic and propose an approach for combining twodifferent approaches, namely coding and gossiping, able to satisfytimeliness and reliability requirements, respectively. We evaluatethe potential benefit of coding on the information deliveryperformance, even when the sender introduces a redundancy toimprove reliability.

Keywords-Publish/Subscribe Middleware; Forward Error Cor-rection; Gossiping

I. INTRODUCTION

With the increasing use of the Internet, we are noticinga growing trend in developing applications over Wide AreaNetworks (WAN). Typical examples are Large-scale ComplexCritical Infrastructures (LCCIs), such as Air Traffic Control(ATC) systems, financial infrastructure, stock market, businessintelligence, that federate geographically-distributed systemsin a “system of systems” manner. A key element in the designof such infrastructures is the event dissemination service usedto convey data over a WAN. Due to its implicit decouplingproperties, which are suitable for satisfying the scalabilityrequirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm is facing a growing interest in thedesign of LCCIs. However, the non-functional requirementsof such infrastructures are not only limited to scalability, but,due to their critical nature, LCCIs also demand reliable andtimely communications. This means that exchanged data haveto be guaranteed to reach the interested destinations despitepossible failures and that latency has to always be within stricttemporal constraints.

The research community on publish/subscribe services hasinvestigated proper methods to implement a reliable eventdissemination, and several approaches, such as [1], [2], [3]have been already proposed in the last decade. Most of suchsolutions to reliable event dissemination only focused ontechniques to provide fault-tolerance by means of retransmis-sions, and any reliability improvement is always gained at theexpenses of severe performance fluctuations. Therefore, such

techniques are not suitable for critical infrastructures wheretimeliness matters as much as fault-tolerance.

The work presented in this paper aims to fill this gap byproposing a strategy to achieve both reliability and timelinessin event dissemination performed by publish/subscribe servicesover the Internet. Since several works, such as [4], have showedthat the Internet exhibits a non negligible probability to loseseveral messages in a row (bursty losses), and the analysispresented in [5] has proved that available reliable publish/sub-scribe services have paid less attention to handle losses, wehave decided to focus our work on tolerating message losses.To this aim, we have combined coding and gossiping, eachknown to be timely and reliable respectively, so to achieve thebest from both. We consider a scenario where a source sendscoded information to a set of destinations over Internet linksthat exhibit a non negligible probability to have bursty losses.Then, destinations apply a gossiping strategy to recover frompossible lost packets, with the possibility to exchange codedinformation, too. We evaluate the potential benefit of codingon the information delivery performance, even when the senderintroduces redundancy to improve reliability. In particular,we evaluate the trade-off between redundancy overhead andgossiping rounds in order to fully reconstruct informationwhile preserving timeliness.

The reminder of this paper is organized as follows: in Sec-tion II we discuss approaches available in the current literatureof reliable publish/subscribe and multicasting, noticing thatnone of them are able to jointly provide fault-tolerance andtimeliness. In Section III we motivate the use of coding asa mean to mitigate the performance fluctuations implied bythe massive use of retransmissions by the gossiping protocols.Section IV describes the proposed strategy to combine gos-siping and coding, and explains why coding is useful in datadissemination. In Section V we evaluate information deliveryperformance by means of a custom time-driven simulator; wealso compare obtained results on reliability improvement incoding and no coding cases. Final Remarks are discussed inSection VI that concludes the paper.

II. RECOVERY STRATEGIES

In our work, we have analyzed several recovery strategiesfor multicast services, which can be used within the context ofpublish/subscribe services, and Figure 1 provides a taxonomyof the described solutions. Due to limited space, in this paperwe only provide a quick overview of such strategies, anddiscuss their pros and cons (the interested reader can referto [6] for a more detailed description). Approaches for reliable

Page 2: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

Figure 1: Classification of all the approaches available in theliterature to tolerate dropped messages.

publish/subscribe services can be briefly grouped as follows:(i) Automatic Repeat reQuest (ARQ); (ii) Forward ErrorCorrection (FEC) and (iii) Path Redundancy.

ARQ schemes comprises approaches that detect any possiblemessage omission due to the manifestation of some kindsof faults, and ask for a retransmission. Depending to thecontacted node for retransmissions, we can have a furtherclassification of the ARQ schemes [7]: sender-based, i.e., allthe nodes always contact the multicaster; parent-based, i.e., anodes always contacts its parent within the multicasting tree;and neighbor-based, i.e., retransmission duties are distributedamong any neighboring nodes. The last one is further dividedin the following techniques:

1) Lateral Error Recovery (LER) [8]: all the nodes arerandomly grouped in distinct planes, and node selectsas its neighbor a certain node per each plane that isdifferent than its own;

2) Cooperative Error Recovery (CER) [9]: nodes are clus-tered in groups whose members are characterized by anegligible loss correlation (i.e., if a node experiences amessage loss, it is unlikely that all the nodes loose thesame message), and a node selects its neighbors betweenthe members of its group;

3) Gossiping [10]: a node stores a received message in abuffer with a size b, and forwards it a limited number oftimes t (called fanin) to a randomly-selected set of nodesof size f (called fanout). Many variants of gossipingalgorithms exist [11]:

a) Push Approaches: messages are forwarded to theother nodes as soon as they are received;

b) Pull Approaches: nodes periodically send to othernodes a list of recently-received messages; if a lostmessage is detected by comparing the received listwith the history of messages received by the given

node, then a transmission is requested;c) Push/Pull Approach: a node forwards to the other

nodes only the identifier of the least received mes-sage; if one of the receivers does not have suchmessage, then it makes an explicit pull request.

FEC schemes contains approaches based on the generation ofadditional information by encoding message content, so thatlost data can be recovered by decoding all the packets receivedby the node. There are three different approaches, which differfrom where the encoding is performed [12], [13]:

1) End-to-End FEC: encoding is done by the multicaster;2) Link-by-Link FEC: every node in the overlay performs

encoding and decoding, in order to protect the deliveryon each link;

3) Selective Network-Embedded FEC: only a subset ofnodes, which includes the multicaster, perform encoding.

The last group includes approaches based on path redundancy,where there is more than one link interconnecting two nodes.Such redundancy can be used in two different manners. Inthe first case, there is a primary link to forward messages toan other node and back-up links to use in case the primaryhas experienced a loss. On the other hand, several copies ofa given message can be sent using the redundant links. Sincethe first one is simply a form of distributed ARQ scheme, wehave focused on the second kind of path redundancy, whichcan be realized by different approaches, depending on theorganization of the nodes in a mesh or a tree. Mesh routingcan implement path redundancy by choosing more that oneneighbor node to communicate with or using swarning contentdelivery by combining push content reporting and pull contentrequesting [14]. On the other hand, in tree-based multicastservices there are three alternative approaches [15]:

1) Cross-link: it consists of connecting random peers viaextra cross-cutting links;

2) In-tree: extra links among different “families” of nodesare established in addition to the one between the parentand its children;

3) Multi-trees: multiple disjoint trees, namely forest, areformed among the members of a multicast group.

Figure 2 provides a graphical representation of the out-come of our study of the achievable quality, with respectto scalability, reliability and timeliness. In general, we cancharacterize the previous strategies as proactive or reactive.The former (indicated by purple boxes in figure) actively sendsout redundant packets by using spatial redundancy, whichwill help nodes to recover dropped packets at the moment ofdetection of the drop, without requiring any external recoveryaction. Concrete examples are FEC schemes that reconstructlost information by using the received one. On the otherhand, the reactive approaches (indicated by green boxes infigure) adopt temporal redundancy where a node that detectsa loss is unable to resolve it and asks for lost packets to theother nodes. Practical examples are ARQ schemes, where thehelp provided by the other nodes consists of retransmissions.From a close scrutiny of Figure 2, we can notice that thereis no approach that exhibit the highest level in all the threequality aspects. If we consider only reliability and timeliness(respectively, lines in red and yellow), we notice that there is

Page 3: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

Figure 2: Analysis of the quality of loss-tolerance for multicastservices in terms of scalability, reliability and timeliness.

a clear separation between proactive and reactive approaches.The first approaches exhibit a low and more stable recoverylatency due to the loss resolution at the detection phase withinformations already hold by the node, but the amount ofredundancy needed to recover messages under any kind offailure load is hard to perfectly tune, so failing to provideperfect reliability has a not negligible probability. On theother hand, reactive approaches are easy to implement andthey guarantee strong reliability properties due to the closedloop control between receivers and sources; however, theyalso present serious performance fluctuations when failureoccurs. In fact, the delivery time is directly related to thenumber of retransmissions to successfully recover a droppedmessage, which depends on the network dynamics, known tobe unpredictable and highly variable over time [16]. Withrespect to the scalability, distributed approaches obtain thehighest score with the exception of LER, which require adetailed knowledge of the overall system topology, which isimpractical for large scale systems.

Since none of the investigated approaches is jointly reliableand timely, we propose to apply a combination of proactive andreactive strategies to take advance of the best of both worlds,in a similar way as done for transport-level multicast by meansof hybrid ARQ and FEC schemes [17]. Specifically, we haveconsidered the use of gossiping as reactive approach, since itis the one exhibiting the highest quality, and FEC as proactiveone (we will focus on end-to-end FEC, leaving application ofmore advanced approaches to future work).

Throughout the paper, we will consider three kinds ofprotocols: (i) gossip; (ii) coding, and (iii) gossip + coding.

III. NETWORK CODING: MOTIVATIONS

Network coding [18] is a technique that allows to conveythe information content of n original packets, x1, x2, ...xn, asa set n linearly independent combinations (encoded packets)and to easily generate redundancy virtually for any lost packetsby means of linear combinations of business data. Each linear

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9

Pro

babi

lity

a pa

cket

is u

sefu

l

Number of missed chunks

no codingq=2q=4q=8

q=16q=32q=64

q=128q=256

Figure 3: Probability of an additional random chunk of data is usefulas a function of the missed chunks for n = 10.

combination is given by

yi =n∑

i=1

cixi

where coefficients ci are taken uniformly at random over0, ..., q − 1, but the all zero coefficients case being excluded.All operations are performed over the Galois Field GF (2w),with 2w = q. Each coded packets yi is equipped with thecoefficients ci used to produce that packet and that will beused by destinations in the decoding phase.

Our proposed protocol exploits a distributed recovery mech-anism that allows a destination node to pull the missedinformation content from other nodes. To get a better insight ofthe benefit of coding on this recovery operation let us comparea recovery protocol that uses coding against a plain protocol.

First of all, the power of network coding with respect to aplain data dissemination is particularly evident when a nodeintroduces a redundancy. Let us consider two different cases:• a node sends n + a encoded packets, with a < n,

generated as previously described;• a node sends n + a plain packets, with a < n.Now let us consider a receiver that receives just n′ out of

the n packets (i.e., n − n′ packets have been dropped by thenetwork); in addition, it receives all redundant packets, suchthat the total number of received packets is l ≥ n + a. Incase of coding, if coefficients ci are independent, then any ofthe a additional packets can be useful to fully reconstruct theinformation. On the contrary, without coding, the a packetscould be just a copy of the n′ packets; in this case, theredundancy is not useful to reconstruct the whole information.This characteristic of network coding also allows a node togenerate redundancy (i.e., linear combinations) even if it hasreceived just a partial information.

Hence, the reason why coding is expected to highly improvea gossip based recovery protocol has its root in the followingsimple property. Consider the n-dimensional vector space Vover a Galois Field with base q, namely GF (q). The number ofelements in V is qn−1, i.e., it is given by the number of wayswe can multiply a base of V with n coefficients, excludingthe all zero cases. For the same reason, a k-dimensional V ’ssubspace Vk has qk−1 elements. Hence the ratio between thenumber of elements in Vk and the number of elements in Vn

is qk−1qn−1 ≈

1qn−k . This is also the probability that a random

Page 4: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

Dissem

ination

Recovery

Plain D

ata

Coded D

ata

Pla

in D

ata

Cod

ed D

ata

Figure 4: Simplified ALM topology that directly connects the senderto all destinations.

vector of V belongs to Vk.Now, consider the case when a single event E is divided into

n chunks of data, which are transmitted to a set of destinationsusing different overlay channels. Suppose that, due to packetlosses, a destination gets k out of the n packets, i.e., thedestination has missed m = n− k packets. For reconstructingthe whole event the destination starts to retrieve the missedm packets through a simple gossip mechanism consisting incontacting another destination and pulling one random packetfrom it.

If the packets are sent without coding and assuming thecontacted destination experienced an independent packet losspattern, the probability that the pulled packet is useful is clearlymn = 1 − k

n . However, if the source node sends randomlinear combinations of the original packets, then, due to theaforementioned property, the probability that the pulled linearcombination is independent from the already received ones isthe the probability that the pulled linear combinations does notbelong to Vk, i.e., ≈ 1− q−m. Figure 3 shows the probabilityas a function of m for n = 10, q = 8 in the two cases (codingand no coding).

While in the no coding case this probability increaseslinearly with the number of missed chunks m, under a codingscheme the probability increases exponentially. In addition, forthe coding case the probability to retrieve the last chunk is assmall as 1 − q−1 and can then be made very close to oneby setting q to some convenient value, i.e., q = 1024. It isalso worth noticing that, under our independence assumption,the problem of retrieving the m missed chunks is the classicalCoupon Collector’s Problem [19], and, hence, the expectednumber of gossip rounds required to retrieve all the chunks isΘ(m log(m)), whereas using coding it is easy to see that theaverage number of rounds is Θ(m).

Finally, we remark that due to the linearity of the operations,encoded packet can also be obtained by linearly combiningalready encoded packets. This allows for intermediate nodesin a multicasting tree to add redundancy exploiting the partialinformation content received so far thus reducing the end toend delay.

PUBLISHER   SUBSCRIBER1  APPLICATION   APPLICATION  

Event(i)

Encoding

Decoding

Event(i)

SUBSCRIBER2  

Complete?

No! Yes!

Not(2)

Not(1)

.

.

.

Not(N)

LOSS

Gossip(i)

Encoding

Not(2)

Not(1)

.

.

.

Not(N)

LOSS

Decoding

Event(i)

Complete?

STEP I

STEP II

Complete? Yes!

Yes!

Figure 5: Protocol schema. Note that at the second Completeoperation (the one with *), if the data has not been completelyrecovered, another gossiping round has to be issued.

IV. PROPOSED PROTOCOL

We consider the system model illustrated in Figure 4, wherea multicaster is directly connected to several destinations. Suchinterconnection can be concretely implemented by means of anApplication Level Multicast (ALM), by building a tree rootedat the multicaster, where each destination can be both a leafnode or even an interior node within the tree.

The protocol we propose combines coding and gossipingfor timely and reliable data dissemination in WAN. It consistsof two steps, as depicted in Figure 5:

1) the sender application passes to the publisher the event(indicated in figure as Event(i)) to be disseminated. Thepublisher fragments the event in K packets and sendsthem to all destinations as notifications (indicated infigure as Not(i)). In particular, as illustrated in figure,the publisher does not only forward the packets with thebusiness data received from the application, but is alsoconstructs N −K additional coded packets as the linearcombination of the previous K packets, that represent theredundancy of the original message. This redundancy isalso sent to all destinations, as evident is figure.

Page 5: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16

Suc

cess

rat

e

Redundancy degree

codingno coding

Figure 6: Comparison on reliability im-provement between coding and no codingcases by varying the redundancy degree.

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

Suc

cess

rat

e

Number of rounds

no coding, PLR=0.2, ABL=2coding, PLR=0.2, ABL=2

no coding, PLR=0.3, ABL=3coding, PLR=0.3, ABL=3

Figure 7: Impact of the number of gossipingrounds on success rate in coding and no cod-ing cases by varying the network conditions.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5

Suc

cess

rat

e

Number of rounds

no coding, fan out=1no coding, fan out=2no coding, fan out=3

coding, fan out=1coding, fan out=2coding, fan out=3

Figure 8: Effect of fanout on success ratein coding and no coding cases.

2) at receiver side, the subscriber collects all the notifica-tions related to a given event, and then applies a decodingaction to obtain the original event. If the original eventhas not been completely reconstructed (i.e., there aresome missing packets that prevent the decoding of thatevent), then the subscriber triggers a gossiping round.

In detail, the receiver tries to reconstruct the original eventwith the collected packets. If all the first K packets arereceived, then the information can be fully reconstructed; how-ever, if some of these packets are lost, the received additionalone can be used to reconstruct the dropped ones. In general,if enough additional packets are received, then all the droppedbusiness data can be successfully reconstructed and we canhave the original event. In case such an event is available, itcan be passed to the application, otherwise we have a dropthat the coding failed to tolerate and gossiping has to be used,as illustrated in the lower part of Figure 5. There are differentgossiping mode, as specified in the previous section; however,without loss of generality, we describe here a very simple one.Specifically, the node selects randomly another node, or evenmore than one, to whom it asks the retransmission of thenecessary data to obtain the original event; to this end, weassume the presence of a peer sampling service [20], [21]that a node uses to obtain a random sample from the setof all destinations. If the contacted node has the requestedinformation (as illustrated in figure), then it forwards therequested packets and their combinations; otherwise, it doesnot provide a response to the received request (not illustratedin figure). The gossiper then receives the additional packetsprovided by the contacted node and retries a decoding puttingtogether these packets and the previous ones. Such decodingtentative can return the original notification (as illustrated infigure), which is passed to the application, otherwise, anothergossip round has to be commenced (not illustrated in figure).

V. EXPERIMENTAL RESULTS

In this section, we show a preliminary experimental analysisperformed through a custom time-driven simulator. We eval-uate the success rate, i.e., the fraction of system nodes thatfully recovers a message, where the system size is 100 nodes.This value is the mean of 100 different publications, averagedon all nodes. The network failure model is defined in terms of:(i) Packet Loss Rate (PLR), i.e., the rate at which packets are

dropped by the network; (ii) Average Burst Length (ABL), i.e.,the number of consecutive packets dropped by the network.

Figure 6 compares the success rate in coding and no codingcases, by varying the number of redundancy packets. PLRand ABL are set to 0.2 and 2 respectively. We can see howcoding exponentially improves the reliability by augmentingthe number of redundant packets. This is direct consequence ofthe high probability for a linear combination to be independentfrom all the others.

Figure 7, instead, shows how reliability augments in pres-ence of gossip, with the PLR varying in the set {0.2; 0.3} andABL in the set {2; 3}. In case of coding, 2 rounds are enoughto fully recover the whole event. The figure thus confirmsthat coding is expected to improve the delay performance asa consequence of a less number of rounds required to get themissed packets in case of loss.

Finally, Figure 8 shows the effect of fan-out for PLR=0.2and ABL=2. By Increasing the fan-out, a node pulls a largeramount of data per round, thus making a round more effective.

VI. FINAL REMARKS

The inherent scalability provided by the publish/subscribeparadigm makes it appealing for the design of geographicallydistributed systems over wide area networks. However, the crit-ical nature of LCCIs requires that all information has to reachthe interested destinations within strict temporal constraints.Recently, researchers started to investigate how to improvereliability in publish/subscribe systems, but this improvement,often obtained through retransmission techniques, is typicallygained at the expenses of performance penalties.

A. Lesson learnt

In this paper, we provided a strategy that combines codingand gossiping for reliable and timely data distribution overWAN. We conducted a simulation-based analysis to evaluatethe potential benefit of coding on delivery performance, evenwhen the sender introduces redundant information to improvereliability. We have observed that the introduction of coding isable to lower the number of retransmissions required to achievea successful delivery of a given notification. This allows im-proving the timeliness guarantees of the dissemination protocoland reduce the dependency of the dissemination latency fromthe experienced network behaviour.

Page 6: On the benefit of network coding for timely and reliable ...platania/index_files/wnr11.pdf · requirements exhibited by large-scale infrastructures, the pub-lish/subscribe paradigm

B. Future workWe plan to perform a deep evaluation of our approach

in terms of reliability and timeliness. To this aim, we arerealizing an analytical model of our approach, and also abehavioural model by using the OMNET++ event simulator,and to experimentally assess the different styles of gossip-ing with our approach by varying the network conditions,and using network topologies similar to the one deployedin Internet. In addition, we are studying enhancements toour approach by introducing various adaptation means, i.e.,customizing redundancy between root and destinations andamong destinations to the experienced loss patterns or evenintroducing a biased gossip partner selection (destinations ofa gossiping message are not randomly selected, but properselection is introduced to improve the success rate).

ACKNOWLEDGMENT

This work has been partially supported by the Italian Min-istry for Education, University, and Research (MIUR) in theframework of the Project of National Research Interest (PRIN)“DOTS-LCCI: Dependable Off-The-Shelf based middlewaresystems for Large-scale Complex Critical Infrastructures”, andby the BLEND Eurostar European Project.

REFERENCES

[1] P. Pietzuch and J. Bacon, “Hermes: A Distributed Event-BasedMiddleware Architecture,” Proceedings of 22nd InternationalConference on Distributed Computing Systems Workshops (ICD-CSW ’02), pp. 611–618, July 2002.

[2] S. Bhola, R. Strom, S. Bagchi, Z. Yuanyuan, and J. Auerbach,“Exactly-once Delivery in a Content-based Publish-SubscribeSystem,” Proceedings of the IEEE International Conference onDependable Systems and Networks (DSN 02), pp. 7–16, June2002.

[3] R. Kazemzadeh and H.-A. Jacobsen, “Reliable and HighlyAvailable Distributed Publish/Subscribe Service,” Proceedingsof the 28th IEEE International Symposium on Reliable Dis-tributed Systems (SRDS 09), pp. 41–50, September 2009.

[4] A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N.Chuah, Y. Ganjali, and C. Diot, “Characterization of Failures inan Operational IP Backbone Network,” IEEE/ACM Transactionson Networking (TON), vol. 16, no. 4, pp. 749–762, August 2008.

[5] C. Esposito, D. Cotroneo, and A. Gokhale, “Reliable Publish/-Subscribe Middleware for Time-sensitive Internet-scale Appli-cations,” Proceedings of the 3rd ACM International Conferenceon Distributed Event-Based Systems (DEBS 09), July 2009.

[6] C. Esposito, “Reliable Event Dissemination for Time-SensibleApplications over Wide-Area Networks,” PhD Thesis, availableat www.mobilab.unina.it/tesiDottorato.html, December 2009.

[7] X. Jin, W. P. K. Yiu, and S. H. G. Chan, “Loss Recovery inApplication-Layer Multicast,” IEEE MultiMedia, vol. 15, no. 1,pp. 18–27, January 2008.

[8] W.-P. K. Yiu, K.-F. S. Wong, S.-H. G. Chan, W.-C. Wong,Q. Zhang, W.-W. Zhu, and Y.-Q. Zhang, “Lateral Error Cor-rection for Media Streaming in Application-Level Multicast,”IEEE/ACM Transactions on Multimedia (T-MM), vol. 8, no. 2,pp. 219–232, April 2006.

[9] G. Tan, S. A. Jarvis, and D. P. Spooner, “Improving theFault Resilience of Overlay Multicast for Media Streaming,”IEEE Transactions on Parallel and Distributed Systems (TPDS),vol. 18, no. 6, pp. 721–734, June 2007.

[10] A.-M. Kermarrec, L-Massoulie, and A. J. Ganesh, “ProbabilisticReliable Dissemination in Large-Scale Systems,” IEEE Trans-actions on Parallel and Distributed Systems (TPDS), vol. 14,no. 2, pp. 1–11, February 2003.

[11] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson,S. Shenker, H. Sturgis, D. Swinehart, and D. Terry, “Epidemicalgorithms for replicated database maintenance,” in Proceedingsof the sixth annual ACM Symposium on Principles of distributedcomputing. ACM, 1987, pp. 1–12.

[12] M. Ghaderi, D. Towsley, and J. Kurose, “Reliability Gain ofNetwork Coding in Lossy Wireless Networks,” Proceedings ofthe 27th Conference on Computer Communications (INFOCOM08), pp. 2171–2179, April 2008.

[13] M. Wu, S. S. Karande, and H. Radha, “Network-embedded FECfor Optimum Throughput of Multicast Packet Video,” Journalon Signal Processing: Image Communication, vol. 20, no. 8, pp.728–742, September 2005.

[14] N. Magharei and R. Rejaie, “PRIME: Peer-to-Peer Receiver-driven Mesh-based Streaming,” Proceedings of the 26th IEEEInternational Conference on Computer Communications (INFO-COM 07), pp. 1415–1423, May 2007.

[15] S. Birrer and F. Bustamante, “A Comparison of ResilientOverlay Multicast Approaches,” IEEE Journal on Selected Areasin Communications (JSAC), vol. 25, no. 9, pp. 1695–1705,December 2007.

[16] V. Paxson, “End-to-End Routing Behavior in the Internet,” ACMSIGCOMM Computer Communication Review, vol. 36, no. 5,pp. 41–56, October 2006.

[17] J. Nonnenmacher, E. W. Biersack, and D. Towsley, “Parity-Based Loss Recovery for Reliable Multicast Trasmission,”IEEE/ACM Transactions on Networking (TON), vol. 6, no. 4,pp. 349–361, August 1998.

[18] C. Fragouli, J. Le Boudec, and J. Widmer, “Network coding:an instant primer,” Computer Communication Review, vol. 36,no. 1, p. 63, 2006.

[19] P. Flajolet, D. Gardy, and L. Thimonier, “Birthday para-dox, coupon collectors, caching algorithms and self-organizingsearch,” Journal Discrete Applied Mathematics, vol. 39, no. 3,November 1992.

[20] M. Jelasity, S. Voulgaris, R. Guerraoui, A. Kermarrec, andM. Van Steen, “Gossip-based peer sampling,” ACM Transactionson Computer Systems, vol. 25, no. 3, p. 8, 2007.

[21] L. Massoulie, E. Le Merrer, A. Kermarrec, and A. Ganesh,“Peer counting and sampling in overlay networks: randomwalk methods,” in Proceedings of the twenty-fifth annual ACMsymposium on Principles of distributed computing. ACM, 2006,pp. 123–132.