exactly-once semantics in a replicated messaging system

Upload: deepankar-patra

Post on 02-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    1/21

    Exactly-once semantics in a replicated messaging system

    Abstract

    A distributed message delivery system can use repli-cation to improve performance and availability. How-ever, without safeguards, replicated messages may bedelivered to a mobile device more than once, makingthe devices user repeat actions (e.g., making unnec-essary phone calls, firing weapons repeatedly). In thispaper we address the problem of exactly-once deliveryto mobile clients when messages are replicated. Wedefine exactly-once semantics and propose algorithms

    to guarantee it. We also propose and define a relaxedversion of exactly-once semantics which is appropri-ate for limited capability mobile devices. We studythe relative performance of our algorithms comparedto weaker at-least-once semantics, and find that theperformance overhead of exactly-once can be mini-mized in most cases by careful design of the system.

    1 Introduction

    In a replicated messaging system, messages are repli-

    cated at various servers, pending delivery to mobileclients. When the client makes a connection to thenetwork, it contacts any of the servers, and downloadsits pending messages. If messages are not replicated,the client could not get its messages when its onlyserver was down. Furthermore, the connection to itsserver may be slow, making it impossible to down-load all the messages in a reasonable time. Both ofthese problem are solved with replication: the clienthas a choice of servers, and can contact one that isnearby at the time of connection and that offersgood response time.

    Message replication is useful in critical applications

    where it is important to deliver messages as soon as aclient makes any connection. For instance, in a mili-tary scenario, it is critical to get messages to units inthe field, no matter at what point they connect, andno matter how short their connection is. Even fornon-critical applications, replication can be very use-ful. For example, when an American traveler visitsEurope, it would be much more convenient to accesshis or her emails through a local server, rather thanrelying on a slow trans-Atlantic connection.

    Replication, of course, introduces the problem ofduplicate delivery. A client can connect to one serverand download message M1, and later connect to adifferent server and receive M1 a second time. Dupli-cate delivery can range from disastrous to simplyannoying. For example, a message fire missile orbuy stock may cause an action to be unintention-ally performed twice with clearly undesirable conse-quences. In other cases, a duplicate message to callFred back can be simply confusing, since the recip-ient does not know whether Fred wants to talk to

    him/her a second time.In this paper we study mechanisms for ensuring

    that messages are delivered exactly once (no deliveryand duplicate delivery must be ruled out). While thisproblem has been extensively studied in the past (seeour related work section), we believe it is importantto revisit the problem in the context of a replicatedmessage system for mobile clients. In particular, wewill

    Study how servers with replicas can coordinateto eliminate or reduce the likelihood of duplicatedelivery.

    Explore duplicate elimination mechanisms thatmay be appropriate for clients with no or limitedstable storage.

    Study weaker exactly-once semantics that maybe useful in a mobile environment.

    Present an evaluation metric for comparingexactly-once mechanisms, useful for highlightingtheir weaknesses or strengths in a mobile envi-ronment.

    We stress that we are not discovering new mech-

    anisms for exactly-once delivery. All of the ideaswe present (e.g., sequence numbers to identify dupli-cates) are well known. Our goal is instead to adaptthese well known building blocks into mechanismsthat are well suited for a mobile environment, and toevaluate the tradeoffs.

    We also stress that not all applications requirestrict exactly-once semantics. However, if we can finda mechanism that guarantees exactly-once delivery ata reasonable performance and complexity cost, then

    1

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    2/21

    we may still want to use it. Thus, we believe thatregardless of the application, it is important to un-derstand how exactly-once delivery can be achievedwith replicated messages, so that an informed deci-sion can be made regarding message delivery options.

    Finally, note that some applications may not re-

    quire exactly-once delivery because they implementtheir own safeguards. For example, the message tofire a missile may specify which missile to fire, sothat a duplicate message will not cause a second fir-ing. However, these applications mechanisms (e.g.,missile numbers, application sequence numbers) areidentical to the mechanisms we describe here. Thus,whether exactly-once is guaranteed by the messagingsystem or by the application, the issues are the same,and the results we present here are relevant.

    The structure of our paper is as follows. We start inSection 2 by defining more precisely our frameworkand the notion of exactly-once delivery. The algo-

    rithms for guaranteeing exactly-once semantics aregiven in Section 3. Section 4 describes our perfor-mance model, while Section 5 presents selected com-parisons of the strategies. We discuss related work inSection 6 and conclude in Section 7.

    2 Delivery semantics

    Figure 1 illustrates a replicated messaging system.Servers are located at various locations around theglobe. The servers are connected by a fixed wide area

    network, e.g., the Internet. These servers are respon-sible for storing and delivering messages to clients.We call these messages notesto distinguish them fromother network traffic that does not require deliveryguarantees (e.g., exactly-once semantics). When anew note is generated, it is replicated to all serversfor delivery to the client. The replication happensunder the guidance of a separate replication proto-col, which we do not consider in detail here.

    A client, which is typically a wireless informationaccess device like a PDA, stays disconnected from theserver network most of the time. It occasionally es-tablishes a wireless connection into a network access

    point, which then lets it communicate with one of theservers to get its notes. Because client connectionscan be short lived, or may not provide good connec-tivity to all servers, it is important to replicate notesat multiple servers. This way, when a connection ismade, note delivery can be made from the server thatoffers best service. However, without any safeguards,a note can be delivered more than once to a client bydifferent servers.

    The semantics of a note specify the required deliv-

    Figure 1: Replicated message delivery system.

    ery properties, e.g., whether duplicates are allowed,or whether the note needs to be delivered by a dead-line. In this paper, we focus on the following seman-tics:

    Exactly-once: a note should eventually be deliv-ered to its target once, and only once.

    As mentioned earlier, notes that lead to irrevoca-ble changes in the physical world often need exactly-

    once guarantees. We focus on notes destined for amobile, wireless device, for instance, asking someoneon the shipping floor to send some goods, instructinga soldier in the field to initiate some action, askinga traveling manager to change the asking price foran acquisition, or telling a nurse in the emergencydepartment to administer some drugs. The mobiledevice may have limited or no storage capacity to re-member what notes have been received in the past,making it more challenging to guarantee exactly-oncedelivery. Also, as we will see, some mechanisms forduplicate elimination work better than others whenthe client is mobile and is only connected for short

    periods.There are other possible delivery semantics, some

    of which are limited notions of exactly-once. For ex-ample, with at-least-once semantics, a note shouldbe delivered at least once to the target, and dupli-cate delivery is not a concern. (For instance, a notestating that the building is on fire may require at-least-once semantics.) As another example, limitedlifetimesemantics may indicate that duplicate deliv-eries are not allowed up to a maximum note lifetime

    2

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    3/21

    (expiration time), but that beyond that there is noguarantee (presumably because the note is of no sig-nificance after the expiration time). Although we donot study these limited semantics here, we believethat solutions for those semantics can be obtainedby relaxing the exactly-once solutions. Thus, the

    exactly-once algorithms we will present here can formthe basis for many other algorithms.

    3 Exactly-once algorithms

    In this section we describe algorithms that guaran-tee exactly-once delivery semantics. Additional im-plementation details (pseudo-code) for selected algo-rithms are given in Appendix A. The algorithms aredivided into two major groups: those (Section 3.2)that guarantee strict exactly-once semantics, andthose that guarantee a relaxed version which will

    be defined in Section 3.3. The former group assumesthat the clients have some stable storage, while thelatter group does not have such a restriction.

    Our description of the algorithms assumes Nservers and one client for simplicity. The algorithmscan easily be extended to service multiple clients.

    3.1 Delivery procedures

    Algorithms that guarantee exactly-once delivery usea procedure Deliver(T,m) to send a message m totarget node T (server or client). Deliver(T,m) isguaranteed to return by a certain predefined time.If the call is successful, then T executed a match-ing Receive call and successfully received m, andthe sender received an acknowledgment confirmingreceipt of m. (Note that T could have received mmore than once.) If Deliver(T,m) fails, the senderwas unable to receive the acknowledgment from Tbefore timing out. Note that it is possible for acall to Deliver(T,m) to timeout, while the match-ing Receive(m) completes successfully, because theacknowledgment was not properly received by thesender.

    3.2 Strict exactly-once algorithms

    The strict exactly-once algorithms must cope withthe different failures modes of Deliver(T,m). Thekey in all algorithms is to assign some form of iden-tifier (id) to each note, so that duplicate or missingnotes can be handled. The algorithms vary in howthey identify notes, how note ids can be recycled,and how long used note ids have to be rememberedby servers and clients. These small details have

    significant impact on the potential performance andthe usability of the algorithms.

    Strategy 1 (Sequenced Streams) Client remem-bers latest sequence number from each server. Se-quence numbers are taken from an infinite domain.

    Our first algorithm resembles the well-known slid-ing window protocol [5] with unbounded sequencenumbers. However, our algorithm is a multiple-stream version of that protocol with a sliding win-dow width of one.

    When a new note m is generated, it is first storedin any one of the servers, for example, server B. Eachserver keeps a sequence number counter, which is in-cremented for each new note generated at that server.The globally unique id of note m is the unique nodeid of server B prepended to ms sequence number atB, e.g., B.5. After this unique id is assigned, m will

    be replicated to all the other servers, where it willhave the same id.

    In this algorithm, servers deliver (using procedureDeliver) all notes with the same server id sequen-tially. In other words, the notes are divided into Nseparate streams, by server id. To enforce sequen-tiality, the client c keeps an array R of sequence num-bers, one for each server. Array R is kept in stablestorage so it survives client failures. The client willonly accept note B.n if R[B] + 1 = n. If the note isaccepted, R[B] is incremented; otherwise note B.n isdiscarded. Notice that a note is not considered deliv-ered (as far as exactly-once semantics is concerned)

    until it is accepted and processed by the client. Alsonote that the acceptance of a note and the corre-sponding counter increment should be an atomic op-eration at the client.

    Each server A maintains a list K of to-be-deliverednotes, sorted by increasing sequence number for eachstream. A note m can be removed from K after itis successfully delivered to the client. Furthermore,the servers periodically synchronize with each otherto purge notes that have been delivered. (See Ap-pendix A for more details.)

    Example 1 The system consists of two servers, A

    and B, plus a client c. Note m, with id A.3, is waitingto be delivered on both A and B. In other words,KA = KB = {A.3}. Client c first connects to serverA and gets m. At this point R[A] = 3. If As callto Deliver(c,m) returns success, A will remove A.3from KA. A may also synchronize with B to informit of ms delivery. This synchronization can be doneimmediately after ms delivery, or periodically.

    Later when c connects to either A or B, the servermay attempt to send note m again, perhaps because

    3

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    4/21

    As call to Deliver failed, or because server synchro-nization has not been performed yet between A andB. But c will not process m again because ms se-quence number (3) is not what is currently expected(R[A] + 1 = 3 + 1 = 4).

    It is easy to see why this algorithm guaranteesexactly-once delivery. Because notes in a stream aredelivered sequentially, and the client remembers themost recent sequence number, in effect the client re-members all the notes it has ever received in the past.Therefore, duplicate delivery can never happen. Anote m will eventually be delivered to T because Twill eventually request notes from a server, and thatserver will have m (servers only delete a note after thenote has been delivered successfully to the client).

    Strategy 1 assumes that sequence numbers neverwrap around. One can either assume that countersare large enough so that wrap around is never a

    problem (just like Y2K wrap around was never sup-posed to happen), or one can extend the algorithmas we describe next.

    Strategy 2 (Sequenced Wrap-around Streams)Client remembers latest sequence number from eachserver. Sequence numbers wrap.

    The problem with reusing a particular sequencenumber is that a message bearing the first use ofsuch a number may get delayed in the network andresurface during the numbers second use, causingre-delivery of the original message. This problemhas been studied extensively in the networking litera-ture [15, 3, 12]. The solutions proposed are primarilybased on making sure that enough time has elapsedbetween possible reuses of the same sequence number.In other words, let Te denote the maximum lifetimeof a message (for example, Te can be derived froman IP packets time-to-live field). Then a sequencenumber cannot be reused until at least Te time haspassed since a message bearing this sequence numberwas sent out.

    In our context, the problem is slightly more com-plex because a note may linger not just in the net-work, but also at some other server. This complica-

    tion is best illustrated by an example.Example 2 In a two-server (A and B) system, theclient c gets note m1 (id = A.5) from A. After timeTe, server A thinks it is safe to reuse sequence number5 because any previous messages delivering m1 willhave expired in the network by now. Hence a newnote m2 is generated on A, also with id A.5.

    Lets assume that at this time R[A] = 4 on theclient. Moreover, assume that, because of some dif-ficulty in server synchronization, server B is still not

    aware that m1 has been delivered by A. Ifc connectsto B at this time, B will attempt to deliver m1 to cagain. Because A.5 is indeed the expected id, c willaccept m1, thus resulting in duplicate delivery. More-over, when c connects to A later, it will not acceptm2 because it thinks that it has seen the sequence

    number before, thus resulting in m2s non-delivery.

    To avoid problems, we cannot reuse id x.s untilall servers have removed x.s. Furthermore, we mustalso ensure that the following id, x.(s + 1) is not inuse either! To see why, suppose in our example thatA.5 is reused when B has removed A.5 but not somemessage m3 with id A.6. When the client receivesthe new m2 with A.5 there is no problem, but R[A] isadvanced to 6. This makes it possible for c to receivethe old m3 again, instead of a new A.6 that followsm2. Thus, the following condition is the one that

    guarantees the correct reuse of sequence numbers inour replicated delivery system.

    Condition 1 An id x.s can only be reused (i.e., anew note generated on server x withs as its sequencenumber) Te time after server x has received confir-mation from all other servers that note x.(s + 1) hasbeen removed from their respective K lists.

    To implement this condition, when server x syn-chronizes with server y, y will reply with the currentstate of its K list. If the reply says that y has alreadyremoved note x.(s + 1) from K, x has effectively re-ceived a promise from y that it will not attempt todeliver note x.(s + 1) in the future.

    It is easy to see how this synchronization avoids theproblem illustrated in Example 2. If server A does notknow that m1 with id A.5 has been removed from B,then it will not be able to reuse number 5. Only afterA knows that m1 does not exist at any server, will itstart counting the required Te seconds to make surem1 is also gone from the network.

    Although Strategy 2 guarantees exactly-once se-mantics, it still has the following potential shortcom-ings:

    1. The client needs a minimum amount of stablestorage (N times the size of a sequence numbercounter) to store its R array. The strategy willnot work if, for example, the client can only re-member one sequence number in its limited sta-ble storage space.

    2. It requires that notes with the same server id bedelivered in strict sequential order. The problemcan be alleviated if the client is willing to re-member k sequence numbers per server id. This

    4

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    5/21

    allows up to k consecutive notes in a stream tobe delivered out-of-order. Unfortunately, the so-lution further increases the storage requirementon the client k-fold.

    3. Since the size of R is proportional to the total

    number of servers (N), the strategy does notscale well if there are potentially many serversin the system. As an example, assume a 16-bitinteger is used for server ids, but not all possibleids are in use at the same time. The client eitherneeds to remember 216 sequence numbers, or itmust be informed whenever servers join or leavethe system, which significantly complicates thedynamic server reconfiguration protocols.

    4. The strategy also does not scale well with manyclients. Because not every note is destined forall the clients, a note must have one sequence

    number per client in order to keep the sequencenumbers consecutive. To cope, a server mustnow have as many sequence number counters asthere are clients, and the protocol is significantlymore expensive as a result.

    The following strategy is designed to address theabove problems, although it has its own drawbacks,as we will see in the performance section.

    Strategy 3 (Id List) Client remembers individualnote ids it has received.

    For this strategy we simply assume that notes haveglobally unique ids, but the ids need not form se-quenced streams. For example, each note may beassigned a server/date/time id. The client keeps instable storage an explicit id list R of individual notesit has received. For each new note received, the clientfirst checks to see if its note id appears in R. If so, theclient simply ignores the note; otherwise, it processesthe note and inserts the note id into R. As before,processing the note and updating the data structurethat records the processing must be an atomic oper-ation.

    If the client had infinite storage space for R, such

    a strategy would trivially guard against duplicate de-livery, since the client would be able to remember theids of all the notes it has ever received. If the clienthas only a fixed number of slots, q, to store R, weneed to purge entries from R, while still guarantee-ing exactly-once semantics.

    The client can purge a note id m only after it iscertain that no server will ever try to deliver m in thefuture, and only when it is sure that m is not lingeringin the network itself. For the latter, we assume a

    maximum network message lifetime, and use it as inStrategy 2 (Sequenced Wrap-around Streams). Herewe only focus on ensuring that m is not re-deliveredby a server.

    A note m can be in one of three possible deliverystates on any server:

    1. Undelivered: server thinks that m has notbeen delivered to its target yet;

    2. Delivered: server knows that m has been de-livered, but it cannot yet forget about m com-pletely because other servers might not knowabout this delivery yet;

    3. Purged: server knows that all other servers areaware that m has been delivered.

    To keep track, each server keeps the following twodata structures:

    K: list of notes that have not been purged, in-cluding Undelivered and Delivered notes;

    D: the subset of notes in K that are Delivered.When a note m is created, it is entered into list K

    on every server. When a server B delivers m (success-ful completion of Deliver(c, m)), m is put into listD on that server. Periodically or immediately uponms delivery, server B will attempt to synchronizeits state with the other servers, by exchanging thecontents of their K and D lists. During the synchro-nization, other servers will realize that m has been

    delivered, and thus will put m into their respectiveD as well. Moreover, once B confirms that m hasbeen put into every servers D, it can authorize theclient to purge m from its R list. The servers can alsopurge m by removing m from both K and D at thesame time.

    Example 3 We have two servers A and B plus oneclient c, as before. Suppose that two notes, m1 andm2, are waiting to be delivered. Client c first connectsto server A and gets note m1. At this point, the datastructures have the following values:

    KA = KB = {m1, m2}DA = {m1}DB = Rc = {m1}

    At this point (or periodically), A tries to synchro-nize with B. However, let us assume that c initiates aconnection to server B before A could successfully in-form B about m1s delivery. In this case, B attempts

    5

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    6/21

    to deliver note m1 again, but c will simply ignore itbecause m1 is already in Rc.

    Assume that B proceeds to deliver m2. The datastructures are now as follow:

    KA = KB ={

    m1

    , m2}DA = {m1}

    DB = {m1, m2}Rc = {m1, m2}

    When B synchronizes with A, A will also add m2to DA. Moreover, after B gets As state information,B will realize that DA = DB = {m1, m2}. Henceit is safe to purge both m1 and m2. In particular, Bwill authorize c to remove these two note ids from Rc.They will also be purged from KB and DB, and fromKA and DA as well during the next synchronization.

    To see why this algorithm guarantees exactly-oncesemantics, we observe that when c connects to aserver B at any point after it has received note m,there are only two possible states:

    1. m is still in cs list R.

    2. m has been purged from R. Since c only purgesm when a server tells it to, and a server can onlyauthorize the purging after m has been put in Dofallservers, m must be marked either Deliveredor Purged on all servers.

    Because neither of the above cases will result innote m being delivered again to client c, we can seethat this strategy indeed guards against duplicate de-livery.

    When the clients R list fills up, the client mustrefuse to process further notes (perhaps by not send-ing acknowledgments to the server), until it receives apurging authorization, which will free up one or moreR slots. Delays due to R overflow can have a signifi-cant performance penalty, especially if the client hasvery limited stable storage. The impact will be stud-ied in detail in Section 5.

    3.3 Relaxed exactly-once algorithms

    A common theme of all the strategies presented sofar is that the client has to carry stable state (R)across connections. Because in these algorithms theclient ultimately decides whether a note should beprocessed, we name them client-centric approaches.

    In some cases, a client-centric approach is not ap-propriate because the mobile client has no stable stor-age, or has limited storage, insufficient for an array of

    counters (one per potential server) or for a full queueof ids. In other cases, the client may have sufficientstable storage, but we may not wish to rely on itbecause the handheld device is vulnerable to theft,physical damage, or loss of power.

    Therefore, in this subsection, we will develop a new

    suite of algorithms where servers assume the primaryresponsibility of remembering which notes have beendelivered. The client need not remember state infor-mation, so the device it runs on can be safely turnedoff, or even replaced by a new device (e.g., if theprevious one was stolen). When a client connects toa server, the server is responsible for guaranteeingthat no duplicate note will be delivered. We will callthese algorithms server-centric, to contrast with al-gorithms in the previous subsection.1

    3.3.1 Atomic delivery

    It is not hard to see that exactly-once semantics can-not be achieved with procedure Deliver(T, m) ofSection 3.1 when the client keeps no state. In particu-lar, suppose that a Deliver call fails (timeouts) whilethe matching Receive returns success. If the clientdiscards its state, all traces of that messages deliverywill be gone from the system, potentially leading toduplicate delivery.

    Thus, server-centric algorithms cannot achieveexactly-once semantics. Instead, we will strive toachieve a weaker notion of correctness. Intuitively,we recognize that the handoff of a note from a serverto a client represent an unavoidable window of vul-

    nerability. That is, during this operation there willbe some chance that the note is delivered without theserver learning about it. If this happens, there is notmuch we can do. However, we would still like to havealgorithms that do not screw up in other ways. Forexample, if the delivering server A did find out thatnote m1 was received by the client, then it would beunacceptable for some other server B to try to deliverm1.

    Intuitively, our weaker notion of correctness saysthat the algorithm must guarantee exactly-once se-mantics, except for the unavoidable note handoff

    problems during the window of vulnerability. Wecapture this notion by assuming an atomic deliv-ery procedure, denoted by aDeliver(T,m). To de-fine its properties, we first define a function (A, m).

    1Incidentally, notice that even though the device on whichthe client application runs does not rely on stable storage,exactly-once delivery is still desired. Even though the deviceitself will not keep a record of an accepted note, the note willaffect the real-world where the device runs, e.g., firing a mis-sile. Accepting the same note again is unacceptable, even if thedevice does not know that the missile has been fired already.

    6

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    7/21

    Function (A, m) represents the number of times thatnode A has received and processed the same messagem2. Assume that before the sender S calls the pro-cedure aDeliver(T,m), we have (T , m) = 0.

    Then an atomic delivery procedure guaranteesthe following: when aDeliver(T,m) returns success,

    (T , m) = 1; and when aDeliver(T,m) returns er-ror, (T, m) = 0. Moreover, in the former case,(T , m) remains 1 unless aDeliver(T,m) is calledagain, in which case the outcome will be undefined.In other words, it is up to our exactly-once algorithmsto guarantee that aDeliver is never called again ifthe first time it was successful. In the case whereaDeliver(T,m) returns error, (T, m) will remain 0until the next time aDeliver(T,m) is called. We fur-ther assume that the return value of aDeliver(T,m)is always logged in permanent storage. Consequently,even if the server crashes right at the moment whenthe function returns, it is still able to figure out later

    the outcome of the call from the log.In summary, a server-centric algorithm has to per-

    form risky operations when it delivers a note toa client. The aDeliver procedure encapsulates thisrisky operation. By saying that aDeliver exists, weare conceptually saying that there is no risk, simplyso we can focus on other potentially dangerous op-erations, ones that can be avoided with a good al-gorithm. In other words, our algorithms should notperform any more risky operations than absolutelynecessary. We will say that an algorithm guaran-tees relaxed exactly-once semantics when it guaran-tees exactly-once delivery under the assumption thataDeliver exists.

    When we compare an algorithm that guaranteesstrict exactly-once semantics to one that offers re-laxed semantics, we must keep in mind that the latterhas some probability of failure. (This probability canbe reduced by increasing the timeout period a serverwaits before declaring a delivery failure.) Of course,the relaxed algorithm has an important advantage inthat it does not require stable storage.

    3.3.2 Algorithms

    The strategies presented in this section differ in the

    amount of participation required of the client. Notescan have any type of globally unique identifier.

    Strategy 4 (Server Synchronization) Clienthas no state.

    In this strategy the client is not required to re-member anything when it is turned off. For a client

    2As before, the successful reception and the acceptance ofa note need to be atomic.

    with volatile memory, this may be the only option.The servers assume full responsibility for guarantee-ing exactly-once semantics.

    As before, each server keeps a list K of undeliveredand delivered but not-yet-purged notes, and a list Dof just the latter. Before a server can start delivering

    notes, it first gathers the list D from all of the otherservers to merge into its own. This step ensures thatthe server is aware of all the notes that have previ-ously been delivered to the client, so that it will notdeliver them again.

    Exactly as in Strategy 3 (Id List), the servers peri-odically synchronize to reclaim space in their K andD lists. In particular, messages whose delivery isknown to all servers are purged.

    Example 4 Server A delivers note m. Hence DAcontains m. Later when B wants to deliver notes, itobtains DA and merges with DB. This will guarantee

    that any previously delivered note, m in particular,will be included in DB, and thus will not be deliveredagain.

    Strategy 5 (Delay Reconnect) Client does notreconnect within time Tr.

    In the two algorithms that follow, we use time toprevent duplicate deliveries. We assume a clock syn-chronization protocol, such as NTP [13]. We also as-sume a bounded maximum clock drift, which, if notzero, can be added to the appropriate parameters in

    our algorithms.Suppose that the client promises to refrain fromconnecting to any server until time Tr has passedsince its last connection to any server3 (How the selec-tion of parameter Tr affects the overall performanceof the strategy will be investigated in Section 5). Thereconnect delay can be achieved by the client remem-bering the timestamp of its last connection, or theclient can use a timer that starts ticking at the endof a connection. Alternatively the human user of theclient device may be trusted to discipline him or her-self.

    When Tr has passed, and the client connects again,

    the new server checks to make sure that, for each ofthe other servers s, the most recent synchronizationmessage received from s is within Tr time. If not,the server will attempt to synchronize with the prob-lematic server(s) while letting the client wait. If allservers have synchronized within Tr time, then it is

    3To be more precise, Tr is measured from the end of lastconnection (when the client sent out the last message) to theinitiation of the next connection (when the client sends thefirst request message).

    7

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    8/21

    safe to deliver notes to the client, because the notesdelivered to the client during its most recent connec-tions are known to the current server.

    To be more precise, each server maintains an arrayT S[s], containing the timestamp of the most recentlyreceived synchronization message from each server s.

    The servers periodically send these synchronizationmessages to each other, which include the originatingservers list D, as in Strategy 4 (Server Synchroniza-tion).

    Example 5 Client c connects to server A and re-ceives note m1. Now DA includes m1. After time Trhas passed, c makes another connection with serverB. The first thing B does is to check when was thelast synchronization message it received from A.

    Case (i): if the synchronization was sent withinTr time, it must have been generated at serverA when DA contained m1

    4. Because B incor-

    porates DA into its own list D, m1 must be inD when B starts delivery.

    Case (ii): if the A synchronization was more thanTr seconds old, B is not allowed to start deliveryuntil it has contacted A and received an up-to-date DA, which should contain m1.

    Thus, because client connections must be separatedby Tr time units, the union of all servers synchroniza-tion messages less than Tr old will include all notesthe client has received.

    Compared to Strategy 4 (Server Synchronization),

    this one has a better start-up time because, if all goeswell, a server only needs to check local informationbefore it is allowed to deliver notes. However, whenthere is a long communication blackout between twoservers, this scheme blocks delivery as does Strat-egy 4. The overhead of the Delay Reconnect strategyis the additional requirement on the client not to con-nect too frequently.

    Another way of looking at this difference is thatwe have switched from a lazy propagation scheme toan eager one. In the Server Synchronization strategy,the list of delivered notes is propagated only whenneeded, while the client waits. In Strategy 5 (Delay

    Reconnect), the list of delivered notes is propagatedeagerly by a server, potentially reducing the overheadwhen the client connects.

    Strategy 6 (Connect History) Client rememberscomplete connection history (when and to whichserver it has connected before).

    4Or, ifDA didnt include m1, that means m1 must havebeen purged from A, which in turn implies that m1s deliverywas known to all servers.

    Each server maintains a table T S of the most re-cent synchronization timestamps from other servers

    just as in Strategy 5 (Delay Reconnect). However,the client is no longer subject to the Tr reconnec-tion time limit. Moreover, the servers are no longerrequired to synchronize with each other periodically,

    but only when there is new information.The client now needs to remember the last time

    it connected to each server. This is recorded in ahistory array H, where H(A) is the time when the lastconnection to server A ended. When the client makesa new connection, it first uploads to the server its Harray. The server must ensure that it has receivedsynchronization messages from each other server thatare fresher than the clients last connect time.

    Example 6 Note m1 is replicated at servers A andB. Server A delivers m1 to c at time t1. At a later

    time t2, c connects to B, giving it the value H(A) =t1. Server B makes sure it has synchronized with Aafter time t1, thus ensuring that B knows about m1sdelivery.

    Compared to Strategy 5 (Delay Reconnect), theadvantages of the Connect History strategy are three-fold. First, it eliminates the client restriction notto reconnect too soon. Second, it saves networkbandwidth by avoiding unnecessary synchronizations.Last, unlike Strategies Server Synchronization andDelay Reconnect, Connect History will not block de-livery because of an irrelevant network partition. Forexample, the client first connects to server A then toB, but it has never connected to C. IfB cannot talkto C (but can talk to A), it can still go ahead anddeliver notes to the client. All these advantages comeat the price of the client carrying a little more infor-mation than it is required to in the Delay Reconnectstrategy.

    It is important to note that, although some of theserver-centric algorithms require the client to storesome information (e.g. its connection history), thereis a fundamental difference between this information

    and that of the client-centric algorithms. In theclient-centric approaches, the client information ulti-mately determines if a note delivery can result in du-plicate delivery or not. In the server-centric schemes,the client information is used solely for improvingperformance rather than for correctness. In the eventthat the client loses its information, the servers canalways perform a full synchronization among them-selves, thereby still guaranteeing relaxed exactly-oncesemantics.

    8

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    9/21

    4 Evaluation

    The relative merits and shortcomings of our algo-rithms can be evaluated along several dimensions:

    Semantic guarantees. In particular, server-centric strategies provide a less strict delivery

    guarantee than the client-centric ones.

    Reliability. For instance, how likely is it thata client connection may be broken by networkproblems?

    Availability. For example, what is the likelihoodthat the client will be able to connect to a serverand get notes when it wants to.

    Performance. For example, what is the overallnetwork traffic incurred, or how fast can notesbe delivered to the client?

    We have studied several metrics that quantify thereliability, availability or performance of the algo-rithms. Due to space limitations, in this paper weonly present one metric, which we have found to beespecially useful. Our metric is the expected through-put, defined as the expected number of notes thata client can successfully receive during a connectionthat lasts Tc seconds.

    We have chosen this metric because it capturesa number of factors. The metric captures the syn-chronization overhead of the algorithms, because al-gorithms that require more synchronization beforeor while notes are delivered will have less over-

    all throughput. The metric also captures the ben-efits of replicating notes at servers: As we addmore servers, clients will find closer or better con-nected servers for download, achieving better deliv-ery throughput. The expected throughput is also agood indirect measure of system reliability, becausea higher throughput means that the client can getthe same number of notes faster, and is therefore lessprone to disconnects. Moreover, our metric focuseson the connection-time performance as observed bythe client, which we feel is the most critical compo-nent of the overall system performance in a wirelesscommunication scenario.

    Having selected a metric, the biggest challenge isfinding a good system model. The model should berich enough to capture the main tradeoffs we want tostudy, while not being so detailed that we are over-whelmed with parameters and components. Our goalis not to predict exact performance of one algorithmversus another in a particular scenario. Rather, ourgoal is to clearly demonstrate and quantify the im-portant features of the various strategies, so we strivefor the simplest model that allows us to do so.

    Figure 2: System model.

    The global network is represented by a circle, as

    shown in Figure 2. Each node (e.g., a server or amobile access point) of the network resides on thecircle. The network distance between any two nodes,measured in network hops, is proportional to theirdistance on the circumference of the circle. For in-stance, if the circumference is 30 network hops, thenthe distance between two nodes directly across fromeach other is 15 hops. In Figure 2, Si is the ith server.The mobile client is represented by c, and the Net-work Access Point (N AP) is its entry into the globalnetwork. A wireless line (dotted line) connects c andN AP.

    We choose a circular representation of the networkbecause it allows us to easily model the fundamen-tal characteristics of a server network. For example,the effect of placing more servers around the globeto increase access locality is reflected by the reduceddistance to the nearest server from a random pointon the circle. (Of course, a sphere may have been amore accurate model for the Earths networks, but wedo not believe we can gain more insights from such asubstantially more complex model.)

    We assume that the servers are uniformly dis-tributed on the circle. Consequently, the more serversthere are, the shorter the distance between neighbor-

    ing servers. For example, with 3 servers and 30 totalhops, neighboring servers are separated by 303 = 10hops. If the number of servers is increased to 10, thatdistance is only 3010 = 3 hops.

    When a client c wants to get its notes, it first con-nects to a random access point (N AP) on the circle.In real life, this access point could be an ISP for mo-bile units, or a LAN segment which takes in the trav-eling client as a guest. From this access point, theclient figures out where the nearest server is on the

    9

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    10/21

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 d 2d 3d 4d

    fd(t)

    t

    = 0.1

    = 0.25

    = 0.5

    Figure 3: fd(t), = 0.05

    circle (Si+1 in the figure), and connects to it. In anactual implementation, technologies such as dynamicDNS translation and service redirection [9, 4, 1] can

    be used to identify the nearest server. Once con-nected, one of the algorithms described in Section 3will be used to guarantee exactly-once delivery ofnotes.

    We model the characteristics of the fixed networkby a distribution function fd(t) (Figure 3), whichgives the probability that node A can deliver a mes-sage to another node B distance d away on the circlewithin time t. To simplify our analysis in this pa-per, we assume that all messages (including notes)are relatively short, so that the major component inmessage delay is the length-invariant part. Thus, formeasuring delays we assume that all messages are of

    the same size. Function fd(t) is given by the followingformula:

    fd(t) =

    0 if t < d

    1 e tdd otherwise

    Here and are scaling constants. Intuitively, gives the minimum time per hop it takes to delivera message5. Thus, captures the maximum speedof the network subject to the physical limitation ofthe networking medium and the routing infrastruc-ture. For example, = 0.05 means that the shortest

    time it takes to deliver a message over one hop is50ms. However, due to unreliability and fluctuationsin the network, not all messages are delivered in theminimum possible time. We assume that, after timed, the probability of a message still not deliveredby time t decays exponentially with t. We use tomeasure how fast this probability changes with t. In

    5The assumption that the minimum time is proportionalto the network distance is approximately correct for relativelyshort messages.

    a sense, reflects the faultiness of the network bymeasuring the percentage of messages that are deliv-ered by a certain time period. For instance, = 0.25says that 1 e 18% of messages are deliveredafter twice the minimum time it takes to deliver sucha message over the same distance. Examples of fd(t)

    with = 0.05 and different values of are given inFigure 3.

    Therefore, our formulation of function fd(t) is ca-pable of representing a large variety of networks, fromslow and lossy networks to fast and reliable ones.Since it is not tied to the values of a specific network,it allows us to vary the parameters and observe howthe performance of various algorithms changes as theunderlying network characteristics change.

    Function f models message delivery in the fixednetwork (i.e., the circle). For the link between theclient (c) and its network access point (N AP), weassume a constant message delay tcap. We do not

    model the faultiness of this link because we aregoing to evaluate expected throughput only when theclient is connected for Tc seconds. That is, our metricwill measure how many notes can be delivered, giventhat the client has connected and remains connectedfor Tc seconds.

    Table 1 summarizes our model parameters. The ta-ble also lists the base values used for obtaining the re-sults of Section 5. We have experimented with manyparameter values, before settling on these as good ini-tial values. As part of our experiments, we have ofcourse studied the sensitivity of each parameter, as itis varied from its base value.

    Although one can easily argue that our base val-ues should be larger or smaller than what we havechosen, we believe that our base values are still quiterepresentative. For example, it seems reasonableto say that a message can go around the world in30 hops (D = 30). We have also selected numbersfor and ( = 0.05, = 0.25) that we believe arereasonable for a typical wide area network.

    5 Results

    Our expected throughput numbers are obtained an-alytically from our performance model. The deriva-tions are not presented in the main body of the paper,but are given in Appendix B.

    Our performance numbers are plotted as the ratioof our exactly-once strategies over a base strategy,Strategy 0, that provides the weaker at-least-once se-mantics. Thus, if pi is the expected throughput ofStrategy i, and p0 is that of Strategy 0, then we plotpip0

    . Concretely, when pip0

    = 0.5, a client is only ex-

    10

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    11/21

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    12/21

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 5 10 15 20 25 30 35 40 45 50

    p4p0

    Tc

    D = 10

    D = 30

    D = 50

    Figure 5: p4p0

    versus Tc

    pensive to have stable storage (or if we cannot trustthat storage, as discussed earlier), then the server-centric strategies should be selected.

    5.2 Strategy Server Synchronization

    In Figure 5, we study the throughput performance ofStrategy 4. The x-axis is now the connection dura-tion, Tc. Three curves represent different values ofD, the total length of the circular global network.

    For very short connections (small Tc), Strategy 4incurs a very high overhead because an initial serversynchronization phase is needed. No note can bedelivered during this period and the throughput ispractically zero. However, once the setup is finished,delivery proceeds at the same rate as in Strategy 0.

    Hence, for longer connections, the effect of the setupperiod is amortized, and the performance ratio grad-ually approaches 1 as Tc increases.

    Another trend we observe from Figure 5 is thatperformance decreases with increasing D. This is be-cause, as the network grows larger (more hops aroundthe globe), the cost of a server synchronization isgreater. Consequently, the initial setup phase takeslonger, and its impact on overall throughput is moremarked.

    In conclusion, Strategy 4 works well for longer con-nections or small networks. If neither of the aboveapplies, Strategy 4 may still be selected because it

    does not rely on any client state information.

    5.3 Strategy Delay Reconnect

    Figure 6 illustrates the effect of the reconnect limitTr on Strategy 5. It also shows and variations todemonstrate the impact of network characteristics.The middle three curves have a common (0.25),but different s. The outer two curves and the mid-dle one share the same (0.05), but differ in their

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 50 100 150 200 250 300 350 400

    p5p0

    Tr

    = 0.1

    = 0.25

    = 0.5 = 0.01

    = 0.1

    = 0.05

    = 0.05

    = 0.25

    z

    j

    RI

    i

    Y

    Figure 6: p5p0

    versus Tr

    values. We remind the reader that measures theoverall speed of the network while reflects its fault-iness.

    As a general principle, the longer the client is will-ing to wait in between connections (bigger Tr), thebetter throughput it will get. This is because a big-ger Tr gives the servers a longer time to perform asuccessful server synchronization, thus reducing thepossibility of needing a synchronization during clientconnection setup. At Tr = 0, Strategy Delay Re-connect degenerates into the Server Synchronizationstrategy because the client can reconnect at any time.Then performance improves as Tr increases, to thepoint where the overhead is practically unnoticeablewhen the client is willing to wait at least one serversynchronization period, Tu (which is 300 seconds in

    our setup). Intuitively, when Tr > Tu, at least onesynchronization should have been attempted in be-tween the clients two connections.

    Looking at the various curves in Figure 6, we seethat as we move towards a slower (larger ) or amore faulty (larger ) network, the performancepenalty becomes bigger. This is simply because ina less efficient network, server synchronizations takelonger and are more likely to fail. Furthermore, wesee from the figure that has a relatively significantimpact on performance.

    5.4 Strategy Connect History

    Figure 7 has the same axes as Figure 5. However, thecurves in the figure represent different values for d,which measures the distance from the current serverto the server of the previous connection. For example,d = 10 implies that the clients previous connectionwas made to a server that is 10 hops away.

    The graphs for Strategy 6 have the same generalshape as Strategy 4 (Server Synchronization). In par-

    12

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    13/21

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 1 2 3 4 5

    p6p0

    Tc

    d = 0d = 5

    d = 10

    d = 15

    Figure 7: p6p0

    versus Tc

    ticular, performance overhead is higher for shorterconnections, but it drops as connection time getslonger. However, Strategy 6 is effective in reducing

    the overhead of Strategy 4. As an evidence, we no-tice that the range of Tc plotted is much smaller inFigure 7 (0-5 sec) than in Figure 5 (0-50 sec).

    If the client consistently connects to the sameserver (d = 0), then naturally Strategy 6 does nothave any overhead (p6

    p0 1). As d increases, the

    performance overhead increases as well. The reasonsare two-fold. Firstly, as the distance between the twoservers gets larger, there is a higher chance that asynchronization may be needed because the currentserver has not been able to perform a successful syn-chronization since the previous connection. Secondly,such a synchronization is also likely to take longer be-

    cause the previous server if farther away.Another parameter not shown in the figure but

    which also has an impact on Strategy 6s perfor-mance is t, the time since the previous connection.A bigger t means that the servers have had a longertime since the last connection to perform a successfulsynchronization. Consequently, the performance isalso better because a makeup synchronization is lesslikely needed. To summarize, Strategy 6 works bestwhen the client primarily makes infrequent connec-tions (big t) to servers that are close together (smalld).

    5.5 Comparison of client- and server-

    centric algorithms

    Lastly, we plot a client-centric algorithm (Strategy 3)and a server-centric one (Strategy 5) side by side tosee how they scale with the size of the network (D).We have to keep in mind the fundamental differencesbetween them. For example, one guarantees strictexactly-once semantics while the other only guaran-

    0

    0.2

    0.4

    0.6

    0.8

    1

    10 20 30 40 50 60 70 80

    pip0

    D

    Tr = 0

    Tr = 150

    Tr = 300

    q= 1

    q= 10

    q= 20

    p5p0

    p3p0

    :

    /

    }

    M

    Figure 8: Comparison of p5p0

    and p3p0

    versus D

    tees relaxed, and one requires stable storage on the

    client while the other does not. However, it is stillinformative to compare them side by side.

    In Figure 8, each strategy is represented by threecurves, which give the performance overhead whenthe size of the global network (D) is changed. Thethree p3

    p0curves differ in the amount of stable storage

    (q) allowed on the client, while the p5p0

    ones in theamount of time the client is willing to wait betweenconnections (Tr).

    The important lesson to take away from this fig-ure is that both strategies can perform fairly well un-

    der favorable conditions, and both can perform badlyotherwise. For Strategy Id List, it is important tohave enough storage space on the client. When thestorage is severely limited, e.g. q = 1, the perfor-mance overhead can be quite large even for small D.On the other hand, with enough storage, e.g. q = 20,the performance penalty can be avoided all together.Similarly, the key factor for Strategy Delay Recon-nect seems to be how long the client is committed towait between connections (Tr).

    Although both strategies perform less well as Dincreases, the downward slope is actually due to very

    different reasons. In the case of Strategy 3, a biggerD means a longer time needed to purge a note id.Consequently, if the client does not have enough Rslots to avoid an overflow, it will eventually need toslow down to wait for the purging to catch up. Onthe other hand, a bigger D increases both the serversynchronization time and the likelihood of needing asynchronization for Strategy 5. The last point mayalso account for the fact that the curves for Strategy 5tend to drop more steeply than those for Strategy 3.

    13

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    14/21

    5.6 Summary of strategies

    In Table 2 we compare the features of all the strate-gies. The comparison is divided into three sections.The first section categorizes the strategies based onwhether they provide strict or relaxed exactly-oncesemantics. The second section inspects the systemassumptions made by the algorithms. Specifically,the first row looks at whether a strategy requires sta-ble storage on the client. For those that do, the nextrow gives whether the minimum size of that storageis proportional to the number of potential servers inthe system. The third row indicates whether eachnote has to potentially have a different unique id foreach client. The fourth row states whether there isan inherent order imposed on note delivery. Finally,the fifth row looks at whether the client is free to con-nect into the system at any time it wishes. The thirdsection deals with algorithm performance and looks

    at which parameters have an impact on the expectedthroughput for different strategies. The rows in thissection should be self-explanatory.

    6 Related work

    This paper builds upon previous work on exactly-once semantics and on replication. However, ourwork is unique in that we consider all of the followingfactors: 1) wireless connectivity, 2) limited or lackingstable storage on client, and 3) different notions ofexactly-once.

    The networking literature has dealt extensivelywith the problem of exactly-once delivery of net-work packets. For example, the well known slidingwindow protocol and its many variants [5, 15, 3] areused to provide delivery guarantees when the under-lying network is faulty and unreliable. Our notion ofexactly-once, however, is different. The window pro-tocols worry about exactly-once delivery only duringone TCP session, not across longer periods of time.The protocols start over when the client disconnectsfrom the server and reconnects later. Consequently,the window protocols only have to deal with one de-livering server, whereas our algorithms have to worry

    about state synchronization among several servers.Moreover, most window protocols rely on the clientremembering unique ids. Some of our algorithms,namely the server-centric ones, can cope with situa-tions where the client has no stable storage.

    The sequence number wrap-around problem hasbeen studied thoroughly [12, 3] in the context ofwindow protocols. However, as explained earlier inSection 3.2, the same problem is more involved inStrategy Sequenced Wrap-around Streams because a

    server has to take into account multiple servers.

    Distributed systems research has studied exactly-once execution of remote procedure calls [2, 16]. So-lutions also rely on the clients ability to rememberunique ids. To the best of our knowledge, there hasbeen no attempt to deal with exactly-once semantics

    with multiple servers or across multiple connections.The Usenet News infrastructure [11, 8] avoids du-

    plicate delivery by requiring a news reader to alwaysconnect to the same news server. The system solvesthe duplication problem among the servers (whenmessages are first replicated) using globally uniqueids. Naturally, the challenge posed by limited capa-bility wireless clients, which we address in this paper,does not come up in the Usenet system.

    There has been a great amount of work on repli-cated data management and replicated transactions.For example, the work on cache consistency [14, 6, 10]is mostly concerned with multiple clients seeing a con-

    sistent picture as the data are being changed. Asanother example, Gray et al [7] looks at transactionserializability in a replicated environment. Our workalso deals with the problems resulting from replica-tion. However, our work addresses a mobile scenario,where, for instance, delivery throughput should bemaximized in order to increase the number of notesdelivered before losing a connection. Limited clientstorage is also a critical issue. We have considereddifferent exactly-once notions to accommodate mo-bile devices with no storage.

    Incidentally, the Server Synchronization strategy

    can be reformulated as a transactional update prob-lem on message delivery status. Specifically, datareplication algorithms can be used to make sure thateach server sees an up-to-date version of the deliverystate of each message. Although this approach maylead to additional algorithms, we believe that it willnot cover the full range of solutions we have proposedhere.

    7 Conclusion

    In this paper we have studied the message delivery

    problem in mission-critical mobile scenarios whereexactly-once semantics is imperative. The algorithmswe have proposed differ in the guarantees they make,in the assumptions they make about the system, inthe requirements they place on the servers and theclients, and in their complexity and performance.

    We have developed a metric, expected notethroughput, that is useful in comparing delivery al-gorithms. An algorithm with a high throughput candeliver more notes in a shorter period of time, in-

    14

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    15/21

    Strategy 1 2 3 4 5 6

    Strict exactly-once

    Relaxed exactly-once

    Requires client stable storage

    Client storage requirement proportional to number of servers

    One unique id per client for each note

    Requires in-order delivery Limit on client reconnects

    Performance sensitive to connection duration Tc

    Sensitive to pattern of past connections

    Sensitive to size of client storage

    Sensitive to frequency of periodic server synchronizations

    Table 2: Comparison of strategies. Strictly speaking, in Strategy Connect History, stable storage is neededfor performance rather than required for correctness. Although Strategy Connect History requires the clientto remember timestamps in the history array H, the client only has to remember servers that it ever connectsto. Moreover, there can also be protocols to allow the client to purge old entries.

    creasing the chances of a successful download to amobile device. We have found that the strategies dif-fer in their performance, but every strategy can betuned to minimize its overhead through careful se-lection of parameters. Overall, in many cases, thecost of exactly-once delivery is no higher than thatof at-least-once delivery. Thus, it makes sense to im-plement exactly-once even if the application does notabsolutely require it.

    References

    [1] Akamai Technologies, Inc. http://www.akamai.com.

    [2] K. P. Birman. Building Secure and Reliable NetworkApplications. Manning Publications Co., 1996.

    [3] G. M. Brown, M. G. Gouda, and R. E. Miller. Blockacknowledgment: Redesigning the window protocol.IEEE Transactions on Communications, 39.4:524532, 1991.

    [4] V. Cardellini, M. Colajanni, and P. S. Yu. Redirec-tion algorithms for load sharing in distributed web-server systems. In The 19th IEEE International Con-ference on Distributed Computing Systems, 1999.

    [5] V. Cerf and R. Kahn. A protocol for packet network

    intercommunication. IEEE Transactions on Com-munications, 22:637648, 1974.

    [6] C. G. Gray and D. R. Cheriton. Leases: An efficientfault-tolerant mechanism for distributed file cacheconsistency. In Proceedings of the Twelfth ACM Sym-posium on Operating Systems Principles, pages 202210, 1989.

    [7] J. Gray, P. Helland, P. ONeil, and D. Shasha. Thedangers of replication and a solution. In ACM SIG-MOD Conference, pages 173182, 1996.

    [8] T. Gschwind and M. Hauswirth. A cache architec-ture for modernizing the Usenet infrastructure. InProceedings of the 32nd Annual Hawaii International

    Conference on Systems Sciences, pages 19, 1999.

    [9] E. Katz, M.Butler, and R. McGrath. A scalableHTTP server: The NCSA prototype. ComputersNetworks and ISDN Systems, 27:155164, 1994.

    [10] R. Katz, S. Eggers, D. Wood, C. Perkins, andR. Sheldon. Implementing a cache consistency pro-tocol. In Proceedings of the 12th International Sym-posium on Computer Architecture, pages 276283,1985.

    [11] D. Lawrence and H. Spencer. Managing USENET.OReilly & Associates, Incorporated, 1998.

    [12] W. S. Lloyd and P. Kearns. Bounding sequencenumbers in distributed systems: A general approach.In Proceedings of the 10th International Conferenceon Distributed Computing Systems, pages 312319,1990.

    [13] D. L. Mills. Internet time synchronization: The net-work time protocol. IEEE Transactions on Commu-nications, 39.10:14821493, 1991.

    [14] M. Nelson, B. Welch, and J. Ousterhout. Caching inthe Sprite network file system. ACM Transactionson Computer Systems, 6.1:134154, 1988.

    [15] A. Shankar. Verified data-transfer protocols withvariable flow control. ACM Transactions on Com-puter Systems, 7.3:281316, 1989.

    [16] A. Vaysburd and S. Yajnik. Exactly-once end-to-end semantics in CORBA invocations across hetero-geneous fault-tolerant ORBs. In Proceedings of the18th IEEE Symposium on Reliable Distributed Sys-

    tems, pages 296297, 1999.

    15

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    16/21

    A Algorithm details

    The pseudo-code in this section uses the followingconventions:

    The code enclosed in braces following a deliveror receive procedure call is executed when thatcall returns timeout. If the call returns success,execution continues from the next statement.

    If m denotes a message that is sent by nodeA, then m.sender contains the id of A, andm.body is the content of the message. In somealgorithms, m.timestamp will contain the localtimestamp when m was sent.

    For algorithms that assume a ServerID.SeqNonote id format, if i is a note id, then i.serverand i.seqno are its two parts, respectively.

    A call to abort will terminate execution of thecurrent procedure.Below we give the pseudo-code for the server

    and client connection routines of Strategy SequencedStreams. The server contains a main message dis-patching thread which executes the following loopforever:

    loop forever {

    Receive(m); {continue;

    // we keep listening.}

    if (m.body == "Request my notes")

    fork(HandleClient(m.sender));

    }

    HandleClient(c) {

    foreach i in K {

    Deliver(c,

    "Here is note i: ...");

    {abort;}

    K = K - { i } ;

    }

    }

    When a client wants to get its notes, it connects to

    a server and then executes the following procedure:

    Deliver(s, "Request my notes"); {abort;}

    loop {

    Receive(m); {abort;}

    if (m.body == "Here is note i: ..." &&

    i.seqno == R[i.server] + 1) {

    ProcessMessage(m);

    R[i.server] = i.seqno;

    // remember that we have

    // received this message.

    }

    }

    Remark: The last two statements in the clients

    code, which process a message and remember its re-ceipt, need to be atomic.

    In Strategy Id List, the servers main loop nowlooks like the following:

    loop forever {

    Receive(m); {continue;

    // we keep listening.}

    if (m.body == "Request my notes")

    fork(HandleClient(m.sender));

    if (m.body ==

    "Server synchronization: Ks,Ds") {P = K - K s ;

    // The other server has purged

    // these already; we can too.

    K = K - P;

    D = D - P;

    P = K s - K ;

    // We have purged these already;

    // the other server should too.

    K s = K s - P ;

    D s = D s - P ;

    D = D + D s ;

    // Merge in notes marked delivered

    // on the other server.

    Deliver(m.sender,

    "Synchronization reply: K,D");

    {continue;}

    }

    }

    Here we assume that server synchronizations areperformed right after each messages delivery:

    HandleClient(c) {// Start delivering undelivered messages

    U = K - D ;

    foreach i in U {

    Deliver(c, "Here is note i: ...");

    {abort;}

    D = D + { i } ;

    fork(ServerSynch(i,c));

    }

    }

    Appendix-1

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    17/21

    ServerSynch(i,c) {

    foreach s in S, in parallel {

    // S is the set of all other servers

    Deliver(s,

    "Server synchronization: K,D");

    {continue;}

    Receive(m); {continue;}

    // get ss reply

    if (m.body ==

    "Synchronization reply: K[s],D[s]")

    {

    P = K - K[s];

    // The other server has purged

    // these already; we can too.

    K = K - P ;

    D = D - P ;

    P = K[s] - K;

    // We have purged these already;// the other server should too .

    K[s] = K[s] - P;

    D[s] = D[s] - P;

    D = D + D[s];

    // Merge in notes marked delivered

    // on the other server.

    }

    }

    if (we have received reply from

    all servers in S) {

    // Purge those notes that everybody

    // knows are delivered.

    P = D[1] intersect D[2] intersect ...

    intersect D[N];

    foreach i in P {

    Deliver(c,

    "Please purge i");

    {continue;}

    K = K - {i};

    D = D - {i};

    }

    }

    }

    Finally, the client now accepts two kinds of mes-sages: note delivery messages and purging authoriza-tions.

    Deliver(s, "Request my notes"); {abort;}

    loop {

    Receive(m); {abort;}

    if (m.body == "Here is note i: ..." &&

    i not in R) {

    ProcessMessage(m);

    R = R + { i } ;

    // remember that we have

    // received this message.

    }

    if (m.body == "Please purge i") {R = R - { i } ;

    }

    }

    Switching to the server-centric algorithm ServerSynchronization, the server function becomes:

    HandleClient(c) {

    foreach s in S, in parallel {

    Deliver(s,

    "Request for list D");

    {abort;}

    Receive(m); {abort;}D[s] = m.body;

    D[s] = D[s] - (D[s] - K);

    // s may not know that some messages

    // are already purged.

    D = D + D[s];

    }

    // Got response from each s. Purge those

    // messages that everybody knows about.

    P = D[1] intersect D[2] intersect ...

    intersect D[N];

    D = D - P ;

    K = K - P ;

    // Can start delivery now.

    U = K - D ;

    foreach i in U {

    aDeliver(c,

    "Here is note i: ...");

    {abort;}

    D = D + { i } ;

    }

    }

    Note that in order to simplify the code, we havethe server abort as soon as it fails to contact one ofthe other servers. As a possible optimization, it maywant to retry the problematic servers for a few timeswhile letting the client hold.

    We omit the rest of the server code and the clientcode for this strategy because they should be ratherstraightforward to construct.

    Our next strategy, Delay Reconnect, also has a dif-ferent HandleClient:

    Appendix-2

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    18/21

    HandleClient(c) {

    foreach s in S, in parallel {

    if (TS[s] < GetCurrentTime() - Tr) {

    // < means earlier than

    Deliver(s,

    "Request for list D");

    {abort;}Receive(m); {abort;}

    D[s] = m.body;

    TS[s] = m.timestamp;

    D[s] = D[s] - (D[s] - K);

    D = D + D[s];

    }

    }

    // We are safe. Can start delivery now.

    U = K - D;

    foreach i in U {

    aDeliver(c,

    "Here is note i: ...");{abort;}

    D = D + { i } ;

    }

    }

    Lastly, Strategy Connect History looks like the fol-lowing:

    HandleClient(c) {

    // get connection history

    Receive(H); {abort;}

    foreach s in S, in parallel {

    if (TS[s] < H(s)) {

    Deliver(s,

    "Request for list D");

    {abort;}

    Receive(m); {abort;}

    D[s] = m.body;

    TS[s] = m.timestamp;

    D[s] = D[s] - (D[s] - K);

    D = D + D[s];

    }

    }

    // We are safe. Can start delivery now.

    U = K - D;

    foreach i in U {

    aDeliver(c,

    "Here is note i: ...");

    {abort;}

    D = D + { i } ;

    }

    }

    B Expected throughput analy-

    sis

    We will first derive a few constructs which will beneeded by the analysis that follows. We refer thereader back to Figure 2. To begin, we give the for-

    mula for E(d), the expected time it takes to delivera message from node A to node B on the circle giventhe distance d between them. Simply, E(d) is justthe expected value of delivery time t based on thedistribution fd(t):

    E(d) = Efd [t]

    =

    0

    t dfd(t)

    =

    0

    tfd(t) dt

    =d

    t 1d

    etdd dt

    = e tdd (t + d)d

    = ( + )d

    = d

    Next we derive the communication delay betweenthe client and its nearest server (between c and Si+1in Figure 2). Because the circle has D hops, and Nservers are spread evenly on it, the network distancebetween two adjacent servers, e.g., Si and Si+1, issimply D

    N

    .6 Let dsap denote the average distance be-tween the access point N AP (a random point on thecircle) and the nearest server Si+1. This distanceranges from 0 (when the access point falls on a serveritself) to half the distance between adjacent servers(when the point falls in the middle of two servers).

    Taking the average, we get dsap =0+ 1

    2

    2DN

    = D4N.Finally, tcs, the expected time it takes to deliver amessage between the client and the nearest server, issimply the sum of the expected delay between Si+1and N AP plus the constant delay between N AP andc. That is,

    tcs = E(dsap) + tcap

    = dsap + tcap

    =D

    4N+ tcap

    Another quantity we will be using often is tss, thetotal expected time for a server to deliver in parallel

    6For simplicity, we assume that the number of hops (D) isdivisible by the number of servers (N).

    Appendix-3

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    19/21

    one message each to the other N 1 servers in an N-server system. For simplicitys sake we assume thattss is equal to the longest expected time among theN 1 messages7. Let di denote the distance froma server to its ith nearest neighbor on a half circle.Then di =

    DN

    i. Thus we get:

    tss = maxi=1...N

    2

    E(di)

    = maxi=1...N

    2

    D

    Ni

    = D

    NN

    2

    Lastly, we make two observations. First, the aver-age time needed to deliver a message and receive anacknowledgment is simply twice the expected mes-sage delay. For example, it takes the server 2tcs =

    2(D

    4N + tcap) time to deliver a note to the client andreceive an acknowledgment. Second, assuming thatnotes are delivered sequentially, i.e., one after the ac-knowledgment of another, then the expected numberof notes that can be delivered sequentially in time Tis simply T divided by the time to deliver a singlenote. As an example, the expected number of notesa client can receive in time Tc is

    Tc2tcs

    = Tc2(D

    4N+tcap)

    .

    B.1 Analysis

    We derive the formulas for pi, the expected numberof notes received by the client in Tc time under Strat-

    egy i. We propose a reference Strategy 0, which is anat-least-once algorithm that can potentially result induplicate deliveries. Because of its relaxed semantics,we expect Strategy 0 to perform well in our through-put study. Hence we will use p0 as our base of com-parison.

    In Strategy 0, because the server does not need tobe concerned with exactly-once delivery, note deliverycan start right after the client requests a connection.Therefore, the expected number of notes the clientwill receive is simply the total time of the connection(Tc) divided by the expected time it takes to delivera single note (2tcs). That is,

    p0 =Tc

    2tcs

    =Tc

    2(D4N + tcap)

    7This is an optimistic estimate, because it is saying thatdelivering several messages in parallel takes only as long asdelivering to the farthest one among them. This is clearly onlyan approximation.

    Recall that in Strategy 1 (Sequenced Streams),notes are delivered in sequence number order, and theclient remembers the latest received sequence num-bers to prevent duplicate delivery. Strategy 2 (Se-quenced Wrap-around Streams) is identical to Strat-egy 1 except that it provides the ability to reuse note

    ids. In both these strategies, once a client connects toa server, note delivery can begin. Consequently, ex-cept for the restriction on in-order delivery, these twostrategies are capable of the same expected through-put as our base case, p0. In other words,

    p1 = p2 = p0

    =Tc

    2(D4N + tcap)

    In Strategy 3 (Id List), the client keeps a list Rof received note ids to guard against duplicates. Asshown in the timing diagram (Figure 1(a)), the serversends a note to the client at time T1. After re-ceiving the acknowledgment (T2 = T1 + 2tcs), theserver immediately initiates the purging protocol8,which requires it to synchronize with all the otherservers (T3 = T2 + 2tss), and then inform the client(T4 = T3+ tcs) to purge. However, the delivery of thesecond note m2 need not wait for the purging of m1,but can start as early as point T2 in the figure.

    From Figure 1(a), we see that a new note is de-livered every 2tcs time starting from time T1. More-over, after point T4, entries in list R are also purgedat the same rate. Consequently, as long as the R

    slots do not fill up by T4, a steady state is reachedwhere note ids are put in and taken out of R at thesame rate of once every 2tcs. As such, the expectedthroughput is simply Tc2tcs . To not fill up R withintime T4 T1 = 3tcs + 2tss, the client needs to have atleast 2tcs+2tss2tcs = 1 +

    tsstcs

    slots in its R. (The reasonthe numerator is 2tcs + 2tss rather than 3tcs + 2tss isthat potentially the delivery of mk can overlap withthe purging authorization of m1 between T3 and T4as shown in Figure 1(a).)

    If, on the other hand, R fills up before T4, wehave the more complicated situation depicted in Fig-ure 1(b). Assume q, the number of slots in R, is 2. At

    time T3, the server sends note m3. However, becauseboth slots in R are used, the client cannot process m3until point T4, when the slot for m1 is freed. Simi-larly for later notes. Note that if the client chooses

    8This eager mode of purging can potentially lead to heavytraffic between the servers because it requires a full server syn-chronization for each note delivered. A real implementationwill likely optimize by batching several synchronizations intoone, for example. However, the eager model gives the samethroughput as the optimized versions under normal circum-stances, and is a lot easier to study.

    Appendix-4

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    20/21

    (a) Without R overflow. (b) With R overflow (q = 2).

    Figure B-1: Delivery timing diagram for Strategy 3.

    to discard m3 instead of holding it till T4, additional

    delays may be needed to get the server to retransmitm3.

    A closer examination of Figure 1(b) tells us thatthe note delivery rate obtainable in a limited storagesystem is only q (in this case, 2) notes per 2tcs + 2tsstime. That is to say, the expected number of notesreceived in Tc time is

    Tc2tcs+2tss

    q. Putting the twocases together, we get:

    p3 =

    Tc2tcs

    if q > 1 + tsstcs

    Tc2tcs+2tss

    q otherwise

    = Tc

    2(D4N

    +tcap) if q > 1 +tsstcs

    Tcq

    2(D4N

    +tcap+DNN2)

    otherwise

    We now move on to the server-centric algorithms.In Strategy 4 (Server Synchronization), the serverfirst needs to synchronize with each of the otherservers in parallel. Because the synchronization takes2tss time on average, we will deduct that time fromthe total time Tc. After the synchronization is com-plete, delivery starts and proceeds at the rate of 2 tcstime per note.

    p4 =Tc 2tss

    2tcs

    =Tc 2DN N2 2(D4N + tcap)

    In Strategy 5 (Delay Reconnect), the clientpromises not to reconnect into the system within timeTr. Each server initiates a synchronization with otherservers every Tu time. When a client connects to

    a server, the server first verifies that the last syn-

    chronization message received from each of the otherservers was sent out at most Tr ago. If so, actualnote delivery can start right away. Otherwise, theserver requires a setup period in which it needs tosynchronize with the problematic server(s).

    Let ps denote the probability that the latter is true,that is, a server synchronization period is needed be-fore note delivery can start. Further assume that thissynchronization takes an average of ts time. Henceon average the setup period takes psts time, which wewill deduct from the total connection time Tc. Thus,

    p5 =Tcpsts2tcs

    .As shown in Figure B-2, the client first connects

    to server A at time T1, and then to server B at T4.We assume that the distance between A and B is thefarthest possible, i.e., dAB =

    DN

    N2 . During the pe-riod of Tr time before the second connection (fromT2 to T4), A could potentially have attempted multi-ple synchronizations to B (s2 and s3), the receipt ofany of which by B would have allowed B to skip thesetup with A. To first order of approximation, ps isequal to the probability that the first such synchro-nizations (i.e. s2) did not make it to B. The messages2 could have been sent at any time between point T2and point T3 with equal probability. (Note that al-though our graph shows the situation where Tr > Tu,our derivation is valid for T r

  • 7/27/2019 Exactly-Once Semantics in a Replicated Messaging System

    21/21

    Figure B-2: Delivery timing diagram for Strategy 5.

    Integrating with respect to t gives us ps 1 Tu0

    dtTu

    fdAB (Tr t). Since fdAB(Tr t) is zero whent > Tr dAB, simplifying further, we have

    ps 1 M0

    1 eTrtdAB

    dAB

    T udt

    = 1 1Tu

    (t dABeTrtdABdAB )M0

    = 1 1Tu

    (M dABeTrdABdAB (e

    MdAB 1))

    = 1 1Tu

    (M

    D

    NN

    2e

    TrDNN2

    DNN2 (e

    M

    DNN2 1))

    where M is a shorthand for max(0, min(Tr dAB, T u)).

    Similarly, we use the time it takes B to synchronize

    with server A to approximate ts. Thus, synchroniza-tion time ts 2tss. Finally, we get:

    p5 =Tc psts

    2tcs

    Tc ps2tss2tcs

    =2N(Tc 2DNN2 ps)

    D + 4N tcap

    Finally, we look at Strategy 6 (Connect History).Recall that this strategy is very similar to the previ-

    ous one in that if the server has heard recently enoughfrom the other servers, then note delivery can startright away. Otherwise, a synchronization period isneeded. The difference, however, is that here theserver has a list of the clients past connection his-tory, and hence can perform a smarter checking.

    We assume that when the client makes a connec-tion to server B at point T2 (Figure B-3), there hasonly been one past connection, which was time t

    ago to server A. Right after the connection at point

    Figure B-3: Delivery timing diagram for Strategy 6.

    T1, server A sent out a synchronization s1. In otherwords, unlike the periodic synchronization of Strat-egy Delay Reconnect, here we assume that synchro-nizations are initiated at the end of each client con-nection. The distance between A and B is d. Theprobability ps that s1 has not arrived at B by T2is 1 fd(t). If that happens, the expected time tsit takes B to perform a synchronization with A is2E(d).

    Therefore, we have:

    p6 =Tc psts

    2tcs

    =Tc (1 fd(t))2E(d)

    2tcs

    =Tc e

    td

    d 2E(d)

    2tcs

    =Tc e

    td

    d 2d

    2(D4N + tcap)

    = 2N(Tc 2de

    td

    d )D + 4N tcap