abstract arxiv:1309.5671v3 [cs.dc] 27 feb 2014 · vive la difference:´ paxos vs. viewstamped...

Vive La Difference:Paxos vs. Viewstamped Replication vs. Zab

Robbert van Renesse, Nicolas Schiper, Fred B. SchneiderDepartment of Computer Science, Cornell University

AbstractPaxos, Viewstamped Replication, and Zab are replicationprotocols that ensure high-availability in asynchronousenvironments with crash failures. Various claims havebeen made about similarities and differences betweenthese protocols. But how does one determine whether twoprotocols are the same, and if not, how significant the dif-ferences are?

We propose to address these questions using refine-ment mappings, where protocols are expressed as succinctspecifications that are progressively refined to executableimplementations. Doing so enables a principled under-standing of the correctness of the different design deci-sions that went into implementing the various protocols.Additionally, it allowed us to identify key differences thathave a significant impact on performance.

1 IntroductionA protocol expressed in terms of a state transition speci-fication Σ refines another specification Σ′ if there exists amapping of the state space of Σ to the state space of Σ′

and each state transition in Σ can be mapped to a statetransition in Σ′ or to a no-op. This mapping betweenspecifications is called refinement [17] or backward sim-ulation [26]. If two protocols refine one another then wemight argue that they are alike. But if they don’t, howdoes one qualify the similarities and differences betweentwo protocols?

We became interested in this question while comparingthree replication protocols for high availability in asyn-chronous environments with crash failures:

• Paxos [18] is a state machine replication proto-col [15, 30]. We consider a version of Paxos thatuses the multi-decree Synod consensus algorithmdescribed in [18], sometimes called Multi-Paxos.Many implementations have been deployed, includ-ing in Google’s Chubby service [2, 4], in Microsoft’s

Autopilot service [11] (used by Bing), and in thepopular Ceph distributed file system [33], with in-terfaces now part of the standard Linux kernel.

• Viewstamped Replication (VSR) [27, 23] is a replica-tion protocol originally targeted at replicating partic-ipants in a Two-Phase Commit (2PC) [20] protocol.VSR has also been used in the implementation of theHarp File System [24];

• Zab [12] (ZooKeeper Atomic Broadcast) is a repli-cation protocol used for the popular ZooKeeper [10]configuration service. ZooKeeper has been in activeuse at Yahoo! and is now a popular open source prod-uct distributed by Apache.

Many claims have been made about the similarities anddifferences of these protocols. For example, citations [21,3] claim that Paxos and Viewstamped Replication are “thesame algorithm independently invented,” “equivalent,” orthat “the view management protocols seem to be equiva-lent” [18].

In this paper, we approach the question of similari-ties and differences between these protocols using refine-ments, as illustrated in Fig. 1. Refinement mappings in-duce an ordering relation on specifications, and the figureshows a Hasse diagram of a set of eight specifications ofinterest ordered by refinement. In this figure, we writeΣ′ → Σ if Σ refines Σ′, that is, if there exists a refinementmapping of Σ to Σ′.

At the same time, we have indicated informal levels ofabstraction in this figure, ranging from a highly abstractspecification of a linearizable service [9] to concrete, ex-ecutable specifications. Active and passive replication arecommon approaches to replicate a service and ensure thatbehaviors are still linearizable. Multi-Consensus proto-cols use a form of rounds in order to refine active andpassive replication. Finally, we obtain protocols such asPaxos, VSR, and Zab.

Each refinement corresponds to a design decision, andas can be seen from the figure it is possible to arrive at the

arX

iv:1

309.

5671

v3 [

cs.D

C]

27

Feb

2014

Linearizable*Service*

Ac/ve*Replica/on*

Passive*Replica/on*Mul/7Consensus*

Mul/7Consensus7PO*

Paxos*

VSR* Zab*

Replica/on*

Consensus*

Service*

Abstract*

Concrete*Implementa/on*

Non

7executable*

Executable*

Figure 1: An ordering of refinement mappings and informallevels of abstraction.

same specification following different paths of such de-sign decisions. There is a qualitative difference betweenrefinements that cross abstraction boundaries and thosethat do not. When crossing an abstraction boundary, arefinement takes an abstract concept and replaces it witha more concrete one. For example, it may take an ab-stract decision and replace it by a majority of votes. Whenstaying within the same abstraction, a refinement restrictsbehaviors. For example, one specification might decidecommands out of order while a more restricted specifica-tion might decide them in order.

Using these refinements, we can identify and analyzecommonalities and differences between the replicationprotocols. They also enable consideration of new variantsof these protocols and what conditions they must satisfyto be correct refinements of a linearizable service. Otherpapers have used refinements to specify a replication pro-tocol [22, 19]. To the best of our knowledge, this paperis the first to employ refinement for comparing differentreplication protocols.

This paper is organized as follows: Section 2 introducesstate transition specifications of linearizable replication aswell as active and passive replication. Section 3 presentsMulti-Consensus, a canonical protocol that generalizesPaxos, VSR, and Zab and forms a basis for comparison.Progress indicators, a new class of invariants, enable con-structive reasoning about why and how these protocolswork. We show how passive replication protocols refineMulti-Consensus by adding prefix order (aka primary or-der), rendering Multi-Consensus-PO. Section 4 presentsimplementation details of Paxos, VSR, and Zab that con-stitute the final refinement steps to executable protocols.Section 5 discusses the implications of identified differ-ences on performance. Section 6 gives a short overviewof the history of concepts used in these replication proto-cols, and Section 7 is a conclusion.

2 Masking FailuresTo improve the availability of a service, a common tech-nique is to replicate it onto a set of servers. A consistencycriterion defines expected responses to clients for concur-rent operations. Ideally, the replication protocol ensureslinearizability [9]—the execution of concurrent client op-erations is equivalent to a sequential execution, whereeach operation is atomically performed at some point intime between its invocation and response.

2.1 SpecificationWe characterize linearizability by giving a state transitionspecification (see Specification 1). A specification definesstates and gives legal transitions between states. A state isdefined by a collection of variables and their current val-ues. Transitions can involve parameters (listed in paren-theses) that are bound within the defined scope. A transi-tion definition gives a precondition and an action. If theprecondition holds in a given state, then the transition isenabled in that state. The action relates the state after thetransition to the state before. A transition is performed in-divisibly starting in a state satisfying the precondition. Notwo transitions are performed concurrently, and if multi-ple transitions are enabled simultaneously, then the choiceof which transition to perform is unspecified.

There are interface variables and internal variables.Interface variables are subscripted with the location of thevariable, which is either a process name or the network,ν. Internal variables have no subscripts, and their valuewill be determined by a function on the state of the under-lying implementation. Specification Linearizable Servicehas the following variables:

• inputsν : a set that contains (clt , op) messages sentby process clt . Here op is an operation invoked byclt ;

• outputsν : a set of (clt , op, result) messages sent bythe service, containing the results of client operationsthat have been executed;

• appState: an internal variable containing the state ofthe application;

• invokedclt : the set of operations invoked by processclt . This is a variable maintained by clt itself;

• respondedclt : the set of completed operations, alsomaintained by clt .

Similarly, there are interface transitions and internal tran-sitions. Interface transitions model interactions with the

Specification 1 Linearizable Servicevar inputsν , outputsν , appState, invokedclt , respondedclt

initially: appState = ⊥ ∧ inputsν = outputsν = ∅ ∧∀clt : invokedclt = respondedclt = ∅

interface transition invoke(clt , op):precondition:

op 6∈ invokedclt

action:invokedclt := invokedclt ∪ opinputsν := inputsν ∪ (clt , op)

internal transition execute(clt , op, result ,newState):precondition:

(clt , op) ∈ inputsν ∧(result ,newState) = nextState(appState, (clt , op))

action:appState := newStateoutputsν := outputsν ∪ ((clt , op), result)

interface transition response(clt , op, result):precondition:

((clt , op), result) ∈ outputsν ∧ op 6∈ respondedclt

action:respondedclt := respondedclt ∪ op

environment, which consists of a collection of processesconnected by a network. An interface transition is per-formed by the process that is identified by the first param-eter to the transition. Interface transitions are not allowedto access internal variables. Internal transitions are per-formed by the service, and we will have to demonstratehow this is done by implementing those transitions. Thetransitions of Specification Linearizable Service are:

• Interface transition invoke(clt , op) is performedwhen clt invokes operation op. Each operation isuniquely identified and can be invoked at most onceby a client (enforced by the precondition). Adding(clt , op) to inputsν models clt sending a messagecontaining op to the service. The client maintainswhat operations it has invoked in invokedclt ;

• Transition execute(clt , op, result , newState) is aninternal transition that is performed when thereplicated service executes op for client clt .The application-dependent deterministic functionnextState relates an application state and an oper-ation from a client to a new application state and aresult. Adding ((clt , op), result) to outputsν mod-els the service sending a response to clt .

• Interface transition response(clt , op, result) isperformed when clt receives the response. The clientkeeps track of which operations have completed inrespondedclt to prevent this transition being per-formed more than once per operation.

From Specification Linearizable Service it is clear that itis not possible for a client to receive a response to an oper-ation before it has been invoked and executed. However,the specification does allow each client operation to be ex-ecuted an unbounded number of times. In an implemen-tation, multiple execution could happen if the response tothe client operation got lost by the network and the clientretransmits its operation to the service. The client willonly learn about at most one of these executions. In prac-tice, replicated services will try to reduce or eliminate theprobability of a client operation being executed more thanonce by keeping state about which operations have beenexecuted. For example, a service could keep track of all itsclients and eliminate duplicate operations using sequencenumbers on client operations. In all the specifications thatfollow, we omit the logic to avoid operations from beingexecuted multiple times to simplify the presentation.

We make the following assumptions about interfacetransitions:

• Crash Failures: A process follows its specificationuntil it fails by crashing. Thereafter, it executes notransitions. Processes that never crash are called cor-rect. A process that “shuts down” and later recoversusing state from stable storage is considered correctalbeit, temporarily slow. Processes are assumed tofail independently.

• Failure Threshold: There is a bound f on the maxi-mum number of replica processes that may crash; thenumber of client processes that fail is unbounded.

• Fairness: Except for interface transitions at acrashed process, a transition that becomes continu-ously enabled is eventually executed.

• Asynchrony: There is no bound on the time beforea continuously enabled transition is executed.

We will use refinement only to show that the design de-cisions that we introduce are safe. The intermediate spec-ifications that we will produce will not necessarily guar-antee liveness. It is important that the final executable im-plementations support liveness properties such as “an op-eration issued by a correct client is eventually executed,”but it is not necessary for our purposes that intermediatespecifications have such liveness properties.1

1State transition specifications may also include supplementary live-ness conditions. If so, a specification Σ that refines a specification Σ′

preserves both the safety and liveness properties of Σ′.

2.2 Active and Passive ReplicationSpecification 1 has internal variables and transitions thathave to be implemented. There are two well-known ap-proaches to replication:

• With active replication, also known as state machinereplication [15, 30], each replica implements a deter-ministic state machine. All replicas process the sameoperations in the same order.

• With passive replication, also known as primarybackup [1], a primary replica runs a deterministicstate machine, while backups only store states. Theprimary computes a sequence of new applicationstates by processing operations and forwards thesestates to each backup in order of generation.

Fig. 2a illustrates a failure-free execution of a service im-plemented using active replication:

1. Clients submit operations to the service (op1 andop2 in Fig. 2a).

2. Replicas, starting out in the same state, execute re-ceived client operations in the same order.

3. Replicas send responses to the clients. Clients ignoreall but the first response they receive.

The tricky part of active replication is ensuring that repli-cas execute operations in the same order, despite replicafailures, message loss, and unpredictable delivery andprocessing delays. A fault-tolerant consensus proto-col [28] is typically employed for replicas to agree on theith operation for each index i in a sequence of operations.Specifically, each replica proposes an operation that wasreceived from one of the clients in instance i of the con-sensus protocol. Only one of the proposed operations canbe decided. The service remains available as long as eachinstance of consensus eventually terminates.

Fig. 2b depicts a failure-free execution of passive repli-cation:

1. Clients submit operations only to the primary.

2. The primary orders the operations and computes newstates and responses.

3. The primary forwards the new states (so-called stateupdates) to each backup in the order generated.

4. The primary sends the response of an operation onlyafter the corresponding state update has been suc-cessfully decided (this is made precise in Section 3).

Because two primaries may be competing to have theirstate updates applied at the backups, it is important thatreplicas apply a state update u on the same state used bythe primary to compute u. This is sometimes called theprefix order or primary order property [12, 13].

For example, consider a replicated integer variable withinitial value 3. One client wants to increment the variable,while the other wants to double it. One primary receivesboth operations and submits state updates 4 followed by 8.Another primary receives the operations in the oppositeorder and submits updates 6 followed by 7. Without pre-fix ordering, it may happen that the decided states are 4followed by 7, not corresponding to any sequential his-tory of the two operations.

VSR and Zab employ passive replication; Paxos em-ploys active replication. However, it is possible to imple-ment one style on the other. The Harp file system [24],for example, uses VSR to implement a replicated mes-sage queue containing client operations—the Harp pri-mary proposes state updates that backups apply to thestate of the message queue. Replicas, running determin-istic NFS state machines, then execute NFS operations inqueue order. In other words, Harp uses an active replica-tion protocol built using a message queue that is passivelyreplicated using VSR.

2.3 RefinementBelow, we present a refinement of a linearizable service(Specification 1) using the active replication approach.We then further refine active replication to obtain passivereplication. The refinement of a linearizable service topassive replication follows transitively.

2.3.1 Active Replication

We omit interface transitions invoke(clt , op) andresponse(clt , op, result), which are the same as inSpecification 1. Hereafter, a command cmd denotes a tu-ple (clt , op).

Specification 2 uses a sequence of slots. A replicaexecutes transition propose(replica, slot , cmd) to pro-pose a command cmd for the given slot . We call thecommand a proposal. Transition decide(slot , cmd)guarantees that at most one proposal is decided for eachslot. Transition learn(replica, slot) models a replicalearning a decision and assigning the decision to thecorresponding slot of the learnedreplica array. Repli-cas update their state by executing a learned opera-tion in increasing order of slot number with transitionupdate(replica, cmd , res, newState). The slot of thenext operation to execute is denoted by versionreplica .

Client'1'

Client'2'

Backup'1'

Backup'2'

Primary'

op1'

op2'

execute(op1)'

apply(newState)'

response'

execute(op2)'

….'

1'

2' 3' 4'

Client'1'

Client'2'

Replica'2'

Replica'3'

Replica'1'

op1'

op2' execute(op1)'

response'

execute(op2)'

….'

1'

2'

3'

Order'Ops.'

Order'Updates'

(a) Active replication

Client'1'

Client'2'

Backup'1'

Backup'2'

Primary'

op1'

op2'

execute(op1)'

apply(newState)'

response'

execute(op2)'

….'

1'

2' 3' 4'

Client'1'

Client'2'

Replica'2'

Replica'3'

Replica'1'

op1'

op2' execute(op1)'

response'

execute(op2)'

….'

1'

2'

3'

Order'Cmds.'

Order'Updates'

(b) Passive replication

Figure 2: A failure-free execution of active and passive replication.

Specification 2 Specification Active Replicationvar proposalsreplica [1...], decisions[1...], learnedreplica[1...]

appStatereplica , versionreplica , inputsν , outputsν

initially:∀s ∈ N+ : decisions[s] = ⊥ ∧∀replica :

appStatereplica = ⊥ ∧ versionreplica = 1 ∧∀s ∈ N+ :proposalsreplica [s] = ∅ ∧ learnedreplica[s] = ⊥

interface transition propose(replica, slot , cmd):precondition:

cmd ∈ inputsν ∧ learnedreplica[slot ] = ⊥action:

proposalsreplica [slot ] := proposalsreplica [slot ] ∪ cmd

internal transition decide(slot , cmd):precondition:

decisions[slot ] = ⊥ ∧ ∃r : cmd ∈ proposalsr[slot ]action:

decisions[slot ] := cmd

internal transition learn(replica, slot):precondition:

learnedreplica[slot ] = ⊥ ∧ decisions[slot ] 6= ⊥action:

learnedreplica[slot ] := decisions[slot ]

interface transition update(replica, cmd , res,newState):precondition:

cmd = learnedreplica[versionreplica ] ∧ cmd 6= ⊥ ∧(res,newState) = nextState(appStatereplica , cmd)

action:outputsν := outputsν ∪ (cmd , res)appStatereplica := newStateversionreplica := versionreplica + 1

Note that propose(replica, slot , cmd) requires thatreplica has not yet learned a decision in slot . While nota necessary requirement for safety, proposing a commandfor a slot that is known to be decided is wasted effort. Itwould make sense to require that replicas propose in thefirst slot for which both proposalsreplica [slot ] = ∅ andlearnedreplica[slot ] = ⊥. We do not require this at thislevel of specification to simplify the refinement mappingbetween active and passive replication.

To show that active replication refines Specification 1,we first show how the internal state of Specification 1 isderived from the state of Specification 2. The only inter-nal state in Specification 1 is the appState variable. Forour refinement mapping, its value is the application statemaintained by the replica (or one of the replicas) with thehighest version number.

To complete the refinement mapping we also haveto show how transitions of active replication map ontoenabled transitions of Specification 1, or onto stut-ters (no-ops with respect to Specification 1). Thepropose, decide, and learn transitions are alwaysstutters because they do not update appStatereplica of anyreplica. An update(replica, cmd , res,newState) tran-sition corresponds to execute(clt , op, res,newState),where cmd = (clt , op), in Specification 1 if replica is thefirst replica to apply op and thus leading to appState be-ing updated. Transition update is a stutter if the replicais not the first replica to apply the update.

2.3.2 Passive Replication

Passive replication (Specification 3) also uses slots, andproposals are tuples (oldState, (cmd , res, newState)) con-sisting of the state prior to executing a command, a com-mand, the output of executing the command, and a newstate that results from applying the command. In an actualimplementation, the old state and new state would respec-tively be represented by an identifier and a state updaterather than the entire value of the state.

Any replica can act as primary. Primaries act specu-latively, computing a sequence of states before they aredecided. Because of this, primaries have to maintain aseparate version of the application state. We call this theshadow state. Primaries may propose to apply differentstate updates for the same slot.

Transition propose(replica, slot , cmd , res, newState)is performed when a primary replica proposes applyingcmd to shadowStatereplica in a certain slot , resulting inoutput res. State shadowStatereplica is what the primarycalculated for the previous slot (even though that state

Specification 3 Specification Passive Replicationvar proposalsreplica [1...], decisions[1...], learnedreplica[1...]

appStatereplica , versionreplica , inputsν , outputsν ,shadowStatereplica , shadowVersionreplica

initially:∀s ∈ N+ : decisions[s] = ⊥∀replica :

appStatereplica = ⊥ ∧ versionreplica = 1 ∧shadowStatereplica = ⊥ ∧ shadowVersionreplica = 1 ∧∀s ∈ N+ :proposalsreplica [s] = ∅ ∧ learnedreplica[s] = ⊥

interface transition propose(replica, slot , cmd ,res, newState):

precondition:cmd ∈ inputsν ∧ slot = shadowVersionreplica ∧learnedreplica[slot ] = ⊥ ∧(res, newState) = nextState(shadowStatereplica , cmd)

action:proposalsreplica [slot ] := proposalsreplica [slot ] ∪

(shadowStatereplica , (cmd , res, newState))shadowStatereplica := newStateshadowVersionreplica := slot + 1

internal transition decide(slot , cmd , res, newState):precondition:

decisions[slot ] = ⊥ ∧∃r, s : (s, (cmd , res, newState)) ∈ proposalsr[slot ] ∧

(slot > 1⇒ decisions[slot − 1] = (−,−, s))action:

decisions[slot ] := (cmd , res, newState)

internal transition learn(replica, slot):precondition:

learnedreplica[slot ] = ⊥ ∧ decisions[slot ] 6= ⊥action:

learnedreplica[slot ] := decisions[slot ]

interface transition update(replica, cmd , res, newState):precondition:

(cmd , res, newState) = learnedreplica[versionreplica ]action:

outputsν := outputsν ∪ (cmd , res)appStatereplica := newStateversionreplica := versionreplica + 1

interface transition resetShadow(replica, version, state):precondition:

version ≥ versionreplica ∧(version = versionreplica ⇒ state = appStatereplica )

action:shadowStatereplica := stateshadowVersionreplica := version

is not necessarily decided as of yet, and may never bedecided). Proposals for a slot are stored in a set since aprimary may propose to apply different commands forthe same slot due to repeated change of primaries.

Transition decide(slot , cmd , res, newState) specifiesthat only one of the proposed new states can be decided.Because cmd was performed speculatively, the decidetransition checks that the state decided in the prior slot, if

any, matches state s to which replica r applied cmd , thusensuring prefix ordering.

Similarly to active replication, transition learn mod-els a replica learning the decision of a slot. With theupdate transition, a replica updates its state based onwhat was learned for the slot. With active replication,each replica performs each client operation, while in pas-sive replication only the primary performs client opera-tions and backups simply obtain the resulting states.

Replicas wishing to act as primary perform tran-sition resetShadow to update their speculativestate and version, respectively denoted by variablesshadowStatereplica and shadowVersionreplica . The newshadow state may itself be speculative and must be set toa version at least as recent as the latest learned state.

To show that passive replication refines active replica-tion, we first note that all variables in passive replicationhave a one-to-one mapping with the ones in active repli-cation, apart from variables shadowVersion and shadow-State which do not appear in active replication.

Transition resetShadow correspond to a stutter tran-sition. Transitions propose, decide, and learn ofpassive replication have preconditions that are either equalto the corresponding ones in active replication or more re-strictive. If these transitions are enabled in passive repli-cation, they are therefore also enabled in active replica-tion. We now show that this is also the case for transitionupdate. For this to be true, (i) cmd must be differentthan ⊥, and (ii) (res, newState) must be the result of ap-plying command cmd on appStatereplica . Condition (i)is trivially satisfied from Specification 3. Condition (ii)holds for the following reasons. Since transition updateis enabled, cmd is in learnedreplica[versionreplica ].From transition learn, cmd was decided for slotversionreplica . From transition decide, cmd was pro-posed in slot versionreplica and either versionreplica = 1or versionreplica > 1 and the state update cmd was ap-plied on the state decided in slot versionreplica − 1,that is, appStatereplica . If versionreplica = 1,(res, newState) = nextState(appState, cmd) triviallyholds, otherwise it holds from the precondition of tran-sition propose.

3 A Generic ProtocolSpecifications 2 and 3 contain internal variables and tran-sitions that need to be refined for an executable implemen-tation. We start with refining active replication. Multi-Consensus (Specification 4) refines active replication andcontains no internal variables or transitions. As previ-ously, the invoke and response transitions (and cor-

Our term Paxos [18] VSR [27] Zab [12] meaningreplica learner cohort server/observer stores copy of application statecertifier acceptor cohort server/participant maintains consensus state

sequencer leader primary leader certifier that proposes orderingsround ballot view epoch round of certification

round-id ballot number view-id epoch number uniquely identifies a roundnormal case phase 2 normal case normal case processing in the absence of failures

recovery phase 1 view change recovery protocol to establish a new roundcommand proposal event record transaction a pair of a client id and an operation to be performed

round-stamp N/A viewstamp zxid uniquely identifies a sequence of proposals

Table 1: Translation between terms used in this paper and in the various replication protocols under consideration.

responding variables) have been omitted—they are thesame as in Specification 1. Transitions propose andupdate of Specification 2 have been omitted for thesame reason. In this section we explain how and whyMulti-Consensus works.

3.1 Certifiers and roundsMulti-Consensus has two basic building blocks:

• A static set of n processes called certifiers. A minor-ity of these may crash. So for tolerating at most ffailures, we require that n ≥ 2f + 1 holds.

• An unbounded number of rounds.

For ease of reference, Table 1 contains a translation be-tween terminology used in this paper and those found inthe papers describing the protocols under consideration.

In each round, a consensus protocol assigns to at mostone certifier the role of sequencer. The sequencer of around can certify at most one command for each slot.The other certifiers can copy the sequencer, certifying thesame command for the same slot and round. Note that iftwo certifiers certify a command in the same slot and thesame round, it must be the same command. Moreover, acertifier cannot retract a certification. Once a majority ofcertifiers certify the command within a round, the com-mand is decided (and because certifications cannot be re-tracted the command will remain decided thereafter). InSection 3.4 we show why two rounds cannot decide dif-ferent commands for the same slot.

Each round has a round-id that uniquely identifies theround. Rounds are totally ordered by their round-ids. Around is in one of three modes: pending, operational, orwedged. One round is the first round (it has the small-est round-id), and initially only that round is operational.Other rounds start out pending. The two possible transi-tions on the mode of a round are as follows:

1. A pending round can become operational only if allrounds with lower round-id are wedged;

2. A pending or operational round can become wedgedunder any circumstance.

This implies that at any time at most one round is op-erational and that wedged rounds can never become un-wedged.

3.2 Tracking ProgressIn Specification 4, each certifier cert maintains a progressindicator progresscert [slot ] for each slot , defined as:

Progress Indicator: A progress indicator is a pair〈rid , cmd〉 where rid is the identifier of a round and cmdis a proposed command or ⊥, satisfying:

• If cmd = ⊥, then the progress indicator guaranteesthat no round with an id less than rid can ever decide,or have decided, a proposal for the slot.

• If cmd 6= ⊥, then the progress indicator guaranteesthat if a round with id rid ′ such that rid ′ ≤ riddecides (or has decided) a proposal cmd ′ for the slot,then cmd = cmd ′.

• Given two progress indicators 〈rid , cmd〉 and〈rid , cmd ′〉 for the same slot, if neither cmd norcmd ′ equals ⊥, then cmd = cmd ′.

We define a total ordering on progress indicators for thesame slot as follows: 〈rid ′, cmd ′〉〈rid , cmd〉 iff

• rid ′ > rid ; or

• rid ′ = rid ∧ cmd ′ 6= ⊥ ∧ cmd = ⊥.

At any certifier, the progress indicator for a slot is mono-tonically non-decreasing.

3.3 Normal case processingEach certifier cert supports exactly one round-id ridcert ,initially 0, the round-id of the first round. The nor-mal case holds when a majority of certifiers support the

Specification 4 Multi-Consensus

var ridcert , isSeqcert , progresscert [1...]certificsν , snapshotsν

initially: certificsν = snapshotsν = ∅ ∧∀cert : ridcert = 0 ∧ isSeqcert = false ∧

∀slot ∈ N+ : progresscert [slot ] = 〈0,⊥〉

interface transition certifySeq(cert , slot , 〈rid , cmd〉):precondition:

isSeqcert ∧ rid = ridcert ∧ progresscert [slot ] = 〈rid ,⊥〉 ∧(∀s ∈ N+ : progresscert [s] = 〈rid ,⊥〉 ⇒ s ≥ slot) ∧∃replica : cmd ∈ proposalsreplica [slot ]

action:progresscert [slot ] := 〈rid , cmd〉certificsν := certificsν ∪ (cert , slot , 〈rid , cmd〉)

interface transition certify(cert , slot , 〈rid , cmd〉):precondition:∃cert ′ : (cert ′, slot , 〈rid , cmd〉) ∈ certificsν ∧ridcert = rid ∧ 〈rid , cmd〉 progresscert [slot ]

action:progresscert [slot ] := 〈rid , cmd〉certificsν := certificsν ∪ (cert , slot , 〈rid , cmd〉)

interface transition observeDecision(replica, slot , cmd):precondition:∃rid :|cert | (cert , slot , 〈rid , cmd〉) ∈ certificsν| > n

2∧

learnedreplica[slot ] = ⊥action:

learnedreplica[slot ] := cmd

interface transition supportRound(cert , rid , proseq):precondition:

rid > ridcert

action:ridcert := rid ; isSeqcert := falsesnapshotsν :=

snapshotsν ∪ (cert , rid , proseq, progresscert )

interface transition recover(cert , rid ,S):precondition:

ridcert = rid ∧ ¬isSeqcert ∧ |S| > n2∧

S ⊆ (id , prog) | (id , rid , cert , prog) ∈ snapshotsνaction:∀s ∈ N+ :〈r, cmd〉 := maxprog[s] | (id , prog) ∈ Sprogresscert [s] := 〈rid , cmd〉if cmd 6= ⊥ then

certificsν := certificsν ∪ (cert , s, 〈ridcert , cmd〉)isSeqcert := true

same round-id, and one of these certifiers is sequencer (itsisSeqcert flag is set to true).

Transition certifySeq(cert , slot , 〈rid , cmd〉) isperformed when sequencer cert certifies commandcmd for the given slot and round. The conditionprogresscert [slot ] = 〈rid ,⊥〉 holds only if no commandcan be decided in this slot by a round with an id lowerthan ridcert . The transition requires that slot is the low-

est empty slot of the sequencer. If the transition is per-formed, cert updates progresscert [slot ] to reflect that ifa command is decided in its round, then it must be com-mand cmd . Sequencer cert also notifies all other certifiersby adding (cert , slot , 〈rid , cmd〉) to set certificsν (mod-eling a broadcast to the certifiers).

A certifier that receives such a message checks if themessage contains the same round-id that it is currentlysupporting and that the progress indicator in the mes-sage exceeds its own progress indicator for the sameslot. If so, then the certifier updates its own progressindicator and certifies the proposed command (transitioncertify(cert , slot , 〈rid , cmd〉)).

The observeDecision(replica, slot , cmd) transi-tion at replica is enabled if a majority of certifiers in thesame round have certified cmd in slot . If so, the com-mand is decided and, as explained in the next section, allreplicas that undergo the observeDecision transition forthis slot will decide on the same command.

3.4 Recovery

In this section, we show how Multi-Consensus deals withfailures. The reason for having an unbounded number ofrounds is to achieve liveness. When an operational roundis no longer certifying proposals, perhaps because its se-quencer has crashed or is slow, the round can be wedgedand a round with a higher round-id can become opera-tional.

Modes of rounds are implemented as follows: A cer-tifier cert can transition to supporting a new round-id rid and prospective sequencer proseq (transitionsupportRound(cert , rid , proseq)). This transition canonly increase ridcert . Precondition rid > ridcert en-sures that a certifier supports a round and a prospec-tive sequencer for the round at most once. Thetransition sends the certifier’s snapshot by adding itto the set snapshotsν . A snapshot is a four-tuple(cert , rid , proseq , progresscert) containing the certifier’sidentifier, its current round-id, the identifier of proseq , andthe certifier’s list of progress indicators. Note that a certi-fier can send at most one snapshot for each round.

Round rid with sequencer proseq is operational, bydefinition, if a majority of certifiers support rid andadded (cert , rid , proseq , progresscert) to the set snap-shots. Clearly, the majority requirement guarantees thatthere cannot be two rounds that are simultaneously oper-ational, nor can there be operational rounds that do nothave exactly one sequencer. Certifiers that support ridcan no longer certify commands in rounds prior to rid .Consequently, if a majority of certifiers support a round-

decisions[slot ] =

cmd if ∃rid : |cert | (cert , slot , 〈rid , cmd〉) ∈ certificsν| > n/2⊥ otherwise

Figure 3: Relation between the certificsν variable of Specification 4 and the decisions variable of Specification 2. Here n is thenumber of certifiers.

id larger than x, then all rounds with an id of x or lowerare wedged.

Transition recover(cert , rid ,S) is enabled at cert ifthe set S contains snapshots for rid and sequencer certfrom a majority of certifiers. The sequencer helps to en-sure that the round does not decide commands inconsis-tent with prior rounds using the snapshots it has collected.For each slot, sequencer cert determines the maximumprogress indicator 〈r, cmd〉 for the slot in the snapshotscontained in S. It then sets its own progress indicator forthe slot to 〈rid , cmd〉. It is easy to see that rid ≥ r.

We argue that 〈rid , cmd〉 satisfies the definition ofprogress indicator in Section 3.2. All certifiers in S sup-port rid and form a majority. Thus, it is not possible forany round between r and rid to decide a command be-cause none of these certifiers can certify a command inthose rounds. There are two cases:

• If cmd = ⊥, no command can be decided before r,so no command can be decided before rid . Hence,〈rid ,⊥〉 is a correct progress indicator.

• If cmd 6= ⊥, then if a command is decided by r ora round prior to r, it must be cmd . Since no com-mand can be decided by rounds between r and rid ,〈rid , cmd〉 is a correct progress indicator.

The sequencer sets its isSeqcert flag upon recovery. As aresult, it is enabled to propose new commands. Normalcase for the round begins and holds as long as a majorityof certifiers support the corresponding round-id.

3.5 Refinement MappingMulti-Consensus refines active replication. We first showhow the internal variables of Specification 2 are de-rived from the variables in Specification 4. Predicatedecisions[slot ] = cmd holds if there exists any roundthat has decided the command (Section 3.4 argues whyall rounds of a given slot can only decide the same com-mand); otherwise decisions[slot ] = ⊥. This is capturedformally in Fig. 3, where cmd is a tuple (clt , op).

The transition certifySeq(cert , slot , 〈rid , cmd〉)of Multi-Consensus always corresponds to a stutterin active replication. The decide(slot , cmd) tran-sition of Specification 2 is performed when, for thefirst time, a majority of certifiers in some round

rid have certified command cmd in slot slot , thatis, when the last certifier cert in the majority per-forms transition certify(cert , slot , 〈rid , cmd〉). TheobserveDecision transition of Multi-Consensus cor-responds exactly to the learn transition of active repli-cation. Both the supportRound and recover transi-tions are stutters with respect to Specification 2 as they donot affect any of its state variables.

3.6 Passive ReplicationSection 2.3 showed how in passive replication a state up-date from a particular primary can only be decided in aslot if it corresponds to applying an operation on the statedecided in the previous slot. We called this property prefixordering. However, Specification 4 does not satisfy prefixordering because any proposal can be decided in a slot, inparticular one that does not correspond to a state decidedin the prior slot. Thus Multi-Consensus does not refinePassive Replication. One way of implementing prefix or-dering would be for the primary to wait with proposinga command for a slot until it knows the decisions for allprior slots. Doing so would be slow.

A better solution is to refine Multi-Consensus to obtaina specification that also refines Passive Replication andsatisfies prefix ordering. We call this specification Multi-Consensus-PO. Multi-Consensus-PO guarantees that eachdecision is the result of an operation applied to the statedecided in the prior slot (except for the first slot). Wecomplete the refinement by adding two preconditions:

(i) In Multi-Consensus-PO, slots have to be decided inorder. To guarantee this, we have each certifier cer-tify commands in order by adding the following pre-condition to transition certify: slot > 1 ⇒∃c 6= ⊥ : progresscert [slot − 1] = 〈rid , c〉. Thusif, in some round, there exists a majority of certifiersthat have certified a command in slot , there also ex-ists a majority of certifiers that have certified a com-mand in the prior slot.

(ii) To guarantee that a decision in slot is based onthe state decided in the prior slot, we add thefollowing precondition to transition certifySeq:slot > 1 ⇒ ∃ oldState : cmd = (oldState,−) ∧progresscert [slot − 1] = 〈rid , (−, (−,−, oldState))〉.This works because by the properties of progress

indicators, if a command has been or will be decidedin round rid or a prior round, it is the command inprogresscert [slot − 1]. Therefore, if the sequencer’sproposal for slot is decided in round rid , it is thecommand in progresscert [slot − 1]. If the primaryand the sequencer are co-located, as they usuallyare, this condition is satisfied automatically as theprimary computes states in order.

Also, Multi-Consensus-PO inherits transitionsinvoke and response from Specification 1 as wellas transitions propose, update, and resetShadowfrom Specification 3. The variables contained in thesetransitions are inherited as well.

The passive replication protocols that we consider inthis paper, VSR and Zab, share the following design deci-sion in the recovery procedure: The sequencer broadcastsa single message containing its entire snapshot rather thansending separate certifications for each slot. Certifierswait for this comprehensive snapshot, and overwrite theirown snapshot with it, before they certify new commandsin this round. As a result, at a certifier all progresscertslots have the same round identifier, and can thus be main-tained as a separate variable.

3.6.1 Refinement Mappings

Below, we present a refinement between Multi-Consensus-PO and Multi-Consensus and then show thatMulti-Consensus-PO refines Passive Replication.

3.6.2 Refining Multi-Consensus

Showing the existence of a refinement between Multi-Consensus-PO and Multi-Consensus is straightforward.Transitions and variables inherited from passive replica-tion are mapped to variables and transitions of Multi-Consensus in the same way as they are mapped from pas-sive replication to active replication (note that these vari-ables and transitions are mapped to variables and tran-sitions of Multi-Consensus that are themselves inheritedfrom active replication). Since Multi-Consensus-PO onlyadds constraints to the transitions that are specific toMulti-Consensus, the refinement exists.

3.6.3 Refining Passive Replication

Passive replication has a single internal variable,decisions , that is derived from certificsν as in Fig. 3, thatis, for any slot s, decisions[s] = cmd holds if a majorityof replicas have certified cmd for slot s.

Transitions certifySeq, supportRound, andrecover are stutters in passive replication, while an

observeDecision transition of Multi-Consensus-POcorresponds to a learn transition in Specification 3.

Similarly to the refinement between ac-tive replication and Multi-Consensus, transitiondecide(slot , cmd , res, newState) is performed when,for the first time, a majority of certifiers in some roundrid have certified command cmd in slot slot , that is,when the last certifier cert in the majority performstransition certify(cert , slot , 〈rid , cmd〉).

In this case however, we must additionally showthat when the last certifier of the majority un-dergoes the certify transition, the followingcondition holds in Specification 3: ∃r, oldState :(oldState, (cmd , res, newState)) ∈ proposalsr[slot ] and(slot > 1⇒ decisions[slot − 1] = (−,−, oldState)).

The first part of the condition holds since certifiersonly certify commands that have been proposed. Thesecond part of the condition holds for the following rea-son. From constraint (ii) of Multi-Consensus-PO, the se-quencer only certifies a command if its corresponding old-State equals the newState of the command it stores inthe previous slot. From the properties of progress indi-cators, if progresscert [slot − 1] = 〈rid , cmd〉, then if acommand is decided in an earlier round than rid , then itmust be cmd . From constraint (i) (commands are cer-tified in order), if a command is decided for slot , thenslot−1 decided on a command previously. Consequently,(slot > 1⇒ decisions[slot − 1] = (−,−, oldState)).

4 ImplementationSpecifications Multi-Consensus and Multi-Consensus-POdo not contain internal variables or transitions. However,they only specify which transitions are safe, not whichtransitions to perform and at what time. We show, infor-mally, final refinements of these specifications to obtainPaxos, VSR, and Zab.

4.1 Normal CaseWe first turn to implementing the state and transitionsof Multi-Consensus and Multi-Consensus-PO. The firstquestion is how to implement the variables. Variablesinputsν , outputsν , certificsν , and snapshotsν are notper-process but global. They model messages that havebeen sent. In actual protocols, these are implemented bythe network: a value in either set is implemented by amessage on the network tagged with the appropriate type,such as snapshot.

The remaining variables are all local to a process suchas a client, a replica, or a certifier, and can be implemented

certify!

learn!

learn!

certifySeq!

(slot,#rid,#cmd)#

normal#case#with#numbers##

Cer5fier#2#

Cer5fier#3#

Sequencer/##Cer5fier#1#

designated#majority#(VSR)#

(slot,#rid)#

(slot,#cmd)#

1

2

3

certify!learn!

Figure 4: Normal case processing at three certifiers. Dots in-dicate transitions, and arrows between certifiers are messages.

as ordinary program variables. In Zab, the progresscertvariable is implemented by a queue of commands. InVSR, the progresscert variable is replaced by the applica-tion state and a counter that counts the number of updatesmade to the state in this round. In Paxos, a progress indi-cator is simply a pair consisting of a round identifier anda command.

Fig. 4 illustrates the steps of normal case processing inthe protocols. The figure shows three certifiers (f = 1).Upon receiving an operation from a client (not shown):

1. The sequencer cert proposes a command for the nextopen slot and sends a message to the other certifiers(maps to certifySeq(cert , slot , 〈rid , cmd〉)). Inthe case of VSR and Zab, the command is a state up-date that results from executing the client operation;with Paxos, the command is the operation itself.

2. Upon receipt by a certifier cert , if cert supportsthe round-id rid in the message, then cert up-dates its slot and replies to the sequencer (transitioncertify(cert , slot , 〈rid , cmd〉)). With VSR andZab, prefix-ordering must be ensured, and cert onlyreplies to the sequencer if its progress indicator forslot − 1 contains a non-empty command for rid .

3. If the sequencer receives successful responses froma majority of certifiers (transition observeDecision),then the sequencer learns the decision and broadcastsa decide message for the command to the replicas(resulting in learn transitions that update the repli-cas (see Specifications 2 and 3).

The various protocols make additional design decisions:

• In the case of VSR, a specific majority of certifiers isdetermined a priori and fixed for each round. We callthis a designated majority. In Paxos and Zab, anycertifier can certify proposals.

• In VSR, replicas are co-located with certifiers, andcertifiers speculatively update their local replica aspart of certification. A replica may well be up-dated before some proposed command is decided,

so if another command is decided the state of thereplica must be rolled back, as we shall see later.Upon learning that the command has been decided(Step 3), the sequencer responds to the client.

• Optionally, Paxos uses leases [8, 18] for read-onlyoperations. Leases have the advantage that read-onlyoperations can be served at a single replica while stillguaranteeing linearizability. This method assumessynchronized clocks (or clocks with bounded drift)and has the sequencer obtain a lease for a certain timeperiod. A sequencer that is holding a lease can thusforward read-only operations to any replica, insert-ing the operation in the ordered stream of commandssent to that replica.

• Zab (or rather ZooKeeper) offers the option to useleasing or to have any replica handle read-only op-erations individually, circumventing Zab. The latteris efficient, but a replica may not have learned thelatest decided proposals and its clients receive re-sults based on stale state (such reads satisfy sequen-tial consistency [16]).

For replicas to learn about decisions, two options exist:

• Certifiers can respond back to the sequencer. The se-quencer learns that its proposed command has beendecided if the sequencer receives responses from amajority (counting itself). The sequencer then noti-fies the replicas.

• Certifiers can broadcast notifications to all replicas,and each replica can individually determine if a ma-jority of the certifiers have certified a particular com-mand.

There is a trade-off between the two options: with n certi-fiers and m replicas, the first approach requires n + mmessages and two network latencies. The second ap-proach requires n × m messages but involves only onenetwork latency. All implementations we know of use thefirst approach.

4.2 RecoveryFig. 5 illustrates the recovery steps in the protocols.

With Paxos, a certifier proseq unhappy with progressstarts the following process to try to become sequenceritself (see Fig. 5a):

• Step 1: Prospective sequencer proseq supports anew round rid proposing itself as sequencer (tran-sition supportRound(proseq , rid , proseq)), andqueries at least a majority of certifiers.

s"

s"

(rid)"

(rid,"round+stamp)"

s"

1

2 r"

(rid,"appState)"s"

(slot,"rid,"cmd)*"

r"

s"

Recovery"case"with"numbers"+"paxos""

(rid)"

Cer>fier"2"

Cer>fier"3"

Prosp."seq."/""Cer>fier"1"

(rid,"snapshotcert)"

Recovery"case"with"numbers"–"vsr/harp"

Recovery"case"with"numbers"–"Zab"

(rid)"

Cer>fier"2"

Cer>fier"3"

View"mgr."/"Cer>fier"1"

(rid,"round1stamp)"

(rid,"design.4maj.)"

2

2.5"

3

3

r"

s"

s"

s"

s"

1

2

1

(rid,4snapshotmax)"Cer>fier"2"

Cer>fier"3"

Prosp."seq./""Cer>fier"1"

3

(ridcert)"

0

2.5"

obtain"missing"cmds."

(a) Paxos

s"

s"

(rid)"

(rid,"round+stamp)"

s"

1

2 r"

(rid,"appState)"s"

(slot,"rid,"cmd)*"

r"

s"


(rid)"

Cer>fier"2"

Cer>fier"3"





(rid)"

Cer>fier"2"

Cer>fier"3"


(rid,"round1stamp)"


2

2.5"

3

3

r"

s"

s"

s"

s"

1

2

1


Cer>fier"3"


3

(ridcert)"

0

2.5"


(b) Zab

s"

s"

(rid)"

(rid,"round+stamp)"

s"

1

2 r"

(rid,"appState)"s"

(slot,"rid,"cmd)*"

r"

s"


(rid)"

Cer>fier"2"

Cer>fier"3"





(rid)"

Cer>fier"2"

Cer>fier"3"


(rid,"round1stamp)"


2

2.5"

3

3

r"

s"

s"

s"

s"

1

2

1


Cer>fier"3"


3

(ridcert)"

0

2.5"


(c) VSR

Figure 5: Depiction of the recovery phase. Dots represent transitions and are labeled with s and r, respectively denoting asupportRound and a recover transition. Messages of the form (x, y)* contain multiple (x,y) tuples.

• Step 2: Upon receipt, a certifier cert that transi-tions to supporting round rid and certifier proseq assequencer (supportRound(cert , rid , proseq)) re-sponds with its snapshot.

• Step 3: Upon receiving responses from a majority,certifier proseq learns that it is sequencer of roundrid (transition recover(proseq , rid ,S)). Normalcase operation resumes after proseq broadcasts thecommand with the highest rid for each slot notknown to be decided.

Prospective sequencers in Zab are determined by aweak leader election protocol Ω [29]. When Ω determinesthat a sequencer has become unresponsive, it initiates aprotocol (see Fig. 5b) in which a new sequencer is elected:

• Step 0: Ω proposes a prospective sequencer proseqand notifies the certifiers. Upon receipt, a certifiersends a message containing the round-id it supportsto proseq .

• Step 1: Upon receiving such messages from a major-ity of certifiers, prospective sequencer proseq selectsa round-id rid that is one larger than the maximumit received, transitions to supporting it (transitionsupportRound(proseq , rid , proseq)), and broad-casts this to the other certifiers for approval.

• Step 2: Upon receipt, if certifier cert can support ridand has not agreed to a certifier other than proseqbecoming sequencer of the round, it performs tran-sition supportRound(cert , rid , proseq). Zab ex-ploits prefix ordering to optimize the recovery pro-tocol. Instead of sending its entire snapshot to theprospective sequencer, a certifier that transitions tosupporting round rid sends a round-stamp. A round-stamp is a lexicographically ordered pair consistingof the round-id in the snapshot (the same for all slots)and the number of slots in the round for which it hascertified a command.

• Step 2.5: If proseq receives responses from a ma-jority, proseq computes the maximum round-stampand determines if it is missing commands. If itis the case, proseq retrieves them from the certi-fier certmax with the highest received round-stamp.If proseq is missing too many commands (e.g. ifproseq did not participate in the last round certmaxparticipated in), certmax sends its entire snapshot toproseq .

• Step 3: After receiving the missing commands,proseq broadcasts its snapshot, in practice a check-point of its state with a sequence of state updates, tothe certifiers (transition recover(proseq , rid ,S)).Certifiers acknowledge the reception of this snapshotand, upon receiving acknowledgments from a major-ity, proseq learns that it is now the sequencer of ridand broadcasts a commit message before resumingthe normal case protocol (not shown in the picture).

In VSR, each round-id has a pre-assigned view man-ager v that is not necessarily the sequencer. A round-id isa lexicographically ordered pair comprising a number andthe process identifier of the view manager.

If unhappy with progress, the view manager v of round-id starts the following recovery procedure (see Fig. 5c):

• Step 1: v starts supporting round rid (transitionsupportRound(v, rid , v)), and queries at least amajority of certifiers.

• Step 2: Upon receipt of such a query, acertifier cert starts supporting rid (transitionsupportRound(cert , rid , v)). Similarly to Zab,cert sends its round-stamp to v.

• Step 2.5: Upon receiving round-stamps from a ma-jority of certifiers, view manager v uses the set ofcertifiers that responded as the designated majorityfor the round and assigns the sequencer role to thecertifier p that reported the highest round-stamp. Theview manager then notifies certifier p, requesting it tobecome sequencer.

• Step 3: Sequencer p, having the latest state, broad-casts its snapshot (transition recover(p, rid ,S)).In the case of VSR, the state that the new sequencersends is its application state rather than a snapshot.

4.3 Garbage CollectionMulti-Consensus has each certifier building up state aboutall slots, which does not scale. Unfortunately, little is writ-ten about this issue in the Paxos and Zab papers. In VSR,no garbage collection is required. Certifiers and repli-cas are co-located, and only store the most recent round-id they adopted and the application state that is updatedupon certification of a command. During recovery the se-quencer simply sends the application state to the replicas,and consequently, there is no need to replay any decidedcommands.

4.4 LivenessAll of the protocols require—in order to make progress—that at most a minority of certifiers experience crash fail-ures. If the current round is no longer making progress,a new round must become operational. If certifiers areslow at this in the face of an actual failure, then perfor-mance may suffer. However, if certifiers are too aggres-sive, rounds will become wedged before being able to de-cide commands, even in the absence of failures.

To guarantee progress, some round with a correct se-quencer must eventually not get preempted by a higherround [6, 5]. Such a guarantee is difficult or even impos-sible to make [7], but with careful failure detection a goodtrade-off can be achieved between rapid failure recoveryand spurious wedging of rounds [14].

In this section, we will look at how the various proto-cols try optimizing progress.

4.4.1 Partial Memory Loss

If certifiers keep their state on stable storage (say, a disk),then a crash followed by a recovery is not treated as a fail-ure but instead as the affected certifier being slow. Stablestorage allows protocols like Paxos, VSR, and Zab to dealwith such transients. Even if all machines crash, as longas a majority eventually recovers their state from beforethe crash, the service can continue operating.

4.4.2 Total Memory Loss

In Section 5.1 of “Paxos Made Live” [4], the developersof Google’s Chubby service describe a way for Paxos todeal with permanent memory loss of a certifier (due to

disk corruption). The memory loss is total, so the recov-ering certifier starts in an initial state. It copies its statefrom another certifier and then waits until it has seen onedecision before starting to participate fully in the Paxosprotocol again. This optimization is flawed since it breaksthe invariant that a certifier’s round-id can only increaseover time (confirmed by the authors of [4]). By copyingthe state from another certifier, it may, as it were, go backin time, which can cause divergence.

Nonetheless, total memory loss can be tolerated byextending the protocols. The original Paxos paper [18]shows how the set of certifiers can be reconfigured to tol-erate total memory loss, and this has been worked out ingreater detail in Microsoft’s SMART project [25] and laterfor Viewstamped Replication as well [23]. Zab also sup-ports reconfiguration [31].

5 Discussion

Table 2 summarizes differences between Paxos, VSR, andZab. We believe that these differences are important be-cause they both demonstrate that the protocols do notrefine one another, and the differences have pragmaticconsequences as discussed below. The comparisons arebased on published algorithms; actual implementationsmay vary. We organize the discussion around normal caseprocessing and recovery overheads.

5.1 Normal Case

Passive vs. Active Replication In active replication,there are at least f + 1 replicas that each have to exe-cute operations. In passive replication, only the sequencerexecutes operations, but has to propagate state updates tothe backups. Depending on the overheads of executingoperations and the size of state update messages, one orthe other may perform better. Passive replication has theadvantage that execution at the sequencer does not haveto be deterministic and can take advantage of parallel pro-cessing on multiple cores.

Read-only Optimizations Paxos and Zookeeper sup-port leasing for read-only operations, but there is no rea-son why leasing could not be added to VSR as well. In-deed, Harp (built on VSR) uses leasing. A lease improveslatency of read-only operations in the normal case, butdelays recovery in case of a failure. ZooKeeper offersthe option whether leases should be used. Without leases,clients read any replica at any time. Doing so compro-mises consistency since replicas may have stale state.

What Section Paxos VSR Zabreplication style 2.2 active passive passive

read-only operations 4.1 leasing certification read any replica/leasingdesignated majority 4.1, 4.2 no yes notime of execution 4.1 upon decision upon certification depends on role

sequencer selection 4.2 majority vote or deterministic view manager assigned majority voterecovery direction 4.2 two-way from sequencer two-way/from sequencer

recovery granularity 4.2 slot-at-a-time application state command prefixtolerates memory loss 4.4.1, 4.4.2 reconfigure partial reconfigure

Table 2: Overview of important differences between the various protocols.

Designated Majority VSR uses designated majorities.This has the advantage that the other (typically f ) cer-tifiers and replicas are not employed during normal op-eration, and they play only a small role during recovery,saving almost half of overhead. There are two disadvan-tages: (1) if the designated majority contains the slowestcertifier the protocol will run at the rate of that slowestcertifier, as opposed to the “median” certifier; and (2) ifone of the certifiers in the designated majority crashes orbecomes unresponsive or slow, then a recovery is neces-sary. In Paxos and Zab, recovery is necessary only if thesequencer crashes. A middle ground can be achieved byusing 2f + 1 certifiers and f + 1 replicas.

Time of Command Execution In VSR, replicas applystate updates speculatively at the same time that they arecertified, possibly before they are decided. Commandsare forgotten as soon as they are applied to the state. Thismeans that no garbage collection is necessary. A disad-vantage is that the response to a client operation must bedelayed until all replicas in the designated majority haveupdated their application state. In other protocols, onlyone replica has to have updated its state and computeda response, because in case the replica fails another de-terministic replica is guaranteed to compute the same re-sponse. Note that at this time each command that led tothis state and response has been certified by a majorityand therefore the state and response are recoverable evenif this one replica crashes.

In Zab, the primary also speculatively applies client op-erations to compute state updates before they are decided.However, replicas only apply those state updates until af-ter they have been decided. In Paxos, replicas only exe-cute a command after it has been decided and there is nospeculative execution.

5.2 Recovery

Sequencer Selection VSR provide an advantage in se-lecting a sequencer that has the most up-to-date state (tak-ing advantage of prefix ordering), and thus it does not have

to recover this state from the other certifiers, simplifyingand streamlining recovery.

Recovery Direction Paxos allows the prospective se-quencer to recover the state of previous slots and, at thesame time, propose new commands for slots for which italready has retrieved sufficient state. However, all cer-tifiers must send their certification state to the prospec-tive sequencer before it re-proposes commands for slots.With VSR, the sequencer is the certifier with the highestround-stamp and it does not need to recover state from theother certifiers. A similar optimization is sketched in thedescription of the Zab protocol (and implemented in theZookeeper service).

With Paxos and Zab, garbage collection of the certifi-cation state is important to ensure that the amount of statethat has to be exchanged on recovery does not become toolarge. It is often faster to bring a recovering replica up todate by replaying decided commands that it missed ratherthan by copying state.

With VSR, the selected sequencer pushes its snapshotto the other certifiers. The snapshot has to be transferredand processed before new certification requests, possiblyresulting in a performance hiccup.

Recovery Granularity In VSR, state sent from the se-quencer to the backups is the entire application state. ForVSR, this state is transaction manager state and is small,but in general such an approach does not scale. However,in some cases that cost is unavoidable, even in the otherprotocols. For example, if a replica has a disk failure, re-playing all commands from day 0 is not scalable either,and the recovering replica instead will have to seed itsstate from another one. In such a case, the replica will loada checkpoint and then replay missing commands to bringthe checkpoint up-to-date—this technique is used in Zab(and in Harp as well). With passive replication protocols,replaying missing commands simply means applying stateupdates; with active replication protocols, replaying com-mands entails re-executing commands. Depending on theoverheads of executing operations and the size of state up-

date messages, one or the other approach may performbetter.

Tolerating Memory Loss An option suggested by VSRis to keep only a round-id on disk; the remaining of thestate is in memory. This technique works only in restrictedsituations where at least one certifier has the most up-to-date state in memory.

6 A Bit of History

Based on the discussion so far one may think that roundsand sequencers were first introduced by protocols that im-plement Multi-Consensus. However, we believe the firstconsensus protocol to use rounds and sequencers is dueto Dwork, Lynch, and Stockmeyer (DLS) [6]. Roundsin DLS are countable and round b + 1 cannot start un-til round b has run its course. Thus, DLS does not refineMulti-Consensus.

Chandra and Toueg’s work on consensus [5] formalizedthe conditions under which consensus protocols termi-nate by encapsulating synchrony assumptions in the formof failure detectors. Their consensus protocol resemblesthe Paxos single-decree Synod protocol and refines Multi-Consensus.

To the best of our knowledge, the idea of using major-ity intersection to avoid potential inconsistencies first ap-pears in Thomas [32]. Quorum replication [32] supportsonly storage objects with read and write operations(or, equivalently, get and put operations in the case of aKey-Value Store).

7 Conclusion

Paxos, VSR, and Zab are well-known replication pro-tocols for an asynchronous environment that admits abounded number of crash failures. The paper describes aspecification for Multi-Consensus, a generic specificationthat contains important design features that the protocolsshare. These features include an unbounded number of to-tally ordered rounds, a static set of certifiers, and at mostone sequencer per round.

The protocols differ in how they refine Multi-Consensus. We were able to disentangle fundamentallydifferent design decisions in the three protocols and con-sider their impact on performance. Most importantly,compute-intensive services are better off with a passivereplication strategy such as used in VSR and Zab (pro-vided that state updates are of a reasonable size). To

achieve predictable low-delay performance for short op-erations during both normal case execution and recovery,an active replication strategy without designated majori-ties, such as used in Paxos, is the best option.

References[1] P.A. Alsberg and J.D. Day. A principle for resilient shar-

ing of distributed resources. In Proc. of the 2nd Int. Conf.on Software Engineering (ICSE’76), pages 627–644, SanFrancisco, CA, October 1976. IEEE.

[2] M. Burrows. The Chubby Lock Service for loosely-coupled distributed systems. In 7th Symposium on Op-erating System Design and Implementation, Seattle, WA,November 2006.

[3] C. Cachin. Yet another visit to Paxos. Technical ReportRZ3754, IBM Research, Zurich, Switzerland, 2009.

[4] T.D. Chandra, R. Griesemer, and J. Redstone. Paxos madelive: an engineering perspective. In Proc. of the 26th ACMSymp. on Principles of Distributed Computing, pages 398–407, Portland, OR, May 2007. ACM.

[5] T.D. Chandra and S. Toueg. Unreliable failure detectors forasynchronous systems. In Proc. of the 11th ACM Symp.on Principles of Distributed Computing, pages 325–340,Montreal, Quebec, August 1991. ACM SIGOPS-SIGACT.

[6] C. Dwork, N. Lynch, and L. Stockmeyer. Consensus inthe presence of partial synchrony. In Proc. of the 3rdACM Symp. on Principles of Distributed Computing, pages103–118, Vancouver, BC, August 1984. ACM SIGOPS-SIGACT.

[7] M.J. Fischer, N.A. Lynch, and M.S. Patterson. Impossi-bility of distributed consensus with one faulty process. J.ACM, 32(2):374–382, April 1985.

[8] C. Gray and D. Cheriton. Leases: an efficient fault-tolerantmechanism for distributed file cache consistency. In Proc.of the Twelfth ACM Symp. on Operating Systems Princi-ples, Litchfield Park, AZ, November 1989.

[9] M. P. Herlihy and J. M. Wing. Linearizability: A correct-ness condition for concurrent objects. Trans. on Program-ming Languages and Systems, 12(3):463–492, July 1990.

[10] P. Hunt, M. Konar, F.P. Junqueira, and B. Reed.ZooKeeper: Wait-free coordination for Internet-scale sys-tems. In USENIX Annual Technology Conference, 2010.

[11] M. Isard. Autopilot: Automatic data center management.Operating Systems Review, 41(2):60–67, April 2007.

[12] F. Junqueira, B. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. InInt’l Conf. on Dependable Systems and Networks (DSN-DCCS’11). IEEE, 2011.

[13] F. P.. Junqueira and M. Serafini. On barriers and the gap be-tween active and passive replication. In Distributed Com-puting, volume 8205, pages 299–313. Springer, 2013.

[14] J. Kirsch and Y. Amir. Paxos for system builders: anoverview. In Proc. of the 2nd Workshop on Large-ScaleDistributed Systems and Middleware (LADIS’08), pages1–6, 2008.

[15] L. Lamport. Time, clocks, and the ordering of events in adistributed system. CACM, 21(7):558–565, July 1978.

[16] L. Lamport. How to make a multiprocessor computer thatcorrectly executes multiprocess programs. IEEE Trans.Comput., 28(9):690–691, September 1979.

[17] L. Lamport. Specifying concurrent program mod-ules. Trans. on Programming Languages and Systems,5(2):190–222, April 1983.

[18] L. Lamport. The part-time parliament. Trans. on ComputerSystems, 16(2):133–169, 1998.

[19] L. Lamport. Byzantizing Paxos by refinement. In Proceed-ings of the 25th International Conference on DistributedComputing, DISC’11, pages 211–224, Berlin, Heidelberg,2011. Springer-Verlag.

[20] B. Lampson and H. Sturgis. Crash recovery in a distributeddata storage system. Technical report, Xerox PARC, PaloAlto, CA, 1976.

[21] B.W. Lampson. How to build a highly available systemusing consensus. In S. Mullender, editor, Distributed sys-tems (2nd Ed.), pages 1–17. ACM Press/Addison-WesleyPublishing Co., New York, NY, 1993.

[22] B.W. Lampson. The ABCDs of Paxos. In Proc. of the20th ACM Symp. on Principles of Distributed Computing,Newport, RI, 2001. ACM Press.

[23] B. Liskov and J. Cowling. Viewstamped Replication revis-ited. Technical Report MIT-CSAIL-TR-2012-021, MIT,July 2012.

[24] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira,and M. Williams. Replication in the Harp file system. InProc. of the Thirteenth ACM Symp. on Operating SystemsPrinciples, Pacific Grove, CA, October 1991.

[25] J.R. Lorch, A. Adya, W.J. Bolosky, R. Chaiken, J.R.Douceur, and J. Howell. The SMART way to migrate repli-cated stateful services. In Proc. of the 1st Eurosys Confer-ence, pages 103–115, Leuven, Belgium, April 2006. ACM.

[26] N. A. Lynch and F. W. Vaandrager. Forward and back-ward simulations, ii: Timing-based systems. Inf. Comput.,128(1):1–25, 1996.

[27] B.M. Oki and B.H. Liskov. Viewstamped Replication: Ageneral primary-copy method to support highly-availabledistributed systems. In Proc. of the 7th ACM Symp. onPrinciples of Distributed Computing, pages 8–17, Toronto,Ontario, August 1988. ACM SIGOPS-SIGACT.

[28] M. Pease, R. Shostak, and L. Lamport. Reaching agree-ment in the presence of faults. J. ACM, 27(2):228–234,April 1980.

[29] N. Schiper and S. Toueg. A robust and lightweight stableleader election service for dynamic systems. In Int’l Conf.on Dependable Systems and Networks (DSN’08), pages207–216. ieee, 2008.

[30] F.B. Schneider. Implementing fault-tolerant services usingthe state machine approach: A tutorial. ACM ComputingSurveys, 22(4):299–319, December 1990.

[31] A. Shraer, B. Reed, D. Malkhi, and F. Junqueira. Dynamicreconfiguration of primary/backup clusters. In AnnualTechnical Conference (ATC’12). USENIX, June 2012.

[32] R.H. Thomas. A solution to the concurrency control prob-lem for multiple copy databases. In Proc. of COMP-CON 78 Spring, pages 88–93, Washington, D.C., February1978. IEEE Computer Society.

[33] S. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, andC. Maltzahn. Ceph: A scalable, high-performance dis-tributed file system. In Proceedings of the 7th Confer-ence on Operating Systems Design and Implementation(OSDI06), November 2006.

abstract arxiv:1309.5671v3 [cs.dc] 27 feb 2014 · vive la difference:´ paxos vs. viewstamped...

Documents