reducing Ω to ◊w

Information Processing Letters 67 (1998) 289-293

Reducing f2 to OVV

Francis Chu 1,2 Cornell University Computer Science Department, Upson Hall, Ithaca, NY 14853, USA

Received 6 February 1998; received in revised form 12 June 1998 Communicated by D. Gries

Keywords: Distributed computing; Failure detector; Fault tolerance

1. Introduction

One of the most important problems in fault-tolerant distributed computing is Consensus. Unfortunately, Fischer et al. [3] showed that Consensus is not solvable in an asynchronous system with even a single crash failure. Ever since, there have been efforts to circumvent this impossibility result. One approach is to augment the system with failure detectors. A fuil- ure detector 27 is essentially a distributed oracle that gives processes (possibly incorrect) hints about which processes have crashed. Each process p has access to a local module, denoted DP, which it can query to get hints about failures. Most of the literature focuses on crush failures. We will be dealing only with crash failures in this paper. Note that, in general, p will see only a subsequence of the outputs of DD, (even if p does not crash), due to the asynchronous nature of the system. We say that p suspects q at time t if at time t, the output of DD, suggests that q has crashed; if, in addition, p sees this output (as a result of a query) at time t , then we say that p doubts q at time t.

Failure detectors are characterized by the properties they satisfy. The properties of interest typically relate actual crashes with the outputs of the local modules.

1 Email: [email protected].

2 Supported by AF grant F49620-96-l-0323 and NSF grant IRI- 9625901.

0020-0190/98/$19.00 0 1998 Elsevier Science B.V All rights reserved PII: SOO20-0190(98)00122-7

(For example, “every process that crashes is eventually permanently suspected by some correct process”.) A natural question to ask is whether two failure detectors satisfying different properties are really distinct in their power to solve problems. That is, given two failure detectors 27 and ZY, are there problems one could solve using ‘D but not ‘D’ or vice versa? 3 To answer this question, Chandra and Toueg [2] define a notion of reduction between failure detectors. A reduction from D’ to 2) is a (distributed) algorithm that maps the outputs of D to the outputs of D’. We say a failure detector D’ reduces to another failure detector 27 iff there is a reduction from D’ to V; that is, roughly speak- ing, D can emulate D’. Note that the reducibility relation is transitive, since reductions can be composed. If each of D and 2)’ reduces to the other, we say that they are equivalent. It is easy to see that this notion of equivalence is an equivalence relation. If D’ reduces to D but not the other way around, we say that D’ is weaker than D, since any problem solvable using 2)’ can be solved using V but not vice versa. Chandra and Toueg [2] define several failure detectors and shows that some pairs are equivalent while others are distinct; thus the equivalence relation is nontrivial.

3 Note that it is common in the literature to use W to denote both a set of failure detectors that satisfy some properties and an arbitmry member of that set.

290 E Chu /Information Processing Letters 67 (1998) 289-293

All the failure detectors presented in [Z] are suffi- cient to solve Consensus. Thus not all failure detectors that solve Consensus are equivalent. Chandra et al. [l] identified the weakest failure detectors needed to solve Consensus: OW. As an intermediate step in the proof, Chandra et al. [l] introduce a failure detector called G’. The proof reduces Q to any failure detector that can solve Consensus. (OW and Q will be defined in Section 2.) There is a trivial reduction from OW to L?; so by transitivity of reducibility, OW reduces to any V that can solve Consensus. Since OW can solve Con-

sensus, we have that G? reduces to OW, by the generic reduction in [ 11. Thus D and OW are equivalent.

While there is a trivial reduction from OW to 0,

the only reduction from a to OW in the literature, to the best of our knowledge, is the generic reduction in the proof in [ 11. The generic reduction is much more complicated than necessary in this case and is very inefficient-the price for generality. This pa-

per presents two reductions from G? to OW: an unbounded reduction and a bounded reduction. The rest of the paper is organized as follows. In Section 2 we will cover the basic model. We give an unbounded

reduction in Section 3 and a bounded reduction in Section 4. It turns out that a slight modification to the bounded reduction gives us a quiescent bounded reduction (a distributed algorithm is quiescent if, in every run, the processes eventually stop sending messages). The modification is briefly described in the concluding remarks.

2. The model

The system consists of a set of n processes 17 =

{Pl,-.., pn} connected by a completely connected

network. We assume communication channels are reliable. That is, if q is correct and p sends nr to q, then q eventually receives nz. We do not assume that

the channels are FIFO, so it is possible for messages to arrive out of order. The system is asynchronous, so there are no bounds on the speeds of the processes and message delays. Processes may fail by crashing. We say p crushes if p stops taking steps and every step that p took was in accordance with the protocol. A process is correct if it follows the protocol and does not crash. This means that correct processes never stop taking steps. A failure detector consists of n local

modules, one for each process in the system. A process

may query its local module from time to time (as specified by the protocol).

The interpretation of the outputs from the local modules depends on the specification of the failure detector. In particular, OW outputs a suspect list, which consists of a list of processes suspected to have crashed, 4 and L? outputs a trusted process, which is a

single process that is considered to be correct. We now turn to the definitions of D and OW (taken from [I]).

Definition 1. A failure detector is in Q iff there is a

time after which all the correct processes always trust

the same correct process.

Definition 2. A failure detector is in OW there is a time after which

(1) every crashed process is permanently suspected by some correct process, and

(2) there is a correct process that is never again

suspected by any correct process.

Note that the properties are to hold eventually, so

there are no guarantees on the initial behaviors of the local modules. Note also that the processes do

not know when the guarantees are honored. The first property of OW is called weak completeness; the second property is called eventual weak accuracy. Eventual weak accuracy implies that there is a correct process p and a time t such that for all t’ 2 t, no correct process doubts p at time t’. Call such a correct process a chief. 5

There is a trivial reduction from OW to 52: Each process p simply queries its local module and suspects everyone the module does not trust. It is easy to verify that the outputs indeed satisfy both properties of OW.

Now let us turn to reductions in the other direction. (Note that reductions from 0 to OW are mappings from the outputs of OW to the outputs of L?; so all the subsequent algorithms in this paper map a suspect list to a trusted process.)

4 The output is actually a set of processes, but it is common to

refer to this set as the suspect list 5 Readers familiar with the notion of kings should note that kings

are chiefs but, in general, not vice versa. The reason is that a correct process, in general, only sees an infinite subsequence of the output

of its local module due to asynchrony.

1c: Chu /Information Processing Letters 67 (1998) 289-293 291

Every process p does the following:

trustp, countp c p, (0, . . . , 0) cobegin ]]Task 1: repeat forever

{p queries its local failure detector module Dp} suspectsp t D, foreach q E suspectsp do

count, [q] t count, [q] + 1 od send count, to all

]]Task 2: when receive count, for some q countp t MAX(countp, count,) trustp t MIN(countp)

coend

Fig. 1. The Slander Algorithm.

3. The Slander Algorithm

The algorithm considered in this section is called the “slander” algorithm. The basic idea is that each process gossips about what it heard from its local module and keeps track of the number of “slanders” it heard regarding a process. The process with the least count is deemed the most trustworthy. See Fig. 1 for a description of the algorithm. Note that MAX in the algorithm computes the component-wise maximum of two vectors and MIN gives thefirst index that realizes the minimum. We now prove that the algorithm is correct.

Lemma 3. Process r is a chief t~countp[r] is bounded for all correct p. Moreovel; if r is a chief then there is a cr such that eventually, for all correct p, count, [r] = cr permanently.

Proof. Suppose r is a chief, so eventually no correct p ever doubts r again. Let t be a time such that all the faulty processes have crashed, all the messages with a faulty sender and a correct receiver have been received, 6 and correct processes never doubt r again. Let p be a correct process. The only way count,[r] will ever change after time t is in Task 2. Task 2,

6 The point of this is to ensure that correct processes will not receive messages from faulty processes atIer time t. This is possible since a faulty process sends only finitely many messages.

however, does not introduce new counts for r into the system (in the sense that it takes the maximum of two existing counts). Let mr be the maximum of count,[r] at time t among the correct processes. Suppose q is correct and count,[r] = mf at time t. Eventually all correct processes will hear from q and we would have count,,[r] = mr for all correct p. We see that count,[r] will never change again. The second claim in the lemma follows if we set cr = mt .

Now suppose r is not a chief. Then some correct p must doubt r infinitely often (since the number of processes is finite). Note that countt,[r] is at least n when p sees r E DD, for the nth time. Thus count,[r] is unbounded. Since p sends countp[q] to all processes, all correct q will have unbounded count, [r] . 0

Theorem 4. The algorithm in Fig. 1 is a reduction of a to ow.

Proof. Suppose r is a chief. From Lemma 3 we see that there is a constant c, such that eventually, for all correct p, count,[r] = cr permanently. Consider a time t such that for all t’ > t, for all correct p and chief r, count,[r] = cr. Let

c* = max{c,: r a chief).

Let t’ > t be a time such that for all correct p and non-chief q, count,[q] > c*. Such a t’ exists since countp[q] is unbounded. Since count,[q] never decreases, we see that from t’ onward, all correct p will trust the same chief, which must be correct by weak completeness. Thus the output satisfies the specification of Q. 0

4. The Flush Algorithm

The algorithm in the previous section uses unbounded space in every run with failures. (In fact, it uses unbounded space in every run with a non- chief, since the counts for non-chiefs are unbounded.) We now present an algorithm that only uses a finite amount of space in each run, although there is no global upper bound on the space used.

To motivate the new algorithm, let us observe what happens if we periodically reset the counters in the previous section. By “reset” we mean that every

292 E Chu / Information Processing Letters 67 (1998) 289-293

process in the system resets its vector to (0, . . . , 0). Chiefs will eventually have a count of 0, but a 0 does not necessarily correspond to a chief. However, if we only reset when all the entries are non-0 in some countP, then eventually only chiefs will have 0 entries. Having made this observation, we see that we do not really need the count, just a binary value will do. Taking this one step further, we see that a bit vector is simply a subset of n; the component-wise maximum is simply set union. We now describe the

new algorithm informally. Each p periodically sends its suspect list to all processes and when p receives a suspect list from

cobegin 11 Task 1: repeat forever

{p queries its local failure detector module VP} suspectsp t suspectsp U ID, send (suspectsp; seqnumP) to all

IJTask 2: when receive (suspectsq, seqnumg) for some q

One difficulty has to be addressed: Since the system is asynchronous, a flush does not get rid of old suspi- cions as it should. A suspicion is old if it was taken from a local module’s output (causally) before the last

flush, which means it should have been discarded by the flush (if not before). A simple way to fix the prob-

lem is to add a sequence number to the suspect list. This way, an old suspicion will be recognized as such. Note that once eventual weak accuracy takes effect, there will be at most one more flush. This means that in each run the sequence number will be bounded, and so the algorithm uses only finite amount of space in each run. There is, however, no upper bound on the sequence number in the system, since OW can behave

badly for an arbitrary finite amount of time. See Fig. 2 for the algorithm. Note that here MIN gives the process with the least process number. We now argue that the algorithm is correct.

if seqnumg = seqnump then suspectsp c suspectsp U suspectsq fi

if seqnumg > seqnump then suspectsp, seqnump t suspects,, seqnum4 fi

if suspectsp = l7 then seqnump , suspectsp t seqnump + 1,0

send (suspectsp, seqnump) to all

fi trustp t MIN(n - suspectsp)

coend

Fig. 2. The Flush Algorithm.

eventually all the correct p will have seqnump 2 s*, since they will receive messages from q. Let us say that a suspect list C is associated with a sequence number s if some correct process has suspects = .! while seqnump = s. If 17 is never associated wik s*, then we see that s* + 1 will never be generated by the correct processes and so sr = s’. On the other hand, if eventually some correct p ends up with seqnump = s* and suspectsp = 17, p will increment its sequence

number to s* + 1 and send out a flush message. It is not possible for s* + 1 to be associated with n, since no correct process will doubt pafter time t . Thus sr=s*+l. q

Lemma 5. For all run r, there is an sr such that eventually, for all correct p. seqnump = s, permanently. Theorem 6. The algorithm in Fig. 2 is correct.

Proof. Consider a run r in which not all processes are Proof. Let r be a run in which not all processes faulty. Consider a time t when all the faulty processes are faulty. Let s, be the same as in Lemma 5.

have crashed, all the messages with a faulty sender Let t be a time such that all faulty processes have and a correct receiver have been received, and the crashed, all correct processes have s, as their sequence second property of OW holds. Thus there is a chief; number, and both properties of OW hold. Since sr is

call it p? Let s* be the maximum sequence number the maximum sequence number (among the correct

in the system (among the correct processes) at time t. processes), after it becomes the sequence number of

We claim that sr = s* or sr = s* + 1. Let q be a all correct processes, suspectsp grows monotonically correct process with seqnum4 = s* at time t . Note that and never becomes l7 for all correct p. Note that

F Chu /Infomation Processing L.eners 67 (1998) 289-293 293

if q’ E suspectsp for some correct p then eventually q’ E suspects4 for all correct q, since all correct q will receive messages from p. Thus there is a time t’ > t and a subset of n (call it 4!) such that for all t” > t’ and all correct p, suspectsp =e.Weseetbatn-L#0 (since n is not associated with s,) and all processes in n - e are correct (by weak completeness). Thus after time t’ all the correct processes will trust the same correct process: the least process in fl- e. Thus the output satisfies the specification of $2. 0

5. Concluding remarks

The algorithm in the previous section uses finite memory in each run. However, it sends out infinitely many messages in each run, even though after a while the suspect lists agree. It seems that we should be able to do better by sending only finitely many messages; that is, we should be able to give a quiescent reduction. This is in fact possible and it involves a minor modification to the algorithm of the previous section.

Modify the algorithm so that p sends out (suspectsp, seqnump) if and only if it is changed by an assign- ment. Essentially the same proofs will work; we just need to be careful to check that by sending fewer messages, we are not depriving processes of any information. Note that once the correct processes agree on (suspectsP, seqnump), there will be at most one more broadcast from each correct process, since (suspectsp,

seqnump) will no longer change. ’ Again, while only finitely many messages are sent in each run, there is no global upper bound. An easy adversary argument shows that any algorithm converting OW to C2 cannot have a global upper bound on the number of messages sent.

Acknowledgements

We would like to thank David Gries, Joe Halpem, and Sam Toueg for their helpful comments. We would also like to thank the anonymous referee who brought

to our attention the subtle difference between kings and chiefs.

References

[l] T.D. Chandra, V. Hadzilacos, S. Toueg, The weakest failure

detector for solving consensus, J. ACM 43 (4) (1996) 685-722.

[2] T.D. Char&a, S. Toueg, Unreliable failure detectors for reliable

distributed systems, J. ACM 43 (2) (1996) 225-267.

[3] M.J. Fischer, N.A. Lynch, MS. Paterson, Impossibility of

distributed consensus with one faulty process, J. ACM 32 (2)

(1985) 374-382.

7 Readers who know about OS (see [2]) can verify that the

flush algorithm is also a transformation of OW to OS (if we

take suspectsp as the output instead of computing tmstp). When we apply our modification in this section we get a quiescent

transfottnation of OW to OS.

reducing Ω to ◊w

Documents