error detection and diagnosis for fault tolerance in distributed systems

9
Information and Software Technology 39 (1998) 975-983 Error detection and diagnosis for fault tolerance in distributed systems Kassem Saleh*, Khaled Al-Saqabi Kuwait University, Department of Electrical and Computer Engineering, P.O. Box 5969, 13060 Safat. Kuwait Received 26 April 1997; revised 5 October 1997; accepted 14 November 1997 Abstract The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault- tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. 0 1998 Elsevier Science B.V. Keywords: Communications software; Detection diagnosis; Distributed systems; Fault tolerance 1. Introduction Software fault tolerance is becoming an increasingly important issue in software systems development since soft- ware is playing an integral role in safety-critical and real- time systems. Most of these systems encompass distributed applications where complex features such as communica- tion, synchronization and timing aspects are manifested. Therefore, achieving fault tolerance and stabilization in dis- tributed systems software is, in general, a more complicated task than in non-distributed software [l]. Distributed and real-time software fault-tolerance can be achieved in three interrelated phases: (a) fault detection (or the detection of an illegal system state); (b) fault diagnosis and localization; and (c) fault elimination and recovery. Fault detection and error recovery can be facilitated using a checkpointing mechanism. The detection of an illegal system (global) state must enable some mechanisms for error localization and diagnosis. These mechanisms use the information recorded locally at each process at the most recent recovery point in order to determine where and why the error occurred. Finally, once the error has been detected and diagnosed, fault recovery procedures must be applied. Many approaches dealing with fault recovery in distributed * Corresponding author. Fax: 00965 481745 I; e-mail: [email protected]. edu.kw 0950-5849/98/$19.00 0 1998 Elsevier Science B.V. All rights reserved PZZ SO950-5849(97)00058-X systems have been studied. Among them, backward recovery which consists of restoring the most recent legal global state, and forward recovery which deals with the anticipation of the error by changing the global state and allowing the computation to progress from a (forward) state reachable from the most recent recovery point. Certainly, the results of the error localization phase are useful in deter- mining candidates for such forward legal state. An evalua- tion of the different mechanisms for recovery (forward or backward) in the context of stabilizing protocols and services is needed. Most of the published research on distributed software fault tolerance concentrates on checkpointing and recovery in distributed systems by assuming that a mechanism exists at each process site to detect the occurrence of an error. Various checkpointing and recovery procedures were introduced, however no formal research was done on error detection. In this paper, we introduce a procedure for the detection of errors in a distributed system. This procedure is based on the exchange of contextual information appended to the messages to transmit, and on the execution of a message validity test prior to the transmission of any message. The contextual information contains elements already intro- duced in two methods for checkpointing and recovery in distributed systems [2-41. By exploiting the power of the two methods, a process in the system will be able to detect,

Upload: kassem-saleh

Post on 05-Jul-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Error detection and diagnosis for fault tolerance in distributed systems

Information and Software Technology 39 (1998) 975-983

Error detection and diagnosis for fault tolerance in distributed systems

Kassem Saleh*, Khaled Al-Saqabi

Kuwait University, Department of Electrical and Computer Engineering, P.O. Box 5969, 13060 Safat. Kuwait

Received 26 April 1997; revised 5 October 1997; accepted 14 November 1997

Abstract

The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in

distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a

transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault- tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the

detection and diagnosis of transient faults in the distributed system. 0 1998 Elsevier Science B.V.

Keywords: Communications software; Detection diagnosis; Distributed systems; Fault tolerance

1. Introduction

Software fault tolerance is becoming an increasingly important issue in software systems development since soft-

ware is playing an integral role in safety-critical and real- time systems. Most of these systems encompass distributed

applications where complex features such as communica-

tion, synchronization and timing aspects are manifested.

Therefore, achieving fault tolerance and stabilization in dis- tributed systems software is, in general, a more complicated

task than in non-distributed software [l]. Distributed and real-time software fault-tolerance can be achieved in three

interrelated phases: (a) fault detection (or the detection of an illegal system state); (b) fault diagnosis and localization;

and (c) fault elimination and recovery. Fault detection and error recovery can be facilitated using a checkpointing

mechanism. The detection of an illegal system (global) state must enable some mechanisms for error localization

and diagnosis. These mechanisms use the information

recorded locally at each process at the most recent recovery point in order to determine where and why the error occurred. Finally, once the error has been detected and diagnosed, fault recovery procedures must be applied. Many approaches dealing with fault recovery in distributed

* Corresponding author. Fax: 00965 481745 I; e-mail: [email protected].

edu.kw

0950-5849/98/$19.00 0 1998 Elsevier Science B.V. All rights reserved PZZ SO950-5849(97)00058-X

systems have been studied. Among them, backward recovery which consists of restoring the most recent legal

global state, and forward recovery which deals with the anticipation of the error by changing the global state and

allowing the computation to progress from a (forward) state reachable from the most recent recovery point. Certainly,

the results of the error localization phase are useful in deter-

mining candidates for such forward legal state. An evalua-

tion of the different mechanisms for recovery (forward or backward) in the context of stabilizing protocols and

services is needed.

Most of the published research on distributed software fault tolerance concentrates on checkpointing and recovery in distributed systems by assuming that a mechanism exists at each process site to detect the occurrence of an error.

Various checkpointing and recovery procedures were introduced, however no formal research was done on error

detection.

In this paper, we introduce a procedure for the detection of errors in a distributed system. This procedure is based on the exchange of contextual information appended to the messages to transmit, and on the execution of a message validity test prior to the transmission of any message. The contextual information contains elements already intro- duced in two methods for checkpointing and recovery in distributed systems [2-41. By exploiting the power of the two methods, a process in the system will be able to detect,

Page 2: Error detection and diagnosis for fault tolerance in distributed systems

976 K. Saleh, K. Al-SaqabiLnfomtion and Softiare Technology 39 (1998) 975-983

diagnose and recover from any transient fault, therefore, increasing the robustness and reliability of the system. We

believe that the overhead incurred using our procedure is

justifiable specially for the case of safety-critical distributed software systems.

The rest of this paper is organized as follows. Section 2 gives some basic definitions of distributed systems and fault

models in such systems. Section 3 introduces a generalized

distributed extension and shows its capabilities. Section 4

gives the overhead and possible optimization of this

method. Section 5 contains the proof of correctness and, finally, Section 6 concludes the paper.

2. Distributed systems model and assumptions

Our distributed system model consists of a collection of

loosely coupled processes which exchange messages over

communication links. These processes form the nodes of a strongly connected network. Each node has a stable and

fault-tolerant storage used by a process to save critical con-

textual information. A process in the distributed system can

be modeled by a communicating finite state machine

(CFSM). The CFSM model is a natural and intuitive form- alism for distributed, real-time and reactive systems that can

be characterized by event-driven processes communicating

with each other by exchanging messages through a commu- nication medium modeled by unidirectional first-in-first-out (FIFO) channels of unbounded capacities. The communica-

tion medium is assumed to be reliable, meaning that it does

not duplicate, eliminate or corrupt messages. Processes are considered to be deterministic meaning that replaying a

sequence of events from a state will consistently reach the

same final state. We assume that there are no permanent

errors in our model. A CFSA4 in a system of n CFSMs can be formally defined

by the quadruple CFSMi = (Si, soi, Mi, Ti), where Si is the set of internal states of process Pi, s,, E = Si is the initial

state of Pi, Mi consists of the union of the set of messages

Pl

sent by Pi (MS,) and the set of messages received (MRi) by

Pi, and finally, Ti is a partial transition function: Si X Mi -

Si. We say that a message m belonging to Mi is a label for a transition. A unidirectional FIFO channel cii carries messages belonging to Mi sent from Pi to Pj. As an example,

consider the protocol given in Fig. 1 consisting of three

processes PI, P2 and P3 modeled by three CFSMS. A

minus (plus) sign ‘ - ’ (‘ + ‘) prefixing the label of a transition

denotes a sending (receiving) transition. The label of a

transmitted (received) message also contains the identifica-

tion of the process to (from) which the message is sent

(received). At any point in time, the processes will be in

one of the states defined by their respective CFSM’s and the channels may contain messages destined to these processes.

Fig. 2 shows execution sequences from the three CFSMs of

Fig. 1. This example will be used throughout the paper to

illustrate the different distributed system extensions.

A global state of a system of n processes is a pair (S,C) where S = (s,, s 2,. ..s,) represent current states of processes

PI, p2,... P, respectively and C = (cc, ‘dij, 0 < ij I n and

i # j) represents the current contents of the channels cti,

linking the processes. The global state of a system is changed by the occurrence of a send or a receive event in

a process. The initial global state of a system is a pair (So,

Cc) in which each of the component states of S are in initial

states in their respective processes and all channels are empty.

A global state (S, C) is said to be legal or safe (illegal or unsafe) if there exists (does not exist) an execution path

consisting of an interleaving of message transmissions and receptions that takes the system from the initial global state

(So, Cc) to (S, C). This implies that the initial global state is

a legal state and any further execution from a legal global

state will always lead to a global state which is also legal.

A global state is said to be consistent if for every reception transition in one process there is a corresponding sending transition in another process and vice versa.

Otherwise the global state is said to be inconsistent. A cut consists of joining n process states each from a

PD P3

+~,p~~p3+ti,p,~pl gpa .3,P3 -83,PZ

-d,Pl

‘3 l 3

+ll.03

Fig. I. A distributed system modeled by three CFSMs.

Page 3: Error detection and diagnosis for fault tolerance in distributed systems

K. Saleh, K. Al-SaqabiHnformation and Software Technology 39 (1998) 975-983 911

-

-6,,3

W)

0 II

I -

I,

4.m 9 t

Fig. 2. Execution sequences from the three CFSMs. 3. A generalized distributed extension

process of the n processes of the distributed system, hence representing a global system state which may or may not be

consistent. A cut is said to be consistent if for every recep- tion transition in one process there is a corresponding

sending transition in another process prior to the respective

states of these processes in the cut. Otherwise the cut is said

to be inconsistent.

The system is said to have failed, if it cannot perform its designed functions due to a variety of faults. A fault is an

abnormal physical state that is manifested by an error. The types of faults that occur in a system depends on the chosen

representation of the system.

In this section, we present a generalized extension which

is shown to be useful for the implementation of error detec-

tion and diagnosis capabilities in distributed software. The MREIT (maximally reachable event index tuple) approach

as defined in Ref. [5] has the potential of being used effi-

ciently for error detection, checkpointing and recovery.

However, it is not as powerful with respect to the diagnosis

of errors, i.e. message loss or out-of-order message reception. The latter feature is supported by the CC (causal

communication) method in Refs. [2,3]. Consequently, we propose an extension which combines both MREIT and

In this work, we are mainly interested in detecting and diagnosing transient failures in distributed systems. A tran-

sient failure is an event that may corrupt the global system

state but not the system’s behavior (i.e. code). They affect

the global state of a system by corrupting the local state of a

process as represented within the system, for example, as represented in the memory. Such failures include: (1)

incorrect initialization of one or more processes in the system to local states that are inconsistent with one another,

leading to an illegal initial global state; (2) side effects or a

local memory overwrite that may corrupt the local process state leading to an illegal global state; (3) transmission errors that are manifested by the loss, corruption, duplica- tion or reordering of messages by the communication medium; and finally, (4) process or processor failures that cause a process to restart from a local state inconsistent with the other states of the peer processes.

-----

A scenario for a transient (i.e. non-permanent) error can Fig. 3. Scenario for a transient error.

be described as follows. Suppose that PI, P2 and P3 are at

states s3, s4 and Sq, respectively, and all channels are empty.

A transient error such as a state value memory overwrite may occur at Pz, and lead to the immediate transmission of

m6 to P3. A possible scenario of events is illustrated in Fig. 3. The figure shows that the error will be detected later by P2 as an unspecified reception error at state ss. In these

sequences of executions, only message m5 transmitted by

P, is not contaminated, i.e. not induced by the transient

error. An error detection mechanism should aim at: (i) mini-

mizing the number of contaminated messages by detecting

the error as early as possible; and (ii) diagnosing the exact reason for the error. In this example, without incorporating

any extension in the system, the error is detected after the transmission of four contaminated messages (namely mh,

m7, and the two ml messages) and the error is manifested

as an unspecified reception error (design error) and not as a transient non-permanent error. This paper concentrates on

detecting such errors which can be defined as the result of a

spontaneous transition fault. When the system changes from

one state to another without any event occurrence, then it is defined as the spontaneous transition fault.

Page 4: Error detection and diagnosis for fault tolerance in distributed systems

978 K. Saleh, K. Al-SaqabiHnfomrion and So&are Technology 39 (1998) 975-983

CC. The requirements and formal definition of this new

generalized technique are described below. Moreover, the capabilities of the techniques for achieving fault tolerance

are also shown.

3.1. Maximally reachable event stamp (MRES)

In order to facilitate the detection of state changes without

the occurrence of an event, the following contextual information is maintained at each process Pi, namely, the

maximally reachable event stamp (MRES;) and a delivery vector (deliv;). The MRES at process P; consists of: (i) an

event index vector (EIV;) containing one pair of informa-

tion for each process in the distributed system, as defined in

the MREIT method; and (ii) senti a two-dimensional n X IZ

matrix, where IZ is the six number of processes in the system.

sent;[k,l] denotes P;‘s knowledge of the number of mes-

sages sent by Pkto P,, Vi, k, 1, 0 < i, k, 15 n. deliv; is an n-vector, where n is the number of processes. deliv;Ej]

denotes the number of messages delivered to Pi from Pj, Vi, j, 0 < i, j < n. Each pair in EIVj contains the current

event index (H) local to the process (each time an event occurs in a process, its event index is incremented by 1) and

state transition information related to the occurrence of the

last event in that process, i.e. (EZ, s,sd), where s, and sd are the source and destination states, respectively. The

algorithm below shows how to obtain the maximally reach-

able event stamp, MRESI and the delivery vector, deliv;, at

each process Pi.

3.2. The procedure

When process Pi is ready to send a message m to process Pj and moves from state s, to state s,, the event index value in the ith tuple of EIV; is incremented by one and the source

and destination states in the state transition information in that

tuple is changed to sds,. Also, the value of sent;[ij] is incre-

mented by one, and the new MRES; of process Pi is stored

locally on stable storage. Then a message validity test (MVT) procedure is performed on message m to ensure that m is not

contaminated. If MVT indicates that m is contaminated then an error is reported and appropriate recovery procedures (out

of scope of this paper) have to be executed otherwise, m is transmitted along with MRES; (i.e. EIV; and senti).

When process Pj receives message m at state sj, along

with MRES; from process Pi, it performs two checks. The first check ensures that messages causally proceeding m and sent from Pi to Pj are delivered before m. The second check ensures that messages known by Pi to have been sent prior to m from other processes to process Pj are delivered before m. Once these two conditions are satisfied then m is con- sumed by Pi, and delivj, sentj and EIVj, stored on stable storage at process Pi, are updated as described in the follow- ing to reflect Pi’s upto date local knowledge of the global state of the system: (i) the ith value of the array delivi (i.e. delivi[i]) is incremented by one to denote that a new

message from Pi is delivered to Pj; (ii) every value in sentj is compared with the corresponding value in senti of

MRES; and the original value in sentj is replaced by the larger of these two values. This is done to update process

Pj’S knowledge of the number of messages sent by other

processes; and (iii) if process Pj has moved to Sk after

receiving m, then the jth tuple in EIVj of process Pj is

updated by incrementing the event index value in that tuple by one and changing the state transition information

to sjsk. Also, if the event index value in any tuple of EIV; is larger than the event index value in the corresponding tuple of EIVb then that tuple in EIV; replaces the corresponding

tuple in EIV;. The procedure is formally described in the following

steps at both the sender and receiver processes.

3.2.1. Sender process When at state s,, P; decides to transmit a message m to

process Pi and moves to state s,, it performs the following

steps.

Updates the ith pair corresponding to Pi in EIV; to (Eli + 1, snls,). Updates sent;[i,j] = sent;[ij] + 1, and saves the new MRES; on stable storage.

Performs the message validity test (MVT), described below in Section 3.3, to check if the message is contami-

nated. If it is contaminated, then an error is reported and a

recovery procedure should be executed. If it is not

contaminated, then proceed as follows. Appends the updated MRES; to the message to transmit.

3.2.2. Receiver process: When process Pj receives the message m, from process Pi

along with MRES;, it performs the following steps.

1. Waits until the following conditions are satisfied:

delivj[i] + 1 = senti[i,j],...pl Vk # i : delivi[k]>sent;[k,j];...p2

Predicates p1 and p2 ensure that the messages that

causally precede m, and are destined to Pi are delivered

before m.

2. Saves the message m and the sender process identifica-

tion Pi. 3. 4. 5.

Updates delivj[i] to delivi[i] + 1. Vk, 1: updates sentj[k, I] to max(sentj[k, 11, senti[k, I]) (i) Updates thejth pair in EIVj by incrementing its local event index EIj by one and replacing the previous state transition by the new reception transition.

(ii) Updates the other pairs in EIVj such that: EIVj = MAX(EIVi, En7j). The max function is defined as fol- lows: if EI, in the nth pair in EIV; is greater than EI, in the nth pair in EIVj, then the nth pair in EIV; replaces the nth pair in EIVP

6. Saves its updated MRESj and delivj on stable storage.

Page 5: Error detection and diagnosis for fault tolerance in distributed systems

K. Saleh, K. Al-Saqabi/Infomarion and Software Technology 39 (1998) 975-983 979

The EIVs of the MRES saved at each process at time t (in

Fig. 2) are listed below:

The saved MRES vectors at P, provide a partial execution history of each other process, as known at Pi. Fig. 4 gives

the time sequence diagram for the three CFSMS. The hori-

zontal line represents time and the vertical lines between

them represent message transfer between the processes.

Initially, for all values of i and j, event index EZj of all the

tuples in EIV ,, matrix senti and vector delivi are set to zero.

In our example, the EIV , at PI after the transmission of m,

is (( 1 ,s ,sz), (O,-), (O,-)). The arc from PI to P2 is labeled by the message mi, EIVi and the matrix sent,. With this infor- mation, P, informs P2 about its own state and its upto date knowledge of the global system state. Upon reception of m , by P2, P2 updates its local copy of EIV2 which becomes

((l,slsZ), (l,sIsZ), (O,-)), in addition to its local copy of sent2, i.e. it updates its local knowledge about the global

system state. When Pj sends rnj, EIV3 becomes ((1,~~s~)

(2JG3), (2J5.73)).

3.3. Message validity test (MVT)

Our strategy for the early detection of synchronization and destabilization symptoms in the distributed system

roq

requires periodical checkpointing to obtain a consistent

global state and the latest checkpointed global state is stored

in the local memory of each process. In order to detect the

error a message validity test (MVT) is performed before the transmission of a message by any process. Such a test attempts to block, as early as possible, the propagation of

contaminated messages in the system. The idea of the MVT

is similar to the recovery block (RB) approach [6] for non- distributed software fault tolerance. However, in the RB

approach, acceptance tests are logical expressions used for

determining the acceptability of the execution results of a

block of statements. Whereas the main goal of the MVT is to check whether the message to be transmitted is contami-

nated or is a result of a synchronization error in the system.

This can be achieved primarily by using the local MRES

associated with the message to be transmitted, to check whether or not an inconsistent cut is formed. An inconsistent

cut implies basically that a message is received by one process, but was never transmitted by any other process in

the system. The MVT procedure given below checks whether an

inconsistent cut is formed during the execution of the

system. We assume that each process maintains in its

memory a representation of each of the CFSMs existing in the distributed system. One pointer for each CFSM points to

the state which corresponds to the local process knowledge

of the state of other processes.

3.3.1. MVT executed at Pi before transmitting a message m,

t0 Pj: Let s,,, be the checkpointed state of process P,, for 4’ = 1 to

020 ri oo2 pq

Fig. 4. Time sequence diagram for the three CFSMs.

Page 6: Error detection and diagnosis for fault tolerance in distributed systems

n in a system of n processes and let Sk be the current state of process Pk as known by Pi from the MRESi that is to be transmitted with m.

For each process Pk (including Pk = Pi)

If there is a reception transition (of message m from P,) along any sequence of events from the checkpointed State sck to the CUtTent Sk&t? Sk Of Pk

then there should be a corresponding sending transition in P, along any sequence of events from the check- pointed state s, to the current state s, in P, otherwise there is an inconsistency in process Pk and the message m, ready for transmission is contaminated. So use the history of MRESk stored at Pk to reason about the error.

Endif

EndFor

The MVT approach can be used to find the earliest state in any process at which the error can be detected. Furthermore, as the contaminated message is detected at the source itself, the distribution of such contaminated messages is prevented at the initial stage and the recovery process is required locally only, thus, decreasing the complexity and the cost of the recovery process. For example, in the transient error scenario presented in Section 2, when M6 is ready for transmission at P2 the MVT procedure is executed.

At Pz, the recorded MRESs include the following EIVs:

((l,SlSZ)T (1,SISZ)T (O,-))

(tlJ1S2), (%W3>, (o,-))

((Ls1~2h (3~3~4)~ (2~5~3)) and

m6 is now ready for transmission with the latest MRES

namely, ((lrS& (4?~6d (2~5%)).

The MVT should be able to recognize that P2 is not aware of the reception of m4 by P ]. In fact, for P2 to transmit m6, it must have received m5 from PI and that PI is at state s4. However, P2 still remembers PI as only at s2. Therefore, MVT should detect a synchronization problem at P2, since an inconsistent cut is formed as shown in Fig. 5. In this figure, the transition that P2 is surely aware of its occurrence is marked with an asterisk.

As P2 is the current transmitter, the check is made for any reception transitions in PI, P2 and P3 preceding their known current states namely s2, sI and s3 respectively with no corresponding sending transitions. In P ,, there are no recep- tion transitions before state s2. However in P2, a message m5

is supposed to have been received from PI, but there is no corresponding sending of message rn5 from PI before state ~2. Therefore, the new message m6 ready for transmission at P2 is contaminated.

The MVT procedure can, thus, be used for the early detection of an error. Also, the reason for the error can be deduced using the MRES collected at each process. In the above example, P2 detects that the message mg to be trans- mitted by itself is contaminated because of a reception in its prior state without a corresponding sending transition in another process. Now when we consider the stored MRES of P2, which is,

((~~~1~2),(~,~1~2),(0, ->I

~~~,~I~,>,~~,~,~,>,~~, - >>

((l>SlS2)9 (3, s3s4), (%s5s3)> (1)

((k$sZ), (4,%jsI),(2, s5s3)) (2)

According to Lemma 1 (Section 5), if the state transition

sl

-ml,PZ *

- s2

+m4,P3

s3

-mS,P2

(e5) s s4

980 K. Saleh, K. Al-Saqabinnfonnation and Sofmare Technology 39 (1998) 975-983

+m7,P3

A sl

4 * -m3,P2

* (MRESs at P2)

+m6,P2 (* denotes transitions which P2 is aware of)

-m7,Pl

(e7)

(,. transient error)

Fig. 5. Inconsistent cut due to a transient error.

Page 7: Error detection and diagnosis for fault tolerance in distributed systems

K. Saleh, K. Al-SaqabilInformation and Sofhvare Technology 39 (1998) 975-983 981

information in the ith pair of EIV, of a process Pi for the

current state is s,s, then the state transition information for

the immediately succeeding event in the same process P, should be s,sY. However, if we look at the state transition information of P2 (2nd tuple in Eqs. (1) and (2)) the transi-

tion information for the event of index 3 is ~94, but the

transition information for the immediately succeeding

event of index 4 is s@, (i.e. s4 # se). Therefore, the error

is localized at this point and it can be obviously judged as

the result of a spontaneous transition fault.

4. Overhead and optimization

The overhead incurred by this procedure is in the form of additional storage, processor time and communication

bandwidth required for the storage, processing and trans-

mission of the contextual information used in this method. The contextual information consists of the event index

vector of size II, a sent matrix of size n X n and a delivery vector of size II for a system of n processes. The processing

operation consists of checking for the validity of the new

message to be transmitted (MVT procedure). This is to guarantee that the new message was produced from a con-

sistent state. Once the validity is confirmed, the message is transmitted along with the latest MRES of the process trans- mitting the message. When the message reaches the destination process, after checking that the delivery condi-

tion is satisfied, it is delivered to the receiver process. The MRES has to be sent for every transmitted message and

hence additional bandwidth is used in the communication

channels, which could otherwise be utilized for sending of

non-control messages. However, these additional require- ments are trivial when we consider the importance of fault

tolerance in a distributed system in which the outcome of

undetected propagation of erroneous messages may be critical, costly and undesirable.

We can optimize the bandwidth utilization by adopting

incremental piggybacking instead of sending the entire event index vector and the sent matrix with every message.

This technique is particularly convenient in systems with a

large number of processes although a process needs to store information regarding the EIV and sent matrix at the time of

the last interaction with other processes.

If i:“, i?, . . . . it pairs of Pi’s EIV have changed to v ,, v2,

. ..v., respectively, since the last message to P, Pi piggy-

backs a compressed EIV {(ii, v,), (iz, v2), . . . . (i,, v,)} in its next message to Pj. When P, receives this message, it updates its EIV as follows. If EI of the VT pair is greater

than the EI in the kth pair of EIVj then vk replaces the kth pair of EIVj. The information of the sent matrix is also piggybacked in a similarly incremental fashion. If entries

{[T,,C,l, If-?,C?l,...,[ r,, c,] ) of sent, matrix have changed to -x1, x2, . . . . x, respectively since the last message to P, Pi

piggybacks a vector i (rl, CI, XI 1, (r2, CZ, x2), . . .,(r,, c,, x,)1 in its next message to Pj. When Pi receives this message, it

updates its sent, matrix as follows: s&j[k, 11 = max

(sentj[k, I], (k, 1, x)) for k, 1 = 1, 2 ,... n. Fig. 6 shows the

incremental piggybacking associated with messages m I and

m2.

To minimize the amount of storage required, for a given

process Pi, we should find for every message transmission,

say to Pj, the number of events that have to be stored at Pi before the receipt of an event acknowledging the proper

reception of the transmitted message. This will vary for

each process. The storage at Pi should be sufficient to record

events along the longest maximal path between any two

states in Pi. Such path can be found using some existing

path algorithms and the upper bound on the length of such path is (mi - l), where mi is the number of states in Pi [4].

Therefore, in the worst case, at any point of time, the latest (m, - 1) number of MRESi may have to be retained at each

process P, and the older MRES, can be deleted.

5. Proof of correctness

In order to prove our algorithm, we will facilitate the use of Lamport’s ‘happened before’ relation denoted by ‘ - ’

[71. a - c indicates that ‘event a happens before event c’

1. If a and c are events in the same process and a comes

before c (or); 2. if a is sending of a message in one process and c is the

receipt of the same message in another process (or); 3. ifa+handb+cthena-c.

Also if a - b then a is assumed ‘smaller’ (in the order of

occurrence) than b [2,3].

Lemma 1: Let the current value of the state transition infor-

mation in the ith pair of the event index vector in MRES i be (s,s,). Then if the occurrence of an event in Pi causes the state transition information of the ith pair of the event index

vector of MRESi to become (s,sJ and if there is no error in the system then s, should be the same as s,,.

Pl .

Fig. 6. Incremental piggybacking associated with messages m , and mZ.

Page 8: Error detection and diagnosis for fault tolerance in distributed systems

982 K. Saleh, K. Al-Saqabinnfomation and Sof&vare Technology 39 (1998) 975-983

Proof: The communicating system under consideration is characterized by the distributed execution of a collection of II processes. However, within a single process, the execution is sequential, with the process switching from one state to the next state by the occurrence of a single event. The state transition information (s,s,) in the EIV of the MRESI gives the state transition information related to the recent transi- tion such that Pi has moved from state s, to state s, and is currently in state s,. Therefore, when a new event occurs in Pi, the process should move from its current state s, to a new state. Therefore, if (s,s,) is the current state transition infor- mation, and if there are no errors, then the immediately succeeding state transition information should be (s,sY) for some value of y. Therefore, it follows that s, = s,.

Lemma 2: If a process Pi transmits a message in to Pj along with MRESi, then

I: senti[i,j]l Eli in the EIV of MRESi j= 1 ton

Proof: The event index value of a process in the event index vector is incremented by one for every occurrence of a send or a receive event in that process. However, the value of senti[ij] for any j, is incremented only when a sending event occurs in process Pi. Therefore, the sum of all sending events in process Pi will eventually be less than or equal to the total number of different events occurring in Pi.

Proposition 1: Consider a message MI and the set of messages M,, sent from a process such that Send(M,) - Send(M,). Then if message MI is contaminated then all the messages in the set M, are also contaminated.

Proof: MI is contaminated if there is some other message M, such that Receive(M,) - Send(MI), and the event Receive(M,) has not occurred. By the transitive closure of ‘ - ’ relation and since Send(M,) - Send(M,), it follows that Receive(M,) - Send(M,) and as the event Receive(M,) has not occurred, it implies that the set of messages M,, are also contaminated.

Proposition 2: Consider a message MI and the set of messages M,, sent from Pj to Pi such that Send(M,) - Send(M,). If M, is delivered, then it means that the set of messages M, are not contaminated and were also delivered.

Proof: A message is delivered to Pi from Pi, once it is confirmed that the message is not contaminated and the conditions

deliv$] + 1 = sentjb, iland

Vk # j, delivi[k] 5 sentj[k, i]

are satisfied. A message will be found contaminated if there is some fault in the preceding events. Since MI is delivered, it means, it is not contaminated and, therefore, there are no

faults in the preceding events also and the set of messages M, are also not contaminated. Also the fact that message M, is delivered implies that its delivery condition was also satisfied (i.e.) messages that ‘happened before’ M, were delivered to P; before Ml and hence the proof that the set of messages, M, were also delivered.

Let us now prove the correctness of the procedure by proving that the safety and liveness properties are satisfied.

Safety corresponds to proving that something bad never happens. In our system, it is equivalent to proving that a contaminated message is never delivered to the system.

Proof: We will prove this by contradiction. Assume that a contaminated message M, from a process Pj has been delivered to another process Pi in the system. By contami- nated,we mean that, for some messsage M,, such that Receive(M,) - Send(M,) and the message M, was sent by the process Pj before the receipt of M,. Let Si, Sj and Sk be the current states of processes Pi, Pj and Pk as claimed by Pj (obtained from MRESJ transmitted with M,). As per the procedure, the current state 15 information of the sender is updated immediately at the receiver upon the reception of that message. Moreover, before transmitting M,, Pj must have performed the message validity test to confirm that if there is any reception event in any process, from the check- pointed state to the corresponding current state of any process then the corresponding send event has also occurred in between the checkpointed state and the current state of some other process. Otherwise, the message M, would be treated as contaminated and would not have been trans- mitted and thus could not be received. This contradicts our assumption that the contaminated message is received and delivered to the system.

Liveness Property ensures that something good will eventually happen. In order to be sure that the system will make progress it is necessary to prove that all non contami- nated messages will eventually be delivered.

Proof. A received message is not delivered to a system if it is contaminated or if the delivery condition is not satisfied. Let us assume that M,, a message sent from Pj to Pi, is the smallest of a set of messages that are not contaminated and still not delivered. So it means that the delivery condition is not satisfied, i.e. ElkIdelivi[k]<sentj [k, i]. Let senti [k, i] =

x and consider M,, the xth message sent from Pk to Pi. From delivi[k] < x, it follows that only less that x messages has SO

far been delivered from Pk to Pi. SO M, is not yet delivered. Since sentj[k,i] = x, it follows that M, must have been sent before M,, and

Send(M,) - Send(M,) (3)

otherwise the sending of the Mih message could not have been known at the instance of sending M,y. Now there are two cases to consider. Since M, is not delivered, either it is contaminated or its delivery condition was not met.

Page 9: Error detection and diagnosis for fault tolerance in distributed systems

K. Saleh. K. Al-Saqabi/lnformation and Software Technology 39 (1998) 975-983 983

if M, is contaminated then it follows from Eq. (3)

and proposition (l), that M, is also contaminated,

which is a contradiction to our assumption that M,Y is not contaminated.

if M, was not delivered because the delivery condition was not true, then it is a contradiction to our assumption

that M, was the smallest of the non contaminated mes- sages that were still not delivered, because when M, is not delivered and since Send(M,) - Send(M,J, M, is smaller than M, and hence M,r is the smallest of non

contaminated messages still not delivered.

Therefore, all non contaminated messages will be

delivered as soon as their delivery conditions are met.

6. Conclusions

In this paper, we introduced a generalized distributed extension by combining the advantages of two already pro-

posed extensions: the maximally reachable event index tuple (MREIT) [4], and the causal communication method

[2,3]. Our proposed extension relies on the exchange of contextual information piggybacked to very transmitted

message, and on the execution of a message validity test

before transmitting a message to preempt any erroneous

message propagation in the distributed system. In the future,

a study of the application of the generalized extension for

the debugging and tracing of distributed software

executions will be done.

Acknowledgements

The authors thank the anonymous referees for their

feedback that helped improve the quality of the paper. We also acknowledge the support of this work by a Kuwait

University Research Administration grant no. EE059, and

the fruitful discussions we had with I. Manonmani on the

first draft of this paper.

References

[l] E. Dijkstra, Self-stabilizing systems in spite of distributed control,

Commun. ACM 17 (1974) 643-644.

[2] M. Raynal, La communication causale dans les systemes repartis:

Protocoles fondes sur le comptage, Networking Distributed Comput.

1 (1) (1991) 87-99.

[3] M. Raynal, A. Schiper, S. Toueg, The causal ordering abstraction and

a simple way to implement it, Inf. Process. Lett. 39 (1991) 343-350.

[4] K. Saleh, I. Ahmed, K. Al-Saqabi, A. Agarwal, A recovery approach

to the design of stabilizing communication protocols. Computer Com-

munications 18 (4) (1995) 276-287.

[5] K. Saleh et al., Dynamic checkpointing procedure for stabilizing com-

munications protocols, Inf. Software Technol. 18 (8) (I 993) 479-485.

[6] B. Randell, System structure for software fault tolerance, IEEE Trans.

Software Engng. 1 (1975) 226-232.

[7] L. Lamport, Time, clocks and the ordering of events in a distributed

system, Comm. ACM 21 (7) (1978) 558-565.