concurrent error detection in alu's by recomputing with shifted

7
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982 Concurrent Error Detection in ALU's by Recomputing with Shifted Operands JANAK H. PATEL, MEMBER, IEEE, AND LEONA Y. FUNG, STUDENT MEMBER, IEEE Abstract-A new method of concurrent error detection in the Arithmetic and Logic Units (ALU's) is proposed. This method, called "Recomputing with Shifted Operands" (RESO), can detect errors in both the arithmetic and logic operations. RESO uses the principle of time redundancy in detecting the errors and achieves its error detection capability through the use of the already existing replicated hardware in the form of identical bit slices. It is shown that for most practical ALU implementations, including the carry-lookahead adders, the RESO technique will detect all errors caused by faults in a bit-slice or a specific subcircuit of the bit slice. The fault model used is more general than the commonly assumed stuck-at fault model. Our fault model assumes that the faults are confined to a small area of the circuit and that the precise nature of the faults is not known. This model is very appropriate for the VLSI circuits. Index Terms-ALU, bit-sliced ALU, concurrent error detection, fault detection, time redundancy, VLSI circuits, VLSI faults. I. INTRODUCTION I T HAS been known for some time that no low-cost and ef- ficient techniques that can check both arithmetic and logic operations have been available. The AN code, Residue code, and Inverse Residue code [1] -[5] are the error detecting codes that were developed earlier for checking arithmetic operations. However, the methods mentioned above are unable to detect some single errors in group carry-lookahead structures. Fur- thermore, these methods cannot be used for checking logical operations. Utilizing a fully duplicated logic unit has been recognized as the most effective method for checking logical operations. In fact, most machines that have been built with an error de- tection scheme used duplication to check logical operations. For example, the fault-tolerant STAR computer used inverse residue codes to check the arithmetic unit but duplication for logic unit [2]. Several other machines such as EDVAC and IBM System/3, have duplicated the entire ALU. Nevertheless, a few designs, such as the residue-checked ALU [6] and a partially self-checking ALU [4], [7] were introduced for checking the entire ALU besides the full duplication scheme. However, both schemes require a large increase in the com- plexity of the circuitry, and therefore the area on the chip. Manuscript received October 6, 1981; revised January 12, 1982. This work was supported by the U.S. Navy under VHSIC Contract N00039-80-C- 0556. J. H. Patel is with the Coordinated Science Laboratory, and the Department of Electrical Engineering, University of Illinois, Urbana, IL 61801. L. Y. Fung was with the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801. She is now with STC Computer Research Cor- poration, Santa Clara, CA 95051. Furthermore, they are based on the traditional stuck-at fault model which is not appropriate for the VLSI technology. A more appropriate assumption is that the failure in a VLSI circuit affects a small area of the chip and the nature of the failure is not precisely known. For example, imperfection in the semiconductor material may be confined to a small area of the chip and thus affect several neighboring gates. Similarly, an incidence of a high energy particle may also affect a small area of the chip. The underlying failure mechanisms are not well understood as yet. Therefore, it is unwise to assume that under failure the output of a specific gate is stuck at some logical value. It is true, however, that at some higher functional level the effect of failures will be felt as changes in logical values. This functional level, for example, can be at the level of a bit-slice of an ALU. This is the level we choose for our functional fault model in this paper. Throughout this paper, a failure in a circuit means some physical malfunction; and an error means an incorrect value of the function under con- sideration. In this paper we present an error detection scheme based on time redundancy. In the next section, some systematic ap- proaches of devising error detectionr by time redundancy are described. Later we describe the proposed error detection method, called Recomputing with Shifted Operands (RESO). The error detection capabilities of RESO are described in Sections IV and V. It is shown that in a typical ALU, RESO detects all functional errors resulting from failures confined to a certain area of the chip, for example, a bit-slice. In Section VI some extensions of RESO, as related to errror correction in logic operations and error detection in multiply-divide circuits, are discussed. II. ERROR DETECTION USING TIME REDUNDANCY In this section we introduce a systematic way of exploiting time redundancy for error-detection. Let x be the input to a computation unitf and letf(x) be the desired output. A space redundancy technique, like Fig. 1, with two identical function units f, will detect any error in one of the two computation units. Now consider computingf(x) twice in time, on the same hardware boxf, as illustrated in Fig. 2. The result of the first computation step is stored in a register and then compared with the result of the second computation step. An intermittent error occurring during either of the computation steps, but not both will be detected; however, no permanent error can be detected. Thus, the error detection capability of Fig. 2 is much worse 0018-9340/82/0700-0589$00.75 © 1982 IEEE 589

Upload: phamdat

Post on 10-Feb-2017

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concurrent Error Detection in ALU's by Recomputing with Shifted

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982

Concurrent Error Detection in ALU's by

Recomputing with Shifted Operands

JANAK H. PATEL, MEMBER, IEEE, AND LEONA Y. FUNG, STUDENT MEMBER, IEEE

Abstract-A new method of concurrent error detection in theArithmetic and Logic Units (ALU's) is proposed. This method, called"Recomputing with Shifted Operands" (RESO), can detect errors inboth the arithmetic and logic operations. RESO uses the principle oftime redundancy in detecting the errors and achieves its error detectioncapability through the use of the already existing replicated hardwarein the form of identical bit slices. It is shown that for most practicalALU implementations, including the carry-lookahead adders, theRESO technique will detect all errors caused by faults in a bit-sliceor a specific subcircuit of the bit slice. The fault model used is moregeneral than the commonly assumed stuck-at fault model. Our faultmodel assumes that the faults are confined to a small area of the circuitand that the precise nature of the faults is not known. This model isvery appropriate for the VLSI circuits.

Index Terms-ALU, bit-sliced ALU, concurrent error detection,fault detection, time redundancy, VLSI circuits, VLSI faults.

I. INTRODUCTION

IT HAS been known for some time that no low-cost and ef-ficient techniques that can check both arithmetic and logic

operations have been available. The AN code, Residue code,and Inverse Residue code [1] -[5] are the error detecting codesthat were developed earlier for checking arithmetic operations.However, the methods mentioned above are unable to detectsome single errors in group carry-lookahead structures. Fur-thermore, these methods cannot be used for checking logicaloperations.

Utilizing a fully duplicated logic unit has been recognizedas the most effective method for checking logical operations.In fact, most machines that have been built with an error de-tection scheme used duplication to check logical operations.For example, the fault-tolerant STAR computer used inverseresidue codes to check the arithmetic unit but duplication forlogic unit [2]. Several other machines such as EDVAC andIBM System/3, have duplicated the entire ALU. Nevertheless,a few designs, such as the residue-checked ALU [6] and apartially self-checking ALU [4], [7] were introduced forchecking the entire ALU besides the full duplication scheme.However, both schemes require a large increase in the com-plexity of the circuitry, and therefore the area on the chip.

Manuscript received October 6, 1981; revised January 12, 1982. This workwas supported by the U.S. Navy under VHSIC Contract N00039-80-C-0556.

J. H. Patel is with the Coordinated Science Laboratory, and the Departmentof Electrical Engineering, University of Illinois, Urbana, IL 61801.

L. Y. Fung was with the Coordinated Science Laboratory, University ofIllinois, Urbana, IL 61801. She is now with STC Computer Research Cor-poration, Santa Clara, CA 95051.

Furthermore, they are based on the traditional stuck-at faultmodel which is not appropriate for the VLSI technology. Amore appropriate assumption is that the failure in a VLSIcircuit affects a small area of the chip and the nature of thefailure is not precisely known. For example, imperfection inthe semiconductor material may be confined to a small areaof the chip and thus affect several neighboring gates. Similarly,an incidence of a high energy particle may also affect a smallarea of the chip. The underlying failure mechanisms are notwell understood as yet. Therefore, it is unwise to assume thatunder failure the output of a specific gate is stuck at somelogical value. It is true, however, that at some higher functionallevel the effect of failures will be felt as changes in logicalvalues. This functional level, for example, can be at the levelof a bit-slice of an ALU. This is the level we choose for ourfunctional fault model in this paper. Throughout this paper,a failure in a circuit means some physical malfunction; andan error means an incorrect value of the function under con-sideration.

In this paper we present an error detection scheme based ontime redundancy. In the next section, some systematic ap-proaches of devising error detectionr by time redundancy aredescribed. Later we describe the proposed error detectionmethod, called Recomputing with Shifted Operands (RESO).The error detection capabilities of RESO are described inSections IV and V. It is shown that in a typical ALU, RESOdetects all functional errors resulting from failures confinedto a certain area of the chip, for example, a bit-slice. In SectionVI some extensions of RESO, as related to errror correctionin logic operations and error detection in multiply-dividecircuits, are discussed.

II. ERROR DETECTION USING TIME REDUNDANCY

In this section we introduce a systematic way of exploitingtime redundancy for error-detection. Let x be the input to acomputation unitfand letf(x) be the desired output. A spaceredundancy technique, like Fig. 1, with two identical functionunits f, will detect any error in one of the two computationunits. Now consider computingf(x) twice in time, on the samehardware boxf, as illustrated in Fig. 2. The result of the firstcomputation step is stored in a register and then compared withthe result of the second computation step. An intermittent erroroccurring during either of the computation steps, but not bothwill be detected; however, no permanent error can be detected.Thus, the error detection capability of Fig. 2 is much worse

0018-9340/82/0700-0589$00.75 © 1982 IEEE

589

Page 2: Concurrent Error Detection in ALU's by Recomputing with Shifted

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982

x error

signal

Fig. 1. Error detection with space redundancy.

Step 1: x

Step 2: x

Step 1: x {

Step 2: x {3i-4 f dZ3 errorF.. Erc(x)r fwx)) d(f(c(x))L......) . signal

Fig. 3. Error detection with time redundancy.

Step 1: x f

Step 2: x f error

signalerrorsignal

Fig. 2. Intermittent error detection with time redundancy.

than that of Fig. 1. Consider modifying the second step of Fig.2. Suppose that we have two functions c and d such thatd(f(c(x))) =f(x) for all x. Now we computef(x) in the firststep, and then during the second step we compute d(f(c(x)))as shown in Fig. 3. The functions c and d may be called thecoding and decoding functions, respectively. If c and d areproperly chosen, then a failure in the unitf will affectf(x) andf(c(x)) differently. Therefore, the outputs of the first step andthe second step will not match, producing an error signal. Quiteoften the functins c and d are such that they\are inverses ofeach other, that is, d(c(x)) = x for all x. In this-case we writec'I(f(c(x))) = f(x). Fig. 4(a) shows this organization. Asomewhat equivalent organization of Fig. 4(b) results fromthe fact that c'- (f(c(x))) = f(x) implies that f(c(x)) =c(f(x)) for all x. Figs. 3 and 4 are the basis of most error de-tection methods using time redundancy.As an example of Fig. 4, consider Boolean functions f(x)

which are self duals. Let x = (xI, x2, , x,) be an n-bitvector;f(x) is a self dual if for all x,f(l1, x2, .. , Yn) =f(x1,x2, ..., x"). Let c be a function which complements each bitof a vector, e.g., c(x1, -,x.) = (xl, * , in). It is clear thatc- = c and c(f(c(x))) = f(x), alternately c(f(x)) = f(c(x)).This property of self duals was the basis of one of the first errordetection methods devised using time redundancy technique.It is called alternating logic design and was inroduced byReynolds and Metze [8]. Time redundancy has also been usedfor error correction [9]. Here, we address ourselves specificallyto error detection in ALU's with organizations like Fig. 4.

There are several problems to consider in the design of anerror detection method using time redundancy. The firstproblem is to find a function c for the given functionf such thatc-(f(c(x))) = f(x). The second problem is that the mereexistence of the function c does not provide the desired errordetection capability. Error detection capability varies widelywith different c's; and, in addition, the functional propertiesoff and c are not sufficient by themselves to determine theerror detection capability. Different circuit implementationsof the same functionf will have different capabilities. Even theorganizations of Fig. 4(a) and (b) may have slightly differenterror coverage. And one final problem is the complexity of the

Step 1: x

Step 2: x f fi41 ~ 1 errorStepJ L J2: ssignal

Fig. 4. Two approaches to error detection using time redundancy.

hardware implementation of the coding and decoding functionsc and c-1. If the hardware required to implement c is compa-rable to that of the function unitf, then the space redundancyapproach of Fig. 1 is clearly more cost-effective. In short, fora cost-effective design, the function c must be such that itprovides a very good error coverage and is far less complex thanthe function f. We present here a cost-effective method of errordetection called Recomputing with Shifted Operands (RESO).An overview of this method follows.

III. DESCRIPTION OF RESO

The RESO method is based on Fig. 4. Let the function unitf be an ALU, the coding function c be a left shift operation,and the decoding function c 1 be a right shift operation. Thus,c(x) = left shift ofx and c-'(x) = right shift of x. With amore precise description of c, e.g., logical or arithmetic shift,by how many bits, what gets shifted in and so forth, it can beshown that for most typical operations of an ALU,c-I(f(c(x))) = f(x). The details about shifts are discussedlater in this section, but for now we shall only give a less precisebut intuitively clear picture of the RESO method. A schematicof the implementation appears in Fig. 5. Depending on whetherwe follow the principle of Fig. 4(a) or (b), the operation of theerror detection scheme in the ALU of Fig. 5 can be describedas below.

Following the principle of Fig. 4(a), during the initialcomputation step the operands are passed through the shifterunshifted and then they are input to the ALU. The result isthen stored in a register unshifted. During the recomputationstep, the operands are shifted left and then input to the ALU.The result is right shifted and then compared with the contents

590

Page 3: Concurrent Error Detection in ALU's by Recomputing with Shifted

PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S

x 0 Y

error signal function out

Fig. 5. Concurrent error detection in an ALU using RESO.

of the register. A mismatch indicates an error in computa-tion.

If we follow the principle of Fig. 4(b), then the operationchanges slightly. Here we input the unshifted operands duringthe first step as before, but we left shift the result and then storeit in the register. In the second step, we input the left shiftedoperands to the ALU and then compare the output directlywith the register. An inequality signals an error.

When an n-bit operand is shifted left by k-bit, its leftmostk bits move out. To preserve these bits during the recompu-

tation step, an (n + k)-bit ALU is needed. For certain logicaloperations, the operands can be shifted left circularly, in whichcase only an n-bit ALU is required. Use of rotations ratherthan shifts is discussed later in Section VI. For now, we shalljust assume an (n + k)-bit ALU for error detection in n-bitoperations. The leftmost k bits of the (n + k)-bit operands forthe first computation step and the rightmost k bits for therecomputation step are determined depending on the opera-

tions and the number system. For all bit-wise logical opera-

tions, it does not matter what they are. For arithmetic opera-

tions, the leftmost k bits of the unshifted operands have to bezeros for the unsigned binary integers and extensions of thesign-bit for the one's and two's complement number repre-

sentation. Consider the left-shifting by k bits as a multiplica-tion process, that is, the original operands are multiplied by2 k for the recomputation step. In order to be consistent, thecarry-in has to be multiplied by 2 k. It can be done by shiftingthe carry-in to the right of one of the operands k times. Now,(2 k 1) X (carry-in) has been added to one of the operandsand one more carry-in will be added to the sum during therecomputation step. For the other operand, k zeros should beshifted into its right. Under error-free conditions, the rightmostk bits of the result from the recomputation step should be allzeros. So these k bits are essential to indicate errors resultingfrom faults in the rightmost k bit-slices. To preserve these bitsfor equality checking, we follow the principle of Fig. 4(b).Thus, k zeros are shifted into the right of the result from thefirst computation, and this shifted result is compared with theunshifted result from the recomputation step.

Let RESO-k be the name of the error detection schemeachieved by recomputing with k-bit shifted operands. The

determination of k depends on three factors: the implemena-tion off, the space redundancy allowed, and the set of errorsdefined. The hardware that the RESO-k needs to detect errorsin an ALU for n-bit operations includes two shifters, a register,an (n + k)-bit ALU, and an equality checker. A totally self-checking equality checker can be implemented based on 1-out-of-2 code checkers [10]. The errors in the shifter can bedetected using any suitable parity codes. For the smallestcircuitry, k has to be one. If k is one, then the faults coveredare confined to a bit-slice or a subnetwork of a bit-slice de-pending on the implementation off. The next section will givethe detail of RESO-1. It will be shown later that RESO-2guarantees the detection of all functional errors resulting fromfailures confined to a bit-slice independent of the implemen-tation. The error detection capabilities of RESO-k will bepresented in a later section.At first glance, it appears that the above method of error

detection reduces the execution rate in half. However, if oneviews the ALU in the global context involving the entirecomputer, then the time penalty is only a small portion of theentire instruction cycle. If the ALU is pipelined,. then therecomputation step can be started immediately after onesegment time unit or one clock cycle of the pipeline. Therefore,both the computation and recomputation are overlapped in theALU, and the resulting penatly in performance is exactly onesegment delay per ALU operation.

IV. ERROR DETECTION CAPABILITY OF RESO-l

In this section we describe error detection capability ofRESO- 1, where RESO- 1 refers to recomputing with operandsshifted by one bit. First, we consider logical operations and thenarithmetic operations. All theorem statements and proofs forarithmetic operations only refer to add operations and can betrivially extended to include subtract, increment, decrement,and negate operations.

Theorem 1: RESO-1 detects all errors in an ALU for allbitwise operations AND, OR, NOT, and their derivations (e.g.,XOR, NOR, etc.) when the failure is confined to a single bit-slice.

Proof: Let the bit-slice i be faulty. If the fault producesan error, then the bit i of the result during the first computationstep is incorrect. During the second computation step, the biti of the result is computed by the nonfaulty slice i + 1.Therefore, the bit i of the recomputation step is the correct bit.Thus, if the failure produces an error affecting the bit i of thefirst result or the bit i - 1 of the second result of both, then thetwo results will not match. The error, therefore, is detected.

Theorem 2: In a bit-sliced ALU whose sum and carryfunctions of the adders are represented by two disjoint net-works like Fig. 6, any errors will be detected by RESO- 1 if thefailure is confined to either the sum network or the carry net-work, but not both.

Proof:Case 1-Only the Sum Circuit is Affected: Suppose the

failure is confined to the sum circuit of bit-slice i. Then the biti of the result is either correct or off by +2i during the first

591

Page 4: Concurrent Error Detection in ALU's by Recomputing with Shifted

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982

f I S I

(xi,yi,cc)

c

Fig. 6. Full adder with disjoint sum and carry networks.

Si

91 C +1

Fig. 7. Full adder with sum and carry networks sharing a subnetwork.

computation step. During the recomputation step, the bit-slicei operates on bit (i - 1) of the original operands. Thus, thefailure will cause the result to be either correct or off by 2'- 1

.

The equality checker will find two results identical if and onlyif both results are correct, since in either computation no twoerrors are equal. Therefore, all errors are detected.

Case 2-Only the Carry Circuit is Affected: Suppose thatthe carry circuit of bit-slice i is faulty. Then during the firstcomputation step, the result is either correct or off by ±2i+ 1.During the recomputation step, the result is either correct or

off by ±2i. Thus, the errors in two results cannot be equal, andtherefore any errors will be detected.

In certain implementations, carry and sum circuit are notdisjoint and they share a common subnetwork. If this sharingis done in a particular manner, then it is possible to detect er-

rors. The specific nature of this sharing is described below.A function f(x1, * , xn) is said to be strongly dependent

on the variable xi, if for every pair of input vectors which differonly in xi, the values off for these vectors differ.

Theorem 3: In an adder, whose sum and carry functions are

represented by two networks which share a common subnet-work and the sum function is strongly dependent on thefunction that the shared subnetwork represents, any errors

caused by failures confined only to one of three subnetworkswill be detected by RESO- 1.

Proof: Let the adder implementation be as in Fig. 7.Case I -Only the Circuitf is Affected: This is the same

as Case 1 of Theorem 2.Case 2-Only the Circuit g is Affected: This is the same

as Case 2 of Theorem 2.Case 3-Only the Circuit k is Affected: Let the circuit

k of bit-slice i be faulty. During the first computation step letonly the output of ki be wrong, then the sum bit i must also bewrong since the sum function strongly depends on the functionk and the carry ci+1 can be correct or incorrect. Hence, theresult of the first computation step will be off the expectedcorrect result by ±2i if only the sum bit is incorrect, or by2i2±2i+ I if carry is also incorrect. If the failure in unit k doesnot produce an error then the result is correct. Therefore, dueto a failure in unit k, the result can be off by one of 10, ±2i, ±3X 2i}. During the recomputation step, again since the sum

Fig. 8. A typical full adder circuit satisfying the structure of Fig. 7.

function strongly depends on the function k, if the output ofki is wrong, then the sum bit i - 1 must also be wrong and thecarry bit i can be wrong or correct. Of course, if the output ofcircuit k is correct, then no error has occurred during thesecond step. Hence, the result of the recomputation step willbe off by one of the values 10, ±2i-1, ±3 X 2V-1. Since theerroneous results from both of the computation step will neverhave the same value, the errors can always be detected. 3

Theorem 4: Any Boolean algebraic functionf(xI, x2,Xn), which is strongly dependent on x1, must be of the form xI

g(x2, *, xn) for some function g of x2, *.. , xn.Proof: By definition,f(x1, x2, X3, ) is strongly depen-

dent on x1 if for every pair of input vectors which differ onlyin x1, the values off for these vectors differ. That is

f(l, X2, X3, * *) =f(0 X2, X3, * ). (1)

By Shannon's Expansion Theorem

f(x1, X2, X3, * ) = XI .f(l, X2, X3, * )+ XI 4f(0, X2, X3, * )-

Substituting (1) into it

f(Xi, X2, X3, * ) = X1 *f(0, X2, X3, * )

+ X1 4f(0, X2, X3, .. *)

Let g(X2, X3, * - *) = f(0, X2, X3, ), then

f(xI, X2, X3, * * *) = X I g(X2, X3, * * *) + XI * g(X2, X3, * * )

that is

f(x1, X2, X3, * ) = Xi @ g(X2, X3, *).

We are now ready to look at some specific implementationsof adders whose errors can be detected by RESO- 1. The mostcommonly used implementation of a full adder is shown in Fig.8. This is a typical network that satisfies the structure of Fig.7. By Theorem 3, the RESO-1 can detect all errors resultingfrom any faults confined to one of the subnetworksf, g, or k.Any faults on the lines which are not in the dotted boxes are

also included in the subnetwork from which the line origi-nates.

-- I

X.

Yi

C.I.

S.

ci+'

592

Page 5: Concurrent Error Detection in ALU's by Recomputing with Shifted

PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S

TABLE IEFFECT OF FAULTY VALUES OF Gi AND Pi ON THE SUM

Correct Faulty Change in

values values the sum

Gi pi Gi Pio 0 0 1 0 2

O O 1 d +2i+lO 1 ° 0 0. 2i

0 1 1 d O, 2i+ 1

1 d 0 0 -2

1 d 0 1 0, -2i+

Fig. 9. Typical implementation of a carry-lookahead adder.

V. ERROR DETECTION CAPABILITIES OF RESO-k

In the previous sections we presented the error detectioncapabilities of RESO-1, that is, recomputing with operandsshifted by 1 bit. We assumed in that case that failures wereconfined to a well-defined cluster of logic elements. Theseclusters include a small number of elements. With the ever-increasing density of components in the fast-changing VLSItechnology, one may consider the assumptions of the previoussection somewhat restrictive. For this reason, in this sectionwe present the generalized RESO error detection method fora less restrictive fault model.The fault model we assume here is also a functional fault

model. It is the same as that of the previous section except thatwe allow more area of the chip or more components to be in-cluded in the affected cluster. For example, one can assumethat failures affect a complete bit-slice or several adjacentbit-slices. To understand why RESO- 1 may not be adequatefor error detection when a large area of the chip is affected,consider the following example, which also leads us to the nexttheorem.

Consider a bit-sliced ripple-carry adder. Suppose that bit-slice i is faulty. Then the sum and the carry bits can have anylogical values at any time. The functional nature of the erroris to change the arithmetic value of the final result. During thefirst computation step, bit-slice i computes a sum bit withweight 2 i and the carry-out bit with a weight 2i+ 1. Thus, it ispossible that the result of the first computation step is off by±2i or ±2i+1 or ±2i±2i+I or 0. In other words, the result isoff by one ofl0, ±2', ±2i+1, +3 X 2i}. During the second step,the operands are shifted left by one bit, and therefore the bit-slice i computes the sum bit with weight 2i-1 and the carry-outbit with weight 2i. Again, not knowing the exact nature of thefault, the sum and carry-out bits can take any logical valuesindependent of the input. Reasoning as before, the result is offby one ofl0, ±2i-1, ±2i, ±3 X 2i-11. From this, it is clear thatresults of the two steps can be identical not only when there isno error, but also when the errors happen to be +2i or -2i inboth steps. This suggests that the second computation stepshould be changed so that no two errors are the same. Thisleads us to the following theorem.

Theorem 5: In a bit-sliced ripple-carry adder, RESO-2

*detects all errors resulting from failures confined to any onebit-slice.

Proof: Let the bit-slice i be faulty. During the firstcomputation step, the sum and the carry outputs of slice i haveweights 2i and 2i+1, respectively. Therefore, the result fromthe adder can be off by one of the values 10, ±2i, +2i+ 1, ±3X 2i).Now consider the recomputation after the operands have

been shifted left by two bits. The sum and carry-out of bit-slicei now have weights 2i-2 and 2i-1, respectively. Therefore, theresult of the recomputation step can.be off by one of the values10, +2i-2, ±2i-1, ±3 X 2i-21.No single error (nonzero value) appears in both sets.

Therefore, any error in either step will cause a mismatch of theresults of the two computation steps, and the error is detected.

For carry-lookahead adders, the RESO-2 is also effectiveas proven in the following theorem. Since a bit-slice is not welldefined in a carry-lookahead (CLA) adder, we first describethe exact implementation of a CLA adder and then definewhat we mean by a "bit-slice." Fig. 9 describes a typical im-plementation of a CLA adder. The function unitfi computesthe sum bit si, carry propagate signal Pi, and the carry gen-erate signal Gi. The function unit g, computes the carry-in tothe stage i; all function units fi are identical. However, thefunction units gi get more complex as i grows. It is only for thesake of convenience that we define a "bit-slice" i to consist ofunitsfi and g,.

Theorem 6: In a bit-sliced carry-lookahead adder, RESO-2detects all errors resulting from failures confined to any onebit-slice.

Proof: Let the bit-slice i, which includes function unitfiand gi, be faulty (see Fig. 9). Then by assumption only the sumbit si, carry generate Gi, and carry propagate Pi are affected.Since the consequences of faults in gi is to affect only the sumbit si, this case is already included in the above assumption.Sum bit si has an arithmetic weight of 2i, and Gi has a weightof 2i+ 1. When G, = 1, the propagate signal Pi has no effect onsum bits i + 1 and higher. Thus, when Gi = 1, Pi has a weightof 0. Depending on the implementation, the sum-bit si may ormay not be a function of Pi. Since we have already consideredsi to be erroneous, the effect of Pi on si can be ignored. WhenGi = 0, Pi has a weight of 2i. With this information, we can

593

Page 6: Concurrent Error Detection in ALU's by Recomputing with Shifted

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982

enumerate all possible changes in the sum contributed by theerrors in Gi and Pi, as shown in Table I.

Combining these values with the possible errors contributedby the sum bit si, we conclude that due to the failures in slicei, the result can be off by one of the values $0, +2i, +2i+1, ±3X 2i}. After the operands are shifted left by two bits, the resultof the recomputation step can be off by one of the values $0,+2i-2, +2i-1, +3 X 2i-21. Thus, no single error value appearsin both computation steps and any error can be detected. 0The above results can be generalized to include the failure

of more than one bit-slice of the ALU. The generalized resultis stated below:

Theorem 7: RESO-k has the following error detectioncapabilities in an ALU:

1) detects all errors in all bit-wise logical operations whenthe failures are confined to k adjacent bit-slices,

2) detects all errors in arithmetic operations in a ripple-carry adder when the failures are confined to (k - 1) adjacentbit-slices for k > 1 (for k = 1 see RESO-1 in the last sec-tion),

3) detects all errors in arithmetic operations in a fullcarry-lookahead adder when the failures are confined to (k -1) adjacent bit-slices, where a bit-slice i consists of functionalunitsf and g1 (see Fig. 9),

4) detects all errors in arithmetic operations in a groupcarry-lookahead adder, each group i consisting of a (k - 1)-bitadder, and circuits for group-carry generate Gi, group-carrypropagate Pi, and the group carry-in C, (similar to Fig. 9)when the failures are confined to a single group.

Proof:I) Let the k adjacent bit-slices i, i + 1, i + k - 1 be

faulty. During the first computation step, bits i, i + 1 , i+ k - 1 of the result are affected by faults. The recomputationstep is performed after k-bit left-shifts of the input operands.Therefore, the bit slice i affects the bit i - k of the result, slicei + 1 affects bit i - k + 1 of the result, and so on. Thus, theresult bits that are affected by the faults are bits i - k, i - k+ 1,*, i -1. Therefore, no single bit of the result is affectedin both computation steps. Hence, an error guarantees a mis-match between the results of the two computation steps.

2) Let us assume that the (k - 1) adjacent bit-slices i,i + 1, * * , i + k - 2 are faulty. During the first computationstep, the faults can cause errors in the sum bits i, i + 1, . , i+ k -2 and carry-out of bits i, i + 1, * -, i + k-2. Thesmallest nonzero magnitude of the change in the result due tofaults occur when the sum bit i is in error. The magnitude ofthis error is 2i. The largest magnitude of the errors occurs whenthe sum bits i, i + 1, i + k - 2 and the carry-out of bit i+ k - 2 are all wrong in one direction. (That is, they all changefrom 0 to 1 or all from 1 to 0.) The magnitude of this error is2i + 2i+1 + .. .+ 2i+k-2 + 2i+k-1, which is the same as 2i(2k- 1).

After the input operands are shifted left by k bits, the re-computation step is performed. The bits affected are the sumbitsi-k,i-k+ 1, ,i-2 and carry-out of bit i-2. Thesmallest and the largest magnitude of the nonzero errors are2i-k and 2i-k(2k - 1), respectively.The largest error of the recomputation step is less than the

smallest error of the first step, since 2i-k(2k - 1) < 2i.Therefore, no single error value can occur in both of thecomputation steps and thus an error always causes a mismatchof the two results and, hence, it is detected.

3) Again assume that the (k - 1) adjacent bit-slices i,i + 1, , i + k - 2 are faulty. Using a reasoning similar tothat in the proof of Theorem 6 and in the proof of part 2)above, it can be shown that the smallest nonzero magnitudeof the error during the first step of computation is 2i, and thelargest magnitude of the error during the recomputation stepis 2i-k(2i - 1) which is less than 2i. Thus, no single error valueoccurs in both steps.

4) The group of (k - 1) bit-slices is the same as the (k- 1) adjacent bit-slices of 3). 0

VI. EXTENSIONS OF RESOThe method of error detection presented in the previous

section can be modified and/or extended in several differentdirections. Among these are: use of rotations rather than shiftsin certain applications, error correction using RESO, andextending RESO to more complex arithmetic functions, suchas multiply, divide, and floating-point operations. We discussthem briefly in this section.Recomputing with Rotated Operands: Since the rotation

is the same as a circular shift, some of our results derived forRESO are also valid under rotation. For bit-wise logical op-erations in an ALU, no two bits of the same operand interact,and therefore the positioning of a bit with respect to other bitsdoes not affect the outcome of the result as long as the bits ofthe second operand are similarly positioned. Thus, it is seenthat rotations can be substituted for logical shifts in RESO,and the same error detection capability is achieved for logicaloperands. It is clear that rotations have an advantage overshifts because no additional bit-slices are needed. It is not clear,however, that rotations can be used in a straightforwardmanner when the arithmetic operations are involved. Withadditional hardware, it is possible to ensure a correct add op-eration after a rotation, so that the carry-in is applied to anappropriate bit-slice and the carry-out is extracted from theproper bit-slices. Thus, there is a tradeoff between the cost ofadditional bit-slices needed for shifts and the cost of additionalcontrol hardware needed for the rotations. Furthermore, wemust also consider the effects of faults on the additionalhardware which is different from a bit-slice. Rotations in acarry-lookahead adder require even more complex controlsince the carry-lookahead unit cannot be divided into identicalbit-slices.

Error Correction Using RESO: First, let us discuss thebitwise logical operations in an ALU. Suppose that the bit-slicei is faulty. Then the bit i of the result may or may not be cor-rect. For the first recomputation step, the operands are shiftedleft by one bit. Now the bit i - 1 of the result is computed bythe faulty bit-slice. If the bit i of the first step and bit i - 1 ofthe second step are incorrect, then the two results disagree intwo bit positions. From this the conclusion is obvious that theslice i produced an incorrect output during both steps. Hence,the result can be corrected by complementing the bit i of thefirst result or bit i - 1 of the second result. However, if the

594

Page 7: Concurrent Error Detection in ALU's by Recomputing with Shifted

PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S

incorrect output were produced during only one of the twosteps, then the disagreement between the two steps occurs inonly one bit position. But it cannot be determined which of thetwo results is correct. For this reason, we need more informa-tion, and it can be obtained by doing a third computation stepafter the operands have been shifted left one more bit, that is,a total of two bits off from the original operands. Now each bitof the result is computed by at least two nonfaulty bit-slices.Therefore, 2 out of 3 majority votes on each bit will decide thecorrect value. This is very similar to Triple Modular Redun-dancy (TMR), the difference being that the TMR is redundantis space, while RESO uses redundancy in time.

Correcting errors in arithmetic operations with RESO is notgenerally straightforward or even possible. Although, undervery restrictive fault models (such as single stuck-at fault), onemay be able to correct errors in arithmetic operations withadditional hardware.RESOfor Complex Arithmetic Functions: We have so far

described the error detection capabilities of RESO as appliedto logical and simple arithmetic operations (add, subtract).For arithmetic operations such as integer multiply and divide,one can determine the error detection capabilities of RESOfor specific hardware implementations. There are also differentways for applying RESO to multiplication or division. Forexample, if the multiplication is performed using add and shiftmethod on an ALU, then we can apply the RESO to individualadd and shift operations and thus check each step of the mul-tiplication algorithm. If the multiplication is done using anarray multiplier, then one can use the shifted operands forrecomputation step. Thus, the first step computes, say, x * yand the recomputation step computes 2x * 2y which is thencompared with x * y shifted left by two bits. Since there aremany different array multipliers, we shall not give here theerror detection capabilities of RESO-k for any particularmultiplier. However, the methods described in the previoussection can be used in determining the error detection capa-bilities of RESO-k for an assumed fault model. RESO isespecially suitable for array multipliers and dividers becausemost such array structures are very regular so that they canbe divided into identical bit-slices.

For floating-point operations, one can apply the alreadyestablished RESO techniques for integers. Thus, exponent andmantissa can be handled separately as integers, each with itsown error detection mechanism.

VII. CONCLUDING REMARKS

We have presented a time redundancy technique for con-current error detection in arithmetic and logic units. Themethod, Recomputing with Shifted Operands (RESO), ex-

595

ploits the bit-slice structure of the ALU's. The fault model usedis more general than the commonly assumed stuck-at-faultmodels. Our model assumes that the physical failures areconfined to a small area of the chip or equivalently to a clusterof components, and the precise nature of the resulting faultsis unknown. This model is very appropriate for the VLSItechnology, since the failure modes in VLSI circuits are notwell understood. Furthermore, we have shown that RESO hasthe capability of detecting errors not only in the logical oper-ations, but also in the arithmetic operations in ripple carryadders, full carry-lookahead adders, and group carry-looka-head adders. The universality of error detection and a low costof implementation makes RESO more attractive than other-methods of error detection in ALU.

REFERENCES

[1] A. Avizienis, "Arithmetic codes: Cost and effectiveness studies for ap-plication in digital systems design," IEEE Trans. Comput., vol. C-20,pp. 1322-1331,Nov. 1971.

[2] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, andD. K. Rubin, "The STAR computer: An investigation of the theory andpractice of fault-tolerant computer design," IEEE Trans. Comput., vol.C-20, pp. 1312-1322, Nov. 1971.

[3] H. L. Garner, "Error codes for arithmetic operations," IEEE Trans,Electron. Comput., vol. EC-15, pp. 763-770, May 1966.

[4] J. F. Wakerly, Error Detecting Code, Self-Checking Circuits and Ap-plications. New York: North-Holland, 1978.[5] F. F. Sellers, M. Y. Hsiao, and L. W. Bearnson, Error Detecting Logicfor Digital Computers. New York: McGraw-Hill, 1968.[6] T. R. N. Rao and P. M. Monteiro, "A residue checker for arithmetic

and logical operations," in Proc. 2nd Int. Symp. on Fault-TolerantComput., June 1971.

[7] J. F. Wakerly, "Partially self-checking circuits and their use in per-forming logical operations," IEEE Trans. Comput., vol. C-23, pp.658-666, Dec. 1974.[8] D. Reynolds and G. Metze, "Fault detection capabilities of alternatinglogic," IEEE Trans. Comput., vol. C-27, pp. 1093-1098, Dec. 1978.[9] J. J. Shedletsky, "Error correction by alternate-data retry," IEEE Trans.

Comput., vol. C-27, pp. 106-112, Feb. 1978.[10] D. A. Anderson, "Design of self-checking digital networks using coding

techniques," Coord. Sci. Lab., Univ. of Illinois, Urbana, Tech. Rep.R-527, Sept. 1971.

Janak H. Patel (S'73-M'76), for a photograph and biography, see page 304of the April 1982 issue of this TRANSACTIONS.

Leona V. Fung (S'80) was born in Hong Kong onJuly 24, 1958. She received the B.S. degreee incomputer science and the M.S. degree in electricalengineering from the University of Illinois at Ur-bana-Champaign.

In 1981 she was a Research Assistant in theCoordinated Science Laboratory at the Universityof Illinois. Her research interests are fault-tolerantcomputing, computer architecture, and VLSI sys-tems.