approximation algorithms for scheduling arithmetic expressions on pipelined machines

20
JOURNAL OF ALGORITHMS 10,120-139 (1989) Approximation Algorithms for Scheduling Arithmetic Expressions on Pipelined Machines DAVID BERNSTEIN* IBM T. J. Watson Research Center, P.O.B. 704, Yorktown Heights, New York 10598 MICHAEL RODEH IBM Israel Scientific Center, Technion City, Haifa 3.2oo0, Israel AND IZIDOR GERTNER Department of Electrical Engineering, Technion-Israel Institute of Technology, Haifa 32000, Israel Received July 19.1987; accepted February 19,1988 Consider a processor which can issue one instruction every machine cycle, but can use its result only d + 1 machine cycles after it has been issued. It is shown that an upper bound for the completion time of an arbitrary list schedule for arbitrary expressions, with possibly common subexpressions, on such machines is greater than the optimum by a factor of 2 - l/(d + 1). Then a class of scheduling algorithms, called leuel algorithms, is defined and analyzed. These algorithms sometimes yield bad schedules which can be made arbitrarily close to the upper bound of list schedules. By extending the leveling algorithm, using the lexicographic order crite- rion similar to that of Coffman-Graham’s algorithm, a better result of 2 - 2/(d + 1) is derived. This bound is asymptotically tight. o 1989 Academic PESS. hc. 1. INTRODUCTION Pipelining is a common technique for building fast processors. In con- trast to parallel processing, in which computational jobs can be initiated simultaneously, only one instruction can be issued every machine cycle in a *This work was done while the author was a graduate student at the Technion-Israel Institute of Technology. 120 0196-6774/89 $3.00 Copyright 0 1989 by Academic Press. Inc. All rights of reproduction in any form resewed

Upload: david-bernstein

Post on 21-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

JOURNAL OF ALGORITHMS 10,120-139 (1989)

Approximation Algorithms for Scheduling Arithmetic Expressions on Pipelined Machines

DAVID BERNSTEIN*

IBM T. J. Watson Research Center, P.O.B. 704, Yorktown Heights, New York 10598

MICHAEL RODEH

IBM Israel Scientific Center, Technion City, Haifa 3.2oo0, Israel

AND

IZIDOR GERTNER

Department of Electrical Engineering, Technion-Israel Institute of Technology, Haifa 32000, Israel

Received July 19.1987; accepted February 19,1988

Consider a processor which can issue one instruction every machine cycle, but can use its result only d + 1 machine cycles after it has been issued. It is shown that an upper bound for the completion time of an arbitrary list schedule for arbitrary expressions, with possibly common subexpressions, on such machines is greater than the optimum by a factor of 2 - l/(d + 1). Then a class of scheduling algorithms, called leuel algorithms, is defined and analyzed. These algorithms sometimes yield bad schedules which can be made arbitrarily close to the upper bound of list schedules. By extending the leveling algorithm, using the lexicographic order crite- rion similar to that of Coffman-Graham’s algorithm, a better result of 2 - 2/(d + 1)

is derived. This bound is asymptotically tight. o 1989 Academic PESS. hc.

1. INTRODUCTION

Pipelining is a common technique for building fast processors. In con- trast to parallel processing, in which computational jobs can be initiated simultaneously, only one instruction can be issued every machine cycle in a

*This work was done while the author was a graduate student at the Technion-Israel Institute of Technology.

120 0196-6774/89 $3.00 Copyright 0 1989 by Academic Press. Inc. All rights of reproduction in any form resewed

ARITHMETIC ON PIPELINED MACHINES 121

pipelined machine; several instructions may be executed concurrently, one in every stage of the pipe. In general, recently designed computer architec- tures [7, 91 include both pipelining and parallelism. In this paper we concentrate on the effect of pipelining.

Pipelining may cause the insertion of NOPs (No Operations) into the sequence of machine instructions either by hardware or software. In both cases a certain penalty is paid in increased execution time. Minimking the number of NOPs increases the effective speed of the machine. Li assumed identical delays of all the instructions in the pipeline [lo]. In this case, assuming that the input is limited to tree expressions, an optimal computa- tion can be constructed by executing first the instructions furthest from the root of the tree. Directed acyclic graphs (dags) were considered by Bruno ef al. [2]. They showed that if the delays of all the instructions are equal to one time unit, then Coffman-Graham’s algorithm [4] can be used to produce an optimal solution.

We allow different delay times for instructions, as if expressions were computed on several pipelined functional units with different numbers of internal stages. Cray-1 [13] and certain RISC architectures [12] exhibit such a behavior. If no bound is put on the maximal delay d then the problem of finding an optimal computation turns out to be NP-complete [8]. However, when d I 1, an optimal algorithm can be derived from Coffman-Graham’s algorithm [l]. Since many architectures have delays greater than 1 (e.g., MIPS [8] has delays of 2 machine cycles, and in Cray-1 the delay times vary from 1 to 13 machine cycles), we study the case of arbitrary bounded delays and proved that an upper bound for list schedules [3] is 2 - l/(d + 1).

Hennessy and Gross [8] showed that finding an optimal schedule for expressions (dags) on a pipelined processor with a maximal delay of d machine cycles is at least as hard as finding an optimal schedule for m = d parallel processors. If m is part of the problem instance, then the problem is NP-complete; for any fixed m 2 3 the complexity question is open while the scheduling problem for m = 2 machines is polynomiahy solvable by Coffman-Graham’s algorithm [4]. However, the difficulties that arise when d = 2 are similar to those of m = 3, and therefore, the status of this special case is unclear.

We propose an algorithm that follows the critical path approach by assigning a level to each instruction. Then, a computation is constructed in such a way that instructions at higher levels are computed first. This has been a natural heuristic approach to multiprocessor scheduling [5, 11, 141. Unfortunately, there are examples in which our algorithm can perform on dags as badly as the upper bound of list schedules. By extending the leveling algorithm in a way similar to Coffman-Graham’s algorithm [4], we reduce the worst case ratio to 2 - 2/(d + 1) when all delays are equal to

122 BERNSTEIN, RODEH, AND GERTNER

either d cycles or 0 cycles. Finally, we construct a family of examples that approaches this worst case ratio arbitrarily closely.

The rest of the paper is organized as follows. In the next section we start with some preliminary definitions. Then, in Section 3 we discuss list schedules, and in Section 4 the leveling algorithm is presented along with its worst case example. In Section 5 we analyze the worst case behavior of Coffman-Graham’s algorithm, and in Section 6 we discuss possible future research.

2. PRELIMINARIES

Our scheduling model consists of a single processor P and a job system T = (J, D, G). T comprises a set of unit execution times jobs J = {Jp..., J, }, a set of delays D = { D,, . . . , D, }, where Di E (0, . . . , d } for some fixed integer d, and a directed graph G = (J, E) of precedence constraints. (The delays model the pipelined structure of P.) Given a job 4, we sometimes use the notation D(A) to denote 0;.

A legal schedule is defined as a one-to-one mapping S from the elements of J into the set N of positive integers (interpreted as time slots) such that for all (4, Ji) E E, S(4) - S(J.) > Di, A time slot of S, in which no job can be executed because of delay limitations, is called a NOP.

We assume that G has no transitive edges since they do not impose additional restrictions on a schedule S of T. Also, Coffman-Graham’s algorithm which will be presented in Section 5 requires distinct transitive and non-transitive edges of G. However, in a more general model of a pipelined machine in which a delay is assigned to an edge rather than to a vertex of the precedence graph, the transitive edges cannot be neglected. We expand this case a bit in Section 6.

As an example, consider the job system of Fig. l(a). The jobs are represented by circles, and their indices appear inside the circles. The integers near the edges denote the corresponding delay times. Three legal schedules for the job system are shown in Fig. l(b), where i in column j means that J, is executed in time slot j. Notice that time slots 5 and 6 of S’ are NOPs since D, = 3 and Da = 2.

The completion (maximum finishing) time c(S) of a schedule S is defined by maxi S(J). For example, in Fig. l(b), c(S’) = 11, c(S2) = 10, and c(S3) = 9. Throughout this work we will be interested in minimizing the completion time, which is equivalent to minimizing the number of NOPs. An optimal schedule S is a legal schedule for which c(S) is smallest. S3 of Fig. l(b) is clearly an optimal schedule for the job system of Fig. l(a) since it has no NOPs.

ARITHMETIC ON PIPELINED MACHINES 123

N 1234567891011

S' 4516 2738 9

S2 4156 27389

S3 145623789

(b)

FIG. 1. A job system and three legal schedules.

3. LIST SCHEDULES

Let T = (J, D, G) be a job system with n jobs where G = (J, E). If (4, Ji) E E we say that Jj is an immediate successor of 4, and Ji is an immediate predecessor of .$ Also, if there exists a directed path in G from J;: to Ji we say that Jj is a successor of Ji, and 4 is a predecessor of Ji. In the sequel, we denote by slot(S, X) the time slot of S in which the job X is executed; by Si we denote the job which is scheduled in S in time slot i. Given a schedule S for T, Ji is ready in time slot k, if for each of its immediate predecessors 4, slot(S, 4) I k - 1 - Di.

Now we consider an important class of schedules, called list schedules [3]. Informally, given a priority list L of the jobs of J, the list schedule S that corresponds to L can be constructed by the following procedure:

1. Iteratively schedule the elements of S starting in time slot 1 such that during the i-th step, L is scanned from left to right, and the first ready job not yet scheduled is chosen to be executed in time slot i.

2. If no such job is found, a NOP is inserted into S in time slot i.

Consider a class of optimal schedules for T. Since all the jobs in T have unit execution times, there is no reason in optimal schedules to leave the processor P idle whenever a ready job exists. Therefore, for our problem,

124 BERNSTEIN, RODEH, AND GERTNER

an optimal schedule can always be found among list schedules. The obvious question is how to obtain the right priority list L.

Analyzing a class of list schedules, we estimate how far from the optimum an arbitrary list schedule can be. In the sequel, we denote optimal schedules by SOPT and arbitrary list schedules by SLIST. Also, we use copt and chst as shorthand notation for c(SOPT) and c(SLIST), respec- tively. To determine the upper bound for the ratio R = clist/copt we need the following result.

LEMMA 1. Let T = (J, D, G) be a job system with n jobs, and let copt = n + k, i.e., there are k 2 0 NOPs in an optimalschedule. Let SLIST be a list schedule of T such that R = cIist/copt is maximal. If k > 0 then there exists a job system T’ whose optimal schedules contain no NOPs and a list schedule SLIST’ for it such that R’ = clist’/copt’ > R.

Prooj: To prove the lemma, we show how to build T’ from T. We add to T k isolated jobs (with no immediate successors, no immediate predeces- sors, and zero delays) to get T’. Let SOPT be an optimal schedule of T. SOPT has k NOPs. An optimal schedule SOPT’ of T’ can be obtained from SOPT by replacing the NOPs of SOPT by the k isolated jobs. Therefore, copt’ = copt. SLIST’ is built from SLIST by scheduling the k isolated jobs before all the other jobs of T’. Therefore, clist’ = clist + k. Thus, R’ = clist’/copt’ = (clist + k)/copt > clist/copt = R. 0

Let SLIST be a list schedule of T and assume that SLISq = NOP. We say that SLISI; is induced by J, and J, if:

1. slot(SLIST, J,) <j < slot(SLIST, 4).

2. (J,, J,) E E.

Notice that every NOP of a list schedule must be induced by at least one pair of jobs. For example, the NOP of time slot 5 in S, of Fig. l(b) is induced by J1 and J2 by J1 and J3 and by Je and 5,. Let C = C,, . . . , C, be a sequence of jobs in J such that for all 2 < i I m, (Cipl, Ci) E E. Thus, C is a directedpath in G. We say that C = Cl,. . . , C,,, covers a NOP of SLIST if the NOP is induced by a pair of jobs of C.

LEMMA 2. Let T = (J, D, G) be a job system, and let SLIST be a list schedule of T. Then there exists a directed path C = C,, . . . , C, in G that covers all the NOPs of SLIST.

Proof: To prove the lemma, it is sufficient to prove the following claim.

CLAIM. Let J, be scheduled in time slot j of SLIST. Then there exists a directed path C = C,, . . . , C,,, with C,,, = J, which covers all the NOPs of SLIST that are earlier than time slot j.

ARITHMETIC ON PIPELINED MACHINES 125

Proof of Claim. We prove the claim by scanning SLIST backwards starting in time slot j. Let the rightmost NOP of SLIST scheduled before time slot j be in time slot i. Let the leftmost job of J which is a predecessor of J, and is scheduled in SLIST after time slot i be J,. (Notice that J, can be J, itself.) Since SLIST is a list schedule, there must exist a job J, which is scheduled in SLIST before time slot i such that (J,, J,) E E; otherwise J, could be scheduled in SLIST in time slot i. Thus, the NOP of time slot i is induced by J, and J,. Notice that in SLIST there might be more NOPs which are located after J,, and before time slot i, and which are induced by J,, and J,. Therefore, there exists a directed path J,,, J,, . . . , J, that covers all the NOPs of SLZST located after J, and before time slot i, including the NOP in time slot i. By iterating this argument until we reach time slot 1, the claim is proved. 0

THEOREM 3. Let T be a job system with n jobs, SOPT an optimal schedule and SLIST an arbitrary list schedule of T, respectively. Then R = clist/copt I 2 - l/(d + 1) and the bound is asymptotically tight.

Proof By Lemma 1, we may assume that SOPT has no NOPs. There- fore, copt = n. Let k be the number of NOPs in SLIST. Thus, clist = n + k and R = 1 + k/n. By Lemma 2, there exists a directed path C = C,, _ . . , C,,, which covers all the NOPs of SLIST. Therefore, Cy=<‘D(Ci) 2 k. Notice that since Dj I d for all j, we have m - 1 2 k/d. Since SOPT has no NOPs, there must be k jobs in SOPT that fill in the delays of C in SLIST. Therefore, we get n 2 k + m 2 k + k/d + 1. Thus, R = 1 + k/n I 1 + k/(k + k/d + 1) = 2 - l/(1 + (d/(1 + d/k))) I 2 - l/(d + 1).

To prove that the bound cannot be improved, consider the job system of Fig. 2. It consists of a chain of k + 1 jobs with delay times of d (Fig. 2(a)) and kd isolated jobs (Fig. 2(b)). Thus, n = kd + k + 1. An optimal sched- ule with no NOPs can be obtained by executing kd isolated jobs in the delay times of the jobs in the chain. The worst list schedule is obtained by executing all kd isolated jobs first, and then computing the jobs in the chain. This worst list schedule has kd NOPs, and clist = 2kd + k + 1.

0 O.--O

kd jobs

(a) (b)

FIG. 2. An upper bound for list schedules.

126

Thus,

BERNSTEIN, RODEH, AND GERTNER

R = clist/copt = (2kd + k + l)/( kd + k + 1)

= 2 - l/(1 + d/(1 f l/k)).

By increasing k, R can be made arbitrarily close to 2 - l/(d + 1). 0

4. A LEVELING ALGORITHM

4.1. Leveled Schedules

In this section, a subclass of list schedules, called leveled schedules is considered. If Ji E J has no immediate successors, we say that Ji is a sink of G. Also, let IS(J) (or ISi for short) be the set of immediate successors of JI:.

The level I(Y) of a job Y is defined by

i

0 Y is a sink of G I(Y) = xxgxyjl(X) + D(Y) otherwise.

Notice that the total execution time of the jobs does not affect the levels as defined above.

Let L be a priority list of the jobs in J constructed in a non-increasing order of their levels (the order among the jobs of the same level is arbitrary). A schedule S corresponding to such an L is called a leveled schedule. We denote leveled schedules by SLEV and use clev as a shorthand notation for c(SLEV). Intuitively, in leveled schedules we first schedule jobs whose delays are maximal, hoping that the NOPs induced by these jobs will be replaced by other jobs.

The leveling strategy defined above is somewhat different from the conventional critical path (CP) algorithm [3], where execution times are taken into account. Consider the job system of Fig. l(a). The levels computed for the jobs by the leveling algorithm appear near the vertices. The list 1,4,5,6,2,7,3,8,9 yields the optimal schedule S3 of Fig. l(b). On the other hand, priority lists obtainable from the CP algorithm are 4,5,1,6,2,7,3,8,9 or 4,1,5,6,2,7,3,8,9, and their corresponding subopti- mal schedules are shown in Fig. l(b) as S’ and S*, respectively.

4.2. A Worst Case Example

Unfortunately, the ratio R = clev/copt can be made arbitrary close to the upper bound of Section 3. In Fig. 3 a worst case example is presented. The job system T = (J, D, G) there consists of k + 1 groups of jobs. A group

ARITHMETIC ON PIPELINED MACHINES 127

FIG. 3. A worst case example for the leveling algorithm.

t,O 5 t I k, consists of a single type-C job C, and d type-A jobs A ,I,. . . , A,,. For all X E J except the sinks of J, D(X) = d. For 0 I t I k, QC,), QA,,), . . . , f(A,,) = id. Executing the jobs of level t in the order C,, A ll,. . . , A,, results in an optimal schedule with no NOPs. Thus, copt = n = (k + l)( d + 1). The worst leveled schedule SLEY executes the jobs of level t in the order A,, . . . , A,,, C, and assigns d NOPs after each type-C job except C,. Therefore, cleu = n + kd = kd + (k + l)( d + 1). We get:

R = cZeu/copt = (kd + (k + l)(d + l))/(k + l)(d + 1)

= 2 - (k + d + l)/(kd + k + d + 1).

By increasing k, R can be made arbitrarily close to 2 - l/(d + 1).

5. COFFMAN AND GRAHAM’s ALGORITHM

5.1. The Labeling Algorithm

The leveling algorithm of Section 4 provides an optimal solution to the scheduling problem when G is a tree [lo]. However, for directed acyclic graphs, the worst case ratio of the leveling algorithm can be made arbitrary close to the upper bound of list schedules (Section 4.2). In the sequel, we describe an algorithm that extends the leveling scheme of Section 4 to obtain an algorithm similar to that of Coffman and Graham [41; the algorithm is referred to as CG.

Let T = (J, D, G) be a job system with n jobs where G = (J, E). The algorithm uses the lexicographic order among sequences of positive inte-

128 BERNSTEIN, RODEH, AND GERTNER

gers. It assigns to each job Ji of J a label A( 4) E {1,2,. . . , n}. The one-to-one mapping X is defined as follows:

1. Compute r(J) for all i. This partitions J into subsets J”, s 2 0, such that for all X E Js, l(X) = s. (Notice that (X, Y) E E does not imply 1(X) > I(Y) because D(X) may be 0.)

2. Suppose that the labels 1,2,. . . , k - 1 have been assigned to the jobs in J”, 0 I s I m - 1. Assign the labels k, . . . , k + 1 J”I - 1 to the jobs in J” by iterating through the following procedure: For each job Ji E J” for which X has been computed for all elements of IS(A), let Qi = { x(Ji)lJi E IS(J)}. Sort the elements of 52, in a decreasing order to get ai. There exists at least one job X E J” such that a(X) 5 Di for all Ji E J”. Choose such an X and fix X(X) = k.

Finally, the priority list L is determined by ordering jobs with the higher labels first. In this paper we concentrate on the worst case analysis of CG.

Let us demonstrate how CG works by applying it to the job system of Fig. 4(a). First, the levels of the jobs are computed resulting in I( J6), I( J3) = 0, 1( J4), 2( J5) = 1, and 1( J1), I( J,) = 2. Then, the labels are computed by processing jobs level by level. First, h( J6) = 1 and X( J3) = 2. For the jobs of level 1, 0, = 52, = (1). Thus, we decide arbitrarily that A( J,) =_ 3 an> A( J4) = 4. Now, we construct 52, = {4,2} and a2 = {4,3}. Since fit, < !&, X(J,) = 5 and X(J,) = 6. The resulting priority list L is

N 1234567

S’ 12345 6

S2 215436

(b)

FIG. 4. A job system with d = 1 and its two schedules.

ARITHMETIC ON PIPELINED MACHINES 129

2,1,4,5,3,6, and the schedule S2 of Fig. 4(b) which corresponds to L is an optimal schedule of the job system of Fig. 4(a). Notice that by scheduling the job system of Fig. 4(a) such that Ji appears before J2 (as might happen in a leveled non-CG schedule), we can get a non-optimal schedule (S’ of Fig. WN.

5.2. Worst Case Analysis of Coffman and Graham’s Algorithm

We analyze the worst behavior of CG when Di E (0, d } for all i. Let the maximal level of G be defined by lmax = (max,, o I( X))/d. (Notice that lmax is an integer since for every job X, D(X) E (0, d }.) For convenience, in the sequel, let G” denote the set of jobs in G whose level is equal to sd (notice that G” = Jsd).

Let G be a graph whose maximal level is m. Let L be a priority list of G in decreasing label order. The CG schedule SCG that corresponds to L is determined by the following procedure:

1. First, schedule the jobs of Gm in the first JG”J time slots according to their labels.

2. Then, iteratively process the jobs of G’, 0 I i I m - 1, as follows:

a. Insert (d - 1) NOPs after the last scheduled job of the previous level.

b. Then, schedule a ready job (not yet scheduled) with the highest label. If no such job exists, insert one additional NOP.

c. Schedule the rest of the jobs in G’ according to their labels. (Notice that some of the jobs of G’ could have been scheduled previously in Step 2b of higher levels.)

Let c(SCG) = ccg. Notice that the only NOPs of SCG are those inserted into CG schedules in Steps 2a and 2b of the above process. Also, notice that CG schedules of G have at least lmax(d - 1) NOPs. Finally, notice that CG schedules are not necessarily list schedules. Alternatively, we could determine a schedule SLIST of G by applying the list scheduling process of Section 3 to L. Since sometimes we deliberately insert unnecessary NOPs into CG schedules, there may be a situation in which SLIST has fewer NOPs than a CG schedule corresponding to L. However, in Theorem 6 we prove that if there exists a schedule S of G without d consecutive NOPs then there are exactly lmax(d - 1) NOPs in a CG schedule of G. This allows us to prove that CG schedules are worse than the optimal by at most a factor of 2 - 2/(d + 1). Moreover, there is a family of examples (one of which will be presented in Section 5.3) in which every schedule constructed from L by a process in which, whenever there are two (or more) ready jobs, a job with the highest label is chosen first (this includes CG schedules and

130 BERNSTEIN, RODEH, AND GERTNER

list scheduling), achieves the bound of 2 - 2/(d + 1) asymptotically. Thus, in trying to derive a 2 - 2/(d + 1) bound, it suffices to concentrate on CG schedules. The following lemma is a counterpart of Lemma 1 in the context of CG schedules.

LEMMA 4. Let T = (J, D, G) be a job system with n jobs, and let copt = n + k. Let SCG be a schedule of T which is generated by CG such that R = ccg/copt is maximal. Then there exists a job system T’ whose optimal schedules contain less than k NOPs and a CG schedule SCG’ of T such that R’ = ccg’/copt’ 2 R.

Proof To prove the lemma, we show how to build T’ from T. We add to T an isolated job A (with no immediate successors, no immediate predeces- sors, and zero delay) to get T’. Let SOPT be an optimal schedule of T. SOPT has k NOPs. An optimal schedule SOPT’ of T’ can be obtained from SOPT by replacing one of the NOPs of SOPT by A. Therefore, copt’ = copt. Notice that A is at level 0, and it does not affect the labels of the other jobs in T’. Also, ccg 2 copt. Therefore, SCG has at east k NOPs. First, notice that if the level-0 jobs of J are scheduled in SCG’ in the same time slots as in SCG then A can be scheduled either in one of the NOPs of SCG or after all the jobs of SCG have been scheduled. In both cases it leads to ccg’ 2 ccg. Also, in every CG schedule, by exchanging two jobs at level 0 the completion time does not change. Therefore, ccg’ 2 ccg which leads to R’ 2 R. 0

Since we are interested only in the worst case analysis of CG, Lemma 4 justifies the following assumption:

Pl. There exists a schedule S of T with no instance of d consecutive NOPs. (S is not necessarily a list schedule.)

Let T = (J, D, G) be a job system with n jobs. Assume, by Pl, that there exists a schedule S of T with no instance of d consecutive NOPs. In the sequel, we prove that the number of NOPs in CG schedules of T is exactly Imax(d - 1). Before we proceed with the main theorem, we need the following result.

Let F be a set of the vertices in G. We define G’ = G - F to be a subgraph of G obtained by removing from G all the vertices that appear in F and all the edges that have at least one of their endpoints in F.

LEMMA 5. Let T = (J, D,G) be a job system with n jobs where G is a graph whose maximal level is m > 0, and assume that there exists a schedule S of T which has no d consecutive NOPs. Let G’ = G - G”, let T’ = (J’, D’, G’) be the job system obtained from T by removing ah the jobs included in G” and let W be a job in J’ such that slot(S, W) is minimal. Then

ARITHMETIC ON PIPELINED MACHINES 131

there exists a schedule S’ of T’ with no instance of d consecutive NOPs such that S/ = W.

Proof. By assumption, there exists a schedule S of T with no instance of d consecutive NOPs. If all the jobs of G” are scheduled in S prior to all the jobs of G’ then S’ is obtained from S by removing all the jobs of G”, and clearly then S; = IV. Otherwise, assume that there exists an X E Gm such that sZot( S, B) < slot(S, X) < slot(S, 3) where B, 3 E G’. The re- moval of X may (potentially) create d or more consecutive NOPs in S’. Squeezing S’ to get rid of unnecessary NOPs does not necessarily help, and we may be left with d NOPs nevertheless. If this is the case then the d consecutive NOPs must be surrounded by some B and 3 such that (B, 3) E G’ and D(B) = d. Thus, Z(B) 2 I(@ + d. Notice that since XE G”, l(X) r I(B) + d. A&o, since m > 0, tpere exists a (not ne- cessariIy immediate) successor X of X such that X E Gmel, i.e., Z(X) = l(X) - d.

Without loss of generality we can assume that sfot(S, F) > sZot(S, 3); otherwise we must have slot(S, X) -C slot(S, X) < slot( S, B), and no d consecutive NOPs are created as a result of the deletion of X.

Choose a job V E G’ such that:

1. sZot(S, V) > sZot(S, 3). 2. I(V) is maximal.

3. slot(S, V) is minimal.

Since Z(V) 2 I(x) 2 1(B) and D(B) = k/-(notice that I/ may be x itself), y is a successor of neither _B nor B. Also, since slot(S, V) > sZot(S, B), I/ is not a predecessor of B. Thus, since slot(S, V) is minimal, nothing prevents from scheduling V in time slot sZot(S, B) - 1 of S.

To complete the proof we have to show that by rescheduling V in time slot slot(S, 3) - 1 we do not create new d consecutive NOPs. Let A and x be the jobs which are scheduled in S immediately before and after V, respectively. (Notice that there might be a few NOPs separating A and A from V.) If by rescheduling V, d consecutive NOPs are created, then (A, x) E G and _D(A) =_d. In this case, slot( S, A) < slot(S, V). If A = B then f<V) r 1(X) 2 1(B) + d = I( A) + d. Otherwise, sZot(S, A) > slot(S, B). Thus, by the choice of V, I(V) r I( A) + <, otherwise A could have been chosen to be scheduled in time slot sZot(S, B) - 1 instead of I? Now we can apply the above argument to V (and repeatedly to its successors) instead of X, proving that the d consecutive NOPs are removed. Notice that all the transformations applied to S do not affect IV. Therefore, S{ = W, and the lemma is proved. •I

132 BERNSTEIN, RODEH, AND GERTNER

THEOREM 6. Let T = (J, D, G) be a job system with n jobs where G is a graph whose maximal level is lmax, and assume that there exists a schedule S of T with no instance of d consecutive NOPs. Then there exists a CG schedule SCG of T which has k NOPs such that k = lmax(d - 1).

Proof By induction on lmax. Basis. If lmax = 0 we construct the SCG schedule of T by scheduling the

jobs with the higher labels first. Since such SCG has no NOPs, we are done. Induction hypothesis. Assume that the theorem holds for all job systems

whose graph G has lmax < m. Induction step. Let T be a job system whose graph G has lmax = m > 0

and assume that there exists a schedule S of T that has no d consecutive NOPs. Let G’ = G - G”’ and let T’ = (J’, D’, G’) be a job system ob- tained from T by removing all the jobs included in G”. By Lemma 5, there exists a schedule S’ of T’ which has no d consecutive NOPs. Thus, by the induction hypothesis, there exists a CG schedule SCG’ of T’ such that k’ = (m - l)(d - l), where k’ is the number of NOPs in SCG’.

Now we discuss the construction of SCG from SCG’. First, notice that for every X E G’, A(X) computed in G’ is identical to that computed in G. Let Y be the job in G’ with a maximal label. Thus, Y is the first job scheduled in SCG’. Since l(Y) I (m - l)d, for every X E G”, A(X) > A(Y). Thus, in the first lG”l time slots of SCG all the jobs of G” are scheduled in the order of their labels, and for 1 G”1 + 1 < i I ) G”I + d - 1, SCGi = NOP.

Case 1. There exists a job X E G” such that X is not a predecessor of Y.

CLAIM 1. Y is ready in time slot 1 G”( + d of SCG.

Proof of Claim 1. Since all the jobs of G” are of level md, they are labeled after all the jobs of lower levels have already been labeled. Consider a subset G” of G” which contains all the jobs whose immediate successors are of level at most (m - 1)d. There exists at least one job in G” which is not a predecessor of Y, otherwise alI the jobs of G” will be predecessors of Y. Notice that h(Y) appears in the k? sequences constructed for the jobs in G” only if they are Y’s immediate predecessors. Let X be a job with a smallest label in Gm. (Notice that X has a smallest label among all the jobs in G”.) Thus, since Y has a highest label in G’, A(Y) does not appear in a(X). Therefore, X is not a predecessor of Y. Notice that since X has a smallest label in G”, slot(SCG, X) = 1 G”(. Therefore, Y is ready in time slot lG’“1 + d of SCG. •I

Since by Claim 1, Y is ready in time slot 1 G”I + d, it must be scheduled there in SCG. Thus, SCG is constructed from SCG’ as follows: for

ARITHMETIC ON PIPELINED MACHINES 133

IG”l + d 5 i I IG”l + d - 1 + ISCG’I, SCG, = SCGi)_(,Gm,+d-l). Since SCG; = Y, SCG is a legal schedule, and since k’ = (m - l)(d - l), k = m(d - 1).

Case 2. Every X E G” is a predecessor of Y.

Since S has no d consecutive NOPs, there exists a job in G’ which is not a successor of all X E G”. Let U be such a job with a maximal label. Notice that U is a leaf of G’.

CLAIM 2. U is ready in time slot ( G”( + d of SCG.

Proof of Claim 2. Consider again a subset G” of G” which contains all the jobs whose immediate successors are of level at most (m - 1)d. There exists at least one job in G” which is not a predecessor of U; otherwise all the jobs of Gm are predecessors of U. Let X be a job with a smallest label in Gm. (Notice that X has a smakst label among all the jobs in G”.) To prove the claim, it is sufficient to show that X is not a predecessor of U. Then, since slot( SCG, X) = ( G”I, U is ready in time slot ) G”I + d of SCG.

Assume by contradiction, that X is a predecessor of U. Thus, since X E G” and U is a leaf of G’, U E IS(X). Therefore, A(U) appears in H(X). Since U is not a successor of all the jobs in G”, there exists a job W in Gm such that U 4 IS(W). Thus, h(U) does not appear in H(W). By the choice of X, h(W) > X(X). Since both of X and W belong to G”, they were candidates for labeling at the same time. Thus, the only reason for X(W) > X(X) is H(W) 2 a(X). Since X(U) appears in h(X) and does n_ot appear in a(W), we conclude that o(W) # a(X). Thus, a(W) > Q(X). Therefore, there must exist a job K in IS(W) - IS(X) such that X(K) > A(U). Since X(K) > X(U), by the choice of U, K must be a successor of all the jobs in G”, and in particular of X. Since K 4 IS(X), there exists a job V E IS(X) which is a predecessor of K. Notice that X(V) > A(K) > X(U). Again, by the choice of U, V is a successor of all the jobs in G”, and in particular of W. Thus, we have a situation in which V is a successor of W, K is a successor of V, and K E IS(W). Hence, the edge (W, K) is transitive in G, a contradiction. 0

By Claim 2, U must-be schedu@ in time slot IG”l + d of SCG. Let G = G’ - {U} and let T = (x B, G) be a job system obtained from T’ bl removing U. To complete the proof of the theorem, we must show that T can be computed in SCG starting in time slot I G m \ + d + 1 in the form dictated by the structure of CG schedules.

CLAIM 3. There exists a schedule S of T which has no d consecutive NOPs.

134 BERNSTEIN, RODEH, AND GERTNER

Proof of Claim 3. Consider again a schedule S of T which has no d consecutive NOPs. First, assume that U is a job in T’ such that slot(S, U) is minimal. Thus, by Lemma 5, there exists a schedule S’ of T’ which has no d consecutive NOPs such that S,l = U. By removing U from S’, we get S, proving the claim. Now let 2 # U be a job in G’ such that slot(S, Z) is minimal. Notice that Z is not a successor of all the jobs in G”; otherwise S must have d consecutive NOPs. Thus, by the choice of U, A(U) > X(Z). Notice that Z is a leaf of G’. Let 6 = G’ - {Z} and let ? = (i 6,G) be a job system obtained from T’ by removing Z. Again, by Lemma 5, there exists a schedule S’ of T’ which has no d consecutive NOPs such that Si’ = Z. By removing Z from S’, we get a schedule s^ of ‘? which has no d consecutive NOPs. Thus, by the induction hypothesis, there exists a CG schedule E of ? which has no d consecutive NOPs. Below we show how to modify SCG to get S,

Let us remove U from SCG and insert Z instead to get S. Clearly, the resultant schedule has no d consecutive NOPs; however, we still must show that it is a legal schedule. First, notice that by the structure of CG schedules, since U is a leaf of G’ and A(U) > X(Z), there does not exist a successor Z’ of Z such that sfot(E, Z’) < slot(SCG, U). Thus, since Z is a leaf of G’, nothing prevents us from scheduling it in time slot slot@%, U) of e. If m = 1 then I(Z) = 0, and the resultant schedule S is legal. Thus, assume that m > 1. Also, if Z has no immediate successors or D(Z) = 0, similarly we are done. Otherwise, assume that D(Z) = d. What we still have to prove is that there is no successor Z’ of Z such that slot@, Z’) 5 slot(@ U) + d. Let Z’ be a successor of Z such that slot(s, Z’) is minimal. Notice that since A(U) > X(Z), l(U) 2 l(Z) 2 1( Z’) + d. Let I(U) = sd (s 2 1). Thus, U E G”, and Z’ belongs to G’, Olils-1.

Sutm.w 3-l. SC%,, (see, u) + I Z NOP. Let K = SCG,,,, (scQ U) + i. By the structure of CG schedules, K belongs to G’, i 2 s. Since Z’ E Gi, 0 I i I s - 1, and there are (d - 1) consecutive NOPs between levels in CG schedules, taking into account K itself, we get slor(s, Z’) > slot(m, U) -t d.

Subcase 3.2. asI,, (FG, “)+ 1 = NOP. By the structure of CG schedules,

there are (d - 1) consecutive NOPs in time slots s1ot(@ U) + 1,. . . , sZot(SCG, U) + d - 1. Let Z’ = ~~,O,~zQLi)+d. If Z’ 4 IS(Z), we are done. Therefore, assume that Z’ E IS(Z). Now consider SCG’. Since U is a leaf of G’ and X(U) > X(Z)etice that startinn time slot 1 and up to (and including) time slot slot(SCG, U) both of SCG and SCG’ are identi- cal. Thus, since SCG,,,, (SCG’ t,)+d = Z’ and Z’ E IS(Z), either

ARITHMETIC ON PIPELINED MACHINES 135

sZot( SCG’, 2) = slot(SCG’, U) + 1 or slot(SCG’, 2) = slot(SCG’, U) + d. In both cases, since 2 is a leaf of G’, by removing U from SCG’ and advancing the part of SCG’ which follows slot(SCG’, U) in one time unit, we get a schedule S of c which has no d consecutive NOPs. 0

By Claim 3, we can apply the induction hypothesis to T. Thus, there exists a schedule SCG of T such that k = (m - l)(d - l), where k is the number of NOPs in m. Notice that SCG, = Y. Thus, SCG is con- structedfromSCGasfollows: for IG’“l + d + 1 < i I IG”l + d + (SCGI, SCGi = SCGi-~,cm,+d~. If m = 1 then 1(U) = 0, and clearly SCG is a legal schedule. Thus, assume that m > 1. If CJ has no successors or D(U) = 0 then again SCG is a legal schedule. Finally, assume that D(U) = d, and let U’ be an immediate successor of U in G such that slot(SCG, U’) is minimal. Notice that since X(Y) > X(U), 1(Y) 2 l(U) 2 Z(V) + d. Therefore, U’ e G”-‘. Since there are exactly (d - 1) NOPs in S?% between the last job of G”-’ and the first job which is scheduled afterwards and taking in account Y-itself, we get sZot(SCG, U’) > d. Thus, SCG is a legal schedule, and since k = (m - l)(d - l), k = m(d - l), the proof is completed. 0

THEOREM 7, Let T = (J, D, G) be a job system with n jobs and let SCG be a schedule generated by the CG algorithm for T. Then, R = ccg/copt I 2 - 2/(d + 1).

Proof Let the maximal level of G be lmax. First, notice that copt 2 fmax(d + 1) + 1. Also, notice that copt 2 n. Let k be the number of NOPs in SCG. By Lemma 4, we may assume that there exists a schedule S of T which has no d consecutive NOPs. Thus, by Theorem 6, k = fmax(d - 1). Thus, ccg = n + k = n + fmax(d - 1). Therefore, R = ccg/copt = (n + Imax(d - l))/copt < (n/copt) + lmax(d - l)/(lmux(d + 1) + 1) < 1 + (d - l)/(d + 1) = 2d/(d + 1) = 2 - 2/(d + 1). 0

COROLLARY 8. Let T = (J, D, G) be a job system such that for all i, Di I 1. Then CG is an optimal algorithm.

Proof. Immediate from Theorem 7. 0

Finally, let us consider the relationship between a CG schedule SCG and a list schedule SLIST both of which correspond to a priority list L determined by CG for T = (J, D, G) whose graph G has the maximal level of m. First, notice that for 1 I i I I G”J , SLZST = SCG,. Furthermore, the only difference between SLZST and SCG might be that in SLIST several jobs are advanced (in accordance to list scheduling policy) to replace the NOPs of SCG. Each such job replaces one NOP of SCG and creates at most one new NOP. Therefore, the number of NOPs in SLIST is at most as that of SCG. On average, list schedules are advantageous on CG

136 BERNSTEIN, RODEH, AND GERTNER

schedules, however, in the worst case both classes of schedules achieve the same asymptotic bound.

5.3. A Worst Case Example

In Fig. 5 we present a worst case example for CG. We are going to show that every scheduling process which uses the labeling algorithm of Section 5.1 to determine the relative priority of jobs in a job system cannot improve on the 2 - 2/(d + 1) bound.

The job system T = (J, D, G) of Fig. 5 consists of k + 1 groups of jobs. The tth group 0 I t I k, consists of one type-C job C,, (d - 1) type-A jobs A,,, . . . , Atdel and one type-B job B,. For all X E J except for the sinks of J, D(X) = d. Executing the jobs of group t in order C,, A 111 ’ . - 7 &-i, B, results in an optimal schedule with no NOPs. Thus, copt = n = (k + l)( d + 1).

Now apply the labeling algorithm to T. For 0 5 t I k, I(C,), 441)>. -. , I( Aid-J, I( B,) = td. One of the possible labelings of jobs of level 0 is one in which h(B,,) > X(C,), A(A,,), . . . , X(A,,-,). This implies that C, gets a label which is lower than the labels of the other jobs of level d. Also, B, might get the label which is higher than the labels of the other jobs of level d. By the similar argument, we conclude that there exists a labeling for the graph of Fig. 5 such that for all 0 I t I k, X(B,) > X(4,), . . . , h(Atd-l) and X(C) < A(A,,), . . . , X(A,,-,). Finally, in the re- sultant priority list L the jobs of group, t, 0 I t < k, appear in order 4, A 117 * . . I A C,. td-1,

To get a schedule S, apply to L any scheduling process which whenever there are two (or more) ready jobs, chooses a job with the highest label to

FIG. 5. A worst case example for CG.

ARITHMETIC ON PIPELINED MACHINES 137

d 1 2 3 4 5 10

2 - l/(d + 1) 1.5 1.67 1.75 1.8 1.83 1.91

2 - 2/(d + 1) 1 1.33 1.5 1.6 1.67 1.82

FIG. 6. The comparison of the worst case ratios of the leveling algorithm and CG.

be scheduled next. First, all the jobs of level kd are scheduled in S in time slots 1,. . . , d + 1 in order such that slot(S, C,) = d + 1. The following (d - 1) time slots of S are NOPs, since every job of group 0 I; t s k - 1 is a successor either of C, or of S,. Subsequently, Bk-i is scheduled in time slot 2d + 1 of S since only B,- i is ready at this time slot. This procedure is iterated resulting in schedule S which has (d - 1) NOPs after each type-C job except C,. Notice that S is a CG schedule which corresponds to L, and also S is a list schedule which corresponds to L. Thus, ccg = n + k(d - 1) and R = ccg/copt = 1 + k(d - l)/(k + l)(d + 1). By increasing k, R can be made arbitrarily close to 2 - 2/(d + 1). We compare the upper bound of list schedules and the worst case ratio of CG in the table of Fig. 6.

6. CONCLUSIONS

In this paper we presented a leveling algorithm to schedule tasks under pipelined constraints. However, its worst case ratio that is identical to an upper bound of list schedules is achieved when an arbitrary (worst possible) choice for jobs of the same level is made. By applying a lexicographic order similar to that of Coffman-Graham’s algorithm to jobs of the same level, we are able to improve the worst case ratio if all the delays are d cycles or 0. The case in which the delays are allowed to be any integer between 0 and d was not treated.

It is interesting to mention how the CG algorithm applies to pipelined machines such that a different delay may be assigned to an edge of the precedence graph rather than to a vertex as was defined in Section 2. It turns out that in this case transitive edges cannot be omitted from the precedence graph since they may impose additional constraints on the legal schedules of a job system. For such a model, the CG algorithm should be modified as follows:

1. Compute the levels (1) for all the jobs as defined in Section 4 taking into consideration all the edges (including transitive) of the precedence graph.

138 BERNSTEIN, RODEH, AND GERTNER

2. Compute the labels (X) for all the jobs as defined in Section 5 taking into consideration only the non-transitive edges of the precedence graph.

3. Build the priority list L by ordering jobs with higher labels first.

The worst case ratio R that was proved in Section 5 for CG can be shown to hold for the modified algorithm described above when all the delays are either d cycles or ‘0.

Another direction for further research is to improve CG. As we men- tioned in the Introduction, there is a certain similarity between scheduling with pipelined constraints and multiprocessor scheduling. The same worst case bound which we proved here for CG was shown for CG in multipro- cessor scheduling [II 1. Moreover, in [ll] an example is given which proves that as long as CG uses the current leveling strategy, its worst case ratio cannot be improved (asymptotically) beyond 2 - 2/(d + 1) even if an “optimal” criterion will be used for jobs of the same level. Therefore, the only way to do better than CG is to refine the leveling criterion by taking into consideration the number of immediate successors of a job and their levels. It is an open question whether the worst case ratio of CG can be improved at all.

ACKNOWLEDGMENTS

We want to thank the referees for valuable suggestions that improved the presentation of the pap*.

REFERENCES

1. D. BERNSTEIN AND I. Gettnu~~, “Computing Expressions 011 a Pipeliued Processor with a Small Number of Stages,” EE Pub. No. 594, Dept. of Alec. Eng., Technion, Haifa, Israel, June 1986.

2. J. BRUNO, J. W. JOMS, AND K. So, Deterministic scheduling with pipelined processors, IEEE Tram. Comput. C-29, No. 4 (1980), 308-316.

3. E. G. COFFMAN, “Computer aud Job-Shop Scheduling Theory,” Wiley, New York, 1976. 4. E. G. COFFUN AND R. L. GUAMI, Optimal scheduling for two-processor systems, Ada

Znfwm. 1 (1972), 200-213. 5. N. F. CHEN AND C. L. LIU, On a class of scheduliug algorithms for multiprocessors

computing systems, “Lecture Notes in Computer Science Vol 24,” pp. 1-16, Springer- Verlag, New York, 1975.

6. T. C. Hu, Parallel seq~cing and assembly line problems, Oper. Rex 9, No. 6 (l%l), 841-848.

7. K. HWANG AND F. A. BIUGGS, “Computer Architecture and Parallel Processing,” Mc- Graw-Hill, New York, 1984.

8. J. L. HEmsaY AND T. R. GROSS, Postpass code optimization of pipeline constraints, ACM Trans. Programming Lang. Systems 5, No. 3 (1983), 442-448.

ARITHMETIC ON PIPELINED MACHINES 139

9. P. M. KOGGE, “The Architecture of Pipelined Computers,” McGraw-Hill, New York, 1981.

10. H. F. LI, Scheduling trees in parallel/pipelined processing environments, IEEE Tranr. Compur. C-26, No. 11 (1977), 1101-1112.

11. S. LAM AND R. SETIU, Worst case analysis of two scheduling algorithms, SIAM J. Comput. 6, No. 3 (1977), 518-536.

12. D. A. PATERSON, Reduced instruction set computers, Comm. ACM 28, No. 1(1985), 8-21. 13. R. M. RUSSELL, The Cray-1 computer system, Comm. ACM 21, No. 1 (1978), 63-72. 14. R. SETEII, Algorithms for minimal length schedules, in “Computer and Job-Shop Schedul-

ing Theory” (E. G. Coffman, Ed.), pp. 51-99. Wiley, New York, 1976.