requirements for optimal execution of loops with tests

9
IEEE TRANSACHONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992 573 Requirements for Optimal Execution of Loops with Tests Augustus K. Uht, Member, IEEE Abstract-Both the efficient execution of branch intensive code, and knowing the bounds on same, are important issues in comput- ing in general and supercomputing in particular. In prior work, it has been suggested, implied, or left as a possible maximum, that the hardware needed to execute code with branches optimally, as with an oracle, is exponentially dependent on the total number of dynamic branches executed, this number of branches being proportional at least to the number of iterations of the loop. For classes of code taking at least one cycle per iteration to execute, this is not the case. For loops containing one test (normally in the form of a Boolean recurrence of order l), it is shown that the hardware necessary varies from exponential to polynomial in the length of the dependence cycle L, while execution time varies from one time cycle per iteration to less than L time cycles per iteration; the variation depends on specific code dependences. These msults bring the eager evaluation of imperative code closer to fruition. Index Terms- Branch or control dependences, branch pre- diction, concurrency, control-flow, eager evaluation, FORTEST loops, general purpose computers, optimal loop execution, paral- lelism, supercomputers. I. INTRODUCTION S supercomputers and parallel machines are targetted at A more general purpose and nonnumeric applications, their ability to execute code containing a high degree of dynamically varying control flow becomes more and more crucial. It is desirable not only to execute all code as quickly as possible, but also to know when code is being executed as quickly as possible. Put another way, one would like to know how fast code can be executed, and what the hardware requirements are to achieve that execution. Basically, knowing both resource and time bounds is useful. With a variety of hardware or software techniques [l], [4], [7], [lo], [13]-[15], [18] it is possible to execute certain classes of loops in saturation, meaning that it takes one cycle to execute a complete iteration of a loop, with the instruction instances (an instance corresponds to execution of an instruction in one iteration) possibly coming from different iterations, and each instruction taking one cycle to execute. Put another way, one instance of every static instruction in the loop is executed every cycle; the instances are likely from Manuscript received November 14, 1990; revised June 13, 1991. This work was supported in part by the Academic Senate of the University of California at San Diego, and in part by the National Science Foundation under Grant CCR-8910586. The author is with the Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093-0114. IEEE Log Number 9200947. Code: 0. 1 = 0 1. Loop: I =I + 1 2. At11 = A[I-11 + C 3. D[II = At11 + E 4. IF (I < 10) GOT0 Loop Timing: Cycle 0 IT(eration instance) 1 of IO Cycle 1: IT 1 of 11 Cycle2 IT20f11, ITlof12, IT 1 of I4 Cycle3 IT3ofI1, IT20f12. IT1of13, IT20fI4 +saturation Cyclen: IT n of 11, ITn-1 of 12, IT n-2 of 13, IT n-1 of I4 Fig. 1. Example of loop saturation. different iterations of the loop. This is the upper performance bound of the kind of code execution we seek to achieve. For an example, see Fig. 1. In cycle 3, saturation has been achieved. In essence, a pipeline is formed between instructions I1 - 1 3 . IT(eration instance) 1 of I3 uses the result of I2 in the previous cycle (IT 1 in cycle 2), which in turn used the result of I1 in the cycle previous to that (IT 1 in cycle 1). Since every instruction in the loop is executed once every cycle, we say that the loop is executing at a rate of one cycle per iteration. In other words, a loop of 1000 iterations will execute in approximately 1000 cycles at this rate. This is a case of there being a dependence cycle of length L = 1 in the loop; specifically, for both instructions I1 and 1 2 , each instruction data-depends on the immediately preceding iteration’s instance of the same instruction. If the data dependences in the loop are more constraining, the loop may execute in multiple cycles per iteration; this is shown below. Dependence cycles limit the per- formance of loops, and are thus key elements of our discussion. A dependence cycle occurs in a loop when one instance of an instruction directly or indirectly uses the result of a previous instance of the same instruction; see Fig. 2. The top of the figure graphically illustrates the dependence cycle between I 3 and 1 6 . The bottom part depicts what happens upon execution of the loop; it shows the dynamic execution of the code in the cycle. The first instance of I 6 is dependent on the first instance of 1 3 ; and the second instance of I 3 is dependent on the first instance of 1 6 , completing the dependence cycle. Since there are two instructions in the cycle, the cycle has length L = 2. Notice that an implication of this is that the loop executes at 1045-9219/92$03.00 0 1992 IEEE

Upload: ak

Post on 07-Mar-2017

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Requirements for optimal execution of loops with tests

IEEE TRANSACHONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992 573

Requirements for Optimal Execution of Loops with Tests

Augustus K. Uht, Member, IEEE

Abstract-Both the efficient execution of branch intensive code, and knowing the bounds on same, are important issues in comput- ing in general and supercomputing in particular. In prior work, it has been suggested, implied, or left as a possible maximum, that the hardware needed to execute code with branches optimally, as with an oracle, is exponentially dependent on the total number of dynamic branches executed, this number of branches being proportional at least to the number of iterations of the loop. For classes of code taking at least one cycle per iteration to execute, this is not the case. For loops containing one test (normally in the form of a Boolean recurrence of order l), it is shown that the hardware necessary varies from exponential to polynomial in the length of the dependence cycle L, while execution time varies from one time cycle per iteration to less than L time cycles per iteration; the variation depends on specific code dependences. These msults bring the eager evaluation of imperative code closer to fruition.

Index Terms- Branch or control dependences, branch pre- diction, concurrency, control-flow, eager evaluation, FORTEST loops, general purpose computers, optimal loop execution, paral- lelism, supercomputers.

I. INTRODUCTION S supercomputers and parallel machines are targetted at A more general purpose and nonnumeric applications, their

ability to execute code containing a high degree of dynamically varying control flow becomes more and more crucial. It is desirable not only to execute all code as quickly as possible, but also to know when code is being executed as quickly as possible. Put another way, one would like to know how fast code can be executed, and what the hardware requirements are to achieve that execution. Basically, knowing both resource and time bounds is useful.

With a variety of hardware or software techniques [l], [4], [7], [lo], [13]-[15], [18] it is possible to execute certain classes of loops in saturation, meaning that it takes one cycle to execute a complete iteration of a loop, with the instruction instances (an instance corresponds to execution of an instruction in one iteration) possibly coming from different iterations, and each instruction taking one cycle to execute. Put another way, one instance of every static instruction in the loop is executed every cycle; the instances are likely from

Manuscript received November 14, 1990; revised June 13, 1991. This work was supported in part by the Academic Senate of the University of California at San Diego, and in part by the National Science Foundation under Grant CCR-8910586.

The author is with the Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093-0114.

IEEE Log Number 9200947.

Code: 0 . 1 = 0

1. Loop: I = I + 1

2 . At11 = A[I-11 + C

3 . D [ I I = At11 + E

4 . I F ( I < 10) GOT0 Loop

Timing: Cycle 0 IT(eration instance) 1 of IO

Cycle 1: IT 1 of 11

Cycle2 IT20f11 , I T l o f 1 2 , IT 1 of I4

Cycle3 IT3ofI1 , IT20f12. IT1of13 , I T 2 0 f I 4 +saturation

Cyclen: IT n of 11, ITn-1 of 12, IT n-2 of 13, IT n-1 of I4

Fig. 1. Example of loop saturation.

different iterations of the loop. This is the upper performance bound of the kind of code execution we seek to achieve.

For an example, see Fig. 1. In cycle 3, saturation has been achieved. In essence, a pipeline is formed between instructions I1 - 13 . IT(eration instance) 1 of I3 uses the result of I2 in the previous cycle (IT 1 in cycle 2), which in turn used the result of I1 in the cycle previous to that (IT 1 in cycle 1). Since every instruction in the loop is executed once every cycle, we say that the loop is executing at a rate of one cycle per iteration. In other words, a loop of 1000 iterations will execute in approximately 1000 cycles at this rate. This is a case of there being a dependence cycle of length L = 1 in the loop; specifically, for both instructions I1 and 12 , each instruction data-depends on the immediately preceding iteration’s instance of the same instruction. If the data dependences in the loop are more constraining, the loop may execute in multiple cycles per iteration; this is shown below. Dependence cycles limit the per- formance of loops, and are thus key elements of our discussion.

A dependence cycle occurs in a loop when one instance of an instruction directly or indirectly uses the result of a previous instance of the same instruction; see Fig. 2. The top of the figure graphically illustrates the dependence cycle between I 3 and 16 . The bottom part depicts what happens upon execution of the loop; it shows the dynamic execution of the code in the cycle. The first instance of I 6 is dependent on the first instance of 13 ; and the second instance of I 3 is dependent on the first instance of 16 , completing the dependence cycle. Since there are two instructions in the cycle, the cycle has length L = 2. Notice that an implication of this is that the loop executes at

1045-9219/92$03.00 0 1992 IEEE

Page 2: Requirements for optimal execution of loops with tests

574 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992

static code:

1. <initialize>

2. LOOP: <update TEST; I = I + 1>

3. B A [ I - 1 1

4 . <other unconditional code>

5. <other unconditional code>

6. A [ I ] = B OP Q [ I l

7. I F f2(TEST) GOT0 LOOP

I 6 is flow &pendent on I 3 in the same iteration, and

I 3 is flow dependent on I6 in the previous iteration.

Dynamic code (execution of I 3 and 16); two cycles per iteration:

Cycle 0: I = 1

Cycle 1: B = A[O]

Cycle 2: A[1] = B ... Cycle 3: B = AI11

Cycle 4 : A[2] = B . . _ Cycle 5: B = A[21

Fig. 2. Dependence cycle illustration.

a rate of 2 time cycles per iteration; it can go no faster in its current form.

The optimality of loop executions in saturation has been demonstrated for loops containing no tests, for example IF-THEN’S. When tests are present, it is not clear what the optimal conditions are. Some of the new methods may give optimal results, but for unbounded hardware resources.

In this paper we derive resource and time bounds for certain classes of singly nested loops containing tests, or IF-THEN statements. In prior work, it was suggested, implied, or left as a possible maximum, that the hardware needed to execute code with branches optimally is exponentially dependent on the total number of dynamic branches executed, this number of branches being proportional at least to the number of iterations of the loop. For classes of code taking at least one cycle per iteration to execute, this is not the case. For loops containing one test (normally in the form of a Boolean recurrence of order 1, meaning that a branch is involved in the dependence cycle, and that iteration n depends on the results of iteration n - l), we show that the hardware necessary varies from exponential to polynomial in the length of the dependence cycle L, while execution time varies from one time cycle per iteration to less than L time cycles per iteration; the variation depends on specific code dependences from conditionally executed instructions on instructions in the dependence cycle. Optimal performance is the same as oracularperformance, in which an oracle knows which way a branch in the code executes, by the time each branch is encountered, for all branches; hence, there is no branch penalty, and the concurrency of the code is

enhanced, Hardware can mimic this by going down both paths of a branch. The model used in [9] goes beyond an oracle in achieving performance better than that shown here; but the way it achieves this is by unrealistically requiring code to be actually executed before the super-optimal code schedule is determined.

We demonstrate that under certain conditions, it is possible to obtain optimal, oracle, performance with bounded hardware. The basic strategy is to first assume a machine with unlimited resources, able to go down multiple branch paths at a time [8], executing code that takes at least one cycle per iteration (we address this assumption later); and then to determine how many resources the machine actually uses.

An equivalent to oracle execution of the code is assumed, based on the analytic method of Riseman and Foster [8]. Their terminology is used here. The basic stratagem is that as soon as a conditional branch is encountered in the concurrent execution stream, two paths are generated, one for each possible evaluation of the branch; execution proceeds down both paths. The branch is thus bypassed and is unresolved until its test source data become available. When the data are generated and the branch is evaluated, it is resolved, and the unnecessary path previously generated is discarded, or pruned, along with any subsequently generated branch bypasses on the same path. This is equivalent to oracular performance.

Given a dependence cycle containing a forward branch, the cycle being of length L above the branch, then the other instructions in the cycle can get at most L iterations ahead of the forward branch, in execution of instances, before the earlier

Page 3: Requirements for optimal execution of loops with tests

UHT OPTIMAL EXECUTION OF LOOPS WITH TESTS 515

forward branch instances begin to be resolved (the branch is actually executed). Therefore, only at most L branch bypasses are active (unresolved) at any given time, and hence (ignoring backward branches), the control flow tree need be only, at most, L branches deep. After such a depth, the outstanding part of the tree (before the earliest unresolved branch) is pruned and the resources are reused.

Put another way: assuming the code consists of a sin- gle unnested loop with some number of internal (forward) branches, we note that as unresolved branches become re- solved, their corresponding paths are discarded, or pruned. As old paths are pruned, new paths are created, due to the occurrence of new branch instances. The number of old paths deleted and new paths created is the same; thus, the number of levels of pending or unresolved paths is a constant, and is proportional to the length of the dependence cycle within the loop. (Note that the pruning method was given in [SI, but the effects were not investigated.)

For a loop with a single test (other than the loop termination test), the L number of pending or unresolved paths is found to be of order 2=, in the worst case, and of order L in the best case, where L is the dependence cycle length of the critical path (longest) cycle in the loop. This leads to a bound on the number of resources (registers and processing elements) needed to achieve optimal performance, since, in fact, oracle performance is bounded. The resources needed may still be excessive; however, those needed can now be estimated.

These results are applicable to both static and dynamic instruction stream executions, and both software and hardware based concurrency extraction methods. The results are likely most applicable to software methods.

In the remainder of the paper, we will describe our rea- soning and extensions thereof. Specifically, in Section I1 the assumptions made are stated and discussed. In Section 111 a generic loop is developed for use in the following (main) section. Section IV presents the derivation of the resource and time bounds. A summary is given in Section V. The Appendix contains supporting math for part of Section IV.

11. ASSUMPTIONS AND CONSTRAINTS

In this section the basic assumptions and constraints are given and discussed. First, two definitions are made.

Definition 1: Eager evaluation is the execution of code before it is guaranteed to be needed, as in Riseman and Foster’s method.

Definition 2: A Branch Domain [12], [17] for forward branches is the section of code between a branch and its target, exclusive of the branch and the target. For backward branches, the domain consists of those instructions between the branch and its target, including the branch and target. A branch that has an instruction in its domain is said to contain the instruction.

Assumptions: 1) Eager evaluation is used to execute the code. 2) There is one unnested loop. 3) The loop contains one Forward Branch (FB) which may

4) The loop is not a DOALL construct. If a loop is con- form a Boolean recurrence. See Fig. 3.

<Initialize> 1.

2. LOOP: cupdate TEST, I>

3 . B A[I-1]

4. <other unconditional code>

5. IF f(B) THEN GOTO ENDIF

6. NI1 = Q[Il

7. ENDIF: IF f2(TEST) GOTO LOOP

Fig. 3. FORTEST loop with one conditional branch, one Boolean recurrence.

structed as a DOALL loop, in which all of its iterations are independent and are executed concurrently, then the only limits on concurrent execution are those of resource allocation and intra-iteration concurrency extraction. For these loops, then, achievable performance is readily obtained from oracle performance; in fact, an oracle is of little use here. Therefore, we need not consider these loops further, as their performance limits and resource constraints are well-understood and readily obtained.

5 ) Each instruction takes one cycle to execute. The dis- cussion is readily extended to a multicycle or pipelined case; see Section IV-D2.

6) Only flow dependences, or Read-After-Write depen- dences, exist in the code. It is assumed that the others (WAW or output, and WAR or anti-) have been removed explicitly [5] or implicitly [ll], [12].

The loops we wish to concentrate our attention on are singly nested loops including DOACROSS [3] and DOWHILE loops, but with control-dependent iteration execution times. DOACROSS loops have dependence cycles L 2 1.

Restricting our attention to these loops is reasonable, as all loops have at least one loop-terminating test which implies at least one dependence cycle of length 1 or more, for example a loop index or exit condition. If this cycle is removed [ 5 ] , the loop becomes a DOALL loop; otherwise, a cycle remains, and our arguments pertain. Essentially we are assuming that the loop terminating or index generating functions are arbitrary and not computable until run-time.

For the non-DOALL loops, which we shall call FOROTHER loops, if they contain no internal tests or branches (other than EXIT’S), then they are executed optimally in time per iteration equal to the length of the dependence cycle (see [3], [15]). The resources needed are readily bounded and do not depend on the execution of any conditional branches.

We are interested in the remaining classes of loops, FOROTHER’s with internal conditional branching, or FORTEST loops. These include Boolean recurrences [2], as well as other classes of control flow. Our discussion will center on FORTEST loops containing one conditional (nonloop- terminating) branch. See Fig. 3.

111. CREATION OF A GENERIC LOOP

For the purposes of our discussion, the loop of Fig. 4 is used initially. Note that it is a Boolean recurrence. The dotted

Page 4: Requirements for optimal execution of loops with tests

516 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992

I

1.

2. LOOP:

3.

4 .

5 .

6 .

7 .

8 .

9 .

1 0 . -IF:

11.

- B = A[I ]

I = I + 1

C = B o p T l

T 2 = Y o p Z

D = C a p T 2

IF D GOT0 EWIP

E = F o p T 3

A[I ] = E

G = A [ I J

B = A[I ]

I F I < n G O T O L O O P

Dependencies broken by

Which instance used for

. .. . . .. . .... .

eager evaluation.

source is dependent on branch.

-

cycle major due to (critical)

backward cycle branch

Fig. 4. Loop example for the preliminary discussion.

lines in the dependence graphs indicate dependences that are removed by eager evaluation. Without loss of generality, it is also assumed that the dependence cycle within the loop is longer than that of the loop-forming backward branch and its terminating condition computations; the discussion may be used to describe other cases. Note that there are also flow dependences from I2 to I9 and I10 (not shown), which result in limiting the execution of I9 and I1 0 to one instance of each of these instructions per cycle. Although it is possible in some cases to precompute the indexes, it is not possible in general. In the course of this section, the loop is transformed into a generic loop sufficient to cover the classes of loops we are interested in.

The loop of Fig. 4 is transformed to that of Fig. 5, and then to Fig. 6 to indicate the critical components of the loop. The backward branch and its dependence cycle are removed, as the cycle is reduced by eager evaluation’ to a cycle of length 1; it thereby no longer limits the execution time of the loop. The original instructions 11, 14, 17, and I9 are also removed, as they are not in the critical path [6] of the dependence cycle.2 In other words, threads of code which do not contribute to the critical cycle are removed.

Lastly, the component of the critical dependence cycle below the forward branch domain, 11 0, is moved to the top of

For example, [141,[181.

as achieved with the Super Advanced Execution of 1131,

’With eager evaluation, they do not add to the steady-state execution time of the loop, as their execution is overlapped with other instructions’ execution or is part of initialization. In the case of 17 , it would be executed before I 8 has to be computed.

I I‘

1 0 . B A[I-11

3. C = B o p T l

5 . D = C o p T 2 hi 1’

2‘

3’

6 . IF D GOT0 ENDIF 4‘ \ 25 8 . A [ I ] = E 5‘ b

(Note that the forward branch enables or disables E’.)

Fig. 5. Reduced loop of Fig. 4; only primary elements of the cycle are shown.

the cycle, for ease of discussion. The reduced “loop” of Fig. 5 is thus obtained, containing only the primary components of the original loop. We will refer to elements of this loop by the new “primed” numbers shown. (One of the key characteristics of this loop is that it contains no data dependences from elements in the forward branch domain to instructions outside of it. Assuming such dependences actually improves the resource bounds; this is discussed in a later section, after the main result has been demonstrated.)

The last refinement is to effectively remove I8 (or 15’). Although it is part of the nominal dependence cycle, it is removed from the cycle via eager evaluation. Like 17, it is computed ahead of time. Its result may or may not be used, depending on how its containing branch is resolved. The generic loop is now as shown in Fig. 6; it is representative of

Page 5: Requirements for optimal execution of loops with tests

UHT: OPTIMAL EXECUTION OF LOOPS WITH TESTS 511

I I' the path not needed is pruned (discarded). Say this is Iteration 2 (Bl-F). Then it and all subsequently generated iteration versions on the same path are eliminated; thus, half of the outstanding iteration versions are pruned at t = 4+. This halving completely offsets the prior version doubling, resulting in a stable number of pending iteration versions, N v - 1. By observation, the number of active iteration versions, including the root iteration, is

5'0

4'

10 . B A[I-11

3. C = B o p T l

5. D = C o p T 2

6 . ~ D G O ' l ' O X N D p

* (Note that the branch selects which eagerly evaluated path is to be used.)

Fig. 6. Further reduced loop of Fig. 4; only critical elements in cycle shown.

the classes of loops under consideration (see Section 11). Having reduced FORTEST loops (with one test) to their

minimal components, the main result is obtained in the next section.

Iv . BOUNDING THE RESOURCE AND TIME REQUIREMENTS OF THE EAGER EVALUATION TREE

The execution of the generic loop just mentioned is now

1) as given, in that the instructions within the forward branch domain are not flow dependent on (use a sink of) instructions in the dependence cycle-Section IV-A;

2 ) modified, such that the critical instruction in the domain is dependent on the last assignment instruction in the cycle-Section IV-B;

described for three possible cases:

3) between the prior two cases-Section IV-C. For all cases, new lower bounds are shown for resources required for optimal execution. For the second case, the bound is not exponential. For the third case, the bound is exponential, but with bases less than 2. Time bounds are also obtained.

A. Minimally Flow Dependent Domain Instructions

In this section the execution of the generic loop presented in the prior section (see Fig. 6), is presented and discussed, using eager evaluation. A steady-state depth of the bypassed branch tree is demonstrated, leading to bounds on the resources needed to execute such code optimally.

Referring to Fig. 7, Iteration 1 of the loop begins at time t = 1. Instruction I5 ' is eagerly evaluated at the same time as I1 ' is executed. At t = 2 , Iteration 1 continues with I 2 ' executing, since it is the next instruction in the dependence cycle.

Also at time t = 2 , two versions of Iteration 2 begin execution, each corresponding to a particular branch direction of branch B1 (instance 1 of 15). One version uses the value of A [ I ] existing before the program executed, the other version using the value generated by I5 of Iteration 1. Likewise, at t = 3, Iteration 3 begins, two versions for each path of each version of B2 ; therefore, there are four versions of Iteration 3 being executed.

Each subsequent cycle, a factor of two more iteration versions are issued for execution; thus, the number of pending iteration versions is 2 t - 1 , and doubles each cycle. However, at t = 4, along with the usual doubling, B1 is resolved; therefore

N v = 2L - 1

where L is the length of the dependence cycle including the branch (in this case, L = 4).

(If the branch is resolved before issuance of new iteration versions, then N v = 2L-1 - 1; this is somewhat less.)

1) Resources: The resources necessary to compute all re- quired N v versions are now computed. We assume no nec- essary computations other than those of I1 ' -4 ' . (Those resources necessary to compute threads of computations not in the critical path would be added; since the exact thread pattern is program dependent, we do not consider such threads here.)

The number of processing elements needed is

N ~ E = N v = 2L - 1

The number of registers needed, assuming one register per instruction instance version, is

NR = 2L+1 - L - 2 = 2Nv - L.

With I5 ' , another 2L - 1 registers would be needed. Although these resources are exponential, they are expo-

nential in the dependence cycle length L, not in the maximum number of iterations executed of the loop, nor in the maximum number of branches executed. These results are for optimal execution.

2) Execution Time: The execution time TE is one cycle per iteration, the optimal.

3) Comments: New resource upper bounds for the optimal execution of the target FORTEST have been achieved. These results represent one endpoint, for the case of instructions in the forward branch domain not flow dependent on instructions earlier in the dependence cycle. We now consider another endpoint, when instructions in the domain are dependent on the last instruction in the dependence cycle.

B. Maximally Flow Dependent Domain Instructions

We now consider a modification of the code in Fig. 5 to that in Fig. 8. The difference arises due to the change in instruction I5 I . It now has as its source the output of I3 I , the last assignment instruction in the dependence cycle.

The execution of this code via eager evaluation is shown in Fig. 9. The effect of the modification is to skew the eager evaluation tree such that all of the right-hand children have edges 3 time units greater than before. This occurs when the branch evaluates false, requiring I5 ' to be part of the cycle, thereby lengthening the execution time when the branch is not taken. Looking again at time t = 4, we note that evaluating B1 causes all but three (four including Iteration 1) pending versions to be discarded, as all but three (four) are skewed or

Page 6: Requirements for optimal execution of loops with tests

578 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992

Time

t

1

El El

Note: not all I5' instances shown.

0 contains the iteration number.

( 1 2 3 ) contains the branch taken (T) or not taken 0 info for instances 1.2, and 3 of the forward branch B.

Ex.: at t=4, B1 evaluating true adds 8 new versions

to B4; also, at t 4 + , all versions (F**) would be eliminated.

Fig. 7. Eager evaluation tree of the execution of the code of Fig. 6.

I I' C. Intermediate Cases

10. B = A[I-11 1' In the last two sections endpoints for resource bounds and execution times have been derived for the generic loop

3. C = B o p T l 2' example assuming minimum and maximum dependences. In this section we consider the intermediate cases, and establish bounds. 5. D = C o p T 2 3'

6 . I F D GOT0 ENDIF 4' Going from the loop of Fig. 5 to that of Fig. 8 effectively moves I5 later in time, due to the dependence, as shown in

8. A t 1 1 = D 5' the progression from Fig. 7 to Fig. 9. If I5 were dependent on I2', rather than on I 3 ' , the skew of the tree would be less than in the maximum case, and the resource bounds begin to appear to be exponential again, although not as large as in the minimum dependence case.

The important criteria are L and the distance DOC from the domain instruction (I5 ') to the instruction in the dependence cycle it obtains its source from (DDC 5 L - 1, though; also, DOC does not include the branch).

BY observation, the number of new iterations added every cycle during the execution is a function of Fibonacci-like sequences. We define a Generalized Fibonacci series G, (n)

(Note that the forward branch enables or disables IY.)

Fig 8. Modified loop of Fig. 5.

delayed from Iteration 1 by greater than or equal to L = 4 cycles, or have not yet begun execution.

I ) Resources: Therefore, for the case of maximum W e n - dence:

N v = L as follows: N ~ E = L 5 0 = 0

L ( L + 1) N R = ~

2

G;(n) E 5 , = ~ , - 1 + x,-(I+~), 12 > i . 2) Execution Time: TE = 1 + f(L - 1) cycles per iteration, where f is the proportion of the time the branch evaluates False.

Of maximum dependence Of

instructions within a domain on those outside of it and in the cycle, polynomial, and in some cases linear, resources are

For the standard Fibonacci series, i = 1. With respect to this case of eager evaluation, i is dependent

on the difference between the dependence cycle length and the distance D ~ ~ ; specifically,

3, Comments: For the

needed for optimal execution. i = 6 E L - DOC

Page 7: Requirements for optimal execution of loops with tests

UHT: OPTIMAL EXECUTION OF LOOPS WITH TESTS

Time

8 6 66

Q

E1

~

579

The notes of Figure 7 still pertain. At t=4. if B1 evaluates true, all versions (FXX) are discard& half of the versions (TXX)

have not yet started, new versions for 8 4 will be issued, but mainly at times later than r=5 .

Fig'9. Eager evaluation tree of the execution of the code of Fig. 8.

and thus the number of pending iterations (a summation of generalized Fibonacci series) is

L

Nv = G6(j ) j=1

with L

NPE = NV = G6(j) j = 1

and L

NR = c ( L f 1 - j )G6( j ) .

The latter expression sums up the active lengths, or number of eagerly evaluated instructions, of the pending iteration versions. These expressions match the endpoints of the mini- mally and maximally dependent cases, taking: S = 0 for the minimally flow dependent case, and 6 = ( L - 1) for the maximally dependent case [this gives 5, = 1 for all relevant

To get a better indication of how the necessary resources vary with S and L, we note that an approximation to the Generalized Fibonacci series is

Gi(n) = Ci(B;)n-l

j=1

Gi (.)I.

where Bi is the ratio of neighboring values of the Generalized Fibonacci series i (see the Appendix), and is the single positive root of the equation

and the factor "Ci" is a correction for using a continuous function to represent a discrete function; C1 M 0.724. We tabulate some of the roots below. determined numerically:

I i I BZ

1.618 1.466 1.325 1.213 ! 100 1.0344

1) Resources: Therefore,

This is a polynomial in B6 of degree L - 1.

Page 8: Requirements for optimal execution of loops with tests

580 IEEE TRANSACTIONS ON P A W L E L AND DISTRIBUTED SYSTEMS, VOL. 3, NO. 5, SEPTEMBER 1992

2) Execution Time: V. SUMMARY

TE = 1 + fS cycles per iteration.

This also satisfies the endpoints. 3) Comments: For the intermediate cases, the necessary

resources for optimal execution are exponential in L, but exponential with bases significantly less than 2.

D. Extensions

1) Multiple Branches: The main effect of allowing multiple branches is to potentially increase the base of the exponential terms heretofore used from 2 or B to less than or equal to:

In many cases it is less than this, for example IF-THEN-ELSE'S, in which one path excludes the other.

Adding multiple branches also complicates the analysis. Now, the left-child paths of the eager evaluation tree also lengthen; for some code the length of the left side equals that of the right side (and is greater than l), eliminating the skew.

2) PipelininglMultiple Cycle Instructions: The major effect of increasing the duration of instruction execution is to effec- tively lengthen L by the appropriate stage delays occurring in the (equivalent) pipelined execution of any instruction in the critical dependence cycle.

3) Boolean Recurrences of Greater Degree: In this paper, it was assumed that the degree of the Boolean (and other) recurrences was 1, meaning that iteration n used results from iteration n - 1. The effect of increasing the degree from 1 to k is to decrease the skew of the right-children of the maximum dependence case, improving performance and increasing the resource bounds. For a minimum DDC = 1, when IC > L,

or Bnumber Of 2number of branches

New lower resource upper bounds were demonstrated for optimal execution of certain important classes of code contain- ing tests, including Boolean recurrences. The bounds depend on the major dependence cycle length of a loop, not the number of iterations executed of the loop. In some cases the resource bounds are polynomial.

As the importance of effectively executing branch-rich code grows, as it is likely to do in nonnumeric supercomputer and parallel machine applications, the results presented here will be able to guide the computer architect and programmer in the development of new systems. Software-based concurrency extraction methods may be the greatest beneficiaries of these results.

APPENDIX DERIVATION OF AND COMMENTS ON (1)

In this Appendix (1) is derived, and then the uniqueness of its positive root is shown.

A. Derivation of (1)

ing values of the Generalized Fibonacci series, or As stated in Section IV-C, B is the ratio between neighbor-

X n + 1 B = - X n

substituting zn + X n - 2 B =

Xn-1 + X n - ( i + l )

now X , = x, -~ + x , - ( ~ + ~ ) so the maximum dependence case is the same as the minimum

2,-1 + G-(z+ l ) + 5,-i - X n - ( t + l ) dependence case. B = - Xn-1 + 1 + 2 ? . L L

2,-1 + I '

X n - ( t + l )

XCn-(.+l) 5,-1 + X n - ( i + l )

E. Execution of the Code

The execution of code employing eager evaluation as de- scribed above is accomplished in software-based techniques, such as [4], by assigning spare registers and processing el- ements to the pending iterations, and reassigning them upon branch resolution; only valid (not-pending) results are written to memory. If particular loops may not be fully eagerly evalu- ated, due to limited resources, the work presented here will aid in determining this fact and potentially guide the compiler to an adequate suboptimal schedule. Also, the methods presented here may be used to indicate to a compiler when further optimizations are not useful, due to limited resources.

Eager evaluation of code via hardware methods is more difficult, and is currently being devised [ 191. However, the same basic stratagem can be followed as in the software meth- ods, namely assigning spare PE's and registers (or variable copies) to pending instruction iterations, and reassigning them upon branch resolution. The methods presented here could also be used by a computer architect who has knowledge of the characteristics of target application code for a new machine design to determine both how much eager evaluation of code is possible, given the resource constraints, and what the expected performance may be.

noting that

then B ~ + ~ + B

B = Bi + 1

Q.E.D.

B. Proof of Root Uniqueness

Rewriting (1) as

B i ( B - 1 ) = 1

it is seen that there are no roots for B > 2 , since Bi > 1 and B - 1 > 1 for B > 2 . Also, there are no roots between 0 and 1 , since in that case the left-hand-side is negative.

Page 9: Requirements for optimal execution of loops with tests

UHT: OPTIMAL EXECUTION OF LOOPS WITH TESTS 581

Therefore, if B is positive, it must be between 1 and 2. ming (MICRO-20), Association of Computing Machinery, Dec. 1987, pp. 88-96. Assuming there is at least One positive root: - ‘ = ‘ 9 we [11] R. M, Tomasulo, ‘<A,, efficient algorithm for expoiting multiple arith-

divide it into (1) and get a polynomial quotient of the form: metic units,” IBM J., pp. 25-33, Jan. 1967. [12] A. K. Uht, “Hardware extraction of low-level concurrency from se-

quential instruction streams,” Ph.D. dissertation, Carnegie-Mellon Univ., Dec. 1985. Available from Universitv Microfilms International. Ann

BZ + pi-lBi-1 + . . . + p 1 B + po

in which all p j are positive. Since this polynomial has all [13] A. K. Uht and R. G. Wedig, “Hardware extraction of low-level con-

currency from serial instruction streams,” in Proc. Int. Con$ Parallel positive coefficients, it has no positive roots; therefore, T is the unique positive root of (1). Q.E.D. Process&, IEEE Computer Society and the Association for Computing

Machinery, Aug. 1986, pp. 729-736. [14] A. K. Uht, “Incremental performance contributions of hardware concur-

rency extraction techniques,” in Proc. Inf. Conf Supercomput., Athens., Greece, Computer Technology Institute, Greece, in cooperation with the Association for Computing Machinery, IFIP et al., June 1987. Springer-

ACKNOWLEDGMENT

My thanks to L. Bradley for much good discussion and critiqueing. She greatly helped in firming up both the ideas of the paper in general, and the intermediate dependence case bound in particular. M. Paturi also helped greatly with the latter. M. Saks helped with section Appendix-B. LD, you’re always there.

REFERENCES

[ 11 A. Aiken and A. Nicolau, “Perfect pipelining: A new loop parallelization technique,” in Proc. 1988 Euro. Symp. Programming, 1988. Also avail- able as Dep. Comput. Sci. Tech. Rep. 87-873, Cornell Univ., Ithaca, NY 14853.

[2] U. Banejee and D. Gajski, “Fast execution of loops with IF statements,” IEEE Trans. Comput. vol. C-33, no. 11, pp. 1030-1033, Nov. 1984.

[3] R. G. Cytron, “Doacross: Beyond vectorization for multiprocessors (extended abstract),” in Proc. 1986 Int. Conf Parallel Processing, Pennsylvania State Univ. and the IEEE Computer Society, Aug. 1986, pp. 836-844.

[4] K. Ebcioglu, “A compilation technique for software pipelining of loops with conditional jumps,” in Proc. Twentieth Annu. Workshop Microprogramming (MICRO-20), Association of Computing Machinery, Dec. 1987, pp. 69-79.

[SI D. A. Padua and M. J. Wolfe, “Advanced compiler optimizations for supercomputers,” Commun. ACM, vol. 29, no. 12, pp. 1184-1201, Dec. 1986.

[6] C. D. Polychronopoulos, “On program restructuring, scheduling, and communication for parallel processor systems,” Ph.D. dissertation, Univ. Illinois at Urbana-Champaign, Aug. 1986. Available as Center for Supercomputing Research and Development Tech. Rep. CSRD 595.

[7] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle, “The Cydra 5 departmental supercomputer,” IEEE Computer Mag., vol. 22, no. 1, pp. 12-35, Jan. 1989.

[8] E. M. Riseman and C. C. Foster, “The inhibition of potential parallelism by conditional jumps,” IEEE Trans. Comput., pp. 1405-1411, Dec. 1972.

191 U. Schwiegelshohn, F. Gasperoni, and K. Ebcioglu, “On optimal paral- lelization of arbitrary loops,” J. Parallel Distributed Comput., vol. 11,

[lo] B. Su, S. Ding, J. Wang, and J. Xia, “GURPR-A method for global software pipelining,” in Proc. Twentieth Annu. Workshop Microprogram-

pp. 130-134, 1991.

.~

Verlag Lecture Note- Series. [15] A. K. Uht, C. D. Polychronopoulos, and J. F. Kolen, “On the com-

bination of hardware and software concurrency extraction methods,” in Proc. Twentieth Annu. Workshop Microprogramming (MICRO-20), Association of Computing Machinery, Dec. 1987.

[16] A. K. Uht, “Requirements for optimal execution of loops with tests,” in Proc. Int. Conf Supercomput., St., Malo, France, Association for Computing Machinery, July 4-8, 1988. An earlier version appeared with the same title as UCSD Comput. Sci. and Eng. Tech. Rep. CS8S116.

[17] -, “A theory of reduced and minimal procedural dependencies,” IEEE Trans. Compuf., vol. 40, pp. 681-692, June 1991.

[ 181 -, “Concurrency extraction via hardware methods executing the static instruction stream,” IEEE Trans. Compuf., to be published.

[19] S. S. Wang, “Enhancing concurrent program execution with eager evaluation,” Ph.D. dissertation, Univ. California at San Diego, June 1991. Available as Dep. Comput. Sci. and Engi. Tech. Rep. CS91-203.

Augustus K. Uht (S’74 - S’78 - M’78 - S’82 - M’85) 0

was born in New York City on July 19, 1955. He received the B.S. degree (in electrical engineering) and the M.Eng. (Elect.) degree in 1977 and 1978, respectively, from Cornell University, Ithaca, NY, and the Ph.D. degree in electrical engineering (spe- cialization in computer engineering) from Camegie- Mellon University, Pittsburgh, PA, in 1985.

He was an Associate Engineer and later a Se- nior Associate Engineer at International Business Machines Corooration. East Fishkill. NY. facilitv

from 1978 to 1982. After obtaining his’ doctorate, he was a Visiting Assis- tant Professor in the Department of Electrical and Computer Engineering, Carnegie-Mellon University in 1986. He joined the faculty of the University of California, San Diego, in 1986, and is currently an Assistant Professor in the Department of Computer Science and Engineering. His research interests include parallel processing, concurrency, reduction of branch effects, eager evaluation, computer implementation, computer architecture, memory systems, and digital logic.

Dr. Uht is a member of the Association for Computing Machinery and the National Society of Professional Engineers. He became a registered Professional Engineer in 1982.