improving register usage chapter 8, section 8.5 end. omer yehezkely

58
Improving Register Usage Chapter 8, Section 8.5 End. Omer Yehezkely

Upload: leroy-reade

Post on 14-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Improving Register Usage

Chapter 8, Section 8.5 End.

Omer Yehezkely

Page 2: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 3: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (1)

Assumption 1: Most compilers can handle register allocation to scalars (using node coloring algorithm). However they don’t know how to handle vectors.

Assumption 2: We are dealing with RISC processors. All of the CPU operations need the data in the registers (except of load and store operations).

Assumption 3: Memory Hierarchy: Accessing the registers is much faster than a cache hit, which is much faster than a cache miss and accessing the main memory, which is much faster than accessing the virtual memory (swap file)…

Page 4: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (2)

Therefore our strategy will be: Do some transformation that will “expose” vector entries as scalars, and then let the good old compiler do the register allocation.

We will benefit from avoiding unnecessary Load / Store operations.

Page 5: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (3)

Example: (Scalar Replacement)

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO

Page 6: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (4)

Dependences to consider:

True dependence

A(I) =… =A(I)

Output dependence

A(I) =…A(I) =

Antidependence

=A(I)…A(I) =

Input dependence

= A(I)… = A(I)

Page 7: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (5)

•We should also consider Loop Carried and Loop Independent dependences.

•In general the more dependences the merry. This is because there are probably more opportunities for registers reuse.

•We will use the dependences to decide if and how to “expose” the vectors as scalars.

Page 8: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (6)

We saw:

•Scalar Replacement (see first example) – this is the actual “exposure”.

•Unroll and Jam – Unrolling of loops in order to bring dependences that are carried by an outer loop into the inner loop. This can benefit register reuse if we apply Scalar Replacement afterwards.

Page 9: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Last lecture at a glance (7)

Example: (Unroll and Jam)

Original Code

DO I = 1, N*2

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

Unroll and Jam

DO I = 1, N*2, 2

DO J = 1, M

A(I) = A(I) + B(J)

A(I+1) = A(I+1) +B(J)

ENDDO

ENDDO

Scalar Replacement

DO I = 1, N*2, 2

s0 = A(I)

s1 = A(I+1)

DO J = 1, M

t = B(J)

s0 = s0 + t

s1 = s1 + t

ENDDO

A(I) = s0

A(I+1) = s1

ENDDO

Page 10: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register

Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 11: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (1)

Loop nesting is not always optimal in regard to register reuse. For example, on CPUs with no vector

engines, the following code (matrix initialization):

DO I=2, N

A(1:M, I) = A(1:M, I-1)

ENDDO

Will be converted into:DO I = 2, N

DO J = 1, M

A(J, I) = A(J, I-1)

ENDDO

ENDDO

Page 12: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (2)Which will be implemented in the following way:

DO I = 2, N

DO J = 1, M

R1 = A(J, I-1)

A(J, I) = R1

ENDDO

ENDDO

Which is not too clever, since it has (N-1)*M Load and Store operations.

If we change the order of the loops we can get a better implementation.

Page 13: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (3)

Original Code

DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDOENDDO

Loop Interchange

DO J = 1, M DO I = 2, N A(J, I) = A(J, I-1) ENDDOENDDO

Scalar Replacement

DO J = 1, M R1 = A(J, 1) DO I = 2, N A(J, I) = R1 ENDDOENDDO

This implementation still requires (N-1)*M Store operations (we can’t escape that), but it only requires M Load operations which can make the running time considerably shorter.

Page 14: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (4)

Considerations for Loop Interchange

The basic idea is to get the loop that carries the most dependences to the innermost position.

Register reuse for the outer loop is usually cannot be achieved due to limited register resources.

We use the conventional direction matrix for loop nest.

Page 15: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (5)

Example:

DO J = 1, N

DO K = 1, N

DO I = 1, 256

A(I, J, K) = A(I, J-1, K) + A(I, J-1, K-1) + A(I, J, K-1)

ENDDO

ENDDO

ENDDO

There are 3 true dependences which result in the following direction matrix:

Page 16: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (6)

Example (cont.):

If we select the J loop to be the innermost we get:

DO K = 1, N

DO I = 1, 256

DO J = 1, N

A(I, J, K) = A(I, J-1, K) + &

A(I, J-1, K-1) + A(I, J, K-1)

ENDDO

ENDDO

ENDDO

DO K = 1, N

DO I = 1, 256

R1 = A(I, 0, K)

DO J = 1, N

R1 = R1 + A(I, J-1, K-1) + &

A(I, J, K-1)

A(I, J, K) = R1

ENDDO

ENDDO

ENDDO

We saved a Load operation in each iteration. It is possible to interchange the 2 outer loops and get further optimization.

Page 17: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (7)

Loop Interchange Algorithm:

1. Form the direction matrix for the loop nest and use it to identify the loops other than the scalarization loop that can legally be moved to the innermost position

2. For each such loop L, let count(L) be the number of rows of the direction matrix that have “<“ in the position corresponding to L and “=“ in every other position.

3. Pick the loop l that maximize the product of count(L) and the iteration count of loop L.

• Some assumptions need to be taken when the bounds of the loop are unknown at compile time.

• Loop interchange should be weighed against cache efficiency (next chapter)

Page 18: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Interchange (8)

100 65 150 1,000 (# of loop iterations)

Example

100 * 2 = 200

65 * 3 = 195

150 * 1 = 150

1,000 * 0 = 0

The outermost loop (100*2) should be the innermost loop

Page 19: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 20: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (1)

Example:

On CPUs with no vector engines the following code:A(1:N) = C(1:N) + D(1:N)

B(1:N) = C(1:N) – D(1:N)

Will be transformed into:DO I = 1, N

A(I) = C(I) + D(I)

ENDDODO I = 1, N B(I) = C(I) - D(I)ENDDO

Page 21: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (2)

Using Loop Fusion (chapter 6) we get:

DO I = 1, N A(I) = C(I) + D(I) B(I) = C(I) – D(I)ENDDO

Using Scalar Replacement We can save on the fetching time of C(I) and D(I):

DO I = 1, N R1 = C(I) R2 = D(I) A(I) = R1 + R2 B(I) = R1 – R2ENDDO

Page 22: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (3)

Profitable Loop Fusion for Register Reuse

Just because a loop fusion is safe does not mean it is profitable.

There are 2 cases where the fusion may be profitable:

•The fusion results in a loop independent dependence (as we just saw) .

•The fusion results in a forward loop carried dependence.

Page 23: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (4)

Example: (forward loop carried dependence)

DO J = 1, N

DO I = 1, M

A(I,J) = C(I,J)+D(I,J)

ENDDO

DO I = 1, M

B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Fusion:DO J = 1, N

DO I = 1, M

A(I,J) = C(I,J)+D(I,J)

B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Page 24: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (5)Fusion:DO J = 1, N

DO I = 1, M

A(I,J) = C(I,J)+D(I,J)

B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Loop Interchange:DO I = 1, M

DO J = 1, N

A(I,J) = C(I,J)+D(I,J)

B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Statement Order Reversing:

DO I = 1, M DO J = 1, N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDOENDDO

Scalar Replacement:

DO I = 1, M R1 = A(I, 0) DO J = 1, N B(I,J) = R1 - E(I,J) R1 = C(I,J)+D(I,J) A(I,J) = R1 ENDDOENDDO

Page 25: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (6)

Loop Alignment for Fusion

Reminder: Blocking dependences cause problems for loop fusion.

DO I = 1, M

DO J = 1, N

A(J,I) = B(J,I) + 1.0

ENDDO

DO J = 1, N

C(J,I) = A(J+1,I) + 2.0

ENDDO

ENDDO

We cannot simply fuse the two loops because we will introduce backward-carried antidependence.

Page 26: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (7)

We can overcome this problem by aligning the loops:

DO I = 1, M

DO J = 0, N-1

A(J+1,I) = B(J,I+1) + 1.0

ENDDO

DO J = 1, N

C(J,I) = A(J+1,I) + 2.0

ENDDO

ENDDO

We can now fuse the two loops on their common iteration range while peeling a single iteration from the beginning of the first loop and one iteration from the end of the second loop.

Page 27: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (8)

Hence we get:

DO I = 1, M

A(1,I) = B(1,I) + 1.0

DO J = 1, N-1

A(J+1,I) = B(J+1,I) + 1.0

C(J,I) = A(J+1,I) + 2.0

ENDDO

C(N,I) = A(N+1,I) + 2.0

ENDDO

Scalar ReplacementDO I = 1, M

A(1,I) = B(1,I) + 1.0

DO J = 1, N-1

R1 = B(J+1,I) + 1.0

A(J+1,I) = R1

C(J,I) = R1 + 2.0

ENDDO

C(N,I) = A(N+1,I) + 2.0

ENDDO

Page 28: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (9)

Definition:

Let be a dependence between loops.

The Alignment Threshold of is defined as follows:

•If is loop independent after merging, threshold() = 0

•If is forward carried after merging, threshold() is the negative of the resulting dependence threshold.

•If is fusion preventing, threshold() is the threshold of the merged dependence.

Aligning by the largest threshold allow fusion.

Page 29: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (10)

Example:DO I = 1, N

A(I) = B(I) + 1.0

ENDDO

DO I = 1, N

C(I) = A(I+1) + A(I-1)

ENDDO

We have 2 dependences:

1. Forward carried with a threshold of 1 because of the reference A(I-1) Alignment threshold of -1.

2. Backward carried with a threshold of 1 because of the reference A(I+1) Alignment threshold of +1.

Page 30: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (11)

Since (+1) > (-1) we should align by the alignment threshold: (+1)

And so we get:

DO I = 0, N-1 A(I+1) = B(I+1) + 1.0ENDDODO I = 1, N C(I) = A(I+1) + A(I-1)ENDDO

From here we can proceed to fuse the loops and then “Scalar Replace” A(I+1).

Page 31: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (12)

Fusion Mechanics

Assuming we have a collection of aligned loops how do we fuse them?

1. Sort the lower bounds of the loops into nondecreasing sequence {L1,L2,…Ln} and sort the upper bounds of the loops into nondecreasing sequence {H1,H2,…,Hn}.

2. Produce a sequence of fusion loops with lower bounds of L1,L2,…,Ln-1 with respective upper bounds of L2-1,L3-1,…,Ln-1.

3. Produce the central fuse loop with a lower bound of Ln and an upper bound of H1.

4. Produce a sequence of fusion loops with lower bounds of H1+1,H2+1,…,Ln-1+1 with respective upper bounds of H2,H3,…,Hn.

Page 32: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (13)

Loop 1

Loop 2

Loop 3

Example

Each color represents a fusion loop.

Loops after alignment

Page 33: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (14)

The Weighted Fusion Problem

The last thing to do is to form the collections of the loops to be fused. We need to do it in a profitable manner.

ExampleL1 DO I = 1, 1,000

A(I) = B(I) + X(I)

ENDDO

L2 DO I = 1, 1,000

C(I) = A(I) + Y(I)

ENDDO

S Z = FOO(A(1:1,000))

L3 DO I = 1, 500

A(I) = C(I) + Z

ENDDO

L1

SL2

L3

1,000

500

500

1,000

1,000

Page 34: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (15)

Definition

A mixed-directed graph is a graph G = (V, E = Ed U Eu) where (V,Ed) forms a directed graph, (V, Eu) forms an undirected graph, and Ed and Eu are disjoint.

•G is acyclic if (V,Ed) is acyclic. •w is a successor or predecessor of v if it is such in (V,Ed). •w is a neighbor of v if it is such in (V,Eu).

Page 35: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (16)Problem DefinitionLet G be an acyclic mixed-directed graph, W a weight function on E, B a set of bad vertices, and Eb a set of bad edges. The weighted loop fusion problem is the problem of finding vertex sets {V1,V2,…,Vn} such that:

•{V1,V2,…,Vn} partitions V.

•Each vertex set Vi either contains no bad vertices, or consists of a single bad vertex.

•Given two v and w in Vi, there is no path from v to w (in Ed) that leaves Vi.

•Given v and w in Vi, there is no bad edge between v and w.

•The induced graph on the vertex sets is acyclic.

The Target: To maximize the total weight of edges between vertices in the same vertex sets.

Page 36: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (17)

The Algorithm

1. Initialize all the quantities and compute initial successor, predecessor, and neighbor sets.

2. Topologically sort the vertices of the directed acyclic graph.

Continued…

Unfortunately, The Weighted Fusion Problem is NP-Hard. Therefore we have to resort to heuristic based algorithms.

A fast and simple algorithm, is the Fast Greedy algorithm for Weighted Fusion which was developed by Kennedy.

Page 37: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (18)

The Algorithm (continued)

3. Process the vertices in V to compute for each vertex the set pathFrom[v], which contains all vertices that can be reached by a path from vertex v, and the set badPathFrom[v], a subset of pathFrom[v] that includes the set of vertices that can be reached from v by a path that contains a bad vertex or a bad edge.

4. Invert the sets pathFrom and badPathFrom, respectively, to produce the sets pathTo[v] and badPathTo[v] for each vertex v in the graph, The set pathTo[v] contains the vertices from which there is a path to v; the set badPathTo[v] contains the vertices from which v can be reached via a bad path.

Continued…

Page 38: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (19)

5. Insert each of the edges into a priority queue edgeHeap by weight.

6. While edgeHeap is nonempty, select and remove the heaviest edge (v,w) from it. If w is in badPathFrom[v] then do not fuse – repeat step 6. Otherwise do the following:

• Collapse v, w, and every edge on the directed path between them.

• After each collapse, adjust the sets pathFrom, badPathFrom, pathTo, and badPathTo to reflect the new graph. That is, the composite node will now be reached from every vertex that reached a vertex in the composite, and it will reach any vertex that is reached by a vertex in the composite.

• After each vertex collapse, recompute successor, predecessor, and neighbor sets for the composite vertex, and recompute weights between the composite vertex and other vertices as appropriate.

The running time of the algorithm is: O(EV + V2)

Page 39: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (20)

L1

SL2

L3

1,000

500

500

1,000

1,000

In the previous example the greedy algorithm will fuse L1 and L2 which is the optimal solution.

Page 40: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (21)

ab

c

e

d

f

Bad

vertex

1a

1 1

11

1

1

1010

However, the algorithm is not optimal. Consider the following example:

Page 41: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (22)

Since the edge (a,f) is the heaviest, the greedy algorithm will fuse the vertices a,b,c,d,f together:

ab

c

e

d

f

Bad

vertex

1a

1 1

11

1

1

1010

This solution weight is 16.

Page 42: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (23)

However, fusing c,d,e,f and a,b produce a better result:

ab

c

e

d

f

Bad

vertex

1a

1 1 11

1

1

1010

This solution weight is 23.

Page 43: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Loop Fusion (24)

Multilevel Loop Fusion

When dealing with multiple-loop nesting problem, the strategy is simple: First align and fuse the outer most loops, then recursively repeat the process for the bodies of the resulting loops.

At best it is inefficient to start with fusing the inner loops (since we won’t be able to fuse all of them, and if we will insist on fusing them we might get the wrong code as the outer loops might need alignment, and therefore the references in the inner loops will change).

Page 44: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 45: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Putting It All Together (1)

In which order should the transformations be applied?

The recommended order is as follows:

1. Loop Interchange.

2. Loop Alignment and Fusion.

3. Unroll and Jam.

4. Scalar Replacement.

But Why?

Page 46: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Putting It All Together (2)

1. Loop Interchange: Fusion might interfere with loop interchange therefore it should be done first.

2. Loop Alignment and Fusion: This can achieve extra reuse across loops

3. Unroll and Jam: This can achieve outer loop reuse when there are dependences carried by other than the inner loop after interchange is finished.

4. Scalar Replacement: As we already noted, this is the actual “exposure” – so this must be the last transformation.

Page 47: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 48: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (1)

Loops with If Statements

Consider the following example:

DO I = 1, N

IF(M(I).LT.0) THEN

A(I)=B(I)+C

ENDIF

D(I) = A(I) + E

ENDDO

Scalar Replacement

DO I = 1, N

IF(M(I).LT.0) THEN

a0 = B(I) + C

A(I) = a0

ENDIF

D(I) = a0 + E

ENDDO

Error: a0 may not be initialized

Page 49: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (2)

We can overcome this problem in the following way:

DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ELSE a0 = A(I) ENDIF D(I) = a0 + EENDDO

Note: We didn’t increase the running time.

Page 50: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (3)

Given a control flow graph of the loop, and assuming that each If statement has (possibly empty) Else branch:

•We insert initialization at the beginning of block b if the variable is used in b but not initialized on any path to b.

•We insert an initialization at the end of block b if the variable has not been initialized on any path to the block, it is live on exit from the block, and at some successor to the block it is used. (as done in the example).

Page 51: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (4)

Triangular Unroll and Jam

Consider the following example:

DO I = 2, 99

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

ENDDO

Naïve Unroll an Jam

DO I = 2, 99, 2

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

A(I+1,J)=A(I+1,I+1)+A(J,J)

ENDDO

ENDDO

Error: We miss an assignment

We can solve the problem by applying Unroll an Jam step by step an using the loop fusion mechanics.

Page 52: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (5)Original Code

DO I = 2, 99

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

ENDDO

Unroll

DO I = 2, 99, 2

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

DO J = 1, I

A(I+1,J) = A(I+1,I+1)+A(J,J)

ENDDO

ENDDO

Jam (Fusion)

DO I = 2, 99, 2

DO J = 1 , I-1

A(I,J) = A(I,I) + A(J,J)

A(I+1,J) = A(I+1,I+1)+A(J,J)

ENDDO

A(I+1,I) = A(I+1,I+1)+A(I,I)

ENDDO

Scalar Replacement

DO I = 2, 99, 2

tI = A(I,I)

tI1 = A(I+1,I+1)

DO J = 1 , I-1

tJ = A(J,J)

A(I,J) = tI + tJ

A(I+1,J) = tI1 + tJ

ENDDO

A(I+1,I) = tI1 + tI

ENDDO

Page 53: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (6)

Note: It is also possible to Unroll using a factor bigger than 2, using the same techniques.

Page 54: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (7)

Trapezoidal Unroll and Jam

The same technique can be used for general trapezoidal loops, for example: (A part of a convolution code)

DO I = 0, N

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

ENDDO

Unroll

DO I = 0, N, 2

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

DO J = I+1, I+N2+1

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1) = F3(I+1)*DT

ENDDO

Page 55: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Complex Loop Nests (8)

UnrollDO I = 0, N, 2

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

DO J = I+1, I+N2+1

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1) = F3(I+1)*DT

ENDDO

Jam (Fusion)DO I = 0, N, 2

F3(I) = F3(I)+F1(I)*W(0)

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1)=F3(I+1)+F1(I+N2+1)*W(-N2)

F3(I) = F3(I)*DT

F3(I+1) = F3(I+1)*DT

ENDDO

Applying Scalar Replacement gave a speedup of 2.22 on a MIPS M120…

Page 56: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Page 57: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Summary (1)

This lecture we covered:

1. Loop Interchange – This gives us more dependences in the innermost loop which we can utilize for more register reuse.

2. Loop Fusion and Alignment – Bring uses together so they can share registers.

3. Complex Loops – How to overcome some of the problems in real-world programs.

Page 58: Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely