optimizing compilers for modern architectures allen and kennedy, chapter 13 compiling array...

40
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Upload: carson-memmott

Post on 14-Dec-2015

227 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Allen and Kennedy, Chapter 13

Compiling Array Assignments

Page 2: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Fortran 90

• Fortran 90: successor to Fortran 77

• Slow to gain acceptance:—Need better/smarter compiler techniques to achieve same

level of performance as Fortran 77 compilers

• This chapter focuses on a single new feature - the array assignment statement: A[1:100] = 2.0—Intended to provide direct mechanism to specify

parallel/vector execution

• This statement must be implemented for the specific available hardware. In an uniprocessor, the statement must be converted to a scalar loop: Scalarization

Page 3: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Fortran 90

• Range of a vector operation in Fortran 90 denoted by a triplet: <lower bound: upper bound: increment>

A[1:100:2] = B[2:51] + 3.0

• Semantics of Fortran 90 require that for vector statements, all inputs to the statement are fetched before any results are stored

Page 4: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Outline

• Simple scalarization

• Safe scalarization

• Techniques to improve on safe scalarization—Loop reversal—Input prefetching—Loop splitting

• Multidimensional scalarization

• A framework for analyzing multidimensional scalarization

Page 5: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization

• Replace each array assignment by a corresponding DO loop

• Is it really that easy?

• Two key issues:— Wish to avoid generating large array temporaries— Wish to optimize loops to exhibit good memory hierarchy — performance

Page 6: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Simple Scalarization

• Consider the vector statement:A(1:200) = 2.0 * A(1:200)

• A scalar implementation:S1 DO I = 1, 200

S2 A(I) = 2.0 * A(I)

ENDDO

• However, some statements cause problems:A(2:201) = 2.0 * A(1:200)

• If we naively scalarize:DO i = 1, 200

A(i+1) = 2.0 * A(i)

ENDDO

Page 7: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Simple Scalarization

• Meaning of statements changed by the scalarization process: scalarization faults

• Naive algorithm which ignores such scalarization faults:

procedure SimpleScalarize(S)

let V0 = (L0: U0: D0 ) be the vector iteration specifier on left side of S;

// Generate the scalarizing loop

let I be a new loop index variable;

generate the statement DO I = L0 ,U0 ,D ;

for each vector specifier V = (L: U: D) in S do

replace V with (I+L-L0 );

generate an ENDDO statement;

end SimpleScalarize

Page 8: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization Faults

• Why do scalarization faults occur?

• Vector operation semantics: All values from the RHS of the assignment should be fetched before storing into the result

• If a scalar operation stores into a location fetched by a later operation, we get a scalarization fault

• Principle 13.1: A vector assignment generates a scalarization fault if and only if the scalarized loop carries a true dependence.

• These dependences are known as scalarization dependences

• To preserve correctness, compiler should never produce a scalarization dependence

Page 9: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Safe Scalarization

• Naive algorithm for safe scalarization: Use temporary storage to make sure scalarization dependences are not created

• Consider:A(2:201) = 2.0 * A(1:200)

• can be split up into:T(1:200) = 2.0 * A(1:200)

A(2:201) = T(1:200)

• Then scalarize using SimpleScalarizeDO I = 1, 200

T(I) = 2.0 * A(I)

ENDDO

DO I = 2, 201

A(I) = T(I-1)

ENDDO

Page 10: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Safe Scalarization

• Procedure SafeScalarize implements this method of scalarization

• Good news: —Scalarization always possible by using temporaries

• Bad News:—Substantial increase in memory use due to temporaries—More memory operations per array element

• We shall look at a number of techniques to reduce the effects of these disadvantages

Page 11: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Reversal

A(2:256) = A(1:255) + 1.0

• SimpleScalarize will produce a scalarization fault

• Solution: Loop reversal

DO I = 256, 2, -1

A(I) = A(I+1) + 1.0

ENDDO

Page 12: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Reversal

• When can we use loop reversal?—Loop reversal maps dependences into antidependences—But also maps antidependences into dependences

A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

• After scalarization:DO I = 2, 257

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

• Loop Reversal gets us:DO I = 257, 2

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

• Thus, cannot use loop reversal in presence of antidependences

Page 13: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

• Causes a scalarization fault when naively scalarized to:

DO I = 2, 257

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

• Problem: Stores into first element of the LHS in the previous iteration

• Input prefetching: Use scalar temporaries to store elements of input and output arrays

Page 14: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

• A first-cut at using temporaries:

DO I = 2, 257

T1 = A(I-1)

T2 = ( T1 + A(I+1) ) / 2.0

A(I) = T2

ENDDO

• T1 holds element of input array, T2 holds element of output array

• But this faces the same problem. Can correct by moving assignment to T1 into previous iteration...

Page 15: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

T1 = A(1)

DO I = 2, 256

T2 = ( T1 + A(I+1) ) / 2.0

T1 = A(I)

A(I) = T2

ENDDO

T2 = ( T1 + A(257) ) / 2.0

A(I) = T2

• Note: We are using scalar replacement, but the motivation for doing so is different than in Chapter 8

Page 16: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

• Already seen in Chapter 8, we need as many temporaries as the dependence threshold + 1.

• Example:

DO I = 2, 257

A(I+2) = A(I) + 1.0

ENDDO

• Can be changed to:T1 = A(1)T2 = A(2)DO I = 2, 255 T3 = T1 + 1.0 T1 = T2 T2 = A(I+2) A(I+2) = T3 ENDDOT3 = T1 + 1.0T1 = T2A(258) = T3T3 = T1 + 1.0A(259) = T3

Page 17: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

• Can also unroll the loop and eliminate register to register copies

• Principle 13.2: Any scalarization dependence with a threshold known at compile time can be corrected by input prefetching.

Page 18: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Input Prefetching

• Sometimes, even when a scalarization dependence does not have a constant threshold, input prefetching can be used effectively

A(1:N) = A(1:N) / A(1)

• which can be naively scalarized as:DO i = 1, N

A(i) = A(i) / A(1)

ENDDO

• true dependence from first iteration to every other iteration

• antidependence from first iteration to itself

• Via input prefetching, we get:tA1 = A(1)

DO i = 1, N

A(i) = A(i) / tA1

ENDDO

Page 19: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Splitting

• Problem with using input prefetching with thresholds > 1: Temporaries must be saved for each iteration of the loop up to the threshold

• Can potentially use loop splitting to solve this problem

A(3:6) = ( A(1:4) + A(5:8) ) / 2.0

• Scalarization loop:

DO I = 3, 6

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

• True dependence and antidependence with threshold of 2

• Split up into two independent loops which do not interact with each other...

Page 20: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Splitting

DO I = 3, 6

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

• Both true and anti dependence has threshold of 2

• With loop splitting becomes: DO I = 3, 5, 2

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

DO I = 4, 6, 2

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

• Note that the threshold becomes 1 for each loop. Could have produced incorrect results if threshold of antidependence was not divisible by 2

Page 21: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Splitting

• Can write the splitting as a nested pair of loops:DO i1 = 3, 4

DO i2 = i1, 6, 2

A(i2) = (A(i2-2) + A(i2+2) ) / 2.0

ENDDO

ENDDO

• The inner loop carries a scalarization dependence with threshold 1 and the outer loop carries no dependence. Can apply input prefetching:

DO i1 = 3, 4 T1 = A(i1-2) DO i2 = i1, 6, 2 T2 = (T1 + A(i2 + 2)) / 2.0 T1 = A(i2) A(i2) = T2 ENDDOENDDO

Page 22: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Splitting

• Principle 13.3: Any scalarization loop in which all true dependences have the same constant threshold T and all antidependences have a threshold that is divisible by T can be transformed, using input prefetching and loop splitting, so that all scalarization dependences are eliminated.

Page 23: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization Algorithm

• Revised scalarization Algorithm:

procedure FullScalarize

for each vector statement S do begin

compute the dependences of S on itself as though S had been scalarized;

if S has no scalarization dependences upon itself then SimpleScalarize(S);

else if S has scalarization dependences, but no self antidependences

then begin

SimpleScalarize(S);

reverse the scalarization loop;

end

Page 24: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization Algorithmelse if all scalarization dependences have a threshold of 1 then begin

SimpleScalarize(S);

InputPrefetch(S);

end

else if all scalarization dependences for S have the same constant threshold T

and all antidependences have thresholds that are divisible by T

then SplitLoop(S);

else if all antidependences for S have the same constant threshold T

and all true dependences have thresholds that are divisible by T then begin

reverse the loop;

SplitLoop(S);

end

else SafeScalarize(S, SL);

end

end FullScalarize

Page 25: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Multidimensional Scalarization

• Vector statements in Fortran 90 in more than 1 dimension:

A(1:100, 1:100) = B(1:100, 1, 1:100)

• corresponds to:DO J = 1, 100

A(1:100, J) = B(1:100, 1, J)

ENDDO

• Scalarization in multiple dimensions:A(1:100, 1:100) = 2.0 * A(1:100, 1:100)

• Obvious Strategy: convert each vector iterator into a loop:

DO J = 1, 100, 1

DO I = 1, 100

A(I,J) = 2.0 * A(I,J)

ENDDO

ENDDO

Page 26: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Multidimensional Scalarization

• What should the order of the loops be after scalarization?—Familiar question: We dealt with this issue in Loop

Selection/Interchange in Chapter 5

• Profitability of a particular configuration depends on target architecture—For simplicity, we shall assume shorter strides through

memory are better—Thus, optimal choice for innermost loop is the leftmost

vector iterator

Page 27: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Multidimensional Scalarization

• Extending previous results to multiple dimensions:— Each vector iterator is scalarized separately, starting from the

leftmost vector iterator in the innermost loop and the rest of the iterators from left to right

• Once the ordering is available:1. Test to see if the loop carries a scalarization dependence. If not,

then proceed to the next loop.2. If the scalarization loop carries only true dependences, reverse

the loop and proceed to the next loop.3. Apply input prefetching, with loop splitting where appropriate, to

eliminate dependences to which it applies. Observe, however, that in outer loops, prefetching is done for a single submatrix (the remaining dimensions).

4. Otherwise, the loop carries a scalarization fault that requires temporary storage. Generate a scalarization that utilizes temporary storage and terminate the scalarization test for this loop, since temporary storage will eliminate all scalarization faults.

Page 28: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Outer Loop Prefetching

A(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

• If we try to scalarize this (keeping the column iterator in the innermost loop) we get a true scalarization dependence (<, >) involving the second input and an antidependence (>, <) involving the first input

• Cannot use loop reversal...

Page 29: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Outer Loop PrefetchingA(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

• We can use input prefetching on the outer loop. The temporaries will be arrays:

T0(1:N) = A(2:N+1, 0)

DO j = 1, N-1

T1(1:N)=( A(0:N-1, j+1) + T0(1:N) ) / 2.0

T0(1:N) = A(2:N+1, j)

A(1:N, j) = T1(1:N)

ENDDO

T1(1:N) = ( A(0:N-1, N) + T0(1:N) ) / 2.0

A(1:N, N) = T1(1:N)

• Total temporary space required = 2 rows of original matrix

• Better than storage required for copy of the result matrix

Page 30: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Interchange

• Sometimes, there is a tradeoff between scalarization and optimal memory hierarchy usage

A(2:100, 3:101) = A(3:101, 1:201:2)

• If we scalarize this using the prescribed order:

DO I = 3, 101

DO 100 J = 2, 100

A(J,I) = A(J+1,2*I-5)

ENDDO

ENDDO

• Dependences (<, >) (I = 3, 4) and (>, >) (I = 6, 7)

• Cannot use loop reversal, input prefetching

• Can use temporaries

Page 31: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Loop Interchange

• However, we can use loop interchange to get:

DO J = 2, 100

DO I = 3, 101

A(J,I) = A(J+1,2*I-5)

ENDDO

ENDDO

• Not optimal memory hierarchy usage, but reduction of temporary storage

• Loop interchange is useful to reduce size of temporaries

• It can also eliminate scalarization dependences

Page 32: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

General Multidimensional Scalarization

• Goal: To vectorize a single statement which has m vector dimensions —Given an ideal order of scalarization (l1, l2, ..., lm)

—(d1, d2, ..., dn) be direction vectors for all true and antidependences of the statement upon itself

—The scalarization matrix is a n m matrix of these direction vectors

• For instance:

A(1:N, 1:N, 1:N) = A(0:N-1, 1:N, 2:N+1) + A(1:N, 2:N+1, 0:N-1)

> = < < > =

Page 33: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

General Multidimensional Scalarization

• If we examine any column of the direction matrix, we can immediately see if the corresponding loop can be safely scalarized as the outermost loop of the nest:—If all entries of the column are = or >, it can be safely

scalarized as the outermost loop without loop reversal.—If all entries are = or <, it can be safely scalarized with

loop reversal.—If it contains a mixture of < and >, it cannot be scalarized

by simple means.

Page 34: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

General Multidimensional Scalarization

• Once a loop has been selected for scalarization, the dependences carried by that loop, any dependence whose direction vector does not contain a = in the position corresponding to the selected loop may be eliminated from further consideration.

• In our example, if we move the second column to the outside, we get:

> = < = > <

< > = > < =

• Scalarization in this way will reduce the matrix to:

< >

Page 35: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

A Complete Scalarization Algorithmprocedure CompleteScalarize(S, loop_list)

let M be the scalarization direction matrix resulting from scalarization S to loop_list;

while there are more loops to be scalarized do begin

let l be the first loop in loop list that can be simply scalarized with or without loop reversal (determine by examining the columns of M from left

to right);

if there is no such l then begin

let l be the first loop on loop_list;

section l by input prefetching

if the previous step fails then

section S using the naive temporary method and exit;

end

Page 36: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

A Complete Scalarization Algorithm

else // make l the outermost loop

section l directly or with loop reversal, depending on the entries in the column of M corresponding to l (if l is the last loop, use

hardware section length);

remove l from loop_list;

let M' be M with the column corresponding to l and the rows corresponding to non “=“ entries in that column eliminated;

M := M';

end //while

end CompleteScalarize

• Time Complexity = O(m2 n) where—m = number of loops—n = number of dependences

Page 37: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

A Complete Scalarization Algorithm

• Correctness follows from the definition of the scalarization matrix and the method to remove dependences from the matrix

• For a given statement S and loop list, Scalarize produces a correct scalarization with the following properties:1 input prefetching is applied to the innermost loop possible the order of scalarization loops is the closest possible to

the order specified on input among scalarizations with property (1).

Page 38: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization ExampleDO J = 2, N-1

A(2:N-1,J) = A(1:N-2,J) + A(3:N,J) + &

A(2:N-1,J-1) + A(2:N-1,J+1)/4.

ENDDO

• Loop carried true dependence, antidependence

• Naive compiler could generate:DO J = 2, N-1

DO i = 2, N-1

T(i-1) = (A(i-1,J) + A(i+1,J) + &

A(i,J-1) + A(i,J+1) )/4

ENDDO

DO i = 2, N-1

A(i,J) = T(i-1)

ENDDO

ENDDO

• 2 (N-2)2 accesses to memory due to array T

Page 39: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Scalarization Example

• However, can use input prefetching to get:DO J = 2, N-1

tA0 = A(1, J)

DO i = 2, N-2

tA1 = (tA0+A(i+1,J)+A(i,J-1)+A(i,J+1))/4

tA0 = A(i-1, J)

A(i,J) = tA1

ENDDO

tA1 = (tA0+A(N,J)+A(N-1,J-1)+A(N-1,J+1))/4

A(N-1,J) = tA1

ENDDO

• If temporaries are allocated to registers, no more memory accesses than original Fortran 90 program

Page 40: Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments

Optimizing Compilers for Modern Architectures

Post Scalarization Issues

• Issues due to scalarization:—Generates many individual loops—These loops carry no dependences. So reuse of quantities

in registers is not common

• Solution: Use loop interchange, loop fusion, unroll-and-jam, and scalar replacement