creating coarse-grained parallelism for loop nests chapter 6, sections 6.3 through 6.9 yaniv carmeli

Creating Coarse-Creating Coarse-grained Parallelism grained Parallelism for Loop Nestsfor Loop Nests

Chapter 6, Sections 6.3 Chapter 6, Sections 6.3 through 6.9through 6.9Yaniv CarmeliYaniv Carmeli

Single loop methods

Privatization

Loop distribution

Alignment

Loop Fusion

Last time…

Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing Profitibility-Based Methods

This time…

This time… Imperfectly Nested Loops

Multilevel Loop Fusion Parallel Code Generation

Packaging Parallelism Strip Mining Pipeline Parallelism Guided Self Scheduling

Vectorization: BadParallelization: Good

Loop Interchange

A(I+1, J) = A(I, J) + B(I, J) ENDDO

ENDDO

Vectorization: OKParallelization: Problematic

DO J = 1, MDO I = 1, NPARALLEL DO J = 1 , M DO I = 1 , N A(I+1, J) = A(I, J) + B(I, J) ENDDOEND PARALLEL DO D = ( < , = )

DO I = 1, N

DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO

ENDDO

DO I = 1, N

PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO

ENDDO

Loop Interchange (Cont.)

Loop Interchange doesn’t work, as both loops carry dependence!!

Best we can do

D = ( < , < )

When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?

Loop Interchange (Cont.) Theorem: In a perfect nest of loops, a particular

loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries.

Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence.

Only If. There is a non “=“ entry in that column: If it is “>” – Can’t interchange loops (dependence

will be reversed) If it is “<“ – Can interchange, but can’t shake the

dependece (Will not allow parallelization anyway...)

Loop Interchange (Cont.) Working with direction matrix

1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix

2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences

3. Repeat step 1

DO I = 1, NDO J = 1, M

DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDOENDDO

ENDO

DO I = 1, NPARALLEL DO J = 1, M

DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDOEND PARALLEL DO

ENDO

Loop Interchange (Cont.) Example:

< = == = << < <

< < = =< = < =< = = <= < = == = < == = = <

Loop Selection – Optimal? Is the approach of selecting the loop with

the most ‘ < ‘ directions optimal? Will result in NO

parallelization for this matrix

While other selections may allow parallelization

< < = =< = < =< = = <= < = == = < == = = <

Is it possible to derive a selection heuristic that provides optimal code?

The problem of loop selection is NP-complete Loop selection is best done by a heuristic!

Loop Selection

< < = =< = < =< = = <= < = == = < == = = <

Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

Heuristic Loop Selection (Cont.) Example of principals involved

in heuristic loop selectionDO I = 2, N

DO J = 2, MDO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1)

ENDDO ENDDO

ENDDO

The I-loop must be sequentialized because of the fourth dependence

The J-loop must be sequentialized because of the first dependence

DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO

ENDDO

= < =< = <= < << = >

Loop Reversal

DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

Using loop reversal to create coarse-grained parallelism. Consider:

DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

DO K = L, 1, -1 DO I = 2, N+1

DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

DO K = L, 1, -1 PARALLEL DO I = 2, N+1

PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO END PARALLEL DOENDDO

= < >< = >

= < << = <

Loop Skewing

DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO ENDDOENDDO

Skewed using k = K + I + J yield:DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO ENDDOENDDO

= < =< = == = <= = =

0 1 01 0 00 0 10 0 0

= < << = <= = <= = =

Loop Skewing - Main Benefits

Eliminate “>” signs in the matrix

Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Loop Skewing - Drawback

The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time). As we shall see – It’s not really a problem for

asynchronous parallelism (unlike vectorization).

Loop Skewing (Cont.) Updated strategy

1. Parallelize outermost loop if possible2. Sequentializes at most one outer loop to find parallelism in the next loop3. If 1 and 2 fail, try skewing4. If 3 fails, sequentialize the loop that can be moved to the outermost

position and cover the most other loops

In Practice – Sometimes we get much worse

execution times, than we would have gotten parallelizing less\different loops.

Profitability-Based Methods

Static performance estimation function No need to be accurate, just good at

selecting the better of two alternatives

Key considerations Cost of memory references Sufficiency of granularity

Profitability-Based Methods (Cont.) Impractical to choose from all

arrangements

Consider only subset of the possible code arrangements, based on properties of the cost function In our case: consider only the inner-most loop

Profitability-Based Methods (Cont.)

1. Subdivide all the references in the loop body into reference groups

Two references are in the same group if: There is a loop independent dependence between

them. There is a constant-distance loop carried

dependence between them.

A possible cost evaluation heuristics:


2. Determine whether subsequent accesses to the same reference are

Loop invariant Cost = 1

Unit stride Cost = number of iterations / cache line size

Non-unit stride Cost = number of iterations



3. Compute loop cost:


#loop _ cost reference cost

aggregate

of iterations

Profitability-Based Methods: ExampleDO I = 1, N

DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

Profitability-Based Methods: ExampleDO I = 1, N

DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

2N3/L+N21N/LN/LI

2N3+N2N1NJ

N3(1+1/L)+N2N/LN1K

COSTBACInner-most loop

Worst

Best

Profitability-Based Methods: Example Reorder loop from innermost to outermost by increasing loop cost: I,K,J

Can’t always have desired loop order(as some permutations are illegal) - Try to find the possible permutation closest to the desired one.

DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

Profitability-Based Methods (Cont.) Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one.

Method: Until there are no more loops:

Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.

Profitability-Based Methods (Cont.)DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).

Multilevel Loop Fusion

Commonly used for imperfect loop nests

Used after maximal loop distribution

Multilevel Loop Fusion

DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C

B(I+1, J) = B(I, J) + D ENDDOENDDO

DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C

ENDDOENDDO

DO I = 1, N DO J = 1, M

B(I+1, J) = B(I, J) + D ENDDOENDDO

PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C


PARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = B(I, J) + D ENDDOEND PARALLEL DO

After distribution each nest is better with a different outer loop – Can’t fuse!

Multilevel Loop Fusion (Cont.)

DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDOENDDO


ENDDOENDDODO I = 1, N DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO


ENDDOENDDODO I = 1, N DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M


Which loop should be fused into the A loop?

i,jA

jB iC

jD

Multilevel Loop Fusion (Cont.)PARALLEL DO J = 1, M

DO I = 1, N A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)ENDDO

ENDDO

PARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDO



jAB

iC

jD

Fusing A loop with B loop

2 barriers

Multilevel Loop Fusion (Cont.)PARALLEL DO I = 1, N

DO J = 1, M A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)ENDDO

ENDDO


B(I+1, J) = A(I, J) + B(I,J)ENDDO

ENDDO



iAC

jB

jD

PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)ENDDO

ENDDO


B(I+1, J) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

Now we can also fuse B-D

iAC

jBD

Fusing A loop with C loop

1 barrier

Multilevel Loop Fusion (Cont.) Decision making needs look-ahead Strategy: Fuse with the loop that cannot

be fused with one of its successorsRationale: If it can’t be fused with its successors – a barrier will be formed anyway.

i,jA

jB iC

jD A barrier is inevitable!!

Parallel Code Generation

Parallelize(l,D)1. Try methods for perfect nests (loop

interchange, loop skewing, loop reversal), and stop if parallelism is found.

2. If nest can be distributed: distribute, run recursively on the distributed nests, and merge.

3. Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.

Code generation scheme:

Parallel Code Generationprocedure Parallelize(l, Dl);

ParallelizeNest(l, success); //(try methods for perfect nests..)if ¬success then begin

if l can be distributed then begindistribute l into loop nests l1, l2, …,

ln;for i:=1 to n do begin

Parallelize(li, Di);endMerge({l1, l2, …, ln});

end

Parallel Code Generation (Cont.)else begin // if l cannot be distributed then

for each outer loop l0 nested in l do beginlet D0 be the set of dependences between statements in l0 less dependences carried by l;

Parallelize(l0,D0);endlet S - the set of outer loops and statements loops left in l;If ||S||>1 then Merge(S);endend

end Parallelize

Parallel Code Generation (Cont.)

DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C

X(I, J) = A(I, J) + C ENDDOENDDO

DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C

ENDDOEND DO

DO J = 1, M DO I = 1, N

X(I, J) = A(I, J) + C ENDDOEND DO

Both loops carry dependence – loop interchange will not find sufficient parallelism.

Try distribution…PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C


DO J = 1, M DO I = 1, N

X(I, J) = A(I, J) + C ENDDOEND DO

I loop can be parallelizedPARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C


PARALLEL DO J = 1, M

DO I = 1, N !Left sequential for memory hierarchy

X(I, J) = A(I, J) + C ENDDOEND PARALLEL DO

Both loops can be parallelized

Now fusing…

Type: (I-loop, parallel)

Type: (J-loop, parallel)

Different types – can’t fuse

Parallel Code Generation (Cont.)


B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + C(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDOENDDO

PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X

ENDDOENDPARALLEL DOPARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DOPARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO


ENDDO DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO





B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO




A

B C

D

I loop, parallel

J loop, parallel

J loop, parallel

J loop, parallel

A

C

D

DO J = 1, JMAXDO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1DO J = 1, JMAXD

DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = 0.0

DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1DO J = 1, JMAXD

DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

ErlebacherPARALLEL DO J = 1, JMAX

DO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXD

DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = 0.0

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXD

DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

ErlebacherPARALLEL DO J= 1, MAXD

L1 : DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1)

L2: DO K = 2, N – 1 DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

L3: DO I = 1, IMAXD TOT(I, J) = 0.0

L4: DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

L5: DO K = 2, N-1 DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

END PARALLEL DO

L1

L4

L2

L3

L5

ErlebacherPARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO

DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO ENDDOEND PARALLEL DO

Packaging of Parallelism Trade off between parallelism and

granularity of synchronization. Larger granularity work-units means

synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.

Strip Mining Converts available parallelism into a form

more suitable for the hardware

DO I = 1, NA(I) = A(I) + B(I)

ENDDO

Interruptions may be disastrous

k = CEIL (N / P)PARALLEL DO I = 1, N, k

DO i = I, MIN(I + k-1, N)A(i) = A(i) + B(i)


The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)

Strip Mining (Cont.) What if the execution time varies among

iteraions?

PARALLEL DO I = 1, NDO J = 2, I

A(J, I) = A(J-1, I) * 2.0ENDDO

END PARALLEL DO

Solution: smaller unit size to allow more balanced distribution

8

N 7

8

N3

8

N 5

8

N

Pipeline Parallelism Fortran command DOACROSS – pipelines

parallel loop iterations with cross-iteration synchronization.

Useful where parallelization is not available High synchronization costs

DOACROSS I = 2, NS1: A(I) = B(I) + C(I)

POST(EV(I)) IF (I>2) WAIT (EV(I-1))

S2: C(I) = A(I-1) + A(I)ENDDO

Scheduling Parallel Work

Load balance

LittleSychro.

Scheduling Parallel Work Parallel execution is slower than serial execution if

Bakery-counter scheduling Moderate synchronization overhead

N- number of iterationsB- time of one iterationp- number of processorsσ0- constant overhead per processor

0

NB

p

Guided Self-Scheduling Incorporates some level of static

scheduling to guide dynamic self-scheduling Schedules groups of iterations Going from large to small chunks of work

Iterations dispensed at time t follows:1

tt t t t

Nx N N x

p

Guided Self-Scheduling (Cont.) GSS: (20 iteration, 4 processors)

Not completely balanced

Required synchronization: 9In bakery counter: 20

6 45 4

Guided Self-Scheduling (Cont.) In the example, last 4 allocation are for a single iteration. Coincidence? Last p-1 iterations will always be of 1 iteration.

GSS(2): No block of iterations smaller than 2

GSS(k): No block is smaller than k

1

2tt t t t

Nx N N x

p

p

Yaniv

Carmeli

B.A. in CS

Thanks for you attention!

creating coarse-grained parallelism for loop nests chapter 6, sections 6.3 through 6.9 yaniv carmeli

Documents

loop slide

resulting loop

outermost loop

loop body

outer loop

loop invariant cost

compute loop cost

shalem loop interchange