creating coarse-grained parallelism for loop nests chapter 6, sections 6.3 through 6.9 yaniv carmeli

52
Creating Coarse- Creating Coarse- grained Parallelism grained Parallelism for Loop Nests for Loop Nests Chapter 6, Sections 6.3 Chapter 6, Sections 6.3 through 6.9 through 6.9 Yaniv Carmeli Yaniv Carmeli

Post on 21-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Creating Coarse-Creating Coarse-grained Parallelism grained Parallelism for Loop Nestsfor Loop Nests

Chapter 6, Sections 6.3 Chapter 6, Sections 6.3 through 6.9through 6.9Yaniv CarmeliYaniv Carmeli

Page 2: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Single loop methods

Privatization

Loop distribution

Alignment

Loop Fusion

Last time…

Page 3: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing Profitibility-Based Methods

This time…

Page 4: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

This time… Imperfectly Nested Loops

Multilevel Loop Fusion Parallel Code Generation

Packaging Parallelism Strip Mining Pipeline Parallelism Guided Self Scheduling

Page 5: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Vectorization: BadParallelization: Good

Loop Interchange

A(I+1, J) = A(I, J) + B(I, J) ENDDO

ENDDO

Vectorization: OKParallelization: Problematic

DO J = 1, MDO I = 1, NPARALLEL DO J = 1 , M DO I = 1 , N A(I+1, J) = A(I, J) + B(I, J) ENDDOEND PARALLEL DO D = ( < , = )

Page 6: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

DO I = 1, N

DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO

ENDDO

DO I = 1, N

PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO

ENDDO

Loop Interchange (Cont.)

Loop Interchange doesn’t work, as both loops carry dependence!!

Best we can do

D = ( < , < )

When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?

Page 7: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Interchange (Cont.) Theorem: In a perfect nest of loops, a particular

loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries.

Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence.

Only If. There is a non “=“ entry in that column: If it is “>” – Can’t interchange loops (dependence

will be reversed) If it is “<“ – Can interchange, but can’t shake the

dependece (Will not allow parallelization anyway...)

Page 8: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Interchange (Cont.) Working with direction matrix

1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix

2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences

3. Repeat step 1

Page 9: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

DO I = 1, NDO J = 1, M

DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDOENDDO

ENDO

DO I = 1, NPARALLEL DO J = 1, M

DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDOEND PARALLEL DO

ENDO

Loop Interchange (Cont.) Example:

< = == = << < <

Page 10: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

< < = =< = < =< = = <= < = == = < == = = <

Loop Selection – Optimal? Is the approach of selecting the loop with

the most ‘ < ‘ directions optimal? Will result in NO

parallelization for this matrix

While other selections may allow parallelization

< < = =< = < =< = = <= < = == = < == = = <

Is it possible to derive a selection heuristic that provides optimal code?

Page 11: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

The problem of loop selection is NP-complete Loop selection is best done by a heuristic!

Loop Selection

< < = =< = < =< = = <= < = == = < == = = <

Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

Page 12: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Heuristic Loop Selection (Cont.) Example of principals involved

in heuristic loop selectionDO I = 2, N

DO J = 2, MDO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1)

ENDDO ENDDO

ENDDO

The I-loop must be sequentialized because of the fourth dependence

The J-loop must be sequentialized because of the first dependence

DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO

ENDDO

= < =< = <= < << = >

Page 13: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Reversal

DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

Using loop reversal to create coarse-grained parallelism. Consider:

DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

DO K = L, 1, -1 DO I = 2, N+1

DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO

DO K = L, 1, -1 PARALLEL DO I = 2, N+1

PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO END PARALLEL DOENDDO

= < >< = >

= < << = <

Page 14: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Skewing

DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO ENDDOENDDO

Skewed using k = K + I + J yield:DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO ENDDOENDDO

= < =< = == = <= = =

0 1 01 0 00 0 10 0 0

= < << = <= = <= = =

Page 15: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Skewing - Main Benefits

Eliminate “>” signs in the matrix

Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Page 16: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Skewing - Drawback

The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time). As we shall see – It’s not really a problem for

asynchronous parallelism (unlike vectorization).

Page 17: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Loop Skewing (Cont.) Updated strategy

1. Parallelize outermost loop if possible2. Sequentializes at most one outer loop to find parallelism in the next loop3. If 1 and 2 fail, try skewing4. If 3 fails, sequentialize the loop that can be moved to the outermost

position and cover the most other loops

Page 18: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

In Practice – Sometimes we get much worse

execution times, than we would have gotten parallelizing less\different loops.

Page 19: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods

Static performance estimation function No need to be accurate, just good at

selecting the better of two alternatives

Key considerations Cost of memory references Sufficiency of granularity

Page 20: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.) Impractical to choose from all

arrangements

Consider only subset of the possible code arrangements, based on properties of the cost function In our case: consider only the inner-most loop

Page 21: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.)

1. Subdivide all the references in the loop body into reference groups

Two references are in the same group if: There is a loop independent dependence between

them. There is a constant-distance loop carried

dependence between them.

A possible cost evaluation heuristics:

Page 22: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.)

2. Determine whether subsequent accesses to the same reference are

Loop invariant Cost = 1

Unit stride Cost = number of iterations / cache line size

Non-unit stride Cost = number of iterations

A possible cost evaluation heuristics:

Page 23: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.)

3. Compute loop cost:

A possible cost evaluation heuristics:

#loop _ cost reference cost

aggregate

of iterations

Page 24: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods: ExampleDO I = 1, N

DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

Page 25: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods: ExampleDO I = 1, N

DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

2N3/L+N21N/LN/LI

2N3+N2N1NJ

N3(1+1/L)+N2N/LN1K

COSTBACInner-most loop

Worst

Best

Page 26: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods: Example Reorder loop from innermost to outermost by increasing loop cost: I,K,J

Can’t always have desired loop order(as some permutations are illegal) - Try to find the possible permutation closest to the desired one.

DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

Page 27: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.) Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one.

Method: Until there are no more loops:

Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.

Page 28: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Profitability-Based Methods (Cont.)DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO

For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).

Page 29: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion

Commonly used for imperfect loop nests

Used after maximal loop distribution

Page 30: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion

DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C

B(I+1, J) = B(I, J) + D ENDDOENDDO

DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C

ENDDOENDDO

DO I = 1, N DO J = 1, M

B(I+1, J) = B(I, J) + D ENDDOENDDO

PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C

ENDDOEND PARALLEL DO

PARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = B(I, J) + D ENDDOEND PARALLEL DO

After distribution each nest is better with a different outer loop – Can’t fuse!

Page 31: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion (Cont.)

DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDOENDDO

DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

ENDDOENDDODO I = 1, N DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO

DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

ENDDOENDDODO I = 1, N DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO

Which loop should be fused into the A loop?

i,jA

jB iC

jD

Page 32: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion (Cont.)PARALLEL DO J = 1, M

DO I = 1, N A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)ENDDO

ENDDO

PARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDO

PARALLEL DO J = 1, M DO I = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO

jAB

iC

jD

Fusing A loop with B loop

2 barriers

Page 33: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion (Cont.)PARALLEL DO I = 1, N

DO J = 1, M A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)ENDDO

ENDDO

PARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)ENDDO

ENDDO

PARALLEL DO J = 1, M DO I = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO

iAC

jB

jD

PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)ENDDO

ENDDO

PARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

Now we can also fuse B-D

iAC

jBD

Fusing A loop with C loop

1 barrier

Page 34: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Multilevel Loop Fusion (Cont.) Decision making needs look-ahead Strategy: Fuse with the loop that cannot

be fused with one of its successorsRationale: If it can’t be fused with its successors – a barrier will be formed anyway.

i,jA

jB iC

jD A barrier is inevitable!!

Page 35: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Parallel Code Generation

Parallelize(l,D)1. Try methods for perfect nests (loop

interchange, loop skewing, loop reversal), and stop if parallelism is found.

2. If nest can be distributed: distribute, run recursively on the distributed nests, and merge.

3. Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.

Code generation scheme:

Page 36: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Parallel Code Generationprocedure Parallelize(l, Dl);

ParallelizeNest(l, success); //(try methods for perfect nests..)if ¬success then begin

if l can be distributed then begindistribute l into loop nests l1, l2, …,

ln;for i:=1 to n do begin

Parallelize(li, Di);endMerge({l1, l2, …, ln});

end

Page 37: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Parallel Code Generation (Cont.)else begin // if l cannot be distributed then

for each outer loop l0 nested in l do beginlet D0 be the set of dependences between statements in l0 less dependences carried by l;

Parallelize(l0,D0);endlet S - the set of outer loops and statements loops left in l;If ||S||>1 then Merge(S);endend

end Parallelize

Page 38: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Parallel Code Generation (Cont.)

DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C

X(I, J) = A(I, J) + C ENDDOENDDO

DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C

ENDDOEND DO

DO J = 1, M DO I = 1, N

X(I, J) = A(I, J) + C ENDDOEND DO

Both loops carry dependence – loop interchange will not find sufficient parallelism.

Try distribution…PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C

ENDDOEND PARALLEL DO

DO J = 1, M DO I = 1, N

X(I, J) = A(I, J) + C ENDDOEND DO

I loop can be parallelizedPARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C

ENDDOEND PARALLEL DO

PARALLEL DO J = 1, M

DO I = 1, N !Left sequential for memory hierarchy

X(I, J) = A(I, J) + C ENDDOEND PARALLEL DO

Both loops can be parallelized

Now fusing…

Type: (I-loop, parallel)

Type: (J-loop, parallel)

Different types – can’t fuse

Page 39: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Parallel Code Generation (Cont.)

DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + C(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDOENDDO

PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X

ENDDOENDPARALLEL DOPARALLEL DO J = 1, M DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DOPARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO

PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X

ENDDO DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO

PARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO

PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO

PARALLEL DO I = 1, N DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO

A

B C

D

I loop, parallel

J loop, parallel

J loop, parallel

J loop, parallel

A

C

D

Page 40: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

DO J = 1, JMAXDO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1DO J = 1, JMAXD

DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = 0.0

DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1DO J = 1, JMAXD

DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

ErlebacherPARALLEL DO J = 1, JMAX

DO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXD

DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = 0.0

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXD

DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

Page 41: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

ErlebacherPARALLEL DO J= 1, MAXD

L1 : DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1)

L2: DO K = 2, N – 1 DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

L3: DO I = 1, IMAXD TOT(I, J) = 0.0

L4: DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

L5: DO K = 2, N-1 DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

END PARALLEL DO

L1

L4

L2

L3

L5

Page 42: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

ErlebacherPARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO

DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO ENDDOEND PARALLEL DO

Page 43: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Packaging of Parallelism Trade off between parallelism and

granularity of synchronization. Larger granularity work-units means

synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.

Page 44: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Strip Mining Converts available parallelism into a form

more suitable for the hardware

DO I = 1, NA(I) = A(I) + B(I)

ENDDO

Interruptions may be disastrous

k = CEIL (N / P)PARALLEL DO I = 1, N, k

DO i = I, MIN(I + k-1, N)A(i) = A(i) + B(i)

ENDDOEND PARALLEL DO

The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)

Page 45: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Strip Mining (Cont.) What if the execution time varies among

iteraions?

PARALLEL DO I = 1, NDO J = 2, I

A(J, I) = A(J-1, I) * 2.0ENDDO

END PARALLEL DO

Solution: smaller unit size to allow more balanced distribution

8

N 7

8

N3

8

N 5

8

N

Page 46: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Pipeline Parallelism Fortran command DOACROSS – pipelines

parallel loop iterations with cross-iteration synchronization.

Useful where parallelization is not available High synchronization costs

DOACROSS I = 2, NS1: A(I) = B(I) + C(I)

POST(EV(I)) IF (I>2) WAIT (EV(I-1))

S2: C(I) = A(I-1) + A(I)ENDDO

Page 47: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Scheduling Parallel Work

Load balance

LittleSychro.

Page 48: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Scheduling Parallel Work Parallel execution is slower than serial execution if

Bakery-counter scheduling Moderate synchronization overhead

N- number of iterationsB- time of one iterationp- number of processorsσ0- constant overhead per processor

0

NB

p

Page 49: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Guided Self-Scheduling Incorporates some level of static

scheduling to guide dynamic self-scheduling Schedules groups of iterations Going from large to small chunks of work

Iterations dispensed at time t follows:1

tt t t t

Nx N N x

p

Page 50: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Guided Self-Scheduling (Cont.) GSS: (20 iteration, 4 processors)

Not completely balanced

Required synchronization: 9In bakery counter: 20

6 45 4

Page 51: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Guided Self-Scheduling (Cont.) In the example, last 4 allocation are for a single iteration. Coincidence? Last p-1 iterations will always be of 1 iteration.

GSS(2): No block of iterations smaller than 2

GSS(k): No block is smaller than k

1

2tt t t t

Nx N N x

p

p

Page 52: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Yaniv

Carmeli

B.A. in CS

Thanks for you attention!