compiling high performance fortran

45
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14

Upload: shalom

Post on 07-Feb-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

Page 2: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Page 3: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

Page 4: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

MPI implementation

Motivation for HPF

• Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

Page 5: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Disadvantages of MPI approach—User has to rewrite the program in SPMD form [Single

Program Multiple Data]—User has to manage data movement [send & receive], data

placement and synchronization—Too messy and not easy to master

Page 6: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Approach 2: Use HPF—HPF is an extended version of Fortran 90—HPF has Fortran 90 features and a few directives

• Directives—Tell how data is laid out in processor memories in parallel

machine configuration. For example,– !HPF DISTRIBUTE A(BLOCK)

—Assist in identifying parallelism. For example,– !HPF INDEPENDENT

Page 7: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Motivation for HPF

• The same sum reduction code

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• Minimum modification

• Easy to write

• Now compiler has to do more work

Page 8: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Advantages of HPF—User needs only to write some easy directives; need not

write the whole program in SPMD form—User does not need to manage data movement [send &

receive] and synchronization—Simple and easy to master

Page 9: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Page 10: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

• Dependence Analysis

• Used for communication analysis—Fact used: No dependence

carried by I loop

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Page 11: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

Page 12: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning—Partition so as to distribute

work of the I loops

Page 13: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

HPF Compilation OverviewREAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning

• Communication Analysis and placement—Communication reqd for

B(0)for each iteration—Shadow region B(0)

Page 14: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

HPF Compilation OverviewREAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning

• Communication Analysis and placement

• Optimization—Aggregation—Overlap communication and

computation—Recognition of reduction

Page 15: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Page 16: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• Distribution Propagation and analysis—Analyze what distribution holds for a given array at a given

point in the program—Difficult due to

– REALIGN and REDISTRIBUTE directives– Distribution of formal parameters inherited from

calling procedure—Use “Reaching Decompositions” data flow analysis and its

interprocedural version

Page 17: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors—Block size,

μA(i) =(ρA(i),δA(i))=(p, j)

BA =ceiling(N,p)

ρA(i) =ceiling(i,BA) −1

δA(i) =(i−1)modBA −1

Page 18: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• Iteration Partitioning—Dividing work among

processors– Computation

partitioning—Determine which iterations

of a loop will be executed on which processor

—Owner-computes rule

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

• Iteration I is executed on owner of A(I)

• 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Page 19: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

A(α(I ))

θL(I ) =ρA(α(I ))

{I |1≤I ≤N;θL(I )=p}

α−1(ρA−1({p}))∩ [1..N]

Page 20: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

Global_ Loop_ index Δ L ⏐ → ⏐ Local_ Loop_ index

α−1(ρA−1({p}))

Page 21: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

ρA−1({p})=[100p+1:100p+100]

α−1(ρA−1({p}))=[100p:100p+99]

ΔL(I,p) =I −min(α −1(ρA−1({p})))+1

=I −100p+1

Iteration PartitioningREAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Map global iteration space, I to local iteration space,i as follows:

Page 22: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Adjust array subscripts for local iterations:

B(β(I))→ B(γ(i))

γ(i) =δB(β(ΔL−1(i,p)))

ΔL−1(i,PID) =i+100* PID−1

δB(k)=k−100* PID

γ(i) =i+100* PID−1−100*PID =i−1

Page 23: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO

Page 24: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generation

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

α−1(ρA−1({p}))∩ β−1(ρB

−1({p}))∩ [1..N]

(α−1(ρA−1({p}))−β−1(ρB

−1({p})))∩ [1..N]

(β−1(ρB−1({p}))−α−1(ρA

−1({p})))∩ [1..N]

Page 25: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generation

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]

Page 26: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generation

• After inserting receive

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

• Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Page 27: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

• Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Page 28: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

• Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Page 29: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

Page 30: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Generation

• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences from S1 to S2

Page 31: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication GenerationREAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2: IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

• Rearrangement won’t be correct

Page 32: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Page 33: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

• Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Page 34: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIFDistribute J Loop

Page 35: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

Page 36: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1: IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2: B(0,J)=Bg(0,J)

S3: A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4: A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

• Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Page 37: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be fully vectorized?

Page 38: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Overlapping Communication and Computation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

Page 39: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

PipeliningREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

• Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

Can be vectorizedBut gives up parallelism

Page 40: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Pipelining

• Pipelined parallelism with communication

Page 41: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Pipelining

• Pipelined parallelism with communication overhead

Page 42: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Pipelining: Blockinglo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

Page 43: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Other Optimizations

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement—Minimize temporary storage used for communication—Space taken for temporary storage should be at most

equal to the space taken by the arrays

• Interprocedural Optimizations

Page 44: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Results

Page 45: Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Summary

• HPF is easy to code—But hard to compile

• Steps required to compile HPF programs—Basic loop compilation

– Communication generation—Optimizations

– Communication vectorization– Overlapping communication with computation– Pipelining