high performance on the j90 systems david turner & tom deboni nersc user services group april...
TRANSCRIPT
High Performance on the J90 Systems
David Turner & Tom DeBoni
NERSC User Services Group
April 1999
13 April, 1999 High Performance on the J90 Systems 2
Philosophical Ramblings
Design for optimization?
Where to start?
When to stop?
13 April, 1999 High Performance on the J90 Systems 3
J90 Potential
STREAM benchmark resultsSustainable memory bandwidth
(http://www.cs.virginia.edu/stream)
John McCalpin, SGI
bytes/iter FLOPS/iterCOPY
a(i)=b(i) 16 0
TRIAD
a(i)=b(i)+q*c(i) 24 2
13 April, 1999 High Performance on the J90 Systems 4
STREAM Results
Machine ncpus COPY TRIAD MFLOPSCray_C90 16 105497.0 103812.0 8651.0Cray_C90 8 55071.9 63229.6 5269.1Cray_C90 1 6965.4 9500.7 791.7
Cray_J932 16 16298.2 14995.9 1249.7Cray_J932 8 9995.2 8941.3 745.1Cray_J932 1 1433.6 1270.0 105.8
Cray_T3E-900 16 7497.0 8828.0 735.7Cray_T3E-900 8 3747.0 4471.0 372.6Cray_T3E-900 1 484.0 568.0 47.3
SGI_Origin_2K 16 5560.0 5240.0 436.7SGI_Origin_2K 8 2570.0 2740.0 228.3SGI_Origin_2K 1 332.0 358.0 29.8
Sun_UE_10000 16 2371.0 2905.0 242.1Sun_UE_10000 8 1271.0 1546.0 128.8Sun_UE_10000 1 164.0 202.0 16.8
13 April, 1999 High Performance on the J90 Systems 5
STREAM Results (cont.)
Machine COPY TRIAD MFLOPS
Cray_C90 6965.4 9500.7 791.7
Cray_J932 1433.6 1270.0 105.8
Compaq_AlphaServer_DS20 1077.0 1323.0 110.2
IBM_RS6000-397 778.8 882.4 73.5
Cray_T3E-900 484.0 568.0 47.3
SGI_Origin_2K 332.0 358.0 29.8
Generic_440BX_400 304.0 315.4 26.3
Sun_Ultra2-2200 228.5 189.9 25.9
Sun_UE_10000 164.0 202.0 16.8
Apple_Mac_G3_266 137.1 137.1 11.4
13 April, 1999 High Performance on the J90 Systems 6
Tools
F90 (with lots of options)
ja./nameja -cst -n name
hpm
prof
flowview
atexpert
13 April, 1999 High Performance on the J90 Systems 7
Program “SLOW”PROGRAM SLOW
IMPLICIT NONE INTEGER, PARAMETER :: DIMSIZE=8000000 REAL, DIMENSION(DIMSIZE) :: X, Y, Z INTEGER:: I, J
X = RANF() Y = RANF() DO J = 1, 10 DO I = 1, DIMSIZE Z(I)=LOG(SIN(X(I))**2+COS(Y(I))**4) END DO PRINT *, Z(DIMSIZE-1) ENDDO STOP
END PROGRAM SLOW
13 April, 1999 High Performance on the J90 Systems 8
No Optimization
f90 -O0 -r6 -O,msgs,negmsgs -o slow slow.f90
x = RANF()
cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=8
A loop starting at line 8 was vectorized.
y = RANF()
cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=9
A loop starting at line 9 was vectorized.
13 April, 1999 High Performance on the J90 Systems 9
Moderate Optimization
f90 -O1 -r6 -O,msgs,negmsgs -o slow slow.f90
do j = 1, 10
cf90-6286 f90:VECTOR SLOW,File = slow.f90,Line=10
A loop starting at line 10 was not vectorized because it contains input/output operations at line 14.
DO i = 1, DIMSIZE
cf90-6204 f90:VECTOR SLOW,File = slow.f90,Line=11
A loop starting at line 11 was vectorized.
z(i) = LOG(SIN(x(i))**2 + COS(y(i))**4)
cf90-6001 f90:SCALAR SLOW,File=slow.f90,Line=12
An exponentiation was replaced by optimization. This may cause numerical differences.
13 April, 1999 High Performance on the J90 Systems 10
High Optimization
f90 -O3 -r6 -O,msgs,negmsgs -o slow slow.f90
cf90-6502 f90:TASKING SLOW,File=slow.f90,Line=10
A loop starting at line 10 was not tasked because it contains input/output operations at line 14.
cf90-6403 f90:TASKING SLOW,File=slow.f90,Line=11
A loop starting at line 11 was tasked.
13 April, 1999 High Performance on the J90 Systems 11
Optimization Results
Opt NCPUS Elapsed User Sys
0 768.7530 583.6793 7.1886
1 89.0162 82.1009 1.1936
2 104.7003 81.5687 1.0003
3 1 107.0177 81.6185 1.2994
3 2 44.6562 81.7050 1.4069
3 3 41.3401 81.5320 1.3099
3 4 24.8146 81.8099 1.2968
13 April, 1999 High Performance on the J90 Systems 12
2 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 5.4300 = 5.4300
2 * 38.1300 = 76.2600
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- -----------
1.88 * 43.5600 = 81.6900
13 April, 1999 High Performance on the J90 Systems 13
3 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 9.2200 = 9.2200
2 * 13.5500 = 27.1000
3 * 15.0700 = 45.2100
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- -----------
2.15 * 37.8400 = 81.5300
13 April, 1999 High Performance on the J90 Systems 14
4 CPU Speedup
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 2.0400 = 2.0400
2 * 1.7700 = 3.5400
3 * 5.3200 = 15.9600
4 * 15.0700 = 60.2800
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
--------------- -------------- ----------
3.38 * 24.2000 = 81.8200
13 April, 1999 High Performance on the J90 Systems 15
Useful F90 Options
-e (0 or i) - initializes storage or flags use of unitialized vars-e n - flags nonstandard fortran usage-e v - make all variables static-g - same as -G0-G (0 or 1) - sets debugging level to statement or block-m (0 - 4) - message verbosity (0 gives most output)-N (72, 80, or 132) - source line length-O - Optimization levels
0,1,2,3, aggress, fastint, msgs, negmsgs, inline(0-3), scalar(0-3), task(0-3), vector (0-3)
-r (0-6, …) - listing levels (6 is EVERYthing)-R (a, b, c)- runtime checking: args, array bounds, indexing
13 April, 1999 High Performance on the J90 Systems 16
Using flowtrace/flowview
f90 -O1 -ef -o slow slow.f90./slowflowview -Luch > slow.flow
Routine Tot Time Percentage Accum%
------------ -------- ---------- -------
SUB2 5.66E+01 69.02 69.02
SUB1 2.43E+01 29.63 98.65
SLOW 1.11E+00 1.35 100.00
13 April, 1999 High Performance on the J90 Systems 17
Using prof
f90 -O1 -l prof -o slow slow.f90
./slow
prof -x ./slow > slow.prof
profview slow.prof
13 April, 1999 High Performance on the J90 Systems 18
profview Output
13 April, 1999 High Performance on the J90 Systems 19
Optimization Strategies
• First, let the compiler do it• Vectorize and scalar optimize, then parallelize
• Vectorization can give you a factor of 10 speedup• Scalar optimization can improve performance by
10-50%• Parallelism will give you a linear speedup, max• Memory contention inhibits gains from parallelism
• Let the compiler advise you
• Add directives where appropriate• Be sure you tell the truth• Check your answers
13 April, 1999 High Performance on the J90 Systems 20
Scalar Optimization
Subroutine or function inlining
Fast (32-bit) integers
-Oallfastint
-Ofastint
Use INTERFACE specifications if passing array sections
13 April, 1999 High Performance on the J90 Systems 21
Vectorization
13 April, 1999 High Performance on the J90 Systems 22
Inhibitors to Vectorization
Function or subroutine references
Inline
Push loop
Split loop
Backwards data dependencies
Reorder loop, use temporary vector
I/O statements
Character or bit manipulations
Branches into loop or backward out of loop
13 April, 1999 High Performance on the J90 Systems 23
Nonvectorizable Code
DO I = 1, N
CALL CALC(X(I), Y(I), Z(I))
ENDDO
...
SUBROUTINE CALC(X, Y, Z)
Z = ALOG(SQRT((SIN(X) * COS(Y)) ** X))
RETURN
END
13 April, 1999 High Performance on the J90 Systems 24
Inlining
DO I = 1, N
Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))
ENDDO
13 April, 1999 High Performance on the J90 Systems 25
Pushing
CALL CALC(X(I), Y(I), Z(I), N)
...
SUBROUTINE CALC(X, Y, Z, N)
DIMENSION X(N), Y(N), Z(N)
DO I = 1, N
Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))
ENDDO
RETURN
END
13 April, 1999 High Performance on the J90 Systems 26
Splitting
DO I = 1, N
A(I) = ABS(CALC(C(I)))
B(I) = A(I) ** T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
ENDDO
13 April, 1999 High Performance on the J90 Systems 27
Splitting (cont.)
EXTERNAL CALC
DO I = 1, N
A(I) = ABS(CALC(C(I)))
ENDDO
DO I = 1, N
B(I) = A(I) ** T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
ENDDO
13 April, 1999 High Performance on the J90 Systems 28
Scalar Recurrence
DIMENSION A(1000), C(1000)
DO J = 1, M
S = BB
DO I = 1, N
S = S * C(I)
A(I) = A(I) + S
ENDDO
ENDDO
<cf90-8135,Scalar,Line=7> Loop starting at line 7 was unrolled 16 times.
13 April, 1999 High Performance on the J90 Systems 29
Scalar Recurrence (cont.)
DIMENSION A(1000), C(1000), S(1000)DO I = 1, M S(I) = BBENDDODO I = 1, N DO J = 1, M S(J) = S(J) * C(I) A(I) = A(I) + S(J) ENDDOENDDO
Loop starting at line 5 was unrolled 2 times.
A loop starting at line 5 was vectorized.
A loop starting at line 9 was vectorized.
13 April, 1999 High Performance on the J90 Systems 30
Compiler Vector Directives
CDIR$ directive
!DIR$ directive
VECTOR, NOVECTOR
Turn vectorization on or off until end of program unit.
IVDEP
Ignore vector dependencies in next loop.
13 April, 1999 High Performance on the J90 Systems 31
Parallel Computing
Multitasking, microtasking, autotasking, parallel processing, multiprocessing, etc.
This is “fine-grained” parallelism
parallelism mostly comes from loop slicing
One possible goal: parallelize outer loop(s),
vectorize inner loop(s)
F90 is capable of autotasking, but it can always
benefit from help
13 April, 1999 High Performance on the J90 Systems 32
Parallelism
13 April, 1999 High Performance on the J90 Systems 33
Parallelism, cont.
13 April, 1999 High Performance on the J90 Systems 34
Data “Scoping”
DIMENSION A(N)
SUM = 0.0
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I)
SUM = SUM + TEMP * A(I)
ENDDO
A, N Shared, read-only everywhere
I, TEMP Private, read-write everywhere
SUM Shared, read-write everywhere
13 April, 1999 High Performance on the J90 Systems 35
Compiler Tasking Directives
DIMENSION A(N)
SUM = 0.0
!MIC$ DOALL SHARED(A,N),PRIVATE(I,TEMP)
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I) * A(I)
!MIC$ GUARD
SUM = SUM + TEMP
!MIC$ ENDGUARD
ENDDO
13 April, 1999 High Performance on the J90 Systems 36
Threshold Test
DIMENSION A(N)
SUM = 0.0
!MIC$ DOALL VECTOR
!MIC$ IF(N.GT.1000)
!MIC$ SHARED(A,N),PRIVATE(I,TEMP)
DO I = 1, N
TEMP = DEEP_THOUGHT(A,I)
!MIC$ GUARD
SUM = SUM + TEMP * A(I)
!MIC$ ENDGUARD
ENDDO
13 April, 1999 High Performance on the J90 Systems 37
Helping F90 with Parallelism
DIMENSION A(N), SUM(NumTasks)
!MIC$ DOALL SHARED(A,N),PRIVATE(J,I,TEMP)DO J = 1, NumTasks
SUM(J) = 0.0
!MIC$ CNCALL DO I = 1, N
SUM(J) = SUM(J) = DEEP_THOUGHT(A,I,J) * A(I)
ENDDO
ENDDO
DO J = 1, NumTasks
TSUM = TSUM + SUM(J)
ENDDO
13 April, 1999 High Performance on the J90 Systems 38
Helping F90 with Directives
• Useful compiler directives for tasking• CASE, ENDCASE• CNCALL• DOALL• DOPARALLEL, ENDDO• GUARD, ENDGUARD• MAXCPUS• NUMCPUS• PERMUTATION• PARALLEL, ENDPARALLEL
• These all begin with !MIC$• NOTE: There are also OpenMP directives...
13 April, 1999 High Performance on the J90 Systems 39
Helping F90 with Directives, cont.
Directive Parameters
AUTOSCOPE
IF
MAXCPUS
PRIVATE
SAVELAST
SHARED
Directive Work Distribution
CHUNKSIZE
GUIDED
NCPUS_CHUNKS
NUMCHUNKS
SINGLE
VECTOR
These all augment !MIC$ directives
NOTE: There are also OpenMP directive parameters...
13 April, 1999 High Performance on the J90 Systems 40
atexpert
f90 -eX -O3 -r6 -o slow slow.f90
setenv NCPUS 1
./slow
atexpert
13 April, 1999 High Performance on the J90 Systems 41
atexpert Output
13 April, 1999 High Performance on the J90 Systems 42
atexpert Output, cont.
13 April, 1999 High Performance on the J90 Systems 43
atexpert Output, cont.