1 applying automated memory analysis to improve the iterative solver in the parallel ocean program...
TRANSCRIPT
1
Applying Automated Memory Analysis to
improve the iterative solver in the Parallel Ocean
ProgramJohn M. Dennis: [email protected]
Elizabeth R. Jessup: [email protected]
April 5, 2006
April 5, 2006 Petascale Computation for the Geosciences Workshop
2
MotivationMotivation
Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency
Apply Automated Memory Analysis to POP
Parallel Ocean Program (POP) solverLarge % of timeScalability issues
Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency
Apply Automated Memory Analysis to POP
Parallel Ocean Program (POP) solverLarge % of timeScalability issues
April 5, 2006 Petascale Computation for the Geosciences Workshop
3
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions
April 5, 2006 Petascale Computation for the Geosciences Workshop
4
Automated Memory Analysis?
Automated Memory Analysis?
Analyze algorithm written in Matlab
Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required
Predictions allow:Evaluate design choicesGuide performance tuning
Analyze algorithm written in Matlab
Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required
Predictions allow:Evaluate design choicesGuide performance tuning
April 5, 2006 Petascale Computation for the Geosciences Workshop
5
POP using 20x24 blocks (gx1v3)
POP using 20x24 blocks (gx1v3)
POP data structure Flexible block structure land ‘block’ elimination Small blocks
Better {load balanced, land block elimination}
Larger halo overhead Larger blocks
Smaller halo overheadLoad imbalancedNo land block elimination
Grid resolutions: test: (128x192) gx1v3 (320x384)
POP data structure Flexible block structure land ‘block’ elimination Small blocks
Better {load balanced, land block elimination}
Larger halo overhead Larger blocks
Smaller halo overheadLoad imbalancedNo land block elimination
Grid resolutions: test: (128x192) gx1v3 (320x384)
April 5, 2006 Petascale Computation for the Geosciences Workshop
6
Alternate Data Structure
Alternate Data Structure
2D data structure Advantages
Regular stride-1 access
Compact form of stencil operator
Disadvantages Includes land points
Problem specific data structure
2D data structure Advantages
Regular stride-1 access
Compact form of stencil operator
Disadvantages Includes land points
Problem specific data structure
1D data structure Advantages
No more land points General data structure
Disadvantages Indirect addressing Larger stencil operator
1D data structure Advantages
No more land points General data structure
Disadvantages Indirect addressing Larger stencil operator
April 5, 2006 Petascale Computation for the Geosciences Workshop
7
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
April 5, 2006 Petascale Computation for the Geosciences Workshop
8
Data movementData movement
Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:
Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)
Compare with prediction (WSLP)
Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:
Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)
Compare with prediction (WSLP)
April 5, 2006 Petascale Computation for the Geosciences Workshop
9
Predicting Data Movement
Predicting Data Movement
solver w/2D (Matlab) solver w/1D (Matlab)
4902 Kbytes 3218 Kbytes
1D data structure --> 34% reduction in data movement
>
Predicts WSLP
April 5, 2006 Petascale Computation for the Geosciences Workshop
10
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4902 4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
April 5, 2006 Petascale Computation for the Geosciences Workshop
11
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
Excessive data movement
April 5, 2006 Petascale Computation for the Geosciences Workshop
12
Two blocks of source code
Two blocks of source code
do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß
q(:,:,i) = A*p(:,:,i)
w0(:,:,i)=Q(:,:,i)*P(:,:,i)
enddodelta = gsum(w0,lmask)
do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß
q(:,:,i) = A*p(:,:,i)
w0(:,:,i)=Q(:,:,i)*P(:,:,i)
enddodelta = gsum(w0,lmask)
ldelta=0do i=1,nblocks
p(:,:,i) = z(:,:,i) + p(:,:,i)* ß
q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)
enddodelta=gsum(ldelta)
ldelta=0do i=1,nblocks
p(:,:,i) = z(:,:,i) + p(:,:,i)* ß
q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)
enddodelta=gsum(ldelta)
PCG2+2D v1 PCG2+2D v2
w0 array accessed after loop!extra access of w0 eliminated
April 5, 2006 Petascale Computation for the Geosciences Workshop
13
Measured versus Predicted data
movement
Measured versus Predicted data
movementSolver Ultra II POWER4 R14K
WSLP WSLM err WSLM err WSLM err
PCG2+2D v1
4902 5163 5% 5068 3% 5728 17%
PCG2+2D v2
4902 4905 0% 4865 -1% 4854 -1%
PCG2+1D 3218 3164 -2% 3335 4% 3473 8%
Data movement matches predicted!
April 5, 2006 Petascale Computation for the Geosciences Workshop
14
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
April 5, 2006 Petascale Computation for the Geosciences Workshop
15
Using 1D data structures in POP2
solver (serial)
Using 1D data structures in POP2
solver (serial)Replace solvers.F90Execution time on cache microprocessors
Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]
Grid: test [128x192 grid points]w/(16x16)
Replace solvers.F90Execution time on cache microprocessors
Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]
Grid: test [128x192 grid points]w/(16x16)
April 5, 2006 Petascale Computation for the Geosciences Workshop
16
0
1
2
3
4
5
6
POWER4 1.3 Ghz
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
0
1
2
3
4
5
6
POWER4 1.3 Ghz
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Serial execution time on IBM POWER4 (test)
Serial execution time on IBM POWER4 (test)
56% reduction in cost/iteration
April 5, 2006 Petascale Computation for the Geosciences Workshop
17
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
April 5, 2006 Petascale Computation for the Geosciences Workshop
18
Using 1D data structure in POP2 solver (parallel)
Using 1D data structure in POP2 solver (parallel) New parallel halo update
Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)
Existing solver/preconditioner technology: Hypre (LLNL)
http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:
Diagonal Hypre integration -> Work in progress
New parallel halo update Examine several CG algorithms w/Diagonal
precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)
Existing solver/preconditioner technology: Hypre (LLNL)
http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:
Diagonal Hypre integration -> Work in progress
April 5, 2006 Petascale Computation for the Geosciences Workshop
19
Solver execution time for POP2 (20x24) on
BG/L (gx1v3)
Solver execution time for POP2 (20x24) on
BG/L (gx1v3)
0
5
10
15
20
25
30
35
40
64
# processors
Secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Hypre (PCG+Diag)
0
5
10
15
20
25
30
35
40
64
# processors
Secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
Hypre (PCG+Diag)
48% cost/iteration
27% cost/iteration
20
64 processors != PetaScale
April 5, 2006 Petascale Computation for the Geosciences Workshop
21
Outline:Outline:
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions
April 5, 2006 Petascale Computation for the Geosciences Workshop
22
0.1 degree POP0.1 degree POP
Global eddy-resolving Computational grid:
3600 x 2400 x 40Land creates problems:
load imbalancesscalability
Alternative partitioning algorithm:Space-filling curves
Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep
Global eddy-resolving Computational grid:
3600 x 2400 x 40Land creates problems:
load imbalancesscalability
Alternative partitioning algorithm:Space-filling curves
Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep
April 5, 2006 Petascale Computation for the Geosciences Workshop
23
Partitioning with Space-filling Curves
Partitioning with Space-filling Curves
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]
Partitioning 1D array
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]
Partitioning 1D array
Nb
April 5, 2006 Petascale Computation for the Geosciences Workshop
24
Partitioning with SFCPartitioning with SFC
Partition for 3 processors
April 5, 2006 Petascale Computation for the Geosciences Workshop
25
POP using 20x24 blocks (gx1v3)
POP using 20x24 blocks (gx1v3)
April 5, 2006 Petascale Computation for the Geosciences Workshop
26
POP (gx1v3) + Space-filling curve
POP (gx1v3) + Space-filling curve
April 5, 2006 Petascale Computation for the Geosciences Workshop
27
Space-filling curve (Hilbert Nb=24)
Space-filling curve (Hilbert Nb=24)
April 5, 2006 Petascale Computation for the Geosciences Workshop
28
Remove Land blocksRemove Land blocks
April 5, 2006 Petascale Computation for the Geosciences Workshop
29
Space-filling curve partition for 8
processors
Space-filling curve partition for 8
processors
April 5, 2006 Petascale Computation for the Geosciences Workshop
30
POP 0.1 degree benchmark on Blue
Gene/L
POP 0.1 degree benchmark on Blue
Gene/L
April 5, 2006 Petascale Computation for the Geosciences Workshop
31
POP 0.1 degree benchmark
POP 0.1 degree benchmark
Courtesy of Y. Yoshida, M. Taylor, P. Worley
April 5, 2006 Petascale Computation for the Geosciences Workshop
32
ConclusionsConclusions
1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure
48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)
Allows use of solver/preconditioner packagesImplementation quality critical!
Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning
1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure
48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)
Allows use of solver/preconditioner packagesImplementation quality critical!
Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning
April 5, 2006 Petascale Computation for the Geosciences Workshop
33
Conclusions (con’t)Conclusions (con’t) Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files
Future work Improve scalability
55% Efficiency 1K => 32K Better preconditioners Improve load-balance
Different block sizesImprove partitioning algorithm
Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files
Future work Improve scalability
55% Efficiency 1K => 32K Better preconditioners Improve load-balance
Different block sizesImprove partitioning algorithm
April 5, 2006 Petascale Computation for the Geosciences Workshop
34
Acknowledgements/Questions?
Acknowledgements/Questions?
Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)
Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)
Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
April 5, 2006 Petascale Computation for the Geosciences Workshop
35
Serial Execution time on Multiple platforms
(test)
Serial Execution time on Multiple platforms
(test)
0
1
2
3
4
5
6
7
8
9
10
IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4
(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
0
1
2
3
4
5
6
7
8
9
10
IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4
(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)
Compute Platform
secon
ds f
or 2
0 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
April 5, 2006 Petascale Computation for the Geosciences Workshop
36
Total execution time for POP2 (40x48) on
POWER4 (gx1v3)
Total execution time for POP2 (40x48) on
POWER4 (gx1v3)
66
68
70
72
74
76
78
80
82
84
86
88
64
# of processors
secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
66
68
70
72
74
76
78
80
82
84
86
88
64
# of processors
secon
ds f
or 2
00 t
imestep
s
PCG2+2D
PCG1+2D
PCG2+1D
PCG1+1D
9.5% reduction
Eliminate need for ~216,000 CPU hours per year @ NCAR
April 5, 2006 Petascale Computation for the Geosciences Workshop
37
POP 0.1 degreePOP 0.1 degreeblocksize
Nb Nb2 Max ||
36x24 100 10000 7545
30x20 120 14400 10705
24x16 150 22500 16528
18x12 200 40000 28972
15x10 240 57600 41352
12x8 300 90000 64074
Increasing || -->D
ecre
asin
g ov
erhe
ad -
->