flat mpi and hybrid parallel programming models for fem ... · mpi programming model (on es). •...
TRANSCRIPT
-
Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures
Kengo NakajimaThe 21st Century Earth Science COE ProgramDepartment of Earth & Planetary ScienceThe University of Tokyo
First International Workshop on OpenMP (IWOMP 05) June 1-4, 2005, Eugene, Oregon, USA.
-
IWOMP, JUN05
2
• Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications
• Effect of reordering on various types of architectures
– Contact Problems– PGA (Pin-Grid Array)– Multigrid
• Summary & Future Works
Overview
-
IWOMP, JUN05
3
• Parallel FEM platform for solid earth simulation.– parallel I/O, parallel linear solvers, parallel visualization– solid earth: earthquake, plate deformation, mantle/core
convection, etc.• Part of national project by STA/MEXT for large-scale
earth science simulations using the Earth Simulator.• Strong collaborations between HPC and natural
science (solid earth) communities.• Current Activity
– Research Consortium for Solid Earth Simulations on the Earth Simulator (9 sub-groups, >80 members)
GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/
-
IWOMP, JUN05
4
System Configuration ofGeoFEM
Visualization dataGPPView
One-domain mesh
Utilities Pluggable Analysis Modules
PEs
Partitioner
Equationsolvers
VisualizerParallelI/O
構造計算(Static linear)構造計算(Dynamic linear)構造計算(
Contact)
Partitioned mesh
PlatformSolverI/F
Comm.I/F
Vis.I/F
StructureFluid
Wave
Visualization dataGPPView
One-domain mesh
Utilities Pluggable Analysis Modules
PEs
Partitioner
Equationsolvers
VisualizerParallelI/O
構造計算(Static linear)構造計算(Dynamic linear)構造計算(
Contact)
Partitioned mesh
PlatformSolverI/F
Comm.I/F
Vis.I/F
StructureFluid
Wave
-
IWOMP, JUN05
5
Results on Solid Earth Simulation
Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands
Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan
TSUNAMI !!TSUNAMI !!Transportation by Groundwater Flow Transportation by Groundwater Flow
through Heterogeneous Porous Mediathrough Heterogeneous Porous Media
Δh=5.00
Δh=1.25
T=100 T=200 T=300 T=400 T=500
-
IWOMP, JUN05
6
Results by GeoFEM
-
IWOMP, JUN05
7
• GeoFEM Project (FY.1998-2002)
• Performance evaluation of FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES)– Implicit Linear Solvers
• Preconditioned Iterative Solver– Hybrid vs. Flat MPI Parallel Programming Model
Motivations
-
IWOMP, JUN05
8
Earth Simulator (ES)http://www.es.jamstec.go.jp/
• 640×8= 5,120 Vector Processors– SMP Cluster-Type Architecture– 8 GFLOPS/PE– 64 GFLOPS/Node– 40 TFLOPS/ES
• 16 GB Memory/Node, 10 TB/ES• 640×640 Crossbar Network
– 12.3 GB/sec×2• Memory BWTH with 32 GB/sec.• 35.6 TFLOPS for LINPACK (2002-March)• 26 TFLOPS for AFES (Climate Simulation)
-
IWOMP, JUN05
9
Flat MPI vs. Hybrid
PE
PE
PE
PE
MemoryPE
PE
PE
PE
Memory
Hybrid:Hierarchal Structure
PE
PE
PE
PE
MemoryPE
PE
PE
PE
Memory
Flat-MPI:Each PE -> Independent
-
IWOMP, JUN05
10
• “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or
processors).– “Hybrid” may provide better convergence rate (i.e. faster
convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).
Hybrid vs. Flat-MPI
-
IWOMP, JUN05
11
Local Data StructureNode-based Partitioning
internal nodes - elements - external nodesPE#1
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7
PE#3
2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
-
IWOMP, JUN05
12
Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]
• Ignore the effects by elements which belong to different partition / processor.
• Not as strong as original ILU. – How bad in many PE cases ?
• Global data dependency in ILU preconditioner is not suitable for parallel computing.
)1,,1,(~
),,2(
1
1
1
⋅⋅⋅−=⎟⎟⎠
⎞⎜⎜⎝
⎛−=
⋅⋅⋅=−=
∑
∑
+=
−
=
NNkyuydx
Nkylby
N
kjjkjkkk
k
jjkjkk
-
IWOMP, JUN05
13
Block-Jacobi Type Localized ILU(0) Preconditioning [Nakajima et al. 1999]
Matrix components whose columun numbers are outside the processor are ignored (set equal to 0) at Back-Forward Substitution.
1 2 3
4 5 6
Considered :
Ignored :
A1 2 3 4 5 6PE#1
PE#2
PE#3
PE#4
A1 2 3 4 5 6PE#1
PE#2
PE#3
PE#4
PE#1
PE#2
PE#3
PE#4
-
IWOMP, JUN05
14
Overlapped Additive Schwartz Domain Decomposition Method
Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi
SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8
PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21
16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07
NO Additive Schwartz WITH Additive Schwartz
-
IWOMP, JUN05
15
Overlapped Additive Schwartz Domain Decomposition Method
Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi
SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8
PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21
16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07
NO Additive Schwartz WITH Additive Schwartz
-
IWOMP, JUN05
16
Overlapped Additive Schwartz Domain Decomposition Method
for Stabilizing Localized Preconditioning
Global OperationΩ
Local OperationΩ1 Ω2
Global Nesting CorrectionΩ1 Ω2
Γ2→1 Γ1→2
222111, ΩΩΩΩΩΩ == rzMrzM
nn
rMz =
( )11111212111111
−ΓΓ
−ΩΩΩ
−Ω
−ΩΩ →→
−−+← nnnn zMzMrMzz
( )11112121222222
−ΓΓ
−ΩΩΩ
−Ω
−ΩΩ →→
−−+← nnnn zMzMrMzz
-
IWOMP, JUN05
17
Overlapped Additive Schwartz Domain Decomposition Method
Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOFs on Hitachi
SR2201, Number of ASDD cycle/iteration= 1, ε= 10-8
PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21
16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07
NO Additive Schwartz WITH Additive Schwartz
-
IWOMP, JUN05
18
• “Initial” Motivation for Hybrid– Block Jacobi-type Localized Parallel Preconditioning– Relatively lower convergence rate for many domains (or
processors).– “Hybrid” may provide better convergence rate (i.e. faster
convergence) because domain number is 1/8 of that of Flat-MPI programming model (on ES).
• Factors for Performance– Type of Applications, Problem Size– Balance of H/W Parameters
Hybrid vs. Flat-MPI
R. Rabenseifner: Communication Bandwidth of Parallel Programming Models on Hybrid Architectures, International Workshop on OpenMP: Experiences and Implementa-tions (WOMPEI 2002), Lecture Notes in Computer Science 2327, pp.437-448, 2002.
-
IWOMP, JUN05
19
• Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications
• Effect of reordering on various types of architectures
– PGA (Pin-Grid Array)– Contact Problems– Multigrid
• Summary & Future Works
-
IWOMP, JUN05
20
Block IC(0)-CG Solver on the Earth Simulator
• 3D Linear Elastic Problems (SPD)• Parallel Iterative Linear Solver
– Node-based Local Data Structure– Conjugate-Gradient Method (CG): SPD– Localized Block IC(0) Preconditioning (Block Jacobi)– Additive Schwartz Domain Decomposition(ASDD)– http://geofem.tokyo.rist.or.jp/
• Hybrid Parallel Programming Model– OpenMP + MPI– Re-Ordering for Vector/Parallel Performance– Comparison with Flat MPI
-
IWOMP, JUN05
21
Flat MPI vs. Hybrid
PE
PE
PE
PE
MemoryPE
PE
PE
PE
Memory
Hybrid:Hierarchal Structure
PE
PE
PE
PE
MemoryPE
PE
PE
PE
Memory
Flat-MPI:Each PE -> Independent
-
IWOMP, JUN05
22
Local Data StructureNode-based Partitioning
internal nodes - elements - external nodesPE#1
1 2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7
PE#3
2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
2 3 4 5
21 22 23 24 25
1617 18 19
20
1112 13 14
15
67 8 9
10
-
IWOMP, JUN05
23
1 SMP node => 1 domain for Hybrid Programming Model
MPI communication among domains
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
Node-0 Node-1
Node-2 Node-3
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
MemoryNode-0 Node-1
Node-2 Node-3PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
PEPEPEPEPEPEPEPE
Memory
-
IWOMP, JUN05
24
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.
M = (L+D)D-1(D+U)Mz= r
D-1(D+U)= z1
Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)
Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz
need to solve this equation
M = (L+D)D-1(D+U)Mz= r
D-1(D+U)= z1
Forward Substitution(L+D)z1= r : z1= D-1(r-Lz1)
Backward Substitution(I+ D-1 U)zNEW = z1z= z - D-1Uz
need to solve this equation
-
IWOMP, JUN05
25
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.
do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))
enddoZ(i)= WVAL / D(i)
enddo
do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))
enddoZ(i)= Z(i) – SW / D(i)
enddo
Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)
Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
Reordering:Directly connectednodes do not appearin RHS.
M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA
do i= 1, NWVAL= R(i)do j= 1, INL(i)WVAL= WVAL - AL(i,j) * Z(IAL(i,j))
enddoZ(i)= WVAL / D(i)
enddo
do i= N, 1, -1SW = 0.0d0do j= 1, INU(i)SW= SW + AU(i,j) * Z(IAU(i,j))
enddoZ(i)= Z(i) – SW / D(i)
enddo
Forward SubstitutionForward Substitution((L+D)zL+D)z= r : z= D= r : z= D--11(r(r--Lz)Lz)
Backward SubstitutionBackward Substitution(I+ D(I+ D--11 U)zU)znewnew= = zzoldold : z= z : z= z -- DD--11UzUz
Dependency…You need the most recent value of “z”of connected nodes.Vectorization/parallelization is difficult.
Reordering:Directly connectednodes do not appearin RHS.
M =(L+D)DM =(L+D)D--11(D+U)(D+U)L,D,U:L,D,U: AA
-
IWOMP, JUN05
26
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.
• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
-
IWOMP, JUN05
27
Basic Strategy for Parallel Programming on the Earth Simulator• Hypothesis
– Explicit ordering is required for unstructured grids in order toachieve higher performance in factorization processes on SMP node and vector processors.
• Re-Ordering for Highly Parallel/Vector Performance– Local operation and no global dependency– Continuous memory access– Sufficiently long loops for vectorization
• 3-Way Parallelism for Hybrid Parallel Programming– Inter Node : MPI– Intra Node : OpenMP– Individual PE : Vectorization
-
IWOMP, JUN05
28
SMP ParallelSMP Parallel
VectorVector
Re-Ordering Technique for Vector/Parallel Architectures
Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))
VectorVectorSMP ParallelSMP Parallel
1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor)
3. DJDS re-ordering 4. Cyclic DJDS for SMP unit
These processes can be substituted by traditional multi-coloring (MC).
-
IWOMP, JUN05
29
Re-Ordering Technique for Vector/Parallel Architectures
Cyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC))
http://www.sc-conference.org/sc2003/paperpdfs/pap155.pdf
K.Nakajima, “Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator“SC2003 Technical Session, Phoenix, Arizona, 2003.[
http://www.sc-conference.org/sc2003/inter_cal/inter_cal_detail.php?eventid=10703#1
-
IWOMP, JUN05
30
Reordering = Coloring• COLOR: Unit of independent sets.• Elements grouped in the same “color” are independent from
each other, thus parallel/vector operation is possible.• Many colors provide faster convergence, but shorter vector
length.: Trade-Off !!
Red-Black (2 colors) 4 colors Multicolor
-
IWOMP, JUN05
31
Colors & IterationsIncompatible Nodes
Doi, S. (NEC) et al.
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
“Incompatible nodes” are not affected by other nodes. System with fewer incompatible nodes provides faster convergence.
-
IWOMP, JUN05
32
RCM Ordering
1 2 4 7 11
3 5 8 12 16
6 9 13 17 20
10 14 18 21 23
15 19 22 24 25
-
IWOMP, JUN05
33
Red-Black OrderingHighly Parallelized: Large Loop Length
Many Incompatible Nodes: Slow Convergence
1 14 2 15 3
16 4 17 5 18
6 19 7 20 8
21 9 22 10 23
11 24 12 25 13
-
IWOMP, JUN05
34
Four-Color OrderingFairly Parallelized: Shorter Loop Length
Fewer Incompatible Nodes: Faster Convergence
1 8 14 20 2
9 15 21 3 10
16 22 4 11 17
23 5 12 18 24
6 13 19 25 7
-
IWOMP, JUN05
35
Cyclic DJDS(RCM+CMC)for Forward/Backward Substitution in BILU Factorization
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
-
IWOMP, JUN05
36
Simple 3D Cubic Model
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-dirrection @ z=Zmin
Ny-1
Nx-1
Nz-1
-
IWOMP, JUN05
37
Effect of Ordering
-
IWOMP, JUN05
38
Effect of Re-Ordering
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Long LoopsContinuous Access
Short LoopsContinuous Access
Short LoopsIrregular Access
Proposed Method
-
IWOMP, JUN05
39
Effect of Re-OrderingResults on 1 SMP node
Color #: 99 (fixed)Re-Ordering is REALLY required !!!
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GFL
OPS
Effect of Vector LengthX10
+ Re-Ordering X100
22 GFLOPS, 34% of the Peak
Ideal Performance: 40%-45%for Single CPU
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
-
IWOMP, JUN05
40
Effect of Re-OrderingResults on 1 SMP node
Color #: 99 (fixed)Re-Ordering is REALLY required !!!
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GFL
OPS
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
80x80x80 case (1.5M DOF)
● 212 iter’s, 11.2 sec.■ 212 iter’s, 143.6 sec.▲ 203 iter’s, 674.2 sec.
-
IWOMP, JUN05
41
Ideal Performance: 48.6% of Peak
-
IWOMP, JUN05
42
SMP node # > 10up to 176 nodes (1408 PEs)
Problem size for each SMP node is fixed.PDJDS-CMC/RCM, Color #: 99
-
IWOMP, JUN05
43
3D Elastic Model (Large Case)256x128x128/SMP node, up to 2,214,592,512 DOF
●:Flat MPI, ○:Hybrid
GFLOPS rate Parallel Work Ratio
0
1000
2000
3000
4000
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
GFL
OPS
40
50
60
70
80
90
100
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
Para
llel W
ork
Rat
io: %
3.8TFLOPS for 2.2G DOF 3.8TFLOPS for 2.2G DOF 176 nodes (33.8% of peak)176 nodes (33.8% of peak)
●Flat MPI
○Hybrid
-
IWOMP, JUN05
44
3D Elastic Model (Small Case)64x64x64/SMP node, up to 125,829,120 DOF
●:Flat MPI, ○:Hybrid
0
1000
2000
3000
4000
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
GFL
OPS
40
50
60
70
80
90
100
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
Para
llel W
ork
Rat
io: %
GFLOPS rate Parallel Work Ratio
●Flat MPI
○Hybrid
-
IWOMP, JUN05
45
3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator
ー:Flat MPI, ---:Hybrid
15
20
25
30
35
40
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10
DOF#
Peak
Per
form
ance
Rat
io: %
Prob Size/SMP node = 3x64^3
3x80^33x100^3
3x128^3
3x256x128x128 SMPnodes
176 SMPnodes
1.E+02
1.E+03
1.E+04
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10
DOF#
GFL
OPS
Prob Size/SMP node = 3x64^3
3x80^3
3x100^3
3x128
3x256x128x128
Large SMP Node # Large SMP Node #
-
IWOMP, JUN05
46
3D Elastic ModelProblem Size and Parallel Performance8-176 SMP nodes of the Earth Simulator
ー:Flat MPI, ---:Hybrid
Large SMP Node # Large SMP Node #
0
500
1000
1500
2000
2500
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10
DOF#
Itera
tions
80
90
100
110
120
130
140
150
160
0 16 32 48 64 80 96 112 128 144 160 176 192
NODE#
Hyb
rid/F
lat M
PI: %
Prob Size/SMP node= 3x64^3
3x100^3
3x256x128x128
Hybrid isfaster
FlatMPIis faster
-
IWOMP, JUN05
47
Hybrid outperforms Flat-MPI• when ...
– number of SMP node (PE) is large.– problem size/node is small.
• because flat-MPI has ...– as 8 times as many communication processes– as TWICE as large communication/computation ratio
• Effect of communication becomes significant if number of SMP node (or PE) is large.
• Performance Estimation by D.Kerbyson (LANL) – LA-UR-02-5222– relatively larger communication latency of ES
N
N
HybridFlat MPIN/2
N/2
N
N
HybridFlat MPIN/2
N/2
-
IWOMP, JUN05
48
Comm. latency is relatively large in ES
D.J.Kerbyson et al. LA-UR-02-5222
ES ASCI-QHP ES45
500 MHz 1.25 GHzProc. Speed
8 x 640= 5,120 4 x 3,072=12,288SMP nodes
8 x 5,120= 40 TFLOPS
2.5 x 12,288= 30 TFLOPSPeak Speed
2 x 5,120= 10 TB 4 x 12,288= 48 TBMemory
32 GB/s 2 GB/sL1/L2 20/30 GB/sPeak Memory BWTH
Latency: 5.6 μsecBWTH : 11.8 GB/s
Latency: 5 μsecBWTH : 300 MB/sInter-Node MPI Comm.
-
IWOMP, JUN05
49
Performance Estimation for Finite-Element Type Application on ES using performance model (not real computations)
D.J.Kerbyson et al. LA-UR-02-5222 (LANL)
# Processors
Tim
e (%
)
100
01 5120
computation/memorycomm. latencycomm. bandwidth
# Processors
Tim
e (%
)
100
01 5120
computation/memorycomm. latencycomm. bandwidth
computation/memorycomm. latencycomm. bandwidth
-
IWOMP, JUN05
50
Hybrid outperforms Flat-MPI (cont.)
• Difference between Hybrid & Flat MPI of iteration number for convergence is not so significant … due to additive Schwartz domain decomposition.
ー:Flat MPI---:Hybrid
Large SMP Node #
0
500
1000
1500
2000
2500
1.E+06 1.E+07 1.E+08 1.E+09 1.E+10
DOF#
Itera
tions
-
IWOMP, JUN05
51
• Earth Simulator (ES)– Earth Simulator Center, JAMSTEC.– Vector Processors
• Hitachi SR8000– The University of Tokyo– Power-PC based Architecture– Pseudo-Vector Capability with Preload/Prefetch
• IBM SP-3– NERSC/Lawrence Berkeley National Laboratory: Seaborg– Power3– 8MB L2-cache for each PE
Various Platforms
-
IWOMP, JUN05
52
Platforms
PE#/node
Peak GF/PE
Mem.BW GB/s/node
ES SP3 SR8k
8
8.0
256
16
1.5
16
8
1.8
32
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
MPI lat. μsec
Network BW GB/s/Node
5.6
12.3
16.3*
1.00
6-20**
1.60
PE#/node
Peak GF/PE
Mem.BW GB/s/node
ES SP3 SR8k
8
8.0
256
16
1.5
16
8
1.8
32
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
MPI lat. μsec
Network BW GB/s/Node
5.6
12.3
16.3*
1.00
6-20**
1.60
*) Oliker et al.**) HLRS
-
IWOMP, JUN05 53
Reordering for the Earth Simulator Matrix Storage for ILU Factorization
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Long LoopsContinuous Access
Short LoopsContinuous Access
Short LoopsIrregular Access
-
IWOMP, JUN05 54
3D Elastic SimulationProblem Size~GFLOPS
Earth Simulator1 SMP node (8 PE’s)
Flat-MPI23.4 GFLOPS, 36.6 % of Peak
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
OpenMP21.9 GFLOPS, 34.3 % of Peak
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
-
IWOMP, JUN05 55
3D Elastic SimulationProblem Size~GFLOPS
Hitachi-SR8000-MPP withPseudo Vectorization1 SMP node (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
Flat-MPI2.17 GFLOPS, 15.0 % of Peak
OpenMP2.68 GFLOPS, 18.6 % of Peak
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
-
IWOMP, JUN05 56
3D Elastic SimulationProblem Size~GFLOPS
IBM-SP3 (NERSC) 1 SMP node (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
Flat-MPI OpenMP
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1.0E+04 1.0E+05 1.0E+06 1.0E+07
DOF: Problem Size
GFL
OPS
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
Cache is wellCache is well--utilizedutilizedin Flatin Flat--MPI.MPI.
-
IWOMP, JUN05 57
3D Elastic SimulationSingle PE Performance
GFLOPS, 104 colors
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
EarthSimulator
98,304 DOF(=3×323)
331,776 DOF(=3×483)
786,432 DOF(=3×643)
DJDS● CRS■ DJDS● CRS■ DJDS● CRS■
1.67 -Hybrid 2.06 - 3.01 -
2.75 -FlatMPI 3.02 - 3.19 -
HitachiSR8000
0.404 0.127Hybrid 0.437 0.140 0.423 0.146
0.366 0.128FlatMPI 0.343 0.138 0.322 0.146
IBM SP30.145 0.167Hybrid 0.142 0.165 0.137 0.145
0.138 0.165FlatMPI 0.135 0.159 0.126 0.141
-
IWOMP, JUN05 58
Performance of Intra-Node MPI8 PE’s: Measurement using SEND/RECV
subroutines in GeoFEM
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+02 1.E+03 1.E+04 1.E+05
message size
sec.
-GB
/sec
.
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+02 1.E+03 1.E+04 1.E+05
message size
sec.
-GB
/sec
.
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+02 1.E+03 1.E+04 1.E+05
message size
sec.
-GB
/sec
.
ESES Hitachi SR8kHitachi SR8k IBM SP3IBM SP3
Elapsed time for 1,000 iterations (sec.)Elapsed time for 1,000 iterations without estimated latency(sec.)Estimated communication bandwidth (GB/sec)
-
IWOMP, JUN05 59
Performance of Intra-Node MPI8 PE’s: Measurement using inter-domain SEND/RECV
subroutines in GeoFEM
PE#/nodePeak GF/PEMem.BW GB/s/nodeMPI lat. μsecNetwork BW GB/s/NodeIntra-node MPI lat. μsecIntra-node MPI BWMB/s/PE
ES8
8.02565.6
12.3
5.29
1018.
SP3161.516
16.3*
1.00
8.32
52.5
SR8k8
1.832
6-20**
1.60
9.17
95.8
Intra-node communication performance seems reasonable accordingto memory BW, inter-node communication performance.
-
IWOMP, JUN05
60
• Earth Simulator (ES)– Flat MPI is slightly better.
• Hitachi SR8000– Similar feature with that of ES– Hybrid is better for PDJDS/CM-RCM ordering.– Difference of single PE performance between Hybrid and
Flat MPI for PDJDS/CM-RCM is significant.• compiler ?
• IBM SP-3– Effect of cache (8MB/PE) is significant.– Cache is well-utilized in Flat MPI– Performance in larger problems is similar
Summary: Single Node PerformanceFlat MPI vs. Hybrid
-
IWOMP, JUN05
61
Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up
extrapolated from 1-node performance)
EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)
0
50
100
150
0 32 64 96 128 160 192
SMP NODE# (8 PEs/NODE)
GFL
OPS
3.0M DOF/node
0
1000
2000
3000
4000
0 32 64 96 128 160 192
SMP NODE# (8 PEs/NODE)
GFL
OPS
12.6M DOF/node
2.2G DOF3.80 TFLOPS 33.7% of peak(176 nodes)
0.38G DOF110. GFLOPS 7.16% of peak(128 nodes)(8 PEs/node)
-
IWOMP, JUN05
62
Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up
extrapolated from 1-node performance)
0
50
100
150
0 32 64 96 128 160 192
SMP NODE# (8 PEs/NODE)
GFL
OPS
786.4K DOF/node
0
1000
2000
3000
4000
0 32 64 96 128 160 192
SMP NODE# (8 PEs/NODE)
GFL
OPS
786.4K DOF/node
0.10G DOF115. GFLOPS 7.51% of peak(128 nodes)(8 PEs/node)
EarthEarth SimulatorSimulator IBM SPIBM SP--3 (3 (SeaborgSeaborg at NERSC)at NERSC)
-
IWOMP, JUN05
63
Speed-Up for Fixed Problem Size/PE ● Hybrid, ● Flat MPI (Lines: Ideal speed-up
extrapolated from 1-node performance)Hitachi SR8000Hitachi SR8000
0
100
200
300
400
0 16 32 48 64 80 96 112 128
SMP NODE#
GFL
OPS
0
100
200
300
400
0 16 32 48 64 80 96 112 128
SMP NODE#
GFL
OPS
6.29M DOF/node786.4K DOF/node
8.05G DOF, 128 nodes● 335 GFLOPS, 18.2% of peak● 257 GFLOPS, 14.7%
0.10G DOF, 128 nodes● 271 GFLOPS, 14.7% of peak● 276 GFLOPS, 15.0%
-
IWOMP, JUN05
64
Platforms
PE#/node
Peak GF/PE
Mem.BW GB/s/node
ES SP3 SR8k
8
8.0
256
16
1.5
16
8
1.8
32
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
MPI lat. μsec
Network BW GB/s/Node
5.6
12.3
16.3*
1.00
6-20**
1.60
PE#/node
Peak GF/PE
Mem.BW GB/s/node
ES SP3 SR8k
8
8.0
256
16
1.5
16
8
1.8
32
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
Earth SimulatorEarth Simulator IBM SP3 "IBM SP3 "SeaborgSeaborg" " NERSC/LBNLNERSC/LBNL
NERSC/LBNLNERSC/LBNL
Hitachi SR8000Hitachi SR8000Univ. TokyoUniv. Tokyo
Hitachi Ltd.Hitachi Ltd.
MPI lat. μsec
Network BW GB/s/Node
5.6
12.3
16.3*
1.00
6-20**
1.60 12 : 1 : 1.6
2.9 : 1 : 1.2-2.7
-
IWOMP, JUN05
65
• Earth Simulator (ES)– Effect of MPI latency is significant for:
• Flat MPI• Many node #• small problem size/node
• Hitachi SR8000• IBM SP-3
– Effect of MPI latency is hidden by their relatively low communication bandwidth.
– Performance with 128 SMP nodes are identical with extrapolation of single node performance, especially if problem size/node is sufficiently large.
Summary: Performance for Many Nodes, Flat MPI vs. Hybrid
-
IWOMP, JUN05
66
• Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications
• Effect of reordering on various types of architectures
– PGA (Pin-Grid Array)– Contact Problems– Multigrid
• Summary & Future Works
-
IWOMP, JUN05 67
Elastic Problemswith Cubic Model
-
IWOMP, JUN05 68
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDCRS/CM-RCM Re-OrderingIterations for Convergence
Flat MPIHybrid
200
250
300
350
400
10 100 1000
COLORS
ITER
ATI
ON
S
-
IWOMP, JUN05 69
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)
Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.0
5.0
10.0
15.0
20.0
25.0
30.0
10 100 1000
COLORS
GFL
OPS
0
25
50
75
100
10 100 1000
COLORS
sec.
-
IWOMP, JUN05 70
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)
Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.0
5.0
10.0
15.0
20.0
25.0
30.0
10 100 1000
COLORS
GFL
OPS
0
25
50
75
100
10 100 1000
COLORS
sec.
Smaller Vector LengthSmaller Vector Length
-
IWOMP, JUN05 71
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDJDS/CM-RCM Re-OrderingEarth Simulator (single node)
Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.0
5.0
10.0
15.0
20.0
25.0
30.0
10 100 1000
COLORS
GFL
OPS
0
25
50
75
100
10 100 1000
COLORS
sec.
Smaller Vector LengthSmaller Vector Length++
OpenMPOpenMP OverheadOverhead
-
IWOMP, JUN05 72
Many Colors ... Trade-OffFast convergence according to iteration number
S.Doi et al.Smaller vector length
FLOPS rate decreases.
Hybrid is much more sensitive to color numberEspecially on the Earth Simulator
-
IWOMP, JUN05 73
Hybrid is much more sensitive to color numbers !
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
y(i)= y(i) + A(k)*X(kk)
-
IWOMP, JUN05 74
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDJDS/CM-RCM Re-OrderingHitachi SR8k (single node)
Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.00
0.50
1.00
1.50
2.00
2.50
3.00
10 100 1000
COLORS
GFL
OPS
0
100
200
300
400
500
10 100 1000
COLORS
sec.
Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator. In FlatIn Flat--MPI, performance is rather improved.MPI, performance is rather improved.
-
IWOMP, JUN05 75
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
PDCRS/CM-RCM Re-OrderingIBM SP3 (single node, 8PE’s)
Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0
250
500
750
1000
10 100 1000
COLORS
sec.
Performance is not so sensitive to color# Performance is not so sensitive to color# as the Earth Simulator. as the Earth Simulator.
0.00
0.50
1.00
1.50
10 100 1000
COLORS
GFL
OPS
-
IWOMP, JUN05 76
Effect of Loop Length (Color#)1003 nodes= 3x106 DOF
IBM SP3 (single node, 8PE’s)Flat MPIHybrid
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0.00
0.50
1.00
1.50
10 100 1000
COLORS
GFL
OPS
0.00
0.50
1.00
1.50
10 100 1000
COLORS
GFL
OPS
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
-
IWOMP, JUN05 77
South-West Japan ModelContact Problems
Selective Blocking Preconditioning
-
IWOMP, JUN05 78
South-West Japan
Eurasia
Philippine
PacificEurasia
Philippine
Pacific
-
IWOMP, JUN05 79
South-West Japan Model800K nodes, 2.4M DOF 1-SMP nodes (8 PE’s)
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 80
South-West Japan Model800K nodes, 2.4M DOF ES, 1-SMP nodes (8 PE’s)
● Flat MPI○ Hybrid (OpenMP)
20
25
30
35
40
1 10 100 1000
Colors
sec.
5
10
15
20
25
1 10 100 1000
Colors
GFL
OPS
Smaller Vector LengthSmaller Vector Length++
OpenMPOpenMP OverheadOverhead
Smaller Vector LengthSmaller Vector Length
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 81
South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 10 100 1000
Colors
GFL
OPS
0
100
200
300
1 10 100 1000
Colors
sec.
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 82
South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 10 100 1000
Colors
GFL
OPS
0
100
200
300
1 10 100 1000
Colors
sec.
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 83
South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 10 100 1000
Colors
GFL
OPS
0
100
200
300
1 10 100 1000
Colors
sec.
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
Smaller Vector LengthSmaller Vector Length++
OpenMPOpenMP OverheadOverhead
-
IWOMP, JUN05 84
South-West Japan Model800K nodes, 2.4M DOF Hitachi SR8000, 1-SMP nodes (8 PE’s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 10 100 1000
Colors
GFL
OPS
0
100
200
300
1 10 100 1000
Colors
sec.
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
WHY ?WHY ?
-
IWOMP, JUN05 85
South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)
0
250
500
750
1000
1 10 100 1000
COLOR#
sec.
0.00
0.25
0.50
0.75
1.00
1 10 100 1000
COLOR#
GFL
OPS
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 86
South-West Japan Model800K nodes, 2.4M DOF IBM SP3, 1-SMP nodes (8 PE’s)
0
250
500
750
1000
1 10 100 1000
COLOR#
sec.
0.00
0.25
0.50
0.75
1.00
1 10 100 1000
COLOR#
GFL
OPS
● Flat MPI○ Hybrid (OpenMP)
400
450
500
550
600
650
1 10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 87
South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)
● Flat MPI○ Hybrid
3000
3100
3200
3300
3400
3500
10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 88
South-West Japan Model7,767,002 nodes, 23,301,006 DOF ES, 10 SMP nodes (Peak= 640 GFLOPS)
● Flat MPI○ Hybrid
75
100
125
150
175
200
10 100 1000
Colors
GFL
OPS
100
150
200
250
10 100 1000
Colors
sec.
-
Complicated GeometriesPGA for Mobile Pentium III
-
IWOMP, JUN05 90
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF)
-
IWOMP, JUN05 91
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering
● Flat MPI○ OpenMP
400
425
450
475
500
10 100 1000
Colors
Itera
tions
-
IWOMP, JUN05 92
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Earth Simulator
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● Flat MPI○ OpenMP
0
50
100
150
10 100 1000
Colors
sec.
0
10
20
30
10 100 1000
Colors
GFL
OPS
Smaller Vector LengthSmaller Vector Length++
OpenMPOpenMP OverheadOverhead
Smaller Vector LengthSmaller Vector Length
-
IWOMP, JUN05 93
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, Hitachi SR8000
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
0
200
400
600
800
10 100 1000
Colors
sec.
0.00
1.00
2.00
3.00
10 100 1000
Colors
GFL
OPS
● Flat MPI○ OpenMP
WHY ?WHY ?
-
IWOMP, JUN05 94
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodePDJDS/MC Re-Ordering, IBM-SP3
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● Flat MPI○ OpenMP
0
1000
2000
3000
10 100 1000
Colors
sec.
0.00
0.25
0.50
0.75
1.00
10 100 1000
Colors
GFL
OPS
-
IWOMP, JUN05 95
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeHitachi SR8000
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
●■ Flat MPI○□ OpenMP
0.00
1.00
2.00
3.00
10 100 1000
Colors
GFL
OPS
0.00
1.00
2.00
3.00
10 100 1000
Colors
GFL
OPS
PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC
WHY ?WHY ?
-
IWOMP, JUN05 96
Single PE test for simple cubic model786,432 DOF= 3x643 nodesHitachi SR8000
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● Flat MPI(PDJDS)■ Flat MPI(PDCRS)○ OpenMP(PDJDS)□ OpenMP(PDCRS)
0.00
0.10
0.20
0.30
0.40
0.50
10 100 1000
Colors
GFL
OPS
Behavior of Flat MPI withBehavior of Flat MPI withPDJDS seem like that of scalar PDJDS seem like that of scalar processors, while processors, while OpenMPOpenMP/PDJDS/PDJDSlooks like vector processors.looks like vector processors.
Feature (or problem) of compiler ?Feature (or problem) of compiler ?PseudoPseudo--Vector option does notVector option does notwork well in Flat MPI.work well in Flat MPI.
-
IWOMP, JUN05 97
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
●■ Flat MPI○□ OpenMP
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC
-
IWOMP, JUN05 98
PDCRS is more suitable for scalar processors
Reduction type loop of CRS is more suitable for cache-based scalar processor because of its localized operation.PDJDS provides data locality if the color number increases.
On scalar processors, performance may be improved as color number increases.
PDJDSPDJDS PDCRSPDCRS
do i= 1, Nk1= index(i-1)+1k2= index(i)do k= k1, k2kk= item(k)Y(i)= Y(i)+A(k)*x(k)
enddoenddo
do j= 1, NCONdo i= 1, NN(j)k = index(j-1) + ikk= item (k)Y(i)= Y(i)+A(k)*x(k)
enddoenddo
-
IWOMP, JUN05 99
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
●■ Flat MPI○□ OpenMP
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC
Performance is slightlyPerformance is slightlyimproved due to data locality.improved due to data locality.
-
IWOMP, JUN05 100
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
●■ Flat MPI○□ OpenMP
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC
Performance Performance is slightlyis slightlyimproved due improved due to data locality.to data locality.
-
IWOMP, JUN05 101
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeIBM-SP3
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
●■ Flat MPI○□ OpenMP
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
0.00
0.50
1.00
1.50
10 100 1000
Colors
GFL
OPS
PDJDS/MCPDJDS/MC PDCRS/MCPDCRS/MC
OpenMPOpenMPoverheadoverhead
OpenMPOpenMPoverheadoverhead
-
IWOMP, JUN05 102
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeItanium II (SGI/Altix) (8 PE’s)
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● Flat MPI(PDJDS)■ Flat MPI(PDCRS)
0.00
1.00
2.00
3.00
10 100 1000
Colors
GFL
OPS
0
250
500
750
1000
10 100 1000
Colors
sec.
-
IWOMP, JUN05 103
PGA model61 pins, 956,128 elem, 1,012,354 nodes (3,037,062 DOF) , 1 SMP nodeAMD Opteron (8 PE’s)
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● ■ ▲
PDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-orderingPDJDS/CM-RCM PDCRS/CM-RCMshort innermost loop
CRS no re-ordering
● Flat MPI(PDJDS)■ Flat MPI(PDCRS)
0.00
1.00
2.00
3.00
10 100 1000
Colors
GFL
OPS
0
250
500
750
1000
10 100 1000
Colors
sec.
-
IWOMP, JUN05 104
Multigrids
-
IWOMP, JUN05
105
MultigridMultigrid is scalable !!!
MGCG: scalable
ICCG: unscalable
J2CG: unscalable
200
150
100
50
0100 101 102 103
Number of Processors (Problem Size)
Tim
e to
Sol
utio
n
Based on LLNL
-
IWOMP, JUN05
106
• Typical Geometry in Earth Science Simulations
– Mantle, Core etc.– Also applicable to external
flow• Boundary Conditions
– r= Rin• u=v=w=0, T=1
– r= Rout• u=v=w=0, T=0
– Heat Generation
Target ApplicationThermal Convection between 2 Spherical Surfaces
Rin=0.50
Rout=1.00
-
IWOMP, JUN05
107
• Governing equations– Momentum (Navier-Stoke)– Continuity– Energy
• Poisson equations for pressure correction derived from continuity constraint.
• Parallel MGCG iterative method for Poisson eqn’s– V-cycle, Geometrical Multigrid– Gauss-Seidel Smoothing– Semi-coarsening– Multilevel communication table
Semi-Implicit Pressure Correction Scheme
-
IWOMP, JUN05
108
Start from IcosahedronSurface Grid by Adaptive Mesh Refinement (AMR)
Level 012 nodes20 tri's
Level 142 nodes80 tri's
Level 2162 nodes320 tri's
Level 42,562 nodes5,120 tri's
Level 3642 nodes
1,280 tri's
-
IWOMP, JUN05
109
• generated from unstructured surface triangles
• structured in normal-to-surface direction
• flexible• suitable for near-wall
boundary layer computation
Semi-Unstructured Prismatic Grids
-
IWOMP, JUN05
110
Results on Hitachi SR2201320x900=288,000 cells/PEup to 37M DOF on 128 PEs
0.00E+00
1.00E+03
2.00E+03
3.00E+03
4.00E+03
5.00E+03
0 16 32 48 64 80 96 112 128
PE#
sec.
ICCG MGCG/FGS MGCG/ILUs
60
70
80
90
100
110
0 16 32 48 64 80 96 112 128
PE#
Par
alle
l Per
form
ance
(%)
-
IWOMP, JUN05
111
How about on ES ?
• 1 CPU– 768,000 elements (1280×600)– ICCG : 29 sec.(Xeon 2.8GHz) ⇒169 sec.(ES)– MGCG: 58 sec.(Xeon 2.8GHz)⇒345 sec.(ES)
• Why ?– CRS: loop length is small
• Remedy– Reordering done in BILU(0)
• Gauss-Seidel smoothing: dependency
– Traditional Multicoloring
-
IWOMP, JUN05
112
Critical Issue of Multigridfor Vector Computer
• Shorter Loop-Length for Coarse Grid Level.
-
IWOMP, JUN05
113
Vectorization:Re-OrderingEarth Simulator
• RCM(Reverse Cuthil-Mckee), Multicolor
768,000 elem'sES, 1 PE
○ MGCG(MC)△ MGCG(RCM)● ICCG(MC)▲ ICCG(RCM)
MGCG(original): 345 sec.ICCG (original): 169 sec.
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0 500 1000 1500
COLOR #
sec.
5
6
7
8
9
4
5
6
7
8
3
4
5
6
7
2
3
4
5
6
1
2
3
4
5
5
6
7
8
9
4
5
6
7
8
3
4
5
6
7
2
3
4
5
6
1
2
3
4
5
-
IWOMP, JUN05
114
GFLOPS,Iterations768,000 elem's, ES 1PE
○ MGCG(MC)△ MGCG(RCM)● ICCG(MC)▲ ICCG(RCM)
10
100
1000
0 500 1000 1500
COLOR #
Itera
tions
1.00
1.50
2.00
2.50
0 500 1000 1500
COLOR #
GFL
OPS
Many colors...• fewer iterations for convergence• lower performance due to shorter loops (especially in MG)
5
6
7
8
9
4
5
6
7
8
3
4
5
6
7
2
3
4
5
6
1
2
3
4
5
5
6
7
8
9
4
5
6
7
8
3
4
5
6
7
2
3
4
5
6
1
2
3
4
5
-
IWOMP, JUN05
115
Scalability: Flat MPI768,000 elem's/PE, up to 80 PE’s
0
10
20
30
40
50
0.E+00 2.E+07 4.E+07 6.E+07
DOF
sec.
○ MGCG(RCM)● ICCG(MC=1500)
-
IWOMP, JUN05
116
Next step: Hybrid
Many colors are required for reasonably fast convergence.Overhead of OpenMP.
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
SMPparallel
do iv= 1, NCOLORS!$omp parallel do private (iv0,j,iS,iE,i,k,kk etc.)do ip= 1, PEsmpTOTiv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1)do j= 1, NLhyp(iv)iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1)iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip )
!CDIR NODEPdo i= iv0+1, iv0+iE-iSk= i+iS - iv0kk= IAL(k)(Important Computations)
enddoenddo
enddoenddo
Vectorized
-
IWOMP, JUN05
117
Hybrid vs. Flat MPI8PEs on ES (1 SMP node)
MGCG (10 iterations) ICCG (100 iterations)
○ Hybrid● Flat MPI
1
10
100
10 100 1000 10000
COLOR #
sec.
/ 10
itera
tions
1
10
100
10 100 1000 10000
COLOR #
sec.
/ 100
iter
atio
ns
-
IWOMP, JUN05
118
MGCG on ESExcellent performance by reordering
More than 20% of peak even in MGCGNice for Poisson solvers75% of ICCG
Excellent scalability for flat-MPI Hybrid
Low performance for many colorsSignificant for MGCG
But we need many colors for reasonably fast convergenceNot recommended
-
IWOMP, JUN05
119
Hybrid vs. Flat MPI8PE’s on ES (1 SMP node)
MGCG (10 iterations) ICCG (100 iterations)
○ Hybrid● Flat MPI
1
10
100
10 100 1000 10000
COLOR #
sec.
/ 10
itera
tions
1
10
100
10 100 1000 10000
COLOR #
sec.
/ 100
iter
atio
ns
1280x600x8= 6,144,000 elements
-
IWOMP, JUN05
120
Hybrid vs. Flat MPI8PE’s on Hitachi SR8000(1 SMP node)
MGCG (10 iterations) ICCG (100 iterations)
○ Hybrid● Flat MPI
0
20
40
60
80
10 100 1000 10000
COLOR #
sec.
/ 10
itera
tions
0
20
40
60
80
10 100 1000
COLOR #
sec.
/ 100
iter
atio
ns
10000
1280x600x8= 6,144,000 elements
-
IWOMP, JUN05
121
Hybrid vs. Flat MPI8PE’s on IBM-SP3 (1 SMP node)
MGCG (10 iterations) ICCG (100 iterations)
○ Hybrid● Flat MPI1280x600x8= 6,144,000 elements
1.0E+00
1.0E+01
1.0E+02
1.0E+03
10 100 1000 10000
COLOR #
sec.
/ 100
iter
atio
ns
1.0E+00
1.0E+01
1.0E+02
1.0E+03
10 100 1000 10000
COLOR #
sec.
/ 10
itera
tions
-
IWOMP, JUN05
122
• Background– GeoFEM Project & Earth Simulator– Preconditioned Iterative Linear Solvers
• Optimization Strategy on the Earth Simulator– BIC(0)-CG for Simple 3D Linear Elastic Applications
• Effect of reordering on various types of architectures
– PGA (Pin-Grid Array)– Contact Problems– Multigrid
• Summary & Future Works
-
IWOMP, JUN05 123
Summary (Earth Simulator)Hybrid Parallel Programming Model on SMP Cluster Architecture with Vector Processors for Unstructured Grids
Nice parallel performance for both inter/intra SMP node on ES, 3.8TFLOPS for 2.2G DOF on 176 nodes (33.8%)Re-Ordering is really requiredSimple multicoloring is better for complicated geometries.
-
IWOMP, JUN05 124
Summary (cont.) (Earth Simulator)Hybrid vs. Flat MPI
Flat-MPI is better for small number of SMP nodes.Hybrid is better for large number of SMP nodes: Especially when problem size is rather small.Flat MPI: Communication, Hybrid: Memorydepends on application, problem size etc.Hybrid is much more sensitive to color numbers than flat MPI due to synchronization overhead of OpenMP, especially on the Earth Simulator.
This is not so significant in Hitachi SR8k & IBM SP3
-
IWOMP, JUN05 125
Summary (cont.) (General)Hybrid vs. Flat MPI
In IBM SP-3, difference between flat-MPI and hybrid is not so significant, although flat-MPI is slightly better.In Hitachi SR8000, pseudo-vector works better for hybrid cases.
Effect of ColorIn scalar processors (especially for flat-MPI), performance is getting better as color number increases, due to efficient utilization of cache.
-
IWOMP, JUN05 126
Hybrid vs. Flat MPIGenerally speaking, they are competitive.
Flat-MPI is better for multigrid type applications with multicolor ordering.In ill-conditioned problems, Flat-MPI may require more iterations.Depends on implementation of compiler
-
IWOMP, JUN05 127
Hybrid vs. Flat MPI (cont.)Earth Simulator
Flat MPI is betterHybrid is better for large PE number with small problem size.
IBM SP-3Flat MPI is better for small problem size because cache is utilized more efficiently
Hitachi SR8000Hybrid is better (mainly because of feature of compiler)
Careful treatment on color number is required for Hybrid programming model.
-
IWOMP, JUN05 128
My current feeling is …Flat MPI is slightly better on the Earth Simulator in FEM-type applications with multicoloring.
-
IWOMP, JUN05 129
Can Hybrid survive ?SAI (Sparse Approximate Inverse) preconditioning
Suitable for parallel & vector computation.especially for Hybrid Parallel Programming Modelonly Mat-Vec. product.reordering for avoiding dependency is not required.
corresponds to single-color
Preconditioning for contact problems.J.Zhang, K.Nakajima et al. (2003) for scalar processorsK.Nakajima (2005) for ES at WOMPEI05
-
IWOMP, JUN05 130
Matrix-Vector Product ONLY
!OMP PARALLEL DOdo ip= 1, PEsmpTOT
iv0= STACKmc(ip)do j= 1, NONdiagTOT
iS= indexS(ip, j)iE= indexE(ip, j)
!CDIR NODEPdo i= iv0+1, iv0+iE-iS
(computations)enddo
enddoenddo
!OMP END PARALLEL DO
-
IWOMP, JUN05 131
Earth Simulator, Mat-Vec.1 SMP node (64GFLOPS peak), 500 Iter's.Hybrid is rather faster !!
●: Flat MPI○: Hybrid
10.0
15.0
20.0
25.0
30.0
1.E+04 1.E+05 1.E+06 1.E+07
DOF
GFL
OPS
-
IWOMP, JUN05
132
Results: ES (1/2)
0
100
200
300
1 2 4 6 8thread#
sec.
set-upsolver
0.00
2.00
4.00
6.00
8.00
0 1 2 3 4 5 6 7 8
thread#
Spee
d-up
●Large (2.47M DOF) ○Small (0.35M DOF)
Large Problem
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05
133
Results: ES (2/2)
1
10
100
1 10 100 1000
COLOR#
sec.
0
10
20
30
1 10 100 1000
COLOR#
GFL
OPS
●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05
134
Results: ES (2/2)
1
10
100
1 10 100 1000
COLOR#
sec.
0
10
20
30
1 10 100 1000
COLOR#
GFL
OPS
●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05
135
Results: Hitachi SR8000 (1/2)
●Large (2.47M DOF) ○Small (0.35M DOF)
Large Problem
0
500
1000
1500
2000
1 2 4 6 8
thread#
sec.
set-upsolver
0.00
2.00
4.00
6.00
8.00
0 1 2 3 4 5 6 7 8
thread#
Spee
d-up
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05
136
Comparison: ES vs. Hitachi SR8k
0
500
1000
1500
2000
1 2 4 6 8
thread#
sec.
set-upsolver
0.00
2.00
4.00
6.00
8.00
0 1 2 3 4 5 6 7 8
thread#
Spee
d-up
0
100
200
300
1 2 4 6 8thread#
sec.
set-upsolver
0.00
2.00
4.00
6.00
8.00
0 1 2 3 4 5 6 7 8
thread#
Spee
d-up
ES
SR8k
-
IWOMP, JUN05
137
Results: Hitachi SR8000 (2/2)●: SB-BIC(0), ▲: SAI: Large○: SB-BIC(0), △: SAI: SmallLarge (2.47M DOF)Small (0.35M DOF)
1
10
100
1000
1 10 100 1000
COLOR#
sec.
0.00
1.00
2.00
3.00
1 10 100 1000
COLOR#
GFL
OPS
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05
138
Results: IBM SP-3 (1/2)
●Large (2.47M DOF) ○Small (0.35M DOF)
Large Problem
0
2500
5000
7500
1 2 4 8 12 16thread#
sec.
set-upsolver
0.00
4.00
8.00
12.00
16.00
0 4 8 12 16
thread#
Spee
d-up
Nakajima, K. (2005), "Sparse Approximate Inverse Preconditioner for Contact Problems on the Earth Simulator using OpenMP", International Workshop on OpenMP: Experiences and Implementations (WOMPEI 2005), Tsukuba, Japan.
-
IWOMP, JUN05 139
SponsorMEXT, Japan.
Computational ResourcesEarth Simulator Center, Japan.NERSC/Lawrence Berkeley National Laboratory, USA.
Dr. Horst D. SimonDr. Esmond G. Ng
Information Technology Center, The University of Tokyo, Japan.
Colleagues ...GeoFEM Project
IWOMP Committee
Acknowledgements
Flat MPI and Hybrid Parallel Programming Models for FEM Applications on SMP Cluster Architectures�OverviewGeoFEM: FY.1998-2002�http://geofem.tokyo.rist.or.jp/System Configuration of�GeoFEMResults on Solid Earth SimulationResults by GeoFEMMotivationsEarth Simulator (ES)�http://www.es.jamstec.go.jp/Flat MPI vs. HybridHybrid vs. Flat-MPILocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodesOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Overlapped Additive Schwartz Domain Decomposition Method�for Stabilizing Localized PreconditioningOverlapped Additive Schwartz Domain Decomposition Method�Effect of additive Schwartz domain decomposition for solid mechanics Hybrid vs. Flat-MPIBlock IC(0)-CG Solver on the Earth SimulatorFlat MPI vs. HybridLocal Data Structure�Node-based Partitioning�internal nodes - elements - external nodes1 SMP node => 1 domain for Hybrid Programming Model�MPI communication among domainsBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorBasic Strategy for Parallel Programming on the Earth SimulatorRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECRe-Ordering Technique for Vector/Parallel Architectures�Cyclic DJDS(RCM+CMC) Re-Ordering�(Doi, Washio, Osoda and Maruyama (NECReordering = ColoringColors & Iterations�Incompatible Nodes�Doi, S. (NEC) et al.RCM OrderingRed-Black Ordering�Highly Parallelized: Large Loop Length �Many Incompatible Nodes: Slow ConvergenceFour-Color Ordering�Fairly Parallelized: Shorter Loop Length �Fewer Incompatible Nodes: Faster ConvergenceCyclic DJDS(RCM+CMC) �for Forward/Backward Substitution in BILU FactorizationSimple 3D Cubic ModelEffect of Ordering�Effect of Re-Ordering Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Effect of Re-Ordering�Results on 1 SMP node�Color #: 99 (fixed)�Re-Ordering is REALLY required !!!Ideal Performance: 48.6% of PeakSMP node # > 10�up to 176 nodes (1408 PEs)��Problem size for each SMP node is fixed.�PDJDS-CMC/RCM, Color #: 993D Elastic Model (Large Case)�256x128x128/SMP node, up to 2,214,592,512 DOF3D Elastic Model (Small Case)�64x64x64/SMP node, up to 125,829,120 DOF3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth Simulator3D Elastic Model�Problem Size and Parallel Performance�8-176 SMP nodes of the Earth SimulatorHybrid outperforms Flat-MPIComm. latency is relatively �large in ES�D.J.Kerbyson et al. LA-UR-02-5222Performance Estimation for Finite-Element Type Application on ES �using performance model (not real computations)�D.J.KerbysonHybrid outperforms Flat-MPI (cont.)Various PlatformsPlatformsReordering for the Earth Simulator Matrix Storage for ILU Factorization3D Elastic Simulation�Problem Size~GFLOPS�Earth Simulator�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�Hitachi-SR8000-MPP with�Pseudo Vectorization�1 SMP node (8 PE’s)3D Elastic Simulation�Problem Size~GFLOPS�IBM-SP3 (NERSC) �1 SMP node (8 PE’s)3D Elastic Simulation�Single PE Performance�GFLOPS, 104 colorsPerformance of Intra-Node MPI�8 PE’s: Measurement using SEND/RECV �subroutines in GeoFEMPerformance of Intra-Node MPI�8 PE’s: Measurement using inter-domain SEND/RECV �subroutines in GeoFEMSummary: Single Node Performance�Flat MPI vs. HybridSpeed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)Speed-Up for Fixed Problem Size/PE �● Hybrid, ● Flat MPI (Lines: Ideal speed-up extrapolated from 1-node performance)PlatformsSummary: Performance for Many Nodes, Flat MPI vs. HybridElastic Problems�with Cubic Model Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�Iterations for ConvergenceEffect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Earth Simulator (single node)Many Colors ... Trade-OffHybrid is much more sensitive to color numbers !Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDJDS/CM-RCM Re-Ordering�Hitachi SR8k (single node)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�PDCRS/CM-RCM Re-Ordering�IBM SP3 (single node, 8PE’s)Effect of Loop Length (Color#)�1003 nodes= 3x106 DOF�IBM SP3 (single node, 8PE’s)South-West Japan Model�Contact Problems��Selective Blocking PreconditioningSouth-West JapanSouth-West Japan Model�800K nodes, 2.4M DOF �1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �ES, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �Hitachi SR8000, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�800K nodes, 2.4M DOF �IBM SP3, 1-SMP nodes (8 PE’s)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)South-West Japan Model�7,767,002 nodes, 23,301,006 DOF �ES, 10 SMP nodes (Peak= 640 GFLOPS)Complicated Geometries�PGA for Mobile Pentium IIIPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node�PDJDS/MC Re-OrderingPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Earth SimulatorPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� PDJDS/MC Re-Ordering, IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Hitachi SR8000Single PE test for simple cubic model�786,432 DOF= 3x643 nodes� Hitachi SR8000PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PDCRS is more suitable for scalar processorsPGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� IBM-SP3PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� Itanium II (SGI/Altix) (8 PE’s)PGA model�61 pins, 956,128 elem, 1,012,354 nodes �(3,037,062 DOF) , 1 SMP node� AMD Opteron (8 PE’s)Multigrids Multigrid�Multigrid is scalable !!!Target Application�Thermal Convection between 2 Spherical SurfacesSemi-Implicit Pressure Correction SchemeStart from Icosahedron�Surface Grid by Adaptive Mesh Refinement (AMR)Semi-Unstructured Prismatic GridsResults on Hitachi SR2201�320x900=288,000 cells/PE�up to 37M DOF on 128 PEsHow about on ES ?Critical Issue of Multigrid�for Vector ComputerVectorization:Re-Ordering�Earth SimulatorGFLOPS,Iterations�768,000 elem's, ES 1PEScalability: Flat MPI�768,000 elem's/PE, up to 80 PE’sNext step: HybridHybrid vs. Flat MPI�8PEs on ES (1 SMP node)MGCG on ESHybrid vs. Flat MPI�8PE’s on ES (1 SMP node)Hybrid vs. Flat MPI�8PE’s on Hitachi SR8000 (1 SMP node)Hybrid vs. Flat MPI�8PE’s on IBM-SP3 �(1 SMP node)Summary (Earth Simulator)Summary (cont.) (Earth Simulator)Summary (cont.) (General)Hybrid vs. Flat MPIHybrid vs. Flat MPI (cont.)My current feeling is …Can Hybrid survive ?�SAI (Sparse Approximate Inverse) preconditioningMatrix-Vector Product ONLYEarth Simulator, Mat-Vec.�1 SMP node (64GFLOPS peak), 500 Iter's.�Hybrid is rather faster !!Results: ES (1/2)Results: ES (2/2)Results: ES (2/2)Results: Hitachi SR8000 (1/2)Comparison: ES vs. Hitachi SR8kResults: Hitachi SR8000 (2/2)Results: IBM SP-3 (1/2)Acknowledgements