t. aoki: beyond peta-scale on stencil and particle-based gpu
TRANSCRIPT
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Beyond Peta-scale on Stencil and
Particle-based GPU Applications
Beyond Peta-scale on Stencil and
Particle-based GPU Applications
1
Takayuki Aoki
Global Scientific Information and Computing CenterTokyo Institute of Technology
SPPEXA annual meeting, January 26, 2016, TUM
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
2
Life TimeLife TimeHPC Algorithms
Numerical Method for CFD
HPC Application(Implementation)
Supercomputer(CPU Architecture, Accelerator)
FFT, Krylov Subspace Method, FMM, Monte Carlo, QR, Stencil
DG LBMWENO1st UPWIND
FEM, FVM, 3rd UPWIND, SPH
year
4~6 years
10~20 years
More than 10 years
1950 2000
FDTD
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
3
Numerical Methods for CFDNumerical Methods for CFDComputational Fluid Dynamics: solving Navior-Stokes equation
Explicit Time Integration
Method Algorithm Time Integration
FEM (unstructured) Sparse Matrix, Indirect Mem access Explicit / Implicit
Spectral method FFT, Stencil Semi-Implicit
FVM (unstructured) Sparse Matrix, Indirect Mem access Semi-Implicit
FVM (structured) Stencil, Sparse Matrix, WENO Explicit / Semi-Implicit
FDM Higher-order Stencil Explicit
LBM D3Q15, D3Q19, D3Q27 Stencil Explicit (SRT/MRT)
SPH, DEM Particle-to-particle interaction Explicit / Semi-Implicit
1st-order EulerRunge-KuttaLeap-frog, Predictor-CorrectorTaylor Expansion (Lax-Wendroff)
Implicit Time Integration
1st-order ImplicitImplicit Runge-KuttaCrank-Nicolson
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
4
Roofline ModelRoofline Model
Per
form
ance
to
th
e p
eakFLOP/Byte
FLOP/Byte + Fpeak/Bpeak + P =
System Parameter:
K-Computer : B/F = 0.5
TSUBAME 2.5 : B/F = 0.19(GPU NVIDIA K20X)
Application Parameter:
Arithmatic Intensity FLOP/Byte
Sp
MV
FD
TD
LB
M (
SR
T)
FF
T
FM
M
LB
M(M
RT
)W
EN
O
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
5
Communication required for Numerical MethodCommunication required for Numerical Method
Amount of Communication Data
LBMCo
mm
un
icat
ion
Dis
tan
ce FMMMG
DGFVM(WENO)
DEM
Implicit
SPH
FFT(Spectral)
FEM
FDTD
Stencil
FVM(SMAC)
Explicit
SPH
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Performance Model for Multi-node computingPerformance Model for Multi-node computing
Domain decomposition for structured mesh
Boundary data exchange between distributed GPUs by MPI
6
MPI APIs such as MPI Isend, Irecv
CUDA APIs such as cudaMemcpy
CUDA APIs
such as
cudaMemcpy
Similarity with the roofline model
Stencil applicationswith explicit time integration
CPU peak performance
Node total performance
Memory bandwidth
Interconnection bandwidth
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUOverlapping Performance ModelOverlapping Performance Model
7
Non-overlapping
Overlapping
Computational time Communication time
FP operations in app
FP operations in app
Computational time (Tcomp) Communication time (Tcomm)
Longest of computation time and communication time is observed as an actual elapsed time.
Tcomp > Tcomm:
Tcomp < Tcomm:
Strong scaling case
Since
(d = dimension of decomposition, i.e., 1D, 2D or 3D)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUPhase-Field Model for Dendritic SolidificationPhase-Field Model for Dendritic Solidification
8
TSUBAME 2.0 at Tokyo Tech(Each node: NVIDIA Tesla M2050 GPU x 3
Intel CPU Xeon X5670 x2 )
Gordon Bell Prize 2011Special Achievements in Scalability and Time-to-Solution
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUImplementation DifficultiesImplementation Difficulties
9
Complex Algorithm
Implementation onHeterogamous system
Many branchdiverges
ConservativeDomain Researcher
FMM, H-Matrix,
GPU, MIC: memory hierarchy, communication hierarchy, fine-grain parallelism, programing
SPH, DEM, AMR: Memory access with many branch diverges, data movement
Some domain researchers do not accept any changes for their source codes.
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Compute Node(3 Tesla K20X GPUs)
Performance: 4.08 TFLOPSMemory: 58.0GB(CPU)
+18GB(GPU)
Rack (30 nodes)
Performance: 122 TFLOPSMemory: 2.28 TB
System (58 racks)
1442 nodes: 2952 CPU sockets, 4264 GPUs
Performance: 224.7 TFLOPS (CPU) ※ Turbo boost
5.562 PFLOPS (GPU)Total: 17.1 PFLOPS
TSUBAME 2.5TSUBAME 2.5
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
A Tsunami Simulation for a wide area
(A Typical Stencil Application)
A Tsunami Simulation for a wide area
(A Typical Stencil Application)
11
SPPEXA annual meeting, January 26, 2016, TUM
Shallow Water Solver
SGFU
yxt,
hv
hu
h
U ,2212
huv
ghhu
hu
F
2
212 ghhv
huv
hv
G
0
x
UA
t
U
uuU
FA
2
1022
0
x
W
t
W
u
u
u
uW
0
0,
2
12
1
ghRiemann invariants :
Factorization:
for
0
xtu
2
1
22
1
2
1
2
1
1
111
uuu
uu
n
nnn
11 )(, nn huhSemi-Lagrangian Scheme
Dividing 2 kernels: Characteristics Method for Deep OceanFVM for the area including coastlines
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Inundation at coastlinesInundation at coastlines
Thin‐film technic
velocity, 0
Sea surface
Land Sea
samllthicknessMany exceptional treatments
, ,
Land Sea
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
14
Global Tsunami SimulationGlobal Tsunami Simulation
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
15
Global Tsunami SimulationGlobal Tsunami Simulation
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Inundation ValidationComoro IslandSri Lanka
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Time-step Control (Sub-cycling)Time-step Control (Sub-cycling)
n step
n+1 step
L1 L2 L3 L4
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUGas-Liquid Two-Phase FlowsGas-Liquid Two-Phase Flows
■ Navier-Stokes solver:Fractional Step■ Time integration:3rd TVD Runge-Kutta■ Advection term:5th WENO■ Diffusion term:4th FD■ Poisson:MG-BiCGstab■ Surface tension:CSF model■ Surface capture:CLSVOF(THINC + Level-Set)
Two-Phase Flows Mesh Method (Surface Capture)to solve Navier Stokes equation
Fine meshes should be assigned aroundgas-liquid interfaces
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
: Level-Set function(distance function): Heaviside function
The Level-Set methods (LSM) use the signed distancefunction to capture the interface. The interface isrepresented by the zero-level set (zero-contour).
Re-initialization for Level-Set function
Fig. Takehiro Himeno, et. Al., JSME, 65-635,B(1999),pp2333-2340
Level-Set method (LSM)Level-Set method (LSM)
Advantage : Curvature calculation, Interface boundaryDrawback : Volume conservation
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
22
Sparse Matrix SolverSparse Matrix Solver
Ax = b for
BiCGStab + MG preconditioner
tp
u1
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
A drop on the dry floorA drop on the dry floor
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
25
Air Separator DeviceAir Separator Device
Which is an experiment / a simulation?
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
LBM Simulations for Turbulent Flow(Stencil Application)
LBM Simulations for Turbulent Flow(Stencil Application)
26
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
27
LBM (Lattice Boltzmann Method)LBM (Lattice Boltzmann Method)
eqiiii
i ffft
f
1
e
uuueue
2
2
42 2
3
2
931
cccwf iii
eqi
i is the value in the direction of ith discrete velocityei is the discrete velocity set; wi is the weighting factoru is the macroscopic velocity
12
3
4
5
6
78
9
1112
1314
15
16
17
18
10
Collisionstep:
Streamingstep:
SRT: Memory Bound, MRT: Compute Intensive
Δ , Δ , , ,
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPULES (Large-Eddy Simulation)LES (Large-Eddy Simulation)
Molecular viscosity andEddy viscosity
Energy spectrum
SGSSGSGSGS
Relaxation time for LES model
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
29
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUPerformancePerformance
30
Number of GPUs
Per
form
ance
[TF
LO
PS
]
Strong ScalabilityWeak Scalability
TSUBAME2.5
TSUBAME2.0
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Granular Material Simulations
(Particle Simulation)
Granular Material Simulations
(Particle Simulation)
31
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Contact interactionNormal direction
Tangential direction
SpringViscosity
Viscosity
Friction
Spring
DEM (Discrete Element Method)DEM (Discrete Element Method)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUParticle DistributionParticle Distribution
• The load imbalance problem among subdomains
Many particles
No particle
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUバンカーショット計算バンカーショット計算
Golf Bunker shot using 16.7 millions particles with 64 GPUs
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Complex Shape ObjectsComplex Shape Objects
35
Non‐spherical model described by rigidly connected sphere particles
∆ ∆ / ∆∆ / ∆ / ∆⁄
∆ / ∆ / ) ∆
∆ ∆ ∆ /
∆ ∆ ∆
∆ ∆ ∆ ∆
∆ / ∆ / ∆ / ∆ ∆
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
36
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Short-range Interaction (SPH: Smoothed Particle Hydrodynamics)
– Each particle has its physical properties and interactions among particles in the range of kernel radius are calculated at every time step.
SPH (Smoothed Particle Hydrodynamics)SPH (Smoothed Particle Hydrodynamics)
Kernel function W
Particle
Kernel radius
Normalization
Discretization
: mass : density
Physical property
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Improve SPH: Generalization of Finite Difference Operators(Imoto, Tagami 2014)
SPHSPH
: Kernel function
First derivatives
h
h : Kernel radius
about 100 particleswithin the kernel
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUA Dam Break SimulationA Dam Break Simulation
72 M particles with 80 GPUs
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Multiple GPU ScalabilityMultiple GPU Scalability• Conditions
Particles : 2 × 106, 1.6 × 107, 1.29 × 108
Domain Decomposition : Dynamic load Balance using Slice Grid Method Time-Integration : 2-stage
Runge-Kutta
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Sub-domains with High Aspect ratio
• The passive particles under vortex velocity field using 64 GPUs
Slice Grid Method
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Space Filling CurvesSpace Filling Curves Construction of a tree so that each leafs contain minimum number of particles
Each leafs is numbered by following the rules of space filling curves.
A tree-base cell refinement
Peano Hilbert Morton
9 Cells 4 Cells 4 Cells
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Space Filling CurvesSpace Filling Curves The domain decomposition
by Peano curve The connections of leafs by
Peano curve
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUScalability on TSUBAME2.5Scalability on TSUBAME2.5
Peano Hilbert
Morton Slice-grid
Comparison of 3 different SFCwith slice-grid method
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
All the objects and fluid are represented by same size of particles.
4 types of interactions among particles are calculated.
1. Interactions among fluid particles
h
Improved SPH (Imoto, Tagami 2014)
h : Kernel radiuswh: Kernel function
A first derivative operator
Fluid-Structure InteractionFluid-Structure Interaction(rigid body)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Including a lot of floating objects
• Simulation area : 180 m × 160 m × 20 m • Total particles : 117,561,285• Particles / object : 19 ~ 472• Objects : 10368• Physical time (s) : 10.0• Steps : 20,000• GPUs : 256 • Simulation (h) : 100 • Initial height (m) : 10.0
Tsunami Debris FlowTsunami Debris Flow
Initial state of each objects
410 315144309
12447271151
441919245
Computational conditions
CAD Data Particles
Representation of object shapes
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUTSUBAME2.5を用いた大規模計算TSUBAME2.5を用いた大規模計算
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUDynamic Domain DecompositionDynamic Domain Decomposition
• Using the improved SPH with 87Million particles with 256 GPUs
Hybrid FORTRAN Framework#directive base
1. Current main target is the JMA production weather model ASUCA
2. Code generation by Python for Accelerator programing:
a. Fixed storage order(GPGPUs need warp‐first)
b. Coarse‐grained parallelism(Accelerators need fine‐grained)
Approach Examples Pros Cons Effect
Libraries ArrayFire, BLAS
Optimization comes‚for free‘
Limited control, limited use
• High portation effort• Relearning of practices
Directives OpenMP, OpenACC
Generallyapplicable
No satisfying standardto solve 2.a, 2.b
• High portation effort• Relearning of practices• No performance portability
The Idea Behind Hybrid FortranNew set of directives to support flexible storage‐ and loop orders (solve 2.a, 2.b).Goal: Unified codebase, performant on CPUs and accelerators
Code transformation with different backend implementations for different architecturesand kernels.
1 2
1
2
OpenMP Fortran
OpenACC Fortran
CUDA Fortran
Coarse‐grained parallelization forCPU
Coarse‐grained parallelization forCPU
Fine‐grained parallelization forGPU
Fine‐grained parallelization forGPU
Kernel rewrittenin 3D
Kernel rewrittenin 3D
arashi
diag long
physicslong
physicsdiag long
Max/Min/Ave
output
rungekuttalong
diag adjustlong
physicsadjust long
physics rklong
dynamics rklong
sedimentdiagnose rk
short
dynamics rkshort
monitflux
radiation
convection
pbl/surface
microphys.
Ported
Not ported
ASUCA on Hybrid Fortran,Now
Tests passed:Rad on CPU, KIJ OrderRad on CPU, IJK OrderGabls3 on CPU, KIJ OrderGabls3 on CPU, IJK OrderWarmbubble on CPU, KIJ OrderWarmbubble on CPU, IJK Order
Rad on GPU, KIJ OrderRad on GPU, IJK OrderGabls3 on GPU, KIJ OrderGabls3 on GPU, IJK OrderWarmbubble on GPU, KIJ OrderWarmbubble on GPU, IJK Order
dtlong
RKshort dtshort
Integration currently ongoing
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
52
SummarySummary Stencil applications with explicit time integrations are scalable to
the scale beyond Peta-scale.
Particle simulations based on short-range interaction with explicit time integrations are scalable to the scale beyond Peta-scale.
Interconnection performances in Exa-scale ???.
Sparse matrix solvers for stiff problems are really expected.
Temporal blocking, commutation avoiding and so on should be included however will not be so drastically improved.
For frameworks and DSLs, various efforts are required to be accepted to research communities.