t. aoki: beyond peta-scale on stencil and particle-based gpu

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

GP GPUGP GPU

Beyond Peta-scale on Stencil and

Particle-based GPU Applications

Beyond Peta-scale on Stencil and

Particle-based GPU Applications

1

Takayuki Aoki

Global Scientific Information and Computing CenterTokyo Institute of Technology

SPPEXA annual meeting, January 26, 2016, TUM


GP GPUGP GPU

2

Life TimeLife TimeHPC Algorithms

Numerical Method for CFD

HPC Application(Implementation)

Supercomputer(CPU Architecture, Accelerator)

FFT, Krylov Subspace Method, FMM, Monte Carlo, QR, Stencil

DG LBMWENO1st UPWIND

FEM, FVM, 3rd UPWIND, SPH

year

4～6 years

10～20 years

More than 10 years

1950 2000

FDTD


GP GPUGP GPU

3

Numerical Methods for CFDNumerical Methods for CFDComputational Fluid Dynamics: solving Navior-Stokes equation

Explicit Time Integration

Method Algorithm Time Integration

FEM (unstructured) Sparse Matrix, Indirect Mem access Explicit / Implicit

Spectral method FFT, Stencil Semi-Implicit

FVM (unstructured) Sparse Matrix, Indirect Mem access Semi-Implicit

FVM (structured) Stencil, Sparse Matrix, WENO Explicit / Semi-Implicit

FDM Higher-order Stencil Explicit

LBM D3Q15, D3Q19, D3Q27 Stencil Explicit (SRT/MRT)

SPH, DEM Particle-to-particle interaction Explicit / Semi-Implicit

1st-order EulerRunge-KuttaLeap-frog, Predictor-CorrectorTaylor Expansion (Lax-Wendroff)

Implicit Time Integration

1st-order ImplicitImplicit Runge-KuttaCrank-Nicolson


GP GPUGP GPU

4

Roofline ModelRoofline Model

Per

form

ance

to

th

e p

eakFLOP/Byte

FLOP/Byte + Fpeak/Bpeak ＋ P =

System Parameter：

K-Computer : B/F = 0.5

TSUBAME 2.5 : B/F = 0.19(GPU NVIDIA K20X)

Application Parameter：

Arithmatic Intensity FLOP/Byte

Sp

MV

FD

TD

LB

M (

SR

T)

FF

T

FM

M

LB

M(M

RT

)W

EN

O


GP GPUGP GPU

5

Communication required for Numerical MethodCommunication required for Numerical Method

Amount of Communication Data

LBMCo

mm

un

icat

ion

Dis

tan

ce FMMMG

DGFVM(WENO)

DEM

Implicit

SPH

FFT(Spectral)

FEM

FDTD

Stencil

FVM(SMAC)

Explicit

SPH


GP GPUGP GPU

Performance Model for Multi-node computingPerformance Model for Multi-node computing

Domain decomposition for structured mesh

Boundary data exchange between distributed GPUs by MPI

6

MPI APIs such as MPI Isend, Irecv

CUDA APIs such as cudaMemcpy

CUDA APIs

such as

cudaMemcpy

Similarity with the roofline model

Stencil applicationswith explicit time integration

CPU peak performance

Node total performance

Memory bandwidth

Interconnection bandwidth


GP GPUGP GPUOverlapping Performance ModelOverlapping Performance Model

7

Non-overlapping

Overlapping

Computational time Communication time

FP operations in app

FP operations in app

Computational time (Tcomp) Communication time (Tcomm)

Longest of computation time and communication time is observed as an actual elapsed time.

Tcomp > Tcomm:

Tcomp < Tcomm:

Strong scaling case

Since

(d = dimension of decomposition, i.e., 1D, 2D or 3D)


GP GPUGP GPUPhase-Field Model for Dendritic SolidificationPhase-Field Model for Dendritic Solidification

8

TSUBAME 2.0 at Tokyo Tech（Each node: NVIDIA Tesla M2050 GPU x 3

Intel CPU Xeon X5670 x2 ）

Gordon Bell Prize 2011Special Achievements in Scalability and Time-to-Solution


GP GPUGP GPUImplementation DifficultiesImplementation Difficulties

9

Complex Algorithm

Implementation onHeterogamous system

Many branchdiverges

ConservativeDomain Researcher

FMM, H-Matrix,

GPU, MIC: memory hierarchy, communication hierarchy, fine-grain parallelism, programing

SPH, DEM, AMR: Memory access with many branch diverges, data movement

Some domain researchers do not accept any changes for their source codes.


GP GPUGP GPU

Compute Node(3 Tesla K20X GPUs)

Performance: 4.08 TFLOPSMemory: 58.0GB(CPU)

+18GB(GPU)

Rack (30 nodes)

Performance: 122 TFLOPSMemory: 2.28 TB

System (58 racks)

1442 nodes: 2952 CPU sockets, 4264 GPUs

Performance: 224.7 TFLOPS (CPU) ※ Turbo boost

5.562 PFLOPS (GPU)Total: 17.1 PFLOPS

TSUBAME 2.5TSUBAME 2.5


GP GPUGP GPU

A Tsunami Simulation for a wide area

(A Typical Stencil Application)

A Tsunami Simulation for a wide area

(A Typical Stencil Application)

11

SPPEXA annual meeting, January 26, 2016, TUM

Shallow Water Solver

SGFU

yxt,

hv

hu

h

U ,2212

huv

ghhu

hu

F

2

212 ghhv

huv

hv

G

0

x

UA

t

U

uuU

FA

2

1022

0

x

W

t

W

u

u

u

uW

0

0,

2

12

1

ghRiemann invariants :

Factorization:

for

0

xtu

2

1

22

1

2

1

2

1

1

111

uuu

uu

n

nnn

11 )(, nn huhSemi-Lagrangian Scheme

Dividing 2 kernels: Characteristics Method for Deep OceanFVM for the area including coastlines


GP GPUGP GPU

Inundation at coastlinesInundation at coastlines

Thin‐film technic

velocity, 0

Sea surface

Land Sea

samllthicknessMany exceptional treatments

, ,

Land Sea


GP GPUGP GPU

14

Global Tsunami SimulationGlobal Tsunami Simulation


GP GPUGP GPU

15

Global Tsunami SimulationGlobal Tsunami Simulation


GP GPUGP GPU

Inundation ValidationComoro IslandSri Lanka


GP GPUGP GPU

Time-step Control (Sub-cycling)Time-step Control (Sub-cycling)

n step

n+1 step

L1 L2 L3 L4


GP GPUGP GPU


GP GPUGP GPU


GP GPUGP GPUGas-Liquid Two-Phase FlowsGas-Liquid Two-Phase Flows

■ Navier-Stokes solver：Fractional Step■ Time integration：3rd TVD Runge-Kutta■ Advection term：5th WENO■ Diffusion term：4th FD■ Poisson：MG-BiCGstab■ Surface tension：CSF model■ Surface capture：CLSVOF(THINC + Level-Set)

Two-Phase Flows Mesh Method (Surface Capture)to solve Navier Stokes equation

Fine meshes should be assigned aroundgas-liquid interfaces


GP GPUGP GPU

: Level-Set function(distance function): Heaviside function

The Level-Set methods (LSM) use the signed distancefunction to capture the interface. The interface isrepresented by the zero-level set (zero-contour).

Re-initialization for Level-Set function

Fig. Takehiro Himeno, et. Al., JSME, 65-635,B(1999),pp2333-2340

Level-Set method (LSM)Level-Set method (LSM)

Advantage : Curvature calculation, Interface boundaryDrawback : Volume conservation


GP GPUGP GPU

22

Sparse Matrix SolverSparse Matrix Solver

Ax = b for

BiCGStab + MG preconditioner

tp

u1


GP GPUGP GPU

A drop on the dry floorA drop on the dry floor


GP GPUGP GPU


GP GPUGP GPU

25

Air Separator DeviceAir Separator Device

Which is an experiment / a simulation?


GP GPUGP GPU

LBM Simulations for Turbulent Flow(Stencil Application)

LBM Simulations for Turbulent Flow(Stencil Application)

26


GP GPUGP GPU

27

LBM (Lattice Boltzmann Method)LBM (Lattice Boltzmann Method)

eqiiii

i ffft

f

1

e

uuueue

2

2

42 2

3

2

931

cccwf iii

eqi

i is the value in the direction of ith discrete velocityei is the discrete velocity set; wi is the weighting factoru is the macroscopic velocity

12

3

4

5

6

78

9

1112

1314

15

16

17

18

10

Collisionstep:

Streamingstep:

SRT: Memory Bound, MRT: Compute Intensive

Δ , Δ , , ,


GP GPUGP GPULES (Large-Eddy Simulation)LES (Large-Eddy Simulation)

Molecular viscosity andEddy viscosity

Energy spectrum

SGSSGSGSGS

Relaxation time for LES model


GP GPUGP GPU

29


GP GPUGP GPUPerformancePerformance

30

Number of GPUs

Per

form

ance

[TF

LO

PS

]

Strong ScalabilityWeak Scalability

TSUBAME2.5

TSUBAME2.0


GP GPUGP GPU

Granular Material Simulations

(Particle Simulation)

Granular Material Simulations

(Particle Simulation)

31


GP GPUGP GPU

Contact interactionNormal direction

Tangential direction

SpringViscosity

Viscosity

Friction

Spring

DEM (Discrete Element Method)DEM (Discrete Element Method)


GP GPUGP GPUParticle DistributionParticle Distribution

• The load imbalance problem among subdomains

Many particles

No particle


GP GPUGP GPUバンカーショット計算バンカーショット計算

Golf Bunker shot using 16.7 millions particles with 64 GPUs


GP GPUGP GPU

Complex Shape ObjectsComplex Shape Objects

35

Non‐spherical model described by rigidly connected sphere particles

∆ ∆ / ∆∆ / ∆ / ∆⁄

∆ / ∆ / ) ∆

∆ ∆ ∆ /

∆ ∆ ∆

∆ ∆ ∆ ∆

∆ / ∆ / ∆ / ∆ ∆


GP GPUGP GPU

36


GP GPUGP GPU

Short-range Interaction （SPH: Smoothed Particle Hydrodynamics）

– Each particle has its physical properties and interactions among particles in the range of kernel radius are calculated at every time step.

SPH （Smoothed Particle Hydrodynamics）SPH （Smoothed Particle Hydrodynamics）

Kernel function W

Particle

Kernel radius

Normalization

Discretization

: mass : density

Physical property


GP GPUGP GPU

Improve SPH: Generalization of Finite Difference Operators(Imoto, Tagami 2014)

SPHSPH

: Kernel function

First derivatives

h

h : Kernel radius

about 100 particleswithin the kernel


GP GPUGP GPUA Dam Break SimulationA Dam Break Simulation

72 M particles with 80 GPUs


GP GPUGP GPU

Multiple GPU ScalabilityMultiple GPU Scalability• Conditions

Particles : 2 × 106, 1.6 × 107, 1.29 × 108

Domain Decomposition : Dynamic load Balance using Slice Grid Method Time-Integration : 2-stage

Runge-Kutta


GP GPUGP GPU

Sub-domains with High Aspect ratio

• The passive particles under vortex velocity field using 64 GPUs

Slice Grid Method


GP GPUGP GPU

Space Filling CurvesSpace Filling Curves Construction of a tree so that each leafs contain minimum number of particles

Each leafs is numbered by following the rules of space filling curves.

A tree-base cell refinement

Peano Hilbert Morton

9 Cells 4 Cells 4 Cells


GP GPUGP GPU

Space Filling CurvesSpace Filling Curves The domain decomposition

by Peano curve The connections of leafs by

Peano curve


GP GPUGP GPUScalability on TSUBAME2.5Scalability on TSUBAME2.5

Peano Hilbert

Morton Slice-grid

Comparison of 3 different SFCwith slice-grid method


GP GPUGP GPU

All the objects and fluid are represented by same size of particles.

4 types of interactions among particles are calculated.

1. Interactions among fluid particles

h

Improved SPH (Imoto, Tagami 2014)

h : Kernel radiuswh: Kernel function

A first derivative operator

Fluid-Structure InteractionFluid-Structure Interaction(rigid body)


GP GPUGP GPU

Including a lot of floating objects

• Simulation area : 180 m × 160 m × 20 m • Total particles : 117,561,285• Particles / object : 19 ~ 472• Objects : 10368• Physical time (s) : 10.0• Steps : 20,000• GPUs : 256 • Simulation (h) : 100 • Initial height (m) : 10.0

Tsunami Debris FlowTsunami Debris Flow

Initial state of each objects

410 315144309

12447271151

441919245

Computational conditions

CAD Data Particles

Representation of object shapes


GP GPUGP GPUTSUBAME2.5を用いた大規模計算TSUBAME2.5を用いた大規模計算


GP GPUGP GPUDynamic Domain DecompositionDynamic Domain Decomposition

• Using the improved SPH with 87Million particles with 256 GPUs

Hybrid FORTRAN Framework#directive base

1. Current main target is the JMA production weather model ASUCA

2. Code generation by Python for Accelerator programing:

a. Fixed storage order(GPGPUs need warp‐first)

b. Coarse‐grained parallelism(Accelerators need fine‐grained)

Approach Examples Pros Cons Effect

Libraries ArrayFire, BLAS

Optimization comes‚for free‘

Limited control, limited use

• High portation effort• Relearning of practices

Directives OpenMP, OpenACC

Generallyapplicable

No satisfying standardto solve 2.a, 2.b

• High portation effort• Relearning of practices• No performance portability

The Idea Behind Hybrid FortranNew set of directives to support flexible storage‐ and loop orders (solve 2.a, 2.b).Goal: Unified codebase, performant on CPUs and accelerators

Code transformation with different backend implementations for different architecturesand kernels.

1 2

1

2

OpenMP Fortran

OpenACC Fortran

CUDA Fortran

Coarse‐grained parallelization forCPU

Coarse‐grained parallelization forCPU

Fine‐grained parallelization forGPU

Fine‐grained parallelization forGPU

Kernel rewrittenin 3D

Kernel rewrittenin 3D

arashi

diag long

physicslong

physicsdiag long

Max/Min/Ave

output

rungekuttalong

diag adjustlong

physicsadjust long

physics rklong

dynamics rklong

sedimentdiagnose rk

short

dynamics rkshort

monitflux

radiation

convection

pbl/surface

microphys.

Ported

Not ported

ASUCA on Hybrid Fortran,Now

Tests passed:Rad on CPU, KIJ OrderRad on CPU, IJK OrderGabls3 on CPU, KIJ OrderGabls3 on CPU, IJK OrderWarmbubble on CPU, KIJ OrderWarmbubble on CPU, IJK Order

Rad on GPU, KIJ OrderRad on GPU, IJK OrderGabls3 on GPU, KIJ OrderGabls3 on GPU, IJK OrderWarmbubble on GPU, KIJ OrderWarmbubble on GPU, IJK Order

dtlong

RKshort dtshort

Integration currently ongoing


GP GPUGP GPU

52

SummarySummary Stencil applications with explicit time integrations are scalable to

the scale beyond Peta-scale.

Particle simulations based on short-range interaction with explicit time integrations are scalable to the scale beyond Peta-scale.

Interconnection performances in Exa-scale ???.

Sparse matrix solvers for stiff problems are really expected.

Temporal blocking, commutation avoiding and so on should be included however will not be so drastically improved.

For frameworks and DSLs, various efforts are required to be accepted to research communities.

t. aoki: beyond peta-scale on stencil and particle-based gpu

Documents