accelerating gyrokinetic tokamak simulation (gts) code
Post on 20-Mar-2022
19 Views
Preview:
TRANSCRIPT
Accelerating Gyrokinetic Tokamak Simulation (GTS) Code using OpenACC
M.G. Yoo, C.H. Ma, S. Ethier, Jin Chen, W.X. Wang, E. Startsev
Princeton Plasma Physics Laboratory, Princeton, U.S.A.
OpenACC Summit 2020
Sep. 01. 2020
Overview of porting GTS to GPU
2
GTS (Gyrokinetic Tokamak Simulation)
• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine
Porting to GPU machine
• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak
• Recently upgraded for physics studies associated with the thermal quench transport
- Significant speed-up (>20x) for the particle parts
• Particle-In-Cell algorithm (particles + grid-based field solve)
• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource
- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)
• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))
3
❑ A global gyrokinetic particle simulation code to study the micro turbulence physics of the fusion plasma in tokamaks
GTS (Gyrokinetic Tokamak Simulation)
- Mainly written in Fortran, partly in C
- Parallelized using MPI + OpenMP (previously), now using MPI+OpenACC
- 𝛿𝑓 particle-in-cell code in 3-dimensional curvilinear coordinate
Gyrokinetic Equation
4
▪ The gyrokinetic equation for particle distribution in 5-dimension phase space
a (or ρ): radial coordinates (a flux surface label)
𝜃 and 𝜙: poloidal and toroidal angle
𝑣∥: parallel velocity
𝜇 = 𝑚𝑠𝑣⊥2/2𝐵 : magnetic moment
𝐵∗ = 𝐵 + 𝑚𝑠𝑣∥/𝑒𝑠 b · ∇ × b
𝑓𝑠: gyro-center distribution function
[G. Hunt, Ph.D. Thesis, University of Leicester, 2016]
Gyrokinetic Poisson Equation
5
Z
▪ Electron and Ion densities from the distribution function
▪ Quasi-neutrality and gyrokinetic Poisson equation
[Dubin, et. al., Phys. Fluids 26, 3524 (1983)]
Particle-In-Cell Method
6
𝒗 =1
𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁
1
𝑞𝑠
𝜕𝐻0𝜕𝜌∥
𝐁𝟎∗ + 𝛅𝐁 +
1
𝑞𝑠𝐁𝟎 × 𝛻𝐻0
▪ Equation of motion
𝑑𝜌∥𝑑𝑡
=𝐁𝟎∗ + 𝛅𝐁
𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁
⋅ −1
𝑞𝑠𝛻𝐻0
1
𝑞𝑠
𝜕𝐻0𝜕𝜌∥
= 𝑩𝟎 ⋅ 𝒗 = 𝑣∥0
𝛻𝐻0 =𝑚𝑠 𝑣∥
0 2
𝐵0+𝜇
𝑞𝛻𝐵0 + 𝛻ഥΦ
𝜌∥ =𝑚𝑠𝑣∥
0
𝑞𝑠𝐵0
𝑑𝛿𝑓
𝑑𝑡= −
𝑑𝑓0𝑑𝑡
= − ሶ𝐙 ⋅ 𝛻𝐙𝑓0 = − ሶ𝐙𝟎 + ሶ𝐙𝟏 ⋅ 𝛻𝐙𝑓0
▪ 𝜹𝒇 weight evolution equation
𝑓tot = 𝑓0 + 𝛿𝑓
𝑑𝑓tot𝑑𝑡
=𝜕𝑓tot𝜕𝑡
+ ሶ𝐙 ⋅ 𝛻𝐙 𝑓 = 0
▪ The distribution function 𝒇𝒔 is represented by Marker Particles
𝑓𝑠 ≈
𝑖=1
𝑁𝑀
𝑤𝑖
𝛿 𝑥 − 𝑥𝑖 𝛿 𝑣 − 𝑣𝑖𝐽(𝑥𝑖)
Particle-In-Cell Method
7
Move particles to new positions(Particles only)
Charge deposition to Grid Nodes(Particles to Grid)
Assigning fields to particles(Grid to Particles)
Update potential & electric fields(Grid only)
𝜙𝑖,𝑗𝑛+1 𝜙𝑖+1,𝑗
𝑛+1
𝜙𝑖,𝑗−1𝑛+1
𝜙𝑖−1,𝑗𝑛+1
𝜙𝑖,𝑗+1𝑛+1
Acceleration of Particle Parts by OpenACC
8
▪ Assigning fields to particles (Grid to Particles)
➢ Particle arrays are global variables and created on device side to reduce the communication time
➢ Particles are independent from each other and has a single level loop
➢ A simple acc parallel directives is very efficient for a huge number of marker particles (𝑁𝑀 > 106)
▪ Particle parts are easily and efficiently parallelized by OpenACC
Acceleration of Particle Parts by OpenACC
9
▪ Charge deposition to Grid Nodes (Particles to Grid)
▪ Move particles to new positions (Particles only)
Gyrokinetic Poisson Equation
10
−∇ ⋅ 𝜖0ി𝐠 +
𝑠
𝑛𝑠𝑚𝑠
𝐵2ി𝐠 −
𝑩𝑩
𝐵⋅ ∇Φ = 𝑒 𝛿 ഥ𝑛𝑖 − 𝛿𝑛𝑒
▪ Coupled (2D+1D) field equations on Unstructured Mesh (use FEM)
▪ A direct 3D field equation on Structured Mesh (use FDM)
Unstructured grid
Structured grid
Solved iteratively
▪ Algebraic Multigrid solver is used to invert linearized equations
𝑨𝒙 = 𝒃
- BoomerAMG method in Hypre library through PETSc interface
- (CPU version) vs (cuda version)
Traverse Cluster at Princeton University
Traverse consists of:• 46 IBM AC922 Power 9 nodes, with each node having
– 2 IBM Power 9 processors (sockets)
• 16 cores per processor
• 4 hardware threads per core
– 32 cores per node
– 256 GB of RAM per node
– 4 NVIDIA V100 GPUs (2 per socket) with 32GB of memory each
– 3.2TB NVMe (solid state) local storage (not shared between nodes)
– EDR InfiniBand, 1:1 per rack, 2:1 rack to rack interconnect
– GPFS high performance parallel scratch storage: 2.9PB raw
– Globus transfer node (10 GbE external, EDR to storage)
• InfiniBand Network
– EDR InfiniBand (100 Gb/s)
– Fully non-blocking (1:1) within a chassis, 2:1 oversubscription between chassis.
11
Elapsed Time
12
0.1
1
10
100
1000
4 CPU (mz4np1)
32 CPU (mz4np8)
4CPU + 4GPU (mz4np1)
32CPU + 4GPU (mz4np8)Tim
e lo
g sc
ale
(s)
micell=50, mpsi=100
MGRID= 45,137MISUM= 9,007,200
Lower is better
Speed-Up Factor
13
0.1
1
10
100
1000
4 CPU (mz4np1)
32 CPU (mz4np8)
4CPU + 4GPU (mz4np1)
32CPU + 4GPU (mz4np8)
Spe
ed
-Up
x18.8
x6.2
x5.5
micell=50, mpsi=100
MGRID= 45,137MISUM= 9,007,200
Higher is better
Full Power of 1 node on Traverse
14
push_ion11%
shifti1%
charge_ion12%
poisson24%
smooth0%
field1%
collision_ion8%
collision_eon7%
push_eon22%
shifte2%
charge_eon9%
check_tracers2%
snapshot0%
diagnostics1%
32 CPU
push_ion2%
shifti1%charge_ion
4%
poisson66%
smooth1%
field2%
collision_ion2%
collision_eon2%
push_eon5%
shifte4%
charge_eon3%
check_tracers6%
snapshot0%
diagnostics2%
32 CPU + 4 GPU
Performance Comparison (original Poisson vs new 3D Poisson)
15
0
1
2
3
4
5
6
7
ori. Tol=1e-3
ori. Tol=1e-5
New Poisson
Elapsed time/step
Breakdown of Elapsed Time
16
push_ion3%
shifti2%
charge_ion4%
poisson44%
smooth1%
field3%
collision_ion3%
collision_eon4%
push_eon8%
shifte7%
charge_eon4%
check_tracers2%
snapshot10%
diagnostics5%
2%1%
2%
71%
1%1%
2%2% 4%
4%2%
1%
5%
2%
4%
3%
5%
29%
1%4%
5%5%
10%
9%
5%
2%
12%
6%
Ori. Poisson (Tol=1e-3) Ori. Poisson (Tol=1e-5) New Poisson 3D
Algebraic Multigrid Solver: (HYPRE) vs (GAMG)
17
0
2
4
6
8
10
12
14
16
HYPRE HYPRE cuda GAMG GAMG cuda
Tim
e (
s)
Poisson time / istep
micell=50, mpsi=100
MGRID= 45,137MISUM= 9,007,200
1 Node 4 MPI ranks mzetamax=4npartdom=1
1 CPU / 1 GPU
HYPRE vs GAMG in Production Run case
18
Tim
e (
s)
micell=100 mpsi=150
MGRID= 113,185*16MISUM= 180,854,400
4 Node 128 MPImzetamax=16npartdom=8
0
5
10
15
20
25
30
35
40
45
Pure CPU + Hypre
GPU + Hypre
GPU + cuda Hypre
GPU + GAMG
GPU + cuda GAMG
4 CPU / 1 GPU
Summary
19
GTS (Gyrokinetic Tokamak Simulation)
• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine
Porting to GPU machine
• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak
• Recently upgraded for physics studies associated with the thermal quench transport
- Significant speed-up (>20x) for the particle parts
• Particle-In-Cell algorithm (particles + grid-based field solve)
• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource
- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)
• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))
top related