accelerating gyrokinetic tokamak simulation (gts) code

Accelerating Gyrokinetic Tokamak Simulation (GTS) Code using OpenACC

M.G. Yoo, C.H. Ma, S. Ethier, Jin Chen, W.X. Wang, E. Startsev

Princeton Plasma Physics Laboratory, Princeton, U.S.A.

OpenACC Summit 2020

Sep. 01. 2020

Overview of porting GTS to GPU

GTS (Gyrokinetic Tokamak Simulation)

• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine

Porting to GPU machine

• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak

• Recently upgraded for physics studies associated with the thermal quench transport

- Significant speed-up (>20x) for the particle parts

• Particle-In-Cell algorithm (particles + grid-based field solve)

• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource

- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)

• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))

❑ A global gyrokinetic particle simulation code to study the micro turbulence physics of the fusion plasma in tokamaks

- Mainly written in Fortran, partly in C

- Parallelized using MPI + OpenMP (previously), now using MPI+OpenACC

- 𝛿𝑓 particle-in-cell code in 3-dimensional curvilinear coordinate

Gyrokinetic Equation

▪ The gyrokinetic equation for particle distribution in 5-dimension phase space

a (or ρ): radial coordinates (a flux surface label)

𝜃 and 𝜙: poloidal and toroidal angle

𝑣∥: parallel velocity

𝜇 = 𝑚𝑠𝑣⊥2/2𝐵 : magnetic moment

𝐵∗ = 𝐵 + 𝑚𝑠𝑣∥/𝑒𝑠 b · ∇ × b

𝑓𝑠: gyro-center distribution function

[G. Hunt, Ph.D. Thesis, University of Leicester, 2016]

Gyrokinetic Poisson Equation

▪ Electron and Ion densities from the distribution function

▪ Quasi-neutrality and gyrokinetic Poisson equation

[Dubin, et. al., Phys. Fluids 26, 3524 (1983)]

Particle-In-Cell Method

𝒗 =1

𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁

𝑞𝑠

𝜕𝐻0𝜕𝜌∥

𝐁𝟎∗ + 𝛅𝐁 +

𝑞𝑠𝐁𝟎 × 𝛻𝐻0

▪ Equation of motion

𝑑𝜌∥𝑑𝑡

=𝐁𝟎∗ + 𝛅𝐁

𝐁𝟎 ⋅ 𝐁𝟎∗ + 𝛅𝐁

⋅ −1

𝑞𝑠𝛻𝐻0

𝑞𝑠

𝜕𝐻0𝜕𝜌∥

= 𝑩𝟎 ⋅ 𝒗 = 𝑣∥0

𝛻𝐻0 =𝑚𝑠 𝑣∥

𝐵0+𝜇

𝑞𝛻𝐵0 + 𝛻ഥΦ

𝜌∥ =𝑚𝑠𝑣∥

𝑞𝑠𝐵0

𝑑𝛿𝑓

𝑑𝑡= −

𝑑𝑓0𝑑𝑡

= − ሶ𝐙 ⋅ 𝛻𝐙𝑓0 = − ሶ𝐙𝟎 + ሶ𝐙𝟏 ⋅ 𝛻𝐙𝑓0

▪ 𝜹𝒇 weight evolution equation

𝑓tot = 𝑓0 + 𝛿𝑓

𝑑𝑓tot𝑑𝑡

=𝜕𝑓tot𝜕𝑡

+ ሶ𝐙 ⋅ 𝛻𝐙 𝑓 = 0

▪ The distribution function 𝒇𝒔 is represented by Marker Particles

𝑓𝑠 ≈

𝑖=1

𝑁𝑀

𝑤𝑖

𝛿 𝑥 − 𝑥𝑖 𝛿 𝑣 − 𝑣𝑖𝐽(𝑥𝑖)

Particle-In-Cell Method

Move particles to new positions(Particles only)

Charge deposition to Grid Nodes(Particles to Grid)

Assigning fields to particles(Grid to Particles)

Update potential & electric fields(Grid only)

𝜙𝑖,𝑗𝑛+1 𝜙𝑖+1,𝑗

𝑛+1

𝜙𝑖,𝑗−1𝑛+1

𝜙𝑖−1,𝑗𝑛+1

𝜙𝑖,𝑗+1𝑛+1

Acceleration of Particle Parts by OpenACC

▪ Assigning fields to particles (Grid to Particles)

➢ Particle arrays are global variables and created on device side to reduce the communication time

➢ Particles are independent from each other and has a single level loop

➢ A simple acc parallel directives is very efficient for a huge number of marker particles (𝑁𝑀 > 106)

▪ Particle parts are easily and efficiently parallelized by OpenACC

Acceleration of Particle Parts by OpenACC

▪ Charge deposition to Grid Nodes (Particles to Grid)

▪ Move particles to new positions (Particles only)

Gyrokinetic Poisson Equation

−∇ ⋅ 𝜖0ി𝐠 +

𝑛𝑠𝑚𝑠

𝐵2ി𝐠 −

𝑩𝑩

𝐵⋅ ∇Φ = 𝑒 𝛿 ഥ𝑛𝑖 − 𝛿𝑛𝑒

▪ Coupled (2D+1D) field equations on Unstructured Mesh (use FEM)

▪ A direct 3D field equation on Structured Mesh (use FDM)

Unstructured grid

Structured grid

Solved iteratively

▪ Algebraic Multigrid solver is used to invert linearized equations

𝑨𝒙 = 𝒃

- BoomerAMG method in Hypre library through PETSc interface

- (CPU version) vs (cuda version)

Traverse Cluster at Princeton University

Traverse consists of:• 46 IBM AC922 Power 9 nodes, with each node having

– 2 IBM Power 9 processors (sockets)

• 16 cores per processor

• 4 hardware threads per core

– 32 cores per node

– 256 GB of RAM per node

– 4 NVIDIA V100 GPUs (2 per socket) with 32GB of memory each

– 3.2TB NVMe (solid state) local storage (not shared between nodes)

– EDR InfiniBand, 1:1 per rack, 2:1 rack to rack interconnect

– GPFS high performance parallel scratch storage: 2.9PB raw

– Globus transfer node (10 GbE external, EDR to storage)

• InfiniBand Network

– EDR InfiniBand (100 Gb/s)

– Fully non-blocking (1:1) within a chassis, 2:1 oversubscription between chassis.

Elapsed Time

4 CPU (mz4np1)

32 CPU (mz4np8)

4CPU + 4GPU (mz4np1)

32CPU + 4GPU (mz4np8)Tim

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

Lower is better

Speed-Up Factor

4 CPU (mz4np1)

32 CPU (mz4np8)

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

Higher is better

Full Power of 1 node on Traverse

push_ion11%

shifti1%

charge_ion12%

poisson24%

smooth0%

field1%

collision_ion8%

collision_eon7%

push_eon22%

shifte2%

charge_eon9%

check_tracers2%

snapshot0%

diagnostics1%

32 CPU

push_ion2%

shifti1%charge_ion

poisson66%

smooth1%

field2%

collision_ion2%

collision_eon2%

push_eon5%

shifte4%

charge_eon3%

check_tracers6%

snapshot0%

diagnostics2%

32 CPU + 4 GPU

Performance Comparison (original Poisson vs new 3D Poisson)

ori. Tol=1e-3

ori. Tol=1e-5

New Poisson

Elapsed time/step

Breakdown of Elapsed Time

push_ion3%

shifti2%

charge_ion4%

poisson44%

smooth1%

field3%

collision_ion3%

collision_eon4%

push_eon8%

shifte7%

charge_eon4%

check_tracers2%

snapshot10%

diagnostics5%

2%2% 4%

Ori. Poisson (Tol=1e-3) Ori. Poisson (Tol=1e-5) New Poisson 3D

Algebraic Multigrid Solver: (HYPRE) vs (GAMG)

HYPRE HYPRE cuda GAMG GAMG cuda

Poisson time / istep

micell=50, mpsi=100

MGRID= 45,137MISUM= 9,007,200

1 Node 4 MPI ranks mzetamax=4npartdom=1

1 CPU / 1 GPU

HYPRE vs GAMG in Production Run case

micell=100 mpsi=150

MGRID= 113,185*16MISUM= 180,854,400

4 Node 128 MPImzetamax=16npartdom=8

Pure CPU + Hypre

GPU + Hypre

GPU + cuda Hypre

GPU + GAMG

GPU + cuda GAMG

4 CPU / 1 GPU

Summary

• OpenACC directives ported the code to GPU keeping compatibilities with CPU machine

Porting to GPU machine

• A global gyrokinetic particle simulation code for micro-turbulence study in tokamak

• Recently upgraded for physics studies associated with the thermal quench transport

- Significant speed-up (>20x) for the particle parts

• Particle-In-Cell algorithm (particles + grid-based field solve)

• GTS is now running production simulations on Traverse with a significant acceleration,making efficient use of the Traverse computational resource

- Field solve part (Poisson equation) will be ported to GPU via some libraries (ex. PETSc, Hypre, AMGx)

• Attended GPU Hackathon in 2019 at Princeton University (Mentor: Rueben Budiardja (ORNL))

accelerating gyrokinetic tokamak simulation (gts) code

Documents

global gyrokinetic particle simulation of turbulence and...

gts-210 series gts-211d gts-212 gts-213 · topcon manual de...

gyrokinetic simulations of turbulence in magnetic fusion...

xavier lapillonne- local and global eulerian gyrokinetic...

hybrid mhd-gyrokinetic codes: extended models, new

introduction to the particle in cell scheme for gyrokinetic...

verification of gyrokinetic particle simulation of...

ligka: a new gyrokinetic code in realistic tokamak geometry...

a quasi-linear gyrokinetic transport model for tokamak

gyrokinetic particle simulation of fusion plasmas -...

valve regulated lead-acid battery gts- series gts...

ccr 2450 gts ccr 3600 gts ccr 3650 gts - toro

instrukcja obsługi - tachimetr elektroniczny serii...

introduction to particle-in-cell gyrokinetic...

verification of gyrokinetic particle simulation of current...

gyrokinetic plasma theory

dd tokamak

full f gyrokinetic simulation of tokamak plasma turbulence...

a ﬁnite element poisson solver for gyrokinetic particle...

the tokamak project