peta-scale simulations with the hpc software … · peta-scale simulations with the hpc software...

Peta-Scale Simulations with the HPC Software Framework waLBerla: Massively Parallel AMR for the Lattice Boltzmann Method

SIAM PP 2016, Paris

April 15, 2016

Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde

Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

2

Outline

• Introduction • The waLBerla Simulation Framework

• An Example Using the Lattice Boltzmann Method

• Parallelization Concepts • Domain Partitioning & Data Handling

• Dynamic Domain Repartitioning • AMR Challenges

• Distributed Repartitioning Procedure

• Dynamic Load Balancing

• Benchmarks / Performance Evaluation

• Conclusion

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Introduction

• The waLBerla Simulation Framework

• An Example Using the Lattice Boltzmann Method

4

Introduction

• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen):

• main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field)

• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers

• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)

• hybrid parallelization: MPI + OpenMP

• vectorization of compute kernels

• written in C++(11), growing Python interface

• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)

• automated build and test system


5

Introduction

• AMR for the LBM – example (vocal fold phantom geometry)

DNS (direct numerical simulation)

Reynolds number: 2500 / D3Q27 TRT

24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels

without refinement: 311 times more memory …

… and 701 times the workload


Parallelization Concepts

• Domain Partitioning & Data Handling

7

simulation domain only in here

empty blocks are discarded

domain partitioning into blocks

static block-level refinement



8

simulation domain only in here

empty blocks are discarded

domain partitioning into blocks

static block-level refinement


octree partitioning within every block of the initial partitioning

(→ forest of octrees)


9

static load balancing

allocation of block data (→ grids)

static block-level refinement (→ forest of octrees)


load balancing can be based on either space-filling curves (Morton or Hilbert order)

using the underlying forest of octrees or graph partitioning (METIS, …)

whatever fits best the needs of the simulation


10




DISK

DISK

separation of domain partitioning from simulation (optional)


compact (KiB/MiB) binary MPI IO


11

static load balancing static block-level refinement (→ forest of octrees)

DISK

DISK






data & data structure stored perfectly distributed

→ no replication of (meta) data!

12




DISK

DISK




all parts customizable via callback functions in order to adapt to the underlying simulation:

1) discarding of blocks 2) (iterative) refinement of blocks

3) load balancing 4) block data allocation†

† support for arbitrary number of block data items (each of arbitrary type)


13


forest of octrees: octrees are not explicitly stored,

but implicitly defined via block IDs

2:1 balanced grid (used for the LBM on refined grids)

distributed graph: nodes = blocks, edges explicitly stored as

< block ID, process rank > pairs

different “views” on / representations of the

domain partitioning


14


forest of octrees: octrees are not explicitly stored,

but implicitly defined via block IDs

2:1 balanced grid (used for the LBM on refined grids)

distributed graph: nodes = blocks, edges explicitly stored as

< block ID, process rank > pairs

different “views” on / representations of the

domain partitioning


our parallel implementation [1] of local grid refinement for the LBM based on [2] shows

excellent performance:

simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. → 1 ms per time step

[1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982]

[2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-Boltzmann schemes , International Journal for Numerical Methods and Fluids

Dynamic Domain Repartitioning

• AMR Challenges

• Distributed Repartitioning Procedure

• Dynamic Load Balancing

• Benchmarks / Performance Evaluation

16

• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)

⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)

(→ octree partitioning & same number of cells for every block)

⇒ “split first, balance afterwards” probably won’t work

• for the LBM, all levels must be load-balanced separately

⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures


AMR Challenges

17

• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)

⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)

(→ octree partitioning & same number of cells for every block)

⇒ “split first, balance afterwards” probably won’t work

• for the LBM, all levels must be load-balanced separately

⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures


AMR Challenges

→ no replication of (meta) data of any sort!

18

split merge

1) split/merge decision callback function to determine which blocks must split

and which blocks may merge

2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,

2:1 balance is automatically preserved

different colors (green/blue) illustrate process assignment


forced split to maintain 2:1 balance


19

split merge

1) split/merge decision callback function to determine which blocks must split

and which blocks may merge

2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,

2:1 balance is automatically preserved


forced split to maintain 2:1 balance


20

3) load balancing callback function to decide to

which process blocks must migrate to (skeleton blocks

actually move to this process)



21

3) load balancing lightweight skeleton blocks

allow multiple migration steps to different processes

(→ enables balancing based on diffusion)



22

3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton

blocks migrate



23

3) load balancing for global load balancing

algorithms, balance is achieved in one step → skeleton blocks

immediately migrate to their final processes



24

4) data migration links between skeleton blocks and corresponding real blocks

are used to perform actual data migration (includes refinement and coarsening of block data)

refine coarsen



25

4) data migration implementation for grid data:

coarsening → senders coarsen data before sending to target process

refinement → receivers refine on target process(es)

refine coarsen



26

4) data migration implementation for grid data:

coarsening → senders coarsen data before sending to target process

refinement → receivers refine on target process(es)

refine coarsen

key parts customizable via callback functions in order to adapt to the underlying simulation:

1) decision which blocks split/merge 2) dynamic load balancing



27 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Dynamic Load Balancing

1) space filling curves (Morton or Hilbert): • every process needs global knowledge (→ all gather)

⇒ scaling issues (even if it’s just a few bytes from every process)

2) load balancing based on diffusion: • iterative procedure (= repeat the following multiple times)

• communication with neighboring processes only

⇒ calculate “flow” for every process-process connection

⇒ use this “flow” as guideline in order to decide where blocks need to migrate for achieving balance

⇒ runtime & memory independent of number of processes (true in practice? → benchmarks)

• useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt “flow”

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

LBM AMR - Performance


domain partitioning

4 grid levels

lid-driven cavity







coarsen









refine coarsen









2:1 balance refine coarsen










32

during this refresh process …

… all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells

→ 72 % of all cells change their size








avg. blocks/process (max. blocks/proc.)

level initially after refresh after load balance

0 0.383 (1) 0.328 (1) 0.328 (1)

1 0.656 (1) 0.875 (9) 0.875 (1)

2 1.313 (2) 3.063 (11) 3.063 (4)

3 3.500 (4) 3.500 (16) 3.500 (4)



34

• SuperMUC – space filling curve: Morton

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing,

split/merge blocks, migrate data)



35

• SuperMUC – space filling curve: Morton

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

14 billion cells

64 billion cells

33 billion cells



36

• SuperMUC – diffusion load balancing

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

14 billion cells

64 billion cells

33 billion cells

time almost independent of #processes !



37

• JUQUEEN – space filling curve: Morton

0

2

4

6

8

10

12

seco

nd

s

31,062

127,232

429,408

256 4096 32,768 458,752 cores

#cells per core 14 billion cells

197 billion cells

58 billion cells

hybrid MPI+OpenMP version with SMP 1 process ⇔ 2 cores ⇔ 8 threads



38

• JUQUEEN – diffusion load balancing

0

2

4

6

8

10

12

seco

nd

s

31,062

127,232

429,408

256 4096 32,768 458,752 cores

#cells per core 14 billion cells

197 billion cells

58 billion cells

time almost independent of #processes !



39

• JUQUEEN – diffusion load balancing

0

2

4

6

8

10

12

ite

rati

on

s

256 4096 32,768 458,752 cores

number of diffusion iterations until load is perfectly balanced



40

• impact on performance / overhead of the entire dynamic repartitioning procedure?

• depends … • … on the number of cells per core

• … on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.)

• … on how often dynamic repartitioning is happening

• previous lid-driven cavity benchmark: • overhead ≙ 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps

⇒ In practice, a lot of time† is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place.



† often the entire overhead of AMR

41


DNS (direct numerical simulation)

Reynolds number: 2500 / D3Q27 TRT

24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels

processes: 3584 (on SuperMUC phase 2)

runtime: c. 24 h (3 × c. 8 h)



42


load balancer: space filling curve (Hilbert order)

time steps: 180,000 / 2,880,000 (finest grid)

refresh cycles: 537 (→ refresh every 335 time steps)

without refinement: 311 times more memory …

… and 701 times the workload



Conclusion

44

• the approach for massively parallel grid repartitioning by

… using a block-structured domain partitioning and

… employing a lightweight “copy” of the data structure … during dynamic load balancing

is paying off and working extremely well:

Conclusion & Outlook


we can handle 1011 cells (> 1012 unknowns) …

… with 107 blocks and 1.83 million threads

45

• the approach for massively parallel grid repartitioning by

… using a block-structured domain partitioning and

… employing a lightweight “copy” of the data structure … during dynamic load balancing

is paying off and working extremely well:

Conclusion & Outlook


we can handle 1011 cells (> 1012 unknowns) …

… with 107 blocks and 1.83 million threads

resilience (using ULFM): store redundant, in-memory “snapshots” → one/multiple process(es) fail → restore data on different processes →

perform dynamic repartitioning → continue :-)

THANK YOU FOR YOUR ATTENTION!

THANK YOU FOR YOUR ATTENTION!

QUESTIONS ?

peta-scale simulations with the hpc software … · peta-scale simulations with the hpc software...

Documents