peta-scale simulations with the hpc software … · peta-scale simulations with the hpc software...

47
Peta-Scale Simulations with the HPC Software Framework waLBerla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Upload: phamxuyen

Post on 16-Apr-2018

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

Peta-Scale Simulations with the HPC Software Framework waLBerla: Massively Parallel AMR for the Lattice Boltzmann Method

SIAM PP 2016, Paris

April 15, 2016

Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde

Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Page 2: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

2

Outline

• Introduction • The waLBerla Simulation Framework

• An Example Using the Lattice Boltzmann Method

• Parallelization Concepts • Domain Partitioning & Data Handling

• Dynamic Domain Repartitioning • AMR Challenges

• Distributed Repartitioning Procedure

• Dynamic Load Balancing

• Benchmarks / Performance Evaluation

• Conclusion

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 3: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

Introduction

• The waLBerla Simulation Framework

• An Example Using the Lattice Boltzmann Method

Page 4: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

4

Introduction

• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen):

• main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field)

• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers

• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)

• hybrid parallelization: MPI + OpenMP

• vectorization of compute kernels

• written in C++(11), growing Python interface

• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)

• automated build and test system

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 5: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

5

Introduction

• AMR for the LBM – example (vocal fold phantom geometry)

DNS (direct numerical simulation)

Reynolds number: 2500 / D3Q27 TRT

24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels

without refinement: 311 times more memory …

… and 701 times the workload

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 6: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

Parallelization Concepts

• Domain Partitioning & Data Handling

Page 7: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

7

simulation domain only in here

empty blocks are discarded

domain partitioning into blocks

static block-level refinement

Parallelization Concepts

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 8: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

8

simulation domain only in here

empty blocks are discarded

domain partitioning into blocks

static block-level refinement

Parallelization Concepts

octree partitioning within every block of the initial partitioning

(→ forest of octrees)

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 9: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

9

static load balancing

allocation of block data (→ grids)

static block-level refinement (→ forest of octrees)

Parallelization Concepts

load balancing can be based on either space-filling curves (Morton or Hilbert order)

using the underlying forest of octrees or graph partitioning (METIS, …)

whatever fits best the needs of the simulation

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 10: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

10

static load balancing

allocation of block data (→ grids)

static block-level refinement (→ forest of octrees)

DISK

DISK

separation of domain partitioning from simulation (optional)

Parallelization Concepts

compact (KiB/MiB) binary MPI IO

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 11: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

11

static load balancing static block-level refinement (→ forest of octrees)

DISK

DISK

separation of domain partitioning from simulation (optional)

Parallelization Concepts

compact (KiB/MiB) binary MPI IO

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

allocation of block data (→ grids)

data & data structure stored perfectly distributed

→ no replication of (meta) data!

Page 12: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

12

static load balancing

allocation of block data (→ grids)

static block-level refinement (→ forest of octrees)

DISK

DISK

separation of domain partitioning from simulation (optional)

Parallelization Concepts

compact (KiB/MiB) binary MPI IO

all parts customizable via callback functions in order to adapt to the underlying simulation:

1) discarding of blocks 2) (iterative) refinement of blocks

3) load balancing 4) block data allocation†

† support for arbitrary number of block data items (each of arbitrary type)

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 13: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

13

Parallelization Concepts

forest of octrees: octrees are not explicitly stored,

but implicitly defined via block IDs

2:1 balanced grid (used for the LBM on refined grids)

distributed graph: nodes = blocks, edges explicitly stored as

< block ID, process rank > pairs

different “views” on / representations of the

domain partitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 14: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

14

Parallelization Concepts

forest of octrees: octrees are not explicitly stored,

but implicitly defined via block IDs

2:1 balanced grid (used for the LBM on refined grids)

distributed graph: nodes = blocks, edges explicitly stored as

< block ID, process rank > pairs

different “views” on / representations of the

domain partitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

our parallel implementation [1] of local grid refinement for the LBM based on [2] shows

excellent performance:

simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. → 1 ms per time step

[1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982]

[2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-Boltzmann schemes , International Journal for Numerical Methods and Fluids

Page 15: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

Dynamic Domain Repartitioning

• AMR Challenges

• Distributed Repartitioning Procedure

• Dynamic Load Balancing

• Benchmarks / Performance Evaluation

Page 16: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

16

• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)

⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)

(→ octree partitioning & same number of cells for every block)

⇒ “split first, balance afterwards” probably won’t work

• for the LBM, all levels must be load-balanced separately

⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

AMR Challenges

Page 17: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

17

• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)

⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)

(→ octree partitioning & same number of cells for every block)

⇒ “split first, balance afterwards” probably won’t work

• for the LBM, all levels must be load-balanced separately

⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

AMR Challenges

→ no replication of (meta) data of any sort!

Page 18: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

18

split merge

1) split/merge decision callback function to determine which blocks must split

and which blocks may merge

2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,

2:1 balance is automatically preserved

different colors (green/blue) illustrate process assignment

Dynamic Domain Repartitioning

forced split to maintain 2:1 balance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 19: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

19

split merge

1) split/merge decision callback function to determine which blocks must split

and which blocks may merge

2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,

2:1 balance is automatically preserved

Dynamic Domain Repartitioning

forced split to maintain 2:1 balance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 20: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

20

3) load balancing callback function to decide to

which process blocks must migrate to (skeleton blocks

actually move to this process)

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 21: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

21

3) load balancing lightweight skeleton blocks

allow multiple migration steps to different processes

(→ enables balancing based on diffusion)

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 22: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

22

3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton

blocks migrate

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 23: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

23

3) load balancing for global load balancing

algorithms, balance is achieved in one step → skeleton blocks

immediately migrate to their final processes

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 24: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

24

4) data migration links between skeleton blocks and corresponding real blocks

are used to perform actual data migration (includes refinement and coarsening of block data)

refine coarsen

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 25: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

25

4) data migration implementation for grid data:

coarsening → senders coarsen data before sending to target process

refinement → receivers refine on target process(es)

refine coarsen

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 26: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

26

4) data migration implementation for grid data:

coarsening → senders coarsen data before sending to target process

refinement → receivers refine on target process(es)

refine coarsen

key parts customizable via callback functions in order to adapt to the underlying simulation:

1) decision which blocks split/merge 2) dynamic load balancing

Dynamic Domain Repartitioning

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 27: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

27 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Dynamic Load Balancing

1) space filling curves (Morton or Hilbert): • every process needs global knowledge (→ all gather)

⇒ scaling issues (even if it’s just a few bytes from every process)

2) load balancing based on diffusion: • iterative procedure (= repeat the following multiple times)

• communication with neighboring processes only

⇒ calculate “flow” for every process-process connection

⇒ use this “flow” as guideline in order to decide where blocks need to migrate for achieving balance

⇒ runtime & memory independent of number of processes (true in practice? → benchmarks)

• useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt “flow”

Page 28: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

LBM AMR - Performance

28 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

domain partitioning

4 grid levels

lid-driven cavity

Page 29: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

coarsen

LBM AMR - Performance

29 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 30: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

refine coarsen

LBM AMR - Performance

30 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 31: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

2:1 balance refine coarsen

LBM AMR - Performance

31 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 32: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

LBM AMR - Performance

32

during this refresh process …

… all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells

→ 72 % of all cells change their size

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 33: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

avg. blocks/process (max. blocks/proc.)

level initially after refresh after load balance

0 0.383 (1) 0.328 (1) 0.328 (1)

1 0.656 (1) 0.875 (9) 0.875 (1)

2 1.313 (2) 3.063 (11) 3.063 (4)

3 3.500 (4) 3.500 (16) 3.500 (4)

LBM AMR - Performance

33 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 34: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

34

• SuperMUC – space filling curve: Morton

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing,

split/merge blocks, migrate data)

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 35: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

35

• SuperMUC – space filling curve: Morton

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

14 billion cells

64 billion cells

33 billion cells

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 36: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

36

• SuperMUC – diffusion load balancing

0

0.5

1

1.5

2

2.5

3

3.5

seco

nd

s

209,671

497,000

970,703

1024 8192 65,536 cores

#cells per core

14 billion cells

64 billion cells

33 billion cells

time almost independent of #processes !

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 37: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

37

• JUQUEEN – space filling curve: Morton

0

2

4

6

8

10

12

seco

nd

s

31,062

127,232

429,408

256 4096 32,768 458,752 cores

#cells per core 14 billion cells

197 billion cells

58 billion cells

hybrid MPI+OpenMP version with SMP 1 process ⇔ 2 cores ⇔ 8 threads

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 38: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

38

• JUQUEEN – diffusion load balancing

0

2

4

6

8

10

12

seco

nd

s

31,062

127,232

429,408

256 4096 32,768 458,752 cores

#cells per core 14 billion cells

197 billion cells

58 billion cells

time almost independent of #processes !

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 39: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

39

• JUQUEEN – diffusion load balancing

0

2

4

6

8

10

12

ite

rati

on

s

256 4096 32,768 458,752 cores

number of diffusion iterations until load is perfectly balanced

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Page 40: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

40

• impact on performance / overhead of the entire dynamic repartitioning procedure?

• depends … • … on the number of cells per core

• … on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.)

• … on how often dynamic repartitioning is happening

• previous lid-driven cavity benchmark: • overhead ≙ 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps

⇒ In practice, a lot of time† is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place.

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

† often the entire overhead of AMR

Page 41: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

41

• AMR for the LBM – example (vocal fold phantom geometry)

DNS (direct numerical simulation)

Reynolds number: 2500 / D3Q27 TRT

24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels

processes: 3584 (on SuperMUC phase 2)

runtime: c. 24 h (3 × c. 8 h)

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

LBM AMR - Performance

Page 42: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

42

• AMR for the LBM – example (vocal fold phantom geometry)

load balancer: space filling curve (Hilbert order)

time steps: 180,000 / 2,880,000 (finest grid)

refresh cycles: 537 (→ refresh every 335 time steps)

without refinement: 311 times more memory …

… and 701 times the workload

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

LBM AMR - Performance

Page 43: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

Conclusion

Page 44: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

44

• the approach for massively parallel grid repartitioning by

… using a block-structured domain partitioning and

… employing a lightweight “copy” of the data structure … during dynamic load balancing

is paying off and working extremely well:

Conclusion & Outlook

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

we can handle 1011 cells (> 1012 unknowns) …

… with 107 blocks and 1.83 million threads

Page 45: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

45

• the approach for massively parallel grid repartitioning by

… using a block-structured domain partitioning and

… employing a lightweight “copy” of the data structure … during dynamic load balancing

is paying off and working extremely well:

Conclusion & Outlook

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

we can handle 1011 cells (> 1012 unknowns) …

… with 107 blocks and 1.83 million threads

resilience (using ULFM): store redundant, in-memory “snapshots” → one/multiple process(es) fail → restore data on different processes →

perform dynamic repartitioning → continue :-)

Page 46: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

THANK YOU FOR YOUR ATTENTION!

Page 47: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid

THANK YOU FOR YOUR ATTENTION!

QUESTIONS ?