peta-scale simulations with the hpc software … · peta-scale simulations with the hpc software...
TRANSCRIPT
Peta-Scale Simulations with the HPC Software Framework waLBerla: Massively Parallel AMR for the Lattice Boltzmann Method
SIAM PP 2016, Paris
April 15, 2016
Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde
Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
2
Outline
• Introduction • The waLBerla Simulation Framework
• An Example Using the Lattice Boltzmann Method
• Parallelization Concepts • Domain Partitioning & Data Handling
• Dynamic Domain Repartitioning • AMR Challenges
• Distributed Repartitioning Procedure
• Dynamic Load Balancing
• Benchmarks / Performance Evaluation
• Conclusion
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Introduction
• The waLBerla Simulation Framework
• An Example Using the Lattice Boltzmann Method
4
Introduction
• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen):
• main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field)
• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers
• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)
• hybrid parallelization: MPI + OpenMP
• vectorization of compute kernels
• written in C++(11), growing Python interface
• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)
• automated build and test system
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
5
Introduction
• AMR for the LBM – example (vocal fold phantom geometry)
DNS (direct numerical simulation)
Reynolds number: 2500 / D3Q27 TRT
24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels
without refinement: 311 times more memory …
… and 701 times the workload
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Parallelization Concepts
• Domain Partitioning & Data Handling
7
simulation domain only in here
empty blocks are discarded
domain partitioning into blocks
static block-level refinement
Parallelization Concepts
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
8
simulation domain only in here
empty blocks are discarded
domain partitioning into blocks
static block-level refinement
Parallelization Concepts
octree partitioning within every block of the initial partitioning
(→ forest of octrees)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
9
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
Parallelization Concepts
load balancing can be based on either space-filling curves (Morton or Hilbert order)
using the underlying forest of octrees or graph partitioning (METIS, …)
whatever fits best the needs of the simulation
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
10
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
11
static load balancing static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
allocation of block data (→ grids)
data & data structure stored perfectly distributed
→ no replication of (meta) data!
12
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
all parts customizable via callback functions in order to adapt to the underlying simulation:
1) discarding of blocks 2) (iterative) refinement of blocks
3) load balancing 4) block data allocation†
† support for arbitrary number of block data items (each of arbitrary type)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
13
Parallelization Concepts
forest of octrees: octrees are not explicitly stored,
but implicitly defined via block IDs
2:1 balanced grid (used for the LBM on refined grids)
distributed graph: nodes = blocks, edges explicitly stored as
< block ID, process rank > pairs
different “views” on / representations of the
domain partitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
14
Parallelization Concepts
forest of octrees: octrees are not explicitly stored,
but implicitly defined via block IDs
2:1 balanced grid (used for the LBM on refined grids)
distributed graph: nodes = blocks, edges explicitly stored as
< block ID, process rank > pairs
different “views” on / representations of the
domain partitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
our parallel implementation [1] of local grid refinement for the LBM based on [2] shows
excellent performance:
simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. → 1 ms per time step
[1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982]
[2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-Boltzmann schemes , International Journal for Numerical Methods and Fluids
Dynamic Domain Repartitioning
• AMR Challenges
• Distributed Repartitioning Procedure
• Dynamic Load Balancing
• Benchmarks / Performance Evaluation
16
• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)
⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)
(→ octree partitioning & same number of cells for every block)
⇒ “split first, balance afterwards” probably won’t work
• for the LBM, all levels must be load-balanced separately
⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
AMR Challenges
17
• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)
⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)
(→ octree partitioning & same number of cells for every block)
⇒ “split first, balance afterwards” probably won’t work
• for the LBM, all levels must be load-balanced separately
⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
AMR Challenges
→ no replication of (meta) data of any sort!
18
split merge
1) split/merge decision callback function to determine which blocks must split
and which blocks may merge
2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,
2:1 balance is automatically preserved
different colors (green/blue) illustrate process assignment
Dynamic Domain Repartitioning
forced split to maintain 2:1 balance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
19
split merge
1) split/merge decision callback function to determine which blocks must split
and which blocks may merge
2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,
2:1 balance is automatically preserved
Dynamic Domain Repartitioning
forced split to maintain 2:1 balance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
20
3) load balancing callback function to decide to
which process blocks must migrate to (skeleton blocks
actually move to this process)
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
21
3) load balancing lightweight skeleton blocks
allow multiple migration steps to different processes
(→ enables balancing based on diffusion)
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
22
3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton
blocks migrate
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
23
3) load balancing for global load balancing
algorithms, balance is achieved in one step → skeleton blocks
immediately migrate to their final processes
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
24
4) data migration links between skeleton blocks and corresponding real blocks
are used to perform actual data migration (includes refinement and coarsening of block data)
refine coarsen
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
25
4) data migration implementation for grid data:
coarsening → senders coarsen data before sending to target process
refinement → receivers refine on target process(es)
refine coarsen
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
26
4) data migration implementation for grid data:
coarsening → senders coarsen data before sending to target process
refinement → receivers refine on target process(es)
refine coarsen
key parts customizable via callback functions in order to adapt to the underlying simulation:
1) decision which blocks split/merge 2) dynamic load balancing
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
27 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Dynamic Load Balancing
1) space filling curves (Morton or Hilbert): • every process needs global knowledge (→ all gather)
⇒ scaling issues (even if it’s just a few bytes from every process)
2) load balancing based on diffusion: • iterative procedure (= repeat the following multiple times)
• communication with neighboring processes only
⇒ calculate “flow” for every process-process connection
⇒ use this “flow” as guideline in order to decide where blocks need to migrate for achieving balance
⇒ runtime & memory independent of number of processes (true in practice? → benchmarks)
• useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt “flow”
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
LBM AMR - Performance
28 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
domain partitioning
4 grid levels
lid-driven cavity
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
coarsen
LBM AMR - Performance
29 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
refine coarsen
LBM AMR - Performance
30 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
2:1 balance refine coarsen
LBM AMR - Performance
31 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
LBM AMR - Performance
32
during this refresh process …
… all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells
→ 72 % of all cells change their size
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
avg. blocks/process (max. blocks/proc.)
level initially after refresh after load balance
0 0.383 (1) 0.328 (1) 0.328 (1)
1 0.656 (1) 0.875 (9) 0.875 (1)
2 1.313 (2) 3.063 (11) 3.063 (4)
3 3.500 (4) 3.500 (16) 3.500 (4)
LBM AMR - Performance
33 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
34
• SuperMUC – space filling curve: Morton
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing,
split/merge blocks, migrate data)
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
35
• SuperMUC – space filling curve: Morton
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
14 billion cells
64 billion cells
33 billion cells
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
36
• SuperMUC – diffusion load balancing
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
14 billion cells
64 billion cells
33 billion cells
time almost independent of #processes !
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
37
• JUQUEEN – space filling curve: Morton
0
2
4
6
8
10
12
seco
nd
s
31,062
127,232
429,408
256 4096 32,768 458,752 cores
#cells per core 14 billion cells
197 billion cells
58 billion cells
hybrid MPI+OpenMP version with SMP 1 process ⇔ 2 cores ⇔ 8 threads
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
38
• JUQUEEN – diffusion load balancing
0
2
4
6
8
10
12
seco
nd
s
31,062
127,232
429,408
256 4096 32,768 458,752 cores
#cells per core 14 billion cells
197 billion cells
58 billion cells
time almost independent of #processes !
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
39
• JUQUEEN – diffusion load balancing
0
2
4
6
8
10
12
ite
rati
on
s
256 4096 32,768 458,752 cores
number of diffusion iterations until load is perfectly balanced
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
40
• impact on performance / overhead of the entire dynamic repartitioning procedure?
• depends … • … on the number of cells per core
• … on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.)
• … on how often dynamic repartitioning is happening
• previous lid-driven cavity benchmark: • overhead ≙ 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps
⇒ In practice, a lot of time† is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place.
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
† often the entire overhead of AMR
41
• AMR for the LBM – example (vocal fold phantom geometry)
DNS (direct numerical simulation)
Reynolds number: 2500 / D3Q27 TRT
24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels
processes: 3584 (on SuperMUC phase 2)
runtime: c. 24 h (3 × c. 8 h)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
LBM AMR - Performance
42
• AMR for the LBM – example (vocal fold phantom geometry)
load balancer: space filling curve (Hilbert order)
time steps: 180,000 / 2,880,000 (finest grid)
refresh cycles: 537 (→ refresh every 335 time steps)
without refinement: 311 times more memory …
… and 701 times the workload
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
LBM AMR - Performance
Conclusion
44
• the approach for massively parallel grid repartitioning by
… using a block-structured domain partitioning and
… employing a lightweight “copy” of the data structure … during dynamic load balancing
is paying off and working extremely well:
Conclusion & Outlook
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
we can handle 1011 cells (> 1012 unknowns) …
… with 107 blocks and 1.83 million threads
45
• the approach for massively parallel grid repartitioning by
… using a block-structured domain partitioning and
… employing a lightweight “copy” of the data structure … during dynamic load balancing
is paying off and working extremely well:
Conclusion & Outlook
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
we can handle 1011 cells (> 1012 unknowns) …
… with 107 blocks and 1.83 million threads
resilience (using ULFM): store redundant, in-memory “snapshots” → one/multiple process(es) fail → restore data on different processes →
perform dynamic repartitioning → continue :-)
THANK YOU FOR YOUR ATTENTION!
THANK YOU FOR YOUR ATTENTION!
QUESTIONS ?