scalability and algorithm-based fault tolerance for plasma physics simulations...
TRANSCRIPT
Scalability and Algorithm-Based Fault Tolerance forPlasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM
Dirk PflugerSimulation of Large Systems, IPVS/SimTech, Universitat Stuttgart
(joint work with M. Heene, A. Hinojosa)
February 2, 2017
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 1
PDE: Turbulence simulations of hot fusion plasmas
[source: ITER project]
Idea: new, CO2-free source of energy for the generations to come
EXAHD with H.-J. Bungartz (TUM), M. Griebel (Bonn), T. Dannert(RZG), F. Jenko (UCLA)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 2
Practically Unlimited Ressources
Contents:
Deuterium in bath tub full of water and Lithium in used laptop batterysuffice for family over 50 years
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 3
Behind the Scenes
Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[
∂
∂t+ ~v
∂
∂~x+
qm
(E +
~vc× B
)∂
∂~v
]f (~x , ~v , t) = 0
Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations
Gyrokinetics: remove fast gyromotion(smallest scale)[
∂
∂t+ ~v · ∂
∂~x+ F
∂
∂v||
]f (~x , v||, µ, t) = ∆(f )
5D~v and F are complex expressions, contain evaluation of E and B
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 4
Behind the Scenes
Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[
∂
∂t+ ~v
∂
∂~x+
qm
(E +
~vc× B
)∂
∂~v
]f (~x , ~v , t) = 0
Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations
Gyrokinetics: remove fast gyromotion(smallest scale)[
∂
∂t+ ~v · ∂
∂~x+ F
∂
∂v||
]f (~x , v||, µ, t) = ∆(f )
5D~v and F are complex expressions, contain evaluation of E and B
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 4
Numerical Simulations for Actual Tokamaks with GENE
Aim: global simulations of ITER
http://www.genecode.orgDirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 5
Numerical Simulations for Actual Tokamaks with GENE
Goal: global simulation with physical realism
Szenario for simulation of “numerical ITER”Global, non-linear runsAt least 1011 grid points, 106 time steps>1 TB just to store single result in memory (complex)
Possible at all?
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 5
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big Data
Therefore: hierarchical discretizationSparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid
sparse grid sg combination technique
5d, level 10 > 1015
25,416,705 1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization
Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid sparse grid
sg combination technique
5d, level 10 > 1015 25,416,705
1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”
Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization
Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible
full grid sparse grid sg combination technique
5d, level 10 > 1015 25,416,705 1,876 × 82,000
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 6
Sparse Grid vs. Combination Technique
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
l1=1 l1=2 l1=3 l1
l2=1
l2=2
l2=3
l2
l1=4
l2=4
–+
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 7
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 8
Scalability
Problem of standard solver: global communication within each time-step
Use hierarchical ansatzTwo-level approach
Numerics: decoupling into locally coupled problems
Algorithms: second level of parallelism
First level: no need to scale to exascale
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 9
Scalability
Problem of standard solver: global communication within each time-step
Use hierarchical ansatzTwo-level approach
Numerics: decoupling into locally coupled problems
Algorithms: second level of parallelism
First level: no need to scale to exascale
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 9
Time-Dependent PDEs
Sparse Grid
compute combine compute
Sparse Grid
combine
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
component grid
Gather-scatter steps every time-interval
Remaining reduced global communication
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 10
Global Communication
Optimal communication schemes
hierarchize add
dehierarchize extract
global reduce
distributed
full grid
distributed sparse grid
each component grid
each component grid
each process group
each process group
distributed
hierarchized full grid
distributed
full grid
distributed
hierarchized full grid
global communication
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 11
Global Communication
Minimize number of communications (Range Query Trees):
O(log(dnd−1))×O(2nnd−1)
Minimize package size
O(2n · nd−1)×O(2n−1)
Derivation in BSP/PEM model
1e-05
0.0001
0.001
0.01
0.1
1
10
100
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Tim
e [s
]
Sparse Grid Level n
Hermit: d = 3, no boundary, min. level fixedSG Red.SG Red.
Subsp. Red.Subsp. Red.
Non-block. Subsp. Red.Parallel Subsp. Red.Parallel Subsp. Red.
Non-block. Parallel Subsp. Red.
[joint work with R. Jacob (ITU, Algorithm Engineering)]
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 11
Runtimes on Hazel Hen
104 105
total #processes
100
101
102
runti
me
[s]
hierarchizationnprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
104 105
total #processes
10−2
10−1
100local reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
4096 8192 16384 32768 65536 180224
total #processes
10−1
100
101
runti
me
[s]
global reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 12
Runtimes on Hazel Hen
Total time
4096 8192 16384 32768 65536 180224
total #processes
100
101
102
103
runti
me
[s]
hierarchization + loc. reduction + glob. reduction
nprocs 1024
nprocs 2048
nprocs 4096
nprocs 8192
GENE 1 time step
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 12
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 13
Resilience for the Exa-Age
Ever decreasing mean time between failureMassive replication of hardware
Smaller scales (higher integration)
Hardware possibly with less checks
. . .
Two categories:1 Hard faults2 Soft/silent faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 14
Resilience for the Exa-Age
Ever decreasing mean time between failureMassive replication of hardware
Smaller scales (higher integration)
Hardware possibly with less checks
. . .
Two categories:1 Hard faults2 Soft/silent faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 14
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
Hard Faults
Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes
⇒ Default MPI response: abort application
SolutionsRecompute (checkpoint-restart)
Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive
Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy
⇒ algorithm-based fault-tolerance (ABFT)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 15
Communication Scheme
Master-worker model
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 16
Software Stack
Fault simulation layer
Implements interface of ULFMplus kill_me() functionality
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 17
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 18
2D Example
1
2
3
61 2 3 4 5
6
5
4
2
1
l
l
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 19
2D Example
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 19
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
ABFT: Fault-Tolerant Combination Technique
Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients
uc~n(~x) =
d−1∑q=0
(−1)q(
d − 1q
) ∑~l∈I~n,q
u~l (~x)
In case of failure: use inclusion-exclusion principle to determine adaptedcombination
1 Solve generalized coefficient problem (GCP):
maxw
Q′(w), Q′(w) :=∑l∈I↓
4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓
2 Obtain new combination coefficients:
cl = (M−1w)l
Extra computations only on lower scales required
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 20
GCP: 2D Example
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
GCP: 2D Example
1
2
3
4
5
6
61 2 3 4 5 1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2
i
i
1
2
3
4
5
1 2 3 4 5 6
6
1
2i
i
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 21
GCP: Example 3D
No faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 22
GCP: Example 3D
2 faults
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 22
GCP: Example 3D
For arbitrary faults: GCP prohibitively expensive
Fast solution possible if enough extra grid layers available
Only fraction of computational effort⇒ faults in lower layers unlikely
Precompute some extra layers in advance
0 1 2 3 4
q0
5
10
15
20
25
#gr
ids
21
15
10
6
3
CT
Extra
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 23
Results Using GENE
Example:
Good reconstruction (visual inspection)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 24
Results Using GENE (2)
Small (reduced) problem4D: x , z, µ, v‖~lmin = [2, 3, 2, 4],~l = [6, 7, 6, 8]⇒ 69 combination grids
5.0% 10.0% 15.0% 20.0% 25.0%
Faults
10−6
10−5
10−4
10−3
10−2
L2
erro
rn
orm
Excellent recovery properties!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 25
Computational Effort
Accumulated timeto compute partial grids
0 1 2 3 4
q
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
tim
e(s
)
101
CT
Extra
×
Gain by ABFT
1 2 3 4 5
Faults
0
1
2
3
4
5
6
tim
e(s
)
Restarting from checkpoint
With recombination
Significant savings in runtime
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 26
Silent/Soft Faults
No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips
1 0 1 0 1 1 1 0 1
1 0 1 0 0 1 1 0 1
Common solutionsChecksumsReplication (process/data)
⇒ Significant overhead (effort, resources)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 27
Silent/Soft Faults
No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips
1 0 1 0 1 1 1 0 1
1 0 1 0 0 1 1 0 1
Common solutionsChecksumsReplication (process/data)
⇒ Significant overhead (effort, resources)
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 27
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 28
Selective Reliability
Focus on critical parts
Algorithm: The Combination Technique in Parallel
for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions
while not converged do
for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)
checkForSDC(); // Cheap sanity check
mitigateFaults(); // Mitigate faults
u(c)n ← reduce(ciui); // Combine solutions
for all i ∈ In,q,τ doui← scatter(u(c)
n ); // Sample each uifrom new u(c)n
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 28
Silent/Soft Faults
Exploit hierarchical approachSimilar discretizations lead to similar results
Exploit redundancy and hierarchical representation to check for faults
Detection of outliers possible
Direct integration into communication schemes possible (SubspaceReduce)
−
+
Component
Grids
Sparse
GridHierarchical Increment Spaces
of the Sparse Grid
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 29
SDC Check: Compare Pairs of Solutions
Similar discretizations should lead to similar results
Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing
SDC Check 1: Compare pairs of solutions
1
2
3
4
1 2 3 4
Pair β(s,t)
(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05
β(s,t) := maxl≤s∧t
maxj∈Il
|α(t)l,j − α
(s)l,j |
min|α(t)
l,j |, |α(s)l,j |
A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique
SPPEXA 2016 12
Pair β(s,t)
(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05
β(s,t) := maxl≤s∧t
maxj∈Il
|α(t)l,j − α
(s)l,j |
min|α(t)
l,j |, |α(s)l,j |
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 30
SDC Check: Outlier detection
Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing
SDC Check 2: Outlier detection
u(0,0) = [1.002,5.356,0.998,1.002,1.001,1.001, .999]
A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique
SPPEXA 2016 13
u(0, 0) = [1.002, 5.356, 0.998, 1.002, 1.001, 1.001, .999]
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 31
2D Example
Advection equation
∂u∂t
+ cx∂u∂x
+ cy∂u∂x
= 0 Ω = [0, 1]2
Periodic boundary conditions
Constant advection velocities cx , cy
Initial condition u(x , y , t = 0) = sin(2πx) sin(2πy)
Lax-Wendroff scheme (2nd order space + time)
Error/solution at t = 0.5 compared to analytical solution
u(x , y , t) = sin(2π(x − cx t)) sin(2π(y − cY t))
Corruption of one single data point in initial condition
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 32
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−2.4
−1.8
−1.2
−0.6
0.0
0.6
1.2
1.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
2D Example
0 5 10 15 20 25 300
5
10
15
20
25
30
Exact solution
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Full Grid
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Combined grid
−2.4
−1.8
−1.2
−0.6
0.0
0.6
1.2
1.8
0 5 10 15 20 25 300
5
10
15
20
25
30
Recovered
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 33
2D Example: Simulated Soft Faults
Inserting one soft faultMeasuring L2-error at the end
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
Full Grid CT, no SDC CT, with SDC CT, recovered
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, lowest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, highest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
0 100 200 300 400 50010−510−410−310−210−1
100101102103104105106
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 105
0 100 200 300 400 50010−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−0.5
0 100 200 300 400 500
Timestep where fault occurs, highest hierarchical subspace
10−5
10−4
10−3
10−2
10−1
100
101
ui(xl1,j1, xl2,j2
) = ui(xl1,j1, xl2,j2
)× 10−300
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 34
Higher-D: Advection-Diffusion Equation
∂tu −∆u + ~a · ∇u = f in Ω× [0,T )
u(·, t) = 0 in ∂Ω
u(·, 0) = u0 in Ω
Ω = [0, 1]d ,~a = (1, . . . , 1)T , u0 = e−100∑d
i=1(xi−0.5)2
Implemented in DUNE-pdelab
FVM, explicit time integration
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 35
Results
Fault in second time stepRelative error w.r.t. full-grid solution (n = 11 in 2D, n = 7 in 5D)Computations on Hazel Hen (HLRS)2D, 5D:
(6,6) (7,7) (8,8) (9,9) (10,10)
n
10-2
10-1
l 2-e
rror
no faults
2 groups (~50% faults)
4 groups (~25% faults)
8 groups (~12% faults)
16 groups (~6% faults)
(4,4,4,4,4) (5,5,5,5,5) (6,6,6,6,6)
n
10-1
100
l 2-e
rror
no faults
2 groups (~50% faults)
4 groups (~25% faults)
Again: excellent recovery properties!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 36
Overview
1 Motivation and Numerics
2 Scalability
3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults
4 Summary
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 37
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low costExploit hierarchical schemeRecombination rather than recomputation
Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low costExploit hierarchical schemeRecombination rather than recomputation
Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
Summary
GyrokineticsHigh-dimensional problem with urgent need for compute resources
Sparse grids: ”Too Big Data”⇒ Big Data
Hierarchical multilevel splitting provides novel handles on exa-challenges
Scalability reduce data in communication2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication
ABFT at low cost avoid data storage and I/OExploit hierarchical schemeRecombination rather than recomputation
Silent faults limit communicated dataExploit underlying hierarchical basisDetection and treatment of silent faults possible
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 38
Thanks to:
. . . and all others!
Thank you for your interest!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 39
Thanks to:
. . . and all others!
Thank you for your interest!
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 39
Mario Heene, Alfredo Parra Hinojosa, Hans-Joachim Bungartz, and Dirk Pfluger.A massively-parallel, fault-tolerant solver for time-dependent pdes in high dimensions.In Euro-Par 2016, Grenoble, June 2016.Accepted.
Philipp Hupp, Mario Heene, Riko Jacob, and Dirk Pfluger.Global communication schemes for the numerical solution of high-dimensional PDEs.Parallel Computing, 52:78 – 105, 2016.
Mario Heene and Dirk Pfluger.Scalable algorithms for the solution of higher-dimensional PDEs.In Software for Exascale Computing-SPPEXA 2013-2015, pages 165–186. Springer InternationalPublishing, 2016.
Alfredo Parra Hinojosa, Christoph Kowitz, Mario Heene, Dirk Pfluger, and Hans-Joachim Bungartz.Towards a fault-tolerant, scalable implementation of GENE.In Recent Trends in Computational Engineering-CE2014, pages 47–65. Springer International Publishing,2015.
Alfredo Parra Hinojosa, Brendan Harding, Hegland Markus, and Hans-Joachim Bungartz.Handling silent data corruption with the sparse grid combination technique.In Proceedings of the SPPEXA Symposium, Lecture Notes in Computational Science and Engineering.Springer-Verlag, February 2016.
Dirk Pfluger, Hans-Joachim Bungartz, Michael Griebel, Frank Jenko, Tilman Dannert, Mario Heene,Alfredo Parra Hinojosa, Christoph Kowitz, and Peter Zaspel.EXAHD: An exa-scalable two-level sparse grid approach for higher-dimensional problems in plasmaphysics and beyond.In Euro-Par 2014 Workshop, Part II, volume 8806 of Lecture Notes in Computer Science, pages 566–577.Springer-Verlag, December 2014.
Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE
Big Data Meets Computation @ IPAM, February 2, 2017 40