scalability and algorithm-based fault tolerance for plasma physics simulations...

67
Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE Big Data Meets Computation @ IPAM Dirk Pfl ¨ uger Simulation of Large Systems, IPVS/SimTech, Universit¨ at Stuttgart (joint work with M. Heene, A. Hinojosa) February 2, 2017 Dirk Pfl ¨ uger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE Big Data Meets Computation @ IPAM, February 2, 2017 1

Upload: others

Post on 12-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Scalability and Algorithm-Based Fault Tolerance forPlasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM

Dirk PflugerSimulation of Large Systems, IPVS/SimTech, Universitat Stuttgart

(joint work with M. Heene, A. Hinojosa)

February 2, 2017

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 1

Page 2: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

PDE: Turbulence simulations of hot fusion plasmas

[source: ITER project]

Idea: new, CO2-free source of energy for the generations to come

EXAHD with H.-J. Bungartz (TUM), M. Griebel (Bonn), T. Dannert(RZG), F. Jenko (UCLA)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 2

Page 3: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Practically Unlimited Ressources

Contents:

Deuterium in bath tub full of water and Lithium in used laptop batterysuffice for family over 50 years

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 3

Page 4: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Behind the Scenes

Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[

∂t+ ~v

∂~x+

qm

(E +

~vc× B

)∂

∂~v

]f (~x , ~v , t) = 0

Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations

Gyrokinetics: remove fast gyromotion(smallest scale)[

∂t+ ~v · ∂

∂~x+ F

∂v||

]f (~x , v||, µ, t) = ∆(f )

5D~v and F are complex expressions, contain evaluation of E and B

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 4

Page 5: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Behind the Scenes

Dilute/hot plasmas are (almost) collisionlessNot magneto-hydrodynamic, but kinect description (Vlasov):[

∂t+ ~v

∂~x+

qm

(E +

~vc× B

)∂

∂~v

]f (~x , ~v , t) = 0

Distribution function f (~x , ~v , t)6D in state spaceCoupled to Maxwell equations

Gyrokinetics: remove fast gyromotion(smallest scale)[

∂t+ ~v · ∂

∂~x+ F

∂v||

]f (~x , v||, µ, t) = ∆(f )

5D~v and F are complex expressions, contain evaluation of E and B

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 4

Page 6: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Numerical Simulations for Actual Tokamaks with GENE

Aim: global simulations of ITER

http://www.genecode.orgDirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 5

Page 7: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Numerical Simulations for Actual Tokamaks with GENE

Goal: global simulation with physical realism

Szenario for simulation of “numerical ITER”Global, non-linear runsAt least 1011 grid points, 106 time steps>1 TB just to store single result in memory (complex)

Possible at all?

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 5

Page 8: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”

Effort O((2n)d )⇒ too Big Data

Therefore: hierarchical discretizationSparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible

full grid

sparse grid sg combination technique

5d, level 10 > 1015

25,416,705 1,876 × 82,000

l1=1 l1=2 l1=3 l1

l2=1

l2=2

l2=3

l2

l1=4

l2=4

–+

Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 6

Page 9: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”

Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization

Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible

full grid sparse grid

sg combination technique

5d, level 10 > 1015 25,416,705

1,876 × 82,000

l1=1 l1=2 l1=3 l1

l2=1

l2=2

l2=3

l2

l1=4

l2=4

–+

Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 6

Page 10: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Sparse Grids – Hierarchical ApproachHigh-dimensional problems suffer “curse of dimensionality”

Effort O((2n)d )⇒ too Big DataTherefore: hierarchical discretization

Sparse grids: O(2n · nd−1) [Zenger 91]Makes high-dimensional discretizations possible

full grid sparse grid sg combination technique

5d, level 10 > 1015 25,416,705 1,876 × 82,000

l1=1 l1=2 l1=3 l1

l2=1

l2=2

l2=3

l2

l1=4

l2=4

–+

Combination technique (multivariate extrapolation-style scheme)Multiple, but smaller grids: O(d · nd−1) problems of size O(2n)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 6

Page 11: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Sparse Grid vs. Combination Technique

l1=1 l1=2 l1=3 l1

l2=1

l2=2

l2=3

l2

l1=4

l2=4

l1=1 l1=2 l1=3 l1

l2=1

l2=2

l2=3

l2

l1=4

l2=4

–+

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 7

Page 12: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Overview

1 Motivation and Numerics

2 Scalability

3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults

4 Summary

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 8

Page 13: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Scalability

Problem of standard solver: global communication within each time-step

Use hierarchical ansatzTwo-level approach

Numerics: decoupling into locally coupled problems

Algorithms: second level of parallelism

First level: no need to scale to exascale

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 9

Page 14: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Scalability

Problem of standard solver: global communication within each time-step

Use hierarchical ansatzTwo-level approach

Numerics: decoupling into locally coupled problems

Algorithms: second level of parallelism

First level: no need to scale to exascale

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 9

Page 15: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Time-Dependent PDEs

Sparse Grid

compute combine compute

Sparse Grid

combine

component grid

component grid

component grid

component grid

component grid

component grid

component grid

component grid

component grid

component grid

component grid

component grid

Gather-scatter steps every time-interval

Remaining reduced global communication

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 10

Page 16: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Global Communication

Optimal communication schemes

hierarchize add

dehierarchize extract

global reduce

distributed

full grid

distributed sparse grid

each component grid

each component grid

each process group

each process group

distributed

hierarchized full grid

distributed

full grid

distributed

hierarchized full grid

global communication

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 11

Page 17: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Global Communication

Minimize number of communications (Range Query Trees):

O(log(dnd−1))×O(2nnd−1)

Minimize package size

O(2n · nd−1)×O(2n−1)

Derivation in BSP/PEM model

1e-05

0.0001

0.001

0.01

0.1

1

10

100

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Tim

e [s

]

Sparse Grid Level n

Hermit: d = 3, no boundary, min. level fixedSG Red.SG Red.

Subsp. Red.Subsp. Red.

Non-block. Subsp. Red.Parallel Subsp. Red.Parallel Subsp. Red.

Non-block. Parallel Subsp. Red.

[joint work with R. Jacob (ITU, Algorithm Engineering)]

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 11

Page 18: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Runtimes on Hazel Hen

104 105

total #processes

100

101

102

runti

me

[s]

hierarchizationnprocs 1024

nprocs 2048

nprocs 4096

nprocs 8192

104 105

total #processes

10−2

10−1

100local reduction

nprocs 1024

nprocs 2048

nprocs 4096

nprocs 8192

4096 8192 16384 32768 65536 180224

total #processes

10−1

100

101

runti

me

[s]

global reduction

nprocs 1024

nprocs 2048

nprocs 4096

nprocs 8192

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 12

Page 19: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Runtimes on Hazel Hen

Total time

4096 8192 16384 32768 65536 180224

total #processes

100

101

102

103

runti

me

[s]

hierarchization + loc. reduction + glob. reduction

nprocs 1024

nprocs 2048

nprocs 4096

nprocs 8192

GENE 1 time step

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 12

Page 20: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Overview

1 Motivation and Numerics

2 Scalability

3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults

4 Summary

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 13

Page 21: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Resilience for the Exa-Age

Ever decreasing mean time between failureMassive replication of hardware

Smaller scales (higher integration)

Hardware possibly with less checks

. . .

Two categories:1 Hard faults2 Soft/silent faults

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 14

Page 22: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Resilience for the Exa-Age

Ever decreasing mean time between failureMassive replication of hardware

Smaller scales (higher integration)

Hardware possibly with less checks

. . .

Two categories:1 Hard faults2 Soft/silent faults

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 14

Page 23: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Hard Faults

Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes

⇒ Default MPI response: abort application

SolutionsRecompute (checkpoint-restart)

Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive

Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy

⇒ algorithm-based fault-tolerance (ABFT)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 15

Page 24: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Hard Faults

Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes

⇒ Default MPI response: abort application

SolutionsRecompute (checkpoint-restart)

Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive

Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy

⇒ algorithm-based fault-tolerance (ABFT)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 15

Page 25: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Hard Faults

Errors that trigger signals to the userNode, OS, network or process failureSoftware crashes

⇒ Default MPI response: abort application

SolutionsRecompute (checkpoint-restart)

Checkpoint on HD / RAMLosslessExpensive storage/communication operationsRestart even more expensive

Continue w/o recomputationRequires adapted numerical schemesNo/minor extra computational effortLossy

⇒ algorithm-based fault-tolerance (ABFT)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 15

Page 26: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Communication Scheme

Master-worker model

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 16

Page 27: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Software Stack

Fault simulation layer

Implements interface of ULFMplus kill_me() functionality

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 17

Page 28: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Selective Reliability

Focus on critical parts

Algorithm: The Combination Technique in Parallel

for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions

while not converged do

for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)

u(c)n ← reduce(ciui); // Combine solutions

for all i ∈ In,q,τ doui← scatter(u(c)

n ); // Sample each uifrom new u(c)n

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 18

Page 29: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Selective Reliability

Focus on critical parts

Algorithm: The Combination Technique in Parallel

for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions

while not converged do

for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)

u(c)n ← reduce(ciui); // Combine solutions

for all i ∈ In,q,τ doui← scatter(u(c)

n ); // Sample each uifrom new u(c)n

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 18

Page 30: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Selective Reliability

Focus on critical parts

Algorithm: The Combination Technique in Parallel

for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions

while not converged do

for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)

mitigateFaults(); // Mitigate faults

u(c)n ← reduce(ciui); // Combine solutions

for all i ∈ In,q,τ doui← scatter(u(c)

n ); // Sample each uifrom new u(c)n

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 18

Page 31: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

1

2

3

61 2 3 4 5

6

5

4

2

1

l

l

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 19

Page 32: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

1

2

3

4

5

1 2 3 4 5 6

6

1

2i

i

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 19

Page 33: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

ABFT: Fault-Tolerant Combination Technique

Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients

uc~n(~x) =

d−1∑q=0

(−1)q(

d − 1q

) ∑~l∈I~n,q

u~l (~x)

In case of failure: use inclusion-exclusion principle to determine adaptedcombination

1 Solve generalized coefficient problem (GCP):

maxw

Q′(w), Q′(w) :=∑l∈I↓

4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓

2 Obtain new combination coefficients:

cl = (M−1w)l

Extra computations only on lower scales required

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 20

Page 34: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

ABFT: Fault-Tolerant Combination Technique

Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients

uc~n(~x) =

d−1∑q=0

(−1)q(

d − 1q

) ∑~l∈I~n,q

u~l (~x)

In case of failure: use inclusion-exclusion principle to determine adaptedcombination

1 Solve generalized coefficient problem (GCP):

maxw

Q′(w), Q′(w) :=∑l∈I↓

4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓

2 Obtain new combination coefficients:

cl = (M−1w)l

Extra computations only on lower scales required

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 20

Page 35: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

ABFT: Fault-Tolerant Combination Technique

Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients

uc~n(~x) =

d−1∑q=0

(−1)q(

d − 1q

) ∑~l∈I~n,q

u~l (~x)

In case of failure: use inclusion-exclusion principle to determine adaptedcombination

1 Solve generalized coefficient problem (GCP):

maxw

Q′(w), Q′(w) :=∑l∈I↓

4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓

2 Obtain new combination coefficients:

cl = (M−1w)l

Extra computations only on lower scales required

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 20

Page 36: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

ABFT: Fault-Tolerant Combination Technique

Find alternative combination, exclude missing solutionsStarting point: standard CT coefficients

uc~n(~x) =

d−1∑q=0

(−1)q(

d − 1q

) ∑~l∈I~n,q

u~l (~x)

In case of failure: use inclusion-exclusion principle to determine adaptedcombination

1 Solve generalized coefficient problem (GCP):

maxw

Q′(w), Q′(w) :=∑l∈I↓

4−‖i‖1 wl , s.t. wl ∈ 0, 1 ∀l ∈ I ↓

2 Obtain new combination coefficients:

cl = (M−1w)l

Extra computations only on lower scales required

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 20

Page 37: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: 2D Example

1

2

3

4

5

1 2 3 4 5 6

6

1

2i

i

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 21

Page 38: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: 2D Example

1

2

3

4

5

6

61 2 3 4 5 1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2i

i

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 21

Page 39: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: 2D Example

1

2

3

4

5

6

61 2 3 4 5 1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2i

i

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 21

Page 40: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: 2D Example

1

2

3

4

5

6

61 2 3 4 5 1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2

i

i

1

2

3

4

5

1 2 3 4 5 6

6

1

2i

i

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 21

Page 41: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: Example 3D

No faults

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 22

Page 42: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: Example 3D

2 faults

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 22

Page 43: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

GCP: Example 3D

For arbitrary faults: GCP prohibitively expensive

Fast solution possible if enough extra grid layers available

Only fraction of computational effort⇒ faults in lower layers unlikely

Precompute some extra layers in advance

0 1 2 3 4

q0

5

10

15

20

25

#gr

ids

21

15

10

6

3

CT

Extra

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 23

Page 44: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Results Using GENE

Example:

Good reconstruction (visual inspection)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 24

Page 45: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Results Using GENE (2)

Small (reduced) problem4D: x , z, µ, v‖~lmin = [2, 3, 2, 4],~l = [6, 7, 6, 8]⇒ 69 combination grids

5.0% 10.0% 15.0% 20.0% 25.0%

Faults

10−6

10−5

10−4

10−3

10−2

L2

erro

rn

orm

Excellent recovery properties!

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 25

Page 46: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Computational Effort

Accumulated timeto compute partial grids

0 1 2 3 4

q

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

tim

e(s

)

101

CT

Extra

×

Gain by ABFT

1 2 3 4 5

Faults

0

1

2

3

4

5

6

tim

e(s

)

Restarting from checkpoint

With recombination

Significant savings in runtime

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 26

Page 47: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Silent/Soft Faults

No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips

1 0 1 0 1 1 1 0 1

1 0 1 0 0 1 1 0 1

Common solutionsChecksumsReplication (process/data)

⇒ Significant overhead (effort, resources)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 27

Page 48: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Silent/Soft Faults

No signal to userFaults unnoticed unless searched forMost common type: Silent Data Corruption (SDC)Errors in arithmetic operations, memory corruption, bit flips

1 0 1 0 1 1 1 0 1

1 0 1 0 0 1 1 0 1

Common solutionsChecksumsReplication (process/data)

⇒ Significant overhead (effort, resources)

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 27

Page 49: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Selective Reliability

Focus on critical parts

Algorithm: The Combination Technique in Parallel

for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions

while not converged do

for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)

mitigateFaults(); // Mitigate faults

u(c)n ← reduce(ciui); // Combine solutions

for all i ∈ In,q,τ doui← scatter(u(c)

n ); // Sample each uifrom new u(c)n

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 28

Page 50: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Selective Reliability

Focus on critical parts

Algorithm: The Combination Technique in Parallel

for all combination grids Ωi do in parallelui← u(x , t = 0) ; // Set initial conditions

while not converged do

for all combination grids Ωi do in parallelui← solver(ui ,Nt); // Solve the PDE on grid Ωi (Nt timesteps)

checkForSDC(); // Cheap sanity check

mitigateFaults(); // Mitigate faults

u(c)n ← reduce(ciui); // Combine solutions

for all i ∈ In,q,τ doui← scatter(u(c)

n ); // Sample each uifrom new u(c)n

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 28

Page 51: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Silent/Soft Faults

Exploit hierarchical approachSimilar discretizations lead to similar results

Exploit redundancy and hierarchical representation to check for faults

Detection of outliers possible

Direct integration into communication schemes possible (SubspaceReduce)

+

Component

Grids

Sparse

GridHierarchical Increment Spaces

of the Sparse Grid

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 29

Page 52: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

SDC Check: Compare Pairs of Solutions

Similar discretizations should lead to similar results

Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing

SDC Check 1: Compare pairs of solutions

1

2

3

4

1 2 3 4

Pair β(s,t)

(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05

β(s,t) := maxl≤s∧t

maxj∈Il

|α(t)l,j − α

(s)l,j |

min|α(t)

l,j |, |α(s)l,j |

A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique

SPPEXA 2016 12

Pair β(s,t)

(2, 4) (2, 3) 3.98e-01(3, 2) (2, 4) 1.11e+00(4, 2) (2, 3) 1.11e+00(3, 2) (2, 3) 6.32e-01(3, 3) (4, 2) 9.85e+05(3, 3) (2, 3) 1.07e+06(3, 2) (4, 2) 3.98e-01(2, 4) (4, 2) 1.27e+00(3, 2) (3, 3) 1.07e+06(2, 4) (3, 3) 9.85e+05

β(s,t) := maxl≤s∧t

maxj∈Il

|α(t)l,j − α

(s)l,j |

min|α(t)

l,j |, |α(s)l,j |

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 30

Page 53: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

SDC Check: Outlier detection

Technische Universitat Munchen • Department of Informatics • Chair of Scientific Computing

SDC Check 2: Outlier detection

u(0,0) = [1.002,5.356,0.998,1.002,1.001,1.001, .999]

A. Parra Hinojosa: SDC-Resilient Algorithms Using the Sparse Grid Combination Technique

SPPEXA 2016 13

u(0, 0) = [1.002, 5.356, 0.998, 1.002, 1.001, 1.001, .999]

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 31

Page 54: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

Advection equation

∂u∂t

+ cx∂u∂x

+ cy∂u∂x

= 0 Ω = [0, 1]2

Periodic boundary conditions

Constant advection velocities cx , cy

Initial condition u(x , y , t = 0) = sin(2πx) sin(2πy)

Lax-Wendroff scheme (2nd order space + time)

Error/solution at t = 0.5 compared to analytical solution

u(x , y , t) = sin(2π(x − cx t)) sin(2π(y − cY t))

Corruption of one single data point in initial condition

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 32

Page 55: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

0 5 10 15 20 25 300

5

10

15

20

25

30

Exact solution

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Full Grid

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Combined grid

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 33

Page 56: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

0 5 10 15 20 25 300

5

10

15

20

25

30

Exact solution

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Full Grid

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Combined grid

−2.4

−1.8

−1.2

−0.6

0.0

0.6

1.2

1.8

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 33

Page 57: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example

0 5 10 15 20 25 300

5

10

15

20

25

30

Exact solution

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Full Grid

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Combined grid

−2.4

−1.8

−1.2

−0.6

0.0

0.6

1.2

1.8

0 5 10 15 20 25 300

5

10

15

20

25

30

Recovered

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 33

Page 58: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

2D Example: Simulated Soft Faults

Inserting one soft faultMeasuring L2-error at the end

0 100 200 300 400 50010−510−410−310−210−1

100101102103104105106

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 105

Full Grid CT, no SDC CT, with SDC CT, recovered

0 100 200 300 400 50010−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−0.5

0 100 200 300 400 500

Timestep where fault occurs, lowest hierarchical subspace

10−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−300

0 100 200 300 400 50010−510−410−310−210−1

100101102103104105106

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 105

0 100 200 300 400 50010−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−0.5

0 100 200 300 400 500

Timestep where fault occurs, highest hierarchical subspace

10−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−300

0 100 200 300 400 50010−510−410−310−210−1

100101102103104105106

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 105

0 100 200 300 400 50010−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−0.5

0 100 200 300 400 500

Timestep where fault occurs, highest hierarchical subspace

10−5

10−4

10−3

10−2

10−1

100

101

ui(xl1,j1, xl2,j2

) = ui(xl1,j1, xl2,j2

)× 10−300

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 34

Page 59: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Higher-D: Advection-Diffusion Equation

∂tu −∆u + ~a · ∇u = f in Ω× [0,T )

u(·, t) = 0 in ∂Ω

u(·, 0) = u0 in Ω

Ω = [0, 1]d ,~a = (1, . . . , 1)T , u0 = e−100∑d

i=1(xi−0.5)2

Implemented in DUNE-pdelab

FVM, explicit time integration

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 35

Page 60: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Results

Fault in second time stepRelative error w.r.t. full-grid solution (n = 11 in 2D, n = 7 in 5D)Computations on Hazel Hen (HLRS)2D, 5D:

(6,6) (7,7) (8,8) (9,9) (10,10)

n

10-2

10-1

l 2-e

rror

no faults

2 groups (~50% faults)

4 groups (~25% faults)

8 groups (~12% faults)

16 groups (~6% faults)

(4,4,4,4,4) (5,5,5,5,5) (6,6,6,6,6)

n

10-1

100

l 2-e

rror

no faults

2 groups (~50% faults)

4 groups (~25% faults)

Again: excellent recovery properties!

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 36

Page 61: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Overview

1 Motivation and Numerics

2 Scalability

3 Algorithm-Based Fault ToleranceHard FaultsSilent/Soft Faults

4 Summary

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 37

Page 62: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Summary

GyrokineticsHigh-dimensional problem with urgent need for compute resources

Sparse grids: ”Too Big Data”⇒ Big Data

Hierarchical multilevel splitting provides novel handles on exa-challenges

Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication

ABFT at low costExploit hierarchical schemeRecombination rather than recomputation

Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 38

Page 63: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Summary

GyrokineticsHigh-dimensional problem with urgent need for compute resources

Sparse grids: ”Too Big Data”⇒ Big Data

Hierarchical multilevel splitting provides novel handles on exa-challenges

Scalability2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication

ABFT at low costExploit hierarchical schemeRecombination rather than recomputation

Silent faultsExploit underlying hierarchical basisDetection and treatment of silent faults possible

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 38

Page 64: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Summary

GyrokineticsHigh-dimensional problem with urgent need for compute resources

Sparse grids: ”Too Big Data”⇒ Big Data

Hierarchical multilevel splitting provides novel handles on exa-challenges

Scalability reduce data in communication2nd level of parallelismNumerical decoupling, extrapolationExploit hierarchical splitting for optimal communication

ABFT at low cost avoid data storage and I/OExploit hierarchical schemeRecombination rather than recomputation

Silent faults limit communicated dataExploit underlying hierarchical basisDetection and treatment of silent faults possible

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 38

Page 65: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Thanks to:

. . . and all others!

Thank you for your interest!

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 39

Page 66: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Thanks to:

. . . and all others!

Thank you for your interest!

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 39

Page 67: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations …helper.ipam.ucla.edu/publications/dmc2017/dmc2017_14139.pdf · 2017-02-03 · Big Data Meets Computation

Mario Heene, Alfredo Parra Hinojosa, Hans-Joachim Bungartz, and Dirk Pfluger.A massively-parallel, fault-tolerant solver for time-dependent pdes in high dimensions.In Euro-Par 2016, Grenoble, June 2016.Accepted.

Philipp Hupp, Mario Heene, Riko Jacob, and Dirk Pfluger.Global communication schemes for the numerical solution of high-dimensional PDEs.Parallel Computing, 52:78 – 105, 2016.

Mario Heene and Dirk Pfluger.Scalable algorithms for the solution of higher-dimensional PDEs.In Software for Exascale Computing-SPPEXA 2013-2015, pages 165–186. Springer InternationalPublishing, 2016.

Alfredo Parra Hinojosa, Christoph Kowitz, Mario Heene, Dirk Pfluger, and Hans-Joachim Bungartz.Towards a fault-tolerant, scalable implementation of GENE.In Recent Trends in Computational Engineering-CE2014, pages 47–65. Springer International Publishing,2015.

Alfredo Parra Hinojosa, Brendan Harding, Hegland Markus, and Hans-Joachim Bungartz.Handling silent data corruption with the sparse grid combination technique.In Proceedings of the SPPEXA Symposium, Lecture Notes in Computational Science and Engineering.Springer-Verlag, February 2016.

Dirk Pfluger, Hans-Joachim Bungartz, Michael Griebel, Frank Jenko, Tilman Dannert, Mario Heene,Alfredo Parra Hinojosa, Christoph Kowitz, and Peter Zaspel.EXAHD: An exa-scalable two-level sparse grid approach for higher-dimensional problems in plasmaphysics and beyond.In Euro-Par 2014 Workshop, Part II, volume 8806 of Lecture Notes in Computer Science, pages 566–577.Springer-Verlag, December 2014.

Dirk Pfluger: Scalability and Algorithm-Based Fault Tolerance for Plasma Physics Simulations with GENE

Big Data Meets Computation @ IPAM, February 2, 2017 40