preliminary investigations on resilient parallel numerical linear algebra … · 2018. 5. 25. ·...

Post on 16-Oct-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SIAM EX14 WorkshopJuly 7, Chicago - IL

Preliminary Investigations on ResilientParallel Numerical Linear Algebra Solvers

Luc Giraud

joint work withE. Agullo, P. Salas, E. F. Yetkin, M. Zounonfunded by ANR RESCUE and G8-ECS

HiePACS Inria ProjectJoint Inria-CERFACS labINRIA Bordeaux Sud-Ouest

Context

L. Giraud - Resilient numerical linear algebra solvers 2/ 25

Resilience: Ability to compute a correct output in presence of faults

I Context: Numerical linear algebraI Goal: Keep converging in presence of faultI Method: Recover-restart strategy without Checkpoint

I HPC systems are not fault-freeI A faulty components (node, core, memory) loses

all its dataI Simulations at exascale have to be resilient

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 3/ 25

Faults in HPC Systems

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 4/ 25

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Sparse linear systems

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 6/ 25

Sparse linear systems

L. Giraud - Resilient numerical linear algebra solvers 7/ 25

x bA

=

Ax = bWe attempt to design fault tolerant solversfor sparse linear system

Two classes of iterative methodsI Stationary methods (Jacobi, Gauss-Seidel, . . . )I Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )

I Krylov methods have attractive potential for Extreme-scale

Interpolation methods

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 8/ 25

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data

=

We distinguish two categories of data:I Static dataI Dynamic data

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 fails

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

��������

��������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 fails

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

������������

������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

������������

������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

0

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Reset: Set (x1) to initial value

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

��������

��������

������������������������������

������������������������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data Interpolatedv data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Our algorithms aim at recovering x1and restart

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

I Sequential simulationsI Simulation of parallel

environment

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(x1x2

)=

(b1b2

)

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Numerical experiments

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 13/ 25

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 4 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 8 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 17 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 40 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 3 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.8 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.2 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.001 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Numerical experiments

Penalty of restart strategy

I Recover-restart strategyI When restarting, we lose the Krylov subspace built before the

faultI Consequence: delay of convergence due to restartI Restarting mechanism is naturally implemented in GMRES to

reduce the computational resource consumptionI CG does not need to be restarted

L. Giraud - Resilient numerical linear algebra solvers 16/ 25

Numerical experiments

Penality of restart strategy on PCG

1e-13

1e-12

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 83 166 249 332 415 498 581 664 747 830

A-n

orm

(err

or)

Iterations

Reset

LI

LSI

SC

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost

L. Giraud - Resilient numerical linear algebra solvers 17/ 25

Numerical experiments

Penality of restart strategy on PCG

1e-13

1e-12

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 83 166 249 332 415 498 581 664 747 830

A-n

orm

(err

or)

Iterations

Reset

LI

LSI

SC

REF

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost

L. Giraud - Resilient numerical linear algebra solvers 17/ 25

Resilience in eigensolvers

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 18/ 25

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem(A11 A12A21 A22

)(x1x2

)= λ

(x1x2

)

Linear Interpolation (LI)Solve the linear system

(A11 − λI1

)x1 = −A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 = λ

(x1x2

)x1 = argmin

x

∥∥∥∥(A11 − λI1A21

)x +

(A12

A22 − λI2

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 19/ 25

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem(A11 A12A21 A22

)(?x2

)= λ

(?x2

)How to recover x1?

Linear Interpolation (LI)Solve the linear system

(A11 − λI1

)x1 = −A12x2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 = λ

(x1x2

)x1 = argmin

x

∥∥∥∥(A11 − λI1A21

)x +

(A12

A22 − λI2

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 19/ 25

Resilience in eigensolvers

L. Giraud - Resilient numerical linear algebra solvers 20/ 25

xA

=

x If Ax = λx with x 6= 0, where A ∈ Cn×n,x ∈ Cn, and λ ∈ C , then,

I λ : eigenvalueI x : eigenvectorI (λ, x) : eigenpair

Two classes of methodsI Fixed Point Methods (Power Method, Subspace iteration)I Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/Krylov

Schur)

Resilience in eigensolvers

Thermo-acoustic test example

(a few smallest eigenvalues)

L. Giraud - Resilient numerical linear algebra solvers 21/ 25

Resilience in eigensolvers

Jacobi-Davidson method

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

0 1 2 2 3 4

LSI

REF

Figure: Jacobi-Davidson method with 5 faults - 1 % lost data.Convergence history using LSI and Checkpoint of current iterate

L. Giraud - Resilient numerical linear algebra solvers 22/ 25

Concluding remarks and perspectives

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 23/ 25

Concluding remarks and perspectives

Concluding remarks

SummaryI We have designed techniques to interpolate meaningfull lost

data based on simple linear algebra toolsI Our techniques preserve some of the key monotonicy of Krylov

solvers but lack of robustness of LI for non-SPD problemsI The restarting effect remains reasonable within the GMRES

contextI No fault, no overheadI These techniques can be adpated to multiple faultsI What about silent soft-error - CGPOP preliminary

experiments ?

L. Giraud - Resilient numerical linear algebra solvers 24/ 25

Merci for your attentionQuestions ?

https://team.inria.fr/hiepacs/

top related