preliminary investigations on resilient parallel numerical linear algebra … · 2018. 5. 25. ·...

58
SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo, P. Salas, E. F. Yetkin, M. Zounon funded by ANR RESCUE and G8-ECS HiePACS Inria Project Joint Inria-CERFACS lab INRIA Bordeaux Sud-Ouest

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

SIAM EX14 WorkshopJuly 7, Chicago - IL

Preliminary Investigations on ResilientParallel Numerical Linear Algebra Solvers

Luc Giraud

joint work withE. Agullo, P. Salas, E. F. Yetkin, M. Zounonfunded by ANR RESCUE and G8-ECS

HiePACS Inria ProjectJoint Inria-CERFACS labINRIA Bordeaux Sud-Ouest

Page 2: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Context

L. Giraud - Resilient numerical linear algebra solvers 2/ 25

Resilience: Ability to compute a correct output in presence of faults

I Context: Numerical linear algebraI Goal: Keep converging in presence of faultI Method: Recover-restart strategy without Checkpoint

I HPC systems are not fault-freeI A faulty components (node, core, memory) loses

all its dataI Simulations at exascale have to be resilient

Page 3: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 3/ 25

Page 4: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Faults in HPC Systems

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 4/ 25

Page 5: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Page 6: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Page 7: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Faults in HPC Systems

Framework

Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF

ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue

Faults in this presentationI Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

L. Giraud - Resilient numerical linear algebra solvers 5/ 25

Page 8: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Sparse linear systems

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 6/ 25

Page 9: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Sparse linear systems

L. Giraud - Resilient numerical linear algebra solvers 7/ 25

x bA

=

Ax = bWe attempt to design fault tolerant solversfor sparse linear system

Two classes of iterative methodsI Stationary methods (Jacobi, Gauss-Seidel, . . . )I Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )

I Krylov methods have attractive potential for Extreme-scale

Page 10: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 8/ 25

Page 11: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Page 12: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Page 13: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

Block row distributionx bA

P

P

P

P

1

2

3

4

=

We distinguish two categories of data:I Static dataI Dynamic data

Page 14: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data

=

We distinguish two categories of data:I Static dataI Dynamic data

Page 15: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 fails

Page 16: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

��������

��������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 fails

Page 17: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

����������������

����������������

������������

������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Page 18: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

������������

������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

0

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Reset: Set (x1) to initial value

Page 19: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

L. Giraud - Resilient numerical linear algebra solvers 9/ 25

��������

��������

������������������������������

������������������������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data Interpolatedv data

=

We distinguish two categories of data:I Static dataI Dynamic data

Let’s assume that P1 failsI Failed processor is replacedI Static data are restored

Our algorithms aim at recovering x1and restart

Page 20: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

I Sequential simulationsI Simulation of parallel

environment

Page 21: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 22: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 23: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 24: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 25: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 26: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 27: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 28: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 29: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 30: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Overview of our fault tolerant algorithm

L. Giraud - Resilient numerical linear algebra solvers 10/ 25

P

P

P

P

1

2

3

4

Time

Fault Successful iteration Failed iteration Interpolation

Restart

I Sequential simulationsI Simulation of parallel

environment

I Generation of fault traceI Realistic probability distribution

Page 31: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(x1x2

)=

(b1b2

)

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Page 32: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Page 33: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−(

A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Page 34: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to recover x1?

Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A11A21

)x −

(A12A22

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 11/ 25

Page 35: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Page 36: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Page 37: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Interpolation methods

Main properties - basic linear algebra

PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]

PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure

L. Giraud - Resilient numerical linear algebra solvers 12/ 25

Page 38: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 13/ 25

Page 39: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 4 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Page 40: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 8 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Page 41: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 17 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Page 42: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 40 faults

L. Giraud - Resilient numerical linear algebra solvers 14/ 25

Page 43: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 3 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Page 44: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.8 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Page 45: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.2 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Page 46: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

SC

REF

Figure: 0.001 % data lost

L. Giraud - Resilient numerical linear algebra solvers 15/ 25

Page 47: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Penalty of restart strategy

I Recover-restart strategyI When restarting, we lose the Krylov subspace built before the

faultI Consequence: delay of convergence due to restartI Restarting mechanism is naturally implemented in GMRES to

reduce the computational resource consumptionI CG does not need to be restarted

L. Giraud - Resilient numerical linear algebra solvers 16/ 25

Page 48: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Penality of restart strategy on PCG

1e-13

1e-12

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 83 166 249 332 415 498 581 664 747 830

A-n

orm

(err

or)

Iterations

Reset

LI

LSI

SC

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost

L. Giraud - Resilient numerical linear algebra solvers 17/ 25

Page 49: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Numerical experiments

Penality of restart strategy on PCG

1e-13

1e-12

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 83 166 249 332 415 498 581 664 747 830

A-n

orm

(err

or)

Iterations

Reset

LI

LSI

SC

REF

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost

L. Giraud - Resilient numerical linear algebra solvers 17/ 25

Page 50: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 18/ 25

Page 51: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem(A11 A12A21 A22

)(x1x2

)= λ

(x1x2

)

Linear Interpolation (LI)Solve the linear system

(A11 − λI1

)x1 = −A12x2

Least Squares Interpolation (LSI)

(A11A21

)x1 +

(A21A22

)x2 = λ

(x1x2

)x1 = argmin

x

∥∥∥∥(A11 − λI1A21

)x +

(A12

A22 − λI2

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 19/ 25

Page 52: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem(A11 A12A21 A22

)(?x2

)= λ

(?x2

)How to recover x1?

Linear Interpolation (LI)Solve the linear system

(A11 − λI1

)x1 = −A12x2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 = λ

(x1x2

)x1 = argmin

x

∥∥∥∥(A11 − λI1A21

)x +

(A12

A22 − λI2

)x2

∥∥∥∥2

L. Giraud - Resilient numerical linear algebra solvers 19/ 25

Page 53: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

L. Giraud - Resilient numerical linear algebra solvers 20/ 25

xA

=

x If Ax = λx with x 6= 0, where A ∈ Cn×n,x ∈ Cn, and λ ∈ C , then,

I λ : eigenvalueI x : eigenvectorI (λ, x) : eigenpair

Two classes of methodsI Fixed Point Methods (Power Method, Subspace iteration)I Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/Krylov

Schur)

Page 54: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

Thermo-acoustic test example

(a few smallest eigenvalues)

L. Giraud - Resilient numerical linear algebra solvers 21/ 25

Page 55: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Resilience in eigensolvers

Jacobi-Davidson method

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

0 1 2 2 3 4

LSI

REF

Figure: Jacobi-Davidson method with 5 faults - 1 % lost data.Convergence history using LSI and Checkpoint of current iterate

L. Giraud - Resilient numerical linear algebra solvers 22/ 25

Page 56: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Concluding remarks and perspectives

Outline

Faults in HPC Systems

Sparse linear systems

Interpolation methods

Numerical experiments

Resilience in eigensolvers

Concluding remarks and perspectives

L. Giraud - Resilient numerical linear algebra solvers 23/ 25

Page 57: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Concluding remarks and perspectives

Concluding remarks

SummaryI We have designed techniques to interpolate meaningfull lost

data based on simple linear algebra toolsI Our techniques preserve some of the key monotonicy of Krylov

solvers but lack of robustness of LI for non-SPD problemsI The restarting effect remains reasonable within the GMRES

contextI No fault, no overheadI These techniques can be adpated to multiple faultsI What about silent soft-error - CGPOP preliminary

experiments ?

L. Giraud - Resilient numerical linear algebra solvers 24/ 25

Page 58: Preliminary Investigations on Resilient Parallel Numerical Linear Algebra … · 2018. 5. 25. · SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel

Merci for your attentionQuestions ?

https://team.inria.fr/hiepacs/