bronevetsky - accurate prediction of soft error ......extending vulnerability characterization to...

46
Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab

Upload: others

Post on 19-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Accurate Prediction of Soft Error Vulnerability of

Scientific Applications

Greg BronevetskyPost-doctoral Fellow

Lawrence Livermore National Lab

Page 2: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Soft error: one-time corruption of system state

• Examples: Memory bit-flips, erroneous computations

• Caused by – Chip variability– Charged particles passing through transistors

• Decay of packaging materials (Lead208, Boron10)• Fission due to cosmic neutrons

– Temperature, power fluctuations

Page 3: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Soft errors are a critical reliability challenge for supercomputers

• Real Machines:– ASCI Q: 26 radiation-induced errors/week– Similar-size Cray XD1: 109 errors/week

(estimated)– BlueGene/L: 3-4 L1 cache bit flips/day

• Problem grows worse with time– Larger machines ⇒ larger error probability– SRAMs growing exponentially more

vulnerable per chip

Page 4: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

We must understand the impact of soft errors on applications

• Soft errors corrupt application state• May lead to crashes or• Need to detect/tolerate soft errors

– State of the art: checkers/correctors for individual algorithms

– No general solution• Must first understand how errors affect

applications– Identify problem– Focus efforts

corrupt output

Page 5: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Prior work says very little about most applications

• Prior fault analysis work focuses on injecting errors into individual applications– [Lu and Reed, SC04]: Linux + MPICH +

Cactus, NAMD, CAM– [Messer et al, ICSDN00]: Linux + Apache and

Linux + Java (Jess, DB, Javac, Jack)– [Some et al, AC02]: Lynx + Mars texture

segmentation application…

• Where’s my application?

Page 6: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Extending vulnerability characterization to more applications• Goal: general purpose vulnerability

characterization– Same accuracy as per-application fault

injection– Much cheaper

• Initial steps– Fault injection iterative linear algebra methods– Library-based fault vulnerability analysis

Page 7: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Step 1: Analyzing fault vulnerability of iterative methods

• Target domain: solvers for sparse linear problem Ax=b

• Goal:understand error vulnerability of class of algorithms– Raw error rates– Effectiveness of potential solutions

• Error model: memory bit-flips

Page 8: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Possible run outcomes

• Success: <10% error

• Silent Data Corruption (SDC): ≥10% error

• Hang: method doesn’t reach target tolerance

• Abort: SegFault or failed SparseLib check

Page 9: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Errors cause SDCs, Hangs, Aborts in ~8-10%, each

Page 10: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Large scale applications vulnerable to silent data corruptions

• Scaled to 1-day, 1,000-processor run of application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

Page 11: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Larger scale applications even more vulnerable to silent data corruptions

• Scaled to 10-day, 100,000-processor run of application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

Page 12: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Error Detectors

Base

Page 13: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Convergence detectors reduce SDC at <20% overhead

Base

Page 14: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Convergence detectors reduce SDC at <20% overhead

Base

Page 15: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Native detectors have little effect at little cost

Base

Page 16: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Encoding-based detectors significantly reduce SDC at high cost

Base

Page 17: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Encoding-based detectors significantly reduce SDC at high cost

Base

Page 18: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

First general analysis of error vulnerability of algorithm class

• Vulnerability analysis for class of common subroutines

• Described raw error vulnerability

• Analyzed various detection/tolerance techniques– No clear winner, rules of thumb

Page 19: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Step 2: Vulnerability analysis of library-based applications

• Many applications mostly composed of calls to library routines

• If error hits some routine, output will be corrupted

• Later routines: corrupted inputs ⇒ corrupted outputs

Inputs Outputs

(Work in progress)

Page 20: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Idea: predict application vulnerability from routine profiles

• Library implementors provide vulnerability profile for each routine:– Error pattern in routine’s output after errors– Function that maps input error patterns to

output error patterns

Inputs Outputs

Page 21: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Idea: predict application vulnerability from routine profiles

• Given application’s dependence graph– Simulate effect of error in each routine– Average over all error locations to produce

error pattern at outputs

Inputs Outputs

Page 22: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Examined applications that use BLAS and LAPACK

• 12 routines ≥O(n2), double precision real numbers– Matrix-vector multiplication – DGEMV– Matrix-matrix multiplication – DGEMM– Rank-1 update – DGER– Linear least squares – DGESV, DGELS– SVD factorization – DGESVD, DGGSVD,

DGESDD– Eigenvectors: DGEEV, DGGEV, DGEES,

DGGES

Page 23: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Examined applications that use BLAS and LAPACK

• 12 routines ≥O(n2), double precision real numbers

• Executed on randomly-generated nxnmatrixes(n=62, 125, 250, 500)

• BLAS/LAPACK from Intel’s Math Kernel Library on Opteron(MLK10) and Itanium2(MKL8)– Same results on both

• Error model: memory bit-flips

Page 24: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Error patterns: multiplicative error histograms

DGEMM

Page 25: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Output error patterns fall into few major categories

1.E‐08

1.E‐06

1.E‐04

1.E‐02

1.E+00

1.E‐08

1.E‐06

1.E‐04

1.E‐02

1.E+00

DGGESOutput beta - 62x1

DGESVOutput L - 62x62

DGGESOutput vsr - 62x62

DGEMMOutput C - 62x62

Page 26: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Error patterns may vary with matrix size

1.E‐07

1.E‐05

1.E‐03

1.E‐01

1.E‐07

1.E‐05

1.E‐03

1.E‐01

DGGSVDOutput beta

DGGSVDOutput V

62 125 250 500

Page 27: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Input-Output error transition functions

• Input-Output error transition functions: trained predictors– Linear Least Squares– Support Vector Machines

(linear, 2nd degree polynomial, rbf kernels)– Artificial Neural Nets

(3,10,100 layers,; linear, gaussian, gaussiansymmetric and sigmoid transfer functions)

Page 28: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Trained on multiple input error patterns

• DataInj: single bit errors

• DataInj-R: output errors of routines with DataInjinputs

• UniInj: uniform multiplicative errors ∈[-100,100]

• UniInj-R: output errors of routines with UniInjinputs

• Inj-R: output errors of error injected routines

Page 29: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Input-Output error transition functions

• Input-Output error transition functions: trained predictors– Linear Least Squares– Support Vector Machines– Artificial Neural Nets

• Trained on sample input error patterns

uniform multiplicative errors∈[-100,100]UniInj:

outputs of routines with UniInj inputsUniInj-R:

single bit errorsDataInj:

outputs of error injected routinesInj-R:

outputs of routines with DataInj inputsDataInj-R:

Page 30: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Output errors depend on input errors

• Equivalence classes– DataInj, DataInj-R | Inj-R– DataUni, DataUni-R

Page 31: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Evaluated accuracy of all predictors on all training sets

• Error metric: – probability of error ≥δ– δ∈{1e-14, 1e-13, …, 2, 10, 100)

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1

Recorded

Predicted

Page 32: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Evaluated accuracy of all predictors on all training sets

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1

Recorded

Predicted

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1

Recorded

Predicted

0%10%20%30%40%50%60%70%80%90%100%

Recorded Predicted Error

0%10%20%30%40%50%60%70%80%90%100%

Recorded Predicted Error

Page 33: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Linear Least Squares has best accuracy, Neural nets worstEvaluation set: union of all training sets

Page 34: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Linear Least Squares has best accuracy, Neural nets worst

Page 35: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Accuracy varies among predictors

DGEES, output wr

Page 36: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Linear Least Squares has best accuracy, Neural nets worst

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

e‐HeapInj.none.All uni0‐1All inj0‐1All e‐uni0‐1All e‐inj0‐1All

1.E‐14

2.E‐14

4.E‐14

9.E‐14

2.E‐13

3.E‐13

7.E‐13

2.E‐11

7.E‐10

2.E‐08

7.E‐07

2.E‐05

7.E‐04

2.E‐02

8.E‐01

2.E+00

1.E+01

Page 37: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Linear Least Squares has best accuracy, Neural nets worst

Inj-R

DataInj

DataInj-R

DataUni

DataUni-R

Page 38: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Evaluated predictors on randomly-generated applications

• Application has constant number of levels• Constant number of operations per level• Operations use as input data from prior

level(s)

Inputs Outputs

Page 39: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Neural Nets: Poor accuracy for application vulnerability prediction

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1 Recorded

Predicted

Function=sigmoid, 3 hidden layers

Page 40: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1 Recorded

Predicted

Linear Least Squares: Good accuracy, restricted

Page 41: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

1E‐10

1E‐09

1E‐08

1E‐07

1E‐06

1E‐05

0.0001

0.001

0.01

0.1

1

Recorded

Predicted

SVMs: Good accuracy, general

Function=rbf, gamma=1.0

Page 42: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Work is still in progress

• Correlating accuracy of input/output predictors to accuracy of application prediction

• More detailed fault injection

• Applications with loops

• Real applications

Page 43: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Step 3: Compiler analyses

• No need to focus on external libraries

• Can use compiler analysis to – Do fault injection/propagation on per-function

basis– Propagate error profiles through more data

structures (matrix, scalar, tree, etc.)

Page 44: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Step 4: Scalable analysis of parallel applications

• Cannot do fault injection on 1,000-process application

• Can modularize fault injection– Inject into individual processes

Page 45: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Step 4: Scalable analysis of parallel applications

• Cannot do fault injection on 1,000-process application

• Can modularize fault injection– Inject into single-process runs– Propagate through small-scale runs

Page 46: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same

Working toward understanding application vulnerability to errors

• Soft errors becoming increasing problem on HPC systems

• Must understand how applications react to soft errors

• Traditional approaches inefficient for realistic applications

• Developing tools to cheaply understand vulnerability of real scientific applications