radiation-induced error criticality in modern hpc parallel ... · kepler k40 xeon-phi. daniel...

46
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators Daniel Oliveira , Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Santos, Caio Lunardi, José Maria Cela, Philippe Navaux, Luigi Carro, Paolo Rech WMC 2017

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Radiation-Induced Error Criticality inModern HPC Parallel Accelerators

Daniel Oliveira, Laercio Pilla, Mauricio Hanzich,Vinicius Fratin, Fernando Santos, Caio Lunardi, José Maria Cela, Philippe Navaux, Luigi Carro, Paolo Rech

WMC 2017

Page 2: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

HPC reliability importance

2

Page 3: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Available AcceleratorsModern parallel accelerators offer:

- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources

3

Kepler K40 Xeon-Phi

Page 4: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Available AcceleratorsModern parallel accelerators offer:

- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?

3

Kepler K40 Xeon-Phi

Page 5: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Available AcceleratorsModern parallel accelerators offer:

- Low cost- Flexible platform- High efficiency (low per-thread consumption)- High computational power and frequency- Huge amount of resources- Reliability?

Error Rate

3

Kepler K40 Xeon-Phi

Page 6: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Titan

Titan (Oak Ridge National Lab): 18,688 GPUs

High probability of having a GPU corruptedTitan Detected Uncorrectable Errors MTBF is ~44h*

*(field and experimental data from HPCA’15)

4

Page 7: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Outline

Radiation Effects Essentials

Error Criticality in HPC

Experimental Procedure

K40 vs Xeon Phi FIT rates

Qualify SDCs for HPC applications

What’s the Plan?

5

Page 8: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Terrestrial Radiation Environment

Galactic cosmic rays interaction with atmosphere generates neutrons.

13 n/(cm2*h) @sea level

6

Page 9: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Terrestrial Radiation Environment

Galactic cosmic rays interaction with atmosphere generates neutrons.

13 n/(cm2*h) @sea level

6

0

1

1

0FFLogic

Soft Errors: the device is not permanently damaged, but the particle may generate bit-flips or logic errors

Page 10: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Silent Data Corruption vs Crash

Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler

Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller

Silent Data Corruption

DUE (Crash)

7

Page 11: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Silent Data Corruption vs Crash

Soft Errors in:- data cache- register files- logic gates (ALU)- scheduler

Soft Errors in:- instruction cache- scheduler / dispatcher- PCI-e bus controller

Silent Data Corruption

DUE (Crash)

7

Page 12: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Output Correctness in HPC

A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.

8

Page 13: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Output Correctness in HPC

error can be in thefloat intrinsic variance

Values in a given range are accepted as correct in physical simulations

Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

8

A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.

Page 14: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Output Correctness in HPC

error can be in thefloat intrinsic variance

Values in a given range are accepted as correct in physical simulations

Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

Goal: quantify and qualify SDC in NVIDIA and Intel architectures.

8

A single fault on shared resources or scheduler affects several parallel threads:multiple corrupted elements.

Page 15: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Radiation Test Facilities

9

Irradiation of Chips Electronics

Page 16: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

GPU Radiation Test Setup

23/48

GPU power control circuitry is out of beam

NVIDIAK40

NVIDIAK40

IntelXeon-Phi

IntelXeon-Phi

desktop PCs

desktop PCs

Page 17: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)

Neutrons Spectrum

11

Page 18: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

@LANSCE 1.8x106 n/(cm2 h)@NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation(~ 91,000 years)

Neutrons Spectrum

All the collected SDCs are publicly available:https://github.com/UFRGS-CAROL/HPCA2017-log-data

11

Page 19: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

- DGEMM: matrix multiplication

- lavaMD: particles interactions

- Hotspot: heat simulation

- CLAMR: DOE’s workload

Selected AlgorithmsWe select a set of benchmarks that:

- stimulate different resources- are representative of HPC applications- minimize error masking (high AVF)

12

Page 20: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Xeon Phi vs K40 FIT rate

1

10

100

1000

Xeon Phi

K40

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

Xeon Phi error rate seems lower than Kepler, but:

-Xeon Phi is built in 3D Trigate, Kepler in planar CMOS-Xeon Phi and K40 have different throughput

13

Page 21: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Parallelism Management Reliability

0

100

200

300

400

500

600

700

0

50

100

150

200

250

300

15 19 23

lavaMD

210 211 212

DGEMM

Rel

ativ

e F

IT [a

.u.]

Rel

ativ

e F

IT [a

.u.]

What about parallel threads management?

Increasing the input size (and #threads):-Xeon-Phi error rate remains constant (<20% variation)-K40 SDC error rate increases with input size

K40 Xeon Phi

14

Page 22: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Parallelism Management Reliability

K40 Xeon-Phi

FIT increases with input size: HW scheduler is prone to be corrupted!

data of 2048 active threads is maintained in the register file

constant FIT rate:embedded OS is OK!

only 4 threads/core are maintained. Other threads data in the main memory (not exposed)

15

Page 23: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

29x29 210x210 211x211 212x212 213x213

DG

EM

M G

Flo

ps

0.00E+00

2.00E+02

4.00E+02

6.00E+02

8.00E+02

1.00E+03

1.20E+03

Xeon Phi

K40

Xeon-Phi GFlops almost constant

K40 Gflopsrapidly increase

Parallelism Management ReliabilityK40 throughput increases with input size.Reliability vs Performances trade-off should be considered(in the paper: Mean Workload Between Failures)

16

Page 24: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Quantify and Qualify SDCs

Number of incorrect elements

Relative Errorhow different the error is from the expected value

Spatial Locality

Potentially Masked Errorsrelative error < 2% is tolerable

xx

x

xx

x x x x x xx x xx x x

x x x

xx

x

line square random

17

Page 25: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Quantify and Qualify SDCs

Number of incorrect elements

Relative Errorhow different the error is from the expected value

Potentially Masked Errorsrelative error < 2% is tolerable

xx

x

xx

Spatial Localityx x x x x x

x x xx x x

x x x

xx

x

line square random

In the paper

17

Page 26: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

18

K40Xeon Phi

Page 27: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Greater different from expected value

18

K40Xeon Phi

Page 28: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Higher number of corrupted elements

Greater different from expected value

18

K40Xeon Phi

Page 29: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Higher number of corrupted elements

Greater different from expected value

BAD: high number of corrupted elements,which are very different from the expected output

18

K40Xeon Phi

Page 30: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

K40 few corrupted elements, value similar to expected one Xeon Phi: a lot of corrupted elements,

which are very different from expected value

18

K40Xeon Phi

Page 31: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Both K40 and Xeon Phi have few corrupted elements.K40 corruption are very different from the expected one

18

K40Xeon Phi

Page 32: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Number of Incorrect Elements vs Relative Error

Purely arithmetic operations are more reliable (and faster) on the K40 (GPUs have shorten and faster pipelines).

Xeon Phi is more reliable for Finite Different Methods (lavaMD), which are based on transcendental functions (exp).

18

DGEMM lavaMDK40Xeon Phi

Page 33: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]Potentially Masked Errors

Potentially Masked Errorsrelative error < 2% is tolerable

19

K40Xeon Phi

Page 34: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Potentially Masked Errors

1

10

100

1000

1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

Potentially Masked Errorsrelative error < 2% is tolerable

19

Page 35: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Potentially Masked Errors

1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

K40Xeon Phi

lavaMD: at most 5% of errors are potentially masked.Exponentiation exacerbate the error magnitude.

1

10

100

1000

Rel

ativ

e F

IT [a

.u.]

19

errors<2%

Page 36: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

1

10

100

1000

1

10

100

1000

Potentially Masked Errors

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

DGEMM: ~64% K40 errors are potentially masked,0% for the Xeon Phi! K40’s short and fast pipelines are reliable for arithmetic operations.

19

errors<2%

Page 37: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Potentially Masked Errors

1

10

100

1000

15 19 23 210 211 212 Hotspot CLAMR

N/A

lavaMD DGEMM

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

1

10

100

1000

19

errors<2%

Page 38: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot

Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.

Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.

Hotspot

20

Page 39: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Hotspot

1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot

Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.

Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.

20

Page 40: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

Hotspot

1

10

100

Rel

ativ

e F

IT [a

.u.]

K40Xeon Phi

errors<2%

1

10

100

Hotspot

Stencil-like code: a lot of elements are corrupted, but the error is small.

Hotspot (Stencil-like): Most errors are potentially masked. 97% for K40, 81% for Xeon Phi.

Temperature is calculated considering nearby cells.Error dissipates (and spreads) as equilibrium is reached.

20

Page 41: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

21

Page 42: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

- We can show how SDC appears at the output, to ease detection

- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC

21

Page 43: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

- We can show how SDC appears at the output, to ease detection

- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC

- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86

21

Page 44: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

What’s The Plan?Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

- We can show how SDC appears at the output, to ease detection

- Understand SDC criticality. Not all errors significantly affect output: there are “acceptable” SDC

- Fault-injection to better understand error propagationSASSIFI: NVIDIA architectural-level fault-injectorCAROL-FI: UFRGS fault-injector for Xeon Phi and X86

- Propose selective-hardening solutions(duplicate only what matters, what REALLY matters)

21

Page 45: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Sponsors

Research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement 689772.

Page 46: Radiation-Induced Error Criticality in Modern HPC Parallel ... · Kepler K40 Xeon-Phi. Daniel Oliveira – WMC 2017 Available Accelerators Modern parallel accelerators offer: - Low

Daniel Oliveira – WMC 2017

AcknowledgmentsCaio LunardiCaroline AguiarDaniel OliveiraFernando SantosVinicius FrattinPaolo RechPhilippe NavauxLuigi Carro

Chris Frost

Nathan DeBardelebenSean BlanchardHeather QuinnThomas FairbanksSteve Wender

Timothy TsaiSiva HariSteve Keckler

David KaeliNUCAR group

Matteo Sonza ReordaLuca Sterpone

Laercio Pilla

Israel KorenSandip Kundu