silent error detection in numerical time stepping schemes (siam pp 2014)

Silent error resiliencein numerical time-stepping schemes

Austin Benson*Institute for Computational and Mathematical Engineering

Stanford University

Sven Schmit* (ICME) and Rob Schreiber (HP Labs)

SIAM PP 2014

* work done while interning at HP Labs

February 19, 2014

Illustrative example 2

Crank−Nicolson Solution

i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

ut =1

100uxx + 0.1 (sin(2πt) + cos(2πx))

t ∈ [0, 2], x ∈ [0, 1]

u(x , 0) = x(x − 1)

∆x = 1/160,∆t = 1/100

Illustrative example: what’s at fault? 3


i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i



I At step 120, multiplied single entry in RHS of Crank-Nicolsonand Backward Euler linear solves by 0.995

Main idea 4

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I At each time step, base method B generates B1,B2, . . .

I Auxiliary method A “checks” with A1,A2, . . .

I Di = ||Bi − Ai || abnormal → possible error

What are these things? 5

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I Base method B: higher-order scheme (Runge-Kutta 5)

I Auxiliary method A “checks”: lower-order scheme(Runge-Kutta 4)

I Want A needs to be cheap: embedded pairs

[Fehlberg, 1969], [Dormand and Prince, 1980]

RK 1/2 A/B scheme 6

ODE: u′ = f (t, u).

kB1 = f (tn, uBn )

uBn+1 = uBn + hf(tn + h/2, uBn + hkB1 /2

)Re-use data!

uAn+1 = uBn + hkB1

Dn+1 = ‖uAn+1 − uBn+1‖

Forward / Backward Euler A/B scheme 7

Want to solve: ut = kuxx (1D)

AUBn+1 = UBn

Re-use data!

UAn+1 = BUBn

Dn+1 = ‖UBn+1 − UAn+1‖

Lots of these schemes 8

I Backward / Forward Euler, Richardson / Crank-Nicolson

I Runge-Kutta 2/3, 4/5

I Adams-Bashforth linear multistep method 2/3, 4/5

I Explicit check on implicit scheme

I Extrapolation

I Key idea: Auxiliary method A re-uses data andcommunication from base method B

Detecting errors 9


i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i



I Exercise in step detection

Detecting errors 10


i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i



Dn+1 = ‖An+1 − Bn+1‖∞

Jn+1 =Dn+1 − Dn

Dn, relative jump

Vn+1 =Var(Dn−p+1, . . . ,Dn+1)

Var(Dn−p, . . . ,Dn), variance change

I p = 10 is usually good

Error detection algorithm 11

input : tolerances τJ and τV , scaling parameters Γ > 1, γ < 1for n = 1, 2, . . . do

Dn+1 := ‖An+1 − Bn+1‖if Jn+1 > τJ and Vn+1 > τV then

FlagError()

Move back in timeendif Jn+1 > τJ then τJ := ΓτJ else τJ := γτJif Vn+1 > τV then τV := ΓτV else τV := γτV

end

I Γ = 1.4, γ = 0.95

Which errors matter? 12

I Bn and An are the outputs of B and A when a fault is injected

I Bn and An are the outputs when no fault is injected

Local truncation error-normalized error:

Ln =‖Bn − Bn‖‖Bn − An‖

≈ Difference caused by error

local truncation error

Experimental setup 13

Crank−Nicolson Solutioni∆

t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i



I Solve equation and artificially inject error at one time step

I Do this for many trials with different types of errors

I True positive rate: #(real errors detected) / #(trials)

I False positive rate: #(non-errors “detected”) / #(time steps)

Heat equation 14

I ut = 0.001uxx + (1−√

1− 4(t − t2))/(2− 2t)

I u(x , 0) = 6|x − 1/2| − 3

I Error:Multiply entry of RHS in linear solves by z ∼ N(1, 5e-5)at a single time step

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

FE/BE, ∆x = 1 / 200, ∆t = 1 / 100

FPR = 0.000

Detected at step of fault

Detected at step or step after fault

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

R/CN, ∆x = 1 / 200, ∆t = 1 / 100

FPR = 0.012

Heat equation 15

I ut = 0.01uxx + q(x , t), q(x , t) = xe−t/2

I u(x , 0) = 4x(x − 1)(x − 2)

I Error:Multiply q(x , t) at one discrete x by z ∼ N(1, 0.1)at a single time step

0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

FE/BE, ∆x = 1 / 100, ∆t = 1 / 60

FPR = 0.000



0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

R/CN, ∆x = 1 / 100, ∆t = 1 / 60

FPR = 0.000

Adams-Bashforth 16

I u′′

(t)− b(1− u(t)2)u′(t) + u(t) = 0

I u′(0) = 1, u(0) = 0

I Error:Multiply one derivative evaluation by z ∼ N(1, 0.1)

100

101

102

103

0

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

AB23 on Van der Pol with h = 1 / 20, b = 2

FPR = 0.037



100

101

102

103

0

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

AB45 on Van der Pol with h = 1 / 20, b = 2

FPR = 0.052

Runge-Kutta 17

I u′′

(t)− b(1− u(t)2)u′(t) + u(t) = 0

I u′(0) = 1, u(0) = 0

I Error:Multiply one derivative evaluation by z ∼ N(1, 0.1)

100

101

102

103

0

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

RK23 on Van der Pol with h = 1 / 10, b = 2

FPR = 0.066

100

101

102

103

0

0.2

0.4

0.6

0.8

1


Tru

e p

ositiv

e r

ate

RK45 on Van der Pol with h = 1 / 10, b = 2

FPR = 0.098

Key ideas 18

Key ideas:

I Take advantage of “paired” solvers to check solutions

I High-impact error → easier to detect

I Simple detection scheme work pretty well

End 19

Questions? Samples:

I What is the performance penalty?

I Why does detection occur one step after the fault?

Information:

I Austin Benson: [email protected]

I Pre-print: see http://stanford.edu/~arbenson

http://stanford.edu/~arbenson

Tardy error detection 20

128 130 132 134 136 138 1402.8

3

3.2

3.4

3.6

3.8

4x 10

−5

Time step (i)

Di

Tardy error detection on heat equation

FE/BE difference

Step of fault

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

4x 10

−5

i (vector component)

|BE

(i)

− F

E(i)|

Component−wise absolute difference BE/FE

Step before fault

Step of fault

Step after fault