practical and efficient time integration and ... - jed brown · why implicit is silly for waves i...

27
Practical and Efficient Time Integration and Kronecker Product Solvers Jed Brown [email protected] (CU Boulder and ANL) Collaborators: Debojyoti Ghosh (LLNL), Matt Normile (CU), Martin Schreiber (Exeter) SIAM Central, 2017-09-30 This talk: https: //jedbrown.org/files/20170930-FastKronecker.pdf

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Practical and Efficient Time Integration andKronecker Product Solvers

Jed Brown [email protected] (CU Boulder and ANL)Collaborators: Debojyoti Ghosh (LLNL), Matt Normile (CU), Martin

Schreiber (Exeter)

SIAM Central, 2017-09-30This talk: https:

//jedbrown.org/files/20170930-FastKronecker.pdf

Page 2: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

1

10

2008 2010 2012 2014 2016

FL

OP

pe

r B

yte

End of Year

Theoretical Peak Floating Point Operations per Byte, Double Precision

HD 3870

HD 4870

HD 5870HD 6970

HD 6970

HD 7970 GHz Ed.

HD 8970

FirePro W

9100

FirePro S9150

X5482X5492

W5590

X5680 X5690

E5-2690

E5-2697 v2

E5-2699 v3

E5-2699 v3

E5-2699 v4

Tesla C1060

Tesla C1060

Tesla C2050

Tesla M2090

Tesla K20Tesla K20X

Tesla K40

Tesla K40

Tesla P100

Xeon Phi 7120 (KNC)

Xeon Phi 7290 (K

NL)

INTEL Xeon CPUs

NVIDIA Tesla GPUs

AMD Radeon GPUs

INTEL Xeon Phis

[c/o Karl Rupp]

Page 3: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

2017 HPGMG performance spectra

Page 4: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Motivation

I Hardware trendsI Memory bandwidth a precious commodity (8+ flops/byte)I Vectorization necessary for floating point performanceI Conflicting demands of cache reuse and vectorizationI Can deliver bandwidth, but latency is hard

I Assembled sparse linear algebra is doomed!I Limited by memory bandwidth (1 flop/6 bytes)I No vectorization without blocking, return of ELLPACK

I Spatial-domain vectorization is intrusiveI Must be unassembled to avoid bandwidth bottleneckI Whether it is “hard” depends on discretizationI Geometry, boundary conditions, and adaptivity

Page 5: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Sparse linear algebra is dead (long live sparse . . . )

I Arithmetic intensity < 1/4

I Idea: multiple right hand sides

(2k flops)(bandwidth)

sizeof(Scalar)+sizeof(Int), k � avg. nz/row

I Problem: popular algorithms have nested data dependenciesI Time step

Nonlinear solveKrylov solve

Preconditioner/sparse matrix

I Cannot parallelize/vectorize these nested loopsI Can we create new algorithms to reorder/fuse loops?

I Reduce latency-sensitivity for communicationI Reduce memory bandwidth (reuse matrix while in cache)

Page 6: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Sparse linear algebra is dead (long live sparse . . . )

I Arithmetic intensity < 1/4

I Idea: multiple right hand sides

(2k flops)(bandwidth)

sizeof(Scalar)+sizeof(Int), k � avg. nz/row

I Problem: popular algorithms have nested data dependenciesI Time step

Nonlinear solveKrylov solve

Preconditioner/sparse matrix

I Cannot parallelize/vectorize these nested loopsI Can we create new algorithms to reorder/fuse loops?

I Reduce latency-sensitivity for communicationI Reduce memory bandwidth (reuse matrix while in cache)

Page 7: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Attempt: s-step methods in 3D

102 103 104 105

dofs/process

102

103

104

105

106work

p = 1

p = 1, s = 2

p = 2, s = 3

p = 2, s = 5

102 103 104 105

dofs/process

102

103

104

105

106memory

p = 1

p = 2

p = 2, s = 3

p = 2, s = 5

102 103 104 105

dofs/process

102

103

104

105

106communication

p = 1

p = 2

p = 2, s = 3

p = 2, s = 5

I Limited choice of preconditioners (none optimal, surface/volume)I Amortizing message latency is most important for strong-scalingI s-step methods have high overhead for small subdomains

Page 8: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Attempt: distribute in time (multilevel SDC/Parareal)

I PFASST algorithm (Emmett and Minion, 2012)I Zero-latency messages (cf. performance model of s-step)I Spectral Deferred Correction: iterative, converges to IRK (Gauss,

Radau, . . . )I Stiff problems use implicit basic integrator (synchronizing on

spatial communicator)

Page 9: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Problems with SDC and time-parallel

256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448Knumber of cores

1248

163264

128256448896

1792sp

eedu

pidealPFASST (theory)PMG+SDC (fine)PMG+SDC (coarse)PMG+PFASST

256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448Knumber of cores

0.00.20.40.60.81.0

rel.

effic

ienc

y PMGPMG+PFASST

c/o Matthew Emmett, parallel compared to sequential SDC

I Iteration count not uniform in s; efficiency starts low

I Low arithmetic intensity; tight error tolerance (cf. Crank-Nicolson)

I Parabolic space-time also works, but comparison flawed

Page 10: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Runge-Kutta methods

u = F(u)y1...

ys

︸ ︷︷ ︸

Y

= un + h

a11 · · · a1s...

. . ....

as1 · · · ass

︸ ︷︷ ︸

A

F

y1...

ys

un+1 = un + hbT F(Y )

I General framework for one-step methods

I Diagonally implicit: A lower triangular, stage order 1 (or 2 withexplicit first stage)

I Singly diagonally implicit: all Aii equal, reuse solver setup, stageorder 1

I If A is a general full matrix, all stages are coupled, “implicit RK”

Page 11: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Implicit Runge-Kutta

12 −

√15

105

3629 −

√15

15536 −

√15

3012

536 +

√15

2429

536 −

√15

2412 −

√15

10536 +

√15

3029 +

√15

15536

518

49

518

I Excellent accuracy and stability propertiesI Gauss methods with s stages

I order 2s, (s,s) Padé approximation to the exponentialI A-stable, symplectic

I Radau (IIA) methods with s stagesI order 2s−1, A-stable, L-stable

I Lobatto (IIIC) methods with s stagesI order 2s−2, A-stable, L-stable, self-adjoint

I Stage order s or s + 1

Page 12: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Method of Butcher (1976) and Bickart (1977)

I Newton linearize Runge-Kutta system at u∗

Y = un + hAF(Y )[Is⊗ In + hA⊗ J(u∗)

]δY = RHS

I Solve linear system with tensor product operator

G = S⊗ In + Is⊗ J

where S = (hA)−1 is s× s dense, J =−∂F(u)/∂u sparse

I SDC (2000) is Gauss-Seidel with low-order correctorI Butcher/Bickart method: diagonalize S = VΛV−1

I Λ⊗ In + Is⊗ JI s decoupled solvesI Complex eigenvalues (overhead for real problem)

Page 13: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Ill conditioning

A = VΛV−1

Page 14: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Skip the diagonalization

[s11 + J s12 + Js21 + J s22 + J

]︸ ︷︷ ︸

S⊗In+Is⊗J

S + j11I j12Ij21I S + j22I j23I

j32I S + j33I

︸ ︷︷ ︸

In⊗S+J⊗Is

I Accessing memory for J dominates cost

I Irregular vector access in application of J limits vectorization

I Permute Kronecker product to reuse J and make fine-grainedstructure regular

I Stages coupled via register transpose at spatial-point granularity

I Same convergence properties as Butcher/Bickart

Page 15: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

PETSc MatKAIJ: “sparse” Kronecker product matrices

G = In⊗S + J⊗T

I J is parallel and sparse, S and T are small and dense

I More general than multiple RHS (multivectors)

I Compare J⊗ Is to multiple right hand sides in row-major

I Runge-Kutta systems have T = Is (permuted from Butchermethod)

I Stream J through cache once, same efficiency as multiple RHS

I Unintrusive compared to spatial-domain vectorization or s-step

Page 16: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Convergence with point-block Jacobi preconditioning

I 3D centered-difference diffusion problem

Method order nsteps Krylov its. (Average)

Gauss 1 2 16 130 (8.1)Gauss 2 4 8 122 (15.2)Gauss 4 8 4 100 (25)Gauss 8 16 2 78 (39)

Page 17: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

We really want multigrid

I Prolongation: P⊗ IsI Coarse operator: In⊗S + (RJP)⊗ IsI Larger time steps

I GMRES(2)/point-block Jacobi smoothing

I FGMRES outer

Method order nsteps Krylov its. (Average)

Gauss 1 2 16 82 (5.1)Gauss 2 4 8 64 (8)Gauss 4 8 4 44 (11)Gauss 8 16 2 42 (21)

Page 18: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Toward a better AMG for IRK/tensor-product systems

I Start with R = R⊗ Is, P = P⊗ Is

Gcoarse = R(In⊗S + J⊗ Is)P

I Imaginary component slows convergence

I Can we use a Kronecker productinterpolation?

I Rotation on coarse grids (connections toshifted Laplacian)

Page 19: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Why implicit is silly for waves

I Implicit methods require an implicit solve in each stage.

I Time step size proportional to CFL for accuracy reasons.I Methods higher than first order are not unconditionally strong

stability preserving (SSP; Spijker 1983).I Empirically, ceff ≤ 2, Ketcheson, Macdonald, Gottlieb (2008) and

othersI Downwind methods offer to bypass, but so far not practical

I Time step size chosen for stabilityI Increase order if more accuracy neededI Large errors from spatial discretization, modest accuracy

I My goal: need less memory motion per stageI Better accuracy, symplecticity nice bonus onlyI Cannot sell method without efficiency

Page 20: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Implicit Runge-Kutta for advection

Table: Total number of iterations (communications or accesses of J) to solvelinear advection to t = 1 on a 1024-point grid using point-block Jacobipreconditioning of implicit Runge-Kutta matrix. The relative algebraic solvertolerance is 10−8.

Method order nsteps Krylov its. (Average)

Gauss 1 2 1024 3627 (3.5)Gauss 2 4 512 2560 (5)Gauss 4 8 256 1735 (6.8)Gauss 8 16 128 1442 (11.2)

I Naive centered-difference discretization

I Leapfrog requires 1024 iterations at CFL=1

I This is A-stable (can handle dissipation)

Page 21: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Diagonalization revisited

(I⊗ I−hA⊗L)Y = (1⊗ I)un (1)

un+1 = un + h(bT ⊗L)Y (2)

I eigendecomposition A = VΛV−1

(V ⊗ I)(I⊗ I−hΛ⊗L)(V−1⊗ I)Y = (1⊗ I)un.

I Find diagonal W such that W−11 = V−11I Commute diagonal matrices

(I⊗ I−hΛ⊗L)(WV−1⊗ I)Y︸ ︷︷ ︸Z

= (1⊗ I)un.

I Using bT = bT VW−1, we have the completion formula

un+1 = un + h(bT ⊗L)Z .

I Λ, b is new diagonal Butcher tableI Compute coefficients offline using extended precision to handle

ill-conditioning of VI Equivalent for linear problems, usually fails nonlinear stability

Page 22: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Exploiting realnessI Eigenvalues come in conjugate pairs

A = VΛV−1

I For each conjugate pair, create unitary transformation

T =1√2

[1 1i −i

]I Real 2×2 block diagonal D; real V (with appropriate phase)

A = (VT ∗)(T ΛT ∗)(TV−1) = V DV−1

I Yields new block-diagonal Butcher table D, b.I Halve number of stages using identity

(α + J)−1u = (α + J)−1u

Solve one complex problem per conjugate pair, then take twicethe real part.

Page 23: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Conditioning

2 4 6 8 10 12 14Number of stages

100

101

102

103

104

105

106

107

Conditioning of diagonalization: gauss(V)

||b||1||bre||1

I Diagonalization in extended precision helps somewhat, as doesreal formulation

I Neither makes arbitrarily large number of stages viable

Page 24: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

REXI: Rational approximation of exponential

u(t) = eLtu(0)

I Haut, Babb, Martinsson, Wingate; Schreiber and Loft

(α⊗ I + hI⊗L)Y = (1⊗ I)un

un+1 = (βT ⊗ I)Y .

I α is complex-valued diagonal, β is complex

I Constructs rational approximations of Gaussian basis functions,target (real part of) eit

I REXI is a Runge-Kutta method: can convert via “modifiedShu-Osher form”

I Developed for SSP (strong stability preserving) methodsI Ferracina, Spijker (2005), Higueras (2005)I Yields diagonal Butcher table A =−α−1,b =−α−2β

Page 25: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Abscissa for RK and REXI methods

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.20.6

0.4

0.2

0.0

0.2

0.4

0.6Abscissa

GaussGauss diagGauss rbdiagREXI 33Cauchy 16

Page 26: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Stability regions

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.020

15

10

5

0

5

10

15

20 0.500

1.000

1.500

Stability region, Gauss, #poles=4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.020

15

10

5

0

5

10

15

20Order star, Gauss, #poles=4

1.0

0.5

0.1

0.0

0.0

0.0

0.1

0.5

1.0

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.020

15

10

5

0

5

10

15

20

0.500

1.0001.500

Stability region, Cauchy (16,6)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.020

15

10

5

0

5

10

15

20Order star, Cauchy (16,6)

1.0

0.5

0.1

0.0

0.0

0.0

0.1

0.5

1.0

Page 27: Practical and Efficient Time Integration and ... - Jed Brown · Why implicit is silly for waves I Implicit methods require an implicit solve in each stage. I Time step size proportional

Outlook on Kronecker product solvers

I⊗S + J⊗ I

I (Block) diagonal S is usually sufficientI Best opportunity for “time parallel” (for linear problems)

I Is it possible to beat explicit wave propagation with high efficiency?

I Same structure for stochastic Galerkin and other UQ methodsI IRK unintrusively offers bandwidth reuse and vectorizationI Need polynomial smoothers for IRK spectraI Change number of stages on spatially-coarse grids (p-MG, or

even increase)?I Experiment with SOR-type smoothers

I Prefer point-block Jacobi in smoothers for spatial parallelism

I Possible IRK correction for IMEX (non-smooth explicit function)I PETSc implementation (works in parallel, hardening in progress)I Thanks to DOE ASCR