algebraic multigrid methods on gpu-accelerated hybrid ...€¦ · algebraic multigrid methods on...

46
Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit¨ at M¨ unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] October 13, 2015

Upload: others

Post on 21-Nov-2019

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Algebraic Multigrid Methods on GPU-AcceleratedHybrid Architectures

Manfred LiebmannTechnische Universitat Munchen

Chair of Optimal Control

Center for Mathematical Sciences, M17

[email protected]

October 13, 2015

Page 2: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

SFB MOBIS: Principal Investigators

• Medical University of Graz, Austria

– Gernot Plank– Christoph Augustin– Caroline Costa

• University of Graz, Austria

– Gundolf Haase– Manfred Liebmann– Aurel Neic

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 1

Page 3: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(1) CARP Virtual Heart Project

The virtual heart model is based on the bidomain equations, a set of coupled partialdifferential equations, which describe the current flow in the myocardium. The bidomainequations in the elliptic-parabolic form the equations are given by[

−∇ · (σi + σe)∇φe−∇ · σb∇φe

]=

[∇ · σi∇Vm + Iei

Ieb

](1)

βCm∂Vm∂t

= (∇ · σi∇φi)− β (Iion(Vm,η)− Ii) (2)

Vm = φi − φe (3)

where φi and φe are the intracellular and extracellular potentials, Vm = φi − φe is thetransmembrane voltage, σi and σe are the intracellular and extracellular conductivity tensorsand Iion is the membrane ionic current density which depends on Vm and a set of statevariables η.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 2

Page 4: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

The PDEs given in (1) to (2) are spatially discretized using linear tetrahedral FEs.Parabolic and elliptic PDE are decoupled and operator splitting based on a Strang scheme isemployed to solve the parabolic PDE. The inner kernel of the compute scheme is then givenby

Ki+eφet+1 = −PKivm

t + MeIet (4)

∂η

∂t= f(vm

t,ηt+1/2) (5)

Iiont+1/2 = g(vm

t,ηt+1/2)

vt+1/2m = vt

m −∆t

CmIt+1/2ion

Mi∂v

t+1/2m

∂t= − 1

βCmKi

(vm

t+1/2 + P−Tφt+1e

)(6)

where Mζ and Kζ are FEM mass and stiffness matrices. P is a projection operator whichtransfers data between intracellular and extracellular grid, and η is a discrete representationof the set of state variables of the ionic model.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 3

Page 5: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

For time discretization of the diffusion part of the parabolic PDE, the fully implicitCrank-Nicholson scheme is employed, with κ defined as βCm

∆t :

(κMi +

Ki

2

)vt+1m = −Ki

(vt+1/2m

2+ P−Tφt+1

e

)+ κMiv

t+1/2m (7)

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 4

Page 6: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(2) CARP Virtual Heart Solver Benchmark on CPU and GPU Clusters

Figure 1: A) Image-based high-resolution FE mesh of rabbit ventricles. Clipping planeexposes view on trabeculation and papillary muscles in the LV cavity. B) Shown is theactivation sequence used for benchmarking. Wavefront propagation is initiated by deliveringa transmembrane stimulus at the LV apex. C) Time traces are shown for Vm and Φe at threedifferent sites, close to the stimulus site (blue), halfway between apex and base (green) andat the latest activating site (red) at the base of the ventricles. D) Convergence history as afunction of time.

Figure 2: Realtime lag ξr as a function of Np for the solution of elliptic PDE, parabolicPDE, ODE and overall compute time. Red traces were measured with the UK nationalsupercomputing facility HECToR. Number of cores required to match fastest GPU setup, Np,are indicated in each panel.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 5

Page 7: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 6

Page 8: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 7

Page 9: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(3) Parallel Toolbox Software

• Parallel Toolbox Solver Library

– The toolbox is implemented as C++ template library– The central piece of the library is the communicator class– The communicator class handles all data exchange for parallel linear algebra kernels– This design enables the reuse of sequential linear algebra kernels– The library includes optimized parallel CPU/GPU solver components: PCG, AMG– A flexible and modular design for building complex parallel solvers

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 8

Page 10: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Communicator Class

The communicator is derived from a domain decomposition based parallelization approach.

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 1: Simple finite element mesh distributed to four processors with global node numbersand color-coded processor domains.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 9

Page 11: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel communication is handled by a communicator object using MPI all-to-allcommunication patterns. Basic parallel linear algebra routines can be build with thesequential routines and the communicator object.

• Parallel linear algebra basics

– Accumulated vector: r, s (fraktur font)– Distributed vector: r, s (sans-serif font)– Accumulated matrix: A,B– Distributed matrix: A,B

• Matrix-vector multiplication and scalar product

– Multiplication: r← As, s← Br– Scalar product: σ ← S(r, s) ≡ S(r, s)

• Accumulation and distribution

– Accumulation: r⇐ r Communication!– Distribution: r⇐ r

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 10

Page 12: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Essential Communication Routines

Accumulation r⇐ r is the most important communication routine in the Parallel Toolbox.This is the only place where MPI all-to-all communication takes place within linear algebracalculations. The accumulation routine provides a single point to optimize communicationperformance.

Furthermore, distribution of a vector r ⇐ r does not require any communication and is alocal operation.

Calculating the global value of a scalar product σ ← S(r, s) requires a simple MPI all-gather operation and accumulation of a single value. Scalar products are expensive becausethey enforce a synchronization point in the parallel code path.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 11

Page 13: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(4) Sequential Algebraic Multigrid Algorithm

• Main ingredients of the algebraic multigrid setup

– Coarse and fine node selection process: I = C ∪ F– Construction of prolongation P and restriction R operators– Triple matrix product: Ac = RAP

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 12

Page 14: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Coarsening Algorithm

Simplified Ruge-Stuben based coarsening algorithm using the strong connection concept.C ← ∅, F ← ∅, T ← I

while T 6= ∅ doFind next node i ∈ TC ← C ∪ {i}F ← F ∪ {j ∈ I|j /∈ C ∪ F ∧ i 6= j ∧ |Aij| > ε|Aii|}T ← T\(C ∪ F )

end while

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 13

Page 15: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Prolongation Operator

P =

(1CCPFC

)(8)

Define the number of strongly coupled coarse grid nodes with respect to the fine grid nodei ∈ F :

ni := #{j ∈ C||Aij| > ε|Aii|} (9)

The matrix PFC is then defined as:

(PFC)ij :=

{1/ni, |Aij| > ε|Aii|0, else

(10)

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 14

Page 16: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Matrix Graph Representations

A =

2 −1 0 0 0 0

−1 2 −1 0 0 0

0 −1 2 −1 0 0

0 0 −1 2 −1 0

0 0 0 −1 2 −1

0 0 0 0 −1 2

(11)

1 2 3 4 5 6

Figure 2: The undirected matrix graph of the 1D Laplace operator.

1 2 3 4 5 6

Figure 3: The directed graph of the prolongation operator.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 15

Page 17: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Sequential Classic Multigrid Algorithm

Require: fl, ul, rl, sl, Al, Rl, Pl, Sl, Tl, 0 ≤ l < L

f0 ← f

if l < L thenul ← 0 {Initial guess}ul ← Sl(ul, fl) {Presmoothing}rl ← fl − Alul {Calculation of residual}fl+1 ← Rlrl {Restriction of residual}multigrid(l+1) {Multigrid recursion}sl ← Plul+1 {Prolongation of correction}ul ← ul + sl {Addition of correction}ul ← Tl(ul, fl) {Postsmoothing}

elseuL ← A−1

L fL {Coarse grid solver}end ifu← u0

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 16

Page 18: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

ω-Jacobi and Forward/Backward Gauss-Seidel Smoother

ω-Jacobi SmootherRequire: Al, D ← diag(Al), ω ∈ (0, 1]

return ul ← ul + ωD−1(fl − Alul)

Forward/Backward Gauss-Seidel SmootherRequire: Al, D ← diag(Al)

return ulforward←− ul +D−1(fl − Alul)

Require: Al, D ← diag(Al)

return ulbackward←− ul +D−1(fl − Alul)

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 17

Page 19: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Preconditioned Conjugate Gradient Algorithm

Require: m ∈ N, α, β, σ, σf , σl, τ ∈ R, r, s, v, wr ← f − Au {Calculation of residual}s← Pr {Apply preconditioner}σ ← S(s, r) {Scalar product}σf ← σl ← σ, m← 0

while m < M ∧ σ/σf > ε2 dom← m+ 1, v ← As

τ ← S(s, v) {Scalar product}α← σ/τ

u← u+ αs {Update solution}r ← r − αv {Update residual}w ← Pr {Apply preconditioner}σ ← S(w, r) {Scalar product}β ← σ/σl, σl ← σ

s← βs+ w {Update search direction}end while

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 18

Page 20: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(5) Parallel Algebraic Multigrid Algorithm

• Challenges of the parallel algebraic multigrid setup

– Parallel coarse and fine node selection must be globally consistent– Skeleton based coarsening equivalent to sequential algorithm with reordering– Local prolongation and restriction operators without communication requirements

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 19

Page 21: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Coarsening Algorithm

Construct the boundary node set B and the submatrix ABB

(a)

CpB ← ∅, F

pB ← ∅, T

pB ← B

while T pB 6= ∅ doFind next node i ∈ T pBCpB ← Cp

B ∪ {i}F pB ← F p

B ∪ {j ∈ B|j /∈ CpB ∪ F

pB ∧ i 6= j ∧ |ABB;ij| > ε|ABB;ii|}

T pB ← T pB\(CpB ∪ F

pB)

end while(b)

Cp ← Ip ∩ CpB, F

p ← Ip ∩ F pB, T

p ← Ip ∩ CpB

while T p 6= ∅ doFind next node i ∈ T pF p ← F p ∪ {j ∈ Ip|j /∈ Cp ∪ F p ∧ i 6= j ∧ |Ap

ij| > ε|Apii|}

T p ← T p\{i}end while(c)

T p ← Ip\(Cp ∪ F p)

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 20

Page 22: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

while T p 6= ∅ doFind next node i ∈ T pCp ← Cp ∪ {i}F p ← F p ∪ {j ∈ Ip|j /∈ Cp ∪ F p ∧ i 6= j ∧ |Ap

ij| > ε|Apii|}

T p ← T p\(Cp ∪ F p)

end while

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 21

Page 23: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Coarsening: Step (a)

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 4: Parallel Coarsening Step (a): The boundary nodes are assigned either red coarsegrid nodes or white fine grid nodes. The unassigned nodes in the inner of the processordomains are depicted in yellow color.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 22

Page 24: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Coarsening: Step (b)

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 5: Parallel Coarsening Step (b): Fine grid nodes in the inner of the processor domainsdepending on the coarse boundary nodes are assigned.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 23

Page 25: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Coarsening: Step (c)

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 6: Parallel Coarsening Step (c): The coarsening is completed on the remaining nodesin the inner of the processor domains.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 24

Page 26: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Prolongation Operator

The local prolongation operator on processor p requires foreign coarse grid nodes C∗p tointerpolate the boundary fine grid nodes F ∗p.F ∗p ← F p ∩ Bp

Construct the submatrix AF∗pCC∗p ← {j ∈ C|∃i ∈ F ∗p : |AF∗pC;ij| > ε|AF∗pC;ii|}\Cp

Jp ← Cp ∪ C∗pConstruct P p := PIpJp using AF∗pC and Ap

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 25

Page 27: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Interpolation on the Finite Element Mesh

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 7: White fine grid nodes are interpolated by red coarse grid nodes. The black arrowsshow the boundary crossing interpolation for the three fine grid nodes 5, 18, and 19.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 26

Page 28: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Global Prolongation Operator

8

4

3

9

10

15

1620

17

21

22

1

2

5

11

7

6

12

18

2325

1413

19

24

26

Figure 8: The global prolongation operator on the whole simulation domain represented as adirected graph.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 27

Page 29: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Interpolation on the Color-Coded Finite Element Mesh

1 2 34

8

4

3

9

10

15

16

20

17

21

22

1

2

5

11

76

12

18

2325

14

13

19

24

26

Figure 9: The processor domains embedded in the finite element mesh are shown in color.The cross-boundary interpolation is depicted for the fine grid nodes 5, 18, and 19.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 28

Page 30: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Color-Coded Global Prolongation Operator

8

4

3

9

10

15

1620

17

21

22

1

2

5

11

7

6

12

18

23

25

1413

19

24

26

Figure 10: The global prolongation operator on the whole simulation domain represented asa color-coded directed graph.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 29

Page 31: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Local Prolongation Operator on Processor 1

4

8

39

10

15

1

2

5

12

Figure 11: The local prolongation operator on processor one represented as colorcodeddirected graph. Note the foreign coarse grid node 12 and the crossboundary interpolation ofnode 5.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 30

Page 32: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Local Prolongation Operator on Processor 2

15

10

16

20

17

5

1

11

12

18

19

24

Figure 12: The local prolongation operator on processor two represented as colorcodeddirected graph. Note the foreign coarse grid nodes 1 and 24 and the cross-boundaryinterpolation of the nodes 5, 18, and 19.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 31

Page 33: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Local Prolongation Operator on Processor 3

20

1721

22

18

12

23

25

19

24

26

Figure 13: The local prolongation operator on processor three represented as colorcodeddirected graph. Note the foreign coarse grid node 12 and the crossboundary interpolation ofthe nodes 18 and 19.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 32

Page 34: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Local Prolongation Operator on Processor 4

2

1

5

10

7

6

12

1413

1924

Figure 14: The local prolongation operator on processor four represented as colorcodeddirected graph. Note the foreign coarse grid nodes 10 and 24 and the cross-boundaryinterpolation of the nodes 5 and 19.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 33

Page 35: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Classic Multigrid Algorithm

Require: fl, ul, rl, sl, Rl, Pl, Sl, Tl, 0 ≤ l < L

f0 ← f

if l < L thenul ← 0 {Initial guess}ul ← Sl(ul, fl) {Presmoothing}rl ← fl − Alul {Calculation of residual}fl+1 ← Rlrl {Restriction of residual}multigrid(l+1) {Multigrid recursion}sl ← Plul+1 {Prolongation of correction}ul ← ul + sl {Addition of correction}ul ← Tl(ul, fl) {Postsmoothing}

elseuL ⇐ (AL)−1fL {Coarse grid solver}

end ifu← u0

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 34

Page 36: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel ω-Jacobi and Forward/Backward Gauss-Seidel Smoother

Parallel ω-Jacobi SmootherRequire: Ap

l ,D⇐ diag(Apl ), ω ∈ (0, 1]

vpl ⇐ (fpl − Apl upl )

return upl ← upl + ωD−1vpl

Parallel Forward/Backward Gauss-Seidel SmootherRequire: Ap

l ,D⇐ diag(Apl ), ω ∈ (0, 1]

vpl ⇐ (fpl − Apl upl ) {On boundary nodes}

upl ← upl + ωD−1vpl {On boundary nodes}upl

forward←− upl + D−1(fpl − Apl upl ) {On inner nodes}

return upl

Require: Apl ,D⇐ diag(Ap

l ), ω ∈ (0, 1]

uplbackward←− upl + D−1(fpl − A

pl upl ) {On inner nodes}

vpl ⇐ (fpl − Apl upl ) {On boundary nodes}

upl ← upl + ωD−1vpl {On boundary nodes}return upl

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 35

Page 37: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Parallel Conjugate Gradient Algorithm

Require: m ∈ N, α, β, σ, σf , σl, τ ∈ R, v, r, s,w

r← f − Au {Calculation of residual}s← Pr {Apply preconditioner}σ ← S(s, r) {Scalar product}σf ← σl ← σ, m← 0

while m < M ∧ σ/σf > ε2 dom← m+ 1, v← As

τ ← S(s, v) {Scalar product}α← σ/τ

u← u + αs {Update solution}r← r − αv {Update residual}w← Pr {Apply preconditioner}σ ← S(w, r) {Scalar product}β ← σ/σl, σl ← σ

s← βs + w {Update search direction}end while

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 36

Page 38: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

(6) Parallelization on GPU-Accelerated Hybrid Architectures

• Parallelization Concepts on CPU and GPU clusters for PCG-AMG

– Coarse grained MPI parallelization for CPU and GPU nodes– Additional fine grained parallelization on GPUs– Many-core implementation of sparse matrix-vector multiplication– Interleaved CRS data format for coalesced memory access on GPU– Nvidia CUDA Technology for C/C++ GPU programming

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 37

Page 39: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Sparse Matrix-Vector Multiplication Kernel

Let A ∈ RN×N be a matrix in compressed row storage format and u, b ∈ RN .

• CRS Sparse Matrix-Vector Multiplication Kernel

– Schedule a thread for every sparse scalar product!– Thread i calculates ui =

∑Nj=1Aijbj

Looks nice! Performs very badly!

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 38

Page 40: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Problems and Solutions

• Non-coalesced memory access!

– Rearrange CRS data structure for coalesced access– Interleave the sparse matrix rows for at least 16 consecutive rows– Holes in the data structure: Not critical! Typical 5-10% increase in storage

• Random access to b vector!

– Use the texture unit of the GPU for random access to b vector– Texturing is optimized for spacial locality: Small read-only cache

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 39

Page 41: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

GPU-Accelerated Sparse Matrix-Vector Multiplication

An efficient sparse matrix-vector multiplication is key to the PCG-AMG solver performance.

0

-1

0

-1

0

0

-1

4

0

0

0

0

0

0

4

0

-1

0

-1

0

0

3

0

-2

0

0

0

0

4

-1

0

0

0

0

0

2

0

0

0

0

-1

3

4

0

-1

-1

0

0

0

-1

0

-1

0

0

0

-1

4

0

0

0

0

0

-1

0

Figure 15: A sample matrix with the rows colored in different hues.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 40

Page 42: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Compressed Row Storage Data Format (CRS)

A flexible data format for sparse matrices.

4 -2 -1 -1 4 -1 3 -1 -1 4 -1 -1 2 -1 -1 4 -1 3 -1 -1 -1 4

1 3 7 1 2 8 3 4 6 4 6 1 5 7 3 6 1 6 7 3 6 8

0 3 6 9 11 14 16 19

3 3 3 2 3 2 3 3

Figure 16: CRS data structure for the sample matrix with the count and displacement vectoron top and the column indices and matrix entries below.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 41

Page 43: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

Interleaved Compressed Row Storage Data Format (ICRS)

Coalesced memory access patterns on GPUs are required to achieve high performance.

4 -2 -1-1 4 -13 -1 -14 -1-1 2 -1-1 4-1 3 -1-1 -1 4

1 3 71 2 83 4 64 61 5 73 61 6 73 6 8

0 1 2 3 4 5 6 7

3 3 3 2 3 2 3 3

Figure 17: ICRS data structure for the sample matrix with the count and displacement vectoron top and the interleaved column indices and matrix entries below. The eight interleavedmatrix rows create holes in the data structure represented by the white boxes.

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 42

Page 44: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

CUDA Code Sample for Sparse Matrix-Vector Multiplication

#define L 256

struct linear_operator_params {

const int *cnt; //ICRS count vector

const int *dsp; //ICRS displacement vector

const int *col; //ICRS column indices

const double *ele; //ICRS matrix entries

const double *u; //Input vector

double *v; //Output vector

int n; //Matrix dimension

};

texture<int2> tex_u;

void _device_linear_operator(int *cnt, int *dsp, int *col, double *ele,

int m, int n, double *u, double *v)

{

cudaBindTexture(0, tex_u, (int2*)u, sizeof(double) * m); //Bind the texture

struct linear_operator_params parms;

parms.cnt = cnt; parms.dsp = dsp; parms.col = col; parms.ele = ele;

parms.u = u; parms.v = v; parms.n = n;

__device_linear_operator<<< (n + N - 1)/N, N >>>(parms); //GPU kernel launch

cudaUnbindTexture(tex_u); //Unbind the texture

}

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 43

Page 45: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

__global__ void __device_linear_operator(struct linear_operator_params parms)

{

unsigned int j = N * blockIdx.x + threadIdx.x;

if(j < parms.n)

{

unsigned int blkStart = parms.dsp[j];

unsigned int blkStop = blkStart + L * parms.cnt[j];

double s = 0.0;

for(unsigned int i = blkStart; i < blkStop; i += L)

{

unsigned int q = parms.col[i]; //Load column index

double a = parms.ele[i]; //Load matrix entry

int2 c = tex1Dfetch(tex_u, q); //Load vector entry using texture mapping

double b = __hiloint2double(c.y, c.x); //Convert texture entries to double number

s += a * b; //Calculate the sparse scalar product

}

parms.v[j] = s; //Store the sparse scalar product

}

}

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 44

Page 46: Algebraic Multigrid Methods on GPU-Accelerated Hybrid ...€¦ · Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures Manfred Liebmann Technische Universit at Munchen

Manfred Liebmann October 13, 2015

rail4284rdist1

orani678rim

rma10para-7

e40r0100olafu

bcsstk35venkat01

nasasrbex11

raefsky3interp512

interp1024

0

2

4

6

8

10

12

14

16

IBM SpMV IMT SpMV

Sparse matrices

SpM

V p

erfo

rman

ce (

GF

LOP

S)

Figure 18: Performance comparison of IBM SpMV and IMT SpMV based on ICRS dataformat on GeForce 8800 GTX

Algebraic Multigrid Methods on GPU-Accelerated Hybrid Architectures 45