batched factorization and inversion routines for block ...hanzt/talks/batched_bjac.pdf ·...

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUsHartwig AnztJoint work with Goran Flegar, Enrique Quintana-Orti

Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017

http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUsHartwig AnztJoint work with Goran Flegar, Enrique Quintana-Orti

Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017


• Small size (<30), “somewhat” flexible• Add inversion routine to Batched-Blas• Go for Gauss-Jordan Elimination instead of LU

Pitch from the sparse community:

3

• Goal: solve linear system

• Use iterative Krylov method for generating solution approximation

• Convergence typically benefits from using a preconditioner

• We focus on Jacobi preconditioners based on diagonal scaling

•

• Provide a high level of parallelism

• Attractive for many-core accelerators like GPUs

• Scalar Jacobi acts on the main diagonal

• block-Jacobi acts on the block-diagonal

Iterative linear system solvers

Ax = b, A 2 Rn⇥n

P = (diag(A))�1

4

• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)

• Often superior to scalar Jacobi preconditioner

• Light-weight preconditioner (compared to ILU)

• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )

• Inversion of diag. blocks in prec. setup + SpMV in prec. application

• Factorization of diag. blocks in prec. setup + trsv in prec. application

Block-Jacobi preconditioner

5








6

• Preconditioner setup (inversion of diagonal blocks) can become a challenge

• Batched inversion of many small systems

• Diagonal blocks are typically of small size








7

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

Block-Jacobi preconditioner generation

8

......


1.

2.



3.


9

• Flexible size up to 32 x 32

• One warp per system

• Inversion in registers

• Communication via __shfl()

• Implicit pivoting: • accumulate row/col swaps • realize them when writing results

• No operations on triangular systems

• Implicit pivoting

• Implementation can use static system size

Advantages over LU-based approach:

Batched Gauss-Jordan elimination (BGJE) on GPUs

10

NVIDIA Pascal architecture (P100)

Batched Gauss-Jordan Elimination (BGJE) achieves >10% of DP peak!

• NVIDIA whitepaper: 56 SMX, 5.3 TF DP, 768 GB/s ( 6 : 1 )• 2n3 operations for inversion• Flexible-size inversion up to size 32x32, one warp per system.

0 0.5 1 1.5 2 2.5 3 3.5 4

Batch size ×104

100

200

300

400

500

600

GF

lop/s

BGJE

5 10 15 20 25 30

Matrix size

0

100

200

300

400

500

600

GF

lop/s

BGJE

Anzt et al. “Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs”. PMAM'17, 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 1-10, 2017.

11

......


1.

2.



3.


12

Challenge: diagonal block extraction from sparse data structures

......


1.

2.



3.

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Non-coalescent memory access

13








shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

14








• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage


memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5



memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5


15









16









• Merge kernels vs. 3 separate kernels:[ extraction + inversion + insertion ][ extraction ] + [ inversion ] + [ insertion ]


memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5


0 10 20 30 40 50 60Test matrices

0

2

4

6

8

10

12

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

17

K40

Block size 32

Diagonal block extraction from sparse data structures

Results show:

• Runtime block-Jacobi vs. BGJE

• 56 test matrices (SuiteSparse)

• Data arranged w.r.t. performanceof (avg.) best strategy

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

Block size 8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0.5

1

1.5

2

2.5

3

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0

5

10

15

20

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

18

Diagonal block extraction from sparse data structures

Tridiagonal-type

Arrow-type

Tridiagonal-type matrix structure:- balanced nnz/row

Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

19

Block-Jacobi preconditioner setup cost

K40

Cheby

shev

2

dw20

48

dw10

24

LeGre

sley_

2508

Cheby

shev

3

saylr

4

sts4

098

olm

5000

nasa

2910

s2rm

q4m

1

s2rm

t3m

1

s3rm

t3m

1

s3rm

t3m

3

s3rm

q4m

1

s1rm

t3m

1

dw81

92

dw40

96Kuu

bcss

tk38

CurlC

url_0

linve

rse

nem

eth1

5

bcss

tk18

piston

bcss

tk17

nd3k

cbuc

kle

Pres_

Poiss

on

msc

1084

8

ABACUS_she

ll_ud

nd6k

spm

srtls

cz40

948

gridge

na

ibm

_mat

rix_2

sme3

Dc

3D_5

1448

_3D

2D_5

4019

_highK

nd12

k

gas_

sens

or

rail_

7984

1

cran

kseg

_1

mat

rix_9dc

3

mat

rix-n

ew_3

nd24

k

ship_0

03

CurlC

url_1 F1

af_s

hell3

ecolog

y2

Emilia

_923

G3_

circu

it

Hook_

1498

ML_

Gee

r

Flan_

1565

10-5

10-4

10-3

10-2

10-1

Blo

ck J

aco

bi g

enera

tion tim

e [s]

Block size 4Block size 8Block size 16Block size 24Block size 32

• Preconditioner setup needs up to 0.01 s on P100 GPU.

• Irrelevant for overall Iterative solver execution time.

20








21

Factorization-based block-Jacobi

extract diagonal block from sparse data structure.

1.

3.

2.

Generate factorization and pivoting.

Insert factorization as blockinto the preconditioner matrixand store pivoting.

*ipiv

Pre

condit

ioner

setu

pPre

condit

ioner

applic

ati

on

Apply trsv using pivoting information to turn the right-hand side sub-vector into the solution of the diagonal-block linear problem.

...

22

• Batched routines: Small-Size LU, Gauss-Huard, Gauss-Huard-T.

• Gauss-Huard-T writes the factors in “access-friendly” fashion.

• Flexible size up to limit of 32 x 32.

• Use one warp per system, factorization in registers.

Factorization-based block-Jacobi

dgetrf dgetri

5 10 15 20 25 30

Matrix size

0

5

10

15

20

25

30

GF

lop

/s

Small-Size LUGauss-HuardGauss-Huard-TcuBLAS LU

5 10 15 20 25 30

Matrix size

0

50

100

150

200

250

300

350

GF

lop

/s

Small-Size LUGauss-HuardGauss-Huard-TcuBLAS LU

NVIDIA P100 GPU

23

NVIDIA Pascal architecture (P100)

• BGJE and BLU invert diagonal blocks vs. BGH / BGHT factorizing only.

• BGHT writes factors in non-coalescent fashion for faster preconditioner application.

0 1 2 3 4 5

Batch size ×104

0.5

1

1.5

2

2.5

3

3.5T

ime/m

atr

ix [m

s]

×10-4

Gauss-Jordan elimination (BGJE)MAGMA dgetrf-inv (BLU)Gauss-Huard (BGH)Gauss-Huard + transp (BGHT)

NVIDIA P100 GPU

24

Inversion-based block-Jacobi vs. factorization-based block-Jacobi

-100 -50 0 50 100

BGH needs more iterations BGJE needs more iterations

0

20

40

60

80

100

120

140

Te

st

ca

se

s

block size 4

block size 8

block size 16

block size 24

block size 32

Convergence of BiCGSTAB with block-Jacobi preconditioner for 300 matrices

Explicit inversion better. Factorization better.

NVIDIA P100 GPU

25

Block-Jacobi preconditioning: Performance goal

50 100 150 200 250 300Test matrices

10-3

10-2

10-1

100

101

102

Runtim

e [s]

BGJEBGHBGHT

1 2 3 4 5 6 7 8 9 10Test matrices

0

2

4

Runtim

e (

BG

JE)

[s]

×10-3

solvesetup

1 2 3 4 5 6 7 8 9 10Test matrices

0

2

4

Runtim

e (

BG

H)

[s]

×10-3

solvesetup

• Almost no difference in total execution time.

• BGH better if setup is significant to iterative solver time.

• BGJE better for long solver runs.

Execution time of BiCGSTAB with Jacobi(24) preconditioner

NVIDIA P100 GPU

26

This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy.

http://icl.cs.utk.edu/magma/


• Small size (<30) “somewhat” flexible• Add inversion routine to Batched-Blas• Go for Gauss-Jordan Elimination instead of LU

Pitch from the sparse community:

batched factorization and inversion routines for block ...hanzt/talks/batched_bjac.pdf ·...

Documents