batched factorization and inversion routines for block ...hanzt/talks/batched_bjac.pdf ·...
TRANSCRIPT
Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUsHartwig AnztJoint work with Goran Flegar, Enrique Quintana-Orti
Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017
http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf
Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUsHartwig AnztJoint work with Goran Flegar, Enrique Quintana-Orti
Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017
http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf
• Small size (<30), “somewhat” flexible• Add inversion routine to Batched-Blas• Go for Gauss-Jordan Elimination instead of LU
Pitch from the sparse community:
3
• Goal: solve linear system
• Use iterative Krylov method for generating solution approximation
• Convergence typically benefits from using a preconditioner
• We focus on Jacobi preconditioners based on diagonal scaling
•
• Provide a high level of parallelism
• Attractive for many-core accelerators like GPUs
• Scalar Jacobi acts on the main diagonal
• block-Jacobi acts on the block-diagonal
Iterative linear system solvers
Ax = b, A 2 Rn⇥n
P = (diag(A))�1
4
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
• Factorization of diag. blocks in prec. setup + trsv in prec. application
Block-Jacobi preconditioner
5
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
• Factorization of diag. blocks in prec. setup + trsv in prec. application
Block-Jacobi preconditioner
6
• Preconditioner setup (inversion of diagonal blocks) can become a challenge
• Batched inversion of many small systems
• Diagonal blocks are typically of small size
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
• Factorization of diag. blocks in prec. setup + trsv in prec. application
Block-Jacobi preconditioner
7
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
8
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
9
• Flexible size up to 32 x 32
• One warp per system
• Inversion in registers
• Communication via __shfl()
• Implicit pivoting: • accumulate row/col swaps • realize them when writing results
• No operations on triangular systems
• Implicit pivoting
• Implementation can use static system size
Advantages over LU-based approach:
Batched Gauss-Jordan elimination (BGJE) on GPUs
10
NVIDIA Pascal architecture (P100)
Batched Gauss-Jordan Elimination (BGJE) achieves >10% of DP peak!
• NVIDIA whitepaper: 56 SMX, 5.3 TF DP, 768 GB/s ( 6 : 1 )• 2n3 operations for inversion• Flexible-size inversion up to size 32x32, one warp per system.
0 0.5 1 1.5 2 2.5 3 3.5 4
Batch size ×104
100
200
300
400
500
600
GF
lop/s
BGJE
5 10 15 20 25 30
Matrix size
0
100
200
300
400
500
600
GF
lop/s
BGJE
Anzt et al. “Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs”. PMAM'17, 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 1-10, 2017.
11
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
12
Challenge: diagonal block extraction from sparse data structures
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Non-coalescent memory access
13
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Non-coalescent memory access
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
14
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Non-coalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
15
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Non-coalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
16
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Non-coalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
• Merge kernels vs. 3 separate kernels:[ extraction + inversion + insertion ][ extraction ] + [ inversion ] + [ insertion ]
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
0 10 20 30 40 50 60Test matrices
0
2
4
6
8
10
12
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
17
K40
Block size 32
Diagonal block extraction from sparse data structures
Results show:
• Runtime block-Jacobi vs. BGJE
• 56 test matrices (SuiteSparse)
• Data arranged w.r.t. performanceof (avg.) best strategy
US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction
0 10 20 30 40 50 60Test matrices
1
2
3
4
5
6
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
Block size 8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
0.5
1
1.5
2
2.5
3
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
0
5
10
15
20
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
18
Diagonal block extraction from sparse data structures
Tridiagonal-type
Arrow-type
Tridiagonal-type matrix structure:- balanced nnz/row
Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction
US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction
19
Block-Jacobi preconditioner setup cost
K40
Cheby
shev
2
dw20
48
dw10
24
LeGre
sley_
2508
Cheby
shev
3
saylr
4
sts4
098
olm
5000
nasa
2910
s2rm
q4m
1
s2rm
t3m
1
s3rm
t3m
1
s3rm
t3m
3
s3rm
q4m
1
s1rm
t3m
1
dw81
92
dw40
96Kuu
bcss
tk38
CurlC
url_0
linve
rse
nem
eth1
5
bcss
tk18
piston
bcss
tk17
nd3k
cbuc
kle
Pres_
Poiss
on
msc
1084
8
ABACUS_she
ll_ud
nd6k
spm
srtls
cz40
948
gridge
na
ibm
_mat
rix_2
sme3
Dc
3D_5
1448
_3D
2D_5
4019
_highK
nd12
k
gas_
sens
or
rail_
7984
1
cran
kseg
_1
mat
rix_9dc
3
mat
rix-n
ew_3
nd24
k
ship_0
03
CurlC
url_1 F1
af_s
hell3
ecolog
y2
Emilia
_923
G3_
circu
it
Hook_
1498
ML_
Gee
r
Flan_
1565
10-5
10-4
10-3
10-2
10-1
Blo
ck J
aco
bi g
enera
tion tim
e [s]
Block size 4Block size 8Block size 16Block size 24Block size 32
• Preconditioner setup needs up to 0.01 s on P100 GPU.
• Irrelevant for overall Iterative solver execution time.
20
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
• Factorization of diag. blocks in prec. setup + trsv in prec. application
Block-Jacobi preconditioner
21
Factorization-based block-Jacobi
extract diagonal block from sparse data structure.
1.
3.
2.
Generate factorization and pivoting.
Insert factorization as blockinto the preconditioner matrixand store pivoting.
*ipiv
Pre
condit
ioner
setu
pPre
condit
ioner
applic
ati
on
Apply trsv using pivoting information to turn the right-hand side sub-vector into the solution of the diagonal-block linear problem.
...
22
• Batched routines: Small-Size LU, Gauss-Huard, Gauss-Huard-T.
• Gauss-Huard-T writes the factors in “access-friendly” fashion.
• Flexible size up to limit of 32 x 32.
• Use one warp per system, factorization in registers.
Factorization-based block-Jacobi
dgetrf dgetri
5 10 15 20 25 30
Matrix size
0
5
10
15
20
25
30
GF
lop
/s
Small-Size LUGauss-HuardGauss-Huard-TcuBLAS LU
5 10 15 20 25 30
Matrix size
0
50
100
150
200
250
300
350
GF
lop
/s
Small-Size LUGauss-HuardGauss-Huard-TcuBLAS LU
NVIDIA P100 GPU
23
NVIDIA Pascal architecture (P100)
• BGJE and BLU invert diagonal blocks vs. BGH / BGHT factorizing only.
• BGHT writes factors in non-coalescent fashion for faster preconditioner application.
0 1 2 3 4 5
Batch size ×104
0.5
1
1.5
2
2.5
3
3.5T
ime/m
atr
ix [m
s]
×10-4
Gauss-Jordan elimination (BGJE)MAGMA dgetrf-inv (BLU)Gauss-Huard (BGH)Gauss-Huard + transp (BGHT)
NVIDIA P100 GPU
24
Inversion-based block-Jacobi vs. factorization-based block-Jacobi
-100 -50 0 50 100
BGH needs more iterations BGJE needs more iterations
0
20
40
60
80
100
120
140
Te
st
ca
se
s
block size 4
block size 8
block size 16
block size 24
block size 32
Convergence of BiCGSTAB with block-Jacobi preconditioner for 300 matrices
Explicit inversion better. Factorization better.
NVIDIA P100 GPU
25
Block-Jacobi preconditioning: Performance goal
50 100 150 200 250 300Test matrices
10-3
10-2
10-1
100
101
102
Runtim
e [s]
BGJEBGHBGHT
1 2 3 4 5 6 7 8 9 10Test matrices
0
2
4
Runtim
e (
BG
JE)
[s]
×10-3
solvesetup
1 2 3 4 5 6 7 8 9 10Test matrices
0
2
4
Runtim
e (
BG
H)
[s]
×10-3
solvesetup
• Almost no difference in total execution time.
• BGH better if setup is significant to iterative solver time.
• BGJE better for long solver runs.
Execution time of BiCGSTAB with Jacobi(24) preconditioner
NVIDIA P100 GPU
26
This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy.
http://icl.cs.utk.edu/magma/
http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf
• Small size (<30) “somewhat” flexible• Add inversion routine to Batched-Blas• Go for Gauss-Jordan Elimination instead of LU
Pitch from the sparse community: