accelerating sparse cholesky factorization on the gpu · 2015. 3. 20. · natalia gimelshein,...
TRANSCRIPT
![Page 1: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/1.jpg)
STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY
DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO
TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
![Page 2: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/2.jpg)
SPARSE MATRIX FACTORIZATION ON GPUS
Objective: Find methods for GPU acceleration of Sparse Cholesky Factorization
Experiment using SuiteSparse 4.4.3 / CHOLMOD
Outline Sparse Cholesky Factorization Previous work / Issues
‘Branches’ approach
![Page 3: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/3.jpg)
Dense block Cholesky
DIRECT SPARSE FACTORIZATION
Supernodes
L11 Lt11 = A11
L11 Lt
21 = At21
A*
22 = A22 – L21 Lt21
POTRF
TRSM
GEMM
A11
A22 A21
At21
= L11 0
L21 I
I 0
0 A*22
Lt11 Lt
21
0 I
dense Cholesky
triangular solve
matrix multiplication
Schur complement
compressed column
![Page 4: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/4.jpg)
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
6
3
1
2
5
4
![Page 5: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/5.jpg)
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF
6
3
1
2
5
4
![Page 6: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/6.jpg)
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM 4
6
3
![Page 7: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/7.jpg)
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM 4
fill fill
6
3
![Page 8: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/8.jpg)
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF
fill fill
6
4
3
![Page 9: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/9.jpg)
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM
fill fill
![Page 10: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/10.jpg)
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
fill fill 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM GEMM
![Page 11: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/11.jpg)
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
fill fill 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM GEMM POTRF
![Page 12: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/12.jpg)
DIRECT SPARSE FACTORIZATION
Lots of ‘small’ math
Irregular access patterns
Larger matrices -> more dense math
Greater connectivity -> more dense math
Factors can be large ( > 128 GB )
![Page 13: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/13.jpg)
PREVIOUS WORK
Just send large BLAS-3 to GPU WORKS! For large, dense matrices
Not so good for:
small matrices
large matrices with low connectivity (shells / beams in FEA)
Find methods for further GPU acceleration of Sparse Factorization
![Page 14: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/14.jpg)
PREVIOUS WORK
0
100
200
300
400
500
600
700
800
GFlop
s/s
Florida Sparse Matrix Collec4on
CPU CPU + GPU Send appropriately-sized BLAS calls to GPU
‘hide’ PCIe communication
Assemble supernodes on GPU
Hybrid computing
why not higher?
why so low?
supe
rnod
e sc
ore
GPU CPU
supernodes
decreasing cost to assemble
row/column threshold ndrow >= 256 ndcol >= 32
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html
SuiteSparse (CHOLMOD) 4.4.3
1.5x
![Page 15: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/15.jpg)
ISSUES PCIe communication
Limits which BLAS operations can be accelerated on GPU
Small BLAS Low occupancy
Launch overhead
Most BLAS calls don’t get sent to the GPU
Seek methods which better accelerate factorization of small / minimally-connected matrices
audikw_1.mtx
% on CPU
![Page 16: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/16.jpg)
PROPOSED SOLUTION Factor branches on GPU
Use previous methods for root
No use of CPU
Eliminates PCIe communication
Requires POTRF, TRSM & GEMM on GPU
Batch and stream BLAS operations
Within levels
Amortizes launch overhead
Streamed to improve occupancy
No size restriction
Maps well to muti-GPU / hybrid computing branch 1
level 0
level 1
level 2
branch 2 branch 3 branch 4
![Page 17: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/17.jpg)
data on
device
data on
host
BATCHED / STREAMED BLAS Batch all BLAS calls to amortize kernel launch latency
Stream multiple batches to increase occupancy
Simply wrap cuBLAS subroutine with batch loop
DGEMM w/ m,n,k=16 -> 40 GF
DGEMM example, m,n,k=16
time st
ream
100 Mflops
: 500 Mflops
batched: 1.2 Gflops
streamed: 4.8 Gflops
Host <-> Device Kernel
![Page 18: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/18.jpg)
BATCHED / STREAMED DGEMM
Square DGEMM
64 streams/threads
Batched / streamed cuBLAS performance matches MKL for small size
Created by wrapping existing, non-batched routines
passing lists 0
200
400
600
800
1000
1200
1400
0 100 200 300 400 500 G
flop
/s
DGEMM m,n,k
GPU streamed GPU
batched/streamed GPU streamed CPU
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off)
![Page 19: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/19.jpg)
PLENTY OF PARALLELISM
Lower levels Many supernodes
Few descendants
Upper levels Few supernodes
Many descendants
audikw_1.mtx
# of supernodes or GEMM + SYRK ops
supernodes
GEMM
![Page 20: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/20.jpg)
BRANCHES Matrix # branches # levels # supernodes # root levels # root supernodes
Fault_639 2 18-‐19 14931 -‐ 15794 1 1 nd24k 2 11 302 -‐ 325 1 1 inline_1 4 16-‐17 3909 -‐ 10633 1 1 Emilia_923 4 17-‐18 10314 -‐ 11570 3 4 boneS10 4 18-‐23 7045 -‐ 26182 1 1 ldoor 3 19-‐20 17413 -‐ 35704 1 1 bone010 6 16-‐20 1957 -‐ 23610 1 1 Hook_1498 9 1-‐18 1 -‐ 33608 3 5 Geo_1438 8 17-‐18 8102 -‐ 9335 5 9 Serena 60 10-‐17 189 -‐ 4910 10 60 audikw_1 4 17-‐19 5631 -‐ 22300 1 1 Flan_1564 8 15-‐17 3937 -‐ 16309 2 2
branch 1
branch 2
branch 3
branch 4
![Page 21: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/21.jpg)
CHOLMOD RESULTS
1.38x average speedup vs. previous CPU+GPU
2x average speedup vs. CPU
Poorly performing matrices see the greatest speedup
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html
0
100
200
300
400
500
600
700
800
900
GFl
op/s
Florida Sparse Matrix Collection
CPU CPU + GPU GPU Branches
CHOLMOD 4.43
![Page 22: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/22.jpg)
PCIE DEPENDENCE
PCIe gen3 -> gen1 12 GB/s -> 3 GB/s
75% loss
CPU+GPU 23% loss
Branches 17% loss
0
100
200
300
400
500
600
700
800
900
Gfl
op/s
Florida Sparse Matrix Collection
PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3
4.4.3 CPU+GPU GPU Branches
1 x i7 3930K + K40 (max boost, ECC=on)
![Page 23: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/23.jpg)
SHELL MODEL PERFORMANCE
0
50
100
150
200
250
300
350
400
450
0 2 4 6 8 10 12
Num
eric
al F
acto
riza
tion
rat
e G
F/s
million degrees of freedom
4.4.3 CPU
4.4.3 CPU+GPU
Branches 1xK40
PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd
• 506,082 supernodes • 640 branches
• 114–1,730 supernodes • 8-20 levels
• 49 levels in root branch • 637 supernodes
2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.
w/ 256 GB + 2xK40 (ECC=ON, full boost)
![Page 24: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/24.jpg)
SHELL MODEL PERFORMANCE
‘Branches’ algorithm well-suited for Multi-GPU
4 x K40
Overall 1.5x speedup
Branches 3.1x speedup
We’ve ported the previous algorithm to multi-GPU
• 506,082 supernodes • 640 branches
• 114–1,730 supernodes
• 8-20 levels • 49 levels in root branch
• 637 supernodes
1 x K40
4 x K40
time
host <-> device
compute kernels
![Page 25: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/25.jpg)
SHELL MODEL PERFORMANCE
0
200
400
600
800
1000
1200
1400
0 2 4 6 8 10 12
Num
eric
al F
acto
riza
tion
rat
e G
F/s
million degrees of freedom
4.4.3 CPU
4.4.3 CPU+GPU
Branches 1xK40
Branches 2xK40
Branches 2xK40-Proj.
Branches 4xK40
Branches 4xK40-Proj.
PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd
2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.
w/ 256 GB + 2xK40 (ECC=ON, full boost)
assuming 87.5% parallel efficiency
2x K40
4x K40
1x K40
![Page 26: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/26.jpg)
CONCLUSIONS Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices
Universally beneficial Aspects apply to other factorization methods
Future Improved performance of batched routines
Support hybrid computing
Complete multi-GPU support
![Page 27: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/27.jpg)
RELATED WORK S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)
Natalia Gimelshein, Anshul Gupta
S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg
S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs
Azzam Haidar, Stanimire Tomov
S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis
S5237 - Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien
![Page 28: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks](https://reader036.vdocuments.net/reader036/viewer/2022071217/604dc4940fd1e366c964edc0/html5/thumbnails/28.jpg)
THANK YOU