pramod kumbha r

7/29/2019 Pramod Kumbha r

1/88

Performance of PETSc GPU Implementation with Sparse

Matrix Storage Schemes

Pramod Kumbhar

August 19, 2011

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2011


2/88

i

Abstract

PETSc is a scalable solver library developed at Argonne National Laboratory (ANL). It

is widely used for solving system of equations arising from discretisation of partial

differential equations (PDEs). GPU support has recently been added to PETSc to

exploit the performance of GPUs. This support is quite new and currently only

available in the PETSc development release. The goal of this MSc project is to evaluate

the performance of the current GPU implementation, especially iterative solvers on the

HECToR GPU cluster. In the current implementation, a new sub-class of matrix wasadded which stores matrix in Compressed Sparse Row (CSR) format. We have

extended the current PETSc GPU implementation to improve the performance using

different sparse matrix storage schemes like ELL, Diagonal and Hybrid.

For structured matrices, the current GPU implementation shows 4x speedup compared -

to Intel Xeon quad-core CPU. For multi-GPU applications, speedup starts decreasing

due to high communication costs on the HECToR GPU cluster. Our implementation

with new storage schemes show 50% performance improvement on sparse matrix-

vector operations. For structured matrices, new implementation shows 7x speedup and

significantly improves the performance of vector operations on the GPU.


3/88

ii

Contents

Chapter 1 Introduction ....................................................................................................1

1.1 Background ............................................................................................................... 1

1.2 Motivation ................................................................................................................. 2

1.3 Related Work ............................................................................................................ 3

1.4 Contributions and Outline ........................................................................................ 4

1.5 Change in Project plan ............................................................................................. 4

Chapter 2 Background .....................................................................................................6

2.1 GPGPU ..................................................................................................................... 6

2.2 GPU Programming models ...................................................................................... 6

2.2.1 CUDA ................................................................................................................ 6

2.3 CUSP and Thrust ...................................................................................................... 8

2.3.1 Thrust ................................................................................................................. 8

2.3.2 CUSP ...............................................................................................................10

3'(V6RXUFHRI6parse Matrices .........................................................................11

2.5 Iterative Methods for Sparse Linear Systems ........................................................13

Chapter 3 PE TSc GPU Implementation ......................................................................15

3.1 PETSc .....................................................................................................................15

3.1.1 PETSc Kernels ................................................................................................15

3.1.2 PETSc Components ........................................................................................16

3.1.3 PETSc Object Design ......................................................................................17

3.2 PETSc GPU Implementation .................................................................................18

3.2.1 Sequential Implementation .............................................................................18

3.2.2 Parallel Implementation ..................................................................................19

3.3 Applications running with PETSc GPU support ...................................................20

Chapter 4 Sparse Matrices ...........................................................................................22


4/88

iii

4.1 Sparse Matrix Representation ................................................................................23

4.2 Sparse Matrix Storage Schemes .............................................................................23

4.2.1 Coordinate List ................................................................................................24

4.2.2 Compressed Sparse Row .................................................................................24

4.2.3 Diagonal...........................................................................................................25

4.2.4 ELL or Padded ITPACK .................................................................................26

4.2.5 Hybrid .............................................................................................................26

4.2.6 Jagged Diagonal Storage (JDS) ......................................................................27

4.2.7 Skyline or Variable Band ................................................................................27

4.3 Performance of Storage Schemes ..........................................................................28

Chapter 5 Implementation of Sparse Storage Support in PE TSc............................35

5.1 Design Approach ....................................................................................................35

5.2 Implementation Details ..........................................................................................37

5.2.1 New Matrix types for GPU .............................................................................37

5.2.2 PETSc Mat Object ...........................................................................................37

5.2.3 New User level API ........................................................................................38

5.2.4 PETSc Mat objects on GPU ............................................................................39

5.2.5 Conversion of PETSc MatAIJ to CUSP CSR ................................................40

5.2.6 Conversion of PETSc MatAIJ to CUSP DIA/ELL/HYB/COO ....................41

5.2.7 Matrix-Vector multiplication for different sparse formats ............................43

5.2.8 Other Important notes .....................................................................................44

5.3 Sample Use Case and Validation ...........................................................................44

Chapter 6 W rapper Codes and Benchmarks..............................................................46

6.1 Testing Codes .........................................................................................................46

6.2 Matrix Market to PETSc binary format .................................................................46

6.3 Benchmarking codes ..............................................................................................48

6.4 Benchmarking Approach .......................................................................................48

Chapter 7 Performance Analysis..................................................................................50

7.1 Benchmarking System ............................................................................................50

7.2 Single GPU Performance .......................................................................................52

7.2.1 Structured Matrices .........................................................................................52

7.2.2 Semi-Structured Matrices ...............................................................................56

7.2.3 Unstructured Matrices .....................................................................................57


5/88

iv

7.3 Multi-GPU Performance ........................................................................................60

7.4 Comparing multi-GPU performance with HECToR .............................................63

7.5 CUSP Matrix Conversion Cost ..............................................................................64

Chapter 8 Discussion ......................................................................................................66

8.1 Challenges for multi-GPU parallelisation .............................................................66

8.1.1 CPU-GPU and GPU-GDRAM Memory transfer ..........................................66

8.1.2 GPU-GPU Communication ............................................................................67

8.2 Future Work ............................................................................................................68

Chapter 9 Conclusion .....................................................................................................72

Bibliography ....................................................................................................................74


6/88

v

List of F iguresFigure 1.1: Main stages involved in single iterations of Fluidity framework [7] ............. 2

)LJXUH 3URILOLQJ UHVXOWV RI %XUJHUV HTXDWLRQ -D model problem with mesh

spacing of 0.002 and domain [-10, 10] .............................................................................. 3

Figure 3.1: PETSc Kernels (implementation: petsc/src/sys) .............................15

Figure 3.2: PETSc Library Organisation [28]..................................................................16

Figure 3.4: PETSc Objects and Application Level Interface ..........................................17

Figure 3.5: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]..........19

Figure 3.6: Parallel Matrix with on-diagonal and off-diagonal elements for two MPI

process ...............................................................................................................................19

Figure 3.7: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]

...........................................................................................................................................20

Figure 4.1: MxN: 4,690,002x4,690,002 NNZ: 20,316,253 id: 1398 ........................32


Figure 4.3: MxN: 3,542,400x3,542,400 NNZ: 96,845,792 id: 1902 ............................32

Figure 4.4: MxN: 4,802,000x4,802,000 NNZ: : 85,362,744 id: 2496 ......................32

Figure 4.5: MxN: 16,614x16,614 NNZ: 1,096,948 id: 409 ......................................33

Figure 4.6: MxN: 999,999x999,999 NNZ: 4,995,991 id: 1883 ................................33


Figure 4.8: MxN: 1,971,281x1,971,281 NNZ: 5,533,214 id: 374 ................................33

Figure 4.9: MxN: 1,61,070x1,61,070 NNZ: 8,185,136 id: 2336 ............................34

Figure 4.10: MxN: 1, 20,216x1, 20,216 NNZ3,121,160 id: 2228 ..........................34

Figure 5.1: PETSc Objects creation in current GPU implementation ............................36

Figure 5.2: PETSc Object creation and new sparse matrix support in new GPU

implementation .................................................................................................................36

Figure 5.3: New User level API registration with Mat class (petsc-src/mat/interface/matreg.c)..............................................................................................38

Figure 5.4: Modified Mat_SeqAIJCUSP class with ELL, DIA, HYB and COO storage

support using the CUSP library ........................................................................................39

Figure 5.5: Converting PETSc AIJ Matrix to CUSP CSR matrix ..................................40

Figure 5.6: Transparent conversion between different sparse formats with CUSP ........41

Figure 5.7: Converting PETSc MatAIJ to CUSP ELL matrix (algorithmic details) ......42

Figure 5.8: Converting PETSc MatAIJ to CUSP ELL format (Algorithmic

Implementation) ................................................................................................................42
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588


7/88

vi

Figure 5.9: Sparse Matrix-Vector operation support for different matrix formats using

CUSP .................................................................................................................................43

Figure 5.10: Simple example of KSP with the use of new sparse matrix format ...........44

Figure 5.11: Convergence of KSP CG solver with different sparse matrix formats on

CPU & GPUs for simple example of 2-D Laplacian from PETSc .................................45

Figure 6.1: Converting Matrix Market format to PETSc binary format (Algorithmic

Implementation) ................................................................................................................47

Figure 7.1: HECToR GPGPU Testbed System consist of NVIDIA and AMD GPUs

connected by Infiniband Network ....................................................................................51

Figure 7.2: Tesla C2050/C2070 Specification [58] .........................................................51

Figure 7.3: Total execution time with different sparse matrix formats on GPU

(using GMRES method) ...................................................................................................52

Figure 7.4: Performance with different sparse matrix formats on GPU

(using GMRES method) ...................................................................................................52

Figure 7.5: Execution time of SpMV with different Sparse matrix formats ..................54

Figure 7.6: Execution time of SpMV+VecMDot+VecMAXPY with different sparse

matrix formats on GPU.....................................................................................................54

Figure 7.7: Performance on CPU with CSR, GPU with CSR and GPU with DIA ........55

Figure 7.8: Achieved Speedup compare to Intel Xeon quad-core ..................................55

Figure 7.9: Execution time of different sparse matrix format for semi-structured matrix

on GPU ..............................................................................................................................56

Figure 7.10: Performance for Semi-Structured Matrices ................................................56

Figure 7.11: Sparse Matrix-Vector Execution time for different sparse matrix formats57

Figure 7.12: Unstructured matrix of size 503712 x 503712 with 36,816,170 non zero

elements ............................................................................................................................57

Figure 7.13: Total execution time on GPU with CSR and HYB format ........................58

Figure 7.14: Performance of CSR and HYB on the GPU ...............................................58

Figure 7.15: Execution time of SpMV on CPU (CSR), GPU (CSR) and GPU (HYB) .59

Figure 7.16: Performance on HECToR GPU cluster with CSR and DIA matrix format

...........................................................................................................................................60

Figure 7.17: Performance with CSR and DIA matrix formats with different number for

GPUs .................................................................................................................................61

Figure 7.18: Execution time for SpMV using CSR and DIA matrix formats on

HECToR GPU ..................................................................................................................62

Figure 7.19: Performance comparison between HECToR GPU system ........................63

Figure 8.1: Overall system architecture considering bandwidth of different sub-systems

(Pre Sandy-Bridge Architecture) ......................................................................................67
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615


8/88

vii

Figure 8.2: HECToR GPU: Infiniband Network with Switched fibre topology

(schematic layout) .............................................................................................................68

Figure 8.3: Speedup using the default Block Jacobi Preconditioner on CPU and GPU

with CSR, ELL .................................................................................................................69

Figure 8.4: Diagonal matrix with few independent nonzero numbers ............................70

Figure 8.5: User Implemented SpMV in PETSc Using MattShell (Design) .................71
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639


9/88

viii

Acknowledgements

I am very grateful to Dr. Michele Weiland and Dr. Chris Maynard for their advice

and supervision during this dissertation. I would also like to thank Dr. Lawrence

Mitchell for providing valuable advice during project discussions. I am also indebted

to my friends and family for their continued support during my study.


10/88

1


11/88

1

Chapter 1

Introduction

1.1BackgroundPETSc (Portable Extensible Toolkit for Scientific Computations) is an open source,scalable solver library developed over past twenty years at Argonne National

Laboratory (ANL). It is used for solving system of equations arising from

discretisation of partial differential equations (PDEs). Developing parallel, nontrivial

PDE solvers for high end computing systems considering the scalability over

thousands of processors is still difficult task and takes lots of time. The PETSc is

designed to ease this task and reduce the development time. It provides parallel

algorithms, debugging support and low-overhead profiling interface that help in the

development of large and complex applications. PETSc is used to solve large linear

systems with 500B unknowns on supercomputers like Jaguar and Jugene with more

than 200K processors [1]. It is used in modelling of many scientific applications inthe area of Geosciences, Computational Fluid Dynamics, Weather Modelling,

Seismology, Surface water flow, Polymer Injection modelling etc. We will discuss

one such application of PETSc called as Fluidity that we have analysed.

Fluidity is an open source, general purpose computational fluid dynamics framework

[2] developed by the Applied Modelling and Computational Group (AMCG) at

Imperial College London. This framework is used in many scientific simulations in

the areas of fluid dynamics, ocean modelling, atmospheric modelling etc. It solves the

Navier-Stokes equations [3] on arbitrary unstructured, adaptive meshes using finite

element methods. While solving this system, we impose the grid on the problemdomain to calculate the numerical solution of partial differential equations (PDEs).

The accuracy and computational cost of the solution depends on the grid spacing. To

compute accurate solutions, one has to use a finer grid; but this increases

computational cost. Hence the Adaptive Mesh Refinement (AMR) technique is used

to reduce the computational costs. AMR uses a coarse grid at the start of the

simulation and as the solution progresses, it identifies areas of interest (i.e. parts of

the grid which exhibits large change in solution) where the grid needs to be refined.

These methods are discussed in more detail in [4], [5], [6].

Figure 1.1 shows the main stages involved in simulations using the Fluidity

framework:


12/88

2

During simulation, a new mesh may need to be generated by using AMR techniques

to maintain the accuracy of the solution. During the assembly stage, a system of

simultaneous equations is assembled using the finite element mesh. In the solver

stage, the system of equations assembled in the assembly stage is solved using

iterative methods. Fluidity uses iterative solvers from the PETSc to solve large sparse

system. The sparse matrices that arise are positive definite, non-symmetric and hence

the Generalised Minimum Residual (GMRES) algorithm is normally used [7]. Theupdate stage involves updating the solution variables, calculating new timestamps and

estimating the error. Finally, the current solution can be written to disk. All these

stages are discussed with more details in [7] and [8]. Currently, Fluidity uses various

libraries like MPI (Message Passing Interface), ParMETIS and PETSc to support

parallelisation [9].

1.2MotivationDespite various parallelisation and optimisation techniques, simulations of complex

phenomenon like tidal modelling or tsunami simulation take from hours to a few dayson modern supercomputers. For Fluidity, the main computationally expensive stages

are the assembly phase and the solver phase. The initial idea of the project was to

improve the performance of Fluidity framework using directive based GPU

programming models like HMPP (Hybrid Multicore Parallel Programming) and PGI

Accelerators. For initial profiling and performance optimisation, we decided to use 1-

D non-OLQHDU PRGHO SUREOHP RI %XUJHUV HTXDWLRQ [10]. The Burger equation is a

fundamental PDE which occurs in various applications of fluid dynamics which takes

the form:

where is the velocity and is the viscosity coefficient. This is a basically 1-D

Navier-Stokes equation (discussed in Section 2.4) without pressure and volume forces

terms. The following Figure 1.2 shows the profiling result of this problem on single

Intel Xeon processor.

repeat

MeshGeneration

AssemblyPhase

Solver PhaseUpdate

SolutionOutputSolution

Figure 1.1: Main stages involved in single iterations of Fluidity framework [7]


13/88

3

The profiling results show that most of the execution time (84%) is spent in PETSc

solver library. For this model problem, when we increase the resolution (i.e. finer

mesh spacing), the assembly phase time increases proportionally and the solver time

increases exponentially. As this is a simple 1-D example, assembly time is relatively

small. But the condition number of matrix is very high and hence the solver takeslarge time to converge. Hence, we decided to deviate from our original plan and

optimise the solver phase of Fluidity which is ultimately the PETSc solvers.

Graphics Processing Units (GPUs) are becoming more popular due to their

performance to cost ratio and potential performance gains compared to CPUs. To

improve the performance of the solver phase, we decided to use the newly implemented

GPU support in PETSc. Also we identified potential performance improvements in

PETSc using different sparse matrix storage schemes, which are more suitable for

GPUs.

1.3Related WorkIn the last year, basic GPU support in PETSc has been added, which is currently

available in the PETSc development release [11]. As per our knowledge, there are no

benchmarking results published for this GPU implementation. Also, the current

implementation only supports CSR (Compressed Sparse Row) matrix storage format

and there is no development effort to support other matrix storage schemes. Nathan

Bell and Michael Garland from NVIDIA Research have published performance results

[12] of sparse matrix-vector operations on GPUs using different sparse matrix storage

schemes. These results show that storage schemes like DIA (Diagonal), ELL

(ELLAPACK) and HYB (Hybrid) are well suited for GPUs. Compared to CSR storagescheme, DIA and ELL formats can achieve 4-6x speedup.

Figure 1.23URILOLQJUHVXOWVRI%XUJHUVHTXDWLRQ-D model problem with

mesh spacing of 0.002 and domain [-10, 10]


14/88

4

There are two main goals of this MSc project. First goal is to improve the performance

of current PETSc GPU implementation using DIA, ELL and HYB sparse matrix

storage formats. Second goal is to evaluate the performance of PETSc GPU

implementation on HECToR GPU cluster. Specifically we are looking at performanceof Krylov Subspace solvers for solving large sparse linear system.

1.4Contributions and OutlineThe contributions of this project report are as follows:

x To discuss the performance of the CUSP and Thrust libraries with large sparsematrices from real world applications;

x To present an initial implementation to support different sparse matrix storageschemes in PETSc using CUSP and Thrust;

xTo evaluate the performance of the PETSc GPU implementation for solvinglarge sparse linear systems;

x To evaluate the performance benefits of the new implementation on single andmulti-GPU applications;

x To compare the overall performance of PETSc GPU implementation on theHECToR GPU cluster and the HECToR (Phase 2b) system.

Chapter 2 presents background information, which includes GPGPU, CUDA

Programming model, CUSP &Thrust libraries, PDEs and Iterative methods for sparse

linear algebra. Chapter 3 discusses the design of PETSc library and implementation of

GPU support in PETSc. Chapter 4 presents different sparse matrix storage schemes

suitable for vector processors which are available in CUSP library. We have also

evaluated the performance of CUSP with large sparse matrices from real world

applications. Chapter 5 presents our initial implementation to support sparse matrix

storage schemes in PETSc using CUSP library. Chapter 6 discusses the wrappers codes

developed for matrix conversion, performance analysis and benchmarking. Chapter 7

presents performance results of PETSc GPU implementation with different sparse

matrix formats on the HECToR GPU cluster as well as the main HECToR (Phase 2b)

system. Chapter 8 discuss the challenges of multi-GPU parallelisation encountered

during performance analysis and outlines the future work in this area. Chapter 9

presents the conclusion of this project and summarize the results.

1.5Change in Project planDuring the project preparation phase (Semester-II), the idea of the MSc project was to

extend the HMPP programming model for the C++ language. Specifically our aim was

to implement a generic meta-programming framework for HMPP using C++ templates.

HMPP is now an open standard developed by CAPS enterprise and Pathscale Inc.

HMPP provides directive based GPU programming model similar to PGI accelerators.

For this MSc project, external organisation was expected to provide ENZO compiler

suite with HMPP C++ support by May 2011. But this compiler was not available until

first week of June 2011 due to complexity of C++ compiler implementation. So wedecided to change our project plan. With the great support of Dr. Michele Weiland and

Dr. Chris Maynard we were quickly able to work out the alternative project plan and


15/88

5

decided to work on Fluidity project, especially the PETSc GPU implementation. This

change in project affected the planned schedule of the project. But with the continuous

advice and support of supervisors, I am able to complete this project successfully.


16/88

6

Chapter 2

Background

2.1GPGPUGPUs (Graphics Processing Units) have a distinct architecture specifically designed for

high floating point operations and fine grained concurrency. In the past, GPUs were

mostly used for improving the performance of graphics operations like pixel shading,

texture mapping and rendering. But in last few years, GPUs have been effectively used

to speed-up the performance of non-graphics applications from different areas of

science like computational fluid dynamics, molecular dynamics, medical imaging,

climate modelling etc. The term GPGPU (General Purpose Computing on GPUs) is

normally used to refer to the use of GPUs for accelerating non-graphics applications

traditionally executing on CPUs.

The primary reason for the popularity of GPUs in the area of scientific computing is

their performance to cost ratio. For example, the NVIDIA Tesla C2070 GPU has 448

cores capable of achieving a theoretical peak performance of 515 GFlops which is 50

times more than the Intel Xeon (E5620) quad-core processor. But if we compare the

prices, Tesla GPU is only five times costlier than Intel Xeon processor. Various

applications ported to the GPUs show significant performance benefits (10-50x

speedup) compared to the CPUs. More details about these applications can be found in

[13].

2.

2GPU Programming modelsProgramming models like ARB, OpenGL, Direct3D and Cg were commonly used for

development graphics applications. But these programming models do not fit well for

development of HPC applications. Research in GPU technology helped to understand

the use of GPUs in general purpose computing. Various programming models like

CUDA, OpenCL, AMD Stream and PGI directives are available for programming these

special purpose devices and CUDA is the most popular programming among them.

2.2.1CUDANVIDIA introduced the CUDA programming model, which enables a large developer

community to exploit GPUs for general purpose computing. The programming


17/88

7

interfaces are exposed through C, C++, FORTRAN languages and third party wrappers

for other languages like Python, Ruby etc. A CUDA application consists of code that

runs on CPU as well as GPU. Compute intensive functions of the program, which

executes on GPU, are called as kernels. The nvcc compiler translates this kernel code tothe PTX assembly code, which is executed on GPU. More details about the CUDA

programming model can be found in [14].

CUDA Architectu re: We will discuss the CUDA architecture considering a Tesla 10series device. A Tesla C1060 GPU device consists of 30 multithreaded streaming

multiprocessors (SM) and each SM consists of 8 streaming processors (SP), two special

function units, on chip shared memory and an instruction unit. Figure 2.1 shows the

organisation of SM, SP, registers and shared memory in a Tesla device. The SM

creates, manages and schedules a group of 32 threads in a batch called as warp. A

single SM has hardware resources that can hold state of three warps at a time [15]. Forthe C1060 devices there can be 23,040 threads (30 SM * 8 SP * 32 threads * 3 Warps)available for execution. Out of this, only 960 threads (30 SM * 32 threads) can be

executed concurrently at a given time. All threads within a warp executes in SIMT

(Single Instruction Multiple Threads) fashion.

Figure 2.1: CUDA Architecture and Memory Hierarchy (Adapted from [15] and [56] )


18/88

8

There are different memory types: register, shared, local, global and caches. Each of

these types has different sizes, latencies, bandwidths and performance characteristics.

Each SM has on-chip registers and shared memory. These memories are small in size

and have very low latency. Local and global memory are largest in size and have veryhigh latency. Data access from global or local memory is very costly and requires 400-

500 cycles. Texture and constant memory have similar latency. But these can be

automatically cached by hardware and hence can be effectively used if kernel exhibits

temporal locality. L1 and L2 caches are introduced in the newer Fermi architecture

giving benefits similar to CPU caches. More detailed descriptions on memory

organisation and performance can be found in [16].

Whenever a CUDA kernel is launched on a GPU, thousands of threads are created,

which are organised into grid. The grid is a 1-D or 2-D array of thread blocks and each

thread block is a 1-D, 2-D or 3-D array of threads. The thread blocks are assigned toavailable SMs. All threads within a thread block execute in a time-multiplexed fashion

on a single SM. The grid and block dimensions largely depend on the hardware

resource requirements of the executing kernel. More information on this can be found

in [14].

2.3CUSP and ThrustCUSP and Thrust are open source C++ template libraries developed using CUDA and

providing high level interfaces for GPU programming. We have used these libraries for

implementing the sparse matrix storage schemes support in PETSc.

2.3.1ThrustThrust is an open source, template library [17] developed on top of CUDA. The main

advantage of Thrust is that it provides a high level interface for GPU programming and

enables rapid development of complex HPC applications. Another important benefit is

that Thrust, being a C++ template library, supports generic programming and Object

Oriented (OO) paradigms. Three main components of Thrust are: Containers, Iterators

and Algorithms.

Containers: A container can store a collection of objects. The containers are usuallyimplemented as template objects so that they can be used with different data types. For

example, common data structures used in programming languages like linked lists,

stacks, queues and heaps arrays are implemented as containers. In Thrust, there are two

main containers: thrust::host_vector and thrust::device_vector. A

host_vector and a device_vector represent an array of element on the CPU (host)

and the GPU (device) memory respectively. The major benefit of containers is that they

handle memory management for the underlying objects. For example, whenever we

create a host_vector, they automatically allocate memory on the CPU. Similarly the

device_vector container handles the memory allocation, deallocation on GPUs.

Whenever we assign a host_vector to device_vector, they automatically make a


19/88

9

YHFWRU FRS\ IURP &38 WR *38 PHPRU\ 6R ORZHU OHYHO $3,V OLNH cudaMalloc,cudaMemcpy,cudaFree etc are completely hidden from application developers.

Iterators: An Iteratoris a generalisation of pointers in C and can be thought of as anobject in C++ which can point to other objects. They are usually used for traversing

over the container objects and are similar to C pointers, thus we can perform pointer

arithmetic. There are different types of Thrust iterators like input, output, constant,

permutation or transform iterator [17]. For example, the input iterator provides the

functionality of accessing the value of a containerREMHFWEXWZHFDQWFKDQJHWKHYDOXH

of that object. It is possible to write generic algorithms by using templates and

parameterized by Iterators.

Algorithms: Thrusts implements more than sixty basic algorithms like merge sort,

radix sort, inclusive scan, reduce or parallel prefix. These algorithms are implementedas templatedobjects so that they can work with all basic data types. With the help ofiterators, the algorithmic implementation does not have to worry about the underlying

object type or object access methods. Algorithms do not directly access the container

data, but use iterators to access the underlying data elements. For example, there is a

single implementation of the radix sort for all data types. Depending on the data type,

an Iterator provides a way to access the data elements.

The mechanism of using containers, iterators and algorithms together can be explained

with following simple example:

In the above example, we first create the host container to store one million float

elements. We then randomly fill this vector using the thrust::generate method. The

vec_h.begin()and vec_h.end() calls provide iterators pointing to the start and end

of vec_h container respectively. When we assign the host container to the device

container, Thrust automatically allocates memory on the GPU using cudaMalloc() and

/*Thrust headers */void main(){

/*allocate storage for one million numbers using host container*/thrust::host_vector vec_h(1000000);/* generate one million numbers on host using iterators*/thrust::generate(vec_h.begin(),vec_h.end(), rand);

/*transparent copy of host vector to device vector*/thrust::device_vector vec_d = vec_h;

/*use of Thrust algorithms: passing iterators as parameters*/

thrust::sort(vec_d.begin(),vec_d.end());

/* transparent copy from device to host memory*/thrust::copy(vec_d.begin(),vec_d.end(),vec_h.begin());

}

Figure 2.3: Simple example to sort one million float elements on GPU using Thrust


20/88

10

calls cudaMemcpy() to make a host-to-device memory copy. In this example, we are

using the thrust::sortmethod to invoke the default sorting algorithm (Merrill's radix

sort [18]) on GPU. Finally we use the thrust::copy method to copy back vector data

from GPU to CPU memory.

2.3.2CUSPCUSP is also an open source C++ template library [19] developed on top of CUDA, but

it specifically targets sparse linear algebra and sparse matrix computations. Similar to

Thrust, this library provides a high-level programming interface and internally uses the

functionality of Thrust and CUBLAS. CUSP provides following five sparse matrix

storage schemes:

xCompressed Sparse Row (CSR)

x Coordinate (COO)x ELLAPACK (ELL)x Diagonal (DIA)x Hybrid (HYB)

We will discuss these storage formats in detail in Section 4.2. CUSP provides an easy

interface for building different sparse matrix formats and a transparent conversion

between these formats. This is explained in the following example:

In the above example, we create a sparse matrix object in COO format. CUSP provides

the cusp::gallery interface for generating sample matrices for a Poisson or

Diffusion problem on a 2-D mesh. When we assign the COO matrix object on the host

to the ELL matrix object on the device, CUSP automatically allocates memory on

GPU, performs the COO to ELL conversion, and copies the matrix data from CPU to

GPU. We have discussed this mechanism in Section 5.2.

/*CUSP headers */void main(){

/*sparse matrix in COO format on host*/cusp::coo_matrix coo_mat;

/*matrix corresponding to 2-D Poisson problem on 15x15 mesh */cusp::gallery::poisson5pt(A, 15, 15);

cusp::ell_matrix ell_mat;

/*performs memory allocation, conversion from COO to ELL,and memory allocation and copy to device */

ell_mat = coo_mat;}

Figure 2.4: Sparse matrix construction and transparent conversion using CUSP


21/88

11

In addition to sparse matrix storage and operations, CUSP provides following features:

x File I/O interface for reading and writing large sparse matrices to/from matrixmarket files.

x Krylov subspace solvers like Conjugate-Gradient (CG), Multi-mass Conjugate-Gradient (CG-M), Biconjugate-Gradient (BiCG) and Generalized Minimum

Residual (GMRES) on GPUs.

x Preconditioners like Algebric Multigrain (AMG), Diagonal and ApproximateInverse (AINV).

We have used the FILE I/O interface for converting matrices stored in matrix market

format (ASCII format) to PETSc binary format. For implementing sparse storage

schemes in PETSc, we have extensively used CUSP and Thrust. Also we have

developed small benchmark to measure the performance of CUSP linear solvers onGPUs.

2.43'(V: Source of Sparse MatricesPartial Differential Equations (PDEs) provide a mathematical model for many scientific

and engineering applications. These equations relate partial derivatives of physical

quantities like force, velocity, momentum, temperature etc. In fluid dynamics, the

Navier-Stokes equations [20] are a set of nonlinear PDEs, which can be used to

describe the flow of incompressible fluids as,

where is the flow velocity, is the viscosity, P is the pressure, is the density offluids and is vector differential operator. Most commonly, we solve these PDEs byapproximating them with equations with a finite number of unknowns. This process of

approximation is called discretisation. There are two commonly used techniquesavailable, Finite Difference Methods (FDM) and Finite Element Methods (FEM),

explained in [21] .

We will illustrate the process of discretisation by using the common example of a PDE

WKDWDSSHDUVLQPDQ\HQJLQHHULQJDUHDVLH3RLVVRQVHTXDWLRQ :

where is a real valued function and two space variables , LQGRPDLQ&RQVLGHUsimple problem where we want to find a function such that

= 1


22/88

12

in the solution domain and on the boundary . To find thenumerical approximation of, we discretised the PDE using finite differences and sub-divide domain

into grid of .We solve for different variables

where i,j=0, 1,2,3...... In this case, the grid spacingis given by h=1/(N+1).

On the 2-D grid, we can write discretised equations (using forward difference for first

derivative and backward difference for second derivative) as

The right hand side of above expression is called as five point stencil because every

point on the lattice is averaged with its four nearest neighbours as shown in Figure 2.2.

A finite difference approximation to the above equation is given by

This results in linear equations with unknowns . The resulting matrix A fromthe linear system is very large, sparse and with a banded structure. Forexample, forN=4, the matrix shown in Figure 2.3 is of order 16x16 and only contains

25% nonzero elements.

Figure 2.2: 2-D Grid and Five-point stencil


23/88

13

There are different storage schemes available to store these sparse matrices. Some

formats like ELL or Diagonal are better suited for GPUs. We will discuss these formats

in Section 4.2, considering their performance on GPUs.

2.5Iterative Methods for Sparse Linear SystemsIterative methods are commonly used for solving large linear systems. These methods

try to find the solution of linear system of equationAx=b by generating sequence of

improving approximate solutions. (Here, iterative meaning repetitive application ofoperations to improve the approximate solution). These methods use initial guess as the

first approximate solution and then improves this solution over the successive

iterations. There are two main classes of iterative methods: Stationary Iterative

Methods and Krylov Subspace Methods. Jacobi, Gauss-Seidel and Successive Over-Relaxation (SOR) methods are examples of stationary methods and they are easy to

implement and analyse. But the convergence of these methods is not guaranteed for all

class of matrices.

Krylov Subspace Methods are class of iterative methods which are considered as most

important iterative techniques currently available for solving linear and non-linear

system of equations. These methods are widely adopted because they are efficient and

reliable. Examples of Krylov Subspace Methods are Conjugate-Gradient, Biconjugate

Gradient and GMRES (Generalized Minimal Residual). These methods are based on

the Krylov Subspace. The m-order Krylov Subspace is defined as

Figure 2.2: Sparse matrix for 5x5 grid (Poisson Problem, 25% Non-Zero elements)


24/88

14

where A is nxn matrix and b is vector of length n. The Research in Krylov Subspacetechniques has brought various new methods. Detailed explanation of all methods is

beyond the scope of this project. We will discuss one such method of Krylov Subspace

solver i.e. GMRES, that we have used in our performance analysis example.

GMRES Method: GMRES is an iterative method which approximates the solution bythe vector in Krylov subspace with minimal residual [22]. GMRES approximate

solution by Euclidian norm of the residualAx-b over the Krylov Subspace. This

method is designed to solve non-symmetric linear systems. Most popular form of

GMRES is based on the Gram-Schmidt orthogonalization process. Gram-Schmidt

process takes a set of linearly independent vectors S= { in Euclideanspace and computes set of orthogonal vectors } = which spanssame subspace of fork


25/88

15

Chapter 3

PETSc GPU Implementation

3.1PETScPETSc is a scalable solver library, which has been used in the development of a large

number of HPC applications [26]. It provides infrastructure for rapid prototyping and

algorithmic design, which eases the development of scientific applications while

maintaining the scalability on large numbers of processors. The design of PETSc

allows transparent use of different linear/non-linear solvers and preconditioners in the

applications. The programming interface is provided through C, C++, FORTRAN and

Python.

In this section we will discuss the PETSc design and architecture, which will help to

understand our further implementation to support sparse matrix storage schemes.

3.1.1PETSc K ernelsPETSc kernels are basic sets of services on top of which the scalable solver library is

built. These kernels are shown in Figure 3.1.

These kernels have a modular structure and are

designed to maintain portability across different

architectures and platforms. For example, instead

of float or integer data types, PETSc provides

new data types like PetscInt, PetscScalar or

PetscMPIInt. These data types are internally

mapped to corresponding int, float, float64 or

double data types supported on the underlying

platform. For our implementation, if we want to

add new memory management routines, we can

implement those in the corresponding kernel and

make them available to applications and other

kernels. These PETSc kernels are explained

with more detail in [27].Figure 3.1: PETSc Kernels


26/88

16

3.1.2PETSc ComponentsPETSc is developed using object oriented paradigms and its architecture allows easy

integration of new features from external developer communities. PETSc consists of

various sub-components listed below:

x Vectorsx Matricesx Distributed Arraysx Preconditionersx Krylov Subspace Solversx Non-linear Solversx Index Setsx Timesteppers

PETSc allows easy customisation and extensions to these components. For example,

we can implement a new matrix subclass or preconditioner that can be transparently

used by all KSP solvers without any modifications. The algorithmic implementation

is separated from the parallel library layer that allows code reusability and easy

addition of new solvers, preconditioners and the data structures. Figure 3.2 shows the

organisation of different PETSc libraries and the levels of abstraction at which they

are exposed.

Figure 3.2: PETSc Library Organisation [54]


27/88

17

PETSc internally uses a number of libraries like BLAS, ParMetis, MPI and HDF5 to

provide the infrastructure required for large HPC applications. PETSc provides much

flexibility for users to choose among different libraries for different classes of

applications. But most of the functionalities of underlying libraries are hidden fromapplication developers due to parallel library layer.

3.1.3PETSc Object DesignIn PETSc, classes like Vector, Matrix and Distributed Array represent data objects.

These objects define various methods for data manipulation in sequential or parallel

implementation. The internal representation of an object i.e. the data structure, is not

exposed to applications and is only available through exposed APIs. This is shown in

Figure 3.3. For example, the Vector class can be used for representing the right hand

side of a linear systemAx=b or discrete solutions of PDEs and stores values in a simplearray format similar to C or FORTRAN array convention. This class defines various

methods for vector operations like the dot product, the vector norm, scaling, scatter or

gather operations. For parallel applications, PETSc automatically distributes these

vector elements within the communicators and uses the functionality of the underlying

MPI library to perform collective or point-to-point MPI operations.

In the parallel implementation, the Matrix or Preconditioner objects do not have

access to the internal data structure directly. Instead, they just call exposed APIs

through the PETSc interface and the internal object representation manages

communication within a MPI communicator. For example, for parallel vectors,

internally aVecScatter object is created to manage data communication across MPI

processes.VecScatterBegin() andVecScatterEnd() routines are used to perform

vector scatter operation across the communicator. To access internalVector data, the

application uses subroutines likeVecGetArray().Only Preconditioners (PC) objects

are implemented in a data structure specific way, so they access and manipulateVectororMatrix data structures directly.

Data Structure

Data Manipulation

Routines

Exposed APIs

(Abstraction)

Applications

PETSc Interface

Matrix Vector Index Set

Figure 3.3: PETSc Objects and Application Level Interface


28/88

18

3.2PETSc GPU ImplementationRecently, GPU support has been added to the PETSc solver library. Currently it is

under development and available in the PETSc development release [11]. The initialimplementation allows for transparent use of GPUs without modifying the existing

application source code. Instead of writing completely new CUDA code, PETSc uses

the open source CUSP and Thrust libraries discussed in Section 2.3. This helps to keep

the GPU implementation separate from the existing PETSc code.

We will discuss this new implementation in more detail, as our development work will

be an extension to current development.

3.2.1Sequential ImplementationThe current implementation assumes that every MPI process has access to a single

GPU. A new GPU specific Vector class calledVecCUSP has been implemented. It uses

CUBLAS, CUSP, as well as Thrust library routines to perform vector operations on a

GPU. The idea behind using these libraries is to use already developed, fine tuned

CUDA implementations with PETSc instead of developing new ones. The PETSc

implementation acts as an interface between PETSc data structures and external CUDA

libraries, i.e. Thrust and CUSP.

Whenever we execute a program with GPU support, two copies of any vector are

created, one on the CPU and another on the GPU. In the existingVec class, a new flag

is added called valid_GPU_array. This flag has the following four possible values and

corresponding meaning:

PETSC_CUSP_UNALLOCATED : Object is not yet allocated on GPU

PETSC_CUSP_CPU : Object is allocated and a valid copy is available on CPU only

PETSC_CUSP_GPU : Object is allocated and a valid copy is available on GPU only

PETSC_CUSP_BOTH : Object is allocated and valid copies are on CPU & GPU (both)

Initially this flag has the value PETSC_CUSP_UNALLOCATED. When an application creates

a Vector object, theVecCUSPCopyToGPU() subroutine creates a new vector copy on theGPU and sets the valid_GPU_array flag to PETSC_CUSP_BOTH indicating that both

copies are now valid and contain recent values. Now all vector operations can be

performed on the GPU. Whenever theVecCUSPCopyToGPU() function gets called, it

makes a copy to GPU only if the vector object is modified on CPU, i.e. value of

valid_GPU_array flag changed. Memory copies between host and device are

managed through the subroutinesVecCUDACopyToGPU()andVecCUDACopyFromGPU().For example, when an application callsVecGetArrayRead() to access vector data,

internally it first callsVecCUDACopyFromGPU() to copy recent vector values from

GPU and then sets valid_GPU_array to PETSC_CUSP_BOTH, indicating that both copies

are now valid and contain recent values. This mechanism can be illustrated byimplementation of simple vector operation AXPY i.e. y = alpha x + y:


29/88

19

For the above vector operation, the VecCUDACopyToGPU() subroutine allocates

memory and copies vector data on to the GPU if the flag value is

PETSC_CUSP_UNALLOCATED. If flag value is PETSC_CUSP_CPU, that means memory is

already allocated on GPU but copy on CPU is recently modified. So it makes CPU to

GPU vector copy and then it makes call to CUBLAS library routines and sets

valid_GPU_array to PETSC_CUDA_GPU.

3.2.2Parallel ImplementationIn the parallel implementation, the parallel Vector and Matrix objects are implemented

on top of the sequential implementation. The Rows of a matrix are partitioned among

the processes in a communicator. This is shown in Figure 3.5. In the PETSc

implementation, a sparse matrix is stored in two parts: the on-diagonal part and the off-

diagonal part. The on-diagonal portion of the matrix, say Ad, stores values of the

column associated with rows owned by that process. These matrix elements are shown

by red colour. All remaining entries from the off-diagonal portion are stored in another

component, say Ao.

VecAXPY(){

/*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(xin);/*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(yin);

try {/*perform AXPY using CuBLAS library routine*/cusp::blas::axpy(*((Vec_CUDA*)xin->spptr)->GPUarray,

*((Vec_CUDA*)yin->spptr)->GPUarray,alpha);/*now updated copy is present on GPU*/yin->valid_GPU_array = PETSC_CUDA_GPU;/*wait until all thread finishes*/ierr = WaitForGPU();

}catch(char *ex) {.........}

}

Figure 3.4: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]

Figure 3.5: Parallel Matrix with on-diagonal and off-diagonal elements

for two MPI process


30/88

20

The sparse matrix vector product is calculated by using two steps: Firstly, calculating

the product related to the on-diagonal entries of matrix, i.e. Ad, by using associated

vector entries of vectorx, i.e. xd. Then we calculate the product of off-diagonal matrix

entries and the associated vector, which gets added to previous result ydas

yd = Ad * xd

yd += Ao * xo

For this operation, updated entries of vector xo must be communicated within

communicators. As only few AoHOHPHQWVDUHQRQ]HURZHGRQWKDYHWRFRPPXQLFDWH

all xo entries. This communication is managed through theVecScatter object, which

handles parallel gather and scatter operations using non-blocking MPI calls. The

VecScatter object stores two arrays of indices: one array stores global indices of

vector elements that will be received as updated entries from other processes in thecommunicator. These received vector elements are stored in the local array. Second

vector stores the mapping between global index of vector element and its position in the

local array.

Communication starts withVecScatterBegin()call, which copies data into message

buffers. For the GPU implementation, the updated vector entries first get copied from

GPU memory to CPU usingVecCUDACopyFromGPU()function. The communication

completes after theVecScatterEnd()call, which waits for completion of non-

blocking MPI calls posted byVecScatterBegin(). This implementation of parallel

matrix-vector operation shown below:

More information about this implementation can be found in [11] [28].

3.3Applications running with PE TSc GPU supportPETSc allows transparent use of GPUs without any changes in the application source

code. Most of the existing PETSc applications can run on the GPUs. New class of

Vector i.e.VecCUSP and matrix i.e. MatCUSP is added to PETSc which performs all

matrix-vector operations on GPUs. To run any existing application on GPU, user has to

VecScatterBegin(a->Mvctx, xd, hatxo,INSERT_VALUES,

SCATTER_FORWARD);

MatMult(Ad, xd, yd);

VecScatterEnd(a->Mvctx, xd, hatxo,INSERT_VALUES,

SCATTER_FORWARD);

MatMultAdd(hatAo, hatxo, yd, yd);

Figure 3.6: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]


31/88

21

set Vector type toVECCUSP and matrix type toMATCUSP usingVecSetType() and

MatSetType() routines respectively. User can also set these Vector and Matrix type,

using option database keys vec_type seqcusp and mat_type seqaijcusp.

All of the Krylov Subspace methods except KSPIBCGS (Improved Stabilized version

of BiConjugate Gradient Squared) are supported on the GPU. Currently, Jacobi, AMG

(Algebraic Multigrid) and AINV (Approximate Inverse) preconditioners are supported

on the GPUs.


32/88

22

Chapter 4

Sparse Matrices

As discussed in Section 2.4, the discretisation of PDEs results in large sparse matrices.

A matrix with only few non-zero elements can be considered as sparse. In a practicalsense, a matrix can be considered as sparse if specialised techniques can be used to take

advantage of the sparsity and the sparsity pattern of the matrix. Depending on the

sparsity pattern, we can divide matrices into two broad categories: structured and

unstructured. A matrix with non-zero elements in a specific regular pattern is called a

structured sparse matrix. For example, all non-zero elements along few diagonals of the

matrix or non-zero elements in small dense sub-blocks, which result into regular

patterns. The Application of FDM or linear FEM on rectangular grids results in

structured sparse matrices. On the other hand, for irregular meshes this results into

unstructured sparse matrices with no specific structure or pattern of non-zero elements.

Figure 4.1 and Figure 4.2 shows example of structured and unstructured matricesrespectively.

Depending on the sparsity pattern, different storage schemes or data structures can be

used. Importantly, the performance of the matrix operations depends on these storage

schemes and processor architecture. This becomes more apparent for vector processors

and GPUs. In this section we will discuss sparse matrix representation and different

storage schemes with their storage efficiency and performance.

Figure 4.2: Example of Unstructured

matrix from bipartite graph

Figure 4.1: Example of Structured Matrix

from structured problem


33/88

23

4.1Sparse Matrix RepresentationThe structure of sparse matrices can be ideally represented by adjacency graphs. Graph

theory techniques have been used effectively for parallelising various iterative methodsand implementing preconditioners [29]. A graph G=(V,E) is represented by set of

vertices and set of edges where areelements ofV. In 2-D plane, the graph G is represented by a set of points which areconnected by edges between these points. In case of the adjacency graph of a sparse

matrix, the nvertices in Vrepresent nunknown variables, and the edges in Erepresentthe binary relation between those vertices. There is an edge from node ito nodejwhen

matrix element . An adjacency graph can be directed or undirected dependingon the symmetry of non zeros. When a sparse matrix has a symmetric non-zero pattern

(i.e.

, the adjacency graph is undirected,

otherwise it is directed.

This adjacency graph representation can be used for parallelisation. In case of

parallelising a Gaussian elimination, at a given stage of the elimination we can find

unknowns which are independent of each other from above binary relation. For

example, in the case of a diagonal matrix all unknowns are independent of each other,

which is not true for dense matrices. More information about sparse matrix

representation and parallelisation strategies can be found in [29].

4.2Sparse Matrix Storage SchemesThere are two main reasons for different sparse matrix storage formats: memory

requirements and computational efficiency. It may not be feasible to store a large sparse

matrix in main memory. Importantly, it is not necessary to store zero matrix elements.

Various storage schemes (i.e. data structures) have been proposed to effectively utilise

sparsity and sparsity patterns of the matrices. 7KHUHLVQRVLQJOHbest VWRUDJHVFKHPHfor all sparse matrices; but a few are suitable for matrices with structured sparsity

patterns, some are general purpose and others are storage schemes for matrices with

arbitrary nonzero patterns. Each storage scheme has different storage costs,

computational costs and performance characteristics. In this section we will discuss

various storage schemes and their performance on GPUs.

Figure 4.3: Spase matrix representation with directed adjacency graph

1 2 3

456


34/88

24

4.2.1Coordinate ListThe coordinate list (COO) is a simple and the most flexible storage format, where we

store every non-zero element of a matrix with three vectors: data,

rowand indices. Thedatavector stores nonzero elements of matrix in row major order. The rowand indicesvectors explicitly stores associated row and column index of every element in the datavector. This is explained in following Figure:

Figure 4.4: Sparse matrix and corresponding COO storage representation

This is a general purpose and robust storage scheme, which can be used for matrices

with arbitrary sparsity patterns. The above example shows that the storage cost of COO

format is proportional to the number of nonzero elements. ForMxNsparse matrix with

k non-zero elements, it requires bytes.4.2.2Compressed Sparse RowCompressed Sparse Row (CSR) is popular and the most general purpose storage

format. This can be used for storing matrices with arbitrary sparsity patterns as it makes

no assumptions about the structure of the nonzero elements. Like COO, this format also

stores only nonzero elements. These elements are stored using three vectors: data,indicesand row_ptr. The dataand indicesvectors are same as for the COO format. Foran MxNsparse matrix, the row_ptrvector has length ofM+1and stores indexes whereeach row of the matrix starts in the valvector. The last entry ofrow_ptrcorresponds tothe number of nonzero elements in the matrix. This storage scheme is explained in the

Figure below:

Figure 4.4: Sparse matrix and corresponding CSR storage representation

4 1 5 2 6 3 9 7 84 0 1 0 0

0 5 0 2 0

0 0 6 0 3

9 0 0 7 0

0 0 0 0 8

0 0 1 1 2 2 3 3 4

0 2 1 3 2 4 0 3 4

4 1 5 2 3 7 6 8 94 0 0 1 0

0 5 0 2 0

0 0 0 0 3

7 0 0 6 0

0 8 0 0 9

0 3 1 3 4 0 3 1 4

0 2 4 5 7 9

data

row

indices

indices

row_ptr

data


35/88

25

There are some advantages of using CSR over COO. The CSR format takes less

storage compared to COO due to the compression of the row indices explained in

above Figure. Also, with the row_ptrvector we can easily compute the number of non-

zero elements in the row as . In parallel algorithms,row_ptr values allow fast row slicing operations and fast access to matrix elementsusing pointer indirection. This is a commonly used sparse matrix storage scheme on

CPUs. For a MxN sparse matrix with k non-zero elements, it requires bytes.4.2.3DiagonalApplications of stencils to regular grids result in bandedsparse matrices, where non-zero elements are restricted to few sub-diagonals of the matrix. For these matrices, the

diagonal (DIA) format can be effectively used. The DIA format uses only two vectors,dataand offsets. The datavector stores nonzero elements of the sub-diagonals of thematrix. The offsetsvector stores offset of every sub-diagonal from the main diagonal ofthe matrix. By convention, the main diagonal has the offset 0, the diagonals below the

main diagonal have negative offset and those are above the main diagonal have positive

offsets. This is illustrated with example in the Figure bellow:

Figure4.5: Sparse matrix and corresponding DIA storage scheme

Unlike CSR and COO, this storage format stores few zero elements explicitly. As we

can see in above Figure, the diagonal with offset -3 has only two non-zero elements.

But to store it in diagonal format, the elements of this diagonal are padded with

DUELWUDU\YDOXHLH6RWKHUHLVVRPHH[WUDVWRUDJHRYHUKHDGDVVRFLDWHGZLWKLW%XWthere are more storage benefits due to fact that we do not have to store column or row

indices explicitly. Usually, the data vector stores non-zero elements in the column

major order which ensures memory coalescing for GPU devices. We will discuss this in

more detail in the performance analysis section 4.3. For a MxN square matrix with dsub-diagonals having at least one non-zero element, it requires bytes.This is not a general purpose storage scheme like CSR and COO. It is very sensitive to

the sparsity pattern and is useful for matrices with an ordered banded structure. For

example, consider a matrix in Figure 4.6. This matrix has a banded structure, but this is

not suitable for DIA storage scheme. The nonzero elements structure is exactly in

opposite order of what is ideally suited for DIA format.

3 0 8 0 0

0 4 0 9 0

0 0 5 0 101 0 0 6 0

0 2 0 0 7

* 3 8

* 4 9

* 5 10

1 6 *

2 7 *

-3 0 2offsetsdata


36/88

26

When we store above matrix with diagonal storage format, we end up storing all sub-

diagonals, each containing single nonzero element and four padding elements.

4.2.4E L L or Padded ITPAC KLike DIA, ELL format is also suitable for vector architectures. This format can be used

for storing sparse matrices arising from semi-structured meshes where the average

numbers of non-zero element per row are nearly same. ForMx Nsparse matrix with amaximum ofknon-zeros per row, we store the matrix in a Mx kdense dataarray. If a

particular row has less than knon-zeros, that row is padded with zeros. The indicesarray stores the column index of every element in the dataarray. These elements arestored in column major order. Figure 4.6 illustrate an example of ELL storage scheme:

Figure4.6: Sparse matrix and corresponding ELL storage scheme

Compare to DIA, ELL is again a PRUHJHQHUDOVWRUDJHIRUPDWDQGLWVQRWQHFHVVDU\WR

have a banded structure of non-zero elements. But the average number of non-zeros

must be the same across all rows of matrix, otherwise we end up padding large numbersof zero elements. ForMxNsparse matrix with maximum of NNZ_PER_ROWnonzero per

row, it requires * bytes for storage.

4.2.5HybridAlthough the ELL format is well suited for vector architectures, most of the time sparse

matrices arising from complex geometries that do not have the same number of nonzero

per row [12]. As the number of non-zero elements starts varying to a larger extent, we

end up storing a large number of padding elements. Consider the example of the sparse

matrix shown in Figure 4.7. In this case, except for the first row all other rows have

Figure 4.5: Banded nonzero pattern which is not suitable for DIA format

1 2

3 45 6

7 8

pramod kumbha r

Documents