pramod kumbha r

Upload: danielle-hays

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Pramod Kumbha r

    1/88

    Performance of PETSc GPU Implementation with Sparse

    Matrix Storage Schemes

    Pramod Kumbhar

    August 19, 2011

    MSc in High Performance Computing

    The University of Edinburgh

    Year of Presentation: 2011

  • 7/29/2019 Pramod Kumbha r

    2/88

    i

    Abstract

    PETSc is a scalable solver library developed at Argonne National Laboratory (ANL). It

    is widely used for solving system of equations arising from discretisation of partial

    differential equations (PDEs). GPU support has recently been added to PETSc to

    exploit the performance of GPUs. This support is quite new and currently only

    available in the PETSc development release. The goal of this MSc project is to evaluate

    the performance of the current GPU implementation, especially iterative solvers on the

    HECToR GPU cluster. In the current implementation, a new sub-class of matrix wasadded which stores matrix in Compressed Sparse Row (CSR) format. We have

    extended the current PETSc GPU implementation to improve the performance using

    different sparse matrix storage schemes like ELL, Diagonal and Hybrid.

    For structured matrices, the current GPU implementation shows 4x speedup compared -

    to Intel Xeon quad-core CPU. For multi-GPU applications, speedup starts decreasing

    due to high communication costs on the HECToR GPU cluster. Our implementation

    with new storage schemes show 50% performance improvement on sparse matrix-

    vector operations. For structured matrices, new implementation shows 7x speedup and

    significantly improves the performance of vector operations on the GPU.

  • 7/29/2019 Pramod Kumbha r

    3/88

    ii

    Contents

    Chapter 1 Introduction ....................................................................................................1

    1.1 Background ............................................................................................................... 1

    1.2 Motivation ................................................................................................................. 2

    1.3 Related Work ............................................................................................................ 3

    1.4 Contributions and Outline ........................................................................................ 4

    1.5 Change in Project plan ............................................................................................. 4

    Chapter 2 Background .....................................................................................................6

    2.1 GPGPU ..................................................................................................................... 6

    2.2 GPU Programming models ...................................................................................... 6

    2.2.1 CUDA ................................................................................................................ 6

    2.3 CUSP and Thrust ...................................................................................................... 8

    2.3.1 Thrust ................................................................................................................. 8

    2.3.2 CUSP ...............................................................................................................10

    3'(V6RXUFHRI6parse Matrices .........................................................................11

    2.5 Iterative Methods for Sparse Linear Systems ........................................................13

    Chapter 3 PE TSc GPU Implementation ......................................................................15

    3.1 PETSc .....................................................................................................................15

    3.1.1 PETSc Kernels ................................................................................................15

    3.1.2 PETSc Components ........................................................................................16

    3.1.3 PETSc Object Design ......................................................................................17

    3.2 PETSc GPU Implementation .................................................................................18

    3.2.1 Sequential Implementation .............................................................................18

    3.2.2 Parallel Implementation ..................................................................................19

    3.3 Applications running with PETSc GPU support ...................................................20

    Chapter 4 Sparse Matrices ...........................................................................................22

  • 7/29/2019 Pramod Kumbha r

    4/88

    iii

    4.1 Sparse Matrix Representation ................................................................................23

    4.2 Sparse Matrix Storage Schemes .............................................................................23

    4.2.1 Coordinate List ................................................................................................24

    4.2.2 Compressed Sparse Row .................................................................................24

    4.2.3 Diagonal...........................................................................................................25

    4.2.4 ELL or Padded ITPACK .................................................................................26

    4.2.5 Hybrid .............................................................................................................26

    4.2.6 Jagged Diagonal Storage (JDS) ......................................................................27

    4.2.7 Skyline or Variable Band ................................................................................27

    4.3 Performance of Storage Schemes ..........................................................................28

    Chapter 5 Implementation of Sparse Storage Support in PE TSc............................35

    5.1 Design Approach ....................................................................................................35

    5.2 Implementation Details ..........................................................................................37

    5.2.1 New Matrix types for GPU .............................................................................37

    5.2.2 PETSc Mat Object ...........................................................................................37

    5.2.3 New User level API ........................................................................................38

    5.2.4 PETSc Mat objects on GPU ............................................................................39

    5.2.5 Conversion of PETSc MatAIJ to CUSP CSR ................................................40

    5.2.6 Conversion of PETSc MatAIJ to CUSP DIA/ELL/HYB/COO ....................41

    5.2.7 Matrix-Vector multiplication for different sparse formats ............................43

    5.2.8 Other Important notes .....................................................................................44

    5.3 Sample Use Case and Validation ...........................................................................44

    Chapter 6 W rapper Codes and Benchmarks..............................................................46

    6.1 Testing Codes .........................................................................................................46

    6.2 Matrix Market to PETSc binary format .................................................................46

    6.3 Benchmarking codes ..............................................................................................48

    6.4 Benchmarking Approach .......................................................................................48

    Chapter 7 Performance Analysis..................................................................................50

    7.1 Benchmarking System ............................................................................................50

    7.2 Single GPU Performance .......................................................................................52

    7.2.1 Structured Matrices .........................................................................................52

    7.2.2 Semi-Structured Matrices ...............................................................................56

    7.2.3 Unstructured Matrices .....................................................................................57

  • 7/29/2019 Pramod Kumbha r

    5/88

    iv

    7.3 Multi-GPU Performance ........................................................................................60

    7.4 Comparing multi-GPU performance with HECToR .............................................63

    7.5 CUSP Matrix Conversion Cost ..............................................................................64

    Chapter 8 Discussion ......................................................................................................66

    8.1 Challenges for multi-GPU parallelisation .............................................................66

    8.1.1 CPU-GPU and GPU-GDRAM Memory transfer ..........................................66

    8.1.2 GPU-GPU Communication ............................................................................67

    8.2 Future Work ............................................................................................................68

    Chapter 9 Conclusion .....................................................................................................72

    Bibliography ....................................................................................................................74

  • 7/29/2019 Pramod Kumbha r

    6/88

    v

    List of F iguresFigure 1.1: Main stages involved in single iterations of Fluidity framework [7] ............. 2

    )LJXUH 3URILOLQJ UHVXOWV RI %XUJHUV HTXDWLRQ -D model problem with mesh

    spacing of 0.002 and domain [-10, 10] .............................................................................. 3

    Figure 3.1: PETSc Kernels (implementation: petsc/src/sys) .............................15

    Figure 3.2: PETSc Library Organisation [28]..................................................................16

    Figure 3.4: PETSc Objects and Application Level Interface ..........................................17

    Figure 3.5: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]..........19

    Figure 3.6: Parallel Matrix with on-diagonal and off-diagonal elements for two MPI

    process ...............................................................................................................................19

    Figure 3.7: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]

    ...........................................................................................................................................20

    Figure 4.1: MxN: 4,690,002x4,690,002 NNZ: 20,316,253 id: 1398 ........................32

    Figure 4.2: MxN: 1,391,349x1,391,349 NNZ: 64,531,701 id: 2541 ........................32

    Figure 4.3: MxN: 3,542,400x3,542,400 NNZ: 96,845,792 id: 1902 ............................32

    Figure 4.4: MxN: 4,802,000x4,802,000 NNZ: : 85,362,744 id: 2496 ......................32

    Figure 4.5: MxN: 16,614x16,614 NNZ: 1,096,948 id: 409 ......................................33

    Figure 4.6: MxN: 999,999x999,999 NNZ: 4,995,991 id: 1883 ................................33

    Figure 4.7: MxN: 1,489,752x1,489,752 NNZ: 10,319,760 id: 2267 ........................33

    Figure 4.8: MxN: 1,971,281x1,971,281 NNZ: 5,533,214 id: 374 ................................33

    Figure 4.9: MxN: 1,61,070x1,61,070 NNZ: 8,185,136 id: 2336 ............................34

    Figure 4.10: MxN: 1, 20,216x1, 20,216 NNZ3,121,160 id: 2228 ..........................34

    Figure 5.1: PETSc Objects creation in current GPU implementation ............................36

    Figure 5.2: PETSc Object creation and new sparse matrix support in new GPU

    implementation .................................................................................................................36

    Figure 5.3: New User level API registration with Mat class (petsc-src/mat/interface/matreg.c)..............................................................................................38

    Figure 5.4: Modified Mat_SeqAIJCUSP class with ELL, DIA, HYB and COO storage

    support using the CUSP library ........................................................................................39

    Figure 5.5: Converting PETSc AIJ Matrix to CUSP CSR matrix ..................................40

    Figure 5.6: Transparent conversion between different sparse formats with CUSP ........41

    Figure 5.7: Converting PETSc MatAIJ to CUSP ELL matrix (algorithmic details) ......42

    Figure 5.8: Converting PETSc MatAIJ to CUSP ELL format (Algorithmic

    Implementation) ................................................................................................................42

    http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588
  • 7/29/2019 Pramod Kumbha r

    7/88

    vi

    Figure 5.9: Sparse Matrix-Vector operation support for different matrix formats using

    CUSP .................................................................................................................................43

    Figure 5.10: Simple example of KSP with the use of new sparse matrix format ...........44

    Figure 5.11: Convergence of KSP CG solver with different sparse matrix formats on

    CPU & GPUs for simple example of 2-D Laplacian from PETSc .................................45

    Figure 6.1: Converting Matrix Market format to PETSc binary format (Algorithmic

    Implementation) ................................................................................................................47

    Figure 7.1: HECToR GPGPU Testbed System consist of NVIDIA and AMD GPUs

    connected by Infiniband Network ....................................................................................51

    Figure 7.2: Tesla C2050/C2070 Specification [58] .........................................................51

    Figure 7.3: Total execution time with different sparse matrix formats on GPU

    (using GMRES method) ...................................................................................................52

    Figure 7.4: Performance with different sparse matrix formats on GPU

    (using GMRES method) ...................................................................................................52

    Figure 7.5: Execution time of SpMV with different Sparse matrix formats ..................54

    Figure 7.6: Execution time of SpMV+VecMDot+VecMAXPY with different sparse

    matrix formats on GPU.....................................................................................................54

    Figure 7.7: Performance on CPU with CSR, GPU with CSR and GPU with DIA ........55

    Figure 7.8: Achieved Speedup compare to Intel Xeon quad-core ..................................55

    Figure 7.9: Execution time of different sparse matrix format for semi-structured matrix

    on GPU ..............................................................................................................................56

    Figure 7.10: Performance for Semi-Structured Matrices ................................................56

    Figure 7.11: Sparse Matrix-Vector Execution time for different sparse matrix formats57

    Figure 7.12: Unstructured matrix of size 503712 x 503712 with 36,816,170 non zero

    elements ............................................................................................................................57

    Figure 7.13: Total execution time on GPU with CSR and HYB format ........................58

    Figure 7.14: Performance of CSR and HYB on the GPU ...............................................58

    Figure 7.15: Execution time of SpMV on CPU (CSR), GPU (CSR) and GPU (HYB) .59

    Figure 7.16: Performance on HECToR GPU cluster with CSR and DIA matrix format

    ...........................................................................................................................................60

    Figure 7.17: Performance with CSR and DIA matrix formats with different number for

    GPUs .................................................................................................................................61

    Figure 7.18: Execution time for SpMV using CSR and DIA matrix formats on

    HECToR GPU ..................................................................................................................62

    Figure 7.19: Performance comparison between HECToR GPU system ........................63

    Figure 8.1: Overall system architecture considering bandwidth of different sub-systems

    (Pre Sandy-Bridge Architecture) ......................................................................................67

    http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615
  • 7/29/2019 Pramod Kumbha r

    8/88

    vii

    Figure 8.2: HECToR GPU: Infiniband Network with Switched fibre topology

    (schematic layout) .............................................................................................................68

    Figure 8.3: Speedup using the default Block Jacobi Preconditioner on CPU and GPU

    with CSR, ELL .................................................................................................................69

    Figure 8.4: Diagonal matrix with few independent nonzero numbers ............................70

    Figure 8.5: User Implemented SpMV in PETSc Using MattShell (Design) .................71

    http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639
  • 7/29/2019 Pramod Kumbha r

    9/88

    viii

    Acknowledgements

    I am very grateful to Dr. Michele Weiland and Dr. Chris Maynard for their advice

    and supervision during this dissertation. I would also like to thank Dr. Lawrence

    Mitchell for providing valuable advice during project discussions. I am also indebted

    to my friends and family for their continued support during my study.

  • 7/29/2019 Pramod Kumbha r

    10/88

    1

  • 7/29/2019 Pramod Kumbha r

    11/88

    1

    Chapter 1

    Introduction

    1.1BackgroundPETSc (Portable Extensible Toolkit for Scientific Computations) is an open source,scalable solver library developed over past twenty years at Argonne National

    Laboratory (ANL). It is used for solving system of equations arising from

    discretisation of partial differential equations (PDEs). Developing parallel, nontrivial

    PDE solvers for high end computing systems considering the scalability over

    thousands of processors is still difficult task and takes lots of time. The PETSc is

    designed to ease this task and reduce the development time. It provides parallel

    algorithms, debugging support and low-overhead profiling interface that help in the

    development of large and complex applications. PETSc is used to solve large linear

    systems with 500B unknowns on supercomputers like Jaguar and Jugene with more

    than 200K processors [1]. It is used in modelling of many scientific applications inthe area of Geosciences, Computational Fluid Dynamics, Weather Modelling,

    Seismology, Surface water flow, Polymer Injection modelling etc. We will discuss

    one such application of PETSc called as Fluidity that we have analysed.

    Fluidity is an open source, general purpose computational fluid dynamics framework

    [2] developed by the Applied Modelling and Computational Group (AMCG) at

    Imperial College London. This framework is used in many scientific simulations in

    the areas of fluid dynamics, ocean modelling, atmospheric modelling etc. It solves the

    Navier-Stokes equations [3] on arbitrary unstructured, adaptive meshes using finite

    element methods. While solving this system, we impose the grid on the problemdomain to calculate the numerical solution of partial differential equations (PDEs).

    The accuracy and computational cost of the solution depends on the grid spacing. To

    compute accurate solutions, one has to use a finer grid; but this increases

    computational cost. Hence the Adaptive Mesh Refinement (AMR) technique is used

    to reduce the computational costs. AMR uses a coarse grid at the start of the

    simulation and as the solution progresses, it identifies areas of interest (i.e. parts of

    the grid which exhibits large change in solution) where the grid needs to be refined.

    These methods are discussed in more detail in [4], [5], [6].

    Figure 1.1 shows the main stages involved in simulations using the Fluidity

    framework:

  • 7/29/2019 Pramod Kumbha r

    12/88

    2

    During simulation, a new mesh may need to be generated by using AMR techniques

    to maintain the accuracy of the solution. During the assembly stage, a system of

    simultaneous equations is assembled using the finite element mesh. In the solver

    stage, the system of equations assembled in the assembly stage is solved using

    iterative methods. Fluidity uses iterative solvers from the PETSc to solve large sparse

    system. The sparse matrices that arise are positive definite, non-symmetric and hence

    the Generalised Minimum Residual (GMRES) algorithm is normally used [7]. Theupdate stage involves updating the solution variables, calculating new timestamps and

    estimating the error. Finally, the current solution can be written to disk. All these

    stages are discussed with more details in [7] and [8]. Currently, Fluidity uses various

    libraries like MPI (Message Passing Interface), ParMETIS and PETSc to support

    parallelisation [9].

    1.2MotivationDespite various parallelisation and optimisation techniques, simulations of complex

    phenomenon like tidal modelling or tsunami simulation take from hours to a few dayson modern supercomputers. For Fluidity, the main computationally expensive stages

    are the assembly phase and the solver phase. The initial idea of the project was to

    improve the performance of Fluidity framework using directive based GPU

    programming models like HMPP (Hybrid Multicore Parallel Programming) and PGI

    Accelerators. For initial profiling and performance optimisation, we decided to use 1-

    D non-OLQHDU PRGHO SUREOHP RI %XUJHUV HTXDWLRQ [10]. The Burger equation is a

    fundamental PDE which occurs in various applications of fluid dynamics which takes

    the form:

    where is the velocity and is the viscosity coefficient. This is a basically 1-D

    Navier-Stokes equation (discussed in Section 2.4) without pressure and volume forces

    terms. The following Figure 1.2 shows the profiling result of this problem on single

    Intel Xeon processor.

    repeat

    MeshGeneration

    AssemblyPhase

    Solver PhaseUpdate

    SolutionOutputSolution

    Figure 1.1: Main stages involved in single iterations of Fluidity framework [7]

  • 7/29/2019 Pramod Kumbha r

    13/88

    3

    The profiling results show that most of the execution time (84%) is spent in PETSc

    solver library. For this model problem, when we increase the resolution (i.e. finer

    mesh spacing), the assembly phase time increases proportionally and the solver time

    increases exponentially. As this is a simple 1-D example, assembly time is relatively

    small. But the condition number of matrix is very high and hence the solver takeslarge time to converge. Hence, we decided to deviate from our original plan and

    optimise the solver phase of Fluidity which is ultimately the PETSc solvers.

    Graphics Processing Units (GPUs) are becoming more popular due to their

    performance to cost ratio and potential performance gains compared to CPUs. To

    improve the performance of the solver phase, we decided to use the newly implemented

    GPU support in PETSc. Also we identified potential performance improvements in

    PETSc using different sparse matrix storage schemes, which are more suitable for

    GPUs.

    1.3Related WorkIn the last year, basic GPU support in PETSc has been added, which is currently

    available in the PETSc development release [11]. As per our knowledge, there are no

    benchmarking results published for this GPU implementation. Also, the current

    implementation only supports CSR (Compressed Sparse Row) matrix storage format

    and there is no development effort to support other matrix storage schemes. Nathan

    Bell and Michael Garland from NVIDIA Research have published performance results

    [12] of sparse matrix-vector operations on GPUs using different sparse matrix storage

    schemes. These results show that storage schemes like DIA (Diagonal), ELL

    (ELLAPACK) and HYB (Hybrid) are well suited for GPUs. Compared to CSR storagescheme, DIA and ELL formats can achieve 4-6x speedup.

    Figure 1.23URILOLQJUHVXOWVRI%XUJHUVHTXDWLRQ-D model problem with

    mesh spacing of 0.002 and domain [-10, 10]

  • 7/29/2019 Pramod Kumbha r

    14/88

    4

    There are two main goals of this MSc project. First goal is to improve the performance

    of current PETSc GPU implementation using DIA, ELL and HYB sparse matrix

    storage formats. Second goal is to evaluate the performance of PETSc GPU

    implementation on HECToR GPU cluster. Specifically we are looking at performanceof Krylov Subspace solvers for solving large sparse linear system.

    1.4Contributions and OutlineThe contributions of this project report are as follows:

    x To discuss the performance of the CUSP and Thrust libraries with large sparsematrices from real world applications;

    x To present an initial implementation to support different sparse matrix storageschemes in PETSc using CUSP and Thrust;

    xTo evaluate the performance of the PETSc GPU implementation for solvinglarge sparse linear systems;

    x To evaluate the performance benefits of the new implementation on single andmulti-GPU applications;

    x To compare the overall performance of PETSc GPU implementation on theHECToR GPU cluster and the HECToR (Phase 2b) system.

    Chapter 2 presents background information, which includes GPGPU, CUDA

    Programming model, CUSP &Thrust libraries, PDEs and Iterative methods for sparse

    linear algebra. Chapter 3 discusses the design of PETSc library and implementation of

    GPU support in PETSc. Chapter 4 presents different sparse matrix storage schemes

    suitable for vector processors which are available in CUSP library. We have also

    evaluated the performance of CUSP with large sparse matrices from real world

    applications. Chapter 5 presents our initial implementation to support sparse matrix

    storage schemes in PETSc using CUSP library. Chapter 6 discusses the wrappers codes

    developed for matrix conversion, performance analysis and benchmarking. Chapter 7

    presents performance results of PETSc GPU implementation with different sparse

    matrix formats on the HECToR GPU cluster as well as the main HECToR (Phase 2b)

    system. Chapter 8 discuss the challenges of multi-GPU parallelisation encountered

    during performance analysis and outlines the future work in this area. Chapter 9

    presents the conclusion of this project and summarize the results.

    1.5Change in Project planDuring the project preparation phase (Semester-II), the idea of the MSc project was to

    extend the HMPP programming model for the C++ language. Specifically our aim was

    to implement a generic meta-programming framework for HMPP using C++ templates.

    HMPP is now an open standard developed by CAPS enterprise and Pathscale Inc.

    HMPP provides directive based GPU programming model similar to PGI accelerators.

    For this MSc project, external organisation was expected to provide ENZO compiler

    suite with HMPP C++ support by May 2011. But this compiler was not available until

    first week of June 2011 due to complexity of C++ compiler implementation. So wedecided to change our project plan. With the great support of Dr. Michele Weiland and

    Dr. Chris Maynard we were quickly able to work out the alternative project plan and

  • 7/29/2019 Pramod Kumbha r

    15/88

    5

    decided to work on Fluidity project, especially the PETSc GPU implementation. This

    change in project affected the planned schedule of the project. But with the continuous

    advice and support of supervisors, I am able to complete this project successfully.

  • 7/29/2019 Pramod Kumbha r

    16/88

    6

    Chapter 2

    Background

    2.1GPGPUGPUs (Graphics Processing Units) have a distinct architecture specifically designed for

    high floating point operations and fine grained concurrency. In the past, GPUs were

    mostly used for improving the performance of graphics operations like pixel shading,

    texture mapping and rendering. But in last few years, GPUs have been effectively used

    to speed-up the performance of non-graphics applications from different areas of

    science like computational fluid dynamics, molecular dynamics, medical imaging,

    climate modelling etc. The term GPGPU (General Purpose Computing on GPUs) is

    normally used to refer to the use of GPUs for accelerating non-graphics applications

    traditionally executing on CPUs.

    The primary reason for the popularity of GPUs in the area of scientific computing is

    their performance to cost ratio. For example, the NVIDIA Tesla C2070 GPU has 448

    cores capable of achieving a theoretical peak performance of 515 GFlops which is 50

    times more than the Intel Xeon (E5620) quad-core processor. But if we compare the

    prices, Tesla GPU is only five times costlier than Intel Xeon processor. Various

    applications ported to the GPUs show significant performance benefits (10-50x

    speedup) compared to the CPUs. More details about these applications can be found in

    [13].

    2.

    2GPU Programming modelsProgramming models like ARB, OpenGL, Direct3D and Cg were commonly used for

    development graphics applications. But these programming models do not fit well for

    development of HPC applications. Research in GPU technology helped to understand

    the use of GPUs in general purpose computing. Various programming models like

    CUDA, OpenCL, AMD Stream and PGI directives are available for programming these

    special purpose devices and CUDA is the most popular programming among them.

    2.2.1CUDANVIDIA introduced the CUDA programming model, which enables a large developer

    community to exploit GPUs for general purpose computing. The programming

  • 7/29/2019 Pramod Kumbha r

    17/88

    7

    interfaces are exposed through C, C++, FORTRAN languages and third party wrappers

    for other languages like Python, Ruby etc. A CUDA application consists of code that

    runs on CPU as well as GPU. Compute intensive functions of the program, which

    executes on GPU, are called as kernels. The nvcc compiler translates this kernel code tothe PTX assembly code, which is executed on GPU. More details about the CUDA

    programming model can be found in [14].

    CUDA Architectu re: We will discuss the CUDA architecture considering a Tesla 10series device. A Tesla C1060 GPU device consists of 30 multithreaded streaming

    multiprocessors (SM) and each SM consists of 8 streaming processors (SP), two special

    function units, on chip shared memory and an instruction unit. Figure 2.1 shows the

    organisation of SM, SP, registers and shared memory in a Tesla device. The SM

    creates, manages and schedules a group of 32 threads in a batch called as warp. A

    single SM has hardware resources that can hold state of three warps at a time [15]. Forthe C1060 devices there can be 23,040 threads (30 SM * 8 SP * 32 threads * 3 Warps)available for execution. Out of this, only 960 threads (30 SM * 32 threads) can be

    executed concurrently at a given time. All threads within a warp executes in SIMT

    (Single Instruction Multiple Threads) fashion.

    Figure 2.1: CUDA Architecture and Memory Hierarchy (Adapted from [15] and [56] )

  • 7/29/2019 Pramod Kumbha r

    18/88

    8

    There are different memory types: register, shared, local, global and caches. Each of

    these types has different sizes, latencies, bandwidths and performance characteristics.

    Each SM has on-chip registers and shared memory. These memories are small in size

    and have very low latency. Local and global memory are largest in size and have veryhigh latency. Data access from global or local memory is very costly and requires 400-

    500 cycles. Texture and constant memory have similar latency. But these can be

    automatically cached by hardware and hence can be effectively used if kernel exhibits

    temporal locality. L1 and L2 caches are introduced in the newer Fermi architecture

    giving benefits similar to CPU caches. More detailed descriptions on memory

    organisation and performance can be found in [16].

    Whenever a CUDA kernel is launched on a GPU, thousands of threads are created,

    which are organised into grid. The grid is a 1-D or 2-D array of thread blocks and each

    thread block is a 1-D, 2-D or 3-D array of threads. The thread blocks are assigned toavailable SMs. All threads within a thread block execute in a time-multiplexed fashion

    on a single SM. The grid and block dimensions largely depend on the hardware

    resource requirements of the executing kernel. More information on this can be found

    in [14].

    2.3CUSP and ThrustCUSP and Thrust are open source C++ template libraries developed using CUDA and

    providing high level interfaces for GPU programming. We have used these libraries for

    implementing the sparse matrix storage schemes support in PETSc.

    2.3.1ThrustThrust is an open source, template library [17] developed on top of CUDA. The main

    advantage of Thrust is that it provides a high level interface for GPU programming and

    enables rapid development of complex HPC applications. Another important benefit is

    that Thrust, being a C++ template library, supports generic programming and Object

    Oriented (OO) paradigms. Three main components of Thrust are: Containers, Iterators

    and Algorithms.

    Containers: A container can store a collection of objects. The containers are usuallyimplemented as template objects so that they can be used with different data types. For

    example, common data structures used in programming languages like linked lists,

    stacks, queues and heaps arrays are implemented as containers. In Thrust, there are two

    main containers: thrust::host_vector and thrust::device_vector. A

    host_vector and a device_vector represent an array of element on the CPU (host)

    and the GPU (device) memory respectively. The major benefit of containers is that they

    handle memory management for the underlying objects. For example, whenever we

    create a host_vector, they automatically allocate memory on the CPU. Similarly the

    device_vector container handles the memory allocation, deallocation on GPUs.

    Whenever we assign a host_vector to device_vector, they automatically make a

  • 7/29/2019 Pramod Kumbha r

    19/88

    9

    YHFWRU FRS\ IURP &38 WR *38 PHPRU\ 6R ORZHU OHYHO $3,V OLNH cudaMalloc,cudaMemcpy,cudaFree etc are completely hidden from application developers.

    Iterators: An Iteratoris a generalisation of pointers in C and can be thought of as anobject in C++ which can point to other objects. They are usually used for traversing

    over the container objects and are similar to C pointers, thus we can perform pointer

    arithmetic. There are different types of Thrust iterators like input, output, constant,

    permutation or transform iterator [17]. For example, the input iterator provides the

    functionality of accessing the value of a containerREMHFWEXWZHFDQWFKDQJHWKHYDOXH

    of that object. It is possible to write generic algorithms by using templates and

    parameterized by Iterators.

    Algorithms: Thrusts implements more than sixty basic algorithms like merge sort,

    radix sort, inclusive scan, reduce or parallel prefix. These algorithms are implementedas templatedobjects so that they can work with all basic data types. With the help ofiterators, the algorithmic implementation does not have to worry about the underlying

    object type or object access methods. Algorithms do not directly access the container

    data, but use iterators to access the underlying data elements. For example, there is a

    single implementation of the radix sort for all data types. Depending on the data type,

    an Iterator provides a way to access the data elements.

    The mechanism of using containers, iterators and algorithms together can be explained

    with following simple example:

    In the above example, we first create the host container to store one million float

    elements. We then randomly fill this vector using the thrust::generate method. The

    vec_h.begin()and vec_h.end() calls provide iterators pointing to the start and end

    of vec_h container respectively. When we assign the host container to the device

    container, Thrust automatically allocates memory on the GPU using cudaMalloc() and

    /*Thrust headers */void main(){

    /*allocate storage for one million numbers using host container*/thrust::host_vector vec_h(1000000);/* generate one million numbers on host using iterators*/thrust::generate(vec_h.begin(),vec_h.end(), rand);

    /*transparent copy of host vector to device vector*/thrust::device_vector vec_d = vec_h;

    /*use of Thrust algorithms: passing iterators as parameters*/

    thrust::sort(vec_d.begin(),vec_d.end());

    /* transparent copy from device to host memory*/thrust::copy(vec_d.begin(),vec_d.end(),vec_h.begin());

    }

    Figure 2.3: Simple example to sort one million float elements on GPU using Thrust

  • 7/29/2019 Pramod Kumbha r

    20/88

    10

    calls cudaMemcpy() to make a host-to-device memory copy. In this example, we are

    using the thrust::sortmethod to invoke the default sorting algorithm (Merrill's radix

    sort [18]) on GPU. Finally we use the thrust::copy method to copy back vector data

    from GPU to CPU memory.

    2.3.2CUSPCUSP is also an open source C++ template library [19] developed on top of CUDA, but

    it specifically targets sparse linear algebra and sparse matrix computations. Similar to

    Thrust, this library provides a high-level programming interface and internally uses the

    functionality of Thrust and CUBLAS. CUSP provides following five sparse matrix

    storage schemes:

    xCompressed Sparse Row (CSR)

    x Coordinate (COO)x ELLAPACK (ELL)x Diagonal (DIA)x Hybrid (HYB)

    We will discuss these storage formats in detail in Section 4.2. CUSP provides an easy

    interface for building different sparse matrix formats and a transparent conversion

    between these formats. This is explained in the following example:

    In the above example, we create a sparse matrix object in COO format. CUSP provides

    the cusp::gallery interface for generating sample matrices for a Poisson or

    Diffusion problem on a 2-D mesh. When we assign the COO matrix object on the host

    to the ELL matrix object on the device, CUSP automatically allocates memory on

    GPU, performs the COO to ELL conversion, and copies the matrix data from CPU to

    GPU. We have discussed this mechanism in Section 5.2.

    /*CUSP headers */void main(){

    /*sparse matrix in COO format on host*/cusp::coo_matrix coo_mat;

    /*matrix corresponding to 2-D Poisson problem on 15x15 mesh */cusp::gallery::poisson5pt(A, 15, 15);

    cusp::ell_matrix ell_mat;

    /*performs memory allocation, conversion from COO to ELL,and memory allocation and copy to device */

    ell_mat = coo_mat;}

    Figure 2.4: Sparse matrix construction and transparent conversion using CUSP

  • 7/29/2019 Pramod Kumbha r

    21/88

    11

    In addition to sparse matrix storage and operations, CUSP provides following features:

    x File I/O interface for reading and writing large sparse matrices to/from matrixmarket files.

    x Krylov subspace solvers like Conjugate-Gradient (CG), Multi-mass Conjugate-Gradient (CG-M), Biconjugate-Gradient (BiCG) and Generalized Minimum

    Residual (GMRES) on GPUs.

    x Preconditioners like Algebric Multigrain (AMG), Diagonal and ApproximateInverse (AINV).

    We have used the FILE I/O interface for converting matrices stored in matrix market

    format (ASCII format) to PETSc binary format. For implementing sparse storage

    schemes in PETSc, we have extensively used CUSP and Thrust. Also we have

    developed small benchmark to measure the performance of CUSP linear solvers onGPUs.

    2.43'(V: Source of Sparse MatricesPartial Differential Equations (PDEs) provide a mathematical model for many scientific

    and engineering applications. These equations relate partial derivatives of physical

    quantities like force, velocity, momentum, temperature etc. In fluid dynamics, the

    Navier-Stokes equations [20] are a set of nonlinear PDEs, which can be used to

    describe the flow of incompressible fluids as,

    where is the flow velocity, is the viscosity, P is the pressure, is the density offluids and is vector differential operator. Most commonly, we solve these PDEs byapproximating them with equations with a finite number of unknowns. This process of

    approximation is called discretisation. There are two commonly used techniquesavailable, Finite Difference Methods (FDM) and Finite Element Methods (FEM),

    explained in [21] .

    We will illustrate the process of discretisation by using the common example of a PDE

    WKDWDSSHDUVLQPDQ\HQJLQHHULQJDUHDVLH3RLVVRQVHTXDWLRQ :

    where is a real valued function and two space variables , LQGRPDLQ&RQVLGHUsimple problem where we want to find a function such that

    = 1

  • 7/29/2019 Pramod Kumbha r

    22/88

    12

    in the solution domain and on the boundary . To find thenumerical approximation of, we discretised the PDE using finite differences and sub-divide domain

    into grid of .We solve for different variables

    where i,j=0, 1,2,3...... In this case, the grid spacingis given by h=1/(N+1).

    On the 2-D grid, we can write discretised equations (using forward difference for first

    derivative and backward difference for second derivative) as

    The right hand side of above expression is called as five point stencil because every

    point on the lattice is averaged with its four nearest neighbours as shown in Figure 2.2.

    A finite difference approximation to the above equation is given by

    This results in linear equations with unknowns . The resulting matrix A fromthe linear system is very large, sparse and with a banded structure. Forexample, forN=4, the matrix shown in Figure 2.3 is of order 16x16 and only contains

    25% nonzero elements.

    Figure 2.2: 2-D Grid and Five-point stencil

  • 7/29/2019 Pramod Kumbha r

    23/88

    13

    There are different storage schemes available to store these sparse matrices. Some

    formats like ELL or Diagonal are better suited for GPUs. We will discuss these formats

    in Section 4.2, considering their performance on GPUs.

    2.5Iterative Methods for Sparse Linear SystemsIterative methods are commonly used for solving large linear systems. These methods

    try to find the solution of linear system of equationAx=b by generating sequence of

    improving approximate solutions. (Here, iterative meaning repetitive application ofoperations to improve the approximate solution). These methods use initial guess as the

    first approximate solution and then improves this solution over the successive

    iterations. There are two main classes of iterative methods: Stationary Iterative

    Methods and Krylov Subspace Methods. Jacobi, Gauss-Seidel and Successive Over-Relaxation (SOR) methods are examples of stationary methods and they are easy to

    implement and analyse. But the convergence of these methods is not guaranteed for all

    class of matrices.

    Krylov Subspace Methods are class of iterative methods which are considered as most

    important iterative techniques currently available for solving linear and non-linear

    system of equations. These methods are widely adopted because they are efficient and

    reliable. Examples of Krylov Subspace Methods are Conjugate-Gradient, Biconjugate

    Gradient and GMRES (Generalized Minimal Residual). These methods are based on

    the Krylov Subspace. The m-order Krylov Subspace is defined as

    Figure 2.2: Sparse matrix for 5x5 grid (Poisson Problem, 25% Non-Zero elements)

  • 7/29/2019 Pramod Kumbha r

    24/88

    14

    where A is nxn matrix and b is vector of length n. The Research in Krylov Subspacetechniques has brought various new methods. Detailed explanation of all methods is

    beyond the scope of this project. We will discuss one such method of Krylov Subspace

    solver i.e. GMRES, that we have used in our performance analysis example.

    GMRES Method: GMRES is an iterative method which approximates the solution bythe vector in Krylov subspace with minimal residual [22]. GMRES approximate

    solution by Euclidian norm of the residualAx-b over the Krylov Subspace. This

    method is designed to solve non-symmetric linear systems. Most popular form of

    GMRES is based on the Gram-Schmidt orthogonalization process. Gram-Schmidt

    process takes a set of linearly independent vectors S= { in Euclideanspace and computes set of orthogonal vectors } = which spanssame subspace of fork

  • 7/29/2019 Pramod Kumbha r

    25/88

    15

    Chapter 3

    PETSc GPU Implementation

    3.1PETScPETSc is a scalable solver library, which has been used in the development of a large

    number of HPC applications [26]. It provides infrastructure for rapid prototyping and

    algorithmic design, which eases the development of scientific applications while

    maintaining the scalability on large numbers of processors. The design of PETSc

    allows transparent use of different linear/non-linear solvers and preconditioners in the

    applications. The programming interface is provided through C, C++, FORTRAN and

    Python.

    In this section we will discuss the PETSc design and architecture, which will help to

    understand our further implementation to support sparse matrix storage schemes.

    3.1.1PETSc K ernelsPETSc kernels are basic sets of services on top of which the scalable solver library is

    built. These kernels are shown in Figure 3.1.

    These kernels have a modular structure and are

    designed to maintain portability across different

    architectures and platforms. For example, instead

    of float or integer data types, PETSc provides

    new data types like PetscInt, PetscScalar or

    PetscMPIInt. These data types are internally

    mapped to corresponding int, float, float64 or

    double data types supported on the underlying

    platform. For our implementation, if we want to

    add new memory management routines, we can

    implement those in the corresponding kernel and

    make them available to applications and other

    kernels. These PETSc kernels are explained

    with more detail in [27].Figure 3.1: PETSc Kernels

  • 7/29/2019 Pramod Kumbha r

    26/88

    16

    3.1.2PETSc ComponentsPETSc is developed using object oriented paradigms and its architecture allows easy

    integration of new features from external developer communities. PETSc consists of

    various sub-components listed below:

    x Vectorsx Matricesx Distributed Arraysx Preconditionersx Krylov Subspace Solversx Non-linear Solversx Index Setsx Timesteppers

    PETSc allows easy customisation and extensions to these components. For example,

    we can implement a new matrix subclass or preconditioner that can be transparently

    used by all KSP solvers without any modifications. The algorithmic implementation

    is separated from the parallel library layer that allows code reusability and easy

    addition of new solvers, preconditioners and the data structures. Figure 3.2 shows the

    organisation of different PETSc libraries and the levels of abstraction at which they

    are exposed.

    Figure 3.2: PETSc Library Organisation [54]

  • 7/29/2019 Pramod Kumbha r

    27/88

    17

    PETSc internally uses a number of libraries like BLAS, ParMetis, MPI and HDF5 to

    provide the infrastructure required for large HPC applications. PETSc provides much

    flexibility for users to choose among different libraries for different classes of

    applications. But most of the functionalities of underlying libraries are hidden fromapplication developers due to parallel library layer.

    3.1.3PETSc Object DesignIn PETSc, classes like Vector, Matrix and Distributed Array represent data objects.

    These objects define various methods for data manipulation in sequential or parallel

    implementation. The internal representation of an object i.e. the data structure, is not

    exposed to applications and is only available through exposed APIs. This is shown in

    Figure 3.3. For example, the Vector class can be used for representing the right hand

    side of a linear systemAx=b or discrete solutions of PDEs and stores values in a simplearray format similar to C or FORTRAN array convention. This class defines various

    methods for vector operations like the dot product, the vector norm, scaling, scatter or

    gather operations. For parallel applications, PETSc automatically distributes these

    vector elements within the communicators and uses the functionality of the underlying

    MPI library to perform collective or point-to-point MPI operations.

    In the parallel implementation, the Matrix or Preconditioner objects do not have

    access to the internal data structure directly. Instead, they just call exposed APIs

    through the PETSc interface and the internal object representation manages

    communication within a MPI communicator. For example, for parallel vectors,

    internally aVecScatter object is created to manage data communication across MPI

    processes.VecScatterBegin() andVecScatterEnd() routines are used to perform

    vector scatter operation across the communicator. To access internalVector data, the

    application uses subroutines likeVecGetArray().Only Preconditioners (PC) objects

    are implemented in a data structure specific way, so they access and manipulateVectororMatrix data structures directly.

    Data Structure

    Data Manipulation

    Routines

    Exposed APIs

    (Abstraction)

    Applications

    PETSc Interface

    Matrix Vector Index Set

    Figure 3.3: PETSc Objects and Application Level Interface

  • 7/29/2019 Pramod Kumbha r

    28/88

    18

    3.2PETSc GPU ImplementationRecently, GPU support has been added to the PETSc solver library. Currently it is

    under development and available in the PETSc development release [11]. The initialimplementation allows for transparent use of GPUs without modifying the existing

    application source code. Instead of writing completely new CUDA code, PETSc uses

    the open source CUSP and Thrust libraries discussed in Section 2.3. This helps to keep

    the GPU implementation separate from the existing PETSc code.

    We will discuss this new implementation in more detail, as our development work will

    be an extension to current development.

    3.2.1Sequential ImplementationThe current implementation assumes that every MPI process has access to a single

    GPU. A new GPU specific Vector class calledVecCUSP has been implemented. It uses

    CUBLAS, CUSP, as well as Thrust library routines to perform vector operations on a

    GPU. The idea behind using these libraries is to use already developed, fine tuned

    CUDA implementations with PETSc instead of developing new ones. The PETSc

    implementation acts as an interface between PETSc data structures and external CUDA

    libraries, i.e. Thrust and CUSP.

    Whenever we execute a program with GPU support, two copies of any vector are

    created, one on the CPU and another on the GPU. In the existingVec class, a new flag

    is added called valid_GPU_array. This flag has the following four possible values and

    corresponding meaning:

    PETSC_CUSP_UNALLOCATED : Object is not yet allocated on GPU

    PETSC_CUSP_CPU : Object is allocated and a valid copy is available on CPU only

    PETSC_CUSP_GPU : Object is allocated and a valid copy is available on GPU only

    PETSC_CUSP_BOTH : Object is allocated and valid copies are on CPU & GPU (both)

    Initially this flag has the value PETSC_CUSP_UNALLOCATED. When an application creates

    a Vector object, theVecCUSPCopyToGPU() subroutine creates a new vector copy on theGPU and sets the valid_GPU_array flag to PETSC_CUSP_BOTH indicating that both

    copies are now valid and contain recent values. Now all vector operations can be

    performed on the GPU. Whenever theVecCUSPCopyToGPU() function gets called, it

    makes a copy to GPU only if the vector object is modified on CPU, i.e. value of

    valid_GPU_array flag changed. Memory copies between host and device are

    managed through the subroutinesVecCUDACopyToGPU()andVecCUDACopyFromGPU().For example, when an application callsVecGetArrayRead() to access vector data,

    internally it first callsVecCUDACopyFromGPU() to copy recent vector values from

    GPU and then sets valid_GPU_array to PETSC_CUSP_BOTH, indicating that both copies

    are now valid and contain recent values. This mechanism can be illustrated byimplementation of simple vector operation AXPY i.e. y = alpha x + y:

  • 7/29/2019 Pramod Kumbha r

    29/88

    19

    For the above vector operation, the VecCUDACopyToGPU() subroutine allocates

    memory and copies vector data on to the GPU if the flag value is

    PETSC_CUSP_UNALLOCATED. If flag value is PETSC_CUSP_CPU, that means memory is

    already allocated on GPU but copy on CPU is recently modified. So it makes CPU to

    GPU vector copy and then it makes call to CUBLAS library routines and sets

    valid_GPU_array to PETSC_CUDA_GPU.

    3.2.2Parallel ImplementationIn the parallel implementation, the parallel Vector and Matrix objects are implemented

    on top of the sequential implementation. The Rows of a matrix are partitioned among

    the processes in a communicator. This is shown in Figure 3.5. In the PETSc

    implementation, a sparse matrix is stored in two parts: the on-diagonal part and the off-

    diagonal part. The on-diagonal portion of the matrix, say Ad, stores values of the

    column associated with rows owned by that process. These matrix elements are shown

    by red colour. All remaining entries from the off-diagonal portion are stored in another

    component, say Ao.

    VecAXPY(){

    /*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(xin);/*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(yin);

    try {/*perform AXPY using CuBLAS library routine*/cusp::blas::axpy(*((Vec_CUDA*)xin->spptr)->GPUarray,

    *((Vec_CUDA*)yin->spptr)->GPUarray,alpha);/*now updated copy is present on GPU*/yin->valid_GPU_array = PETSC_CUDA_GPU;/*wait until all thread finishes*/ierr = WaitForGPU();

    }catch(char *ex) {.........}

    }

    Figure 3.4: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]

    Figure 3.5: Parallel Matrix with on-diagonal and off-diagonal elements

    for two MPI process

  • 7/29/2019 Pramod Kumbha r

    30/88

    20

    The sparse matrix vector product is calculated by using two steps: Firstly, calculating

    the product related to the on-diagonal entries of matrix, i.e. Ad, by using associated

    vector entries of vectorx, i.e. xd. Then we calculate the product of off-diagonal matrix

    entries and the associated vector, which gets added to previous result ydas

    yd = Ad * xd

    yd += Ao * xo

    For this operation, updated entries of vector xo must be communicated within

    communicators. As only few AoHOHPHQWVDUHQRQ]HURZHGRQWKDYHWRFRPPXQLFDWH

    all xo entries. This communication is managed through theVecScatter object, which

    handles parallel gather and scatter operations using non-blocking MPI calls. The

    VecScatter object stores two arrays of indices: one array stores global indices of

    vector elements that will be received as updated entries from other processes in thecommunicator. These received vector elements are stored in the local array. Second

    vector stores the mapping between global index of vector element and its position in the

    local array.

    Communication starts withVecScatterBegin()call, which copies data into message

    buffers. For the GPU implementation, the updated vector entries first get copied from

    GPU memory to CPU usingVecCUDACopyFromGPU()function. The communication

    completes after theVecScatterEnd()call, which waits for completion of non-

    blocking MPI calls posted byVecScatterBegin(). This implementation of parallel

    matrix-vector operation shown below:

    More information about this implementation can be found in [11] [28].

    3.3Applications running with PE TSc GPU supportPETSc allows transparent use of GPUs without any changes in the application source

    code. Most of the existing PETSc applications can run on the GPUs. New class of

    Vector i.e.VecCUSP and matrix i.e. MatCUSP is added to PETSc which performs all

    matrix-vector operations on GPUs. To run any existing application on GPU, user has to

    VecScatterBegin(a->Mvctx, xd, hatxo,INSERT_VALUES,

    SCATTER_FORWARD);

    MatMult(Ad, xd, yd);

    VecScatterEnd(a->Mvctx, xd, hatxo,INSERT_VALUES,

    SCATTER_FORWARD);

    MatMultAdd(hatAo, hatxo, yd, yd);

    Figure 3.6: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]

  • 7/29/2019 Pramod Kumbha r

    31/88

    21

    set Vector type toVECCUSP and matrix type toMATCUSP usingVecSetType() and

    MatSetType() routines respectively. User can also set these Vector and Matrix type,

    using option database keys vec_type seqcusp and mat_type seqaijcusp.

    All of the Krylov Subspace methods except KSPIBCGS (Improved Stabilized version

    of BiConjugate Gradient Squared) are supported on the GPU. Currently, Jacobi, AMG

    (Algebraic Multigrid) and AINV (Approximate Inverse) preconditioners are supported

    on the GPUs.

  • 7/29/2019 Pramod Kumbha r

    32/88

    22

    Chapter 4

    Sparse Matrices

    As discussed in Section 2.4, the discretisation of PDEs results in large sparse matrices.

    A matrix with only few non-zero elements can be considered as sparse. In a practicalsense, a matrix can be considered as sparse if specialised techniques can be used to take

    advantage of the sparsity and the sparsity pattern of the matrix. Depending on the

    sparsity pattern, we can divide matrices into two broad categories: structured and

    unstructured. A matrix with non-zero elements in a specific regular pattern is called a

    structured sparse matrix. For example, all non-zero elements along few diagonals of the

    matrix or non-zero elements in small dense sub-blocks, which result into regular

    patterns. The Application of FDM or linear FEM on rectangular grids results in

    structured sparse matrices. On the other hand, for irregular meshes this results into

    unstructured sparse matrices with no specific structure or pattern of non-zero elements.

    Figure 4.1 and Figure 4.2 shows example of structured and unstructured matricesrespectively.

    Depending on the sparsity pattern, different storage schemes or data structures can be

    used. Importantly, the performance of the matrix operations depends on these storage

    schemes and processor architecture. This becomes more apparent for vector processors

    and GPUs. In this section we will discuss sparse matrix representation and different

    storage schemes with their storage efficiency and performance.

    Figure 4.2: Example of Unstructured

    matrix from bipartite graph

    Figure 4.1: Example of Structured Matrix

    from structured problem

  • 7/29/2019 Pramod Kumbha r

    33/88

    23

    4.1Sparse Matrix RepresentationThe structure of sparse matrices can be ideally represented by adjacency graphs. Graph

    theory techniques have been used effectively for parallelising various iterative methodsand implementing preconditioners [29]. A graph G=(V,E) is represented by set of

    vertices and set of edges where areelements ofV. In 2-D plane, the graph G is represented by a set of points which areconnected by edges between these points. In case of the adjacency graph of a sparse

    matrix, the nvertices in Vrepresent nunknown variables, and the edges in Erepresentthe binary relation between those vertices. There is an edge from node ito nodejwhen

    matrix element . An adjacency graph can be directed or undirected dependingon the symmetry of non zeros. When a sparse matrix has a symmetric non-zero pattern

    (i.e.

    , the adjacency graph is undirected,

    otherwise it is directed.

    This adjacency graph representation can be used for parallelisation. In case of

    parallelising a Gaussian elimination, at a given stage of the elimination we can find

    unknowns which are independent of each other from above binary relation. For

    example, in the case of a diagonal matrix all unknowns are independent of each other,

    which is not true for dense matrices. More information about sparse matrix

    representation and parallelisation strategies can be found in [29].

    4.2Sparse Matrix Storage SchemesThere are two main reasons for different sparse matrix storage formats: memory

    requirements and computational efficiency. It may not be feasible to store a large sparse

    matrix in main memory. Importantly, it is not necessary to store zero matrix elements.

    Various storage schemes (i.e. data structures) have been proposed to effectively utilise

    sparsity and sparsity patterns of the matrices. 7KHUHLVQRVLQJOHbest VWRUDJHVFKHPHfor all sparse matrices; but a few are suitable for matrices with structured sparsity

    patterns, some are general purpose and others are storage schemes for matrices with

    arbitrary nonzero patterns. Each storage scheme has different storage costs,

    computational costs and performance characteristics. In this section we will discuss

    various storage schemes and their performance on GPUs.

    Figure 4.3: Spase matrix representation with directed adjacency graph

    1 2 3

    456

  • 7/29/2019 Pramod Kumbha r

    34/88

    24

    4.2.1Coordinate ListThe coordinate list (COO) is a simple and the most flexible storage format, where we

    store every non-zero element of a matrix with three vectors: data,

    rowand indices. Thedatavector stores nonzero elements of matrix in row major order. The rowand indicesvectors explicitly stores associated row and column index of every element in the datavector. This is explained in following Figure:

    Figure 4.4: Sparse matrix and corresponding COO storage representation

    This is a general purpose and robust storage scheme, which can be used for matrices

    with arbitrary sparsity patterns. The above example shows that the storage cost of COO

    format is proportional to the number of nonzero elements. ForMxNsparse matrix with

    k non-zero elements, it requires bytes.4.2.2Compressed Sparse RowCompressed Sparse Row (CSR) is popular and the most general purpose storage

    format. This can be used for storing matrices with arbitrary sparsity patterns as it makes

    no assumptions about the structure of the nonzero elements. Like COO, this format also

    stores only nonzero elements. These elements are stored using three vectors: data,indicesand row_ptr. The dataand indicesvectors are same as for the COO format. Foran MxNsparse matrix, the row_ptrvector has length ofM+1and stores indexes whereeach row of the matrix starts in the valvector. The last entry ofrow_ptrcorresponds tothe number of nonzero elements in the matrix. This storage scheme is explained in the

    Figure below:

    Figure 4.4: Sparse matrix and corresponding CSR storage representation

    4 1 5 2 6 3 9 7 84 0 1 0 0

    0 5 0 2 0

    0 0 6 0 3

    9 0 0 7 0

    0 0 0 0 8

    0 0 1 1 2 2 3 3 4

    0 2 1 3 2 4 0 3 4

    4 1 5 2 3 7 6 8 94 0 0 1 0

    0 5 0 2 0

    0 0 0 0 3

    7 0 0 6 0

    0 8 0 0 9

    0 3 1 3 4 0 3 1 4

    0 2 4 5 7 9

    data

    row

    indices

    indices

    row_ptr

    data

  • 7/29/2019 Pramod Kumbha r

    35/88

    25

    There are some advantages of using CSR over COO. The CSR format takes less

    storage compared to COO due to the compression of the row indices explained in

    above Figure. Also, with the row_ptrvector we can easily compute the number of non-

    zero elements in the row as . In parallel algorithms,row_ptr values allow fast row slicing operations and fast access to matrix elementsusing pointer indirection. This is a commonly used sparse matrix storage scheme on

    CPUs. For a MxN sparse matrix with k non-zero elements, it requires bytes.4.2.3DiagonalApplications of stencils to regular grids result in bandedsparse matrices, where non-zero elements are restricted to few sub-diagonals of the matrix. For these matrices, the

    diagonal (DIA) format can be effectively used. The DIA format uses only two vectors,dataand offsets. The datavector stores nonzero elements of the sub-diagonals of thematrix. The offsetsvector stores offset of every sub-diagonal from the main diagonal ofthe matrix. By convention, the main diagonal has the offset 0, the diagonals below the

    main diagonal have negative offset and those are above the main diagonal have positive

    offsets. This is illustrated with example in the Figure bellow:

    Figure4.5: Sparse matrix and corresponding DIA storage scheme

    Unlike CSR and COO, this storage format stores few zero elements explicitly. As we

    can see in above Figure, the diagonal with offset -3 has only two non-zero elements.

    But to store it in diagonal format, the elements of this diagonal are padded with

    DUELWUDU\YDOXHLH6RWKHUHLVVRPHH[WUDVWRUDJHRYHUKHDGDVVRFLDWHGZLWKLW%XWthere are more storage benefits due to fact that we do not have to store column or row

    indices explicitly. Usually, the data vector stores non-zero elements in the column

    major order which ensures memory coalescing for GPU devices. We will discuss this in

    more detail in the performance analysis section 4.3. For a MxN square matrix with dsub-diagonals having at least one non-zero element, it requires bytes.This is not a general purpose storage scheme like CSR and COO. It is very sensitive to

    the sparsity pattern and is useful for matrices with an ordered banded structure. For

    example, consider a matrix in Figure 4.6. This matrix has a banded structure, but this is

    not suitable for DIA storage scheme. The nonzero elements structure is exactly in

    opposite order of what is ideally suited for DIA format.

    3 0 8 0 0

    0 4 0 9 0

    0 0 5 0 101 0 0 6 0

    0 2 0 0 7

    * 3 8

    * 4 9

    * 5 10

    1 6 *

    2 7 *

    -3 0 2offsetsdata

  • 7/29/2019 Pramod Kumbha r

    36/88

    26

    When we store above matrix with diagonal storage format, we end up storing all sub-

    diagonals, each containing single nonzero element and four padding elements.

    4.2.4E L L or Padded ITPAC KLike DIA, ELL format is also suitable for vector architectures. This format can be used

    for storing sparse matrices arising from semi-structured meshes where the average

    numbers of non-zero element per row are nearly same. ForMx Nsparse matrix with amaximum ofknon-zeros per row, we store the matrix in a Mx kdense dataarray. If a

    particular row has less than knon-zeros, that row is padded with zeros. The indicesarray stores the column index of every element in the dataarray. These elements arestored in column major order. Figure 4.6 illustrate an example of ELL storage scheme:

    Figure4.6: Sparse matrix and corresponding ELL storage scheme

    Compare to DIA, ELL is again a PRUHJHQHUDOVWRUDJHIRUPDWDQGLWVQRWQHFHVVDU\WR

    have a banded structure of non-zero elements. But the average number of non-zeros

    must be the same across all rows of matrix, otherwise we end up padding large numbersof zero elements. ForMxNsparse matrix with maximum of NNZ_PER_ROWnonzero per

    row, it requires * bytes for storage.

    4.2.5HybridAlthough the ELL format is well suited for vector architectures, most of the time sparse

    matrices arising from complex geometries that do not have the same number of nonzero

    per row [12]. As the number of non-zero elements starts varying to a larger extent, we

    end up storing a large number of padding elements. Consider the example of the sparse

    matrix shown in Figure 4.7. In this case, except for the first row all other rows have

    Figure 4.5: Banded nonzero pattern which is not suitable for DIA format

    1 2

    3 45 6

    7 8