ucb sparse tutorial 1

Upload: narakrib

Post on 02-Jun-2018

250 views

Category:

Documents


5 download

TRANSCRIPT

  • 8/10/2019 UCB Sparse Tutorial 1

    1/18

    Sparse Matrix Techniques(Tutorial)

    X. Sherry LiLawrence Berkeley National Lab

    Math 290 / CS 298, UCB

    Jan. 31, 2007

  • 8/10/2019 UCB Sparse Tutorial 1

    2/18

    01/31/07 Math 290 / CS 298 2

    OutlineOutline

    Part I

    Computer representations of sparse matrices

    Sparse matrix-vector multiply with various storages

    Performance optimizations

    Part II

    Techniques for sparse factorizations

    (e.g., SuperLU solver)

  • 8/10/2019 UCB Sparse Tutorial 1

    3/18

    01/31/07 Math 290 / CS 298 3

    Sparse Storage SchemesSparse Storage Schemes

    Notation

    N dimension

    NNZ number of nonzeros

    Assume arbitrary sparsity pattern

    triplets format ({i, j, val}) is not sufficient . . .

    Storage: 2*NNZ integers, NNZ reals

    Not easy to randomly access one row or column

    Linked list format provides flexibility, but not friendly on modernarchitectures . . .

    Cannot call BLAS directly

  • 8/10/2019 UCB Sparse Tutorial 1

    4/18

    01/31/07 Math 290 / CS 298 4

    Compressed Row Storage (CRS)Compressed Row Storage (CRS)

    Store nonzeros row by row contiguously

    Example: N = 7, NNZ = 19

    3 arrays:

    Storage: NNZ reals, NNZ+N+1 integers 1 a

    2 b

    c d 3

    e 4 f

    5 g

    h i 6 jk l 7

    nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7

    colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7

    rowptr 1 3 5 8 11 13 17 20

    1 3 5 8 11 13 17 20

  • 8/10/2019 UCB Sparse Tutorial 1

    5/18

    01/31/07 Math 290 / CS 298 5

    SpMV (y = Ax) with CRSSpMV (y = Ax) with CRS

    dot product

    No locality for x

    Vector length usually short

    Memory-bound: 3 reads, 2 flops

    nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7

    colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7

    rowptr 1 3 5 8 11 13 17 20

    1 3 5 8 11 13 17 20

    do i = 1, N . . . row i of A

    sum = 0.0 do j = rowptr(i), rowptr(i+1) 1 sum = sum + nzval(j) * x(colind(j)) enddo y(i) = sumenddo

  • 8/10/2019 UCB Sparse Tutorial 1

    6/18

    01/31/07 Math 290 / CS 298 6

    Compressed Column Storage (CCS)Compressed Column Storage (CCS)

    Also known as Harwell-Boeingformat

    Store nonzeros columnwise contiguously 3 arrays:

    Storage: NNZ reals, NNZ+N+1 integers 1 a

    2 b

    c d 3

    e 4 f

    5 g

    h i 6 jk l 7

    nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7

    rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7

    colptr 1 3 6 8 11 16 17 20

  • 8/10/2019 UCB Sparse Tutorial 1

    7/18

    01/31/07 Math 290 / CS 298 7

    SpMV (y = Ax) with CCSSpMV (y = Ax) with CCS

    SAXPY

    No locality for y

    Vector length usually short

    Memory-bound: 3 reads, 1 write, 2 flops

    y(i) = 0.0, i = 1N

    do j = 1, N . . . column j of A t = x(j) do i = colptr(j), colptr(j+1) 1 y(rowind(i))= y(rowind(i))+ nzval(i) * t enddoenddo

    nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7

    rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7

    colptr 1 3 6 8 11 16 17 20

  • 8/10/2019 UCB Sparse Tutorial 1

    8/18

    01/31/07 Math 290 / CS 298 8

    Jagged Diagonal Storage (JDS)Jagged Diagonal Storage (JDS)

    Also known as ITPACK, or Ellpack storage [Saad, Kincaid et al.]

    Force all rows to have the same length as the longest row,

    then columns are stored contiguously

    2 arrays: nzval(N,L) and colind(N,L), where L = max row length

    N*L reals, N*L integers

    Usually L

  • 8/10/2019 UCB Sparse Tutorial 1

    9/18

    01/31/07 Math 290 / CS 298 9

    SpMV with JDSSpMV with JDS

    Neither dot nor SAXPY

    Good for vector processor: long vector length (N)

    Extra memory, flops for padded zeros, especially bad if row lengthsvary a lot

    y(i) = 0.0, i = 1Ndo j = 1, L do i = 1, N y(i) = y(i) + nzval(i, j) * x(colind(i, j)) enddo

    enddo 1 a 0 02 b 0 0

    c d 3 0

    e 4 f 0

    5 g 0 0

    h i 6 j

    k l 7 0

  • 8/10/2019 UCB Sparse Tutorial 1

    10/18

    01/31/07 Math 290 / CS 298 10

    Segmented-Sum [Blelloch et al.]Segmented-Sum [Blelloch et al.]

    Data structure is an augmented form of CRS

    Computational structure is similar to JDS

    Each row is treated as a segmentin a long vector

    Underlined elements denote the beginning of each segment

    (i.e., a row in A)

    Dimension: S * L ~ NNZ, where L is chosen to approximate thehardware vector length

    1 a

    2 b

    c d 3

    e 4 f5 g

    h i 6 j

    k l 7

    1 d 5 j

    a 3 g k

    2 e h lb 4 i 7

    c f 6

  • 8/10/2019 UCB Sparse Tutorial 1

    11/18

    01/31/07 Math 290 / CS 298 11

    SpMV with Segmented-SumSpMV with Segmented-Sum

    2 arrays: nzval(S, L) and colind(S, L), where S*L ~ NNZ

    NNZ reals, NNZ integers

    Good for vector processors

    SpMV is performed bottom-up, with each row-sum (dot) of Ax stored inthe beginning of each segment

    Similar to JDS, but with more control logic in inner-loop

    1 a

    2 b

    c d 3

    e 4 f

    5 g

    h i 6 j

    k l 7

    1 d 5 j

    a 3 g k

    2 e h l

    b 4 i 7

    c f 6

    do i = S, 1 do j = 1, L . . .

    enddoenddo

  • 8/10/2019 UCB Sparse Tutorial 1

    12/18

    01/31/07 Math 290 / CS 298 12

    Performance (megaflop rate) [Gaeke et al.]Performance (megaflop rate) [Gaeke et al.]

    Test matrix: N = 10000, NNZ = 177782, random pattern

    ~18 nonzeros per row on average JDS does 4.6x more operations

    1.6 Gflops1.5 Gflops667 MflopsPeak flop rate

    200 MHz1.5 GHz333 MHzClock rate

    165295Seg-Sum

    632

    137

    17

    4

    27

    6

    JDS

    (effective)

    11020929CRS

    VIRAMPentium 4Ultra 2imachine

  • 8/10/2019 UCB Sparse Tutorial 1

    13/18

    01/31/07 Math 290 / CS 298 13

    Optimization TechniquesOptimization Techniques

    Matrix reordering

    For CRS SpMV, can improve x-vector locality by reducing thebandwidth of matrix A

    Example: reverse Cuthill-McKee (breadth-first search)

    Observed 2-3x improvement [Toledo, et al.]

  • 8/10/2019 UCB Sparse Tutorial 1

    14/18

    01/31/07 Math 290 / CS 298 14

    Optimization TechniquesOptimization Techniques

    Register blocking

    Find dense blocks of size r-by-c in A

    (If needed, allow some zeros to be filled in)

    A*x is proceeded block by block

    keep c elements of x and r elements of y in registers x element re-used r times, y element re-used c times

    Amount of indexed load and store is reduced

    Obtained up to 2.5x improvement [Vuduc et al.]

  • 8/10/2019 UCB Sparse Tutorial 1

    15/18

    01/31/07 Math 290 / CS 298 15

    SPARSITY [Im, Yelick]SPARSITY [Im, Yelick]

  • 8/10/2019 UCB Sparse Tutorial 1

    16/18

    01/31/07 Math 290 / CS 298 16

    Performance Improvement [Vuduc et al.]Performance Improvement [Vuduc et al.]

  • 8/10/2019 UCB Sparse Tutorial 1

    17/18

    01/31/07 Math 290 / CS 298 17

    Other RepresentationsOther Representations

    Block entry formats (e.g., multiple degrees of freedom areassociated with a single physical location)

    Constant block size (BCRS)

    Varying block sizes (VBCRS)

    Skyline (or profile) storage (SKS)

    Lower triangle stored row by rowUpper triangle stored column by column

    In each row (column), first nonzero

    defines a profile

    All entries within the profile

    (some may be zeros) are stored

  • 8/10/2019 UCB Sparse Tutorial 1

    18/18

    01/31/07 Math 290 / CS 298 18

    ReferencesReferences

    Templates for the solution of linear systems

    Barrett, et al., SIAM, 1994

    BeBOP: http://bebop.cs.berkeley.edu/

    Sparse BLAS standard:

    http://www.netlib.org/blas/blast-forum