ucb sparse tutorial 1

8/10/2019 UCB Sparse Tutorial 1

1/18

Sparse Matrix Techniques(Tutorial)

X. Sherry LiLawrence Berkeley National Lab

Math 290 / CS 298, UCB

Jan. 31, 2007


2/18

01/31/07 Math 290 / CS 298 2

OutlineOutline

Part I

Computer representations of sparse matrices

Sparse matrix-vector multiply with various storages

Performance optimizations

Part II

Techniques for sparse factorizations

(e.g., SuperLU solver)


3/18

01/31/07 Math 290 / CS 298 3

Sparse Storage SchemesSparse Storage Schemes

Notation

N dimension

NNZ number of nonzeros

Assume arbitrary sparsity pattern

triplets format ({i, j, val}) is not sufficient . . .

Storage: 2*NNZ integers, NNZ reals

Not easy to randomly access one row or column

Linked list format provides flexibility, but not friendly on modernarchitectures . . .

Cannot call BLAS directly


4/18

01/31/07 Math 290 / CS 298 4

Compressed Row Storage (CRS)Compressed Row Storage (CRS)

Store nonzeros row by row contiguously

Example: N = 7, NNZ = 19

3 arrays:

Storage: NNZ reals, NNZ+N+1 integers 1 a

2 b

c d 3

e 4 f

5 g

h i 6 jk l 7

nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7

colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7

rowptr 1 3 5 8 11 13 17 20

1 3 5 8 11 13 17 20


5/18

01/31/07 Math 290 / CS 298 5

SpMV (y = Ax) with CRSSpMV (y = Ax) with CRS

dot product

No locality for x

Vector length usually short

Memory-bound: 3 reads, 2 flops

nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7

colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7

rowptr 1 3 5 8 11 13 17 20

1 3 5 8 11 13 17 20

do i = 1, N . . . row i of A

sum = 0.0 do j = rowptr(i), rowptr(i+1) 1 sum = sum + nzval(j) * x(colind(j)) enddo y(i) = sumenddo


6/18

01/31/07 Math 290 / CS 298 6

Compressed Column Storage (CCS)Compressed Column Storage (CCS)

Also known as Harwell-Boeingformat

Store nonzeros columnwise contiguously 3 arrays:

Storage: NNZ reals, NNZ+N+1 integers 1 a

2 b

c d 3

e 4 f

5 g

h i 6 jk l 7

nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7

rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7

colptr 1 3 6 8 11 16 17 20


7/18

01/31/07 Math 290 / CS 298 7

SpMV (y = Ax) with CCSSpMV (y = Ax) with CCS

SAXPY

No locality for y

Vector length usually short

Memory-bound: 3 reads, 1 write, 2 flops

y(i) = 0.0, i = 1N

do j = 1, N . . . column j of A t = x(j) do i = colptr(j), colptr(j+1) 1 y(rowind(i))= y(rowind(i))+ nzval(i) * t enddoenddo

nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7

rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7

colptr 1 3 6 8 11 16 17 20


8/18

01/31/07 Math 290 / CS 298 8

Jagged Diagonal Storage (JDS)Jagged Diagonal Storage (JDS)

Also known as ITPACK, or Ellpack storage [Saad, Kincaid et al.]

Force all rows to have the same length as the longest row,

then columns are stored contiguously

2 arrays: nzval(N,L) and colind(N,L), where L = max row length

N*L reals, N*L integers

Usually L


9/18

01/31/07 Math 290 / CS 298 9

SpMV with JDSSpMV with JDS

Neither dot nor SAXPY

Good for vector processor: long vector length (N)

Extra memory, flops for padded zeros, especially bad if row lengthsvary a lot

y(i) = 0.0, i = 1Ndo j = 1, L do i = 1, N y(i) = y(i) + nzval(i, j) * x(colind(i, j)) enddo

enddo 1 a 0 02 b 0 0

c d 3 0

e 4 f 0

5 g 0 0

h i 6 j

k l 7 0


10/18

01/31/07 Math 290 / CS 298 10

Segmented-Sum [Blelloch et al.]Segmented-Sum [Blelloch et al.]

Data structure is an augmented form of CRS

Computational structure is similar to JDS

Each row is treated as a segmentin a long vector

Underlined elements denote the beginning of each segment

(i.e., a row in A)

Dimension: S * L ~ NNZ, where L is chosen to approximate thehardware vector length

1 a

2 b

c d 3

e 4 f5 g

h i 6 j

k l 7

1 d 5 j

a 3 g k

2 e h lb 4 i 7

c f 6


11/18

01/31/07 Math 290 / CS 298 11

SpMV with Segmented-SumSpMV with Segmented-Sum

2 arrays: nzval(S, L) and colind(S, L), where S*L ~ NNZ

NNZ reals, NNZ integers

Good for vector processors

SpMV is performed bottom-up, with each row-sum (dot) of Ax stored inthe beginning of each segment

Similar to JDS, but with more control logic in inner-loop

1 a

2 b

c d 3

e 4 f

5 g

h i 6 j

k l 7

1 d 5 j

a 3 g k

2 e h l

b 4 i 7

c f 6

do i = S, 1 do j = 1, L . . .

enddoenddo


12/18

01/31/07 Math 290 / CS 298 12

Performance (megaflop rate) [Gaeke et al.]Performance (megaflop rate) [Gaeke et al.]

Test matrix: N = 10000, NNZ = 177782, random pattern

~18 nonzeros per row on average JDS does 4.6x more operations

1.6 Gflops1.5 Gflops667 MflopsPeak flop rate

200 MHz1.5 GHz333 MHzClock rate

165295Seg-Sum

632

137

17

4

27

6

JDS

(effective)

11020929CRS

VIRAMPentium 4Ultra 2imachine


13/18

01/31/07 Math 290 / CS 298 13

Optimization TechniquesOptimization Techniques

Matrix reordering

For CRS SpMV, can improve x-vector locality by reducing thebandwidth of matrix A

Example: reverse Cuthill-McKee (breadth-first search)

Observed 2-3x improvement [Toledo, et al.]


14/18

01/31/07 Math 290 / CS 298 14

Optimization TechniquesOptimization Techniques

Register blocking

Find dense blocks of size r-by-c in A

(If needed, allow some zeros to be filled in)

A*x is proceeded block by block

keep c elements of x and r elements of y in registers x element re-used r times, y element re-used c times

Amount of indexed load and store is reduced

Obtained up to 2.5x improvement [Vuduc et al.]


15/18

01/31/07 Math 290 / CS 298 15

SPARSITY [Im, Yelick]SPARSITY [Im, Yelick]


16/18

01/31/07 Math 290 / CS 298 16

Performance Improvement [Vuduc et al.]Performance Improvement [Vuduc et al.]


17/18

01/31/07 Math 290 / CS 298 17

Other RepresentationsOther Representations

Block entry formats (e.g., multiple degrees of freedom areassociated with a single physical location)

Constant block size (BCRS)

Varying block sizes (VBCRS)

Skyline (or profile) storage (SKS)

Lower triangle stored row by rowUpper triangle stored column by column

In each row (column), first nonzero

defines a profile

All entries within the profile

(some may be zeros) are stored


18/18

01/31/07 Math 290 / CS 298 18

ReferencesReferences

Templates for the solution of linear systems

Barrett, et al., SIAM, 1994

BeBOP: http://bebop.cs.berkeley.edu/

Sparse BLAS standard:

http://www.netlib.org/blas/blast-forum

ucb sparse tutorial 1

Documents