sparch: efﬁcient architecture for sparse matrix multiplication · 2020. 2. 18. · sparch:...

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhekai Zhang∗, Hanrui Wang∗, Song HanEECS

Massachusetts Institute of TechnologyCambridge, MA, US

{zhangzk, hanrui, songhan}@mit.edu

William J. DallyElectrical Engineering

Stanford University / NVIDIAStanford, CA, US

[email protected]

Abstract—Generalized Sparse Matrix-Matrix Multiplication(SpGEMM) is a ubiquitous task in various engineering andscientific applications. However, inner product based SpGEMMintroduces redundant input fetches for mismatched nonzerooperands, while outer product based approach [1] suffers frompoor output locality due to numerous partial product matrices.Inefficiency in the reuse of either inputs or outputs data leadsto extensive and expensive DRAM access.

To address this problem, this paper proposes an efficientsparse matrix multiplication accelerator architecture, SpArch,which jointly optimizes the data locality for both input andoutput matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partialmatrices so that partial matrices are merged on chip imme-diately after produced. We then propose a condensed matrixrepresentation that reduces the number of partial matrices bythree orders of magnitude and thus reduces DRAM access by5.4×. We further develop a Huffman tree scheduler to improvethe scalability of the merger for larger sparse matrices, whichreduces the DRAM access by another 1.8×. We also resolve theincreased input matrix read induced by the new representationusing a row prefetcher with near-optimal buffer replacementpolicy, further reducing the DRAM access by 1.5×. Evaluatedon 20 benchmarks, SpArch reduces the total DRAM accessby 2.8× over previous state-of-the-art. On average, SpArchachieves 4×, 19×, 18×, 17×, 1285× speedup and 6×, 164×,435×, 307×, 62× energy savings over OuterSPACE, MKL,cuSPARSE, CUSP, and ARM Armadillo, respectively.

Keywords-Sparse Matrix Multiplication; Domain-SpecificArchitecture; Specialized Accelerators; Huffman-Tree; DataReuse

I. INTRODUCTION

Generalized sparse matrix-matrix multiplication (SpG-EMM) is the key computing kernel for many algorithmssuch as compressed deep neural networks [2], [3], [4],[5], triangle counting [6], Markov clustering [7], searchingalgorithms [8], [9], and matching algorithms [10]. It is alsoubiquitous in scientific and engineering applications such asgrammar parsing [11], chemical molecule dynamics [12],color intersection search [13], linear solver [14] and manyother applications [15], [16], [17], [18], [19].

However, the performance of SpGEMM is memorybounded on the traditional general-purpose computing plat-forms such as CPUs and GPUs, because of the irregular

∗Equal Contributions.

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0

Multiply Stage Merge Stage Final Results

0.6

1.3

2.2

1.1

Partial Matrix A Partial Matrix B

0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Sorted

Merged

(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7

Comparator Array

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

(1, 0.1) (3, 0.5) (4, 0.2)

Partial

Matrix A

(3, 0.6) (5, 1.3)(10, 2.2) (12, 1.1) (15, 4.4)(29, 3.9) (35, -1.1)(40, 9.2) (44, 7.1)(52, -1.0) (55, 0.2) (61, 9.9)

Chunk A0

(13, 1.2) (19, 5.1)(22, -3.1) (35, 1.2)(37, 9.2) (42, 1.1)(47, 9.9) (48, 0.3)(58, -10.8)

Partial

Matrix B

Chunk A1 Chunk A2

Chunk B0 Chunk B1 Chunk B2

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9)

Decide chunk Pairs for low level

comparator arrays

B1

B1

A2

B1

A2

B2

0

1

2

3

4

Top level comparator array

compares last element in each chunk

Low level comparator arrays

compare chunk pairs in parallel

A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7

Distance: 3 Step0 Step1 Step2 Step3 Step4 Step5 Step6 Step7 Step8 Step9 Step10 Step11

HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher

Distance List Builder

Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

zero_count 0 0 1 2 2 2 3 3

iter1: if zero_count[0] == 1

shift left by 1

31 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

zero_count 0 0 1 2 2 2 2

iter2: if zero_count[1] == 1

shift left by 2

1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing

Round 0Round 1Round 2

Load Sequence

Produces 16 partial matrices Produces 12 partial matrices

Matrix Condensingstored in CSR loaded by condensed column

1st4th

2nd

3rd

Pipelined Multiply and Merge Matrix Condensing Huffman Tree

Scheduler Row Prefetcher

5.7x slowdown 5.7x more DRAM access

8.8x speedup 5.4x less DRAM access

Overall 4.2x speedup

2.8x less DRAM access



Better Output Reuse Better Input Reuse

Merger (Comparator Array)

26 31 28 42 13 15 14 16

22 24 11 12


DRAM

A B C DMultiplier Array


24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array

Multiplier ArrayMultiplier Array

A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

R8

A 15

2-Way with Sequential Scheduler

total weight of all nodes: 365

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

weight∝memory access

(a)

2-Way with Huffman Tree Scheduler

C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84

R: Round total weight of all nodes reduces to: 354

R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

total weight of all nodes reduces to: 228weight∝memory access

R: Round

(c)

A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

Vanilla Inner

Product Based =

Poor Input Reuse Good Output ReuseA B

C

Vanilla Outer

Product Based

=

Good Input Reuse Poor Output Reuse

{ merge many partial

matrices

D E

HG

F

Proposed Outer

Product Based

=

Good Input Reuse Good Output Reuse

❸ Huffman Tree Scheduler

❹ Row Prefetcher

❶ Pipelined Multiply and Merge

❷ Matrix Condensing Less Partial Matrices

{less merge

(1)(2)

JK

ED’

Figure 1. Proposed outer product based SpGEMM architecture jointlyoptimizes input and output data reuse. It achieves good output reuse withpipelined multiply and merge, matrix condensing, Huffman tree scheduler,good input reuse with row prefetcher.

memory access pattern and poor locality caused by low-density matrices [20], [21], [22]. For instance, the density ofTwitter’s [23] adjacency matrix is as low as 0.000214%. AsMoore’s Law [24] is slowing down, domain-specific archi-tecture [25] becomes promising to improve performance andenergy efficiency. OuterSPACE [1] proposed outer productbased SpGEMM, which has perfect input reuse comparedto inner product based method. However, outer producthas poor output reuse because it produces a considerableamount of partial matrices (F,G,H in Figure 1). Thosepartial matrices need to be stored to DRAM before merg-ing, which incurs extensive DRAM access and cancels thebenefits of good input reuse. As a result, the performanceof OuterSPACE is only 10.4% of the theoretical peak.










Figure 2. Four innovations in SpArch. The first slowdown is because of the excessive memory access of partially merged results when the number ofpartial matrices is much larger than merge tree parallelism. But it enables the next three optimizations. We save DRAM bandwidth, achieve higher memoryutilization and speedup.

In this paper, we propose SpArch1, a domain-specificaccelerator to jointly optimize input and output data reuse.We obtain input reuse by outer product, and output reuse byon-chip partial matrix merging.

To achieve this, we design a highly parallelized mergerto pipeline the two computing stages, multiply and merge.The multiply stage produces partial matrices, and the mergestage merges partial matrices into final results. For largematrices, however, the number of partial matrices exceedsthe merger’s parallelism. Merging only a part of partialmatrices at a time with multiple rounds leads to an increasedamount of memory access for the partially merged results,which neutralizes the performance gain of pipelined multipleand merge, making DRAM access even larger (see the firststep of Figure 2). However, it is a prerequisite for thenext three optimizations. Therefore, we propose a condensedmatrix representation for the first input matrix, where allnon-zero elements are pushed to the left, forming muchdenser columns. As shown in Figure 1, we condense thefirst input matrix D to condensed D′. The condensed matrixD′ is loaded by condensed column. Implementation-wise,this is equivalent to storing the matrix in CSR format andfetching the elements with the same index for all rows. Thesecond input matrix E is stored in CSR format. The originalleft matrix D has three columns and produces three partialmatrices. After condensing, D′ only has two columns andonly two partial matrices. In real-world benchmarks, we savethe number of partial matrices by three orders of magnitude.

Unfortunately, the condensed representation can still pro-duce a larger amount of partial matrices than the merger’sparallelism. The merge order impacts the amount of DRAMaccess: partial matrices merged early have to be load/storedto DRAM for every future merges. Therefore, we shouldmerge matrices with less non-zeros first. We design aHuffman tree scheduler to decide the near-optimal mergeorder. We represent each column of a sparse matrix as aleaf node. The weight of the node is the number of non-zeros in that column. For very sparse matrices, the numberof non-zeros after merge is approximately the sum of thenumber of non-zeros before merge. Therefore, the weight of

1SpArch is the homophone of word spark, meaning striking together twohard surfaces such as stone or metal, which we consider as analogous tomatrix-condensing, a critical technique in our architecture.

a parent node is the sum of the children’s weights, exactlyfollowing the convention of a Huffman tree. A Huffmantree minimizes the sum of all nodes’ weights. We applyit to minimize the total DRAM traffic. Scheduled bya Huffman tree, we merge matrices with fewer non-zerosfirst and larger ones later (example in Figure 8). The threetechniques discussed above save output DRAM access on20 benchmarks by 1/5.7 × 5.4 × 1.8 = 1.7× compared toOuterSPACE (Figure 2).

However, in matrix condensing step, we ruin the perfectreuse of the second matrix (E in Figure 1) because now onecondensed column of D′ need all the three (green, black andpink) rows of E. In contrast, one column of D only needs onerow of E. Therefore, we propose a row prefetcher to prefetchrows of the second matrix E and store to a row buffer. Thebuffer replacement policy is near-optimal because we lookat the sequence of the rows we need ahead of time whilestreaming in the first matrix, and thus we can replace the linewith farthest next use. The row buffer can achieve a 62%hit rate, thus reducing DRAM access of the second matrixby 2.6×, largely recovering the input reuse. With the fourtechniques together, we reduce DRAM access by 2.8× overOuterSPACE.

In summary, SpArch jointly optimizes input and outputreuse of SpGEMM and makes five contributions:

• Pipeline the multiply stage and the merge stage with acomparator array-based highly parallelized merger.

• Use matrix-condensing to condense the first input ma-trix, reducing the number of partial matrices by threeorders of magnitude.

• A Huffman tree scheduler that provides a near-optimalmerge order of partial matrices and reduces the DRAMtraffic.

• A row prefetcher that achieves near-optimal bufferreplacement policy for the second input matrix andresolves the increased DRAM read induced by matrix-condensing.

• We evaluate SpArch on real-world datasets fromSuiteSparse [26], SNAP [27] and rMAT [28], achieving4×, 19×, 18×, 17×, 1285× speedup, and 6×, 164×,435×, 307×, 62× energy saving over OuterSPACE,MKL, cuSPARSE, CUSP and ARM Armadillo.

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3

iter1: if zero_count[0]

== 1 shift left by 1

1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing

stored in CSR, loaded by condensed column

1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15


Total weight of all nodes: 365

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84

R: Round Total weight of all nodes reduces to: 354

R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

Total weight of all nodes reduces to: 228(c)

A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1

Partial Matrix BPartial Matrix A

0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7

Comparator Array(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 3. Comparator Array Based Merger. Each diagonal group outputsthe data on the boundary of ”≥” and ”<” to form a sorted array. The adderand zero eliminator will further merge the duplicated elements.

II. PROPOSED ARCHITECTURE

A. Comparator Array based Merger

Previous state-of-the-art SpGEMM accelerator [1] pro-cesses multiply and merge stages separately, which needto store all partial matrices in DRAM. A better way isto pipeline the multiply and merge stages and perform anon-chip merge. The partial matrix is represented in COOformat with [row index, column index, value]. It is sortedby row index then column index. The merger combinesmultiple sorted arrays into a new sorted array. A simplemerger puts each array into a FIFO and selects the smallestelements from the top of all FIFOs. This method suffersfrom low parallelism and thus cannot fully utilize the DRAMbandwidth. Therefore, we design a parallel merge unit.

1) Parallel Merge Unit: A naıve merger walks twopointers over two arrays and compares the correspondingelements. The pointer with the smaller element will moveforward. This only outputs one element per cycle. In orderto increase the parallelism, we replace the pointer by asliding window of size N , all elements in the window Awill be compared to all elements in window B by an arrayof comparators. After the comparison, we move one of thewindow forward by N , improving the throughput by Ntimes.

A comparator array is the core of our merge unit. Asin Figure 3, the merger contains 4×4 comparators. The twoinput matrices are stored in the format of [coordinate, value].We use the comparator array to compare coordinates of all

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 4. Hierarchical Comparator Array. We break the input array intomultiple chunks and use a top level comparator array to decide which pairsof chunks will be fed to low level comparator arrays.

non-zero elements in partial matrix A and those of partialmatrix B. If A<B, then the entry is ’<’, otherwise it is’≥’. We pad one dummy column of ’<’ to the right andone dummy row of ’≥’ to the bottom. Then we detect aboundary between the ’≥’ and ’<’ tiles. The boundary isdefined as below: 1. The left-top corner tile is a boundary.2. The ’≥’ tiles in the first row are boundaries. 3. If a tile is’≥’ and its top neighbor is ’<’, it is a boundary. 4. If a tileis ’<’ and its left neighbor is ’≥’, it is a boundary. With therules, we mark the boundary tiles with red in Figure 3. Wefurther divide all the tiles diagonally into eight groups. Thetiles in the same group have the same group index marked atthe top-left of each tile. Each group will have one and onlyone output. The output of one boundary tile is the smallercoordinate and corresponding value in its two inputs. Forexample, for the left-top tile, 1 < 3, therefore (1, 0.1) is theoutput of that tile, also the output of group 0.

The output of the whole comparator array is the sortedresults of two input arrays. Let the left input array as aand top array as b. The correctness is based on the factthat a boundary tile (i, j) in k-th group always has ’<’above and ’≥’ on the left. If it is ’≥’, it will output thetop corresponding bj . The ’<’ above indicates a0 to ai−1from left array a are smaller than bj . Plus b0 to bj−1, thereare exactly i+ j = k items smaller than it. Thus the outputof that tile will be the k-th item in the merged result. Thecase of ’<’ is similar to ’≥’.

All the results are generated in one clock cycle becausethere is no data dependency between inputs. The 4×4comparator array can process arbitrary length of inputs. Ineach clock cycle, it merges eight inputs, four each from twomatrices. Then we shift-in following inputs and replace theprevious ones, so new merged results can be produced in

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 5. A merge tree which merges four partial product matrices. Weonly show coordinates of elements here.

each cycle.2) Hierarchical Parallel Merge Unit: To increase the

merger parallelism, we can increase the size of the com-parator array, such as from 4×4 to 12×12. Nevertheless, thenumber of comparators is O(n2), where n is the side lengthof the comparator array, thus consuming a large numberof hardware resources. Therefore, we further propose ahierarchical merger in Figure 4. We use two levels ofcomparator arrays to reduce the total number of comparators.

The hierarchical merge unit contains high-level and low-level comparator arrays. We divide the input into chunks of4, as in Figure 4 top. The length of each chunk equals tothe input length of the low level comparator array (4 in theexample). The top level array is used to decide which chunksneed to be compared by low level arrays. The intuitionis that if the largest element of chunk A is still smallerthan chunk B, then we can skip the comparison of the twochunks. We use the top level comparator array to comparethe last element in each chunk and get the boundary tiles(marked in red). Since each chunk is already sorted, the lastelement is the largest. The boundary tiles indicate the chunkpairs for the low level comparator arrays. For example, inFigure 4, the right-bottom tile is a boundary tile, so the A2and B2 chunk is a chunk pair. We also divide the top levelcomparator array into diagonal groups, and each group willgenerate one and only one chunk pair. Then we use fivelow level comparator arrays to process the chunk pairs inparallel.

Output of each low level comparator array is limited bya min-max bound to avoid element duplication. Results not

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 6. Zero Eliminator. Bit 0 of zero count is checked to determinewhether to shift by 1 in the first layer. Bit 1 of zero count is checked todetermine whether to shift by 2 in the second layer. We need logN -cycle-latency to process input of length N.

in the bounds are set to 0. If the top level boundary tile hasone left/top boundary tile neighbor, then the min bound isthe first element of the top/left input chunk. The min boundof the first low level array is the smallest element. The maxbound of each low level array is the min bound of the nextone. The upper bound of the last low level array is +∞.

Then the outputs of each low level array are concatenatedtogether to get the overall output. In this way, we save thelow level comparators in non-boundary tiles in the high levelarray. Mathematically, the number of comparators is reducedto O(n 4

3 ). If we choose a n23 × n 2

3 top level comparatorarray and n

13 ×n 1

3 low level comparators, we can process nelements at a time and uses only (2n

23−1)∗(n 1

3 )2+(n23 )2 =

O(n 43 ) comparators.

3) Merge Tree: Using the hierarchical merger, we get ahighly parallelized binary streaming merger that can mergeup to 16 elements in each cycle. It merges two arrays intoone array. In order to merge more arrays into one array, westack multiple binary mergers and form a merge tree. Asshown in Figure 5, the merge tree is a full binary tree thateach node represents a FIFO on the hardware. Input arraysare fed to the leaf nodes, and the output array is collectedfrom the root node. A binary merger merges the elementsfrom the child FIFO and stores the results to the parentFIFO. The throughput of the whole tree is bounded by theroot node, which has only one merger. Therefore, each layershares one merger to balance the throughput.

Figure 5 illustrates a merge tree of 2 layers, and each layeris equipped with a 1×1 merger. We only show coordinatesof elements here. Four arrays A, B, C, D are to be merged.They are loaded into the leaf FIFOs. In the first 4 cycles, thelower merger merges data from the leaf FIFOs and store theresults to the middle FIFOs. Then the upper merger noticesthere is enough data in the middle FIFOs, so it merges themand stores to the topmost FIFO. When the root FIFO isfull, data is written to DRAM. The merge tree continues toworking streamingly until all data fed to the lowest FIFOreach the root, and four arrays are merged fully on-chip.

4) Adders and Zero Eliminator: The merger stated aboveonly merges the elements and leaves alone same-locationelements, i.e., elements with the same row and column index.However, in SpGEMM, we need to perform add operationon those elements. Therefore, we connect a slice of addersright after the merger, and it will add adjacent same-locationelements and set one of the elements to zero. Then we use

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing

Stored in CSR; Loaded by condensed column

1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 7. Matrix Condensing. Condense the sparse matrix to the left,reducing the number of columns, thus reducing the number of partialmatrices. It can be stored naturally using the CSR format.

a Zero Eliminator to compress these zeroes and output thedense results. The Zero Eliminator consists of two parts. Thefirst part is a prefix sum module that computes the number ofzeroes (zero count) before each element. The second partis a modified logN layer shifter. Each layer contains NMUXs that shift the input array and zero count by 1,2,4,...positions. Unlike a traditional shifter in which MUXs sharea common control signal, the MUXs in Zero Eliminatorare controlled by the zero count signal of each element.Figure 6 shows an example of the Zero Eliminator.

B. Matrix Condensing

Due to the limitation of hardware resources, we can onlyafford to merge 64 arrays on-chip. However, the intermediateresults generated by the multiplier can be as large as 10,000to 1,000,000, which requires multiple rounds of 64-waymerging and still consumes a considerable amount of DRAMbandwidth. Therefore, we propose matrix condensing tomerge multiple columns into a single column, which reducesthe number of partial matrices to 100 to 1,000 in ourbenchmarks.

The motivation of matrix condensing is that if we havetwo columns a1, a2 from the left matrix that have noelements sharing the same row index, we can merge them(i.e., combine them into a new array sorted by the row index)while keeping the original column index. When we processthe merged column in multiply phase, we multiply eachelement by its corresponding row (the same as its originalcolumn index) in the second input matrix B. The result isthe same as the merge result of a1 × B and a2 × B. Weuse a cheap merge of the left matrix to replace an expensivemerge of the much longer multiplied results and reduce thenumber of partial matrices.

Furthermore, since exchanging elements between differentcolumns does not affect the final result, we condense allelements in a row to the leftmost column. In this way, thenumber of columns of the condensed left matrix is far lessthan the original one. The number of partial product matricesis equivalent to the number of condensed columns, which is

reduced by three orders of magnitude.Figure 7 shows a matrix in condensed format. The number

of columns is reduced from 16 to 12, which equals thelength of the longest row in the original matrix. In real-worlddatasets, we can reduce it from 100,000 to 100˜1,000, whichis much closer to the size of the merge tree.

We store the left matrix in CSR format. The elementsin CSR directly map to those in condensed format: the ith

element in a CSR row is just in the ith column of condensedformat. CSR format and our condensed format are twodifferent views of the same data. In the view of DRAM, thematrix is loaded by the rows, but in the view of a port ofthe merge tree, the matrix is loaded by condensed columns.This is achieved by the data loader, which dispatches datato different ports according to its condensed column. As inFigure 7, if the merger has parallelism of 4, we load fourcondensed columns together. The dash frames show the loadsequence. After loading an element from the left matrix, wecan use its original column index to fetch a row from theright matrix and fed them to the multipliers. The multipliedresults will be fed to the ports whose index equals to thecondensed column index (not the original column index).

C. Huffman Tree Scheduler

The merge unit can merge 64 matrices on-chip. Aftermatrix condensing, the number of partial matrices can stillexceed 64, which needs to be written to DRAM and mergedlater. The order of the merge matters: the earlier a matrixis merged, the more rounds of DRAM read and write itneeds. Intuitively, we should merge sparser partial productsmatrices first, since they have less number of non-zeros; evenin-and-out from DRAM for many times, the access numberis smaller. We should leave denser matrices merged later. Towisely choose the order, we use a Huffman tree schedulerto minimize memory access during the whole task.

Our scheduler abstracts the entire merge process as a tree.In Figure 8, we show the 2-way and 4-way Huffman treeschedulers when the merger can merge 2, and 4 matrices atthe same time. We also show a 2-way sequential schedulerwithout the Huffman tree scheduler for comparison.

The leaf nodes in the tree represent the initial multipliedresults of a column in the left matrix with the right matrix.The internal nodes represent the partially merged results.The weights of the nodes represent the size of partiallymerged results or the size of the initial multiplied results.For internal nodes, we estimate its weight by adding up theweights of its children because the matrix is very sparse,and the additions during the merge stage are relatively rare.The arrow points from nodes to their parent node representa round of multiply-and-merge operation.

The memory access amount of all partially merged resultsequals to the sum of all internal node weights. Since the rootnode and leaf node weights do not rely on the tree shape,the optimization goal is to minimize the total weights of

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 8. Huffman tree scheduler gives near optimal order of partial matrices merging. The total memory access is minimized. The weights of nodesrepresent the number of non-zeros in the partial matrix. The total DRAM access is proportional to the sum of nodes’ weights.

all nodes, which also equals the sum of the weights of leafnodes multiplied by their depths in the tree.

A k-ary Huffman Tree is proven to be the optimal solutionto minimize the sum of all nodes. It puts larger leaf nodesnearer to the root and smaller nodes farther, so the accu-mulative products of weights and depths can be minimized.This is exactly what we need to minimize the total DRAMtraffic. In each round of the Huffman Tree construction, weselect k un-merged nodes with minimal weights and mergethem into an internal node except the first round. In thefirst round, the number of nodes that need to choose canbe calculated with Formula 1. This guarantees that the lastround always merges k nodes, so the root node of the treeis always full.

kinit = (numcondensed col−2)mod(nummerger way−1)+2(1)

In the example (Figure 8), a 2-way Huffman schedulerreduced the total weights from 365 to 354. A 4-way Huff-man scheduler further reduces it to 228. The weight isproportional to the DRAM traffic. Therefore total weightis reduced when the parallelism of the merger increases.However, high parallelism usually comes at the cost of highpower consumption and large area. We choose a 64-wayHuffman tree scheduler, and will discuss the trade-offs inSection III.

In our real implementation, the Huffman tree is built onthe fly with a priority queue. The implementation reflectsthe algorithm of building a Huffman tree. We firstly addthe weights of leaf nodes to the queue and sort them. For am−way merger, in each iteration, the first m partial matricesare merged, and the weight of the merged matrix is addedto the queue, and then we sort the queue again.

D. Row Prefetcher

Matrix condensing reduces the number of partial matricesbut ruins the input reuse of the second operand matrix sinceone column after condensing requires the elements from

different rows of the second matrix. We solve the problemby a row prefetcher with near-optimal buffer replacement.

The prefetcher is motivated by the fact that we can predictthe access order of the right matrix ahead of time. Asmentioned in section II-B, we access the left matrix fromtop to down. Using the column index of the left matrix, wecan deduce the order of the right matrix, which can be usedby the prefetcher to load data before they are consumed bythe multipliers.

The prefetcher has two functions: first, fetching the dataused by multipliers ahead of time to hide the latency ofDRAM; second, caching the fetched data in a buffer forfuture reuse to reduce the amount of DRAM reading.

The first function is achieved by multiple data fetchers.We use a data fetcher for each DRAM channel. Accesses todifferent DRAM channels and banks are overlapped; thus,the DRAM latency can be hidden.

The second function is achieved by an on-chip buffer,which stores the rows of the right matrix. When new datais fetched, using the predicted access order, we can replacethe line with the furthest next use, which can achieve near-optimal reuse if the prediction is accurate.

In Figure 9, we show a simplified example. The leftcolumn is one condensed column from the condensed firstmatrix. Each element is marked with its original column.The first three are original column 1, 0, and 2. On the top,we show row 0 to row 4 of the second matrix and dividethem according to the length of the buffer line. The centralpart of the figure shows the buffer at different time steps.The red frames highlight the new contents after each timestep.

In time step 0, we load Row 1. In time step 1, we loadRow 0. In time step 2, we firstly store R2-0. Then we needto spill some buffer lines to continue to load R2. We spillR0 first because R0 will be used in 7 time steps later, andR1 will be used in 3 time steps later. However, after spillingall R0 lines, R2 still has remaining parts. So we have to spill

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 9. The row prefetcher and the buffer replacement policy. The prefetcher fetches the rows of the second matrix in advance to hide the DRAMlatency. The buffer lines are replaced according to the pre-calculated sequence of required rows thus can achieve near Belady optimal.

First Matrix A

= =

Second Matrix B

a1 aN-1

b1

bN-1

CN-1

Partial Matrices Ci

Resultant Matrix C

C1C0


3zero_count 0 0 1 2 2 2 3



1 0 0 2 3 0 4 0

1 0 0 2 3 4 0 0

3zero_count 0 0 1 2 2 2 2



1 0 0 2 3 4 0 0

1 2 3 4 0 0 0 0

Matrix

Condensing


Load Sequence


Matrix Condensing


1st4th

2nd

3rd











26 31 28 42 13 15 14 16

22 24 11 12


DRAM



24 26 22 28 11 13 12 14


DRAM

A B C D

26 31 28 42


22 24

15 21 16 17

13 14

11 12


DRAM

A B C D


26 31 28 42 21 23 17 19

22 24 15 16

13 14


DRAM

A B C D

11, 12

Multiplier Array


A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

Vanilla Inner

Product Based =


C

Vanilla Outer

Product Based

=



matrices

D E

HG

F

Proposed Outer

Product Based

=



❹ Row Prefetcher



{less merge

(1)(2)

JK

ED’

R8

A 15



A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


(a)


C

28

B15 13

A15

E9

17

K2

L2

4

I2

J2

4

8

32

F75

12

G3

H2

12D

24

52

84


R0

R2

R6

R1

R4

R5

R7

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2

(b)


J2

K2

L2

6

C13

G3

H2

I2

A15

B15

D12

E9

F7 13

41

84

R0R1R2R3

A 15

B 15

C 13

D 12

E 9

F 7

G 3

H 2

I 2

J 2

K 2

L 2


A4

D

L K2 7

C

2I

9

4 13

16 17

33

5

J2

B

2G

15 16 15

H2

F3

20 31

51

84

E

12

0.6

1.3

2.2

1.1


0.1 0.5

0.2

1.2

+

0.1 1.1

0.2 1.3

2.2

1.1 1.2

=

Merged(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)

Sorted

(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)0 1 2 3 4 5 6 7


(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

A0

B0

< < <

≥ ≥ <

≥ ≥ ≥

0

1

2

1

2

3

2

3

4

A0 (13, 1.2)

A1 (37, 9.2)

A2 (58, -10.8)

B0 (12, 1.1)

B1 (40, 9.2)

B2 (61, 9.9) D

ecid

e ch

unk

Pai

rs fo

r low

leve

l co

mpa

rato

r arr

ays

B1

B1

A2

B1

A2

B2

0

1

2

3

4





A0

A1

A0

B0

Zoom In

(1, 0.1)

(3, 0.5)

(4, 0.2)

(13, 1.2)

(3, 0.6)

(5, 1.3)

(10, 2.2)

(12, 1.1)

≥ ≥ <

≥ ≥ ≥

≥ ≥ ≥

≥ ≥ ≥

0

1

2

3

1

2

3

4

2

3

4

5

3

4

5

6,7

<

<

<

<

<

<

<

<

≥ ≥ ≥ ≥

3

4

5

6

4

5

6

7

4 5 6 7

(1,0.1)(3,0.5)(4,0.2)Chunk A0

(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)Chunk A1

(42,1.1)(47,9.9)(48,0.3)(58,-10.8)Chunk A2

Partial Matrix A

(3,0.6)(5,1.3)(10,2.2)(12,1.1)Chunk B0

(15,4.4)(29,3.9)(35,-1.1)(40,9.2)Chunk B1

(44,7.1)(52,-1.0)(55,0.2)(61,9.9)Chunk B2

Partial Matrix B

R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6

R1-0R1 R1-1 R1-2 R1-3

R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9

R3-0R3 R3-1 R3-2

R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5

Replace R0 then

Replace R1

R1-0R1-1R1-2R1-3

C1

R1-0R1-1R1-2R1-3R0-0R0-1R0-2R0-3R0-4R0-5R0-6

C0

R2-8R2-9R1-2R1-3R2-1R2-2R2-3R2-4R2-5R2-6R2-7R2-0

C2

R2-8R2-9R3-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C3

Replace R1 then

Replace R2

R2-8R2-9R2-0R3-1R2-1R2-2R2-3R2-4R2-5R2-6R2-7R3-2

C2

Replace R3

R2-8R2-9R1-0R3-1R1-1R1-2R1-3R2-4R2-5R2-6R2-7R3-2

C1

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C4

Replace R2

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R4-0R4-1R4-2R4-3R3-2

C1

Replace R4

R4-4R4-5R1-0R3-1R1-1R1-2R1-3R3-0R4-1R4-2R4-3R3-2

C3

Replace R3 then

Replace R4

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C1

R0-6R4-5R1-0R0-1R1-1R1-2R1-3R0-0R0-3R0-4R0-5R0-2

C0

C1

C0

C2

C3

C2

C1

C4

C1

C3

C0

C1

C0

C4

C4

C1

C0

Distance: 7


HB

M

MatB Row Prefetcher

Multiplier A

rray

MatA Column Fetcher


Look-Ahead FIFO

Required Row

Row Distance

MatAElement

MatB Row

Merger

Merger

New PMat

Merger FIFO

Partial Mat Fetcher

Partial Mat Addr

Partial Mat Writer

Old PMat

Figure 10. System architecture of proposed SpArch.

R1. Spilling a row line by line instead of as a whole canbring benefits. For example, from the time step 7 to timestep 8, we only need to load R3-0 and no need for R3-1and R3-2 because they are already in the buffer.

E. System Architecture

The system architecture of SpArch is shown in Figure 10.The left matrix A is stored in CSR and fetched by condensedcolumn, and the right matrix B is stored in CSR format inHBM. The MatA Column Fetcher receives control instruc-tions from the software scheduler, calculates the addresses ofdata in the selected columns, and fetches the elements fromthe left matrix. Then the fetched elements will be sent to alook-ahead FIFO. The Distance List Builder will process thelook-ahead FIFO and calculates the next use time of eachrow. The row index and next use time are provided to MatBRow Prefetcher to prefetch rows and decide which buffer

line to spill, as illustrated in Section II-D. To perform theassociative search, we use a hash table to map row indexes topositions in the buffer and a reduction tree of next use timeto decide which line to spill. As the width of the hash tableis much lower than the buffer itself, and the next use timechanges very few (at most one line per cycle), the powerconsumption of the logic is minimal.

After prefetching rows, the multiplier array will conductthe outer product and generate partial matrices in the COOformat and send them to the merge tree. There is a MUXbefore each lowest level FIFO of the merge tree and can beconfigured to accept data either from the Multiplier Arrayor the Partial Matrix Fetcher. If the MUXs before eachlowest level FIFOs are configured to Partial Matrix Fetcher,the scheduler will send the address of correspondingpartially merged results. It will fetch the requested matrixonce the FIFO is near empty. The output of the merge treewill be connected to the Partial Matrix Writer. It bufferspartial matrices before they are written to DRAM and alsoconverts the internal COO format to the CSR format if it isthe final result.

III. EVALUATION

A. Methodology

To evaluate the performance of our design, we built a cycle-accurate simulator in C++ to model the exact behavior of thehardware. Each module is abstracted as a class with a clockupdate method updating the internal state of this module

Array Merger 16×16 hierarchical merger (4×4 top level + 4×4 low level) with 64-bit index (32 bits for row and 32 bits for column)and 64-bit value, running under 1GHz clock frequency.

Merge Tree 6 layers of array merger provides the ability of merging at most 64 arrays simultaneously.

Multiplier 2 groups, each consists of 8 double precision floating point multipliers.

MatA Column Fetcher Looks ahead buffer of 8192 elements. 64 fetchers support 64 columns of left matrix.

MatB Row Prefetcher A prefetch buffer of 1024 lines×48 elements per line×12 bytes per element. 16 fetchers that load data from 16 DRAMchannels. Each can prefetch up to 48 rows before used.

Partial Matrix Fetcher Support fetching up to 64 partial results.

Partial Matrix Writer A FIFO of 1024 elements before they are stored into DRAM.

Main Memory 16×64-bit HBM channels, each channel provides 8GB/s bandwidth.

Table IARCHITECTURAL SETUP OF SPARCH.

SpArch OuterSPACE

Technology 40nm 32nm

Area 28.49 mm2 87 mm2

Power 9.26W 12.39W

DRAM HBM@128GB/s HBM@128GB/s

Bandwidth 68.6% 48.3%Utilization

Table IICOMPARISON WITH PRIOR STATE-OF-THE-ART OUTERSPACE ON

AREA, POWER, AND MEMORY BANDWIDTH UTILIZATION.

in each cycle, and a clock apply method, which simulatesthe flip-flops in the circuit to make sure signals are updatedcorrectly. The parameters during the simulation are listed inTable I.

To measure the area and power consumption, we model allthe logic on the data path, which consists of the array merg-ers, arithmetic units, FIFOs, row prefetcher, and DRAM.We implement the array merger in Verilog and synthesize itusing Synopsys Design Compiler (DC) under TSMC 40nmlibrary. We generate the annotated toggle rate using the xsimRTL simulator with the real input data extracted from thesimulator and dump it into Switching Activity InterchangeFormat (SAIF) to estimate the power consumption withDesign Compiler. We then use the area and power data from[29] with the number of multiplications and additions fromthe simulator to model the arithmetic units (floating-pointmultipliers and adders). We also dumped the size, width, andthe amount of reading/writing of each SRAM/FIFO from thesimulator and use CACTI to estimate the energy and area ofFIFOs and prefetch buffers in the circuit. From the simulator,we also got the exact amount of DRAM read and write.We use them to estimate our DRAM power consumptionaccording to [30], [31].

We use multiple platforms as our baselines, including Desk-top CPU, GPU, mobile CPU, and the state-of-art ASICaccelerator OuterSPACE. For desktop CPU, we use IntelMKL [32] on a 6-core Intel Core-i7 5930K CPU and mea-sure its power with pcm-power tool. MKL provides math

Energy (nJ/FLOP) Area (mm2)Outer- SpArch Outer- SpArchSPACE SPACE

Computation 3.19 0.26 49.1 4.1

SRAM 0.35 0.34 37.5 24.4

DRAM 1.20 0.29 N/A N/A

Crossbar 0.21 N/A 0.1 N/A

Overall 4.95 0.89 86.7 28.5

Table IIIENERGY AND POWER BREAKDOWN. SAVINGS OF SPARCH ARE FROM

LESS DRAM ACCESS AND MORE EFFICIENT LOGIC.

routines parallelized by OpenMP for various computationalworkloads. For GPU, we use both cuSPARSE [33] andCUSP [34], [35] and run them on an NVIDIA TITANXp GPU. cuSPARSE parallelizes the computation betweenmatrix rows and then merges the partial results of eachrow with a hash table. CUSP also computes matrix rowsin parallel and then sorts and merges different rows. Powerconsumption is measured with nvidia-smi. For mobileCPU, we use Armadillo library [36], [37] on a 4-coreARM A53 CPU and measure its dynamic power using apowermeter.

In order to provide fair comparison baselines, the datatype of all baselines and our architecture are double.For MKL, cuSPARSE, CUSP and Armadillo, we dis-card memory allocation and transportation time and onlymeasure the core execution time of SpGEMM func-tions, i.e., mkl_sparse_spmm, cusparseDcsrgemm,cusp::gen eralized_spgemm and overloaded ”*” re-spectively. For all power measurement, we first measurethe idle power of the system and then repeatedly run theworkloads and measure the total power. Then the dynamicpower is total power minus idle power.B. Experimental Results

We obtain the area and power consumption under TSMC40nm technology and compare it with state-of-the-artSpGEMM accelerator OuterSPACE (Table II and Table III).For a fair comparison, we use the same DRAM power

Table 1-2

0 1 2 3 4

2.5 0.1409 3.8 5.8 10.4

0

1

2

3

4

GFLOPS0 3 6 9 12

10.4

5.8

3.8

0.141

2.5

Table 1-3-1-1

Comparator Array Size

1x1 2x2 4x4 8x8 16x16

GFLOPS 1.3 2.5 4.9 8.2 10.4

DRAM Access 216.1 216.1 215.8 212.3 208.2

Table 1-3-1

Prefetched Buffer Size

1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96

GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57

DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4

DR

AM

Acc

ess

(MB

)

200

205

210

215

220

GFL

OP

S

10.1

10.2

10.3

10.4

10.5

10.6

Prefetcher Buffer Size

1024x241024x36

1024x481024x60

1024x721024x84

1024x96

GFLOPS DRAM Access

203.4203.8204.6

206

208.2

211.5

216.5

10.5710.55

10.53

10.4710.45

10.3

10.2110.21

10.3

10.45 10.47

10.5310.55

10.57

Table 1-3-1-2

Merger FIFO Size 1024x48 512x96 256x192

GFLOPS 10.45 9.91 9.26

DRAM Access 208.2 226.6 245.7

DR

AM

Acc

ess

(MB

)

190

202

214

226

238

250

GFL

OP

S

9

9.5

10

10.5

11

Prefetcher Buffer Size1024x48 512x96 256x192

GFLOPS DRAM Access245.7

226.6

208.2

9.26

9.91

10.4510.45

9.91

9.26

Table 1-3-1-1-1

Lookahead FIFO Size

1024 2048 4096 8192 16384

GFLOPS 10.04 10.25 10.43 10.45 10.39

DRAM Access 228.0 220.1 212.3 208.2 206.5

DR

AM

Acc

ess

(MB

)

200

206

212

218

224

230

GFL

OP

S

9.9

10.1

10.3

10.5

Lookahead FIFO Size1024 2048 4096 8192 16384

GFLOPS DRAM Access

206.5208.2

212.3

220.1

228.010.39

10.4510.43

10.25

10.0410.04

10.25

10.43 10.45

10.39

Table 1-3-1-1-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.28 2.53 4.94 8.18 10.45

DRAM Access 216.1 216.1 215.8 212.3 208.2

DR

AM

Acc

ess

(MB

)

160

179

198

217

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

Comparator Array Size1x1 2x2 4x4 8x8 16x16

GFLOPS DRAM Access

208.2212.3

215.8216.1216.1 10.45

8.18

4.94

2.53

1.281.28

2.53

4.94

8.18

10.45

Table 1-3-1-1-2-1

Merger Tree Layer Number

2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

Merge Tree Layer Number2 3 4 5 6 7

GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

17.7x slowdown adding Streaming Merger

30.0x speedup adding Matrix Condensing

1.5x speedup adding Huffman Tree Scheduler

1.8x speedup adding Row Prefetcher

OuterSPACE baseline

Table 1-1

Ours OUTERSPACE MKL cusparse Cusp Armadillo

2cubes_sphere 22502560726.0 117597935385.95 3701214573000.00 5602909040000.00 5057141810000.00 1099911900000.00

amazon0312 34701146746.6 238976252696.78 7207865140500.00 35960793600000.00 13738356450000.00 3365801472000.00

ca-CondMat 4756871533.8 21655976925.16 605435353600.00 3757648020000.00 1410054120000.00 312242700000.00

cage12 35082194302.3 215186427380.43 7543879496900.00 9066918050000.00 9680728120000.00 1813963368000.00

cit-Patents 149033522572.0 1081587942672.57 125573273364000.00900614636400000.00226214262280000.0035912575740000.00

cop20k_A 63283531793.7 337443388146.46 10415188696800.00 12128841150000.00 25850932800000.00 3057388236000.00

email-Enron 57109881213.6 232796214276.61 4653444986500.00 22777282880000.00 24129764800000.00 2924367684000.00

facebook 12928998525.3 72137366027.55 2538925434500.00 1799024340000.00 4654827040000.00 423569192000.00

filter3D 68638180177.3 348186465686.90 5335778277900.00 14250852660000.00 12270936000000.00 3282877400000.00

m133-b3 5575314894.7 55627729339.97 647739247400.00 1537524320000.00 1115430400000.00 330838380000.00

mario002 16160687070.8 92294480587.95 2399119054500.00 3297006080000.00 2053654110000.00 1121264056000.00

offshore 59870043079.2 306329072850.65 8357248301000.00 25612540200000.00 9684328000000.00 3681397012000.00p2p-Gnutella31

1025862359.6 7678900784.92 146916778000.00 711338960000.00 299848800000.00 71396006000.00

patents_main 4393226489.7 38586476444.21 2553718118200.00 4009694080000.00 953241600000.00 612742172000.00

poisson3Da 10901218961.7 58654988303.26 1032253651000.00 3563726220000.00 2737218880000.00 494497920000.00

roadNet-CA 37116993642.5 259295805543.02 4316717127600.00 3915112240000.00 2901509280000.00 1579561760000.00

scircuit 9970951548.0 80259280319.29 1262874457800.00 1653559680000.00 1739803700000.00 478701732000.00

web-Google 75112209528.2 457795390833.15 25415642295900.00 271881684680000.00187713932800000.00 9272237364000.00

webbase-1M 82990097859.2 467933016580.93 12031348378800.00 47310436880000.00 33521372800000.00 3631895124000.00

wiki-Vote 4826206197.7 25022879577.01 426681058000.00 2275802160000.00 2366244480000.00 196310604000.00geomean 20569178997.7 124889743799.513371156669140.138953110415108.886308680360113.291279483112326.31

Table 2-1

Over OuterSPACE Over MKL Over cuSPARSE Over CUSP Over Armadillo

2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672

amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474

ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779

cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429

cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313

cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766

email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592

facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377

filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564

m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102

mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428

offshore 5.1165667685493 139.589816061172 427.80226775715 161.755821474672 61.4898006191512p2p-Gnutella31

7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778

patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095

poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062

roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006

scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645

web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491

webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188

wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953

Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567

Ener

gy

Effic

ienc

y

1

10

100

1000

2cubes_sphereamazon0312

ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook

filter3D m133-b3mario002

offshorep2p-Gnutella31

patents_mainpoisson3Da

roadNet-CAscircuit

web-Googlewebbase-1M

wiki-VoteGeo Mean

624144

123

484345

139

706169594833

5148

241

5266

97

49

307490404

2499

174

78

251217292

162127

200179

360423408

1518

276296396

225

435472570

3620

166105

327

913693

428

204276

208139

399

192

6043

258

7901036

249164

88145

338

12711695

581

143140148116

78

196

81

165

843

215127

208164

656687

597

56

10

564

576

57

5


Table 1

Ours OUTERSPACE MKL cusparse Cusp Armadillo2cubes_sphere 22447104490.9 4840622889.5 946771950.9 1934476862 1643760543 15866853.23amazon0312 10306750158.7 2197465725.1 417518939.1 282763268 360821244.1 6042181.397ca-CondMat 11059869709.9 3376614144.3 766008159.2 334415873 640145984.6 7935611.049cage12 15650458514.8 3109800328.5 543699489.9 1399328610 1093099234 12500642.05cit-Patents 4297675991.9 1093913493.4 40547564.56 15474721.19 68782285.04 1142959.597cop20k_A 24272212523.3 5181363151.0 1071047260 2733113520 960385436.3 19839329.11email-Enron 11741633908.3 3860666386.7 1247352212 776766093.7 480805296.5 10414050.12facebook 23389172455.1 5964902147.4 915436871.5 3048932526 1819759702 30325301.35filter3D 22621691088.0 5402073958.9 2326725721 2523736849 2473368796 20339233.64m133-b3 3965074273.3 718277486.5 365444192.3 480132646.6 323685537 4384745.355mario002 8498208703.5 2579691687.9 582330427.6 1421601176 1300155200 7537936.452offshore 22075226544.9 4828284372.2 1128999246 1132136622 1971495141 13938017.93p2p-Gnutella31 3234727556.4 870069104.7 298361608.5 153070584 215722724.3 3095472.175

patents_main 4667349915.1 940615500.1 88970126.68 163586776.2 368629149.2 1816005.835poisson3Da 19213801564.0 4348831538.8 1580987580 1403277752 1323258948 16647047.58roadNet-CA 3989816373.6 1058242894.1 391100367.7 1294900144 1098711176 5606197.506scircuit 10073705190.9 1873289912.7 727878411.1 1408464067 1080676892 9629040.532web-Google 9638412792.0 2481820528.1 238147571.6 56304417.32 78132086.74 3855547.126webbase-1M 7818788891.6 2329275195.3 499639084.9 392187862.9 419726934.3 9442781.933wiki-Vote 14217536501.7 3593402934.3 1399277456 439897958.4 502795751.7 13673047.74geomean 10449953808.21022515929470.66255559628937.031761595192418.54101631398831.5934668133356.74890044

Table 2


2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317

amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691

ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684

cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494

cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149

cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459

email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061

facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707

filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207

m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281

mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693


3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341

patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693

poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998

roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241

scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557

web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918

webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532

wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504

Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664

Spee

dup

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

12851040828

25001046712

115425701045158411279041112771

1127122337601252139417061415

17

2819

123

9

4

15131511

7

129

13

2425

62

141729

1418

3220

171

7

3

14

292119

6898

15

9

278

11

3336

1219

1016

40

141012

52

11

2015

1110

26

9

23

106

29

142524

44345

44545

36

443

545

355


Table 1-1-1

Ours MKLrmat-5k-x32 13871334064.5 1358366704rmat-5k-x16 12460358251.2 980693508.9rmat-10k-x32 10094618599.5 1182123213rmat-5k-x8 9394076179.7 760608128.4rmat-10k-x16 9539259802.5 763053019rmat-20k-x32 7865802681.3 939993571.8rmat-5k-x4 6576208342.5 352499704.2rmat-10k-x8 8592527459.5 684995374rmat-20k-x16 7371529812.6 736914843.1rmat-40k-x32 6442724494.4 779826545.9rmat-10k-x4 6716478022.4 290633276.5rmat-20k-x8 7462114702.3 600813361rmat-40k-x16 6083553391.8 625954887.6rmat-20k-x4 6845946204.74 290874130.6rmat-40k-x8 6377788424.74 517816495rmat-80k-x16 5217667204.83 518906373.8rmat-40k-x4 6372357066.38 229579497rmat-80k-x8 5551040046.90 381004740.5rmat-80k-x4 5840754841.78 243843142.5Geo Mean 7544938524.49646568293928.976508

GFL

OP

S

1E+08

1E+09

1E+10

rmat-5k-x32

rmat-5k-x16

rmat-10k-x32

rmat-5k-x8

rmat-10k-x16

rmat-20k-x32

rmat-5k-x4

rmat-10k-x8

rmat-20k-x16

rmat-40k-x32

rmat-10k-x4

rmat-20k-x8

rmat-40k-x16

rmat-20k-x4

rmat-40k-x8

rmat-80k-x16

rmat-40k-x4

rmat-80k-x8

rmat-80k-x4

Geo Mean

Ours MKLdegradation: 2.7x

degradation: 5.9x

Density: 6E-3 5E-5

Figure 11. Speedup of SpArch over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo on real-world matrices. We get a consistent speedupover state-of-the-art accelerator OuterSPACE.

Table 1-2

0 1 2 3 4

2.5 0.1409 3.8 5.8 10.4

0

1

2

3

4

GFLOPS0 3 6 9 12

10.4

5.8

3.8

0.141

2.5

Table 1-3-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.3 2.5 4.9 8.2 10.4

DRAM Access 216.1 216.1 215.8 212.3 208.2

Table 1-3-1


1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96

GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57

DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4

DR

AM

Acc

ess

(MB

)

200

205

210

215

220

GFL

OP

S

10.1

10.2

10.3

10.4

10.5

10.6

Prefetcher Buffer Size

1024x241024x36

1024x481024x60

1024x721024x84

1024x96

GFLOPS DRAM Access

203.4203.8204.6

206

208.2

211.5

216.5

10.5710.55

10.53

10.4710.45

10.3

10.2110.21

10.3

10.45 10.47

10.5310.55

10.57

Table 1-3-1-2

Merger FIFO Size 1024x48 512x96 256x192

GFLOPS 10.45 9.91 9.26

DRAM Access 208.2 226.6 245.7

DR

AM

Acc

ess

(MB

)

190

202

214

226

238

250

GFL

OP

S

9

9.5

10

10.5

11

Prefetcher Buffer Size1024x48 512x96 256x192


226.6

208.2

9.26

9.91

10.4510.45

9.91

9.26

Table 1-3-1-1-1

Lookahead FIFO Size

1024 2048 4096 8192 16384

GFLOPS 10.04 10.25 10.43 10.45 10.39

DRAM Access 228.0 220.1 212.3 208.2 206.5

DR

AM

Acc

ess

(MB

)

200

206

212

218

224

230

GFL

OP

S

9.9

10.1

10.3

10.5

Lookahead FIFO Size1024 2048 4096 8192 16384

GFLOPS DRAM Access

206.5208.2

212.3

220.1

228.010.39

10.4510.43

10.25

10.0410.04

10.25

10.43 10.45

10.39

Table 1-3-1-1-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.28 2.53 4.94 8.18 10.45

DRAM Access 216.1 216.1 215.8 212.3 208.2

DR

AM

Acc

ess

(MB

)

160

179

198

217

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

Comparator Array Size1x1 2x2 4x4 8x8 16x16

GFLOPS DRAM Access

208.2212.3

215.8216.1216.1 10.45

8.18

4.94

2.53

1.281.28

2.53

4.94

8.18

10.45

Table 1-3-1-1-2-1

Merger Tree Layer Number

2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

Merge Tree Layer Number2 3 4 5 6 7

GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

17.7x slowdown adding Streaming Merger

30.0x speedup adding Matrix Condensing

1.5x speedup adding Huffman Tree Scheduler

1.8x speedup adding Row Prefetcher

OuterSPACE baseline

Table 1-1


2cubes_sphere 22502560726.0 117597935385.95 3701214573000.00 5602909040000.00 5057141810000.00 1099911900000.00

amazon0312 34701146746.6 238976252696.78 7207865140500.00 35960793600000.00 13738356450000.00 3365801472000.00

ca-CondMat 4756871533.8 21655976925.16 605435353600.00 3757648020000.00 1410054120000.00 312242700000.00

cage12 35082194302.3 215186427380.43 7543879496900.00 9066918050000.00 9680728120000.00 1813963368000.00

cit-Patents 149033522572.0 1081587942672.57 125573273364000.00900614636400000.00226214262280000.0035912575740000.00

cop20k_A 63283531793.7 337443388146.46 10415188696800.00 12128841150000.00 25850932800000.00 3057388236000.00

email-Enron 57109881213.6 232796214276.61 4653444986500.00 22777282880000.00 24129764800000.00 2924367684000.00

facebook 12928998525.3 72137366027.55 2538925434500.00 1799024340000.00 4654827040000.00 423569192000.00

filter3D 68638180177.3 348186465686.90 5335778277900.00 14250852660000.00 12270936000000.00 3282877400000.00

m133-b3 5575314894.7 55627729339.97 647739247400.00 1537524320000.00 1115430400000.00 330838380000.00

mario002 16160687070.8 92294480587.95 2399119054500.00 3297006080000.00 2053654110000.00 1121264056000.00


1025862359.6 7678900784.92 146916778000.00 711338960000.00 299848800000.00 71396006000.00

patents_main 4393226489.7 38586476444.21 2553718118200.00 4009694080000.00 953241600000.00 612742172000.00

poisson3Da 10901218961.7 58654988303.26 1032253651000.00 3563726220000.00 2737218880000.00 494497920000.00

roadNet-CA 37116993642.5 259295805543.02 4316717127600.00 3915112240000.00 2901509280000.00 1579561760000.00

scircuit 9970951548.0 80259280319.29 1262874457800.00 1653559680000.00 1739803700000.00 478701732000.00

web-Google 75112209528.2 457795390833.15 25415642295900.00 271881684680000.00187713932800000.00 9272237364000.00

webbase-1M 82990097859.2 467933016580.93 12031348378800.00 47310436880000.00 33521372800000.00 3631895124000.00


Table 2-1


2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672

amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474

ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779

cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429

cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313

cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766

email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592

facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377

filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564

m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102

mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428


7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778

patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095

poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062

roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006

scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645

web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491

webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188

wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953

Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567

Ener

gy

Effic

ienc

y

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

624144

123

484345

139

706169594833

5148

241

5266

97

49

307490404

2499

174

78

251217292

162127

200179

360423408

1518

276296396

225

435472570

3620

166105

327

913693

428

204276

208139

399

192

6043

258

7901036

249164

88145

338

12711695

581

143140148116

78

196

81

165

843

215127

208164

656687

597

56

10

564

576

57

5


Table 1



Table 2


2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317

amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691

ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684

cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494

cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149

cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459

email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061

facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707

filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207

m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281

mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693


3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341

patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693

poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998

roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241

scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557

web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918

webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532

wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504

Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664

Spee

dup

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

12851040828

25001046712

115425701045158411279041112771

1127122337601252139417061415

17

2819

123

9

4

15131511

7

129

13

2425

62

141729

1418

3220

171

7

3

14

292119

6898

15

9

278

11

3336

1219

1016

40

141012

52

11

2015

1110

26

9

23

106

29

142524

44345

44545

36

443

545

355


Table 1-1-1

Ours MKLrmat-5k-x32 13871334064.5 1358366704rmat-5k-x16 12460358251.2 980693508.9rmat-10k-x32 10094618599.5 1182123213rmat-5k-x8 9394076179.7 760608128.4rmat-10k-x16 9539259802.5 763053019rmat-20k-x32 7865802681.3 939993571.8rmat-5k-x4 6576208342.5 352499704.2rmat-10k-x8 8592527459.5 684995374rmat-20k-x16 7371529812.6 736914843.1rmat-40k-x32 6442724494.4 779826545.9rmat-10k-x4 6716478022.4 290633276.5rmat-20k-x8 7462114702.3 600813361rmat-40k-x16 6083553391.8 625954887.6rmat-20k-x4 6845946204.74 290874130.6rmat-40k-x8 6377788424.74 517816495rmat-80k-x16 5217667204.83 518906373.8rmat-40k-x4 6372357066.38 229579497rmat-80k-x8 5551040046.90 381004740.5rmat-80k-x4 5840754841.78 243843142.5Geo Mean 7544938524.49646568293928.976508

GFL

OP

S

1E+08

1E+09

1E+10

rmat-5k-x32

rmat-5k-x16

rmat-10k-x32

rmat-5k-x8

rmat-10k-x16

rmat-20k-x32

rmat-5k-x4

rmat-10k-x8

rmat-20k-x16

rmat-40k-x32

rmat-10k-x4

rmat-20k-x8

rmat-40k-x16

rmat-20k-x4

rmat-40k-x8

rmat-80k-x16

rmat-40k-x4

rmat-80k-x8

rmat-80k-x4

Geo Mean

Ours MKLdegradation: 2.7x

degradation: 5.9x

Density: 6E-3 5E-5

Figure 12. Energy saving of SpArch over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo.

Table 1-3-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.3 2.5 4.9 8.2 10.4

DRAM Access 216.1 216.1 215.8 212.3 208.2

Table 1-1


2cubes_sphere 22502560726.0 117597935385.95 3701214573000.00 5602909040000.00 5057141810000.00 1099911900000.00

amazon0312 34701146746.6 238976252696.78 7207865140500.00 35960793600000.00 13738356450000.00 3365801472000.00

ca-CondMat 4756871533.8 21655976925.16 605435353600.00 3757648020000.00 1410054120000.00 312242700000.00

cage12 35082194302.3 215186427380.43 7543879496900.00 9066918050000.00 9680728120000.00 1813963368000.00

cit-Patents 149033522572.0 1081587942672.57 125573273364000.00900614636400000.00226214262280000.0035912575740000.00

cop20k_A 63283531793.7 337443388146.46 10415188696800.00 12128841150000.00 25850932800000.00 3057388236000.00

email-Enron 57109881213.6 232796214276.61 4653444986500.00 22777282880000.00 24129764800000.00 2924367684000.00

facebook 12928998525.3 72137366027.55 2538925434500.00 1799024340000.00 4654827040000.00 423569192000.00

filter3D 68638180177.3 348186465686.90 5335778277900.00 14250852660000.00 12270936000000.00 3282877400000.00

m133-b3 5575314894.7 55627729339.97 647739247400.00 1537524320000.00 1115430400000.00 330838380000.00

mario002 16160687070.8 92294480587.95 2399119054500.00 3297006080000.00 2053654110000.00 1121264056000.00


1025862359.6 7678900784.92 146916778000.00 711338960000.00 299848800000.00 71396006000.00

patents_main 4393226489.7 38586476444.21 2553718118200.00 4009694080000.00 953241600000.00 612742172000.00

poisson3Da 10901218961.7 58654988303.26 1032253651000.00 3563726220000.00 2737218880000.00 494497920000.00

roadNet-CA 37116993642.5 259295805543.02 4316717127600.00 3915112240000.00 2901509280000.00 1579561760000.00

scircuit 9970951548.0 80259280319.29 1262874457800.00 1653559680000.00 1739803700000.00 478701732000.00

web-Google 75112209528.2 457795390833.15 25415642295900.00 271881684680000.00187713932800000.00 9272237364000.00

webbase-1M 82990097859.2 467933016580.93 12031348378800.00 47310436880000.00 33521372800000.00 3631895124000.00


Table 2-1


2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672

amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474

ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779

cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429

cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313

cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766

email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592

facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377

filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564

m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102

mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428


7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778

patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095

poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062

roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006

scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645

web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491

webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188

wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953

Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567

Ener

gy

Effic

ienc

y

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

624144

123

484345

139

7061695948

335148

241

5266

97

49

307490

404

2499

174

78

251217292

162127

200179

360423408

1518

276296396

225

435472570

3620

166105

327

913693

428

204276

208139

399

192

6043

258

7901036

249164

88145

338

12711695

581

143140148116

78

196

81

165

843

215127

208164

656687

597

56

10

564

576

57

5


Table 1



Table 2


2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317

amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691

ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684

cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494

cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149

cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459

email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061

facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707

filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207

m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281

mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693


3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341

patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693

poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998

roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241

scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557

web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918

webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532

wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504

Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664

Spee

dup

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

1285104082825001046712

115425701045158411279041112771

1127122337601252139417061415

1728

19

123

9

4

15131511

7

129

13

2425

62

1417

29

1418

3220

171

7

3

14

292119

6898

159

278

11

3336

1219

1016

40

141012

52

11

2015

1110

26

9

23

106

29

14

2524

44345

4454

53

644

354

53

55


Table 1-2

Column Fetcher Row Prefetcher Multiplier Array Merge Tree Partial Mat Writer

area 2.64 5.8 0.45 17.27 2.34

Partial Mat Writer8.2%

Merge Tree60.6%

Multiplier Array1.6%

Row Prefetcher20.4%

Column Fetcher9.3%

Table 1-1-2

Column Fetcher Row Prefetcher Multiplier Array Merge Tree Partial Mat Writer HBM

power 101.39 1155.72 73.10 4738.47 243.04 2240.4

HBM26.2%


Merge Tree55.4%


Row Prefetcher13.5%

Column Fetcher1.2%

(a) (b)

Table 1-3-1


1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96

GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57

DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4

DR

AM

Acc

ess

(MB

)

200

205

210

215

220

GFL

OP

S

10.1

10.2

10.3

10.4

10.5

10.6

(a)Prefetcher Buffer Total Size

1024x241024x36

1024x601024x72

1024x841024x96

GFLOPS DRAM Access

203.4203.8204.6

206

208.2

211.5

216.5

10.5710.55

10.53

10.4710.45

10.3

10.2110.21

10.3

10.4510.47

10.5310.55

10.57

Table 1-3-1-2

Merger FIFO Size 2048x24 1024x48 512x96 256x192

GFLOPS 10.41 10.45 9.91 9.26

DRAM Access 202.1 208.2 226.6 245.7

DR

AM

Acc

ess

(MB

)

190

202

214

226

238

250

GFL

OP

S

9

9.5

10

10.5

11

2048x24 512x96 256x192


226.6

208.2

202.19.26

9.91

10.4510.4110.41 10.45

9.91

9.26

Table 1-3-1-1-1

Lookahead FIFO Size

1024 2048 4096 8192 16384

GFLOPS 10.04 10.25 10.43 10.45 10.39

DRAM Access 228.0 220.1 212.3 208.2 206.5

DR

AM

Acc

ess

(MB

)200

206

212

218

224

230

GFL

OP

S

9.9

10.1

10.3

10.5

(d)Lookahead FIFO Size1024 2048 4096 16384

GFLOPS DRAM Access

206.5208.2

212.3

220.1

228.010.39

10.4510.43

10.25

10.0410.04

10.25

10.43 10.45

10.39

Table 1-3-1-1-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.28 2.53 4.94 8.18 10.45

DRAM Access 216.1 216.1 215.8 212.3 208.2

DR

AM

Acc

ess

(MB

)

160

179

198

217

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

(c)Comparator Array Size1x1 2x2 4x4 8x8

GFLOPS DRAM Access

208.2212.3

215.8216.1216.1 10.45

8.18

4.94

2.53

1.281.28

2.53

4.94

8.18

10.45

1024x481024x48

16x16 8192

6

Table 1-1-1

MKL Oursrmat-5k-x32 1358366704 13871334064.5rmat-5k-x16 980693508.9 12460358251.2rmat-10k-x32 1182123213 10094618599.5rmat-5k-x8 760608128.4 9394076179.7rmat-10k-x16 763053019 9539259802.5rmat-20k-x32 939993571.8 7865802681.3rmat-5k-x4 352499704.2 6576208342.5rmat-10k-x8 684995374 8592527459.5rmat-20k-x16 736914843.1 7371529812.6rmat-40k-x32 779826545.9 6442724494.4rmat-10k-x4 290633276.5 6716478022.4rmat-20k-x8 600813361 7462114702.3rmat-40k-x16 625954887.6 6083553391.8rmat-20k-x4 290874130.6 6845946204.74rmat-40k-x8 517816495 6377788424.74rmat-80k-x16 518906373.8 5217667204.83rmat-40k-x4 229579497 6372357066.38rmat-80k-x8 381004740.5 5551040046.90rmat-80k-x4 243843142.5 5840754841.78Geo Mean 568293928.976508 7544938524.49646

1E+08

1E+09

1E+10

rmat-5k-x32

rmat-5k-x16

rmat-10k-x32

rmat-5k-x8

rmat-10k-x16

rmat-20k-x32

rmat-5k-x4

rmat-10k-x8

rmat-20k-x16

rmat-40k-x32

rmat-10k-x4

rmat-20k-x8

rmat-40k-x16

rmat-20k-x4

rmat-40k-x8

rmat-80k-x16

rmat-40k-x4

rmat-80k-x8

rmat-80k-x4

Geo Mean

MKL Ours

Performance (GFLOPS/s)

0.05 0.10 0.2 0.4 Operation Intensity (Ops/Byte)

6.4

12.8

25.6

51.2computation roof 32.0

0.19

Ours

23.910.4

2.5

OuterSPACE

bandwidth roof

Table 1-3-1-1-2-1-1

Number of Merger Tree Layers

2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11

Number of Merge Tree Layers2 3 4 5 7

GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

6

Table 2-2

0 1 2 3 4

2.5 0.4352 3.8 5.9 10.4

0

1

2

3

4

GFLOPS0.0 3.0 6.0 9.0 12.0

10.4

5.9

3.8

0.4

2.5

5.7x slowdown with Pipelined Multiply and Merge

8.8x speedup with Matrix Condensing

1.5x speedup with Huffman Tree Scheduler

1.8x speedup with Row Prefetcher

OuterSPACE baseline

Degradation: 2.7x

Degradation: 5.9x

Density: 6E-3 5E-5

FLOPS

(b)Prefetcher Buffer Line Size

Figure 13. Area (a) and Power (b) Breakdown. The merge tree as thecore of SpArch takes the most of area and power.

estimation as OuterSPACE, which is 42.6 GB/s/W.We also analyze the power and area per module. Figure 13

shows area and power breakdown of SpArch. Most of theenergy and area is consumed by the merge tree, which isthe core part of SpArch. We evaluate the performance ofSpArch on the SuiteSparse dataset [26] and SNAP dataset[27], which are the same as [1].

Figure 11 compares the SpArch and baselines. On av-erage, SpArch is 4×, 19×, 18×, 17× and 1285× fasterthan OuterSPACE, MKL, cuSPARSE, CUSP and ARM Ar-madillo. SpArch outperforms the baselines on every matrix.This mainly comes from the reduction of redundant memoryaccess of partial matrices, with 2.8× less DRAM accesscompared to OuterSPACE. We also achieve higher memorybandwidth utilization compared to OuterSPACE, by virtueof hiding the DRAM latency with row prefetcher and theregular write pattern of streaming merge tree. Figure 12shows the relative energy savings. SpArch achieves 6×,164×, 435×, 307× and 62× energy saving comparingto OuterSPACE, MKL, cuSPARSE, CUSP and ARM Ar-madillo. This is because our specialized architecture de-signed for SpGEMM tasks removes unnecessary compo-nents like cache, instruction decoder in general processors,and uses a carefully-designed computational unit, which ismore efficient on specific tasks.

Table 1-3-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.3 2.5 4.9 8.2 10.4

DRAM Access 216.1 216.1 215.8 212.3 208.2

Table 1-1


2cubes_sphere 22502560726.0 117597935385.95 3701214573000.00 5602909040000.00 5057141810000.00 1099911900000.00

amazon0312 34701146746.6 238976252696.78 7207865140500.00 35960793600000.00 13738356450000.00 3365801472000.00

ca-CondMat 4756871533.8 21655976925.16 605435353600.00 3757648020000.00 1410054120000.00 312242700000.00

cage12 35082194302.3 215186427380.43 7543879496900.00 9066918050000.00 9680728120000.00 1813963368000.00

cit-Patents 149033522572.0 1081587942672.57 125573273364000.00900614636400000.00226214262280000.0035912575740000.00

cop20k_A 63283531793.7 337443388146.46 10415188696800.00 12128841150000.00 25850932800000.00 3057388236000.00

email-Enron 57109881213.6 232796214276.61 4653444986500.00 22777282880000.00 24129764800000.00 2924367684000.00

facebook 12928998525.3 72137366027.55 2538925434500.00 1799024340000.00 4654827040000.00 423569192000.00

filter3D 68638180177.3 348186465686.90 5335778277900.00 14250852660000.00 12270936000000.00 3282877400000.00

m133-b3 5575314894.7 55627729339.97 647739247400.00 1537524320000.00 1115430400000.00 330838380000.00

mario002 16160687070.8 92294480587.95 2399119054500.00 3297006080000.00 2053654110000.00 1121264056000.00


1025862359.6 7678900784.92 146916778000.00 711338960000.00 299848800000.00 71396006000.00

patents_main 4393226489.7 38586476444.21 2553718118200.00 4009694080000.00 953241600000.00 612742172000.00

poisson3Da 10901218961.7 58654988303.26 1032253651000.00 3563726220000.00 2737218880000.00 494497920000.00

roadNet-CA 37116993642.5 259295805543.02 4316717127600.00 3915112240000.00 2901509280000.00 1579561760000.00

scircuit 9970951548.0 80259280319.29 1262874457800.00 1653559680000.00 1739803700000.00 478701732000.00

web-Google 75112209528.2 457795390833.15 25415642295900.00 271881684680000.00187713932800000.00 9272237364000.00

webbase-1M 82990097859.2 467933016580.93 12031348378800.00 47310436880000.00 33521372800000.00 3631895124000.00


Table 2-1


2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672

amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474

ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779

cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429

cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313

cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766

email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592

facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377

filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564

m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102

mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428


7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778

patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095

poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062

roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006

scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645

web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491

webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188

wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953

Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567

Ener

gy

Effic

ienc

y

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

624144

123

484345

139

7061695948

335148

241

5266

97

49

307490

404

2499

174

78

251217292

162127

200179

360423408

1518

276296396

225

435472570

3620

166105

327

913693

428

204276

208139

399

192

6043

258

7901036

249164

88145

338

12711695

581

143140148116

78

196

81

165

843

215127

208164

656687

597

56

10

564

576

57

5


Table 1



Table 2


2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317

amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691

ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684

cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494

cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149

cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459

email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061

facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707

filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207

m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281

mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693


3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341

patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693

poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998

roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241

scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557

web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918

webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532

wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504

Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664

Spee

dup

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

1285104082825001046712

115425701045158411279041112771

1127122337601252139417061415

1728

19

123

9

4

15131511

7

129

13

2425

62

1417

29

1418

3220

171

7

3

14

292119

6898

159

278

11

3336

1219

1016

40

141012

52

11

2015

1110

26

9

23

106

29

14

2524

44345

4454

53

644

354

53

55


Table 1-2


area 2.64 5.8 0.45 17.27 2.34


Merge Tree60.6%


Row Prefetcher20.4%

Column Fetcher9.3%

Table 1-1-2


power 101.39 1155.72 73.10 4738.47 243.04 2240.4

HBM26.2%


Merge Tree55.4%


Row Prefetcher13.5%

Column Fetcher1.2%

(a) (b)

Table 1-3-1


1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96

GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57

DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4

DR

AM

Acc

ess

(MB

)

200

205

210

215

220

GFL

OP

S

10.1

10.2

10.3

10.4

10.5

10.6


1024x241024x36

1024x601024x72

1024x841024x96

GFLOPS DRAM Access

203.4203.8204.6

206

208.2

211.5

216.5

10.5710.55

10.53

10.4710.45

10.3

10.2110.21

10.3

10.4510.47

10.5310.55

10.57

Table 1-3-1-2


GFLOPS 10.41 10.45 9.91 9.26

DRAM Access 202.1 208.2 226.6 245.7

DR

AM

Acc

ess

(MB

)

190

202

214

226

238

250

GFL

OP

S

9

9.5

10

10.5

11

2048x24 512x96 256x192


226.6

208.2

202.19.26

9.91

10.4510.4110.41 10.45

9.91

9.26

Table 1-3-1-1-1

Lookahead FIFO Size

1024 2048 4096 8192 16384

GFLOPS 10.04 10.25 10.43 10.45 10.39

DRAM Access 228.0 220.1 212.3 208.2 206.5

DR

AM

Acc

ess

(MB

)

200

206

212

218

224

230

GFL

OP

S

9.9

10.1

10.3

10.5


GFLOPS DRAM Access

206.5208.2

212.3

220.1

228.010.39

10.4510.43

10.25

10.0410.04

10.25

10.43 10.45

10.39

Table 1-3-1-1-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.28 2.53 4.94 8.18 10.45

DRAM Access 216.1 216.1 215.8 212.3 208.2

DR

AM

Acc

ess

(MB

)

160

179

198

217

GFL

OP

S

0

2.2

4.4

6.6

8.8

11


GFLOPS DRAM Access

208.2212.3

215.8216.1216.1 10.45

8.18

4.94

2.53

1.281.28

2.53

4.94

8.18

10.45

1024x481024x48

16x16 8192

6

Table 1-1-1


1E+08

1E+09

1E+10

rmat-5k-x32

rmat-5k-x16

rmat-10k-x32

rmat-5k-x8

rmat-10k-x16

rmat-20k-x32

rmat-5k-x4

rmat-10k-x8

rmat-20k-x16

rmat-40k-x32

rmat-10k-x4

rmat-20k-x8

rmat-40k-x16

rmat-20k-x4

rmat-40k-x8

rmat-80k-x16

rmat-40k-x4

rmat-80k-x8

rmat-80k-x4

Geo Mean

MKL Ours



6.4

12.8

25.6


0.19

Ours

23.910.4

2.5

OuterSPACE

bandwidth roof

Table 1-3-1-1-2-1-1


2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11


GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

6

Table 2-2

0 1 2 3 4

2.5 0.4352 3.8 5.9 10.4

0

1

2

3

4

GFLOPS0.0 3.0 6.0 9.0 12.0

10.4

5.9

3.8

0.4

2.5





OuterSPACE baseline

Degradation: 2.7x

Degradation: 5.9x

Density: 6E-3 5E-5

FLOPS


Figure 14. Performance on rMAT benchmarks. Our performance is notonly higher than MKL, but also more stable when the matrices get sparser.

Figure 14 shows the performance comparison on syn-thesized rMAT [28] data with Intel MKL. The density ofmatrices ranges from 6×10−3 to 5×10−5. SpArch achievesover 10× speedup while keeping a relatively stable perfor-mance: the performance only drops by 2.7× when matricesare sparser, while MKL suffers from larger performancedegradation. The lower impact on matrix density partiallyderives from the outer product algorithm, which is notsensitive to the density. It also relates to our architecturedesign which uses a single but large merge tree rather thanmany small processing elements. The latter one will bemore likely to suffer from the problem of load balancing,especially when matrices become sparser.

To understand how far our design is from the theoreticallyoptimal solution, we use the roofline model to analyzeour performance. Figure 15 shows the result of rooflineanalysis. We calculate the theoretical operational intensityof the outer product on our dataset, which equals thenumber of operations divided by the size of two inputmatrices plus the size of merged final results, calculatedto be 0.19Flops/Byte. The theoretical computation roof is32GFlops in our design. We use 16 multipliers running in1GHz. The peak multiplication performance is 16GFlops/s,and the overall peak performance (multiplication+addition)



6.4

12.8

25.6


0.19

Ours

23.910.4

2.5

OuterSPACE

bandwidth roof

�1

Figure 15. Roofline Model for SpArch and OuterSPACE. The theoreticaloperational intensity is 0.19Flops/Byte. The theoretical computation roofis 32GFlops/s. SpArch achieves 10.4GFlops/s while OuterSPACE acheives2.5GFlops/s. Our performance is 4× higher than OuterSPACE by virtue ofless redundant DRAM traffic for partial matrices.

is 32GFlops/s. The roof in our condition is 23.9GFlops, 2.3×over our performance, and 9.6× over OuterSPACE. We aremuch nearer to the roof comparing to OuterSPACE.

C. Interpreting the Performance Gain

To analyze where our speed up comes from and demon-strate the effectiveness of several proposed ideas separately,we conduct breakdown experiments. Figure 16 shows theperformance breakdown for each of our contributions. Inthe breakdown experiments, we first remove our prefetcher,scheduler, and change back to CSC/CSR matrix format butpreserve the pipeline of multiply and merge phase and use arandom order to select initial columns and partially mergedresults to merge. The performance slows down by 5.7×comparing to OuterSPACE.

The DRAM read-write in OuterSPACE is mainly theintermediate results generated by the multiplier. Assumingthe dimension of the matrix is N × N , and there are Mmultiplications, thus M intermediate results. We also havean average of 0.5M final results, according to the statisticsof the datasets, so the memory access is roughly 2.5M . Inour multiply-merge pipeline, ideally, the partial matrices arenot written to DRAM. However, as N is much larger thanthe size of the merge tree (64), we still need to store lots ofpartially merged results to DRAM. Assume that the mergingdoes not change the length of arrays, we can estimate theread times of one multiplied result α as follows:

E =

t−1∑k=0

Pr[α is used in round k] (2)

=

t−1∑k=0

w

N − k ∗ (w − 1)(3)

=w

w − 1∗

t−1∑k=0

1

t+ 1w−1 − k

(4)

=w

w − 1∗

t∑i=1

11

w−1 + i(5)

Table 2-2

0 1 2 3 42.5 0.4352 3.8 5.9 10.4

0

1

2

3

4

GFLOPS

0.0 3.0 6.0 9.0 12.0

10.4

5.9

3.8

0.4

2.5





OuterSPACE baseline

Figure 16. Dissecting the performance gain. The performance firstdrops due to the increased memory access of partially merged results,but it enables next three optimizations that reduce memory access moreeffectively: matrix condensing and Huffman tree scheduler for the outputmatrix, together with row prefetcher for the input matrix. Overall it achieves4.2× speed up over OuterSPACE.

where t = N−1w−1 is the number of rounds and w = 64 is the

size of the merge tree. As 1w−1 is very small compared to

i, we can ignore it:

E ≈ w

w − 1∗

t∑i=1

1

i(6)

≈ w

w − 1∗ ln t (7)

On average N ≈ 140, 000 in the datasets, we need toread and write the partially merged results for roughlyln 140000

63 − 1 ≈ 6.7 times. (Minus 1 because we donot need to load/store the multiplied results in the firstround). The DRAM access of partial results is roughly6.7∗2M +0.5M = 13.9M . Comparing to the 2.5M accessof OuterSPACE, the 5.7× slow down is reasonable. Wethen apply matrix condensing to the left matrix, which canreduce the number of columns from 140,000 to 100, both onaverage. That means we can finish the merging in 2 roundson average using a 64-way merger. However, the right matrixwill be read not only once. Instead, the amount of rightmatrix access equals to the number of multiplications M .Plus the partially merged results and final results, whichis roughly ((1 + 1/2) − 1) ∗ 2M + 0.5M = 1.5M (usingFormula 6), our memory access is 2.5M , 5.5× less than theone without condensing. The speedup improvement this stepis 8.8×, higher than 5.5× because with matrix condensing,each round takes much more clock cycles than before, thusthe startup overhead is reduced, leading to higher memoryutilization and higher performance.

Furthermore, we add back the Huffman Tree scheduler,which makes the memory access of partially merged resultsnegligible. This is because the average number of non-zeroesper row is less than 64. The long columns can be schedulednear the root node in the Huffman Tree, so they will notgenerate partially merged results. This reduces the memoryaccess by roughly 1.7× (from 2.5M to 1.5M ), near to the

Table 1-3-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.3 2.5 4.9 8.2 10.4

DRAM Access 216.1 216.1 215.8 212.3 208.2

Table 1-1


2cubes_sphere 22502560726.0 117597935385.95 3701214573000.00 5602909040000.00 5057141810000.00 1099911900000.00

amazon0312 34701146746.6 238976252696.78 7207865140500.00 35960793600000.00 13738356450000.00 3365801472000.00

ca-CondMat 4756871533.8 21655976925.16 605435353600.00 3757648020000.00 1410054120000.00 312242700000.00

cage12 35082194302.3 215186427380.43 7543879496900.00 9066918050000.00 9680728120000.00 1813963368000.00

cit-Patents 149033522572.0 1081587942672.57 125573273364000.00900614636400000.00226214262280000.0035912575740000.00

cop20k_A 63283531793.7 337443388146.46 10415188696800.00 12128841150000.00 25850932800000.00 3057388236000.00

email-Enron 57109881213.6 232796214276.61 4653444986500.00 22777282880000.00 24129764800000.00 2924367684000.00

facebook 12928998525.3 72137366027.55 2538925434500.00 1799024340000.00 4654827040000.00 423569192000.00

filter3D 68638180177.3 348186465686.90 5335778277900.00 14250852660000.00 12270936000000.00 3282877400000.00

m133-b3 5575314894.7 55627729339.97 647739247400.00 1537524320000.00 1115430400000.00 330838380000.00

mario002 16160687070.8 92294480587.95 2399119054500.00 3297006080000.00 2053654110000.00 1121264056000.00


1025862359.6 7678900784.92 146916778000.00 711338960000.00 299848800000.00 71396006000.00

patents_main 4393226489.7 38586476444.21 2553718118200.00 4009694080000.00 953241600000.00 612742172000.00

poisson3Da 10901218961.7 58654988303.26 1032253651000.00 3563726220000.00 2737218880000.00 494497920000.00

roadNet-CA 37116993642.5 259295805543.02 4316717127600.00 3915112240000.00 2901509280000.00 1579561760000.00

scircuit 9970951548.0 80259280319.29 1262874457800.00 1653559680000.00 1739803700000.00 478701732000.00

web-Google 75112209528.2 457795390833.15 25415642295900.00 271881684680000.00187713932800000.00 9272237364000.00

webbase-1M 82990097859.2 467933016580.93 12031348378800.00 47310436880000.00 33521372800000.00 3631895124000.00


Table 2-1


2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672

amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474

ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779

cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429

cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313

cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766

email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592

facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377

filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564

m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102

mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428


7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778

patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095

poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062

roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006

scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645

web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491

webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188

wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953

Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567

Ener

gy

Effic

ienc

y

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

624144

123

484345

139

7061695948

335148

241

5266

97

49

307490

404

2499

174

78

251217292

162127

200179

360423408

1518

276296396

225

435472570

3620

166105

327

913693

428

204276

208139

399

192

6043

258

7901036

249164

88145

338

12711695

581

143140148116

78

196

81

165

843

215127

208164

656687

597

56

10

564

576

57

5


Table 1



Table 2


2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317

amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691

ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684

cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494

cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149

cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459

email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061

facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707

filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207

m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281

mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693


3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341

patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693

poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998

roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241

scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557

web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918

webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532

wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504

Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664

Spee

dup

1

10

100

1000


ca-CondMatcage12

cit-Patentscop20k_A

email-Enronfacebook




roadNet-CAscircuit


wiki-VoteGeo Mean

1285104082825001046712

115425701045158411279041112771

1127122337601252139417061415

1728

19

123

9

4

15131511

7

129

13

2425

62

1417

29

1418

3220

171

7

3

14

292119

6898

159

278

11

3336

1219

1016

40

141012

52

11

2015

1110

26

9

23

106

29

14

2524

44345

4454

53

644

354

53

55


Table 1-2


area 2.64 5.8 0.45 17.27 2.34


Merge Tree60.6%


Row Prefetcher20.4%

Column Fetcher9.3%

Table 1-1-2


power 101.39 1155.72 73.10 4738.47 243.04 2240.4

HBM26.2%


Merge Tree55.4%


Row Prefetcher13.5%

Column Fetcher1.2%

(a) (b)

Table 1-3-1


1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96

GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57

DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4

DR

AM

Acc

ess

(MB

)

200

205

210

215

220

GFL

OP

S

10.1

10.2

10.3

10.4

10.5

10.6


1024x241024x36

1024x601024x72

1024x841024x96

GFLOPS DRAM Access

203.4203.8204.6

206

208.2

211.5

216.5

10.5710.55

10.53

10.4710.45

10.3

10.2110.21

10.3

10.4510.47

10.5310.55

10.57

Table 1-3-1-2


GFLOPS 10.41 10.45 9.91 9.26

DRAM Access 202.1 208.2 226.6 245.7

DR

AM

Acc

ess

(MB

)

190

202

214

226

238

250

GFL

OP

S

9

9.5

10

10.5

11

2048x24 512x96 256x192


226.6

208.2

202.19.26

9.91

10.4510.4110.41 10.45

9.91

9.26

Table 1-3-1-1-1

Lookahead FIFO Size

1024 2048 4096 8192 16384

GFLOPS 10.04 10.25 10.43 10.45 10.39

DRAM Access 228.0 220.1 212.3 208.2 206.5D

RA

M A

cces

s (M

B)

200

206

212

218

224

230G

FLO

PS

9.9

10.1

10.3

10.5


GFLOPS DRAM Access

206.5208.2

212.3

220.1

228.010.39

10.4510.43

10.25

10.0410.04

10.25

10.43 10.45

10.39

Table 1-3-1-1-1-1


1x1 2x2 4x4 8x8 16x16

GFLOPS 1.28 2.53 4.94 8.18 10.45

DRAM Access 216.1 216.1 215.8 212.3 208.2

DR

AM

Acc

ess

(MB

)

160

179

198

217

GFL

OP

S

0

2.2

4.4

6.6

8.8

11


GFLOPS DRAM Access

208.2212.3

215.8216.1216.1 10.45

8.18

4.94

2.53

1.281.28

2.53

4.94

8.18

10.45

1024x481024x48

16x16 8192

6

Table 1-1-1


1E+08

1E+09

1E+10

rmat-5k-x32

rmat-5k-x16

rmat-10k-x32

rmat-5k-x8

rmat-10k-x16

rmat-20k-x32

rmat-5k-x4

rmat-10k-x8

rmat-20k-x16

rmat-40k-x32

rmat-10k-x4

rmat-20k-x8

rmat-40k-x16

rmat-20k-x4

rmat-40k-x8

rmat-80k-x16

rmat-40k-x4

rmat-80k-x8

rmat-80k-x4

Geo Mean

MKL Ours

Degradation: 2.7x

Degradation: 5.9x

Density: 6E-3 5E-5



6.4

12.8

25.6


0.19

Ours

23.910.4

2.5

OuterSPACE

bandwidth roof

Table 1-3-1-1-2-1-1


2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11


GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

6

Table 2-2

0 1 2 3 4

2.5 0.4352 3.8 5.9 10.4

0

1

2

3

4

GFLOPS0.0 3.0 6.0 9.0 12.0

10.4

5.9

3.8

0.4

2.5





OuterSPACE baseline

FLO

PS


Figure 17. Design space exploration on various buffer sizes and arraysizes.

actual performance we get.We finally add back the row prefetcher. It buffers the

data from the right matrix and achieves a hit rate of 62%.Therefore, we reduce the total memory access from 1.5Mto 0.38M + 0.5M = 0.88M , reducing DRAM access by1.7×, which matches the experiments.

D. Design Space Exploration

The high performance of SpArch comes from two factors:higher memory bandwidth utilization and lower memoryaccess. To achieve higher bandwidth utilization, we needto make sure we are not bounded by the performance ofcomputational units - mainly the comparator arrays in themerge tree. To achieve lower memory access, we need tomake sure high reuse of the right matrix, which is mainlyaffected by the prefetch buffer and the look-ahead FIFO;and low partial merged results access, which is mainlydetermined by the size of the merge tree. However, higherperformance usually comes with the cost of a larger area andpower. We need to balance the performance and power/areacarefully. We conduct design space exploration to find theoptimal parameters for our design.

Figure 17 (a)(b) shows the relationship between perfor-mance and the size of prefetch buffer (m×n means thereare m lines and each line contains n elements). Whenthe number of lines is fixed as, the longer line results inbetter performance, but the area will increase linearly. Thedifference between 1024×60 and 1024×48 is much smallerthan the one between 1024×48 and 1024×36, so we choose48 as our line size. As shown in Figure 17 (b), When theoverall buffer size is fixed as 1024 × 48 = 49152, morelines leads to less memory access. However, when there aremore than 1024 lines, such as 2048 lines, the benefit fromless DRAM access is minor, but the increased latency ofreplacement logic has a greater impact on the performance,

Table 1-3-1-1-2-1


2 3 4 5 6 7

GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45

DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2

DR

AM

Acc

ess

(MB

)

0

175

350

525

700

GFL

OP

S

0

2.2

4.4

6.6

8.8

11


GFLOPS DRAM Access

204.2208.2223.4274.7

403.3

645.310.4510.45

10

8.8

6.39

4.134.13

6.39

8.8

1010.45 10.45

6

Figure 18. Design space exploration on the merge tree size. 6 layersare enough to achieve good performance, above which gives diminishingimprovement.

leading to smaller overall Flops/s. Thus we choose 1024 asthe number of lines.

Figure 17 (c) shows the relationship between the sizeof comparator arrays in our merge tree and the overallperformance. When the array size is less than 8×8, theperformance increases linearly. That means we are com-putationally bounded. When the size increases to 16×16,the performance gain is lower, and we are bounded bymemory bandwidth. So we choose 16×16 as our comparatorsize, which consumes the memory bandwidth as much aspossible.

Figure 17 (d) explores the size of look-ahead FIFO.Larger FIFO makes the replacement of prefetch buffer moreaccurate and improves the data reuse. However, we needmore time to fill the larger FIFO at the start of each round.Excessive size can lead to performance degradation. Wechoose 8192, which results in the highest performance, asour configuration of the look-ahead FIFO.

Figure 18 helps us determine the size of the merge tree.Larger merge tree can process more columns at the sametime but requires more FIFOs and comparator arrays. Amerge tree of 6 layers and 64 ports is good enough, andthe larger one does not contribute to the speedup. Therefore,we choose 6 layers to achieve the best performance withoutwasting area and power.

IV. RELATED WORK

Sparse Matrix-Matrix Multiplication Acceleration. Be-sides the ASIC Sparse Matrix-Matrix Multiplication accel-erator OuterSPACE [1] that we mainly discussed in thispaper, there are some previous explorations on SpMM onFPGA, such as [38]. Contrary to our fully specialized mergetree, both of them utilized general-purpose PEs and not fullyspecialized for the sparse data.

There are also many works focusing on SpMM ongeneral-purpose platforms like CPUs and GPUs. The al-gorithms are mostly based on the multiply-and-insert-likeGustavson’s algorithm [39], which has many variants thatare different in the method of inserting an element to theoutput matrix. We list several examples as below. cuS-PARSE [33] utilizes a hash table. However, on-chip hash-table limits the scalability, and off-chip hash-table suffers

from low memory utilization. CUSP [34] uses a sortingalgorithm which suffers from higher complexity (sortingnetwork) and excessive DRAM access if on-chip resourcesare limited. HeapSpGEMM [40] uses a heap to insert. Sincethe heap is hard to parallelize, the parallelism only comesfrom processing multiple rows simultaneously, which wouldsuffer from the load-balance problem. BHSPARSE [41] andproposed SpArch choose merge as the insertion method.It is less complex than sorting and remains parallelizable.It can also process the whole partial matrix rather than asingle row, eliminating the load-balance issue. Compared toBHSPARSE, our design further reduces the DRAM accessthanks to the specialized multiply-and-merge pipeline, whichcan hardly be implemented on general-purpose platforms.

Hardware Accelerator for Merge and Intersection. [42]designed an efficient parallel GPU-based method to solvethe problem of intersecting two unsorted sets. [43] useda special data structure called bloom filter to parallelizethe intersections on GPU. [44] implemented parallel listsintersection and index compression on GPU to speed upweb search. [45] used SIMD techniques to enable a highlevel of parallelism in the set intersection. Extensor [46]also proposes to accelerate different sparse linear algebra byhierarchically eliminating computations to the intersectionoperation.

Sparse Linear Algebra. A number of designs are pre-sented to accelerate sparse linear algebra on FPGA plat-forms. [47] introduced a tree of binary operators to increasethe energy efficiency of SpMV on FPGAs. [48] provedthat separating indices comparison and computing operationscould increase the throughput. The index comparison andcomputing in our SpArch system design are also separated.[49] proposed a new sparse matrix storage method called”BVCSR” to compress the indices of non-zero elements,thus increasing the valid bandwidth of FPGA. [50] and [51]proposed an architecture for large-scale SpMV in the FEMproblem. [50] co-designed an FPGA SpMV architecture witha matrix stripping and partitioning algorithms that enablethe architecture to process arbitrarily large matrices withoutchanging the PE quantities.

For sparse operations in deep learning algorithms, [52]proposed a specialized architecture that can leverage notonly the static sparse weights but also the dynamic sparseactivations. [53] further extended the efficient SpMV archi-tecture to LSTM[54], a kind of recurrent neural networks.Besides SpMV, SCNN [55] proposed an architecture forsparse convolution operations, and SparTen [56] furthersolved the overhead and load-balance issue in SCNN. In-stead of designing a pure hardware architecture, [57] focusedon the algorithm and hardware co-design. Different fromthose works, SpArch is based on the outer product algorithm,trying to maximize both input and output data reuse.

Comparison between Different Hardware. The com-parison of performance and energy efficiency of different

platforms has been investigated by many researchers. [58],[21] and [59] conducted comparisons by running the samebenchmarks on different platforms and measuring the per-formance. We use similar methodology in the evaluation ofSpArch.

Systolic Array. Systolic Array [60] is a homogeneousarchitecture of coupled processing units. Although the pro-posed comparator array shares a similar spatial architecturewith systolic arrays, there exists a fundamental differencebetween them: the comparator array does not have long datadependencies. Each comparator block only depends on thecomparison results from the top and left neighbor block,which is computed based on the input array and not onother comparator blocks. Therefore, all the outputs of thecomparator matrix can be calculated in one cycle.

V. CONCLUSION

In this paper, we propose SpArch optimizing the datareuse of both the input matrix and the output matrix toaccelerate sparse matrix-matrix multiplication (SpGEMM).We design a streaming-based merger to pipeline the multiplyand merge phase of the outer product on-chip. We then pro-pose a condensed matrix representation to reduce the partialmatrices generated by the merge phase. We further develop aHuffman tree scheduler to make our merger scalable to largeror power-law matrices, also to reduce the memory access ofpartially merged results. Finally, we use a row prefetcherthat achieves near-optimal buffer replacement to reduce theread of the input right matrix. These improvements aredemonstrated to be valid and fruitful. When evaluated on 20real-world benchmarks, the memory access reduces by 2.8×,the performance increases by 4×, and the energy efficiencyincreases by 6× compared to the previous state-of-the-art.We also provide a detailed breakdown to interpret the savingof each step, offering insights to the specialized acceleratordesigns.

ACKNOWLEDGEMENT

Part of this work was supported under an NSF HDRAward #1934700 and DARPA SDH program. We thankMIT Data Science and AI Lab (DSAIL) for supportingthis research. We thank Joel Emer, Saman Amarasinghe,Julian Shun, Stephen Keckler, and Christopher Fletcher forinspiring discussions. This research was, in part, funded bythe U.S. Government. The views and conclusions containedin this document are those of the authors and should notbe interpreted as representing the official policies, eitherexpressed or implied, of the U.S. Government.

REFERENCES

[1] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng,C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, andR. Dreslinski, “Outerspace: An outer product based sparsematrix multiplication accelerator,” in 2018 IEEE InternationalSymposium on High Performance Computer Architecture(HPCA), pp. 724–736, IEEE, 2018.

[2] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-pressing deep neural networks with pruning, trained quantiza-tion and huffman coding,” arXiv preprint arXiv:1510.00149,2015.

[3] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weightsand connections for efficient neural network,” in Advances inneural information processing systems, pp. 1135–1143, 2015.

[4] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learningstructured sparsity in deep neural networks,” in Advances inneural information processing systems, pp. 2074–2082, 2016.

[5] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc:Automl for model compression and acceleration on mobiledevices,” in ECCV, pp. 784–800, 2018.

[6] A. Azad, A. Buluc, and J. Gilbert, “Parallel triangle countingand enumeration using matrix algebra,” in 2015 IEEE In-ternational Parallel and Distributed Processing SymposiumWorkshop (IPDPSW), pp. 804–811, IEEE, 2015.

[7] S. M. Van Dongen, Graph clustering by flow simulation. PhDthesis, 2000.

[8] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “A unified frame-work for numerical and combinatorial computing,” Comput-ing in Science & Engineering, vol. 10, no. 2, pp. 20–25, 2008.

[9] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “High-performance graph algorithms from parallel sparse matrices,”in International Workshop on Applied Parallel Computing,pp. 260–269, Springer, 2006.

[10] M. O. Rabin and V. V. Vazirani, “Maximum matchings in gen-eral graphs through randomization,” Journal of Algorithms,vol. 10, no. 4, pp. 557–567, 1989.

[11] G. Penn, “Efficient transitive closure of sparse matrices overclosed semirings,” Theoretical Computer Science, vol. 354,no. 1, pp. 72–81, 2006.

[12] S. Itoh, P. Ordejon, and R. M. Martin, “Order-n tight-binding molecular dynamics on parallel computers,” Com-puter physics communications, pp. 173–185, 1995.

[13] T. M. Chan, “More algorithms for all-pairs shortest paths inweighted graphs,” SIAM Journal on Computing, vol. 39, no. 5,pp. 2075–2089, 2010.

[14] I. Yamazaki and X. S. Li, “On techniques to improve ro-bustness and scalability of a parallel hybrid linear solver,” inInternational Conference on High Performance Computingfor Computational Science, pp. 421–434, Springer, 2010.

[15] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grainedparallelism in algebraic multigrid methods,” SIAM Journal onScientific Computing, vol. 34, no. 4, pp. C123–C152, 2012.

[16] G. Karypis, A. Gupta, and V. Kumar, “A parallel formulationof interior point algorithms,” in Supercomputing’94: Proceed-ings of the 1994 ACM/IEEE Conference on Supercomputing,pp. 204–213, IEEE, 1994.

[17] S. Badia, A. F. Martın, and J. Principe, “A highly scalable par-allel implementation of balancing domain decomposition byconstraints,” SIAM Journal on Scientific Computing, vol. 36,no. 2, pp. C190–C218, 2014.

[18] H. Wang, J. Yang, H.-S. Lee, and S. Han, “Learning todesign circuits,” NeurIPS 2018 Machine Learning for SystemsWorkshop, 2018.

[19] H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang,R. Marcus, r. addanki, M. Khani Shirkoohi, S. He, V. Nathan,F. Cangialosi, S. Venkatakrishnan, W.-H. Weng, S. Han,T. Kraska, and D. Alizadeh, “Park: An open platform forlearning-augmented computer systems,” in Advances in Neu-ral Information Processing Systems 32, pp. 2490–2502, Cur-ran Associates, Inc., 2019.

[20] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis,and N. Koziris, “Understanding the performance of sparsematrix-vector multiplication,” in 16th Euromicro Conferenceon Parallel, Distributed and Network-Based Processing (PDP2008), pp. 283–292, IEEE, 2008.

[21] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang,“Understanding performance differences of fpgas and gpus,”in 2018 IEEE 26th Annual International Symposium onField-Programmable Custom Computing Machines (FCCM),pp. 93–96, IEEE, 2018.

[22] K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, “Sparsematrix-matrix multiplication on modern architectures,” in2012 19th International Conference on High PerformanceComputing, pp. 1–10, IEEE, 2012.

[23] Twitter, “Number of monthly active twitter users worldwidefrom 1st quarter 2010 to 1st quarter 2019 (in millions),” 2019.

[24] G. E. Moore, “Cramming more components onto integratedcircuits.”

[25] J. L. Hennessy and D. A. Patterson, “A new golden age forcomputer architecture: Domain-specific hardware/softwareco-design, enhanced security, open instruction sets, and agilechip development,” Turing Lecture, 2018.

[26] T. A. Davis and Y. Hu, “The university of florida sparse matrixcollection,” ACM Transactions on Mathematical Software(TOMS), vol. 38, no. 1, p. 1, 2011.

[27] J. Leskovec and A. Krevl, “SNAP Datasets: Stanfordlarge network dataset collection.” http://snap.stanford.edu/data, June 2014.

[28] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A.Ang, “Introducing the graph 500,” Cray Users Group (CUG),vol. 19, pp. 45–74, 2010.

[29] S. Galal and M. Horowitz, “Energy-efficient floating-pointunit design,” IEEE Transactions on computers, vol. 60, no. 7,pp. 913–922, 2010.

[30] A. Shilov, “Jedec publishes hbm2 specification as samsungbegins mass production of chips,” 2016.

[31] M. Alfano, B. Black, J. Rearick, J. Siegel, M. Su, and J. Din,“Unleashing fury: A new paradigm for 3-d design and test,”IEEE Design & Test, vol. 34, no. 1, pp. 8–15, 2016.

[32] Intel, “Intel math kernel library.”

[33] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi,“Cusparse library,”

[34] S. Dalton, N. Bell, L. Olson, and M. Garland, “Cusp: Genericparallel algorithms for sparse matrix and graph computa-tions,” 2014. Version 0.5.0.

[35] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grainedparallelism in algebraic multigrid methods,” SIAM Journal onScientific Computing, vol. 34, no. 4, pp. C123–C152, 2012.

[36] C. Sanderson and R. Curtin, “Armadillo: a template-basedc++ library for linear algebra,” Journal of Open SourceSoftware, vol. 1, no. 2, p. 26, 2016.

[37] C. Sanderson and R. Curtin, “Practical sparse matrices in c++with hybrid storage and template-based expression optimisa-tion,” Mathematical and Computational Applications, vol. 24,no. 3, 2019.

[38] C. Y. Lin, N. Wong, and H. K.-H. So, “Design spaceexploration for sparse matrix-matrix multiplication on fpgas,”International Journal of Circuit Theory and Applications,vol. 41, no. 2, pp. 205–219, 2013.

[39] F. G. Gustavson, “Two fast algorithms for sparse matrices:Multiplication and permuted transposition,” ACM Transac-tions on Mathematical Software (TOMS), vol. 4, no. 3,pp. 250–269, 1978.

[40] A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori,O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiplelevels of parallelism in sparse matrix-matrix multiplication,”SIAM Journal on Scientific Computing, vol. 38, no. 6,pp. C624–C651, 2016.

[41] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix multiplication for irregular data,” in 2014 IEEE 28thInternational Parallel and Distributed Processing Symposium,pp. 370–381, IEEE, 2014.

[42] M. Fort, J. A. Sellares, and N. Valladares, “Intersectingtwo families of sets on the gpu,” Journal of Parallel andDistributed Computing, vol. 104, pp. 167–178, 2017.

[43] F. Zhang, D. Wu, N. Ao, G. Wang, X. Liu, and J. Liu, “Fastlists intersection with bloom filter using graphics processingunits,” in Proceedings of the 2011 ACM Symposium onApplied Computing, pp. 825–826, ACM, 2011.

[44] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang, X. Liu,J. Liu, and S. Lin, “Efficient parallel lists intersection and in-dex compression algorithms using graphics processing units,”Proceedings of the VLDB Endowment, vol. 4, no. 8, pp. 470–481, 2011.

[45] S. Han, L. Zou, and J. X. Yu, “Speeding up set intersections ingraph algorithms using simd instructions,” in Proceedings ofthe 2018 International Conference on Management of Data,SIGMOD ’18, (New York, NY, USA), pp. 1587–1602, ACM,2018.

[46] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago,A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Ex-tensor: An accelerator for sparse tensor algebra,” in MICRO52, p. 319333, ACM, 2019.

[47] L. Zhuo and V. K. Prasanna, “Sparse matrix-vector multipli-cation on fpgas,” in Proceedings of the 2005 ACM/SIGDA13th International Symposium on Field-programmable GateArrays, FPGA ’05, (New York, NY, USA), pp. 63–74, ACM,2005.

[48] E. Jamro, T. Pabis, P. Russek, and K. Wiatr, “The algorithmsfor fpga implementation of sparse matrices multiplication,”Computing and Informatics, vol. 33, no. 3, pp. 667–684,2015.

[49] D. Zou, Y. Dou, S. Guo, and S. Ni, “High performance sparsematrix-vector multiplication on fpga,” IEICE Electronics Ex-press, vol. 10, no. 17, pp. 20130529–20130529, 2013.

[50] Y. Elkurdi, D. Fernandez, E. Souleimanov, D. Giannacopou-los, and W. J. Gross, “Fpga architecture and implementationof sparse matrix–vector multiplication for the finite elementmethod,” Computer Physics Communications, vol. 178, no. 8,pp. 558–570, 2008.

[51] P. Grigoras, P. Burovskiy, W. Luk, and S. Sherwin, “Op-timising sparse matrix vector multiplication for large scalefem problems on fpga,” in Field Programmable Logic andApplications (FPL), 2016 26th International Conference on,pp. 1–9, IEEE, 2016.

[52] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,and W. J. Dally, “Eie: efficient inference engine on com-pressed deep neural network,” in ISCA, pp. 243–254, 2016.

[53] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie,H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speechrecognition engine with sparse lstm on fpga,” in Proceedingsof the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84, ACM, 2017.

[54] S. Hochreiter and J. Schmidhuber, “Long short-term mem-ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[55] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkate-san, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally,“Scnn: An accelerator for compressed-sparse convolutionalneural networks,” in ISCA, pp. 27–40, June 2017.

[56] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vi-jaykumar, “Sparten: A sparse tensor accelerator for convolu-tional neural networks,” MICRO ’52, pp. 151–165, 2019.

[57] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core:Algorithm and hardware co-design for vector-wise sparseneural networks on modern gpus,” MICRO ’52, pp. 359–371,ACM, 2019.

[58] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performancecomparison of fpga, gpu and cpu in image processing,” inField programmable logic and applications, 2009. fpl 2009.international conference on, pp. 126–131, IEEE, 2009.

[59] C. Cullinan, C. Wyant, T. Frattesi, and X. Huang, “Computingperformance benchmarks among cpu, gpu, and fpga,”

[60] H.-T. Kung, “Why systolic architectures?,”

sparch: efﬁcient architecture for sparse matrix multiplication · 2020. 2. 18. · sparch:...

Documents