outerspace: an outer product based sparse matrix ...subh/papers/outerspace-hpca-2018.… ·...

OuterSPACE: An Outer Product based Sparse MatrixMultiplication Accelerator

Subhankar Pal⇤ Jonathan Beaumont⇤ Dong-Hyeon Park⇤ Aporva Amarnath⇤ Siying Feng⇤

Chaitali Chakrabarti† Hun-Seok Kim⇤ David Blaauw⇤ Trevor Mudge⇤ Ronald Dreslinski⇤

⇤University of Michigan, Ann Arbor, MI †Arizona State University, Tempe, AZ⇤{subh, jbbeau, dohypark, aporvaa, fengsy, hunseok, blaauw, tnm, rdreslin}@umich.edu

†[email protected]

ABSTRACTSparse matrices are widely used in graph and dataanalytics, machine learning, engineering and scientificapplications. This paper describes and analyzes Out-erSPACE, an accelerator targeted at applications thatinvolve large sparse matrices. OuterSPACE is a highly-scalable, energy-e�cient, reconfigurable design, consist-ing of massively parallel SPMD-style processing units,distributed memories, high-speed crossbars and HighBandwidth Memory (HBM).

We identify redundant memory accesses to non-zerosas a key bottleneck in traditional sparse matrix-matrixmultiplication algorithms. To ameliorate this, we im-plement an outer product based matrix multiplicationtechnique that eliminates redundant data accesses by de-coupling multiplication from accumulation. We demon-strate that traditional architectures, due to limitationsin their memory hierarchies and ability to harness par-allelism in the algorithm are unable to take advantageof this reduction without incurring significant overheadcosts. OuterSPACE is designed to specifically overcomethese challenges.We simulate the key components of our architecture

using gem5 on a diverse set of matrices from the Univer-sity of Florida’s SuiteSparse collection and the StanfordNetwork Analysis Project and show a mean speedup of7.9⇥ over Intel’s Math Kernel Library on a Xeon CPU,13.0⇥ against cuSPARSE and 14.0⇥ against CUSP whenrun on an NVIDIA K40 GPU, while achieving an aver-age throughput of 2.9 GFLOPS within a 24 W powerbudget in an area of 87 mm2.

KEYWORDSSparse matrix processing, application-specific hardware,parallel computer architecture, hardware-software co-design, hardware accelerator

1. INTRODUCTIONGeneralized sparse matrix-matrix multiplication (Sp-

GEMM) and sparse matrix-vector multiplication (SpMV)are two key kernels of complex operations in domainssuch as graph analytics, machine learning, and scientificcomputation, as we elaborate in Section 2. The per-centage of non-zero elements in the matrices involved

can be very small. For example, the number of activedaily Facebook users is currently 1.08 billion and theaverage number of friends per user is 338 [56]. A graphrepresenting Facebook users as vertices and “friendships”between users as edges results in an adjacency matrixof dimension 1.08 billion with a density of just 0.0003%.Sparse matrix-based computations are becoming an

increasingly important problem. These applications aretypically bottlenecked by memory rather than computa-tion, primarily due to irregular, data-dependent memoryaccesses in existing matrix libraries, leading to poorthroughput and performance [23,41].

The demise of Moore’s Law has led to renewed inter-est in accelerators to improve performance and reduceenergy and cost. In order to address these issues for theincreasingly important problem of sparse matrix com-putations, we propose a custom accelerator for sparsematrix applications, OuterSPACE, that consists of asyn-chronous Single Program, Multiple Data (SPMD)-styleprocessing units with dynamically-reconfigurable non-coherent caches and crossbars. OuterSPACE is designedto work with the unconventional outer product-basedmatrix multiplication approach [14, 60]. This involvesmultiplying the ith column of the first matrix (A) withthe ith row of the second matrix (B), for all i. Eachmultiplication generates a partial product matrix and allthe generated matrices are then accumulated element-wise to form the final result. The seemingly obviousdrawback of this approach is the maintenance and stor-age of these partial products matrices. In the case ofsparse matrices, however, this is much less of a con-cern and other considerations dominate. We identifycontrasting data-sharing patterns in the two distinct,highly parallel compute phases: multiply and merge.The multiply phase involves data-sharing across parallelcomputation streams, while the merge phase involvesstrictly independent processing with little-to-no com-munication or synchronization between streams. Thisdiscrepancy leads to sub-optimal execution of the outerproduct method on mainstream architectures, namelyGPUs and multi-core/many-core CPUs (Section 4.4).The reconfigurability of OuterSPACE enables us to

meet the contrasting computational needs of outer prod-uct method’s two compute phases. Employing asyn-

chronous SPMD-style processing elements allows for con-trolling divergent code to operate fully in parallel, whileSIMD architectures would need to at least partially seri-alize it. Software-controlled scratchpads coupled withhardware-controlled caches help prevent wasted dataaccesses to the main memory. Further, allowing non-coherence relieves pressure on the memory system asso-ciated with excess broadcasts and writebacks (which cancontribute up to 30-70% of the total write tra�c [36]),providing a fraction of the performance benefits.While our main focus is sparse matrix-matrix multi-

plication, we also present results of sparse matrix-vectormultiplication, and describe how element-wise operationscan be performed on the OuterSPACE system.We use a server-class multi-core CPU and GPU as

baselines to compare our architecture against. For sparsematrix-matrix multiplication on the CPU, we use state-of-the-art sparse BLAS functions from Intel’s Math Ker-nel Library (MKL). MKL provides math routines forapplications that solve large computational problems andare extensively parallelized by OpenMP threading whileusing vector operations provided by the AVX instructionset. For the GPU, we compare our architecture againstthe CUSP and cuSPARSE libraries. cuSPARSE [46]applies row-by-row parallelism and uses a hash tableto merge partial products for each row of the outputmatrix. CUSP [8,17] presents fined-grained parallelismby accessing the input matrices row-by-row and storingthe partial result with possible duplicates into an inter-mediate coordinate format. The intermediate structureis then sorted and compressed into the output matrix.The rest of the paper is organized as follows. Sec-

tion 2 discusses a wide spectrum of applications thatutilize sparse matrix operation kernels. Section 3 pro-vides a brief background on our implementation of outerproduct multiplication and sparse matrix storage format.Section 4 discusses our outer product implementation forsparse matrix-matrix and matrix-vector multiplication.Section 5 provides details of OuterSPACE and how theouter product algorithm e�ciently maps to it. Section 6

presents our experimental setup and Section 7 presentsthe results. Section 8 briefly discusses how OuterSPACEcan be scaled up. Lastly, Section 9 discusses related workand Section 10 provides some concluding remarks.

2. MOTIVATIONSparse matrices are ubiquitous in most modern ap-

plications that operate on big data. It is the generalconsensus that a selection of linear algebra routines op-timized over years of research can be used to acceleratea wide range of graph and scientific algorithms [20].

Sparse matrix-matrix multiplication, in particular, isa significant building block of multiple algorithms preva-lent in graph analytics, such as breadth-first search [21][22], matching [49], graph contraction [13], peer pres-sure clustering [54], cycle detection [65], Markov clus-tering [61], and triangle counting [7]. It is also a keykernel in many scientific-computing applications. Forexample, fast sparse matrix-matrix multiplication is aperformance bottleneck in the hybrid linear solver ap-

plying the Schur complement method [63] and algebraicmultigrid methods [8]. Other computing applications,such as color intersection searching [31], context-freegrammar parsing [48], finite element simulations basedon domain decomposition [25], molecular dynamics [28],and interior point methods [32] also rely heavily onsparse matrix-matrix multiplication.Sparse-matrix vector multiplication is also predomi-

nant across diverse applications, such as PageRank [12],minimal spanning tree, single-source shortest path andvertex/edge-betweenness centrality calculations. It servesas a dominant compute primitive in machine learning(ML) algorithms such as support vector machine [45]and ML-based text analytics applications [42].While GPUs demonstrate satisfactory compute e�-

ciency on sparse matrix-vector multiplication and su�-ciently dense matrix-matrix multiplication [9], we showthat compute units are significantly underutilized whenthe density drops below 0.1%, often achieving fewer than1 GFLOPS, despite a peak theoretical throughput ofover 4 TFLOPS [59]. This is further supported by thefact that rankings, such as the Green Graph 500 list [2],are dominated by CPU-based systems.For large dense matrices, inner product multiplica-

tion, block partitioning and tiling techniques are usedto take advantage of data locality. However, when thedensity of the input matrices is decreased, the run-timeis dominated by irregular memory accesses and index-matching in order to perform the inner product. More-over, while tiling techniques can reduce redundant readsto main memory in the short-term, the on-chip storageconstraints still necessitate that many data elements beredundantly fetched multiple times across tiles [29]. Thestark ine�ciencies on both the hardware and algorithmfronts motivate our work to formulate a new approachfor sparse matrix multiplication acceleration.

3. BACKGROUNDThis section outlines a few fundamental concepts be-

hind matrix-matrix multiplication and storage formatsfor representation of sparse matrices in memory.

3.1 Matrix Multiplication

3.1.1 The Inner Product Approach

Traditionally, generalized matrix-matrix multiplica-tion (GEMM) is performed using the inner productapproach. This is computed using a series of dot prod-uct operations between rows of the first matrix (A) andcolumns of the second matrix (B) and results is elementsof the final product (C):

ci,j =PN�1

k=0 ai,k ⇥ bk,j

Here, N is the number of columns in A (or rowsin B), while i and j are the row and column indices,respectively, of an element in the final matrix. Thus,each element of the final matrix is computed througha series of multiply-and-accumulate (MAC) operations.This is generally optimized using block partitioning andtiling techniques [62].

Figure 1: Outer product multiplication of matrices A and B.Each column-of-A and the corresponding row-of-B are multipliedwith each other to produce N partial product matrices, Ci. Theseare then summed together to produce the final result matrix C.

3.1.2 The Outer Product Approach

The outer product method [14,60] multiples two ma-trices A and B by decomposing the operation into outer

product multiplications of pairs of columns-of-A androws-of-B, as illustrated in Figure 1. Mathematically,

C =PN�1

i=0 Ci =PN�1

i=0 aibi

where ai is the ith column-of-A and bi is the the ith row-of-B. Thus, the computation is divided into two sets ofoperations: a set of multiply operations to generate theouter products, followed by a set of merge operationsto accumulate the outer products into the final result.In Section 4, we propose an outer product based sparsematrix multiplication paradigm based on this approach.

3.2 Compressed Storage FormatsAn M⇥N matrix is often represented in the dense

format as a 2-D array laid out in the memory as anM⇥N contiguous block. For sparse matrices, however,most of the elements are zeros, and hence, there is littlemerit in storing such a matrices in the dense format.

The Compressed Sparse Row (CSR) format representsa matrix in a compressed manner using three arrays.The vals array consists of the non-zero elements of thematrix in row-major order, the cols array contains thecolumn indices of the elements in vals, the row-ptrs arraycontains pointers to the start of each row of the matrixin the cols and vals arrays. The dual of the CSR formatis the Compressed Sparse Column (CSC) format, whichis comprised of the vals, rows and col-ptrs arrays.

In our implementation, while not being restrictive, weemploy a similar storage scheme, consisting of a contigu-ous block of row pointers each pointing to contiguousarrays of column-index/value pairs. We henceforth re-fer to this as the Compressed Row (CR) format. Thecomplementary format, Compressed Column (CC) for-mat, consists of column pointers pointing to arrays ofrow-index/value pairs.

4. OUTER PRODUCT IMPLEMENTATIONThis section details the highly parallel outer product

algorithm alluded to in Section 3.1.2. The algorithmmaximizes memory reuse and avoids redundant reads tonon-zero elements.A major ine�ciency of sparse inner products (i.e.,

row-of-A ⇥ column-of-B) is the multiplications beingselectively performed on matched non-zero indices. Theouter product technique benefits over other conventionalmethods of multiplying matrices through:

• Elimination of index-matching : Each pair of non-zero elements from column-of-A and the corre-sponding row-of-B produce meaningful outputs.This is in contrast to inner product like algorithms,where the indices need to be matched before multi-plication, leading to ine�cient utilization of mem-ory bandwidth to fetch elements redundantly.

• Maximized reuse of non-zeros : All elements in arow-of-B are shared for all elements in a column-of-A within an outer product. This maximizes theamount of reuse within a particular outer prod-uct calculation to its theoretical maximum, as weillustrate later in Figure 2.

• Minimized loads of a column and row : As a resultof maximized data reuse within a outer productcalculation, we have no available data reuse acrossdi↵erent outer products. Thus, once the computa-tion between a column-of-A and the correspondingrow-of-B is completed, they are never used againand can be evicted from local memory.

In the rest of the section, we present details aboutthe two phases of outer product multiplication: multiply

and merge. Since our algorithm requires that A be inCC format and B be in CR, we describe in Section 4.3how we convert a matrix into its complementary format.

4.1 Multiply PhaseFigure 2 shows an example multiplication of two 4⇥4

sparse matrices, given three parallel processing unitsin the system. For clarity of understanding, the denserepresentations of matrices A and B are shown on thetop-left of the figure. These matrices are decomposedinto pairs of columns-of-A and rows-of-B. An outer

Figure 2: Outer product multiplication of matrices A (in CC)and B (in CR), illustrated in dense format, using 3 processingelements, and the layout of the partial products in memory. Boththe CR and the CC modes of operation are shown here. Note thatrow 2 of B is empty and hence no outer product is formed corre-sponding to it. The blue blocks represent the row/column pointers,orange + green blocks are the partial product rows/columns con-taining index-value pairs.

product operation between each pair generates a full 4⇥4compressed matrix, and each of the generated matricesare summed to produce the final result. In the CR modeof operation, each processing unit multiplies one non-zero element from a column-of-A with all the non-zerosin the corresponding row-of-B. The processing units aregreedily scheduled in this example.

In our implementation, we store the intermediate par-tial products as a set of linked lists corresponding toeach row (pointed to by Ri), where each node of the listcontains a contiguous set of values representing a partialrow of an outer product (Figure 2).The CC mode of operation is analogous to the CR

mode, where we re-program the processing units tomultiply an element of a row-of-B with all the non-zerosin the corresponding column of A. The row pointers, Ri

are replaced by column pointers, Ci. This is illustratedin the bottom-right part of Figure 2.

4.2 Merge PhaseThe outer products pointed to by a row/column pointer

need to be merged to form the final result. We assign theprocessing units to walk through the linked list pointedto by Ri/Ci and merge them to form a complete finalrow/column. In the event that multiple data valuesfrom di↵erent outer products correspond to the sameindex, they must be summed together. However, thisgets increasingly rare with sparser matrices.In Section 5, we elaborate the merging scheme that

maps e�ciently to the architecture of OuterSPACE. Thehardware can be programmed to produce the resultantmatrix in either the CR or the CC format. For brevity,we assume CR mode operation in the rest of the paper.

4.3 Matrix Format ConversionWhen matrices A and B are not available in the CC

and CR formats, respectively, either one or both willhave to be converted to the complementary format. Thisis a one-time requirement for chained multiplication op-erations of the typeA⇥B⇥C. . . , since OuterSPACE canoutput the result in either CR or CC. However, computa-tions such as AN can be decomposed into a logarithmicnumber of operations (A2=A⇥A, A4=A2

⇥A2 and soon), where each operation would consist of conversionfollowed by actual computation. The requirement ofconversion is obviated for symmetric matrices, since theCR and CC forms are equivalent.

In our evaluations, we assume that both inputs, A andB, are in the CR format, such that A must be converted.We partition the conversion operation into conversion-

load and conversion-merge phases, similar to themultiply

and merge phases. The processing elements streamthrough matrix A and store it into the intermediatedata structure (Figure 2) in parallel. Conceptually, thisis similar to multiplying the matrix A with an Identity

Matrix of the same dimension:

ICC⇥ACR !ACC (CC mode)

where, ICC is the Identity Matrix and the subscriptsrepresent the respective storage formats.

4.4 Performance on Traditional HardwareOuter product multiplication is a well-established lin-

ear algebra routine. Yet, it is not widely implementedin mainstream hardware, as these designs are not well-suited for such an algorithm. We demonstrate the in-e�ciencies of this paradigm on traditional hardwarequantitatively by comparing our outer product imple-mentations against state-of-the art libraries, since thereare no readily available outer product libraries.

4.4.1 Multi-Core CPU (Intel Xeon)

Figure 3 compares the run-times of our CPU imple-mentation of the outer product algorithm, using thePOSIX threads library, against the Intel MKL SpGEMMlibrary on an Intel Xeon processor with 6 threads.

Figure 3: Comparison of our outer product implementationagainst Intel MKL on a Xeon multi-core CPU. The matrices areuniformly random with increasing dimension and decreasing den-sity, keeping the number of non-zeros constant at 1 million. Formatconversion and memory allocation times are not considered.

While the run-time of MKL drops exponentially withdecreasing matrix density, the outer product algorithmmust overcome two overheads with increasing matrix di-mension (N ): the decreasing number of useful operationsperformed at each matrix datum, and the increasingnumber of book-keeping operations due to the growingsize of the data structure in Figure 2. Thus, the priceof no index-matching and minimized redundant readsof non-zeros in the outer product technique is paid byadditional pressure on the memory system, as N N⇥N

partial product matrices are streamed out during themultiply phase and back in during the merge phase.

This necessitates larger memory bandwidth and morecores to churn through the data streams than availableon our 6-core CPU. It is further exacerbated by theabsence of software-controlled scratchpad memories andthe ine↵ectiveness of CPU caching for the merge phase,which does not exhibit any data sharing between cachesand thus leads to thrashing. This is substantiated byour studies of cache performance of the matrices inFigure 3, which show mean L2 hit rates of 0.14 and 0.12during the multiply and merge phases, respectively. Incomparison, MKL spGEMM routines are vectorized andheavily optimized for the multi-core architecture.Table 1 presents data generated using Intel VTune

Amplifier for a Core i7 CPU running MKL on the samematrices as in Figure 3. The under-utilization of band-width (average of 62%) suggests that bandwidth is notthe primary bottleneck and increasing it will likely pro-vide only sub-linear speedups.

Table 1: Bandwidth utilization of the MKL sparse GEMM onan Intel Core i7 running 4 threads. Each matrix has a uniformrandom distribution of 10 million non-zeros.

MatrixDimension

Peak BandwidthUtilization (%)

Avg. BandwidthUtilization (%)

1,048,576 62.5 44.22,097,152 67.5 58.44,194,304 67.5 62.08,388,608 85.0 62.4

4.4.2 GPU (NVIDIA Tesla)

NVIDIA’s implementations for matrix multiplicationprovided in their CUSP and cuSPARSE libraries performvery poorly at density levels below 0.01%. Running syn-thetic workloads at this level of sparsity achieves fewerthan 1 GFLOPS. L1 cache hit rate and utilized mem-ory bandwidth drop from 86% and 30 GB/s at 10%density to under 40% and 10 GB/s, respectively. Wecompare the performance of our custom outer prod-uct implementation, written in CUDA, against CUSP.Our implementation makes use of the GPU’s availablescratchpad storage. Figure 4 compares the run-timeswhen run on an NVIDIA K40 GPU.

Figure 4: Comparison of a GPU outer product implementationagainst CUSP. The matrices are uniform random with increasingdimension while density is decreased, keeping the number of non-zeros constant at 1 million.

A comparison of results from Figure 3 and Figure 4show that the GPU makes better use of available pro-cessing power and bandwidth than the CPU. The mul-tiplication phase streams and processes the data muchfaster than the CPU implementation, scaling roughlylinearly with decreasing density.However, latency is quickly dominated by the merge

phase. Despite both phases achieving similarly highL1 hit rates (> 80%) and low data dependency stalls(< 5%), the merge phase su↵ers from a much lower totalthroughput. This is a result of numerous conditionalbranches within the code to handle di↵erent relativecolumn indices as they are read in and sorted. Becausethere is little correlation between adjacent threads asthey process these branches, many threads within a givenwarp diverge and must be executed serially. Thus, whilethe high degree of parallelism available is attractive, theSIMD nature of the GPU’s processing elements preventan overall win of the algorithm over traditional libraries.

4.4.3 Many-Core CPU (Intel Xeon Phi)

Our experiments with the CPU outer product code

on an Intel Xeon Phi Knights Corner system show anaverage slowdown of 14.7⇥ compared to the CPU, foruniformly random matrices of size 32K to 524K with 1million non-zeros. We also note that denser matricesincur a significant amount of memory allocation over-head, which worsens the overall run-time. Although theouter product approach has high degrees of parallelism,it lacks an equivalent vectorizability. Moreover, Akbu-dak and Aykanat show in [4] that the memory latency,rather than bandwidth, is the performance bottleneckfor a many-core Xeon Phi system.Intel MKL’s spGEMM function also shows 1.1⇥ to

8.9⇥ increase in execution time with respect to that onthe Xeon CPU with decreasing density. The large cachesof CPUs result in significantly better sparse matrix-matrix multiplication performance of CPUs as comparedto the Xeon Phi, because repeatedly accessed rows inB may already be available in cache. The throughput-oriented Xeon Phi architecture has much smaller caches,resulting in many ine�cient reloads of data from globalmemory. This trend is similar to what is observed in [50]for both the CPU and the Xeon Phi.To combat the ine�ciencies of existing architectures,

we designed a many-core architecture using an SPMDparadigm to exploit the massive parallelism availablein the algorithm. We allow simple, dynamic resourceallocation using asynchronous tiles and a non-coherentmemory hierarchy. The SPMD paradigm allows compu-tation to drift among the cores when work is imbalanced,leading to better compute utilization than the SIMTprogramming model in GPUs. Furthermore, to addressthe performance discrepancy between the multiply andmerge phases due to di↵erent memory access patterns,we employ a reconfigurable cache system to allow sharingdata across multiple cores when needed and segmentingstorage to isolated processing units when it is not.

5. THE OUTERSPACE ARCHITECTUREQualitative analysis and results from the previous sec-

tion reveal two key reasons why outer product multipli-cation does not perform well on conventional hardware:

• Outside of a particular outer product calculationduring the multiply phase, there is no reuse of ele-ments, as explained in Section 4. During this phase,the corresponding columns-of-A and rows-of-B canbe shared across processing elements, whereas thereis no data sharing between the processing units dur-ing the merge phase. Traditional hardware lacksthe support to optimize for both of these phases,while dealing with variable memory allocation.

• While the multiply phase has consistent controlflow between adjacent threads, dynamic executionpaths in the merge phase necessitate fine-grainedasynchrony across processing elements to fully uti-lize available parallelism. An ideal architecturewill allow for fully decoupled processing withoutsacrificing the ability to e↵ectively share data.

To harness the massive parallelism and data reusethrough fine-grained control over what data is fetched

Figure 5: The memory hierarchy (left) and the architectures of the Processing Tile (center) and the Processing Element (right). Thesolid dark lines represent 64-bit bidirectional links.

from memory, we propose our custom architecture, Out-erSPACE. Figure 5 shows the microarchitecture of theOuterSPACE system. Our architecture is a system-on-chip consisting of SPMD-style parallel processingelements arranged as tiles, with two levels of shared,reconfigurable caches, and a set of control processors forscheduling and coordination, all connected to an HBM.In this work, we implement simple cache-to-scratchpadreconfiguration by switching-o↵ tag arrays, although re-cent work such as coherent scratchpads [6] and Stash [35]have the potential to make OuterSPACE more general.Following is a summary of the key elements of our

proposed architecture:

• Processing Element (PE): A data-streaming andcompute engine comprised of an ALU with a float-ing point unit, scratchpad memory, control unit,work queue and outstanding request queue (16 PEsare grouped to form a tile)

• Local Control Processor (LCP): A small in-ordercore that coordinates 16 PEs in a tile, streaminginstructions for the PEs to execute

• Central Control Processor (CCP): A power-e�cientcore that is responsible for scheduling work andallocating memory for intermediate data structures

• High-speed crossbars and coalescing caches thatcan be reconfigured as scratchpad memories

• High Bandwidth Memory that stores the input ma-trices and the intermediate partial products

In the following subsections, we present a detaileddescription of each element of the OuterSPACE node.We model 16 PEs per tile and 16 tiles based on scala-bility studies [53] to ensure that crossbar sizes do notbottleneck our architecture.

5.1 Processing ElementA block diagram of a Processing Element (PE) is

shown on the right side of Figure 5. At the core of thePE is a floating-point capable ALU for multiplicationand summation of the matrix elements. These elementsare streamed-in from the memory hierarchy (Section 5.3),orchestrated by the Control Unit, which generates loadsand stores to memory. An outstanding request queuekeeps track of the loads and stores that are in-flight.

Lastly, there is a small private scratchpad and a FIFOwork queue for decoded instructions and bookkeepingdata, which are supplied to the PE by the LCP.

5.2 Processing TileA processing tile consists of 16 PEs, an LCP and a

reconfigurable cache with 16 processor-side read/writeports and 4 memory-side ports. These caches inter-nally consist of 16 single-ported cache banks and a con-troller (not shown) that interfaces with the LCP toreconfigure the cache into a scratchpad. As mentionedin Section 4, this is the key structure that reconfiguresour system from a shared memory architecture into anon-shared one. The PEs within a tile communicate withthe lower memory levels through a 4-ported crossbar.

5.3 Memory HierarchyFigure 5 shows 16 processing tiles that interface with

the main memory through 4 L1 caches. These caches actlike victim caches [30] and thus are smaller than their L0counterparts, in order to minimize undesired eviction ofdata that is going to be reused. They cache-in elementsthat are evicted from the L0 caches when some PEsstart drifting away in their execution flow from otherswithin a tile, which occurs when the PEs are operatingon multiple rows simultaneously. The L0 caches alsocontain a 16⇥16 crossbar (not shown).

Our baseline main memory features an HBM 2.0 x64interface [55], with 16 memory channels and a totalmemory bandwidth of 128 GB/s. With a PE clocked at1.5 GHz, the total bandwidth for all the PEs is 9.2 TB/s(256 PEs ⇥ 1.5 giga-operations per second ⇥ 12 B peraccess for double-precision value and index pair ⇥ read +write channels). We overcome this bandwidth gap with amulti-level cache-crossbar hierarchy connecting the PEsto the memory. This hierarchy provides extensive reuseof temporally correlated data within an outer product.The PEs in our system execute in an SPMD-fashion,

often drifting apart from each other and only synchro-nizing at the end of the multiply and merge phases, asoutlined in Section 4. This contrasts to the GPUs, whichtraditionally employ a SIMT model of execution wherecompute units operate in locksteps through a dataset.This also opens up avenues for various circuit-level

techniques to improve energy e�ciency, such as voltage

throttling and dynamic bandwidth allocation. Further-more, the PEs only share read-only data, which allowfor the crossbars to be non-coherent structures withoutbreaking correctness. Incorporating techniques such asSARC [33], VIPS [34] or DeNovo [16] would help expandOuterSPACE to algorithms demanding coherence, whichwe defer to a future work.

5.4 Mapping the Outer Product AlgorithmThis subsection illustrates how the outer product al-

gorithm described in Section 4 maps to OuterSPACE.

5.4.1 Multiply Phase

As mentioned in Section 4.1 and illustrated in Figure 2,processing units (PEs) multiply an element of a column-of-A with the entire corresponding row-of-B. The L0caches retain the rows-of-B until all the PEs within atile are finished with this set of multiplication. ThePEs store the multiplied results in contiguous memorychunks, which are part of the linked list correspondingto a row-pointer (Ri), using a write-no-allocate policy toavoid results evicting elements of B. The memory layoutdescribed in Figure 2, where chunks of memory in eachnode of the list is discontiguous, allows each PE to workindependently without any synchronization. Thus, theonly data that the PEs share throughout the multiply

phase is read-only data (values and column indices ofthe rows-of-B).

5.4.2 Merge Phase

A subset of the PEs within each tile is assigned tomerge all the partial products corresponding to a singlerow of the resultant final matrix at the start of the merge

phase. The values to be merged are distributed acrossthe partial products, as indicated by Ri in Figure 2.For minimum computational complexity, a parallel

merge-sort algorithm would be optimal for merging rowsacross partial products in rN log(rN) time, where rN isthe total number of elements in the row. However, thiswill result in multiple re-fetches of the same data whenthe entire row cannot be contained within the uppermemory hierarchy, which will dominate the executiontime. Instead, we focus on minimizing memory tra�c.Our algorithm operates as follows (assuming the numberof rows to merge is rN , each of which contains rNelements, where r and N are the density and dimensionof the matrix, respectively):

1. Fetch the head of each row and sort by columnindex into a linked list (O(r2N2) operations)

2. Store the smallest-indexed element from the listinto the final location, load the next element fromthe corresponding row and sort it into the list(O(rN) operations)

3. Repeat 2. until all elements of each row have beensorted and shipped to memory (r2N2 iterations)

The overall complexity is O(r3N3). While less e�-cient algorithmically, number of elements stored in localmemory is only on the order of rN . A local bu↵er of thenext elements to sort can help hide the latency of insert-ing elements into the list under the latency of grabbing

a new element from main memory. Given our targetworkloads and system specifications, the time to sort thevalues is expected to be on the order of one-tenth to one-millionth of the time for memory to supply the values,with sparser matrices having a smaller discrepancy.

Because this phase requires no data sharing across tiles(each set of rows that is being merged reads independentdata), the shared cache can be reconfigured into privatescratchpads. This minimizes memory transactions byeliminating conflict cache misses within the processingtiles and saves energy by eliminating tag bank lookups.If a scratchpad bank is too small to bu↵er the entiremerge operation of each row, we recursively merge asubset of the rows into a single row until the number ofrows is su�ciently small.

Figure 5 illustrates the reconfigurability of the tiles tohandle the di↵erent phases of computation. Specifically:

• A batch of the PEs within a tile are disabled tothrottle per-PE bandwidth and conserve power.Half of the PEs loads the row bu↵ers, while theremainder sort the incoming values and store themto the result matrix.

• A subset of the cache banks within a tile are recon-figured as private scratchpad memories for the PEs(Figure 5). The reconfiguration can be achievedsimply by power-gating the tag array of the cache.The address range of each bank is remapped bysoftware, through the assistance of the LCP. Thus,the reconfigured private cache-scratchpad structuremaximizes the memory-level parallelism while min-imizing stalls due to computation. This techniquehas been employed on a smaller scale in GPUs [47].

5.5 Memory ManagementPrecise memory pre-allocation for the intermediate

partial products is impossible, as the sizes of the outerproducts are dependent on the specific row/columnsizes [39]. However, due to the predictable nature ofthe two phases, we can greatly reduce the overhead ofdynamic memory allocation over general schemes.

For the multiply phase, we statically assign each par-tial product enough storage to handle the average case(the average number of non-zero elements can be quicklycalculated from the compressed format before computa-tion begins), as well as a large spillover stack to be useddynamically for larger products. As a statically assignedPE (one per row/column pair) begins computation ofa given product, it can evaluate from the row pointersexactly how much spillover space is needed. The PEsends a single atomic instruction to increment a globalstack pointer by the appropriate amount, and writes thecurrent value location visible to the other PEs. As longas the PEs do not consume the entirety of their staticallocation before the atomic load returns, the latency ofmemory allocation can be entirely hidden.In the merge phase, we perform a single memory

allocation before computation by maintaining a set ofcounters for the size of each row in the multiply phase.As matrices get sparser, the amount of space wasted due

Table 2: Simulation parameters of OuterSPACE.

ProcessingElement

1.5 GHz clock speed, 64-entry outstanding re-quests queue, 1 kB scratchpad memoryMultiply phase: All 16 PEs per tile activeMerge phase: 8 PEs per tile active, rest disabled

L0 cache/scratchpad

Multiply phase: 16 kB, 4-way set-associative,16-ported, shared, non-coherent cache with32 MSHRs and 64 B block size per tileMerge phase: 2 kB, 4-way set-associative, single-ported, private cache with 8 MSHRs and 64 Bblock size + 2 kB scratchpad per active PE-pair

L1 cache4 kB, 2-way set-associative, 16-ported, shared,non-coherent with 32 MSHRs and 64 B blocks

Crossbar 16⇥16 & 4⇥4 non-coherent, swizzle-switch basedMainMemory

HBM 2.0 with 16 64-bit pseudo-channels each @8000 MB/s with 80-150 ns average access latency

to merge collisions becomes negligibly small. We discussthe estimated allocation overheads in Section 7.3.

The memory footprint of the outer product approachcan be represented as (↵·N+�·N2

·r+� ·N3·r2), where ↵,

� and � are small, implementation-dependent constants,for uniformly random sparse matrices with dimensionN and density r. For non-uniform matrices, this metricis not easily quantifiable, as the sparsity patterns of thetwo matrices heavily influence the number of collisionsbetween non-zeros.

5.6 Other Matrix OperationsWe evaluate a sparse-matrix sparse-vector multiplica-

tion algorithm similar to our matrix-matrix implemen-tation, with a few simplifications. In particular, theamount of work assigned to each PE is reduced andno scratchpad is needed in the merge phase, as partialproducts do not need to be sorted.Element-wise matrix operations follow a similar pro-

cedure as the merge phase of the matrix-matrix mul-tiplication algorithm described in Section 4.2. GivenN matrices A1, A2, ..., AN with the same dimensions,the data can be reorganized into a data structure simi-lar to the one illustrated in Figure 2 and element-wiseoperations (+, -, ⇥, /, ==) can be performed on it.The performances of element-wise matrix addition,

subtraction, multiplication and comparison operationson OuterSPACE are expected to be very similar to themerge phase on sparse matrix-matrix multiplication,which is primarily bottlenecked by memory throughput.

There is close to a one-to-one correspondence betweendata operations in each of the typical element-wise ma-trix routines (addition, subtraction, multiplication andcomparison) and the merge phase of our outer-productsparse matrix-matrix multiplication. Thus, they areexpected to have similar complexities and we do notevaluate them separately in this work.

6. EXPERIMENTAL SETUPTo evaluate the performance of the outer product al-

gorithm on OuterSPACE, we modeled the key elements,namely, the PEs, the cache-crossbar hierarchy and theHBM, using the gem5 simulator [11]. We created twoseparate models pertaining to the two phases. We ig-nored the start-up time, since it can be easily hidden by

Table 3: CPU and GPU configurations.

CPU3.6 GHz Intel Xeon E5-1650V4, 6 cores/12 threads128 GB RAM, solid state drives

GPUNVIDIA Tesla K40, 2880 CUDA cores @ 745 MHz,12 GB GDDR5 at 288 GB/s

the latency of a few memory accesses, and schedulingdelays. We also assumed that the PEs are greedily sched-uled for the multiply phase. We used an outstandingrequest queue size of 64 per PE, which is at par withthe number of load-store units in modern GPUs [1] andCPUs [27], to hide latencies of memory accesses. Thesimulation parameters used are shown in Table 2.

We chose our parameters to optimize for performanceand power e�ciency. For the merge phase, we enabledonly 8 of the 16 PEs per tile and reconfigure a propor-tional number of cache banks into private scratchpadmemories. We, in fact, observed that enabling a greaternumber of PEs results in slight performance degradationdue to thrashing in the L1 cache. The scratchpad sizeis chosen to be large enough to hide the load latencyduring sorting operations.CACTI 6.5 [43] was used for modeling the cache la-

tencies, area and power. For power dissipated by thecore, we derive static and dynamic power consumptionsfor an ARM Cortex-A5 with VFPv4 in 32 nm from [53].We pessimistically use the same aggressive core modelfor the PEs in addition to the LCPs and CCP. Dynamicpower consumption is calculated by capturing activityfactors of the cache and cores from simulation. TheHBM power numbers are derived from the JEDEC spec-ification document [55] and [5]. The parameters formodeling the crossbars are also obtained from [53].We built an instruction trace generator for the PEs

and ran the generated traces through our gem5 modelto process large matrices. Due to the practical limita-tions of this approach, we did not model the dynamicmemory allocation overhead in OuterSPACE and didnot consider this overhead across any other platformin our evaluation. However, in Section 7.3, we providean analysis into this overhead by quantifying the num-ber of dynamic allocation requests using the allocationapproach in Section 5.5.

7. EVALUATIONWe evaluate the OuterSPACE architecture by compar-

ing against state-of-the-art library packages on commer-cial systems, namely, Intel’s MKL (Version 2017 InitialRelease) on the CPU, NVIDIA’s cuSPARSE (Version8.0) and CUSP (Version 0.5.1) on the GPU. The speci-fications of these hardware are summarized in Table 3.We show the performance of OuterSPACE on two impor-tant classes of matrix operations, sparse matrix-matrixand sparse matrix-vector multiplication.We collected the simulation times reported by gem5

for the multiply and merge models running instructiontraces that excluded memory-allocation code.In order to provide fair bases for comparison against

the CPU and the GPU, we discarded the memory alloca-tion time and only considered run-times of the computedfunctions for the MKL, cuSPARSE and CUSP imple-

Table 4: Matrices from University of Florida SuiteSparse (SS) [19] and Stanford Network Analysis Project (SNAP) [37] with their plots,dimension, number of non-zeros (nnz ) and average number of non-zeros per row/column (nnzav) and problem domain.

Matrix PlotDim.nnznnzav

Kind

2cubes -sphere

101K1.6M16.2

EMproblem

amazon-0312

401K3.2M8.0

Co-purchasenetwork

ca-Cond-Mat

23K187K8.1

Conden-sed

matter

cage12130K2.0M15.6

Directedweightedgraph

cit-Patents

3.8M16.5M4.4

Patentcitationnetwork


Kind

cop20k-A

121K2.6M21.7

Accele-ratordesign

email-Enron

36.7K368K10.0

Enronemail

network

facebook4K

176K43.7

Friend-ship

network

filter3D106K2.7M25.4

Reduc-tion

problem

m133-b3200K801K4.0

Combi-natorialproblem


Kind

mario-002

390K2.1M5.4

2D/3Dproblem

o↵shore260K4.2M16.3

EMproblem

p2p-Gnutella-

31

63K148K2.4

p2pnetwork

patents -main

241K561K2.3


poisson-3Da

14K353K26.1

FluidDynam-

ics


Kind

roadNet-CA

2.0M5.5M2.8

Roadnetwork

scircuit171K959K5.6

Circuitsimula-tion

webbase-1M

1M3.1M3.1


web-Google

916K5.1M5.6

Googlewebgraph

wiki-Vote

8.3K104K12.5

Wiki-pedia

network

mentations (memory-allocation and computation arediscrete functions for these libraries). While report-ing throughput in GFLOPS, we only counted opera-tions associated with multiplication and accumulationin order to maintain consistency across algorithms andavoid artificially inflating performance by accountingfor additional bookkeeping. We verified that the rawGFLOPS reported by our GPU performance counterson the Florida benchmark suite (Section 7.1.2) are verysimilar to those reported by Liu and Vinter [39], butomitted the results for brevity.

7.1 Sparse Matrix-Matrix MultiplicationWithout loss of generality, we evaluate the perfor-

mance of sparse matrix-matrix multiplication on ourplatform by multiplying a sparse matrix with itself (C =A⇥A; A in CR format to begin, generating C in CR), inorder to generate a result equivalent to the that of mul-tiplication of two matrices of similar densities. We usetwo di↵erent sets of matrices, synthetic and real-world,as benchmarks to evaluate our system. We include theformat conversion overhead for non-symmetric matri-ces (Section 4.3) while reporting performance results forOuterSPACE, to model the worst-case.

7.1.1 Synthetic Matrices

In Figure 6, we compare the performance-scaling ofouter product on OuterSPACE against the CPU andGPU libraries, using a set of synthetic datasets obtainedfrom the Graph500 R-MAT data generator [44]. TheR-MAT parameters were set to their default values (A =0.57, B = C = 0.19) used for Graph500 to generateundirected power-law graphs, which is also employedin recent work on graph analytics [51] [58]. We presentthe data for medium-sized matrices corresponding tonEdges = 100,000 with nVertices swept between 5,000to 80,000. For illustrating the impact of sparsity patternon performance, we also provide comparisons againstuniformly random matrices of the same size and density.

OuterSPACE performs consistently well with respectto other platforms. It outperforms MKL and CUSP witha greater margin for the R-MATs than for the uniformly

random matrices, with the execution time changing onlyslightly across matrix densities. cuSPARSE, however,performs better with increasing density. OuterSPACEexhibits slight performance degradation with increasingdensity for uniformly random matrices, but gets starkerspeedups over the GPU for the power-law graphs. Thisdata also substantiates that the outer product algorithmis much less sensitive to a change in size for a givennumber of non-zeros than the MKL functions, whichcorrelates with our observations on the CPU (Figure 3).

7.1.2 Real-World Matrices

The real-world matrices we evaluate were derived fromthe University of Florida SuiteSparse Matrix Collec-tion [19] and the Stanford Network Analysis Project(SNAP) [37], containing a wide spectrum of real-worldsparse matrices from diverse domains such as structuralengineering, computational fluid dynamics, model reduc-tion, social networks, web graphs, citation networks, etc.These matrices have been widely used for performanceevaluation in prior work [18,24,40,52] in this area. Wechoose matrices from this collection that have both regu-lar and irregular structures. Table 4 presents a summaryof the structure and properties of these matrices. Theyhave dimensions varying from 4,096 to 3,774,768 anddensities ranging between 0.001% and 1.082%.

In Figure 7, we present the speedups of OuterSPACEover the MKL, cuSPARSE and CUSP. OuterSPACE

Figure 6: Performance-scaling comparison of OuterSPACE withchange in matrix dimension and density. The first set of data is R-MATs with parameters (A = 0.57, B = C = 0.19) for undirectedgraphs. The next set is for uniformly random matrices of the samesize and density as the R-MAT dataset.

Figure 7: Speedups of OuterSPACE over the CPU running Intel MKL and the GPU running cuSPARSE and CUSP.

steadily achieves a speedup across all the matrices iden-tified in Table 4, with an average speedup of 7.9⇥ overthe Intel MKL, 13.0⇥ over cuSPARSE and 14.0⇥ overCUSP. Some of the matrices that show relatively lesserspeedups over MKL and cuSPARSE are filter3D androadNet-CA. These matrices are regular (have most oftheir non-zeros along their diagonals) and the multipli-cation algorithms used by the libraries work better withsuch matrices, because they incur fewer comparisonswhile multiplying two regular matrices. OuterSPACEperforms only 3.9⇥ better than MKL for the m133-b3

matrix due to the uneven distribution of non-zeros alongthe columns, which leads to load imbalances during themerge phase and uneven data sharing patterns duringthe multiply phase. MKL performs particularly badon email-Enron, a real-world email dataset with thecharacteristics of a power-law graph [15], substantiatingthe observation made in Section 7.1.1. OuterSPACEalso achieves the highest speedups over cuSPARSE formatrices that have a more smeared (irregular) distri-bution of non-zeros, such as ca-CondMat, cit-Patents,p2p-Gnutella31 and web-Google.OuterSPACE running the outer product algorithm

over this suite of matrices achieves an average through-put of 2.9 GFLOPS, accounting only for useful opera-tions (multiplications and summations). We also observea memory bandwidth utilization of 59.5-68.9% for themultiply phase and a slightly lower 46.5-64.8% for themerge phase. This is due to fewer active PEs and greateruse of local memory during the merge phase. This canbe further improved if the matrices are strategically laidout in memory such that there is fairer access to ev-ery memory channel, but this would require compilersupport and is beyond the scope of our work.

7.2 Sparse Matrix-Vector MultiplicationIn this section, we evaluate the performance of sparse

matrix-vector multiplication on OuterSPACE. Table 5shows the speedups of OuterSPACE over the CPU run-ning MKL and GPU running cuSPARSE with vectordensities ranging from 0.01 to 1.0 (fully dense). Theperformance of MKL is constant across di↵erent vectordensities for a given matrix dimension and density, be-cause MKL’s sparse matrix-vector method performs thebest when the vector is treated as a dense vector regard-less of the number of zeros in the vector. cuSPARSE,

however, scales with change in vector density.

Table 5: Speedups of OuterSPACE over CPU (MKL) and GPU(cuSPARSE) for sparse matrix-vector multiplication. The densityof the vector (r) is varied from 0.01 to 1.0. The sparse matricescontain uniformly random distribution of one million non-zeros.

Speedup over CPU Speedup over GPU

MatrixDim.

r=0.01

r=0.1

r=1.0

r=0.01

r=0.1

r=1.0

65,536 93.2 8.7 0.8 92.5 11.2 3.8131,072 107.5 9.9 1.0 98.2 11.2 2.8262,144 152.1 12.6 1.2 126.0 12.5 2.3524,287 196.3 17.2 1.7 154.4 17.4 2.2

Across all matrix sizes, the speedup of OuterSPACEscales linearly with the vector density, with a 10⇥ reduc-tion in density resulting in approximately a 10⇥ gainin speedup. Such performance scaling is possible be-cause the outer product algorithm only accesses specificcolumns in the input matrix that match the indices ofthe non-zero elements of the vector. This eliminates allredundant accesses to the matrix that are present in aconventional inner product algorithm. Thus, the num-ber of memory accesses to the sparse matrix is directlyproportional to the number of non-zeros in the vector.

We identify two striking di↵erences between the outerproduct algorithm on OuterSPACE and existing matrix-vector multiplication on CPUs and GPUs. First, unlikeconventional algorithms where the performance dependsheavily on the dimensions of the matrix regardless ofthe density, the performance of the outer product algo-rithm scales with the number of non-zero elements inthe matrix, while remaining independent of the matrixsize, for uniform matrices. Second, the performance ofthe sparse matrix-vector algorithm on OuterSPACE alsoscales linearly with the density of the vector, which al-lows OuterSPACE to outperform traditional algorithmsfor sparse vectors. Even while working on small, densevectors, OuterSPACE achieves within 80% of the MKL’sperformance, as reflected in the fourth column of Table 5.

7.3 Dynamic Memory AllocationAs detailed in Section 5.5, the latency of dynamic

allocation requests by the PEs can typically be hiddenby sending the atomic “increment spill-over pointer” op-eration before the PE begins multiplication. Staticallyassigning more space for partial products improves the

run-time by decreasing the number of accesses for dy-namic allocation, at the expense of wasted storage.We assume ↵ ·

nnz2

N elements are allocated statically,

where nnz2

N is the amount of storage needed for andaverage row and ↵ is a parameter. Our analysis of thetotal number of dynamic requests to increment the spill-over pointer, while sweeping (↵), shows that the countof these requests drops to less than 10,000 for ↵>=2 foralmost all the matrices in Table 4. m133-b3 is an outlier,with zero dynamic requests, as it has exactly 4 non-zerosper row, which fits within the statically allocated spaceeven for ↵=1. Our strategy works best for matrices thatare more uniformly distributed, since a suitable value of↵ can eliminate most of the dynamic allocation requests.

7.4 Power and Area AnalysisTable 6 presents area and power estimates for Out-

erSPACE in 32 nm. The total chip area, excluding theHBM controllers, is calculated to be 87 mm2. The powerconsumption of the system is calculated to be 24 W,using the parameters presented in Section 6. This yieldson average 0.12 GFLOPS/W for OuterSPACE.

Table 6: Power and area estimates for OuterSPACE (Figure 5).Component Area (mm2) Power (W)

All PEs, LCPs, CCP 49.14 7.98All L0 caches/scratchpads 34.40 0.82All L1 caches 3.13 0.06All crossbars 0.07 0.53Main memory N/A 14.60

Total 86.74 23.99

For comparison, the mean measured power consump-tion of the K40 GPU while running the workloads was85 W. With the GPU achieving only 0.067 GFLOPSon an average, this yields 0.8 MFLOPS/W. The Out-erSPACE system is, thus, approximately 150⇥ betterthan the GPU on a performance/power metric.

8. OUTERSPACE SCALINGOur current architecture is still well below current

reticle sizes and power limitations. To handle matriceslarger than a few million, a silicon-interposed systemwith 4 HBMs and 4⇥ the PEs on-chip could be realized.This extended configuration would support input matri-ces containing tens to hundreds of millions of non-zeroelements, limited by the HBM’s capacity. In order toprocess larger matrices, we conceive equipping our archi-tecture with node-to-node serializer-deserializer (SerDes)channels to allow multiple OuterSPACE nodes connectedin a torus topology, thus minimizing system latency andmaximizing throughput. Such a system would be ableto process matrices with billions of non-zeros. To scaleto problems of matrices containing trillion non-zeros, weenvision interconnecting many such 16-node clusters.

9. RELATED WORKWith the prevalence of sparse matrix operation ker-

nels in big data applications and their rising significancein a multitude of areas, there has been abundant work

on accelerating sparse matrix-dense vector/matrix mul-tiplication [3,10,26,41]. However, there has been rela-tively less work done to accelerate sparse matrix-sparsevector/matrix multiplication and more on creating soft-ware frameworks for existing state-of-the-art architec-tures like multi-core and many-core CPUs [4, 52, 57],GPUs [18,24,39,41] and heterogeneous (CPU-GPU) sys-tems [39,40,41]. On the contrary, our work demonstratesan e�cient co-design of the outer product based sparsematrix multiplication technique with our custom, scal-able, reconfigurable architecture, achieving significantspeedups over state-of-the-art CPU and GPU libraries.There has been some work on enhancing the under-

lying hardware for sparse matrix-matrix multiplication.Lin et al. [38] propose an FPGA-based architecture forsparse matrix-matrix multiplication that uses on-chipdedicated DSP blocks and reconfigurable logic as process-ing elements (PEs). However, the design is significantlylimited by scarce on-chip FPGA resources, includingthe number of PEs and the size of the on-chip memory.Yavits and Ginosar [64] explore a juxtaposed ResistiveCAM and RAM based sparse matrix-matrix and sparsematrix-vector accelerator applying a row-by-row algo-rithm to e�ciently match the indices of the multiplierand multiplicand and select the ReRAM row, where thecorresponding non-zero element of the sparse multipli-cand matrix/vector is stored. Zhu et al. [66] introducea 3D-stacked logic-in-memory system by placing logiclayers between DRAM dies to accelerate a 3D-DRAMsystem for sparse data access and build a custom CAMarchitecture to speed-up the index-alignment processof column-by-column matrix multiplication by takingadvantage of its parallel matching characteristics.

However, the aforementioned hardware solutions accel-erate SpGEMM algorithms with large amount of redun-dant memory accesses, which we have identified to be akey performance bottleneck in sparse matrix-matrix mul-tiplication. With our custom architecture, OuterSPACE,tailored for the outer product algorithm which eliminatesmost of these redundant accesses, we achieve significantspeedups as well as high energy e�ciency (Section 7).

There also exists work for accelerating sparse matrix-vector multiplication. Mishra et al. [42] add blockingto the baseline software and design fine-grained accel-erators that augment each core in sparse matrix-vectormultiplication. Nurvitadhi et al. [45] propose a SpM-SpV algorithm using column-based SpMV, row blocking,column skipping, unique value compression (UVC), andbit-level packing of matrix data and a hardware accel-erator for it, composed of a data management unit andPEs. In our work, we address both sparse matrix-matrixmultiplication and sparse matrix-vector multiplication.

10. CONCLUSIONThe Intel MKL and NVIDIA CUSP and cuSPARSE li-

braries are successful state-of-the-art libraries for sparselinear algebra. However, our experiments show thatMKL and cuSPARSE work optimally only for regularsparse matrices. While CUSP is insensitive to the irreg-ularity of sparse matrices, it introduces extra memory

overhead for the intermediate storage [39].In this work, we explore an outer product based matrix

multiplication approach and implement it on traditionalCPUs and GPUs. We discover ine�ciencies in thesearchitectures, which lead to sub-optimal performance ofthe outer produce algorithm, and resolve them by build-ing a new custom reconfigurable hardware. Our noveltylies in the e�cient co-design of the algorithm implemen-

tation and the hardware. We demonstrate OuterSPACE’se�ciency for two key kernels, sparse matrix-matrix multi-plication and sparse matrix-vector multiplication, whichform the building blocks of many major applications.

The reconfigurable memory hierarchy of OuterSPACE,which adapts to the contrary data-sharing patterns of theouter product algorithm, aids in reducing the number ofmain memory accesses. This, coupled with the increasedflexibility across PEs through SPMD-style processing(due to lack of synchronization and coherence overheads),enables OuterSPACE to achieve good throughput andhigh speedups over traditional hardware. Energy savingsare attributed to the bare-bone PE and energy-e�cientswizzle-switch crossbar designs [53].In essence, our work demonstrates that a massively-

parallel architecture consisting of asynchronous workers,coupled with memory hierarchies that are tailored toretain reusable data, uncovers an enormous potential toaccelerate kernels that are heavily memory-bound.

11. REFERENCES

[1] Nvidia kepler gk110 architecture whitepaper. Technicalreport, NVIDIA, 2012.

[2] Seventh green graph 500 list(http://green.graph500.org/lists.php), 2016.

[3] S. Acer, O. Selvitopi, and C. Aykanat. Improvingperformance of sparse matrix dense matrix multiplication onlarge-scale parallel systems. Parallel Computing, 59, 2016.

[4] K. Akbudak and C. Aykanat. Exploiting locality in sparsematrix-matrix multiplication on many-core architectures.http://www.prace-ri.eu/IMG/pdf/wp144.pdf, 2017.

[5] M. Alfano, B. Black, J. Rearick, J. Siegel, M. Su, and J. Din.Unleashing fury: A new paradigm for 3-d design and test.IEEE Design & Test, 34(1):8–15, 2017.

[6] L. Alvarez, L. Vilanova, M. Moreto, M. Casas, M. Gonzalez,X. Martorell, N. Navarro, E. Ayguade, and M. Valero.Coherence protocol for transparent management ofscratchpad memories in shared memory manycorearchitectures. In Proceedings of the 42nd Annual Int’lSymposium on Computer Architecture, ISCA ’15, pages720–732, New York, NY, USA, 2015. ACM.

[7] A. Azad, A. Buluc, and J. R. Gilbert. Parallel trianglecounting and enumeration using matrix algebra. 2015 IEEEInt’l Parallel and Distributed Processing SymposiumWorkshop, pages 804–811, 2015.

[8] N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grainedparallelism in algebraic multigrid methods. SIAM J.Scientific Comput., 2011.

[9] N. Bell and M. Garland. E�cient sparse matrix-vectormultiplication on cuda. NVIDIA Technical ReportNVR-2008-004, NVIDIA Corporation, Dec. 2008.

[10] M. A. Bender, G. S. Brodal, R. Fagerberg, R. Jacob, andE. Vicari. Optimal sparse matrix dense vector multiplicationinAathe i/o-model. Theory of Computing Systems,47(4):934–962, Nov 2010.

[11] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D.Hill, and D. A. Wood. The gem5 simulator. SIGARCHComput. Archit. News, 39(2):1–7, Aug. 2011.

[12] S. Brin and P. L. The anatomy of a large-scale hypertextualweb search engine. 7th Int. WWW Conference, 1998.

[13] A. Buluc and J. R. Gilbert. The combinatorial blas: Designimplementation and applications. The Int’l Journal of HighPerformance Computing Applications.

[14] A. Buluc and J. R. Gilbert. On the representation andmultiplication of hypersparse matrices. In IEEE Int’lSymposium on Parallel and Distributed Processing, IPDPS’08, pages 1–11. IEEE, 2008.

[15] A. Chapanond, M. S. Krishnamoorthy, and B. Yener. Graphtheoretic and spectral analysis of enron email data.Computational & Mathematical Organization Theory,11(3):265–281, Oct 2005.

[16] B. Choi, R. Komuravelli, H. Sung, R. Smolinski,N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, andC. T. Chou. Denovo: Rethinking the memory hierarchy fordisciplined parallelism. In 2011 Int’l Conference on ParallelArchitectures and Compilation Techniques, pages 155–166,Oct 2011.

[17] S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Genericparallel algorithms for sparse matrix and graphcomputations. http://cusplibrary.github.io, 2015.Version 0.5.1.

[18] S. Dalton, L. Olson, and N. Bell. Optimizing sparsematrix—matrix multiplication for the gpu. ACMTrans. Math. Softw., 41(4):25:1–25:20, Oct. 2015.

[19] T. A. Davis and Y. Hu. The university of florida sparsematrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25,Dec. 2011.

[20] I. S. Du↵, M. A. Heroux, and R. Pozo. An overview of thesparse basic linear algebra subprograms: The new standardfrom the blas technical forum. ACM Transactions onMathematical Software (TOMS), 28(2):239–267, 2002.

[21] J. R. Gilbert, S. Reinhardt, and V. B. Shah. A unifiedframework for numerical and combinatorial computing.Computing in Science & Engineering, 10(2):20–25.

[22] J. R. Gilbert, S. Reinhardt, and V. B. Shah.High-performance graph algorithms from parallel sparsematrices. Proc. of the Int’l Workshop on Applied ParallelComputing, 2006.

[23] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, andN. Koziris. Understanding the performance of sparsematrix-vector multiplication. In Proceedings of the 16thEuromicro Conference on Parallel, Distributed andNetwork-Based Processing, PDP ’08. IEEE, 2008.

[24] F. Gremse, A. Hofter, L. O. Schwen, F. Kiessling, andU. Naumann. Gpu-accelerated sparse matrix-matrixmultiplication by iterative row merging. SIAM Journal onScientific Computing, 37(1):C54–C71, 2015.

[25] V. Hapla, D. Horak, and M. Merta. Use of Direct Solvers inTFETI Massively Parallel Implementation, pages 192–205.Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.

[26] A. F. Heinecke. Cache Optimised Data Structures andAlgorithms for Sparse Matrices. B.s. thesis, TechnicalUniversity of Munich, April 2008.

[27] Intel Corporation. IntelR� 64 and IA-32 ArchitecturesOptimization Reference Manual. Number 248966-025. June2011.

[28] S. Itoh, P. OrdejAsn, and R. M. Martin. Order-ntight-binding molecular dynamics on parallel computers.Computer Physics Communications, 88(2):173 – 185, 1995.

[29] R. W. Johnson, C. H. Huang, and J. R. Johnson. Multilinearalgebra and parallel programming. The Journal ofSupercomputing, 5(2):189–217, 1991.

[30] N. P. Jouppi. Improving direct-mapped cache performanceby the addition of a small fully-associative cache andprefetch bu↵ers. In Proceedings of the 17th Annual Int’lSymposium on Computer Architecture, ISCA ’90. IEEE,1990.

[31] H. Kaplan, M. Sharir, and E. Verbin. Colored intersectionsearching via sparse rectangular matrix multiplication. SCG’06: Proceedings of the twenty-second annual symposium oncomputational geometry, pages 52–60, 2006.

[32] G. Karypis, A. Gupta, and V. Kumar. A parallelformulation of interior point algorithms. In Proceedings ofthe 1994 ACM/IEEE Conference on Supercomputing,Supercomputing ’94, pages 204–213, Los Alamitos, CA,USA, 1994. IEEE Computer Society Press.

[33] S. Kaxiras and G. Keramidas. Sarc coherence: Scalingdirectory cache coherence in performance and power. IEEEMicro, 30(5):54–65, Sept 2010.

[34] S. Kaxiras and A. Ros. E�cient, snoopless, system-on-chipcoherence. In 2012 IEEE Int’l SOC Conference, pages230–235, Sept 2012.

[35] R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa,M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve.Stash: Have your scratchpad and cache it too. InProceedings of the 42nd Annual Int’l Symposium onComputer Architecture, ISCA ’15, pages 707–719, June 2015.

[36] E. Lee, H. Kang, H. Bahn, and K. G. Shin. Eliminatingperiodic flush overhead of file i/o with non-volatile bu↵ercache. IEEE Transactions on Computers, 65(4):1145–1157,April 2016.

[37] J. Leskovec and A. Krevl. SNAP Datasets: Stanford largenetwork dataset collection.http://snap.stanford.edu/data, June 2014.

[38] C. Y. Lin, Z. Zhang, N. Wong, and H. K. H. So. Designspace exploration for sparse matrix-matrix multiplication onfpgas. In 2010 Int’l Conference on Field-ProgrammableTechnology, pages 369–372, Dec 2010.

[39] W. Liu and B. Vinter. An e�cient gpu general sparsematrix-matrix multiplication for irregular data. In 2014IEEE 28th Int’l Parallel and Distributed ProcessingSymposium, pages 370–381, May 2014.

[40] W. Liu and B. Vinter. A framework for general sparsematrix-matrix multiplication on gpus and heterogeneousprocessors. CoRR, abs/1504.05022, 2015.

[41] K. Matam, S. R. K. B. Indarapu, and K. Kothapalli. Sparsematrix-matrix multiplication on modern architectures. In2012 19th Int’l Conference on High PerformanceComputing, pages 1–10, Dec 2012.

[42] A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce, andD. Marr. Fine-grained accelerators for sparse machinelearning workloads. In 2017 22nd Asia and South PacificDesign Automation Conference (ASP-DAC), pages 635–640,Jan 2017.

[43] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi.Cacti 6.0: A tool to model large caches.

[44] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang.Introducing the graph 500. 2010.

[45] E. Nurvitadhi, A. Mishra, and D. Marr. A sparse matrixvector multiply accelerator for support vector machine. In2015 Int’l Conference on Compilers, Architecture andSynthesis for Embedded Systems (CASES), pages 109–116,Oct 2015.

[46] NVIDIA. Cusparse library.https://developer.nvidia.com/cuSPARSE, 2014.

[47] NVIDIA. Nvidia’s next generation cuda computearchitecture: Fermi. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf,2015.

[48] G. Penn. E�cient transitive closure of sparse matrices overclosed semirings. Theoretical Computer Science,354(1):72–81, 2006.

[49] M. O. Rabin and V. V. Vazirani. Maximum matchings ingeneral graphs through randomization. Journal ofAlgorithms, 10(4):557–567, 1989.

[50] K. Rupp, P. Tillet, F. Rudolf, J. Weinbub, A. Morhammer,T. Grasser, A. Jungel, and S. Selberherr. Viennacl-linearalgebra library for multi- and many-core architectures. SIAMJournal on Scientific Computing, 38(5):S412–S439, 2016.

[51] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park,M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey.Navigating the maze of graph analytics frameworks usingmassive graph datasets. In Proceedings of the 2014 ACMSIGMOD Int’l conference on Management of data, pages979–990. ACM, 2014.

[52] E. Saule, K. Kaya, and U. V. Catalyurek. Performanceevaluation of sparse matrix multiplication kernels on intelxeon phi. CoRR, abs/1302.1078, 2013.

[53] K. Sewell, R. G. Dreslinski, T. Manville, S. Satpathy,N. Pinckney, G. Blake, M. Cieslak, R. Das, T. F. Wenisch,D. Sylvester, D. Blaauw, and T. Mudge. Swizzle-switchnetworks for many-core systems. IEEE Journal on Emergingand Selected Topics in Circuits and Systems, 2(2):278–294,June 2012.

[54] V. B. Shah. An Interactive System for CombinatorialScientific Computing with an Emphasis on ProgrammerProductivity. PhD thesis, June 2007.

[55] A. Shilov. Jedec publishes hbm2 specification.http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification, 2016.

[56] A. Smith. 6 new facts about facebook, Feb 2014.

[57] P. D. Sulatycke and K. Ghose. Caching-e�cientmultithreaded fast multiplication of sparse matrices. InProceedings of the First Merged Int’l Parallel ProcessingSymposium and Symposium on Parallel and DistributedProcessing, pages 117–123, Mar 1998.

[58] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor,M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey.Graphmat: High performance graph analytics madeproductive. Proceedings of the VLDB Endowment,8(11):1214–1225, 2015.

[59] A. Tech. Nvidia launches tesla k40.http://www.anandtech.com/show/7521/nvidia-launches-tesla-k40, 2013.

[60] R. A. Van De Geijn and J. Watts. Summa: Scalableuniversal matrix multiplication algorithm.

[61] S. van Dongen. Graph Clustering by Flow Simulation. PhDthesis, 2000.

[62] R. W. Vuduc and H. J. Moon. Fast sparse matrix-vectormultiplication by exploiting variable block structure.Springer, 2005.

[63] I. Yamazaki and X. S. Li. On techniques to improverobustness and scalability of a parallel hybrid linear solver.Proceedings of the 9th Int’l meeting on high performancecomputing for computational science, pages 421–434, 2010.

[64] L. Yavits and R. Ginosar. Sparse matrix multiplication oncam based accelerator. CoRR, abs/1705.09937, 2017.

[65] R. Yuster and U. Zwick. Detecting short directed cyclesusing rectangular matrix multiplication and dynamicprogramming. SODA ’04: Proceedings of the 15th annualACM-SIAM symposium on discrete algorithms, 2004.

[66] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, and F. Franchetti.Accelerating sparse matrix-matrix multiplication with3d-stacked logic-in-memory hardware. In 2013 IEEE HighPerformance Extreme Computing Conference (HPEC),pages 1–6, Sept 2013.

outerspace: an outer product based sparse matrix ...subh/papers/outerspace-hpca-2018.… ·...

Documents