ALRESCHA: A Lightweight Reconﬁgurable Sparse-Computation ... · HPCA2020UnofﬁcialAuthor’sCopy...

HPCA 2020 Unofficial Author’s Copy

ALRESCHA: A Lightweight ReconfigurableSparse-Computation Accelerator

Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, Sudhakar YalamanchiliGeorgia Institute of Technology, Atlanta, GA

{bahar.asgari, rhadidi, tushar, hyesoon.kim, sudha}@gatech.edu

ABSTRACTSparse problems that dominate a wide range of applica-tions fail to effectively benefit from high memory band-width and concurrent computations in modern high-performance computer systems. Therefore, hardware ac-celerators have been proposed to capture a high degree ofparallelism in sparse problems. However, the unexploredchallenge for sparse problems is the limited opportunityfor parallelism because of data dependencies, a com-mon computation pattern in scientific sparse problems.Our key insight is to extract parallelism by mathemat-ically transforming the computations into equivalentforms. The transformation breaks down the sparse ker-nels into a majority of independent parts and a minorityof data-dependent ones and reorders these parts to gainperformance. To implement the key insight, we proposea lightweight reconfigurable sparse-computation acceler-ator (Alrescha). To efficiently run the data-dependentand parallel parts and to enable fast switching betweenthem, Alrescha makes two contributions. First, it im-plements a compute engine with a fixed compute unitfor the parallel parts and a lightweight reconfigurableengine for the execution of the data-dependent parts.Second, Alrescha benefits from a locally-dense storageformat, with the right order of non-zero values to yieldthe order of computations dictated by the transforma-tion. The combination of the lightweight reconfigurablehardware and the storage format enables uninterruptedstreaming from memory. Our simulation results showthat compared to GPU, Alrescha achieves an averagespeedup of 15.6× for scientific sparse problems, and8× for graph algorithms. Moreover, compared to GPU,Alrescha consumes 14× less energy.

1. INTRODUCTIONSparse problems have become popular in a wide range

of applications, from scientific problems to graph ana-lytics. Since traditional high-performance computingsystems fail to effectively provide high bandwidth forsparse problems, several studies have advocated softwareoptimizations for CPUs [1, 2, 3], GPUs [4, 5, 6, 7, 8],and CPU-GPU systems [9]. As the effectiveness issuehas coupled with approaching the end of Moore’s law,specialized hardware for sparse problems has become at-tractive. For instance, hardware accelerators have beenproposed for sparse matrix-matrix multiplication [10, 11,12, 13], matrix-vector multiplication [14, 15, 16, 17], orboth [18, 19, 20]. In addition, accelerators for sparseproblems have been proposed to reduce memory-accesslatency or improve energy efficiency [21, 22, 23, 24, 25].

Further, approaches such as blocking [26, 25, 13] havebeen used to reduce indirect memory accesses.

The aforementioned hardware accelerators and software-optimization techniques for sparse problems often focuson a specific domain of application and take advantageof a specific pattern in computations to improve perfor-mance (more details in Table 2). However, flexibilityin the range of target applications is an important fea-ture for a hardware accelerator. Such flexibility is notjust for creating more generic accelerators, but is foraccelerating all the different kernels in a program toeffectively improve the overall performance. In fact,unlike the assumptions of prior work, sparse problemsmay comprise two groups of kernels with contradictoryfeatures: (i) highly parallizable and (ii) data-dependentkernels. In such a case, the challenge is that while asparse problem requires high bandwidth and a high levelof concurrency, the dependent computations prevent ben-efiting from the available memory bandwidth and thehigh level of concurrency. In other words, the need forhigh bandwidth and the limited opportunity for paral-lelism are two contradictory attributes, which challengeperformance optimization. Such sparse problems havebecome a major computation pattern in many fields ofscience. For instance, the high-performance conjugategradient (HPCG) [27] benchmark is now a complementto the high-performance Linpack (HPL).

To clarify the challenge, Figure 1a shows the particulardata-dependency pattern in scientific sparse problemsthat causes a performance bottleneck (details in §2). Asthe pseudo code shows, for col < row, the computationof x[row] must wait for the previous elements of x tobe done. As a result, processing each row of the matrixdepends on the result of processing the previous rowand the rows cannot be processed in parallel. Therefore,as Figure 1b shows, typically, the rows of the matrix areprocessed sequentially – even though processing individ-ual rows (i.e., a dot product) can be parallelized. Toextract more parallelism, code optimizations such as rowreordering or matrix coloring [8] have been proposed tocapture and run independent rows in parallel. However,the effectiveness of such high-level methods (i.e., in thegranularity of instructions) depends on the distributionof non-zero values in a matrix (e.g., a specific problemmay not have independent rows).The Key Insight: Our observation to resolve the

challenge is that the data-dependent operations rely onlyon a fraction of the results from the previous operations.Thus, the key insight is to extract parallelism by mathe-matically transforming the computation to equivalentforms. Such transformations allow us to break down the

1

(a) An example of sparse problem with challenging parallelism:

(b) Approach 1: Rows must me processed sequentially

A

ii+1i+2

DOT product 1 update

DOT product 2 update

xi

i+1x

x

……

Seq

uenc

e of

dat

a-de

pend

ent c

ompu

tatio

ns

By excluding this part, the rest can run in parallel

DOT product 3

Time

DOT product 1 DOT product 2 DOT product 3

Time

if col<row, must wait for x[col] to be updated.

for row = 0 to number_of_rows for j=0 to nonzeros_in_row [row] if (col != row) sum += A_val[j] * x[A_col[j]] x[row] = sum / A_val[row]

(c) Approach 2 (ALRESCHA):Rows can be processed in parallel

1

2

3

Parallelizable DOT products. A locally-dense storage format is required to provide right blocks of matrix A at right order.

Parallelizable DOT products on small various-size operands

These are sequential operations on small operands. Therefore, their latency can be optimized by a fast hardware.

Can be implemented by a fast reconfigurable hardware to avoid a latency bottleneck. They also require fine-grained ordering of values in the blocks of storage format.

2 3&

Figure 1: An example of sparse problems withdata-dependency patterns in its computations.traditionally dependent parts into a majority of paralleland a minority of dependent operations. To do so, weexclude the dependency-prone section of the matrix andrun the rest in parallel (i.e., we convert Figure 1b toFigure 1c). To implement the key insight, we propose alightweight reconfigurable sparse-computation accelera-tor (Alrescha1), which makes three key contributions:

• Lightweight reconfigurability: After transform-ing the computations, while the majority of oper-ations will be highly parallelizable (¶ and ·), westill need to execute a few data-dependent com-putations ¸, albeit on small operands. Duringruntime, the application repetitively switches be-tween the three mentioned groups of operations(i.e., between ¶ and ·/¸). The switching itselfmust be fast enough to prevent a bottleneck. Tothis end, the compute engine of Alrescha imple-ments quick reconfiguration during runtime by inte-grating a fixed computation unit for ¶ and a smallreconfigurable computation unit for · and ¸.

• Locally-dense storage format: Since Alreschareorders the computations to process the parallelparts together ¶, we propose a storage format inwhich the order of non-zero values matches the or-der of computations. The storage format enablesstreaming the ordered blocks of matrix A in theright timing. Besides, such a storage format facili-tates · and ¸ by dictating required element-wiseordering. Alrescha integrates the proposed storageformat with a data-driven execution model to elim-inate the transferring and decoding of meta-data.

• Generic sparse accelerator: The ability of Al-rescha to accelerate distinct kernels makes it the

1A binary star, the two stars of which orbit one another.

Preconditioned conjugate gradient (PCG):p(0)=x;r(0)=b-Ap(0);for i=1 to m z=MG(r(i-1)); a(i)=dot_prod(r,z); if i=1 then p=z; else p=a(i)/a(i-1)*p+z;a(i)=a(i)/dot_prod(p,Ap);x(i+1)=x(i)+a(i)p(i);r(i)=r(i-1)-a(i)Ap(i);

if(depth < 3) SymGS(); SpMV(); MG(depth++); SymGS();else SymGS();

for row = 0 to rows sum = b [row] for j=0 to nnz_in_row [row] col = A_col[ j ]; val = A_val[ j ]; if (col != row) sum = sum -val * x[col]; x[row] = sum / A_diag[row];

Figure 2: An example of the PCG algorithm [27]for solving a sparse linear system of equations(i.e., Ax = b), including SpMV and SymGS.

first generic hardware accelerator for both scien-tific and graph applications regardless of whetherthey have data-dependent compute patterns. Forapplications with no distinct kernels, Alrescha stillsustains high performance by integrating the pro-posed storage format and the execution model,hence avoiding meta-data transfer.

Alrescha accelerates various sparse kernels such assparse matrix-vector multiplication (SpMV), symmet-ric Guass-Seidel (SymGS) smoother, PageRank (PR),breadth-first search (BFS), and single-source shortestpath (SSSP). To evaluate Alrescha, we target a widerange of data sets with various sizes. Comparing Al-rescha with a CPU and a GPU platform shows thatAlrescha achieves an average speedup of 15.6× for sci-entific sparse problems and 8× for graph algorithms.Moreover, compared to a GPU, Alrescha consumes 14×less energy. We also compare Alrescha with the state-of-the-art hardware accelerators for sparse problems,namely, OuterSPACE [18], an accelerator for SpMV,GraphR [24], and a Memristive accelerator for scientificproblems [25]. Our experiment results on various datasets show that compared to state-of-the-art, Alreschaachieves an average speed up of 2.1×, 1.87×, 1.7× forscientific algorithm, graph analytics, and SpMV, respec-tively. The performance gain on data sets with variousdistributions of non-zero values indicates that the gainedbenefits of Alrescha are independent of specific patternsin sparse data.

2. BACKGROUNDThis section reviews the background and the character-istics of two key sparse problems as follows.Scientific Problems: Many physical-world phenom-

ena, such as sound, heat, elasticity, fluid dynamics, andquantum mechanics, are modeled with partial differen-tial equations (PDE). To numerically process and solvethem via digital computers, PDEs are discretized to a3D grid (e.g., using 27-stencil discretization), which isthen converted to a linear system of algebraic equations:Ax = b, in which A is the coefficient matrix, often verylarge for two or higher dimensional problems (e.g., el-liptic, parabolic, or hyperbolic PDEs). Such a systemof linear equations, with a symmetric positive-definitematrix, can be solved by iterative algorithms such asconjugate gradient (CG) methods (e.g., preconditionedCG (PCG), which ensures fast convergence by precon-ditioning). These methods are specifically useful for

2

solving sparse systems that are too large to be solvedby direct methods.

SymGSSpMVDotOthers

63%31%

Figure 3: The break-down of execution time ofPCG on NVIDIA K20.

An example of thePCG algorithm for solv-ing Ax = b is shown inFigure 2 [27]. The al-gorithm updates the vec-tor x in m iterations. AsFigure 3 shows, the exe-cution time of the algo-rithm is dominated bytwo kernels, SpMV and SymGS [28, 8, 29]. The re-maining kernels, such as the dot product, consume onlya tiny fraction of the execution time and are so ubiqui-tous that they are executed using special hardware insome supercomputers.To explore the characteristics of SpMV and SymGS,

we use an example of applying them on two operands, avector (b1×m) and a matrix (Am×n)1. Applying SpMVon the two operands results in a vector (x1×n), eachelement of which can be calculated as:

xj =k∑

i=1b[AT_indi]×AT_valij , (1)

in which k, AT_val, and AT_ind are the number ofnon-zero values, the non-zero values themselves, andthe row indices of the jth column of AT , respectively.Figure 4a shows a visualization of Equation 1. Sincethe elements of the output vector can be calculatedindependently, SpMV has the potential for parallelism.On the other hand, each element of the vector result ofapplying SymGS on the same two operands (i.e. vectorb1×m and a matrix Am×n) is calculated as follows, basedon the Gauss-Seidel method [30]:

xtj = 1

ATjj

− (bj −j−1∑i=1

ATij ×xt

i −n∑

i=j+1AT

ij ×xt−1i ). (2)

(a) (b)

b1⇥m b1⇥m

ATm⇥n AT

m⇥n

x1⇥n xt1⇥n

xt�11⇥n

jth jth

jth

jth jth

Figure 4: Calculation of(a) xj in SpMV, and (b) xt

j

in SymGS.

Figure 4b illustratesa visualization of Equa-tion 2 (i.e., the bluevectors correspond to∑j−1

i=1 ATij × xt

i and redvectors correspond

to∑n

i=j+1 ATij × xt−1

i ).In fact, calculating thejth element of x at it-eration t (i.e., the or-ange element of xt inFigure 4b) depends not only on the values of x at it-eration t − 1 (i.e., the red elements of xt−1), but alsoon the values of xt, which are being calculated in thecurrent iteration (i.e., the blue elements of xt). Suchdependencies in the SymGS kernel limit the parallelismopportunity. Although some optimization strategies havebeen proposed for employing parallelism [8], the SymGSkernel can still be a performance bottleneck.Graph Analytics: A common approach to represent

graphs is to use an adjacency matrix, each element of1In the rest of the paper, all matrix As refer to this matrix.

1 1 41

11 2

ABCD

A B C D

0 ∞ ∞ ∞A B C D

A

B

CD

11

1

4

2

1

Path length from A:Edges to C:Initial:

0 1 2 3A B C D

Final:

Edges from C:

1 1 11

11 1

ABCD

A B C D3 1 1 2A B C D

Out degree:

0.25 0.25 0.25 0.25A B C D

Initial rank:

0.83 0.83 0.37 0.67A B C D

Final rank:

(a) (b)An example Graph

Figure 5: An example graph, its adjacencysparse matrices, and the vector operands of twograph algorithms: (a) SSSP and (b) PR.

which represents an edge in the graph (Figure 5 illus-trates an example of such a representation). Since inmany applications the graph is sparse, the equivalent ad-jacency matrix includes several zeros. Graph algorithmstraverse vertices and edges to compute a set of prop-erties based on the connectivity relationships. Travers-ing is implemented as a form of a dense-vector sparse-matrix operation. Such implementations are suited tothe vertex-centric programming model [31], which ispreferred to the edge-centric model. The vertex-centricmodel divides a graph algorithm into three phases. Inthe first phase, all the edges from a vertex (i.e., a rowof the adjacency matrix) are processed. This process isa vector-vector operation between the row of the matrixand a property vector, varied based on the algorithm.In the second phase, the output vector from the firstphase is reduced by a reduction operation (e.g., sum). Inthe final phase, the result is assigned to its destination.Three widely used graph algorithms are BFS, SSSP,

PR. Figure 5 shows example graphs, the adjacency ma-trices, and the vector operands for the SSSP and PRalgorithms. For SSSP (Figure 5a), the vector containingthe lengths of nodes from node A is updated iterativelyby multiplying a row of the matrix by the path-lengthvector and then choosing the minimum of the resultvector. After traversing all the nodes, the final values ofthe vector indicate the shortest paths from node A toall the other nodes. PR (Figure 5b) iteratively updatesthe rank vector, initialized by equal values. At eachiteration, the elements of the rank vector are divided bythe elements of the out-degree vector (i.e., the numberof out-going edges for each vertex), chosen by a row ofthe matrix, and the result vector is reduced to a singlerank by adding the elements of the vector.Common Features: While the sparse kernels used

in both scientific and graph applications are similarin having sparse matrix operands, some kernels (e.g.,SpMV) exhibit more concurrency, whereas others (e.g.,SymGS) have several data dependencies in their compu-tations. Regardless of this difference, a common propertyof sparse kernels is that the reuse distance of accessesto the sparse matrix is high, while the input and out-put vectors of these kernels are being reused frequently.Moreover, the accesses to at least one of the vectors areoften irregular. The other, and more important, com-mon feature of these kernels is that they follow the threephases of operations iteratively (i.e., vector operation,reduce, and assign). Table 1 summarizes these phases forthe main sparse kernels, as well as the operands and theoperations at each phase. The sparse kernels calculatean element of their result by accessing a row/column of

3

Table 1: The properties of sparse kernels and corresponding dense data paths, implemented inAlrescha. Depending on the type of kernel, the operation in phase 1 can use the three vectoroperands at the same time or use just two of them.Sparse Kernel Sparse Dense Data Paths Phase 1 (vector operation) Phase 2 Phase 3

Application vector operand1 vector operand2 vector operand3 operation (reduce) (assign)

SymGS PDE solving D-SymGS/GEMV a row of the vector from the vector at multiplication sum apply operation with AT

coefficient matrix iteration (i-1) iteration (i) and bj and update vectorSpMV PDE solving GEMV a row of the vector from N/A multiplication sum sum and

and graph coefficient matrix iteration (i-1) update the vectorPage Rank Graph D-PR a column of the out-degree the rank vector AND/division sum rank vector updateadjacency matrix vector of vertices at iteration (i-1)BFS Graph D-BFS a column of the frontier vector N/A sum min compare and update

adjacency matrix distance vectorSSSP Graph D-SSSP a column of the frontier vector N/A sum min compare and update

adjacency matrix distance vector

the sparse large matrix only once and then reuse one ortwo vector/s for the calculation of all output vector ele-ments. We benefit from the common features to designan accelerator that is flexible to run all the mentionedsparse kernels without significant overhead. Alreschaconverts the sparse kernels into the dense data paths,listed in the second column of Table 1 (details in §4.1).

3. MOTIVATION AND RELATED WORKSparse problems in either sparse (i.e., non-compressed)

or compressed representation face many performancechallenges. The sparse formats are less efficient be-cause they require storing, transferring, and processingof zero values. On the other hand, even the highlypreferred compressed representations still rely on trans-ferring meta-data and random memory accesses thatlimit performance by memory-access latency. Besides,as the ratio of compute-per-data-transfer of many sparseproblems (i.e., those including vector-matrix operations)is low, normally, we expect their performance to bedirectly related to the memory bandwidth. However,gaining higher performance for sparse problems is notas straightforward as adding more memory bandwidth.This section sheds more light on the challenges thatsparse problems – specifically those with data-dependentcomputations – face when being executed on modernCPUs, GPUs, or even on hardware accelerators.Ineffectiveness of CPUs and GPUs: To date,

many software-level optimizations for CPUs [1, 2, 3],GPUs, [4, 5, 6, 7, 8], and CPU-GPU systems [9] havebeen proposed. However, software optimizations alonecannot effectively handle irregular data-dependent mem-ory references, a main characteristic of sparse problems.Irregular data-dependent memory references limit the

00.511.522.533.5

00.5

11.5

22.5

33.5

NVIDIA Volta

V100 (Summit)

NVIDIA Tesla

V100 (Sierra

)

NVIDIA Tesla

P100

NVIDIA K80

NVIDIA K40

NVIDIA K20x

Intel Xeon Phi 7

250 68C

Intel Xeon Platin

ium 8160 24C

Intel Xeon Platin

um 8174 24C

Intel Xeon E5

-2670 12C

Intel Xeon E5

-2680-V3

Pflo

ps/S

ec

Frac

tion

of P

eak

(%)

Figure 6: The performance of modern comput-ing platforms ranked by the standard metric ofHPCG [27] benchmark on GPUs and CPUs.

reach of software schemes and thus lead to poor perfor-mance due to degraded bandwidth utilization. Further-more, optimizations for extracting more parallelism andbandwidth such as matrix coloring [8] and blocking [26]have not been effective enough for the aforementionedreason. Figure 6 summarizes the performance of runningsparse scientific applications on a range of CPUs andGPUs. As the figure shows, they utilize only a tinyfraction of the peak performance.Prior Hardware Accelerators: The ineffectiveness

of CPUs and GPUs, along with approaching the end ofMoore’s law, has motivated the migration to specializedhardware for sparse problems. For instance, hardwareaccelerators have targeted sparse matrix-matrix multipli-cation [10, 11, 12, 13], matrix-vector multiplication [14,15, 16, 17], or both [18, 19, 20], which are the mainsparse kernels in many sparse problems. A state-of-the-art SpMV accelerator, OuterSPACE [18], employs anouter-product algorithm to minimize the redundant ac-cesses to non-zero values of the sparse matrix. Despitethe speedup of OuterSPACE over the traditional SpMVby increasing the data reuse rate and reducing memoryaccesses, it produces random access to a local cache.To efficiently utilize memory bandwidth, Hegde et al.proposed Extensor [19], a novel fetching mechanismthat avoids the memory latency overhead associatedwith sparse kernels. Song et al. proposed GraphR [24],a graph accelerator, and Feinberg et al. proposed ascientific-problem accelerator [25], both of which processblocks of non-zero values instead of individual ones. Be-sides, Huang et al. have proposed analog [32] and hybrid(analog-digital) [33] accelerator for solving PDEs.

The prior specialized hardware designs often have notfocused on resolving the challenge of data-dependentcomputations in sparse problems that prevent benefitingfrom the available memory bandwidth. Sparse problemsmay involve a combination of contradictory parallizableand data-dependent kernels. The flexibility to accel-erate both types of kernels is necessary not only toimprove the overall performance but also to be moregeneric. Table 2 compares the most relevant hardwareapproaches and techniques for accelerating sparse prob-lems with Alrescha. As the table lists, Alrescha is thefirst reconfigurable sparse problem accelerator for bothscientific and graph applications, which supports multi-kernel execution and resolves the limited parallelism infine granularity. Besides, unlike prior work, Alreschareduces the number of accesses to the vector operand ofthe sparse matrix-vector operations.

4

Table 2: Comparing the state-of-the-art accelerators for sparse kernels.GraphR [24] OuterSPACE [18] Memristive-Based Row Reordering Alrescha (our work)Accelerator [25] Matrix Coloring [8]

Application Domain Graph Graph (only SpMV) PDE solver PDE solver Graph and PDE solver

Hardware

Multi-Kernel Support 7 7 7 7 3

BW Utilization Low Moderate Low Moderate High

NOT Transferring Meta-data 7 7 7 7 3

Processing Type ReRAM Crossbar PEs connected in heterogeneous Memristive GPU Instruction Fixed vector processor anda high-speed crossbar crossbar a small reconfigurable switch

Cache Optimizations N/A 7 N/A 7 3For Frequently-Used VectorsReconfigurability 7 Only for cache hierarchy 7 N/A 3

TechniquesStorage Format 4×4 COO CSR multi-size blocks (64×64, ELL 8×8 blocking with

128×128, 256×256, 512×512) fine-grained in-block orderingResolving Limited Parallelism N/A N/A 7

3 (Instruction-level,3limited by sparsity pattern)

4. ALRESCHAAlrescha is a memory-mapped accelerator. The mem-

ory dedicated to Alrescha is also accessible by a hostfor programming. Figure 7 shows an overview of Al-rescha, the host, and the connections for programmingand transferring data. The programming model of Al-rescha is similar to offloading computations from a CPUto a GPU (the programming model is beyond the scopeof this paper). To program the accelerator, the hostlaunches the sparse kernels of sparse algorithms (i.e.,PCG and graph algorithms) to the accelerator. To do so,the host first converts the sparse kernels into a sequenceof dense data paths and generates a binary file. Then,the host writes the binary file to a configuration tableof the accelerator through the program interface. Thedetails of the conversion mechanism and the dense datapath implementation are explained in §4.1 and §4.2.

During the execution of an algorithm, repetitive switch-ing between the dense data paths is required. The keyfeature of Alrescha to enable fast switching among thosedense data paths is the runtime reconfigurability. Thedetails of the reconfigurable microarchitecture of Al-rescha and the mechanism of real-time reconfigurationare explained in §4.3 and §4.4. Besides switching amongthe data paths during runtime, Alrescha also reordersthe dense data paths to reduce the number of switches.Such a reordering necessitates a new storage format,which is introduced in §4.5. Therefore, the other taskof the host is to reformat the sparse matrix operandsinto a storage format consisting of blocks, each of whichcorresponds to a dense data path. The formatted data iswritten into the physical memory space of the acceleratorthrough the data interface (Figure 7).Since the target algorithms are iterative, the prepro-

cessing (i.e., conversion and reformatting) is a one-timeoverhead. Besides, the complexity and effort of prepro-cessing depend on the previous format, data source, andhost platform. For instance, the conversion complexityfrom frequently-used storage formats (e.g., CSR and

Program including SymGS( ), SpMV( ), BFS( ), SSSP( ), and PR( ) sparse kernels

Binary file including a sequence of dense data paths (i.e., GEMV, D-SymGS, D-BFS,

D-SSSP, and D-PR)

Matrix operand in Alrescha storage format

Memory of ALRESCHA

Cofiguration table of ALRESCHA

programinterface

datainterface

AlreschaHost

Figure 7: The overview of Alrescha and host.

BCSR) is linear in time and requires constant space.Since the preprocessing complexity is linear, it can bedone while data streams from the memory. Moreover, ifdata is generated in the system (e.g., through sensors),it is initially be formatted in the Alrescha format andreformatting is not required.4.1 Kernel to Data Path ConversionAlgorithm 1 shows the procedure for converting a

sparse matrix to dense data paths. Based on the kerneltype, the sparse matrix operand (Anxn), and the dimen-sion of the non-zero blocks in the matrix operand (ω),the algorithm generates the configuration table, each rowof which specifies the type of data path and informationabout its operands. More specifically, for each densedata path, besides its type (i.e., DP ), illustrated by onebit, the index of the input vector operand (Inxin), theindex of the output vector (Inxout), the access order,which can be left to right (l2r) or right to left (r2l),and the source of the operand (Op) are stored. In otherwords, the Inxin and Inxout indicate the read and writeaddresses of a local cache, respectively; Op selects the lo-cal cache containing the vector operand. As a result, allmentioned meta data in a row of the configuration tabletakes 2[log2(n/ω)] + 3 bits. The three bits are for thedata path type, access order, and operand source, respec-tively. The configuration table is used for configuring aconfiguration switch (details in §4.3).

The general procedure of the conversion algorithmis as follows: (i) As lines 8 to 12 show, the sparsekernels with no (or straightforward) data dependen-cies including SpMV, BFS, SSSP, and PR are brokendown into a sequence of general matrix-vector multi-plication (GEMV), dense BFS (D-BFS), dense SSSP(D-SSSPs), and dense PR (D-PR), respectively. Thesedense data paths have the same functionality as theircorresponding sparse kernels do; however, they work onnon-overlapping locally-dense blocks of the sparse matrixoperand and overlapping sub-vectors of the dense vectoroperand of the original sparse kernel. (ii) As lines 13to 26 show, the sparse kernels with data dependencies(e.g., SymGS kernel, the key computation of PCG algo-rithm) are broken down into a majority of parallelizableGEMV (lines 14 to 21) and a minority of sequentialdense SymGS (D-SymGS) data paths (lines 23 to 26).The conversion for SymGS is to assign GEMVs to

non-diagonal non-zero blocks (line 15) and D-SymGS todiagonal non-zero blocks of the sparse matrix (line 23).

5

For accelerating SymGS, the key insight of Alrescha is toseparate GEMV from D-SymGS data paths to preventthe performance from being limited by the sequentialnature of the SymGS kernel. To this end, Alreschareduces switching between the two data paths (GEMVand D-SymGS) by reordering them so that Alrescha firstexecutes all the GEMVs in a row successively and thenswitches to a D-SymGS. The distributive property ofinner products in Equation 2 guarantees the correctnessof such reordering. As an example of the outcome ofAlgorithm 1, Figure 8 shows the state machine of PCG,equivalent to the algorithm in Figure 2, which comprisesthree sparse kernels, two of which are the focus of thispaper and are launched to the accelerator by the host.The configuration table for a SymGS example is shownin Figure 8. Based on Equation 2 and as lines 19 and 21of Algorithm 1 indicate, all the non-zero blocks in theupper triangle of A have to be multiplied by xt, and allof those in the lower triangle have to be multiplied byxt−1. §4.2 explains the reasoning for the listed accessorders in Figure 8.

Algorithm 1 Convert Algorithm1: function Convert(KernelType, Anxn,ω)

Anxn: sparse matrix, ω : block widthDP : Data path typel2r: left to right, r2l: right to left

2: Inxin := 0, Inxout := 03: Blocks[] = Split(A,ω) // partitions A to ω × ω blocks4: m = n/ω5: for (i = 1, i < m,i + +) do6: for (j = 1, i < m,j + +) do7: if (nnz(Blocks[i, j])> 0) then8: if KernelType ! = SymGS then9: DP = KernelT ype.DataP ath10: Inxin = i.ω, Inxout = j.ω11: Order = l2r12: Op = port1 // the operand vector13: else14: if (i! = j) then15: DP = GEMV16: Inxin = j.ω17: Inxout = −1 // no write to cache18: Order = l2r19: if (i > j) then20: Op = port2 //which is xt−1

21: else22: Op = port1 //which is xt

23: else24: DP = D-SymGS25: Inxin = j.ω, Inxout = (i + 1).ω26: Order = r2l27: Op = port2 //which is xt−1

28: Add2Table(DP,Inxin, Inxout,Order,Op)

4.2 Dense Data Path ImplementationAlrescha implements two classes of dense data paths.

The first class consists of D-BFS, D-SSSP, D-PR, andGEMV with straightforward patterns of data dependen-cies. The second class (e.g., D-SymGS) captures morecomplicated sequential patterns. This section describesthe implementations of the two classes of data paths.(a) Straightforward data paths: These data paths,

such as D-BFS, D-SSSP, D-PR, and GEMV, work on lo-cally dense blocks of A to capture locality in accesses tothe dense vector operand. If the sparse matrix operandincludes blocks of non-zero values of size ω, Alreschasplits the vector operand into chunks of size ω, and at

PCG:

SymGSSpMV

dotproduct

123456789

96 7 82 3 4 51

74739

Inxin

xtxt�1

xt�1

xt�1

xt�1

OpOrder

-14-7

Inxout

l2r

r2l

DP

GEMV

D � SymGS

D � SymGS

D � SymGS

GEMV

r2l

l2r

r2l

Figure 8: An example of the configuration tablefor a SymGS kernel, in which n = 9, ω = 3.

each cycle, it fetches a chunk of the vector from a localcache instead of fetching an individual element of thevector operand. More specifically, a vector operation isapplied on ω elements of the vector operand and all thenon-zero blocks of A in a row. The approach of Alreschafor running BFS, SSSP, PR, and SpMV sparse kernelsprovides two advantages: (i) it guarantees locality ofcache accesses (i.e., the values in a cache line are usedin succeeding cycles), and (ii) each element of the vectoroperand is fetched from the cache only once per n/ω, inwhich n in the number of columns of A.(b) Data-dependent data paths: SymGS, an ex-

ample of a sparse kernel with such dependencies, includesa matrix-vector operation with a sparse matrix operandA and three vector operands, Ajj , b, and a combina-tion of xt−1 and xt (Equation 2). The three vectorsare the diagonal of A, the right-hand side coefficient ofthe Ax = b equations, and a combination of variables ofthe equations computed in previous iterations (i.e., theinitial values) and the variables being computed in thecurrent iteration. Based on Equation 2, calculating anelement of xt includes two inner products, the sizes ofwhich change when we move across the rows of A. Toenable using a compute engine for both inner products,Alrescha merges them by factoring out the subtraction:

xtj = ( 1

ATjj

− bj)+(j−1∑i=1

ATij ×xt

i +n∑

i=j+1AT

ij ×xt−1i ) (3)

which consists of a unified dot product of size n, the com-mon operation in both classes of the dense data paths.As a result of the common operation, by transformingEquation 2 to Equation 3, the D-SymGS kernel canutilize the same dot product engine used by other densedata paths (i.e., the first class of data paths). The onlydifference between the dot product of D-SymGS andthat in other data paths is the sources of the inputs andthe destination of the output. In detail, one operandof the dot product of D-SymGS is made by shifting xt

into xt−1. This can be implemented by just rotatingthe inputs of the multipliers and pushing the xt

j into thefirst multiplier to be used in calculating xt

j+1.To clarify the functionality of Alrescha for D-SymGS,

we use a simple example in Figure 10, for which weassume that the size of the problem fits the size of thehardware (i.e., the number of multipliers). As Figure 10illustrates, Alrescha stores the diagonal of A in a sepa-rate vector and does the subtraction with b in parallelwith the dot product. Note that separately storing thediagonal of A helps utilize the memory bandwidth onlyfor streaming the operand of the dot product engine. Atthe first step, one row of the non-diagonal elements of

6

... ... ...... ......

chip 1 chip 2 chip 8

Mem

ory

ALU 1 ALU 2 ALU 8

Cache

Red

uctio

nTr

ee

RE 1 RE2 RE3 … RE 7

Fixed Interconnect (Tree Topology)

Config. Table

data

Cache

Processing Element

…In

dice

s

64 BytesDual-port SRAMWrite address generator

out inxoffset

access order

adrs

Read address generatorin inxoffset

adrs

Red

uce

Engi

ne

+ <

Reg1 Reg2

prog

ram

ALUin1(mem)

in2(cache)

x ÷ +

in4(ALU)

in3(swt)

Reg1 Reg2

out(ALU) pr

ogra

m

Example Matrix

PE

PE PE

PE PE

LUT

Con

fig.

inputs

FF

Con

figur

able

Sw

itch

Stac

kFI

FOs

PE

A B C

C B A

X Y Z

X Y Z

…

Data Interface

32 bits5Gbps

32 bits5Gbps

32 bits5Gbps

(b)

(c)

64 bits

RECONFIG. COMPUTE UNIT

FIXE

D C

OM

PUTE

UN

IT

(a)

FCU

+

+

-

Ajj

b/

xtxt�1

stack(link buffer)

x x x+

++

x

Cac

heFI

FOs

Stac

k

(d)

RCU

Vec

tor

Ope

rati

on

Memory

FCU+

xtxt�1

x x x+

++

x

Cac

heSt

ack

b/

RCU

Memory

FCU+ / / /

++

+/

Cac

heSt

ack

RCU

RankOut-Degree

& && &

Memory

Data

stack(link buffer)

stack(link buffer)

Figure 9: (a) The microarchitecture of Alrescha including the FCU for implementing common com-putations, and the RCU for providing specific configuration for distinct dense data paths. And, threeexample configurations for supporting: (b) D-SymGS, (c) GEMV, and (d) PR.A is multiplied by xt−1

1 to xt−13 . In the second step, we

need xt0 instead of xt−1

1 . However, the newly calculatedxt

0 will not take the place of the kicked-off xt−11 . As a

result, we insert the new variables by shifting the oldone to the right. As Figure 10 shows, while reading arow from A at each cycle, the elements belonging to theupper triangle of A are reordered, while the elements inthe lower triangle keep their original orders. Such anordering, which is listed in the configuration table ofFigure 8, is also reflected in the storage format (detailsin Figure 13 in §4.5).

x x x+

+

x3 a03 x2 a02 x1 a01

+-

x x x+

+

x0 a10 x3 a13 x2 a12

+-

x x x+

+

x1 a21 x0 a20 x3 a23

+-

x x x

+

+

x2 a30 x1 a31 x0 a30

+-

a22 a11 a00a33 a33a22a11

a22a22 a11

b3 b2 b3

b3 b2 b1

x0 x1 x2 x3xt�1

Ajj

b

x0 x1 x2 x3xt

a22a11a00a33

b2 b1 b0b3

a00a01a02a03a10a11a12a13a20a21a22a23a30a31a32a33

A

D-SymGS: step 1 (j=0) step 2 (j=1)

step 3 (j=2) step 4 (j=3)

b2 b1 b0b3

/ /

/ /

x0 x1

x2 x3

Figure 10: An example of D-SymGS. At eachstep, one xt

j is calculated based on Equation 3.The xt

j is then used in calculating the next ele-ment xt

j+1 at the next step. The arrows amongthe operands of the dot product indicate shiftingthem one to the right at each step.4.3 Reconfigurable MicroarchitectureThis section introduces the microarchitecture of Al-

rescha to accelerate the sparse kernels by executingtheir fundamental data paths (i.e., D-SymGS, GEMV,D-BFS, D-SSSP, and D-PR). To deliver close-to-idealperformance, the main idea behind the hardware ofAlrescha is to have a separate fixed computation unit(FCU) and a reconfigurable computation unit (RCU)and configuring only the former for switching betweendata paths (Figure 9). Reconfiguring only a fraction ofthe entire data path reduces the configuration time.

The FCU streams only the dense blocks of the sparse

matrix (i.e., no meta-data) from memory and applies therequired vector operation (i.e., phase 1 in Table 1). TheFCU includes ALUs and reduce engines (REs) connectedtogether in a tree topology, as shown in Figure 9, andform the reduction tree. The interconnections betweenthe REs of the FCU are fixed for all data paths and donot require reconfiguration. The reduction tree is fullypipelined to yield the speed of the streaming data frommemory. One of the inputs of ALUs (i.e., the matrixoperand) is always streamed from memory, and the otherinput/s (i.e., the vector operands) comes from the RCU.The former input of the ALU requires a multiplexerbecause at runtime, its input might need to be changed.For example, only during initializing the D-SymGS doesthat input come from the cache (i.e., xt−1); after the ini-tialization, that input comes from a processing elementin the RCU and a forwarding connection between theinputs of the multipliers. For GEMV, on the other hand,the ALU requires the multiplexers to choose betweenxt−1 and xt during runtime.The responsibility of the RCU is to handle the spe-

cific data dependencies in different kernels. The RCUincludes a local cache, buffers, processing elements (PE),and a configurable switch, which determines the intercon-nection between the units in the RCU. The configurableswitch is not a novel approach here and is implementedsimilarly to those of FPGAs and is directly controlledby the content of a configurable table (§4.4). The localcache stores the vector operands, which require address-able accesses (e.g., xt−1, xt, and b), whereas the buffershandle vector operands, which require deterministic ac-cesses. For instance, we employ first-in-first-out (FIFO)for AT

jj and b, and use a last-in-first-out (LIFO) stackfor the link buffer. The link buffer establishes trans-missions between the dense data paths (details in §4.4).For data path transmission, the reduction tree has tobe drained, during which the switch is reconfigured toprepare it for the next data path. Therefore, the latencyof configuration is hidden by the latency of draining the

7

adder tree. The PEs of the RCU are implemented bylook-up tables (LUTs) to provide multiplication, divi-sion, summation, and subtraction. Figure 9b, c, and dillustrate the configuration of the RCU for performingD-SymGS, GEMV, and D-PR data paths.

4.4 Real-Time ReconfigurationThe algorithms with a sequence of distinct sparse

kernels benefit from reconfigurability because they fre-quently switch between the kernels. As a result, to sat-isfy high-speed and low-cost reconfigurability, Alreschaimplements a lightweight reconfiguration, meaning thatonly the configuration of a small fraction of the com-pute engine is changed frequently. Switching betweentwo data-dependent data paths is performed using thelink buffer (i.e., implemented as the stack). During run-time, the intermediate results generated by GEMV arepushed into the stack. Then, the successive D-SymGSpops them up. Figure 11 presents a numerical exam-ple for switching from GEMV to D-SymGS data paths.During steps 1 to 3, the RCU is configured to GEMV,and the results are pushed into the link stack. At step3, while the reduction tree (adder tree) is drained, theinterconnect between the cache and FIFOs in the RCUis reconfigured to D-SymGS. This does not affect thepushes to the stack, so it can be parallelized and hidesthe latency. In steps 4 to 6, D-SymGS pops the resultsof the GEMV from the link stack and consumes them.

4.5 Storage FormatDepending on the distribution of non-zeros in a sparse

matrix, various storage formats may suit them. Fig-ure 12 compares four well-known formats based on meta-data per non-zero. The compressed sparse row (CSR),which stores a vector of column indices and a vector ofrow pointers, locates all the non-zeros independently.Therefore, CSR is the right choice when the non-zerosdo not exhibit any spatial localities. On the other hand,when all the non-zeros are located in diagonals, the di-agonal format (DIA) [35], which stores the non-zerossequentially, could be the best option. An extension tothe DIA format, Ellpack-Itpack (ELL) [36] is more flexi-ble when the matrix has a combination of non-diagonaland diagonal elements. For instance, ELL is used for

7.08link

input from memory

xt�1 from cache

4.0 1.4 1.0 0.9 2.1 1.8

0.8 0.7 2.9

7.41 7.08

3.0 1.0 4.7

0.8 0.7 2.9

16.7 7.41 7.08

0.8 0.7 2.9

b

xt

input from memory

xt�1 from cache

1/Ajj

link

0.5 2.1 1.0

0.5 0.58 3.3

10.3 13.4 11.2

5.5 0.5 2.2

2.7 3.0 1.2

0.62 0.5 0.58

16.1 10.3 13.4

16.7 7.41

5.55 13.7 05.55 0 0

13.7 5.55 2.2

1.3 1.0 0.8

0.34 0.62 0.5

19.0 16.1 10.3

16.7

5.55 13.7 30.7

0.5 2.2 1.5

16.7 7.41 7.08

3.1 1.5 2.2 0.5 0.1 1.3 0.8 0.7 2.9xt�1

17.8 21 18.214.519.016.110.313.411.2b

0.3 1.0 2.1 0.52.7 1.7 1.2 3.01.0 1.3 2.0 0.80 0 0 1.6

0 0 4.0 1.40 0 0.9 2.10 0 3.0 1.0

1.7 2.6 0.3 0

1.84.70

1.0

0.20 0.5 1.50 3.0 1.0 4.70

12349

GEMV data path:

D-SymGS data path:

1 2 3

4 5 6

Figure 11: Switching between GEMV and D-SymGS data paths using the link stack.

CSR

ELLBCSR

DIA

Low

HighRandom

Common Pattern in Scientific and

Graph Applications Diagonal

ALRESCHA

Structured

Byte

/ N

NZ

Figure 12: The spectrum of storage formats forvarious types of sparse matrices (low is better).

implementing SymGS in GPUs. However, such a formatdoes not provide enough flexibility for parallelizing rowsbecause it does not sustain the locality across rows.

The choice of storage format should be compatible withthe common pattern in a wide range of sparse applica-tions. In such a case, blocked CSR (BCSR) [26], anextension of CSR, which assigns the column indices androw pointers to blocks of non-zero values, is the rightformat. Although BCSR is an appropriate format forscientific applications and graph analytics in terms ofstorage overhead, the strategy of BCSR for assigningindices and pointers, and the order of values, is nota good match for smoothly streaming data. In otherwords, the main requirement for fast computation isthe order of operations, which in turn dictates the datastructures to be streamed in the same order. Thus, weadapt BCSR and propose a new storage format withthe same meta-data overhead but compatible with thereconfigurable microarchitecture of Alrescha.Figure 13, illustrates our proposed locally-dense for-

mat for mapping the example sparse matrix to the physi-cal memory addresses of the accelerator. The Order ofBlocks: All the non-diagonal non-zero blocks in a row ofblocks are stored together, followed by a diagonal block;The Order of Values: The non-zero values belongingto the upper triangle of the non-diagonal blocks arestored in the opposite order of their original locationsin the matrix (see the order of A, B, and C in Fig-ure 13). Accordingly, the difference between the columnindices of BSCR and input indices (i.e., Inxin) of theAlrescha format is shown in Figure 13; The DiagonalElements: For SymGS, the diagonal of A is excludedand stored separately in a local cache. Therefore, we

mapping to the physical memory

ABCD EF

X Y Z

Phys

ical

Mem

ory

Spac

e

A B CD E F

X Y Z123456789

96 7 82 3 4 51

BCSR:col_index: { }row_pointer: {1,3,4,6}

ALRESCHA:input_index: { }output_index: {1,4,7}

1,7,4,3,7 7,4,7,3,9

same order of values

reverse order of values

Figure 13: Comparing Alrescha with BCSR:The col_index of BCSR and input_index (i.e.,Inxin) of Alrescha are color-coded to show theircorresponding blocks in the matrix. Alreschauses the index of the last column for the inputindex of diagonal blocks.

8

KindNNZ

NameRow/Col

KindNNZ

NameRow/Col

Electromagnetic1,647,264

2cubes_sphere101,492

Fluid Dynamics10,319,760

atmosmodm1,489,752

Structural Prob.17,277,420

CoupCons3D416,800


dielFilterV3real1,102,824


HV15R2,017,169


ifiss_mat96,307

Electromagnetics406,084

light_in_tissue29,282

Circuit Simul.958,936

scircuit170,998


StocF-14651,465,137

Electromagnetics6,459,326

TEM152078152,078

Structural Prob.23,487,281

Transport1,602,111

Fluid Dynamics803,978

windtunnel_ev3D40,816

Thermal Prob.463,625

epb384,617

Chemistry Prob.5,156,379

Si34H3697,569

Chemistry Prob.3,381,809

GaAsH661,349

Thermal Prob.711,558

thermomech_TC102,158

Materials Prob.1,181,120

xenon148,600


poisson3Db85,623

Materials Prob.583,770

crystm24,696

Circuit Simul.954,163

ASIC_100k99,340

Economic Prob.596,992

finan51274,752

Acoustics Prob.1,660,579

qa8fm66,127

Acoustics Prob.5,036,288

mono_500Hz169,410


offshore259,789

Figure 14: Scientific application data sets attained from University of Florida SuiteSparse [34].Table 3: Graph datasets.

Dataset row/col NNZcom–orkut 3,072,441 234,370,166

hollywood–2009 1,139,905 1,139,905kron-g500–logn21 2,097,152 182,082,942

roadnet–CA 1,971,281 5,533,214LiveJournal 4,847,571 68,993,773Youtube 1,134,890 5,975,248Pokec 1,632,803 30,622,564

sx-stackoverflow 2,601,977 36,233,450

consider non-square blocks on the diagonal (e.g., 3×4instead of 3 × 3) so that the the mapping of the non-diagonal element of that block to the physical memoryis adjusted. Meta Data: the indices of the input andoutput (i.e., Inxin and Inxout) are not streamed frommemory during runtime. Instead, they are stored in theconfiguration table during the one-time programmingphase and are used for reconfiguration purposes. As aresult, during the iterative execution of the algorithms,the whole available memory bandwidth is utilized onlyfor streaming payload.

5. PERFORMANCE EVALUATIONThis section explores the performance of Alrescha by

comparing it with the CPU, GPU, and state-of-the-artaccelerators for sparse problems. We evaluate Alreschafor both scientific applications and graph analytics. Thissection first introduces the data sets and algorithms,and the experimental setup. Then, we analyze executiontime and energy consumption.5.1 Datasets, Algorithms, and Baselines

We pick real-world matrices with applications in scien-tific and graph problems from the University of FloridaSuiteSparse Matrix Collection [34]. The matrices, shownin Figure 14 are our datasets representing scientificapplications, including circuit simulations, electromag-netic, fluid dynamics, structural, material, acoustics,economics, and chemical problems, all of which can bemodeled by PDEs. As the figure shows, the non-zerovalues have various distributions across the matrices.For graph analytics, we choose the eight datasets, listedin Table 3, along with their dimensions and the numberof non-zeros (NNZ). We run PCG, which includes theSymGS and SpMV kernels, on the matrices listed inFigure 14, and run graph algorithms (i.e., BFS, SSSP,and PR) on the matrices in Table 3. We also run SpMVon both categories of datasets.We compare Alrescha with the CPU and GPU plat-

Table 4: Baseline configurations.GPU baseline

Graphics card NVIDIA Tesla K40c, 2880 CUDA coresArchitecture Kepler

Clock frequency 745MHzMemory 12 GB GDDR5, 288 GB/sLibraries Gunrock [37] and CUSPARSE

Optimizations row reordering (coloring) [8], ELL formatCPU baseline

Processor Intel Xeon E5-2630 v3 8-coreClock frequency 2.4 GHz

Cache 64 KB L1, 256 KB L2, 20 MB L3Memory 128 GB DDR4, 59 GB/sPlatforms CuSha [39], GridGraph [38]

Table 5: Alrescha Configuration.Floating point double precision (64 bits)Clock frequency 2.5 GHz

Cache 1KB, 64-Byte lines, 4-cycle access latencyRE latency 3 Cycles (sum: 3, min: 1)ALU latency 3 CyclesMemory 12 GB GDDR5, 288 GB/s

forms. The configurations of the baseline platformsare listed in Table 4. For the CPU and GPU, we ex-clude disk access time. For fair comparisons, we includenecessary optimizations, such as row reordering and suit-able storage formats (e.g. ELL) proposed for the CPUand GPU implementations. The PCG algorithm andthe graph algorithms running on GPU are respectivelybased on the cuSPARSE and Gunrock [37] libraries. Thegraph algorithms running on the CPU are based on theGridGraph [38] and/or CuSha [39] platforms, whicheverachieves better performance for a specific algorithm.

Besides the comparison with the CPU and GPU, thissection compares Alrescha with the state-of-the-art hard-ware accelerators, including OuterSPACE [18], an accel-erator for SpMV, GraphR [24], a ReRAM-based graphaccelerator, and a Memristive accelerator for scientificproblems [25]. To reproduce their latency and powerconsumption numbers, we modeled the behavior of thepreceding accelerators based on the information providedin the published papers (e.g., the latency of read andwrite operations for GraphR and Memristive accelera-tor). We validate our numbers based on their reportednumbers for their configurations to make sure our re-produced numbers are never worse than their reportednumbers. Moreover, for fair comparison, we assign allthe accelerators the same computation and memory-bandwidth budget – this assumption does not harm theperformance of our peers.

9

5.2 Experimental SetupWe convert the raw matrices using Algorithm 1 imple-

mented in Matlab. To do that, we examine block sizesof 8, 16, and 32 for the range of data sets and choosethe block size of eight because, unlike the other two, 8provides a balance between the opportunity for paral-lelism and the number of non-zero values. We modelthe hardware of Alrescha using a cycle-level simulatorwith the configurations listed in Table 5. The clockfrequency is chosen to enable the compute logic to followthe speed of streaming from memory (i.e., each 64-bitoperands of ALU are delivered from memory in 0.4 ns,through the 32-bit 5 Gbps links.) To measure energyconsumption, we model all the components of the mi-croarchitecture using a TSMC 28 nm standard cell andthe SRAM library at 200 MHz. The reported numbersinclude programming the accelerator.

5.3 Execution TimeScientific Problems: The primary axis of Figure 15

(i.e., the bars) illustrates the speedup of running PCG onAlrescha over the GPU implementation optimized by rowreordering [8] for extracting a high level of parallelism;the secondary axis of Figure 15 shows the bandwidthutilization. The figure also captures the speedup of theMemoristive-based hardware accelerator [25]. On aver-age, Alrescha provides a 15.6× speedup compared to theoptimized implementation on the GPU. The speedup ofAlrescha is approximately twice that of the most recentaccelerator for solving PDEs. To investigate the reasonsbehind this observation, we plot memory bandwidthutilization in Figure 15. As the figure shows, the perfor-mance of Alrescha and the other hardware acceleratorfor all scientific datasets is directly related to memorybandwidth utilization – mainly because of the sparsitynature. Moreover, none of them fully utilize the avail-able memory bandwidth because both approaches useblocked storage formats, in which the percentage of non-zero values in a block rarely reaches a hundred percent.Nevertheless, we see that Alrescha better utilizes thebandwidth because it resolves the dependencies in com-putations, which otherwise limits bandwidth utilization.To clarify the impact of resolving dependencies on

overall performance, Figure 16 presents the percentageof sequential computations in the GPU implementation,

0%

20%

40%

60%

80%

100%

1.0

10.0

2cub

es_s

pher

eAS

IC_1

00k

atm

osm

odm

Coup

Cons

3Dcr

ystm

03di

elFI

lterV

3rea

lep

b3fin

an51

2G

aAsH

6HV

15R

ifiss

_mat

light

_in_

tissu

em

ono_

500H

zof

fsho

repo

isson

3Db

qa8f

msc

ircui

tSi

34H

36St

ocF-

1465

TEM

1520

78th

erm

omec

h_TC

Tran

spor

tw

indt

unne

l_ev

ap3d

xeno

n1

GM

EAN

Band

wid

th U

tiliza

tion

Spee

dup

over

GPU

(log

. sca

le) ALRESCHA Memristive-based Accel. ALRESCHA Memristive-based Accel.

15.6

7.3

Figure 15: Speedup for PCG algorithm on scien-tific datasets, normalized to GPU (bar charts),and bandwidth utilization (the lines). TheMemristive-based accelerator [25] is the state-of-the-art accelerator for scientific problems.

0%

20%

40%

60%

80%

2cub

es_s

pher

eA

SIC_

100k

atm

osm

odm

Coup

Cons

3Dcr

ystm

03di

elFI

lter

V3r

eal

epb3

finan

512

GaA

sH6

HV

15R

ifiss

_mat

light

_in_

tiss

uem

ono_

500H

zof

fsho

repo

isso

n3D

bqa

8fm

scir

cuit

Si34

H36

Stoc

F-14

65TE

M15

2078

ther

mom

ech…

Tran

spor

tw

indt

unne

l_e…

xeno

n1

GM

EAN

% o

f seq

uent

ial c

ompu

te Baseline ALRESCHA

Figure 16: The reduction of the sequential partof the PCG algorithm by applying Alrescha.The baseline shows the percentage of sequentialoperations by row-reordering optimization.

versus that in Alrescha, which has an average of 23.1%sequential operations. As the figure suggests, even inthe GPU implementation that extracts the independentparallel operations using row reordering and graph col-oring, on average 60.9% of operations are still sequential.This is more than 60% for highly-diagonal matrices andless than 60% for matrices with a greater opportunityfor in-row parallelism. Such a trend identifies the distri-bution of locally-dense blocks as another rationale fordetermining the speedups. More specifically, when thedistribution of non-zero values in rows of a matrix offersthe opportunity for parallelism, the speedup over theGPU is smaller compared to when the non-zeros aremostly distributed in the diagonal.Insights: For multi-kernel sparse algorithms with data-

dependent computations, Alrescha improves performanceby (i) extracting parallelizable data paths, (ii) reorderingthe dense data paths and the elements in the blocks tomaximize reuse of data, and (iii) implementing them inlightweight reconfigurable hardware, all of which resultin fast switching not only between the distinct data pathsof a single kernel, but also among the sparse kernels.Graph Analytics And SpMV: This section ex-

plores the performance of the algorithms consisting ofone type of sparse kernel that naturally has fewer datadependency patterns in their computations. Such astudy claims that Alrescha is not just optimized for aspecific domain and is applicable to accelerating a widerange of sparse applications. First, we analyze the per-formance of graph applications. Figure 17 illustrates thespeedup of running BFS, SSSP, and PR on Alrescha,a recent hardware accelerator for graph (i.e., based onGraphR [24]), and GPU, all normalized to the CPU.

1

10

com

-ork

utho

llyw

ood-

09kr

on-g

500-

logn

21ro

adne

t_CA

LiveJ

ourn

alYo

utub

ePo

kec

sx-s

tack

over

flow

com

-ork

utho

llyw

ood-

09kr

on-g

500-

logn

21ro

adne

t_CA

LiveJ

ourn

alYo

utub

ePo

kec

sx-s

tack

over

flow

com

-ork

utho

llyw

ood-

09kr

on-g

500-

logn

21ro

adne

t_CA

LiveJ

ourn

alYo

utub

ePo

kec

sx-s

tack

over

flow

GMEA

N-B

FSGM

EAN

-SSS

PGM

EAN

-PR

Spee

dup

over

CPU

(log.

scal

e) ALRESCHA GraphR GPU

BFS SSSP PR

184 92 56

15.7

7.7

27.6

Figure 17: Speedup for graph algorithms ongraph data sets over the CPU. GraphR [24] isthe state-of-the-art graph accelerator.

10

0.01

0.1

1

10

1

10

2cubes_

sphere

ASIC_100k

atm

osmodm

CoupCons3D

crys

tm03

dielFIlte

rV3re

alepb3

finan

512

GaAsH

6

HV15R

ifiss_

mat

light_

in_tissu

e

mono_500Hz

offshore

poisson3Db

qa8fm

scirc

uit

Si34H36

StocF

-1465

TEM152078

therm

omech

_TC

Transp

ort

windtunnel_eva

p3d

xenon1

com

-ork

ut

hollywood-0

9

kron-g

500-logn

21

road

net_CA

LiveJo

urnal

Youtube

Pokec

sx-st

acko

verfl

ow

GMEA

N Sientif

ic

GMEA

N Gra

ph

% L

aten

cy o

f cac

he a

cces

s (lo

g. s

cale

)

SpM

VSp

eedu

p ov

er G

PU (l

og. s

cale

) ALRESCHA OuterSPACE ALRESCHA OuterSpPACE

Scientific Graph

6.913.6

Figure 18: Speedup for running SpMV on scientific and graph datasets normalized to GPU (barcharts), and the percentage of execution time devoted to cache accesses (the lines). OuterSPACE [18]is the state-of-the-art SpMV accelerator.

1

10

100

2cubes_

sphere

ASIC_100k

atm

osmodm

CoupCons3D

crys

tm03

dielFIlte

rV3re

alepb3

finan

512

GaAsH

6

HV15R

ifiss_

mat

light_

in_tissu

e

mono_500Hz

offshore

poisson3Db

qa8fm

scirc

uit

Si34H36

StocF

-1465

TEM152078

therm

omech

_TC

Transp

ort

windtunnel_eva

p3d

xenon1

com

-ork

ut

hollywood-0

9

kron-g

500-logn

21

road

net_CA

LiveJo

urnal

Youtube

Pokec

sx-st

acko

verfl

ow

GMEA

N

Ener

gy S

avin

g (lo

g. s

cale

) Normalized to CPU Normalized to GPU122 147 152 282169 12874

14

Scientific Graph

Figure 19: Energy consumption improvement of Alrescha normalized to that of CPU and GPU.

As the figure shows, Alrescha offers average speedupsof 15.7×, 7.7×, and 27.6×, for BFS, SSSP, and PRalgorithms, respectively. We achieve this speedup byavoiding the transfer of meta-data, reordering the blocksfor increasing data reuse, and improving the locality.

The primary axis of Figure 18 (i.e., the bars) illustratesthe speedup of SpMV, a common algorithm of varioussparse applications on Alrescha and OuterSPACE [18](i.e., the recent hardware accelerator for the SpMV),normalized to the GPU baseline. As the figure shows,Alrescha offers average speedups of 6.9× and 13.6× forscientific and graph datasets. When running SpMV, allthe data paths are GEMV; therefore, no transmissionbetween data paths is required. However, optimizationsof Alrescha help achieve greater performance. The keyoptimization here is accesses to the cache to obtainfrequent accesses to the vector operand of SpMV. Toshow this, the secondary axis of Figure 18 (i.e., thelines) plots the percentage of the whole execution timefor accesses to the local cache. To perform an SpMV,OuterSPACE reads each element of the vector operand,multiples it with all the elements in a row of the matrixand then accumulates each of the partial products, andwrites them to their corresponding element of the outputvector. Hence, unlike Alrescha, the computation engineof OuterSPACE has to put the partial products in theirright location in the output vector, which may lead tolack of locality in accesses to the cache.

Insights: For single-kernel sparse problems (e.g., SpMV),Alrescha gains speedup by (i) improving the locality ofcache accesses (i.e., consuming the values in a cacheline in succeeding cycles), (ii) increasing the data reuserate of not only the sparse-matrix operands, but alsothe dense-vector operands, and (iii) avoiding meta-datatransfer and decoding.

5.4 Energy ConsumptionA primary motive for using hardware accelerators

rather than software optimizations is to reduce energyconsumption. To achieve this goal, the techniques inte-grated in the hardware accelerators have to be efficient.Since a source of energy consumption is accesses to localSRAM-based buffers or caches, reducing the numberof reads and writes from and to local caches by substi-tuting them with computation is beneficial. Figure 19illustrates the energy consumption of Alrescha for exe-cuting SpMV, normalized to that of the CPU and GPUbaselines. As Figure 19 shows, on average, the totalenergy consumption improves by 74× compared to theCPU and 14× compared to the GPU. Note that theactivity of compute units, defined by the density of thelocally-dense block, impacts energy but not performance.Insights: The main reasons for the low energy con-

sumption are the small reconfigurable hardware of Al-rescha in combination with utilizing a locally-dense stor-age format with the right order of blocks and valuesmatched with the order of computation, thus avoidingthe decoding of meta-data and reducing the number ofaccesses to the cache and the memory.

6. CONCLUSIONWe proposed Alrescha, a sparse problem accelerator thatquickly reconfigures the data path during runtime todynamically tune the hardware for distinct computationpatterns and enables using high-bandwidth memory atlow-cost for fast acceleration of sparse problems.

AcknowledgmentWe would like to thank the anonymous reviewers andgratefully acknowledge the support of NSF CSR 1526798.

11

7. REFERENCES

[1] K. Akbudak and C. Aykanat, “Exploiting locality in sparsematrix-matrix multiplication on many-core architectures,”IEEE Transactions on Parallel and Distributed Systems,vol. 28, 2017.

[2] E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performanceevaluation of sparse matrix multiplication kernels on intelxeon phi,” in International Conference on ParallelProcessing and Applied Mathematics. Springer, 2013.

[3] P. D. Sulatycke and K. Ghose, “Caching-efficientmultithreaded fast multiplication of sparse matrices,” inParallel Processing Symposium, 1998. IPPS/SPDP 1998.Proceedings of the First Merged International... andSymposium on Parallel and Distributed Processing 1998.IEEE, 1998.

[4] S. Dalton, L. Olson, and N. Bell, “Optimizing sparsematrixâĂŤmatrix multiplication for the gpu,” ACMTransactions on Mathematical Software (TOMS), vol. 41,2015.

[5] F. Gremse, A. Hofter, L. O. Schwen, F. Kiessling, andU. Naumann, “Gpu-accelerated sparse matrix-matrixmultiplication by iterative row merging,” SIAM Journal onScientific Computing, vol. 37, 2015.

[6] W. Liu and B. Vinter, “An efficient gpu general sparsematrix-matrix multiplication for irregular data,” in Paralleland Distributed Processing Symposium, 2014 IEEE 28thInternational. IEEE, 2014.

[7] K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, “Sparsematrix-matrix multiplication on modern architectures,” inHiPC, 2012 19th International Conference on. IEEE, 2012.

[8] E. Phillips and M. Fatica, “A cuda implementation of thehigh performance conjugate gradient benchmark,” in PMBS.Springer, 2014.

[9] W. Liu and B. Vinter, “A framework for general sparsematrix–matrix multiplication on gpus and heterogeneousprocessors,” Journal of Parallel and Distributed Computing,vol. 85, 2015.

[10] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili,“Lodestar: Creating locally-dense cnns for efficient inferenceon systolic arrays,” in DAC. ACM, 2019.

[11] C. Y. Lin, N. Wong, and H. K.-H. So, “Design spaceexploration for sparse matrix-matrix multiplication onfpgas,” International Journal of Circuit Theory andApplications, vol. 41, 2013.

[12] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, andF. Franchetti, “Accelerating sparse matrix-matrixmultiplication with 3d-stacked logic-in-memory hardware,”in HPEC, 2013 IEEE. IEEE, 2013.

[13] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili,“Eridanus: Efficiently running inference of dnns usingsystolic arrays,” IEEE Micro, 2019.

[14] B. Asgari, R. Hadidi, and H. Kim, “Ascella: Acceleratingsparse computation byenabling stream accesses to memory,”in DATE. IEEE, 2020.

[15] A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce, andD. Marr, “Fine-grained accelerators for sparse machinelearning workloads,” in ASP-DAC. IEEE, 2017.

[16] E. Nurvitadhi, A. Mishra, and D. Marr, “A sparse matrixvector multiply accelerator for support vector machine,” inProceedings of the 2015 International Conference onCompilers, Architecture and Synthesis for EmbeddedSystems. IEEE Press, 2015.

[17] U. Gupta, B. Reagen, L. Pentecost, M. Donato, T. Tambe,A. M. Rush, G.-Y. Wei, and D. Brooks, “Masr: A modularaccelerator for sparse rnns,” in PACT. IEEE, 2019.

[18] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng,C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, andR. Dreslinski, “Outerspace: An outer product based sparsematrix multiplication accelerator,” in HPCA. IEEE, 2018.

[19] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago,

A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher,“Extensor: An accelerator for sparse tensor algebra,” inMICRO. ACM, 2019.

[20] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi,S. Koppula, N. M. Ghiasi, T. Shahroodi, J. G. Luna, andO. Mutlu, “Smash: Co-designing software compression andhardware-accelerated indexing for efficient sparse matrixoperations,” in MICRO. ACM, 2019.

[21] T. J. Ham, L. Wu, N. Sundaram, N. Satish, andM. Martonosi, “Graphicionado: A high-performance andenergy-efficient accelerator for graph analytics,” in MICRO.IEEE, 2016.

[22] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalableprocessing-in-memory accelerator for parallel graphprocessing,” ACM SIGARCH Computer Architecture News,vol. 43, 2016.

[23] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim,“Graphpim: Enabling instruction-level pim offloading ingraph computing frameworks,” in HPCA. IEEE, 2017.

[24] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Graphr:Accelerating graph processing using reram,” in HPCA.IEEE, 2018.

[25] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang,and E. Ipek, “Enabling scientific computing on memristiveaccelerators,” in ISCA. IEEE, 2018.

[26] R. W. Vuduc and H.-J. Moon, “Fast sparse matrix-vectormultiplication by exploiting variable block structure,” inHPCC. Springer, 2005.

[27] J. Dongarra, M. A. Heroux, and P. Luszczek, “Hpcgbenchmark: a new metric for ranking high performancecomputing systems,” Knoxville, Tennessee, 2015.

[28] D. Ruiz, F. Mantovani, M. Casas, J. Labarta, and F. Spiga,“The hpcg benchmark: analysis, shared memory preliminaryimprovements and evaluation on an arm-based platform,”2018.

[29] V. Marjanović, J. Gracia, and C. W. Glass, “Performancemodeling of the hpcg benchmark,” in PMBS. Springer,2014.

[30] G. H. Golub and C. F. Van Loan, Matrix computations.JHU press, 2012, vol. 3.

[31] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system forlarge-scale graph processing,” in Proceedings of the 2010ACM SIGMOD International Conference on Managementof data. ACM, 2010.

[32] Y. Huang, N. Guo, M. Seok, Y. Tsividis, andS. Sethumadhavan, “Analog computing in a modern context:A linear algebra accelerator case study,” IEEE Micro,vol. 37, 2017.

[33] Y. Huang, N. Guo, M. Seok, Y. Tsividis, K. Mandli, andS. Sethumadhavan, “Hybrid analog-digital solution ofnonlinear partial differential equations,” in MICRO. IEEE,2017.

[34] T. A. Davis and Y. Hu, “The university of florida sparsematrix collection,” ACM TOMS, vol. 38, 2011.

[35] Y. Saad, Iterative methods for sparse linear systems. siam,2003, vol. 82.

[36] D. R. Kincaid, T. C. Oppe, and D. M. Young, “Itpackv 2duser’s guide,” Texas Univ., Austin, TX (USA). Center forNumerical Analysis, Tech. Rep., 1989.

[37] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D.Owens, “Gunrock: A high-performance graph processinglibrary on the gpu,” in ACM SIGPLAN Notices, vol. 51,no. 8. ACM, 2016.

[38] X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scalegraph processing on a single machine using 2-levelhierarchical partitioning,” in USENIX ACT, 2015.

[39] F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan,“Cusha: vertex-centric graph processing on gpus,” in HPDC.ACM, 2014.

12

ALRESCHA: A Lightweight Reconﬁgurable Sparse-Computation ... · HPCA2020UnofﬁcialAuthor’sCopy...

Documents

Transcript of ALRESCHA: A Lightweight Reconﬁgurable Sparse-Computation ... · HPCA2020UnofﬁcialAuthor’sCopy...