ALRESCHA: A Lightweight Reconfigurable Sparse-Computation ... · HPCA2020UnofficialAuthor’sCopy...
Transcript of ALRESCHA: A Lightweight Reconfigurable Sparse-Computation ... · HPCA2020UnofficialAuthor’sCopy...
HPCA 2020 Unofficial Author’s Copy
ALRESCHA: A Lightweight ReconfigurableSparse-Computation Accelerator
Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, Sudhakar YalamanchiliGeorgia Institute of Technology, Atlanta, GA
{bahar.asgari, rhadidi, tushar, hyesoon.kim, sudha}@gatech.edu
ABSTRACTSparse problems that dominate a wide range of applica-tions fail to effectively benefit from high memory band-width and concurrent computations in modern high-performance computer systems. Therefore, hardware ac-celerators have been proposed to capture a high degree ofparallelism in sparse problems. However, the unexploredchallenge for sparse problems is the limited opportunityfor parallelism because of data dependencies, a com-mon computation pattern in scientific sparse problems.Our key insight is to extract parallelism by mathemat-ically transforming the computations into equivalentforms. The transformation breaks down the sparse ker-nels into a majority of independent parts and a minorityof data-dependent ones and reorders these parts to gainperformance. To implement the key insight, we proposea lightweight reconfigurable sparse-computation acceler-ator (Alrescha). To efficiently run the data-dependentand parallel parts and to enable fast switching betweenthem, Alrescha makes two contributions. First, it im-plements a compute engine with a fixed compute unitfor the parallel parts and a lightweight reconfigurableengine for the execution of the data-dependent parts.Second, Alrescha benefits from a locally-dense storageformat, with the right order of non-zero values to yieldthe order of computations dictated by the transforma-tion. The combination of the lightweight reconfigurablehardware and the storage format enables uninterruptedstreaming from memory. Our simulation results showthat compared to GPU, Alrescha achieves an averagespeedup of 15.6× for scientific sparse problems, and8× for graph algorithms. Moreover, compared to GPU,Alrescha consumes 14× less energy.
1. INTRODUCTIONSparse problems have become popular in a wide range
of applications, from scientific problems to graph ana-lytics. Since traditional high-performance computingsystems fail to effectively provide high bandwidth forsparse problems, several studies have advocated softwareoptimizations for CPUs [1, 2, 3], GPUs [4, 5, 6, 7, 8],and CPU-GPU systems [9]. As the effectiveness issuehas coupled with approaching the end of Moore’s law,specialized hardware for sparse problems has become at-tractive. For instance, hardware accelerators have beenproposed for sparse matrix-matrix multiplication [10, 11,12, 13], matrix-vector multiplication [14, 15, 16, 17], orboth [18, 19, 20]. In addition, accelerators for sparseproblems have been proposed to reduce memory-accesslatency or improve energy efficiency [21, 22, 23, 24, 25].
Further, approaches such as blocking [26, 25, 13] havebeen used to reduce indirect memory accesses.
The aforementioned hardware accelerators and software-optimization techniques for sparse problems often focuson a specific domain of application and take advantageof a specific pattern in computations to improve perfor-mance (more details in Table 2). However, flexibilityin the range of target applications is an important fea-ture for a hardware accelerator. Such flexibility is notjust for creating more generic accelerators, but is foraccelerating all the different kernels in a program toeffectively improve the overall performance. In fact,unlike the assumptions of prior work, sparse problemsmay comprise two groups of kernels with contradictoryfeatures: (i) highly parallizable and (ii) data-dependentkernels. In such a case, the challenge is that while asparse problem requires high bandwidth and a high levelof concurrency, the dependent computations prevent ben-efiting from the available memory bandwidth and thehigh level of concurrency. In other words, the need forhigh bandwidth and the limited opportunity for paral-lelism are two contradictory attributes, which challengeperformance optimization. Such sparse problems havebecome a major computation pattern in many fields ofscience. For instance, the high-performance conjugategradient (HPCG) [27] benchmark is now a complementto the high-performance Linpack (HPL).
To clarify the challenge, Figure 1a shows the particulardata-dependency pattern in scientific sparse problemsthat causes a performance bottleneck (details in §2). Asthe pseudo code shows, for col < row, the computationof x[row] must wait for the previous elements of x tobe done. As a result, processing each row of the matrixdepends on the result of processing the previous rowand the rows cannot be processed in parallel. Therefore,as Figure 1b shows, typically, the rows of the matrix areprocessed sequentially – even though processing individ-ual rows (i.e., a dot product) can be parallelized. Toextract more parallelism, code optimizations such as rowreordering or matrix coloring [8] have been proposed tocapture and run independent rows in parallel. However,the effectiveness of such high-level methods (i.e., in thegranularity of instructions) depends on the distributionof non-zero values in a matrix (e.g., a specific problemmay not have independent rows).The Key Insight: Our observation to resolve the
challenge is that the data-dependent operations rely onlyon a fraction of the results from the previous operations.Thus, the key insight is to extract parallelism by mathe-matically transforming the computation to equivalentforms. Such transformations allow us to break down the
1
(a) An example of sparse problem with challenging parallelism:
(b) Approach 1: Rows must me processed sequentially
A
ii+1i+2
DOT product 1 update
DOT product 2 update
xi
i+1x
x
……
Seq
uenc
e of
dat
a-de
pend
ent c
ompu
tatio
ns
By excluding this part, the rest can run in parallel
DOT product 3
Time
DOT product 1 DOT product 2 DOT product 3
Time
if col<row, must wait for x[col] to be updated.
for row = 0 to number_of_rows for j=0 to nonzeros_in_row [row] if (col != row) sum += A_val[j] * x[A_col[j]] x[row] = sum / A_val[row]
(c) Approach 2 (ALRESCHA):Rows can be processed in parallel
1
2
3
Parallelizable DOT products. A locally-dense storage format is required to provide right blocks of matrix A at right order.
Parallelizable DOT products on small various-size operands
These are sequential operations on small operands. Therefore, their latency can be optimized by a fast hardware.
Can be implemented by a fast reconfigurable hardware to avoid a latency bottleneck. They also require fine-grained ordering of values in the blocks of storage format.
2 3&
Figure 1: An example of sparse problems withdata-dependency patterns in its computations.traditionally dependent parts into a majority of paralleland a minority of dependent operations. To do so, weexclude the dependency-prone section of the matrix andrun the rest in parallel (i.e., we convert Figure 1b toFigure 1c). To implement the key insight, we propose alightweight reconfigurable sparse-computation accelera-tor (Alrescha1), which makes three key contributions:
• Lightweight reconfigurability: After transform-ing the computations, while the majority of oper-ations will be highly parallelizable (¶ and ·), westill need to execute a few data-dependent com-putations ¸, albeit on small operands. Duringruntime, the application repetitively switches be-tween the three mentioned groups of operations(i.e., between ¶ and ·/¸). The switching itselfmust be fast enough to prevent a bottleneck. Tothis end, the compute engine of Alrescha imple-ments quick reconfiguration during runtime by inte-grating a fixed computation unit for ¶ and a smallreconfigurable computation unit for · and ¸.
• Locally-dense storage format: Since Alreschareorders the computations to process the parallelparts together ¶, we propose a storage format inwhich the order of non-zero values matches the or-der of computations. The storage format enablesstreaming the ordered blocks of matrix A in theright timing. Besides, such a storage format facili-tates · and ¸ by dictating required element-wiseordering. Alrescha integrates the proposed storageformat with a data-driven execution model to elim-inate the transferring and decoding of meta-data.
• Generic sparse accelerator: The ability of Al-rescha to accelerate distinct kernels makes it the
1A binary star, the two stars of which orbit one another.
Preconditioned conjugate gradient (PCG):p(0)=x;r(0)=b-Ap(0);for i=1 to m z=MG(r(i-1)); a(i)=dot_prod(r,z); if i=1 then p=z; else p=a(i)/a(i-1)*p+z;a(i)=a(i)/dot_prod(p,Ap);x(i+1)=x(i)+a(i)p(i);r(i)=r(i-1)-a(i)Ap(i);
if(depth < 3) SymGS(); SpMV(); MG(depth++); SymGS();else SymGS();
for row = 0 to rows sum = b [row] for j=0 to nnz_in_row [row] col = A_col[ j ]; val = A_val[ j ]; if (col != row) sum = sum -val * x[col]; x[row] = sum / A_diag[row];
Figure 2: An example of the PCG algorithm [27]for solving a sparse linear system of equations(i.e., Ax = b), including SpMV and SymGS.
first generic hardware accelerator for both scien-tific and graph applications regardless of whetherthey have data-dependent compute patterns. Forapplications with no distinct kernels, Alrescha stillsustains high performance by integrating the pro-posed storage format and the execution model,hence avoiding meta-data transfer.
Alrescha accelerates various sparse kernels such assparse matrix-vector multiplication (SpMV), symmet-ric Guass-Seidel (SymGS) smoother, PageRank (PR),breadth-first search (BFS), and single-source shortestpath (SSSP). To evaluate Alrescha, we target a widerange of data sets with various sizes. Comparing Al-rescha with a CPU and a GPU platform shows thatAlrescha achieves an average speedup of 15.6× for sci-entific sparse problems and 8× for graph algorithms.Moreover, compared to a GPU, Alrescha consumes 14×less energy. We also compare Alrescha with the state-of-the-art hardware accelerators for sparse problems,namely, OuterSPACE [18], an accelerator for SpMV,GraphR [24], and a Memristive accelerator for scientificproblems [25]. Our experiment results on various datasets show that compared to state-of-the-art, Alreschaachieves an average speed up of 2.1×, 1.87×, 1.7× forscientific algorithm, graph analytics, and SpMV, respec-tively. The performance gain on data sets with variousdistributions of non-zero values indicates that the gainedbenefits of Alrescha are independent of specific patternsin sparse data.
2. BACKGROUNDThis section reviews the background and the character-istics of two key sparse problems as follows.Scientific Problems: Many physical-world phenom-
ena, such as sound, heat, elasticity, fluid dynamics, andquantum mechanics, are modeled with partial differen-tial equations (PDE). To numerically process and solvethem via digital computers, PDEs are discretized to a3D grid (e.g., using 27-stencil discretization), which isthen converted to a linear system of algebraic equations:Ax = b, in which A is the coefficient matrix, often verylarge for two or higher dimensional problems (e.g., el-liptic, parabolic, or hyperbolic PDEs). Such a systemof linear equations, with a symmetric positive-definitematrix, can be solved by iterative algorithms such asconjugate gradient (CG) methods (e.g., preconditionedCG (PCG), which ensures fast convergence by precon-ditioning). These methods are specifically useful for
2
solving sparse systems that are too large to be solvedby direct methods.
SymGSSpMVDotOthers
63%31%
Figure 3: The break-down of execution time ofPCG on NVIDIA K20.
An example of thePCG algorithm for solv-ing Ax = b is shown inFigure 2 [27]. The al-gorithm updates the vec-tor x in m iterations. AsFigure 3 shows, the exe-cution time of the algo-rithm is dominated bytwo kernels, SpMV and SymGS [28, 8, 29]. The re-maining kernels, such as the dot product, consume onlya tiny fraction of the execution time and are so ubiqui-tous that they are executed using special hardware insome supercomputers.To explore the characteristics of SpMV and SymGS,
we use an example of applying them on two operands, avector (b1×m) and a matrix (Am×n)1. Applying SpMVon the two operands results in a vector (x1×n), eachelement of which can be calculated as:
xj =k∑
i=1b[AT_indi]×AT_valij , (1)
in which k, AT_val, and AT_ind are the number ofnon-zero values, the non-zero values themselves, andthe row indices of the jth column of AT , respectively.Figure 4a shows a visualization of Equation 1. Sincethe elements of the output vector can be calculatedindependently, SpMV has the potential for parallelism.On the other hand, each element of the vector result ofapplying SymGS on the same two operands (i.e. vectorb1×m and a matrix Am×n) is calculated as follows, basedon the Gauss-Seidel method [30]:
xtj = 1
ATjj
− (bj −j−1∑i=1
ATij ×xt
i −n∑
i=j+1AT
ij ×xt−1i ). (2)
(a) (b)
b1⇥m b1⇥m
ATm⇥n AT
m⇥n
x1⇥n xt1⇥n
xt�11⇥n
jth jth
jth
jth jth
Figure 4: Calculation of(a) xj in SpMV, and (b) xt
j
in SymGS.
Figure 4b illustratesa visualization of Equa-tion 2 (i.e., the bluevectors correspond to∑j−1
i=1 ATij × xt
i and redvectors correspond
to∑n
i=j+1 ATij × xt−1
i ).In fact, calculating thejth element of x at it-eration t (i.e., the or-ange element of xt inFigure 4b) depends not only on the values of x at it-eration t − 1 (i.e., the red elements of xt−1), but alsoon the values of xt, which are being calculated in thecurrent iteration (i.e., the blue elements of xt). Suchdependencies in the SymGS kernel limit the parallelismopportunity. Although some optimization strategies havebeen proposed for employing parallelism [8], the SymGSkernel can still be a performance bottleneck.Graph Analytics: A common approach to represent
graphs is to use an adjacency matrix, each element of1In the rest of the paper, all matrix As refer to this matrix.
1 1 41
11 2
ABCD
A B C D
0 ∞ ∞ ∞A B C D
A
B
CD
11
1
4
2
1
Path length from A:Edges to C:Initial:
0 1 2 3A B C D
Final:
Edges from C:
1 1 11
11 1
ABCD
A B C D3 1 1 2A B C D
Out degree:
0.25 0.25 0.25 0.25A B C D
Initial rank:
0.83 0.83 0.37 0.67A B C D
Final rank:
(a) (b)An example Graph
Figure 5: An example graph, its adjacencysparse matrices, and the vector operands of twograph algorithms: (a) SSSP and (b) PR.
which represents an edge in the graph (Figure 5 illus-trates an example of such a representation). Since inmany applications the graph is sparse, the equivalent ad-jacency matrix includes several zeros. Graph algorithmstraverse vertices and edges to compute a set of prop-erties based on the connectivity relationships. Travers-ing is implemented as a form of a dense-vector sparse-matrix operation. Such implementations are suited tothe vertex-centric programming model [31], which ispreferred to the edge-centric model. The vertex-centricmodel divides a graph algorithm into three phases. Inthe first phase, all the edges from a vertex (i.e., a rowof the adjacency matrix) are processed. This process isa vector-vector operation between the row of the matrixand a property vector, varied based on the algorithm.In the second phase, the output vector from the firstphase is reduced by a reduction operation (e.g., sum). Inthe final phase, the result is assigned to its destination.Three widely used graph algorithms are BFS, SSSP,
PR. Figure 5 shows example graphs, the adjacency ma-trices, and the vector operands for the SSSP and PRalgorithms. For SSSP (Figure 5a), the vector containingthe lengths of nodes from node A is updated iterativelyby multiplying a row of the matrix by the path-lengthvector and then choosing the minimum of the resultvector. After traversing all the nodes, the final values ofthe vector indicate the shortest paths from node A toall the other nodes. PR (Figure 5b) iteratively updatesthe rank vector, initialized by equal values. At eachiteration, the elements of the rank vector are divided bythe elements of the out-degree vector (i.e., the numberof out-going edges for each vertex), chosen by a row ofthe matrix, and the result vector is reduced to a singlerank by adding the elements of the vector.Common Features: While the sparse kernels used
in both scientific and graph applications are similarin having sparse matrix operands, some kernels (e.g.,SpMV) exhibit more concurrency, whereas others (e.g.,SymGS) have several data dependencies in their compu-tations. Regardless of this difference, a common propertyof sparse kernels is that the reuse distance of accessesto the sparse matrix is high, while the input and out-put vectors of these kernels are being reused frequently.Moreover, the accesses to at least one of the vectors areoften irregular. The other, and more important, com-mon feature of these kernels is that they follow the threephases of operations iteratively (i.e., vector operation,reduce, and assign). Table 1 summarizes these phases forthe main sparse kernels, as well as the operands and theoperations at each phase. The sparse kernels calculatean element of their result by accessing a row/column of
3
Table 1: The properties of sparse kernels and corresponding dense data paths, implemented inAlrescha. Depending on the type of kernel, the operation in phase 1 can use the three vectoroperands at the same time or use just two of them.Sparse Kernel Sparse Dense Data Paths Phase 1 (vector operation) Phase 2 Phase 3
Application vector operand1 vector operand2 vector operand3 operation (reduce) (assign)
SymGS PDE solving D-SymGS/GEMV a row of the vector from the vector at multiplication sum apply operation with AT
coefficient matrix iteration (i-1) iteration (i) and bj and update vectorSpMV PDE solving GEMV a row of the vector from N/A multiplication sum sum and
and graph coefficient matrix iteration (i-1) update the vectorPage Rank Graph D-PR a column of the out-degree the rank vector AND/division sum rank vector updateadjacency matrix vector of vertices at iteration (i-1)BFS Graph D-BFS a column of the frontier vector N/A sum min compare and update
adjacency matrix distance vectorSSSP Graph D-SSSP a column of the frontier vector N/A sum min compare and update
adjacency matrix distance vector
the sparse large matrix only once and then reuse one ortwo vector/s for the calculation of all output vector ele-ments. We benefit from the common features to designan accelerator that is flexible to run all the mentionedsparse kernels without significant overhead. Alreschaconverts the sparse kernels into the dense data paths,listed in the second column of Table 1 (details in §4.1).
3. MOTIVATION AND RELATED WORKSparse problems in either sparse (i.e., non-compressed)
or compressed representation face many performancechallenges. The sparse formats are less efficient be-cause they require storing, transferring, and processingof zero values. On the other hand, even the highlypreferred compressed representations still rely on trans-ferring meta-data and random memory accesses thatlimit performance by memory-access latency. Besides,as the ratio of compute-per-data-transfer of many sparseproblems (i.e., those including vector-matrix operations)is low, normally, we expect their performance to bedirectly related to the memory bandwidth. However,gaining higher performance for sparse problems is notas straightforward as adding more memory bandwidth.This section sheds more light on the challenges thatsparse problems – specifically those with data-dependentcomputations – face when being executed on modernCPUs, GPUs, or even on hardware accelerators.Ineffectiveness of CPUs and GPUs: To date,
many software-level optimizations for CPUs [1, 2, 3],GPUs, [4, 5, 6, 7, 8], and CPU-GPU systems [9] havebeen proposed. However, software optimizations alonecannot effectively handle irregular data-dependent mem-ory references, a main characteristic of sparse problems.Irregular data-dependent memory references limit the
00.511.522.533.5
00.5
11.5
22.5
33.5
NVIDIA Volta
V100 (Summit)
NVIDIA Tesla
V100 (Sierra
)
NVIDIA Tesla
P100
NVIDIA K80
NVIDIA K40
NVIDIA K20x
Intel Xeon Phi 7
250 68C
Intel Xeon Platin
ium 8160 24C
Intel Xeon Platin
um 8174 24C
Intel Xeon E5
-2670 12C
Intel Xeon E5
-2680-V3
Pflo
ps/S
ec
Frac
tion
of P
eak
(%)
Figure 6: The performance of modern comput-ing platforms ranked by the standard metric ofHPCG [27] benchmark on GPUs and CPUs.
reach of software schemes and thus lead to poor perfor-mance due to degraded bandwidth utilization. Further-more, optimizations for extracting more parallelism andbandwidth such as matrix coloring [8] and blocking [26]have not been effective enough for the aforementionedreason. Figure 6 summarizes the performance of runningsparse scientific applications on a range of CPUs andGPUs. As the figure shows, they utilize only a tinyfraction of the peak performance.Prior Hardware Accelerators: The ineffectiveness
of CPUs and GPUs, along with approaching the end ofMoore’s law, has motivated the migration to specializedhardware for sparse problems. For instance, hardwareaccelerators have targeted sparse matrix-matrix multipli-cation [10, 11, 12, 13], matrix-vector multiplication [14,15, 16, 17], or both [18, 19, 20], which are the mainsparse kernels in many sparse problems. A state-of-the-art SpMV accelerator, OuterSPACE [18], employs anouter-product algorithm to minimize the redundant ac-cesses to non-zero values of the sparse matrix. Despitethe speedup of OuterSPACE over the traditional SpMVby increasing the data reuse rate and reducing memoryaccesses, it produces random access to a local cache.To efficiently utilize memory bandwidth, Hegde et al.proposed Extensor [19], a novel fetching mechanismthat avoids the memory latency overhead associatedwith sparse kernels. Song et al. proposed GraphR [24],a graph accelerator, and Feinberg et al. proposed ascientific-problem accelerator [25], both of which processblocks of non-zero values instead of individual ones. Be-sides, Huang et al. have proposed analog [32] and hybrid(analog-digital) [33] accelerator for solving PDEs.
The prior specialized hardware designs often have notfocused on resolving the challenge of data-dependentcomputations in sparse problems that prevent benefitingfrom the available memory bandwidth. Sparse problemsmay involve a combination of contradictory parallizableand data-dependent kernels. The flexibility to accel-erate both types of kernels is necessary not only toimprove the overall performance but also to be moregeneric. Table 2 compares the most relevant hardwareapproaches and techniques for accelerating sparse prob-lems with Alrescha. As the table lists, Alrescha is thefirst reconfigurable sparse problem accelerator for bothscientific and graph applications, which supports multi-kernel execution and resolves the limited parallelism infine granularity. Besides, unlike prior work, Alreschareduces the number of accesses to the vector operand ofthe sparse matrix-vector operations.
4
Table 2: Comparing the state-of-the-art accelerators for sparse kernels.GraphR [24] OuterSPACE [18] Memristive-Based Row Reordering Alrescha (our work)Accelerator [25] Matrix Coloring [8]
Application Domain Graph Graph (only SpMV) PDE solver PDE solver Graph and PDE solver
Hardware
Multi-Kernel Support 7 7 7 7 3
BW Utilization Low Moderate Low Moderate High
NOT Transferring Meta-data 7 7 7 7 3
Processing Type ReRAM Crossbar PEs connected in heterogeneous Memristive GPU Instruction Fixed vector processor anda high-speed crossbar crossbar a small reconfigurable switch
Cache Optimizations N/A 7 N/A 7 3For Frequently-Used VectorsReconfigurability 7 Only for cache hierarchy 7 N/A 3
TechniquesStorage Format 4×4 COO CSR multi-size blocks (64×64, ELL 8×8 blocking with
128×128, 256×256, 512×512) fine-grained in-block orderingResolving Limited Parallelism N/A N/A 7
3 (Instruction-level,3limited by sparsity pattern)
4. ALRESCHAAlrescha is a memory-mapped accelerator. The mem-
ory dedicated to Alrescha is also accessible by a hostfor programming. Figure 7 shows an overview of Al-rescha, the host, and the connections for programmingand transferring data. The programming model of Al-rescha is similar to offloading computations from a CPUto a GPU (the programming model is beyond the scopeof this paper). To program the accelerator, the hostlaunches the sparse kernels of sparse algorithms (i.e.,PCG and graph algorithms) to the accelerator. To do so,the host first converts the sparse kernels into a sequenceof dense data paths and generates a binary file. Then,the host writes the binary file to a configuration tableof the accelerator through the program interface. Thedetails of the conversion mechanism and the dense datapath implementation are explained in §4.1 and §4.2.
During the execution of an algorithm, repetitive switch-ing between the dense data paths is required. The keyfeature of Alrescha to enable fast switching among thosedense data paths is the runtime reconfigurability. Thedetails of the reconfigurable microarchitecture of Al-rescha and the mechanism of real-time reconfigurationare explained in §4.3 and §4.4. Besides switching amongthe data paths during runtime, Alrescha also reordersthe dense data paths to reduce the number of switches.Such a reordering necessitates a new storage format,which is introduced in §4.5. Therefore, the other taskof the host is to reformat the sparse matrix operandsinto a storage format consisting of blocks, each of whichcorresponds to a dense data path. The formatted data iswritten into the physical memory space of the acceleratorthrough the data interface (Figure 7).Since the target algorithms are iterative, the prepro-
cessing (i.e., conversion and reformatting) is a one-timeoverhead. Besides, the complexity and effort of prepro-cessing depend on the previous format, data source, andhost platform. For instance, the conversion complexityfrom frequently-used storage formats (e.g., CSR and
Program including SymGS( ), SpMV( ), BFS( ), SSSP( ), and PR( ) sparse kernels
Binary file including a sequence of dense data paths (i.e., GEMV, D-SymGS, D-BFS,
D-SSSP, and D-PR)
Matrix operand in Alrescha storage format
Memory of ALRESCHA
Cofiguration table of ALRESCHA
programinterface
datainterface
AlreschaHost
Figure 7: The overview of Alrescha and host.
BCSR) is linear in time and requires constant space.Since the preprocessing complexity is linear, it can bedone while data streams from the memory. Moreover, ifdata is generated in the system (e.g., through sensors),it is initially be formatted in the Alrescha format andreformatting is not required.4.1 Kernel to Data Path ConversionAlgorithm 1 shows the procedure for converting a
sparse matrix to dense data paths. Based on the kerneltype, the sparse matrix operand (Anxn), and the dimen-sion of the non-zero blocks in the matrix operand (ω),the algorithm generates the configuration table, each rowof which specifies the type of data path and informationabout its operands. More specifically, for each densedata path, besides its type (i.e., DP ), illustrated by onebit, the index of the input vector operand (Inxin), theindex of the output vector (Inxout), the access order,which can be left to right (l2r) or right to left (r2l),and the source of the operand (Op) are stored. In otherwords, the Inxin and Inxout indicate the read and writeaddresses of a local cache, respectively; Op selects the lo-cal cache containing the vector operand. As a result, allmentioned meta data in a row of the configuration tabletakes 2[log2(n/ω)] + 3 bits. The three bits are for thedata path type, access order, and operand source, respec-tively. The configuration table is used for configuring aconfiguration switch (details in §4.3).
The general procedure of the conversion algorithmis as follows: (i) As lines 8 to 12 show, the sparsekernels with no (or straightforward) data dependen-cies including SpMV, BFS, SSSP, and PR are brokendown into a sequence of general matrix-vector multi-plication (GEMV), dense BFS (D-BFS), dense SSSP(D-SSSPs), and dense PR (D-PR), respectively. Thesedense data paths have the same functionality as theircorresponding sparse kernels do; however, they work onnon-overlapping locally-dense blocks of the sparse matrixoperand and overlapping sub-vectors of the dense vectoroperand of the original sparse kernel. (ii) As lines 13to 26 show, the sparse kernels with data dependencies(e.g., SymGS kernel, the key computation of PCG algo-rithm) are broken down into a majority of parallelizableGEMV (lines 14 to 21) and a minority of sequentialdense SymGS (D-SymGS) data paths (lines 23 to 26).The conversion for SymGS is to assign GEMVs to
non-diagonal non-zero blocks (line 15) and D-SymGS todiagonal non-zero blocks of the sparse matrix (line 23).
5
For accelerating SymGS, the key insight of Alrescha is toseparate GEMV from D-SymGS data paths to preventthe performance from being limited by the sequentialnature of the SymGS kernel. To this end, Alreschareduces switching between the two data paths (GEMVand D-SymGS) by reordering them so that Alrescha firstexecutes all the GEMVs in a row successively and thenswitches to a D-SymGS. The distributive property ofinner products in Equation 2 guarantees the correctnessof such reordering. As an example of the outcome ofAlgorithm 1, Figure 8 shows the state machine of PCG,equivalent to the algorithm in Figure 2, which comprisesthree sparse kernels, two of which are the focus of thispaper and are launched to the accelerator by the host.The configuration table for a SymGS example is shownin Figure 8. Based on Equation 2 and as lines 19 and 21of Algorithm 1 indicate, all the non-zero blocks in theupper triangle of A have to be multiplied by xt, and allof those in the lower triangle have to be multiplied byxt−1. §4.2 explains the reasoning for the listed accessorders in Figure 8.
Algorithm 1 Convert Algorithm1: function Convert(KernelType, Anxn,ω)
Anxn: sparse matrix, ω : block widthDP : Data path typel2r: left to right, r2l: right to left
2: Inxin := 0, Inxout := 03: Blocks[] = Split(A,ω) // partitions A to ω × ω blocks4: m = n/ω5: for (i = 1, i < m,i + +) do6: for (j = 1, i < m,j + +) do7: if (nnz(Blocks[i, j])> 0) then8: if KernelType ! = SymGS then9: DP = KernelT ype.DataP ath10: Inxin = i.ω, Inxout = j.ω11: Order = l2r12: Op = port1 // the operand vector13: else14: if (i! = j) then15: DP = GEMV16: Inxin = j.ω17: Inxout = −1 // no write to cache18: Order = l2r19: if (i > j) then20: Op = port2 //which is xt−1
21: else22: Op = port1 //which is xt
23: else24: DP = D-SymGS25: Inxin = j.ω, Inxout = (i + 1).ω26: Order = r2l27: Op = port2 //which is xt−1
28: Add2Table(DP,Inxin, Inxout,Order,Op)
4.2 Dense Data Path ImplementationAlrescha implements two classes of dense data paths.
The first class consists of D-BFS, D-SSSP, D-PR, andGEMV with straightforward patterns of data dependen-cies. The second class (e.g., D-SymGS) captures morecomplicated sequential patterns. This section describesthe implementations of the two classes of data paths.(a) Straightforward data paths: These data paths,
such as D-BFS, D-SSSP, D-PR, and GEMV, work on lo-cally dense blocks of A to capture locality in accesses tothe dense vector operand. If the sparse matrix operandincludes blocks of non-zero values of size ω, Alreschasplits the vector operand into chunks of size ω, and at
PCG:
SymGSSpMV
dotproduct
123456789
96 7 82 3 4 51
74739
Inxin
xtxt�1
xt�1
xt�1
xt�1
OpOrder
-14-7
Inxout
l2r
r2l
DP
GEMV
D � SymGS
D � SymGS
D � SymGS
GEMV
r2l
l2r
r2l
Figure 8: An example of the configuration tablefor a SymGS kernel, in which n = 9, ω = 3.
each cycle, it fetches a chunk of the vector from a localcache instead of fetching an individual element of thevector operand. More specifically, a vector operation isapplied on ω elements of the vector operand and all thenon-zero blocks of A in a row. The approach of Alreschafor running BFS, SSSP, PR, and SpMV sparse kernelsprovides two advantages: (i) it guarantees locality ofcache accesses (i.e., the values in a cache line are usedin succeeding cycles), and (ii) each element of the vectoroperand is fetched from the cache only once per n/ω, inwhich n in the number of columns of A.(b) Data-dependent data paths: SymGS, an ex-
ample of a sparse kernel with such dependencies, includesa matrix-vector operation with a sparse matrix operandA and three vector operands, Ajj , b, and a combina-tion of xt−1 and xt (Equation 2). The three vectorsare the diagonal of A, the right-hand side coefficient ofthe Ax = b equations, and a combination of variables ofthe equations computed in previous iterations (i.e., theinitial values) and the variables being computed in thecurrent iteration. Based on Equation 2, calculating anelement of xt includes two inner products, the sizes ofwhich change when we move across the rows of A. Toenable using a compute engine for both inner products,Alrescha merges them by factoring out the subtraction:
xtj = ( 1
ATjj
− bj)+(j−1∑i=1
ATij ×xt
i +n∑
i=j+1AT
ij ×xt−1i ) (3)
which consists of a unified dot product of size n, the com-mon operation in both classes of the dense data paths.As a result of the common operation, by transformingEquation 2 to Equation 3, the D-SymGS kernel canutilize the same dot product engine used by other densedata paths (i.e., the first class of data paths). The onlydifference between the dot product of D-SymGS andthat in other data paths is the sources of the inputs andthe destination of the output. In detail, one operandof the dot product of D-SymGS is made by shifting xt
into xt−1. This can be implemented by just rotatingthe inputs of the multipliers and pushing the xt
j into thefirst multiplier to be used in calculating xt
j+1.To clarify the functionality of Alrescha for D-SymGS,
we use a simple example in Figure 10, for which weassume that the size of the problem fits the size of thehardware (i.e., the number of multipliers). As Figure 10illustrates, Alrescha stores the diagonal of A in a sepa-rate vector and does the subtraction with b in parallelwith the dot product. Note that separately storing thediagonal of A helps utilize the memory bandwidth onlyfor streaming the operand of the dot product engine. Atthe first step, one row of the non-diagonal elements of
6
... ... ...... ......
chip 1 chip 2 chip 8
Mem
ory
ALU 1 ALU 2 ALU 8
Cache
Red
uctio
nTr
ee
RE 1 RE2 RE3 … RE 7
Fixed Interconnect (Tree Topology)
Config. Table
data
Cache
Processing Element
…In
dice
s
64 BytesDual-port SRAMWrite address generator
out inxoffset
access order
adrs
Read address generatorin inxoffset
adrs
Red
uce
Engi
ne
+ <
Reg1 Reg2
prog
ram
ALUin1(mem)
in2(cache)
x ÷ +
in4(ALU)
in3(swt)
Reg1 Reg2
out(ALU) pr
ogra
m
Example Matrix
PE
PE PE
PE PE
LUT
Con
fig.
inputs
FF
Con
figur
able
Sw
itch
Stac
kFI
FOs
PE
A B C
C B A
X Y Z
X Y Z
…
Data Interface
32 bits5Gbps
32 bits5Gbps
32 bits5Gbps
(b)
(c)
64 bits
RECONFIG. COMPUTE UNIT
FIXE
D C
OM
PUTE
UN
IT
(a)
FCU
+
+
-
Ajj
b/
xtxt�1
stack(link buffer)
x x x+
++
x
Cac
heFI
FOs
Stac
k
(d)
RCU
Vec
tor
Ope
rati
on
Memory
FCU+
xtxt�1
x x x+
++
x
Cac
heSt
ack
b/
RCU
Memory
FCU+ / / /
++
+/
Cac
heSt
ack
RCU
RankOut-Degree
& && &
Memory
Data
stack(link buffer)
stack(link buffer)
Figure 9: (a) The microarchitecture of Alrescha including the FCU for implementing common com-putations, and the RCU for providing specific configuration for distinct dense data paths. And, threeexample configurations for supporting: (b) D-SymGS, (c) GEMV, and (d) PR.A is multiplied by xt−1
1 to xt−13 . In the second step, we
need xt0 instead of xt−1
1 . However, the newly calculatedxt
0 will not take the place of the kicked-off xt−11 . As a
result, we insert the new variables by shifting the oldone to the right. As Figure 10 shows, while reading arow from A at each cycle, the elements belonging to theupper triangle of A are reordered, while the elements inthe lower triangle keep their original orders. Such anordering, which is listed in the configuration table ofFigure 8, is also reflected in the storage format (detailsin Figure 13 in §4.5).
x x x+
+
x3 a03 x2 a02 x1 a01
+-
x x x+
+
x0 a10 x3 a13 x2 a12
+-
x x x+
+
x1 a21 x0 a20 x3 a23
+-
x x x
+
+
x2 a30 x1 a31 x0 a30
+-
a22 a11 a00a33 a33a22a11
a22a22 a11
b3 b2 b3
b3 b2 b1
x0 x1 x2 x3xt�1
Ajj
b
x0 x1 x2 x3xt
a22a11a00a33
b2 b1 b0b3
a00a01a02a03a10a11a12a13a20a21a22a23a30a31a32a33
A
D-SymGS: step 1 (j=0) step 2 (j=1)
step 3 (j=2) step 4 (j=3)
b2 b1 b0b3
/ /
/ /
x0 x1
x2 x3
Figure 10: An example of D-SymGS. At eachstep, one xt
j is calculated based on Equation 3.The xt
j is then used in calculating the next ele-ment xt
j+1 at the next step. The arrows amongthe operands of the dot product indicate shiftingthem one to the right at each step.4.3 Reconfigurable MicroarchitectureThis section introduces the microarchitecture of Al-
rescha to accelerate the sparse kernels by executingtheir fundamental data paths (i.e., D-SymGS, GEMV,D-BFS, D-SSSP, and D-PR). To deliver close-to-idealperformance, the main idea behind the hardware ofAlrescha is to have a separate fixed computation unit(FCU) and a reconfigurable computation unit (RCU)and configuring only the former for switching betweendata paths (Figure 9). Reconfiguring only a fraction ofthe entire data path reduces the configuration time.
The FCU streams only the dense blocks of the sparse
matrix (i.e., no meta-data) from memory and applies therequired vector operation (i.e., phase 1 in Table 1). TheFCU includes ALUs and reduce engines (REs) connectedtogether in a tree topology, as shown in Figure 9, andform the reduction tree. The interconnections betweenthe REs of the FCU are fixed for all data paths and donot require reconfiguration. The reduction tree is fullypipelined to yield the speed of the streaming data frommemory. One of the inputs of ALUs (i.e., the matrixoperand) is always streamed from memory, and the otherinput/s (i.e., the vector operands) comes from the RCU.The former input of the ALU requires a multiplexerbecause at runtime, its input might need to be changed.For example, only during initializing the D-SymGS doesthat input come from the cache (i.e., xt−1); after the ini-tialization, that input comes from a processing elementin the RCU and a forwarding connection between theinputs of the multipliers. For GEMV, on the other hand,the ALU requires the multiplexers to choose betweenxt−1 and xt during runtime.The responsibility of the RCU is to handle the spe-
cific data dependencies in different kernels. The RCUincludes a local cache, buffers, processing elements (PE),and a configurable switch, which determines the intercon-nection between the units in the RCU. The configurableswitch is not a novel approach here and is implementedsimilarly to those of FPGAs and is directly controlledby the content of a configurable table (§4.4). The localcache stores the vector operands, which require address-able accesses (e.g., xt−1, xt, and b), whereas the buffershandle vector operands, which require deterministic ac-cesses. For instance, we employ first-in-first-out (FIFO)for AT
jj and b, and use a last-in-first-out (LIFO) stackfor the link buffer. The link buffer establishes trans-missions between the dense data paths (details in §4.4).For data path transmission, the reduction tree has tobe drained, during which the switch is reconfigured toprepare it for the next data path. Therefore, the latencyof configuration is hidden by the latency of draining the
7
adder tree. The PEs of the RCU are implemented bylook-up tables (LUTs) to provide multiplication, divi-sion, summation, and subtraction. Figure 9b, c, and dillustrate the configuration of the RCU for performingD-SymGS, GEMV, and D-PR data paths.
4.4 Real-Time ReconfigurationThe algorithms with a sequence of distinct sparse
kernels benefit from reconfigurability because they fre-quently switch between the kernels. As a result, to sat-isfy high-speed and low-cost reconfigurability, Alreschaimplements a lightweight reconfiguration, meaning thatonly the configuration of a small fraction of the com-pute engine is changed frequently. Switching betweentwo data-dependent data paths is performed using thelink buffer (i.e., implemented as the stack). During run-time, the intermediate results generated by GEMV arepushed into the stack. Then, the successive D-SymGSpops them up. Figure 11 presents a numerical exam-ple for switching from GEMV to D-SymGS data paths.During steps 1 to 3, the RCU is configured to GEMV,and the results are pushed into the link stack. At step3, while the reduction tree (adder tree) is drained, theinterconnect between the cache and FIFOs in the RCUis reconfigured to D-SymGS. This does not affect thepushes to the stack, so it can be parallelized and hidesthe latency. In steps 4 to 6, D-SymGS pops the resultsof the GEMV from the link stack and consumes them.
4.5 Storage FormatDepending on the distribution of non-zeros in a sparse
matrix, various storage formats may suit them. Fig-ure 12 compares four well-known formats based on meta-data per non-zero. The compressed sparse row (CSR),which stores a vector of column indices and a vector ofrow pointers, locates all the non-zeros independently.Therefore, CSR is the right choice when the non-zerosdo not exhibit any spatial localities. On the other hand,when all the non-zeros are located in diagonals, the di-agonal format (DIA) [35], which stores the non-zerossequentially, could be the best option. An extension tothe DIA format, Ellpack-Itpack (ELL) [36] is more flexi-ble when the matrix has a combination of non-diagonaland diagonal elements. For instance, ELL is used for
7.08link
input from memory
xt�1 from cache
4.0 1.4 1.0 0.9 2.1 1.8
0.8 0.7 2.9
7.41 7.08
3.0 1.0 4.7
0.8 0.7 2.9
16.7 7.41 7.08
0.8 0.7 2.9
b
xt
input from memory
xt�1 from cache
1/Ajj
link
0.5 2.1 1.0
0.5 0.58 3.3
10.3 13.4 11.2
5.5 0.5 2.2
2.7 3.0 1.2
0.62 0.5 0.58
16.1 10.3 13.4
16.7 7.41
5.55 13.7 05.55 0 0
13.7 5.55 2.2
1.3 1.0 0.8
0.34 0.62 0.5
19.0 16.1 10.3
16.7
5.55 13.7 30.7
0.5 2.2 1.5
16.7 7.41 7.08
3.1 1.5 2.2 0.5 0.1 1.3 0.8 0.7 2.9xt�1
17.8 21 18.214.519.016.110.313.411.2b
0.3 1.0 2.1 0.52.7 1.7 1.2 3.01.0 1.3 2.0 0.80 0 0 1.6
0 0 4.0 1.40 0 0.9 2.10 0 3.0 1.0
1.7 2.6 0.3 0
1.84.70
1.0
0.20 0.5 1.50 3.0 1.0 4.70
12349
GEMV data path:
D-SymGS data path:
1 2 3
4 5 6
Figure 11: Switching between GEMV and D-SymGS data paths using the link stack.
CSR
ELLBCSR
DIA
Low
HighRandom
Common Pattern in Scientific and
Graph Applications Diagonal
ALRESCHA
Structured
Byte
/ N
NZ
Figure 12: The spectrum of storage formats forvarious types of sparse matrices (low is better).
implementing SymGS in GPUs. However, such a formatdoes not provide enough flexibility for parallelizing rowsbecause it does not sustain the locality across rows.
The choice of storage format should be compatible withthe common pattern in a wide range of sparse applica-tions. In such a case, blocked CSR (BCSR) [26], anextension of CSR, which assigns the column indices androw pointers to blocks of non-zero values, is the rightformat. Although BCSR is an appropriate format forscientific applications and graph analytics in terms ofstorage overhead, the strategy of BCSR for assigningindices and pointers, and the order of values, is nota good match for smoothly streaming data. In otherwords, the main requirement for fast computation isthe order of operations, which in turn dictates the datastructures to be streamed in the same order. Thus, weadapt BCSR and propose a new storage format withthe same meta-data overhead but compatible with thereconfigurable microarchitecture of Alrescha.Figure 13, illustrates our proposed locally-dense for-
mat for mapping the example sparse matrix to the physi-cal memory addresses of the accelerator. The Order ofBlocks: All the non-diagonal non-zero blocks in a row ofblocks are stored together, followed by a diagonal block;The Order of Values: The non-zero values belongingto the upper triangle of the non-diagonal blocks arestored in the opposite order of their original locationsin the matrix (see the order of A, B, and C in Fig-ure 13). Accordingly, the difference between the columnindices of BSCR and input indices (i.e., Inxin) of theAlrescha format is shown in Figure 13; The DiagonalElements: For SymGS, the diagonal of A is excludedand stored separately in a local cache. Therefore, we
mapping to the physical memory
ABCD EF
X Y Z
Phys
ical
Mem
ory
Spac
e
A B CD E F
X Y Z123456789
96 7 82 3 4 51
BCSR:col_index: { }row_pointer: {1,3,4,6}
ALRESCHA:input_index: { }output_index: {1,4,7}
1,7,4,3,7 7,4,7,3,9
same order of values
reverse order of values
Figure 13: Comparing Alrescha with BCSR:The col_index of BCSR and input_index (i.e.,Inxin) of Alrescha are color-coded to show theircorresponding blocks in the matrix. Alreschauses the index of the last column for the inputindex of diagonal blocks.
8
KindNNZ
NameRow/Col
KindNNZ
NameRow/Col
Electromagnetic1,647,264
2cubes_sphere101,492
Fluid Dynamics10,319,760
atmosmodm1,489,752
Structural Prob.17,277,420
CoupCons3D416,800
Electromagnetic89,306,020
dielFilterV3real1,102,824
Fluid Dynamics283,073,458
HV15R2,017,169
Fluid Dynamics3,599,932
ifiss_mat96,307
Electromagnetics406,084
light_in_tissue29,282
Circuit Simul.958,936
scircuit170,998
Fluid Dynamics21,005,389
StocF-14651,465,137
Electromagnetics6,459,326
TEM152078152,078
Structural Prob.23,487,281
Transport1,602,111
Fluid Dynamics803,978
windtunnel_ev3D40,816
Thermal Prob.463,625
epb384,617
Chemistry Prob.5,156,379
Si34H3697,569
Chemistry Prob.3,381,809
GaAsH661,349
Thermal Prob.711,558
thermomech_TC102,158
Materials Prob.1,181,120
xenon148,600
Fluid Dynamics2,374,949
poisson3Db85,623
Materials Prob.583,770
crystm24,696
Circuit Simul.954,163
ASIC_100k99,340
Economic Prob.596,992
finan51274,752
Acoustics Prob.1,660,579
qa8fm66,127
Acoustics Prob.5,036,288
mono_500Hz169,410
Electromagnetic4,242,673
offshore259,789
Figure 14: Scientific application data sets attained from University of Florida SuiteSparse [34].Table 3: Graph datasets.
Dataset row/col NNZcom–orkut 3,072,441 234,370,166
hollywood–2009 1,139,905 1,139,905kron-g500–logn21 2,097,152 182,082,942
roadnet–CA 1,971,281 5,533,214LiveJournal 4,847,571 68,993,773Youtube 1,134,890 5,975,248Pokec 1,632,803 30,622,564
sx-stackoverflow 2,601,977 36,233,450
consider non-square blocks on the diagonal (e.g., 3×4instead of 3 × 3) so that the the mapping of the non-diagonal element of that block to the physical memoryis adjusted. Meta Data: the indices of the input andoutput (i.e., Inxin and Inxout) are not streamed frommemory during runtime. Instead, they are stored in theconfiguration table during the one-time programmingphase and are used for reconfiguration purposes. As aresult, during the iterative execution of the algorithms,the whole available memory bandwidth is utilized onlyfor streaming payload.
5. PERFORMANCE EVALUATIONThis section explores the performance of Alrescha by
comparing it with the CPU, GPU, and state-of-the-artaccelerators for sparse problems. We evaluate Alreschafor both scientific applications and graph analytics. Thissection first introduces the data sets and algorithms,and the experimental setup. Then, we analyze executiontime and energy consumption.5.1 Datasets, Algorithms, and Baselines
We pick real-world matrices with applications in scien-tific and graph problems from the University of FloridaSuiteSparse Matrix Collection [34]. The matrices, shownin Figure 14 are our datasets representing scientificapplications, including circuit simulations, electromag-netic, fluid dynamics, structural, material, acoustics,economics, and chemical problems, all of which can bemodeled by PDEs. As the figure shows, the non-zerovalues have various distributions across the matrices.For graph analytics, we choose the eight datasets, listedin Table 3, along with their dimensions and the numberof non-zeros (NNZ). We run PCG, which includes theSymGS and SpMV kernels, on the matrices listed inFigure 14, and run graph algorithms (i.e., BFS, SSSP,and PR) on the matrices in Table 3. We also run SpMVon both categories of datasets.We compare Alrescha with the CPU and GPU plat-
Table 4: Baseline configurations.GPU baseline
Graphics card NVIDIA Tesla K40c, 2880 CUDA coresArchitecture Kepler
Clock frequency 745MHzMemory 12 GB GDDR5, 288 GB/sLibraries Gunrock [37] and CUSPARSE
Optimizations row reordering (coloring) [8], ELL formatCPU baseline
Processor Intel Xeon E5-2630 v3 8-coreClock frequency 2.4 GHz
Cache 64 KB L1, 256 KB L2, 20 MB L3Memory 128 GB DDR4, 59 GB/sPlatforms CuSha [39], GridGraph [38]
Table 5: Alrescha Configuration.Floating point double precision (64 bits)Clock frequency 2.5 GHz
Cache 1KB, 64-Byte lines, 4-cycle access latencyRE latency 3 Cycles (sum: 3, min: 1)ALU latency 3 CyclesMemory 12 GB GDDR5, 288 GB/s
forms. The configurations of the baseline platformsare listed in Table 4. For the CPU and GPU, we ex-clude disk access time. For fair comparisons, we includenecessary optimizations, such as row reordering and suit-able storage formats (e.g. ELL) proposed for the CPUand GPU implementations. The PCG algorithm andthe graph algorithms running on GPU are respectivelybased on the cuSPARSE and Gunrock [37] libraries. Thegraph algorithms running on the CPU are based on theGridGraph [38] and/or CuSha [39] platforms, whicheverachieves better performance for a specific algorithm.
Besides the comparison with the CPU and GPU, thissection compares Alrescha with the state-of-the-art hard-ware accelerators, including OuterSPACE [18], an accel-erator for SpMV, GraphR [24], a ReRAM-based graphaccelerator, and a Memristive accelerator for scientificproblems [25]. To reproduce their latency and powerconsumption numbers, we modeled the behavior of thepreceding accelerators based on the information providedin the published papers (e.g., the latency of read andwrite operations for GraphR and Memristive accelera-tor). We validate our numbers based on their reportednumbers for their configurations to make sure our re-produced numbers are never worse than their reportednumbers. Moreover, for fair comparison, we assign allthe accelerators the same computation and memory-bandwidth budget – this assumption does not harm theperformance of our peers.
9
5.2 Experimental SetupWe convert the raw matrices using Algorithm 1 imple-
mented in Matlab. To do that, we examine block sizesof 8, 16, and 32 for the range of data sets and choosethe block size of eight because, unlike the other two, 8provides a balance between the opportunity for paral-lelism and the number of non-zero values. We modelthe hardware of Alrescha using a cycle-level simulatorwith the configurations listed in Table 5. The clockfrequency is chosen to enable the compute logic to followthe speed of streaming from memory (i.e., each 64-bitoperands of ALU are delivered from memory in 0.4 ns,through the 32-bit 5 Gbps links.) To measure energyconsumption, we model all the components of the mi-croarchitecture using a TSMC 28 nm standard cell andthe SRAM library at 200 MHz. The reported numbersinclude programming the accelerator.
5.3 Execution TimeScientific Problems: The primary axis of Figure 15
(i.e., the bars) illustrates the speedup of running PCG onAlrescha over the GPU implementation optimized by rowreordering [8] for extracting a high level of parallelism;the secondary axis of Figure 15 shows the bandwidthutilization. The figure also captures the speedup of theMemoristive-based hardware accelerator [25]. On aver-age, Alrescha provides a 15.6× speedup compared to theoptimized implementation on the GPU. The speedup ofAlrescha is approximately twice that of the most recentaccelerator for solving PDEs. To investigate the reasonsbehind this observation, we plot memory bandwidthutilization in Figure 15. As the figure shows, the perfor-mance of Alrescha and the other hardware acceleratorfor all scientific datasets is directly related to memorybandwidth utilization – mainly because of the sparsitynature. Moreover, none of them fully utilize the avail-able memory bandwidth because both approaches useblocked storage formats, in which the percentage of non-zero values in a block rarely reaches a hundred percent.Nevertheless, we see that Alrescha better utilizes thebandwidth because it resolves the dependencies in com-putations, which otherwise limits bandwidth utilization.To clarify the impact of resolving dependencies on
overall performance, Figure 16 presents the percentageof sequential computations in the GPU implementation,
0%
20%
40%
60%
80%
100%
1.0
10.0
2cub
es_s
pher
eAS
IC_1
00k
atm
osm
odm
Coup
Cons
3Dcr
ystm
03di
elFI
lterV
3rea
lep
b3fin
an51
2G
aAsH
6HV
15R
ifiss
_mat
light
_in_
tissu
em
ono_
500H
zof
fsho
repo
isson
3Db
qa8f
msc
ircui
tSi
34H
36St
ocF-
1465
TEM
1520
78th
erm
omec
h_TC
Tran
spor
tw
indt
unne
l_ev
ap3d
xeno
n1
GM
EAN
Band
wid
th U
tiliza
tion
Spee
dup
over
GPU
(log
. sca
le) ALRESCHA Memristive-based Accel. ALRESCHA Memristive-based Accel.
15.6
7.3
Figure 15: Speedup for PCG algorithm on scien-tific datasets, normalized to GPU (bar charts),and bandwidth utilization (the lines). TheMemristive-based accelerator [25] is the state-of-the-art accelerator for scientific problems.
0%
20%
40%
60%
80%
2cub
es_s
pher
eA
SIC_
100k
atm
osm
odm
Coup
Cons
3Dcr
ystm
03di
elFI
lter
V3r
eal
epb3
finan
512
GaA
sH6
HV
15R
ifiss
_mat
light
_in_
tiss
uem
ono_
500H
zof
fsho
repo
isso
n3D
bqa
8fm
scir
cuit
Si34
H36
Stoc
F-14
65TE
M15
2078
ther
mom
ech…
Tran
spor
tw
indt
unne
l_e…
xeno
n1
GM
EAN
% o
f seq
uent
ial c
ompu
te Baseline ALRESCHA
Figure 16: The reduction of the sequential partof the PCG algorithm by applying Alrescha.The baseline shows the percentage of sequentialoperations by row-reordering optimization.
versus that in Alrescha, which has an average of 23.1%sequential operations. As the figure suggests, even inthe GPU implementation that extracts the independentparallel operations using row reordering and graph col-oring, on average 60.9% of operations are still sequential.This is more than 60% for highly-diagonal matrices andless than 60% for matrices with a greater opportunityfor in-row parallelism. Such a trend identifies the distri-bution of locally-dense blocks as another rationale fordetermining the speedups. More specifically, when thedistribution of non-zero values in rows of a matrix offersthe opportunity for parallelism, the speedup over theGPU is smaller compared to when the non-zeros aremostly distributed in the diagonal.Insights: For multi-kernel sparse algorithms with data-
dependent computations, Alrescha improves performanceby (i) extracting parallelizable data paths, (ii) reorderingthe dense data paths and the elements in the blocks tomaximize reuse of data, and (iii) implementing them inlightweight reconfigurable hardware, all of which resultin fast switching not only between the distinct data pathsof a single kernel, but also among the sparse kernels.Graph Analytics And SpMV: This section ex-
plores the performance of the algorithms consisting ofone type of sparse kernel that naturally has fewer datadependency patterns in their computations. Such astudy claims that Alrescha is not just optimized for aspecific domain and is applicable to accelerating a widerange of sparse applications. First, we analyze the per-formance of graph applications. Figure 17 illustrates thespeedup of running BFS, SSSP, and PR on Alrescha,a recent hardware accelerator for graph (i.e., based onGraphR [24]), and GPU, all normalized to the CPU.
1
10
com
-ork
utho
llyw
ood-
09kr
on-g
500-
logn
21ro
adne
t_CA
LiveJ
ourn
alYo
utub
ePo
kec
sx-s
tack
over
flow
com
-ork
utho
llyw
ood-
09kr
on-g
500-
logn
21ro
adne
t_CA
LiveJ
ourn
alYo
utub
ePo
kec
sx-s
tack
over
flow
com
-ork
utho
llyw
ood-
09kr
on-g
500-
logn
21ro
adne
t_CA
LiveJ
ourn
alYo
utub
ePo
kec
sx-s
tack
over
flow
GMEA
N-B
FSGM
EAN
-SSS
PGM
EAN
-PR
Spee
dup
over
CPU
(log.
scal
e) ALRESCHA GraphR GPU
BFS SSSP PR
184 92 56
15.7
7.7
27.6
Figure 17: Speedup for graph algorithms ongraph data sets over the CPU. GraphR [24] isthe state-of-the-art graph accelerator.
10
0.01
0.1
1
10
1
10
2cubes_
sphere
ASIC_100k
atm
osmodm
CoupCons3D
crys
tm03
dielFIlte
rV3re
alepb3
finan
512
GaAsH
6
HV15R
ifiss_
mat
light_
in_tissu
e
mono_500Hz
offshore
poisson3Db
qa8fm
scirc
uit
Si34H36
StocF
-1465
TEM152078
therm
omech
_TC
Transp
ort
windtunnel_eva
p3d
xenon1
com
-ork
ut
hollywood-0
9
kron-g
500-logn
21
road
net_CA
LiveJo
urnal
Youtube
Pokec
sx-st
acko
verfl
ow
GMEA
N Sientif
ic
GMEA
N Gra
ph
% L
aten
cy o
f cac
he a
cces
s (lo
g. s
cale
)
SpM
VSp
eedu
p ov
er G
PU (l
og. s
cale
) ALRESCHA OuterSPACE ALRESCHA OuterSpPACE
Scientific Graph
6.913.6
Figure 18: Speedup for running SpMV on scientific and graph datasets normalized to GPU (barcharts), and the percentage of execution time devoted to cache accesses (the lines). OuterSPACE [18]is the state-of-the-art SpMV accelerator.
1
10
100
2cubes_
sphere
ASIC_100k
atm
osmodm
CoupCons3D
crys
tm03
dielFIlte
rV3re
alepb3
finan
512
GaAsH
6
HV15R
ifiss_
mat
light_
in_tissu
e
mono_500Hz
offshore
poisson3Db
qa8fm
scirc
uit
Si34H36
StocF
-1465
TEM152078
therm
omech
_TC
Transp
ort
windtunnel_eva
p3d
xenon1
com
-ork
ut
hollywood-0
9
kron-g
500-logn
21
road
net_CA
LiveJo
urnal
Youtube
Pokec
sx-st
acko
verfl
ow
GMEA
N
Ener
gy S
avin
g (lo
g. s
cale
) Normalized to CPU Normalized to GPU122 147 152 282169 12874
14
Scientific Graph
Figure 19: Energy consumption improvement of Alrescha normalized to that of CPU and GPU.
As the figure shows, Alrescha offers average speedupsof 15.7×, 7.7×, and 27.6×, for BFS, SSSP, and PRalgorithms, respectively. We achieve this speedup byavoiding the transfer of meta-data, reordering the blocksfor increasing data reuse, and improving the locality.
The primary axis of Figure 18 (i.e., the bars) illustratesthe speedup of SpMV, a common algorithm of varioussparse applications on Alrescha and OuterSPACE [18](i.e., the recent hardware accelerator for the SpMV),normalized to the GPU baseline. As the figure shows,Alrescha offers average speedups of 6.9× and 13.6× forscientific and graph datasets. When running SpMV, allthe data paths are GEMV; therefore, no transmissionbetween data paths is required. However, optimizationsof Alrescha help achieve greater performance. The keyoptimization here is accesses to the cache to obtainfrequent accesses to the vector operand of SpMV. Toshow this, the secondary axis of Figure 18 (i.e., thelines) plots the percentage of the whole execution timefor accesses to the local cache. To perform an SpMV,OuterSPACE reads each element of the vector operand,multiples it with all the elements in a row of the matrixand then accumulates each of the partial products, andwrites them to their corresponding element of the outputvector. Hence, unlike Alrescha, the computation engineof OuterSPACE has to put the partial products in theirright location in the output vector, which may lead tolack of locality in accesses to the cache.
Insights: For single-kernel sparse problems (e.g., SpMV),Alrescha gains speedup by (i) improving the locality ofcache accesses (i.e., consuming the values in a cacheline in succeeding cycles), (ii) increasing the data reuserate of not only the sparse-matrix operands, but alsothe dense-vector operands, and (iii) avoiding meta-datatransfer and decoding.
5.4 Energy ConsumptionA primary motive for using hardware accelerators
rather than software optimizations is to reduce energyconsumption. To achieve this goal, the techniques inte-grated in the hardware accelerators have to be efficient.Since a source of energy consumption is accesses to localSRAM-based buffers or caches, reducing the numberof reads and writes from and to local caches by substi-tuting them with computation is beneficial. Figure 19illustrates the energy consumption of Alrescha for exe-cuting SpMV, normalized to that of the CPU and GPUbaselines. As Figure 19 shows, on average, the totalenergy consumption improves by 74× compared to theCPU and 14× compared to the GPU. Note that theactivity of compute units, defined by the density of thelocally-dense block, impacts energy but not performance.Insights: The main reasons for the low energy con-
sumption are the small reconfigurable hardware of Al-rescha in combination with utilizing a locally-dense stor-age format with the right order of blocks and valuesmatched with the order of computation, thus avoidingthe decoding of meta-data and reducing the number ofaccesses to the cache and the memory.
6. CONCLUSIONWe proposed Alrescha, a sparse problem accelerator thatquickly reconfigures the data path during runtime todynamically tune the hardware for distinct computationpatterns and enables using high-bandwidth memory atlow-cost for fast acceleration of sparse problems.
AcknowledgmentWe would like to thank the anonymous reviewers andgratefully acknowledge the support of NSF CSR 1526798.
11
7. REFERENCES
[1] K. Akbudak and C. Aykanat, “Exploiting locality in sparsematrix-matrix multiplication on many-core architectures,”IEEE Transactions on Parallel and Distributed Systems,vol. 28, 2017.
[2] E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performanceevaluation of sparse matrix multiplication kernels on intelxeon phi,” in International Conference on ParallelProcessing and Applied Mathematics. Springer, 2013.
[3] P. D. Sulatycke and K. Ghose, “Caching-efficientmultithreaded fast multiplication of sparse matrices,” inParallel Processing Symposium, 1998. IPPS/SPDP 1998.Proceedings of the First Merged International... andSymposium on Parallel and Distributed Processing 1998.IEEE, 1998.
[4] S. Dalton, L. Olson, and N. Bell, “Optimizing sparsematrixâĂŤmatrix multiplication for the gpu,” ACMTransactions on Mathematical Software (TOMS), vol. 41,2015.
[5] F. Gremse, A. Hofter, L. O. Schwen, F. Kiessling, andU. Naumann, “Gpu-accelerated sparse matrix-matrixmultiplication by iterative row merging,” SIAM Journal onScientific Computing, vol. 37, 2015.
[6] W. Liu and B. Vinter, “An efficient gpu general sparsematrix-matrix multiplication for irregular data,” in Paralleland Distributed Processing Symposium, 2014 IEEE 28thInternational. IEEE, 2014.
[7] K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, “Sparsematrix-matrix multiplication on modern architectures,” inHiPC, 2012 19th International Conference on. IEEE, 2012.
[8] E. Phillips and M. Fatica, “A cuda implementation of thehigh performance conjugate gradient benchmark,” in PMBS.Springer, 2014.
[9] W. Liu and B. Vinter, “A framework for general sparsematrix–matrix multiplication on gpus and heterogeneousprocessors,” Journal of Parallel and Distributed Computing,vol. 85, 2015.
[10] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili,“Lodestar: Creating locally-dense cnns for efficient inferenceon systolic arrays,” in DAC. ACM, 2019.
[11] C. Y. Lin, N. Wong, and H. K.-H. So, “Design spaceexploration for sparse matrix-matrix multiplication onfpgas,” International Journal of Circuit Theory andApplications, vol. 41, 2013.
[12] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, andF. Franchetti, “Accelerating sparse matrix-matrixmultiplication with 3d-stacked logic-in-memory hardware,”in HPEC, 2013 IEEE. IEEE, 2013.
[13] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili,“Eridanus: Efficiently running inference of dnns usingsystolic arrays,” IEEE Micro, 2019.
[14] B. Asgari, R. Hadidi, and H. Kim, “Ascella: Acceleratingsparse computation byenabling stream accesses to memory,”in DATE. IEEE, 2020.
[15] A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce, andD. Marr, “Fine-grained accelerators for sparse machinelearning workloads,” in ASP-DAC. IEEE, 2017.
[16] E. Nurvitadhi, A. Mishra, and D. Marr, “A sparse matrixvector multiply accelerator for support vector machine,” inProceedings of the 2015 International Conference onCompilers, Architecture and Synthesis for EmbeddedSystems. IEEE Press, 2015.
[17] U. Gupta, B. Reagen, L. Pentecost, M. Donato, T. Tambe,A. M. Rush, G.-Y. Wei, and D. Brooks, “Masr: A modularaccelerator for sparse rnns,” in PACT. IEEE, 2019.
[18] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng,C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, andR. Dreslinski, “Outerspace: An outer product based sparsematrix multiplication accelerator,” in HPCA. IEEE, 2018.
[19] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago,
A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher,“Extensor: An accelerator for sparse tensor algebra,” inMICRO. ACM, 2019.
[20] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi,S. Koppula, N. M. Ghiasi, T. Shahroodi, J. G. Luna, andO. Mutlu, “Smash: Co-designing software compression andhardware-accelerated indexing for efficient sparse matrixoperations,” in MICRO. ACM, 2019.
[21] T. J. Ham, L. Wu, N. Sundaram, N. Satish, andM. Martonosi, “Graphicionado: A high-performance andenergy-efficient accelerator for graph analytics,” in MICRO.IEEE, 2016.
[22] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalableprocessing-in-memory accelerator for parallel graphprocessing,” ACM SIGARCH Computer Architecture News,vol. 43, 2016.
[23] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim,“Graphpim: Enabling instruction-level pim offloading ingraph computing frameworks,” in HPCA. IEEE, 2017.
[24] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Graphr:Accelerating graph processing using reram,” in HPCA.IEEE, 2018.
[25] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang,and E. Ipek, “Enabling scientific computing on memristiveaccelerators,” in ISCA. IEEE, 2018.
[26] R. W. Vuduc and H.-J. Moon, “Fast sparse matrix-vectormultiplication by exploiting variable block structure,” inHPCC. Springer, 2005.
[27] J. Dongarra, M. A. Heroux, and P. Luszczek, “Hpcgbenchmark: a new metric for ranking high performancecomputing systems,” Knoxville, Tennessee, 2015.
[28] D. Ruiz, F. Mantovani, M. Casas, J. Labarta, and F. Spiga,“The hpcg benchmark: analysis, shared memory preliminaryimprovements and evaluation on an arm-based platform,”2018.
[29] V. Marjanović, J. Gracia, and C. W. Glass, “Performancemodeling of the hpcg benchmark,” in PMBS. Springer,2014.
[30] G. H. Golub and C. F. Van Loan, Matrix computations.JHU press, 2012, vol. 3.
[31] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system forlarge-scale graph processing,” in Proceedings of the 2010ACM SIGMOD International Conference on Managementof data. ACM, 2010.
[32] Y. Huang, N. Guo, M. Seok, Y. Tsividis, andS. Sethumadhavan, “Analog computing in a modern context:A linear algebra accelerator case study,” IEEE Micro,vol. 37, 2017.
[33] Y. Huang, N. Guo, M. Seok, Y. Tsividis, K. Mandli, andS. Sethumadhavan, “Hybrid analog-digital solution ofnonlinear partial differential equations,” in MICRO. IEEE,2017.
[34] T. A. Davis and Y. Hu, “The university of florida sparsematrix collection,” ACM TOMS, vol. 38, 2011.
[35] Y. Saad, Iterative methods for sparse linear systems. siam,2003, vol. 82.
[36] D. R. Kincaid, T. C. Oppe, and D. M. Young, “Itpackv 2duser’s guide,” Texas Univ., Austin, TX (USA). Center forNumerical Analysis, Tech. Rep., 1989.
[37] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D.Owens, “Gunrock: A high-performance graph processinglibrary on the gpu,” in ACM SIGPLAN Notices, vol. 51,no. 8. ACM, 2016.
[38] X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scalegraph processing on a single machine using 2-levelhierarchical partitioning,” in USENIX ACT, 2015.
[39] F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan,“Cusha: vertex-centric graph processing on gpus,” in HPDC.ACM, 2014.
12