Research Article A Novel CSR-Based Sparse Matrix-Vector ...

13
Research Article A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs Guixia He 1 and Jiaquan Gao 2 1 Zhijiang College, Zhejiang University of Technology, Hangzhou 310024, China 2 College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China Correspondence should be addressed to Jiaquan Gao; [email protected] Received 4 January 2016; Accepted 27 March 2016 Academic Editor: Veljko Milutinovic Copyright © 2016 G. He and J. Gao. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. is motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency. 1. Introduction Sparse matrix-vector multiplication (SpMV) has proven to be an important operation in scientific computing. It need be accelerated because SpMV represents the dominant cost in many iterative methods for solving large-sized linear systems and eigenvalue problems that arise in a wide variety of scientific and engineering applications [1]. Initial work about accelerating the SpMV on CUDA-enabled GPUs is presented by Bell and Garland [2, 3]. e corresponding implementations in the CUSPARSE [4] and CUSP libraries [5] include optimized codes of the well-known compressed sparse row (CSR), coordinate list (COO), ELLPACK (ELL), hybrid (HYB), and diagonal (DIA) formats. Experimental results show speedups between 1.56 and 12.30 compared to an optimized CPU implementation for a range of sparse matrices. SpMV is a largely memory bandwidth-bound operation. Reported results indicate that different access patterns to the matrix and vectors on the GPU influence the SpMV performance [2, 3]. e COO, ELL, DIA, and HYB kernels benefit from full coalescing. However, the scalar CSR kernel (CSR-scalar) shows poor performance because of its rarely coalesced memory accesses [3]. e vector CSR kernel (CSR- vector) improves the performance of CSR-scalar by using warps to access the CSR structure in a contiguous but not generally aligned fashion [3], which implies partial coalesc- ing. Since then, researchers have developed many highly efficient CSR-based SpMV implementations on the GPU by optimizing the memory access pattern of the CSR structure. Lu et al. [6] optimize CSR-scalar by padding CSR arrays and achieve 30% improvement of the memory access per- formance. Dehnavi et al. [7] propose a prefetch-CSR method that partitions the matrix nonzeros to blocks of the same size and distributes them amongst GPU resources. is method obtains a slightly better behavior than CSR-vector by padding rows with zeros to increase data regularity, using parallel reduction techniques, and prefetching data to hide global memory accesses. Furthermore, Dehnavi et al. enhance the performance of the prefetch-CSR method by replacing it with three subkernels [8]. Greathouse and Daga suggest a CSR-Adaptive algorithm that keeps the CSR format intact Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 8471283, 12 pages http://dx.doi.org/10.1155/2016/8471283

Transcript of Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Page 1: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Research ArticleA Novel CSR-Based Sparse Matrix-VectorMultiplication on GPUs

Guixia He1 and Jiaquan Gao2

1Zhijiang College Zhejiang University of Technology Hangzhou 310024 China2College of Computer Science and Technology Zhejiang University of Technology Hangzhou 310023 China

Correspondence should be addressed to Jiaquan Gao springf12163com

Received 4 January 2016 Accepted 27 March 2016

Academic Editor Veljko Milutinovic

Copyright copy 2016 G He and J Gao This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations Compressed sparse row (CSR)is the most frequently used format to store sparse matrices However CSR-based SpMVs on graphic processing units (GPUs) forexample CSR-scalar and CSR-vector usually have poor performance due to irregular memory access patterns This motivates usto propose a perfect CSR-based SpMV on the GPU that is called PCSR PCSR involves two kernels and accesses CSR arrays in afully coalesced manner by introducing a middle array which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) andCSR-vector (partial coalescing) Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar CSR-vectorand CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-basedalgorithm CSR-Adaptive Furthermore we extend PCSR on a single GPU to multiple GPUs Experimental results on four C2050GPUs show that no matter whether the communication between GPUs is considered or not PCSR onmultiple GPUs achieves goodperformance and has high parallel efficiency

1 Introduction

Sparse matrix-vector multiplication (SpMV) has proven tobe an important operation in scientific computing It needbe accelerated because SpMV represents the dominant costin many iterative methods for solving large-sized linearsystems and eigenvalue problems that arise in a wide varietyof scientific and engineering applications [1] Initial workabout accelerating the SpMV on CUDA-enabled GPUs ispresented by Bell and Garland [2 3] The correspondingimplementations in the CUSPARSE [4] and CUSP libraries[5] include optimized codes of the well-known compressedsparse row (CSR) coordinate list (COO) ELLPACK (ELL)hybrid (HYB) and diagonal (DIA) formats Experimentalresults show speedups between 156 and 1230 compared toan optimized CPU implementation for a range of sparsematrices

SpMV is a largely memory bandwidth-bound operationReported results indicate that different access patterns tothe matrix and vectors on the GPU influence the SpMVperformance [2 3] The COO ELL DIA and HYB kernels

benefit from full coalescing However the scalar CSR kernel(CSR-scalar) shows poor performance because of its rarelycoalesced memory accesses [3]The vector CSR kernel (CSR-vector) improves the performance of CSR-scalar by usingwarps to access the CSR structure in a contiguous but notgenerally aligned fashion [3] which implies partial coalesc-ing Since then researchers have developed many highlyefficient CSR-based SpMV implementations on the GPU byoptimizing the memory access pattern of the CSR structureLu et al [6] optimize CSR-scalar by padding CSR arraysand achieve 30 improvement of the memory access per-formance Dehnavi et al [7] propose a prefetch-CSR methodthat partitions the matrix nonzeros to blocks of the same sizeand distributes them amongst GPU resources This methodobtains a slightly better behavior than CSR-vector by paddingrows with zeros to increase data regularity using parallelreduction techniques and prefetching data to hide globalmemory accesses Furthermore Dehnavi et al enhance theperformance of the prefetch-CSR method by replacing itwith three subkernels [8] Greathouse and Daga suggest aCSR-Adaptive algorithm that keeps the CSR format intact

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 8471283 12 pageshttpdxdoiorg10115520168471283

2 Mathematical Problems in Engineering

and maps well to GPUs [9] Their implementation efficientlyaccesses DRAM by streaming data into the local scratchpadmemory and dynamically assigns different numbers of rowsto each parallel GPU compute unit In addition numerousworks have proposed for GPUs using the variants of the CSRstorage format such as the compressed sparse eXtended [10]bit-representation-optimized compression [11] block CSR[12 13] and row-grouped CSR [14]

Besides using the variants of CSR many highly efficientSpMVs onGPUs have been proposed by utilizing the variantsof the ELL and COO storage formats such as the ELLPACK-R [15] ELLR-T [16] sliced ELL [13 17] SELL-C-120590 [18] slicedCOO [19] and blocked compressed COO [20] Specializedstorage formats provide definitive advantages However asmany programs use CSR the conversion from CSR to otherstorage formats will present a large engineering hurdle andcan incur large runtime overheads and require extra storagespace Moreover CSR-based algorithms generally have alower memory usage than those that are based on otherstorage formats such as ELL DIA and HYB

All the above observations motivate us to further inves-tigate how to construct efficient SpMVs on GPUs whilekeeping CSR intact In this study we propose a perfect CSRalgorithm called PCSR on GPUs PCSR is composed oftwo kernels and accesses CSR arrays in a fully coalescedmanner Experimental results on C2050 GPUs show thatPCSR outperforms CSR-scalar and CSR-vector and has abetter behavior compared to CSRMV and HYBMV in thevendor-tuned CUSPARSE library [4] and a most recentlyproposed CSR-based algorithm CSR-Adaptive

The main contributions in this paper are summarized asfollows

(i) A novel SpMV implementation on a GPU whichkeeps CSR intact is proposed The proposed algo-rithm consists of two kernels and alleviates the defi-ciencies of many existing CSR algorithms that accessCSR arrays in a rare or partial coalesced manner

(ii) Our proposed SpMV algorithm on aGPU is extendedtomultiple GPUs Moreover we suggest twomethodsto balance the workload among multiple GPUs

The rest of this paper is organized as follows Followingthis introduction the matrix storage CUDA architectureand SpMV are described in Section 2 In Section 3 a newSpMV implementation on a GPU is proposed Section 4discusses how to extend the proposed SpMV algorithm ona GPU to multiple GPUs Experimental results are presentedin Section 5 Section 6 contains our conclusions and points toour future research directions

2 Related Techniques

21 Matrix Storage To take advantage of the large number ofzeros in sparse matrices special storage formats are requiredIn this study the compressed sparse row (CSR) format is onlyconsidered although there aremany varieties of sparsematrixstorage formats such as the ELLPACK (or ITPACK) [21]COO [22] DIA [1] and HYB [3] Using CSR an 119899 times 119899 sparse

matrix 119860 with119873 nonzero elements is stored via three arrays(1) the array 119889119886119905119886 contains all the nonzero entries of 119860 (2)the array 119894119899119889119894119888119890119904 contains column indices of nonzero entriesthat are stored in 119889119886119905119886 and (3) entries of the array 119901119905119903 pointto the first entry of subsequence rows of 119860 in the arrays 119889119886119905119886and 119894119899119889119894119888119890119904

For example the following matrix

119860 =

[[[[[[[[[[[

[

4 1 0 1 0 0

1 4 1 0 1 0

0 1 4 0 0 1

1 0 0 4 1 0

0 1 0 1 4 1

0 0 1 0 1 4

]]]]]]]]]]]

]

(1)

is stored in the CSR format by

119889119886119905119886

[4 1 1 1 4 1 1 1 4 1 1 4 1 1 1 4 1 1 1 4]

119894119899119889119894119888119890119904

[0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5]

119901119905119903 [0 3 7 10 13 17 20]

(2)

22 CUDA Architecture The compute unified device archi-tecture (CUDA) is a heterogenous computing model thatinvolves both the CPU and theGPU [23] Executing a parallelprogram on the GPU using CUDA involves the following(1) transferring required data to the GPU global memory(2) launching the GPU kernel and (3) transferring resultsback to the host memoryThe threads of a kernel are groupedinto a grid of thread blocks The GPU schedules blocks overthe multiprocessors according to their available executioncapacity When a block is given to a multiprocessor it issplit in warps that are composed of 32 threads In the bestcase all 32 threads have the same execution path and theinstruction is executed concurrently If not the executionpaths are executed sequentially which greatly reduces theefficiency The threads in a block communicate via thefast shared memory but the threads in different blockscommunicate through high-latency global memory Majorchallenges in optimizing an application on GPUs are globalmemory access latency different execution paths in eachwarp communication and synchronization between threadsin different blocks and resource utilization

23 Sparse Matrix-Vector Multiplication Assume that119860 is an119899times119899 sparse matrix and 119909 is a vector of size 119899 and a sequentialversion of CSR-based SpMV is described in Algorithm 1Obviously the order in which elements of 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903and 119909 are accessed has an important impact on the SpMVperformance on GPUs where memory access patterns arecrucial

Mathematical Problems in Engineering 3

Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done

Algorithm 1 Sequential SpMV

3 SpMV on a GPU

In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels

(i) Kernel 1 calculate the array V = [V1 V2 V

119873] where

V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then

save it to global memory(ii) Kernel 2 accumulate element values of V according

to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]

V119894 119895 =

0 1 119899 minus 1 and store them to an array 119910 in globalmemory

We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1

31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 1199051198901199091199051119863119891119890119905119888ℎ (119891119897119900119886119905119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(3)

Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value

(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)

Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 119891119890119905119888ℎ 119889119900119906119887119897119890 (119889119900119906119887119897119890119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(4)

32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages

(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned

(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position

119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced

(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904

Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced

From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)

4 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middotmiddot middot middot middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory

ptr_s[0]ptr_s[1]

ptr_s[2]

ptr_s[TB minus 1]

ptr_s[TB]

Thread 0

Thread 1

Thread 2

Thread TB minus 1

Thread 0

Global memory

ptr[i lowast TB + 0]

ptr[i lowast TB + 1]ptr[i lowast TB + 2]

ptr[i lowast TB + TB minus 1]ptr[i lowast TB + TB]

Figure 1 First stage of Kernel 2

Table 1 Symbols used in this study

Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid

elementsPerThread Number of elements calculated by eachthread

sizeSharedMemory Size of shared memory119872 Number of GPUs

Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while

Algorithm 2 Kernel 1

4 SpMV on Multiple GPUs

In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied

(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector

(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector

In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR

If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect

5 Experimental Results

51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2

All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)

52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and

Mathematical Problems in Engineering 5

Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx

lowastLoad ptr into the shared memory ptr slowast(05) 119901119905119903 119904[119905119894119889]larr 119901119905119903[119892119894119889](06) if 119905119894119889 == 0 then119901119905119903 s[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr 119901119905119903[119892119894119889 + 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](07) syncthreads()(08) 119905119890119898119901 larr (119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896] minus119901119905119903 119904[0])119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1(09) 119899119897119890119899 larr min(119905119890119898119901 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910)(10) 119904119906119898 larr 00119898119886119909119897119890119899 larr 119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](11) for 119894 larr 119901119905119903 119904[0] to119898119886119909119897119890119899 minus 1with 119894 += 119899119897119890119899 do(12) indexlarr 119894 + 119905119894119889(13) syncthreads()

lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()

lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898

Algorithm 3 Kernel 2

Block grid

Block 0

Block 1

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory Global memoryThread 0

Thread 0

Thread 0

Thread RT

Thread TB minus 1

Thread TB minus 1

v_s[0]

v_s[TB minus 1]

v_s[j lowast TB + 0]

v_s[m lowast TB + 0]

Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]

v[PT + 0]

v[PT + TB minus 1]

v[PT + j lowast TB]

v[PT + j lowast TB + TB minus 1]

v[PT + m lowast TB]

v[PT + RS minus 1]

v_s[j lowast TB + TB minus 1]

v_s[RS minus 1]

Figure 2 Second stage of Kernel 2

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

2 Mathematical Problems in Engineering

and maps well to GPUs [9] Their implementation efficientlyaccesses DRAM by streaming data into the local scratchpadmemory and dynamically assigns different numbers of rowsto each parallel GPU compute unit In addition numerousworks have proposed for GPUs using the variants of the CSRstorage format such as the compressed sparse eXtended [10]bit-representation-optimized compression [11] block CSR[12 13] and row-grouped CSR [14]

Besides using the variants of CSR many highly efficientSpMVs onGPUs have been proposed by utilizing the variantsof the ELL and COO storage formats such as the ELLPACK-R [15] ELLR-T [16] sliced ELL [13 17] SELL-C-120590 [18] slicedCOO [19] and blocked compressed COO [20] Specializedstorage formats provide definitive advantages However asmany programs use CSR the conversion from CSR to otherstorage formats will present a large engineering hurdle andcan incur large runtime overheads and require extra storagespace Moreover CSR-based algorithms generally have alower memory usage than those that are based on otherstorage formats such as ELL DIA and HYB

All the above observations motivate us to further inves-tigate how to construct efficient SpMVs on GPUs whilekeeping CSR intact In this study we propose a perfect CSRalgorithm called PCSR on GPUs PCSR is composed oftwo kernels and accesses CSR arrays in a fully coalescedmanner Experimental results on C2050 GPUs show thatPCSR outperforms CSR-scalar and CSR-vector and has abetter behavior compared to CSRMV and HYBMV in thevendor-tuned CUSPARSE library [4] and a most recentlyproposed CSR-based algorithm CSR-Adaptive

The main contributions in this paper are summarized asfollows

(i) A novel SpMV implementation on a GPU whichkeeps CSR intact is proposed The proposed algo-rithm consists of two kernels and alleviates the defi-ciencies of many existing CSR algorithms that accessCSR arrays in a rare or partial coalesced manner

(ii) Our proposed SpMV algorithm on aGPU is extendedtomultiple GPUs Moreover we suggest twomethodsto balance the workload among multiple GPUs

The rest of this paper is organized as follows Followingthis introduction the matrix storage CUDA architectureand SpMV are described in Section 2 In Section 3 a newSpMV implementation on a GPU is proposed Section 4discusses how to extend the proposed SpMV algorithm ona GPU to multiple GPUs Experimental results are presentedin Section 5 Section 6 contains our conclusions and points toour future research directions

2 Related Techniques

21 Matrix Storage To take advantage of the large number ofzeros in sparse matrices special storage formats are requiredIn this study the compressed sparse row (CSR) format is onlyconsidered although there aremany varieties of sparsematrixstorage formats such as the ELLPACK (or ITPACK) [21]COO [22] DIA [1] and HYB [3] Using CSR an 119899 times 119899 sparse

matrix 119860 with119873 nonzero elements is stored via three arrays(1) the array 119889119886119905119886 contains all the nonzero entries of 119860 (2)the array 119894119899119889119894119888119890119904 contains column indices of nonzero entriesthat are stored in 119889119886119905119886 and (3) entries of the array 119901119905119903 pointto the first entry of subsequence rows of 119860 in the arrays 119889119886119905119886and 119894119899119889119894119888119890119904

For example the following matrix

119860 =

[[[[[[[[[[[

[

4 1 0 1 0 0

1 4 1 0 1 0

0 1 4 0 0 1

1 0 0 4 1 0

0 1 0 1 4 1

0 0 1 0 1 4

]]]]]]]]]]]

]

(1)

is stored in the CSR format by

119889119886119905119886

[4 1 1 1 4 1 1 1 4 1 1 4 1 1 1 4 1 1 1 4]

119894119899119889119894119888119890119904

[0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5]

119901119905119903 [0 3 7 10 13 17 20]

(2)

22 CUDA Architecture The compute unified device archi-tecture (CUDA) is a heterogenous computing model thatinvolves both the CPU and theGPU [23] Executing a parallelprogram on the GPU using CUDA involves the following(1) transferring required data to the GPU global memory(2) launching the GPU kernel and (3) transferring resultsback to the host memoryThe threads of a kernel are groupedinto a grid of thread blocks The GPU schedules blocks overthe multiprocessors according to their available executioncapacity When a block is given to a multiprocessor it issplit in warps that are composed of 32 threads In the bestcase all 32 threads have the same execution path and theinstruction is executed concurrently If not the executionpaths are executed sequentially which greatly reduces theefficiency The threads in a block communicate via thefast shared memory but the threads in different blockscommunicate through high-latency global memory Majorchallenges in optimizing an application on GPUs are globalmemory access latency different execution paths in eachwarp communication and synchronization between threadsin different blocks and resource utilization

23 Sparse Matrix-Vector Multiplication Assume that119860 is an119899times119899 sparse matrix and 119909 is a vector of size 119899 and a sequentialversion of CSR-based SpMV is described in Algorithm 1Obviously the order in which elements of 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903and 119909 are accessed has an important impact on the SpMVperformance on GPUs where memory access patterns arecrucial

Mathematical Problems in Engineering 3

Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done

Algorithm 1 Sequential SpMV

3 SpMV on a GPU

In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels

(i) Kernel 1 calculate the array V = [V1 V2 V

119873] where

V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then

save it to global memory(ii) Kernel 2 accumulate element values of V according

to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]

V119894 119895 =

0 1 119899 minus 1 and store them to an array 119910 in globalmemory

We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1

31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 1199051198901199091199051119863119891119890119905119888ℎ (119891119897119900119886119905119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(3)

Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value

(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)

Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 119891119890119905119888ℎ 119889119900119906119887119897119890 (119889119900119906119887119897119890119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(4)

32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages

(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned

(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position

119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced

(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904

Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced

From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)

4 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middotmiddot middot middot middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory

ptr_s[0]ptr_s[1]

ptr_s[2]

ptr_s[TB minus 1]

ptr_s[TB]

Thread 0

Thread 1

Thread 2

Thread TB minus 1

Thread 0

Global memory

ptr[i lowast TB + 0]

ptr[i lowast TB + 1]ptr[i lowast TB + 2]

ptr[i lowast TB + TB minus 1]ptr[i lowast TB + TB]

Figure 1 First stage of Kernel 2

Table 1 Symbols used in this study

Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid

elementsPerThread Number of elements calculated by eachthread

sizeSharedMemory Size of shared memory119872 Number of GPUs

Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while

Algorithm 2 Kernel 1

4 SpMV on Multiple GPUs

In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied

(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector

(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector

In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR

If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect

5 Experimental Results

51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2

All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)

52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and

Mathematical Problems in Engineering 5

Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx

lowastLoad ptr into the shared memory ptr slowast(05) 119901119905119903 119904[119905119894119889]larr 119901119905119903[119892119894119889](06) if 119905119894119889 == 0 then119901119905119903 s[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr 119901119905119903[119892119894119889 + 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](07) syncthreads()(08) 119905119890119898119901 larr (119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896] minus119901119905119903 119904[0])119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1(09) 119899119897119890119899 larr min(119905119890119898119901 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910)(10) 119904119906119898 larr 00119898119886119909119897119890119899 larr 119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](11) for 119894 larr 119901119905119903 119904[0] to119898119886119909119897119890119899 minus 1with 119894 += 119899119897119890119899 do(12) indexlarr 119894 + 119905119894119889(13) syncthreads()

lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()

lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898

Algorithm 3 Kernel 2

Block grid

Block 0

Block 1

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory Global memoryThread 0

Thread 0

Thread 0

Thread RT

Thread TB minus 1

Thread TB minus 1

v_s[0]

v_s[TB minus 1]

v_s[j lowast TB + 0]

v_s[m lowast TB + 0]

Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]

v[PT + 0]

v[PT + TB minus 1]

v[PT + j lowast TB]

v[PT + j lowast TB + TB minus 1]

v[PT + m lowast TB]

v[PT + RS minus 1]

v_s[j lowast TB + TB minus 1]

v_s[RS minus 1]

Figure 2 Second stage of Kernel 2

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Mathematical Problems in Engineering 3

Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done

Algorithm 1 Sequential SpMV

3 SpMV on a GPU

In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels

(i) Kernel 1 calculate the array V = [V1 V2 V

119873] where

V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then

save it to global memory(ii) Kernel 2 accumulate element values of V according

to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]

V119894 119895 =

0 1 119899 minus 1 and store them to an array 119910 in globalmemory

We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1

31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 1199051198901199091199051119863119891119890119905119888ℎ (119891119897119900119886119905119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(3)

Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value

(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)

Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as

V [119905119894119889]

larr997888 119889119886119905119886 [119905119894119889]

sdot 119891119890119905119888ℎ 119889119900119906119887119897119890 (119889119900119906119887119897119890119879119890119909119877119890119891 119894119899119889119894119888119890119904 [119905119894119889])

(4)

32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages

(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned

(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position

119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced

(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904

Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced

From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)

4 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middotmiddot middot middot middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory

ptr_s[0]ptr_s[1]

ptr_s[2]

ptr_s[TB minus 1]

ptr_s[TB]

Thread 0

Thread 1

Thread 2

Thread TB minus 1

Thread 0

Global memory

ptr[i lowast TB + 0]

ptr[i lowast TB + 1]ptr[i lowast TB + 2]

ptr[i lowast TB + TB minus 1]ptr[i lowast TB + TB]

Figure 1 First stage of Kernel 2

Table 1 Symbols used in this study

Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid

elementsPerThread Number of elements calculated by eachthread

sizeSharedMemory Size of shared memory119872 Number of GPUs

Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while

Algorithm 2 Kernel 1

4 SpMV on Multiple GPUs

In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied

(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector

(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector

In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR

If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect

5 Experimental Results

51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2

All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)

52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and

Mathematical Problems in Engineering 5

Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx

lowastLoad ptr into the shared memory ptr slowast(05) 119901119905119903 119904[119905119894119889]larr 119901119905119903[119892119894119889](06) if 119905119894119889 == 0 then119901119905119903 s[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr 119901119905119903[119892119894119889 + 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](07) syncthreads()(08) 119905119890119898119901 larr (119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896] minus119901119905119903 119904[0])119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1(09) 119899119897119890119899 larr min(119905119890119898119901 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910)(10) 119904119906119898 larr 00119898119886119909119897119890119899 larr 119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](11) for 119894 larr 119901119905119903 119904[0] to119898119886119909119897119890119899 minus 1with 119894 += 119899119897119890119899 do(12) indexlarr 119894 + 119905119894119889(13) syncthreads()

lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()

lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898

Algorithm 3 Kernel 2

Block grid

Block 0

Block 1

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory Global memoryThread 0

Thread 0

Thread 0

Thread RT

Thread TB minus 1

Thread TB minus 1

v_s[0]

v_s[TB minus 1]

v_s[j lowast TB + 0]

v_s[m lowast TB + 0]

Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]

v[PT + 0]

v[PT + TB minus 1]

v[PT + j lowast TB]

v[PT + j lowast TB + TB minus 1]

v[PT + m lowast TB]

v[PT + RS minus 1]

v_s[j lowast TB + TB minus 1]

v_s[RS minus 1]

Figure 2 Second stage of Kernel 2

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

4 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middotmiddot middot middot middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory

ptr_s[0]ptr_s[1]

ptr_s[2]

ptr_s[TB minus 1]

ptr_s[TB]

Thread 0

Thread 1

Thread 2

Thread TB minus 1

Thread 0

Global memory

ptr[i lowast TB + 0]

ptr[i lowast TB + 1]ptr[i lowast TB + 2]

ptr[i lowast TB + TB minus 1]ptr[i lowast TB + TB]

Figure 1 First stage of Kernel 2

Table 1 Symbols used in this study

Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid

elementsPerThread Number of elements calculated by eachthread

sizeSharedMemory Size of shared memory119872 Number of GPUs

Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while

Algorithm 2 Kernel 1

4 SpMV on Multiple GPUs

In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied

(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector

(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector

In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR

If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect

5 Experimental Results

51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2

All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)

52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and

Mathematical Problems in Engineering 5

Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx

lowastLoad ptr into the shared memory ptr slowast(05) 119901119905119903 119904[119905119894119889]larr 119901119905119903[119892119894119889](06) if 119905119894119889 == 0 then119901119905119903 s[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr 119901119905119903[119892119894119889 + 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](07) syncthreads()(08) 119905119890119898119901 larr (119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896] minus119901119905119903 119904[0])119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1(09) 119899119897119890119899 larr min(119905119890119898119901 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910)(10) 119904119906119898 larr 00119898119886119909119897119890119899 larr 119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](11) for 119894 larr 119901119905119903 119904[0] to119898119886119909119897119890119899 minus 1with 119894 += 119899119897119890119899 do(12) indexlarr 119894 + 119905119894119889(13) syncthreads()

lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()

lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898

Algorithm 3 Kernel 2

Block grid

Block 0

Block 1

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory Global memoryThread 0

Thread 0

Thread 0

Thread RT

Thread TB minus 1

Thread TB minus 1

v_s[0]

v_s[TB minus 1]

v_s[j lowast TB + 0]

v_s[m lowast TB + 0]

Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]

v[PT + 0]

v[PT + TB minus 1]

v[PT + j lowast TB]

v[PT + j lowast TB + TB minus 1]

v[PT + m lowast TB]

v[PT + RS minus 1]

v_s[j lowast TB + TB minus 1]

v_s[RS minus 1]

Figure 2 Second stage of Kernel 2

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Mathematical Problems in Engineering 5

Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid

Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx

lowastLoad ptr into the shared memory ptr slowast(05) 119901119905119903 119904[119905119894119889]larr 119901119905119903[119892119894119889](06) if 119905119894119889 == 0 then119901119905119903 s[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr 119901119905119903[119892119894119889 + 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](07) syncthreads()(08) 119905119890119898119901 larr (119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896] minus119901119905119903 119904[0])119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1(09) 119899119897119890119899 larr min(119905119890119898119901 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910)(10) 119904119906119898 larr 00119898119886119909119897119890119899 larr 119901119905119903 119904[119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896](11) for 119894 larr 119901119905119903 119904[0] to119898119886119909119897119890119899 minus 1with 119894 += 119899119897119890119899 do(12) indexlarr 119894 + 119905119894119889(13) syncthreads()

lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()

lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898

Algorithm 3 Kernel 2

Block grid

Block 0

Block 1

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Block i

Block BG

Threads in the ith block

Shared memory Global memoryThread 0

Thread 0

Thread 0

Thread RT

Thread TB minus 1

Thread TB minus 1

v_s[0]

v_s[TB minus 1]

v_s[j lowast TB + 0]

v_s[m lowast TB + 0]

Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]

v[PT + 0]

v[PT + TB minus 1]

v[PT + j lowast TB]

v[PT + j lowast TB + TB minus 1]

v[PT + m lowast TB]

v[PT + RS minus 1]

v_s[j lowast TB + TB minus 1]

v_s[RS minus 1]

Figure 2 Second stage of Kernel 2

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

6 Mathematical Problems in Engineering

Block gridBlock 0

Block 1

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

Block i

Block BG

ThreadsThread 0

Thread 1

Thread j

Thread TB minus 1

sum ptr_s[0]leileptr_s[1]v_s[i]

sum ptr_s[1]leileptr_s[2]v_s[i]

sum ptr_s[j]leileptr_s[j+1]v_s[i]

sum ptr_s[TBminus1]leileptr_s[TB]v_s[i]

Figure 3 Third stage of Kernel 2

Table 2 Properties of test matrices

Name Rows Nonzeros (nz) nzrow Descriptionepb2 25228 175027 694 Thermal problemecl32 51993 380415 732 Semiconductor devicebayer01 57735 277774 481 Chemical processg7jac200sc 59310 837936 1413 Economic problemfinan512 74752 335872 449 Economic problem2cubes sphere 101492 1647264 1623 Electromagneticstorso2 115967 1033473 891 2D3D problemFEM 3D thermal2 147900 3489300 2359 Nonlinear thermalscircuit 170998 958936 561 Circuit simulationcont-300 180895 988195 546 Optimization problemGa41As41H72 268096 18488476 6896 Pseudopotential methodF1 343791 26837113 7806 Stiffness matrixrajat24 358172 1948235 544 Circuit simulationlanguage 399130 1216334 305 Directed graphaf shell9 504855 17588845 3484 Sheet metal formingASIC 680ks 682712 2329176 341 Circuit simulationecology2 999999 4995991 500 Circuit theoryHamrle3 1447360 5514242 381 Circuit simulationthermal2 1228045 8580313 699 Unstructured FEMcage14 1505785 27130349 1801 DNA electrophoresisTransport 1602111 23500731 1467 Structural problemG3 circuit 1585478 7660826 483 Circuit simulationkkt power 2063494 12771361 619 Optimization problemCurlCurl 4 2380515 26515867 1114 Model reductionmemchip 2707524 14810202 547 Circuit simulationFreescale1 3428755 18920347 552 Circuit simulation

CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]

We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla

C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported

521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Mathematical Problems in Engineering 7

Sing

le-p

reci

sion

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(a) Single precision

Dou

ble-

prec

ision

perfo

rman

ce (G

Flop

s)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

181614121086420

(b) Double precision

Figure 4 Performance of all algorithms on a Tesla C2050

Sing

le-p

reci

sion

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(a) Single precision

Dou

ble-

prec

ision

band

wid

th (G

Byte

ss)

CSR-scalarCSRMVCSR-Adaptive

CSR-vectorHYBMVPCSR

epb2

ecl32

baye

r01

g7ja

c200

scfin

an512

tors

o2FE

M_3

D_t

herm

al2

cont

-300

raja

t24

lang

uage

af_s

hell9

ASI

C_680

ksH

amrle

3

ther

mal2

kkt_

pow

er

140

120

100

80

60

40

20

0

(b) Double precision

Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050

cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive

Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism

522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050

53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

8 Mathematical Problems in Engineering

(a) cont-300 (b) af shell9

Figure 6 Visualization of the af shell9 and cont-300 matrix

Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02670 00178 8321 02640 00156 8417scircuit 03484 02413 00322 7220 02250 00207 7741Ga41As41H72 42387 23084 00446 9181 23018 00432 9207F1 65544 38865 07012 8432 35710 02484 9177ASIC 680ks 08196 04567 00126 8972 04566 00021 8974ecology2 12321 06665 00140 9242 06654 00152 9258Hamrle3 17684 09651 00478 9161 09208 500E minus 05 9602thermal2 20708 10559 00056 9806 10558 00045 9806cage14 59177 34757 05417 8513 31548 00458 9378Transport 47305 24665 00391 9589 24655 00407 9593G3 circuit 19731 10485 00364 9408 11061 01148 8918kkt power 43465 27916 07454 7785 22252 00439 9766CurlCurl 4 51605 27107 00347 9518 27075 00244 9530memchip 38257 21905 03393 8732 20975 02175 9119Freescale1 50524 30235 05719 8355 28175 02811 8966

performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively

On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are

9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod

On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Mathematical Problems in Engineering 9

Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01560 00132 7123 01527 00111 7278scircuit 03484 01453 00262 5994 01357 00130 6417Ga41As41H72 42387 16123 07268 6572 13410 01846 7902F1 65544 25240 06827 6492 19121 01900 8569ASIC 680ks 08196 02944 00298 6959 02887 00264 7098ecology2 12321 03593 00160 8572 03554 00141 8667Hamrle3 17684 05114 00307 8645 04775 00125 9259thermal2 20708 05553 00271 9322 05546 00255 9333cage14 59177 18126 03334 8162 15386 00188 9615Transport 47305 12292 00270 9621 12275 00158 9635G3 circuit 19731 05804 00489 8499 06195 00790 7963kkt power 43465 14974 05147 7257 11584 00418 9380CurlCurl 4 51605 13554 00153 9518 13501 00111 9556memchip 38257 11439 01741 8361 11175 01223 8559Freescale1 50524 17588 04039 7181 14806 01843 8531

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs

and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI

On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one

532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs

the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively

On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357

On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

10 Mathematical Problems in Engineering

Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs

Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 02494 600E minus 04 8909 02503 500119864 minus 04 8875scircuit 03484 02234 00154 7795 02165 00070 8044Ga41As41H72 42387 23516 00030 9012 23795 00521 8907F1 65544 39252 06948 8349 36076 02392 9084ASIC 680ks 08196 04890 00113 8380 04998 00178 8199ecology2 12321 06865 300119864 minus 04 8974 06863 800E minus 04 8976Hamrle3 17684 10221 00209 8650 10066 00170 8784thermal2 20708 11403 00230 9080 11402 00203 9081cage14 59177 35756 05644 8275 32244 00196 9176Transport 47305 24623 00203 9605 24550 00183 9634G3 circuit 19731 11215 00189 8796 11766 00896 8384kkt power 43465 29539 06973 7357 24459 00356 8885CurlCurl 4 51605 27064 00092 9534 27049 100E minus 03 9539memchip 38257 23218 03467 8239 22243 01973 8599Freescale1 50524 31216 05868 8092 29367 03199 8602

Table 6 Comparison of PCSRI and PCSRII with communication on four GPUs

Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE

2cubes sphere 04444 01567 00052 7089 01531 00028 7254scircuit 03484 01544 00204 5639 01495 00073 5827Ga41As41H72 42387 17157 07909 6176 14154 02178 7487F1 65544 21149 03833 7748 20022 01941 8184ASIC 680ks 08196 03449 00187 5939 03423 00147 5987ecology2 12321 04257 00048 7235 04257 00056 7235Hamrle3 17684 06231 00087 7095 06297 00085 7021thermal2 20708 06922 00267 7478 06959 00269 7439cage14 59177 19339 03442 7650 16417 00067 9012Transport 47305 13323 00279 8877 13217 00070 8948G3 circuit 19731 07234 00408 6819 07458 00620 6614kkt power 43465 17277 05495 6289 13791 00305 7879CurlCurl 4 51605 15065 00253 8563 15004 08789 8599memchip 38257 13804 01768 6929 13051 01029 7328Freescale1 50524 20711 04342 6098 18193 02262 6943

and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639

Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs

6 Conclusion

In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Mathematical Problems in Engineering 11

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs

2cu

bes_

sphe

resc

ircui

tG

a41

As41

H72 F1

ASI

C_680

ksec

olog

y2H

amrle

3

ther

mal2

cage14

Tran

spor

tG3

_circ

uit

kkt_

pow

erCu

rlCur

l_4

mem

chip

Free

scal

e1

PCSRIPCSRII

Para

llel e

ffici

ency

10

08

06

04

02

00

Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs

without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one

Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017

References

[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003

[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008

[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009

[4] NVIDIA CUSPARSE Library 65 2015 httpsdevelopernvidiacomcusparse

[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom

[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012

[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010

[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011

[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014

[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013

[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013

[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012

[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010

[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014

[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011

[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012

[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

12 Mathematical Problems in Engineering

Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010

[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014

[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013

[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014

[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988

[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993

[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008

[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011

[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011

[26] NVIDIA ldquoCUDAC Programming Guide 65rdquo 2015 httpdocsnvidiacomcudacuda-c-programming-guide

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article A Novel CSR-Based Sparse Matrix-Vector ...

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of