OpenSPARSE: An Open Platform for Sparse Basic …2018/10/04 · New Efficient General Sparse Matrix...
Transcript of OpenSPARSE: An Open Platform for Sparse Basic …2018/10/04 · New Efficient General Sparse Matrix...
OpenSPARSE: An Open Platform for Sparse Basic Linear Algebra Subprograms
Weifeng Liu, Norwegian University of Science and Technology Guangming Tan, Institute of Computing Technology, Chinese Academy of Sciences Wei Xue, Tsinghua University Hao Wang, Ohio State University
SparseDaysMee+ng2018at
September27th–28th,2018,Toulouse,France
2
Outline • A brief history of BLAS, Sparse BLAS, CombBLAS and GraphBLAS • Recent work on optimizing sparse kernels • Observations on performance and usage of sparse kernels • OpenSPARSE: objective, design and preliminary results
3
A brief history of BLAS, Sparse BLAS, CombBLAS and GraphBLAS
4
Some milestones of BLAS - 1973
R.J.Hanson,F.T.Krogh,C.L.Lawson.1973.AProposalforStandardLinearAlgebraSubprograms.TechnicalReport.NASA.
5
Some milestones of BLAS - 1988
J.J.Dongarra,J.D.Croz,S.Hammarling,R.J.Hanson.1988.AnextendedsetofFORTRANbasiclinearalgebrasubprograms.ACMTrans.Math.SoRw.
6
Some milestones of BLAS - 1990
J.J.Dongarra,J.D.Croz,S.Hammarling,I.S.Duff.1990.Asetoflevel3basiclinearalgebrasubprograms.ACMTrans.Math.SoRw.
7
Some milestones of Sparse BLAS - 1991
D.S.Dodson,R.G.Grimes,J.G.Lewis.1991.SparseextensionstotheFORTRANBasicLinearAlgebraSubprograms.ACMTrans.Math.SoRw.
8
Some milestones of Sparse BLAS - 1992/1996
S.Carney,M.A.Heroux,G.Li,K.Wu.1996.ARevisedProposalforaSparseBLASToolkit.TechnicalReport.SPARKERWorkingNote3.
M.A.Heroux.1992.AProposalforaSparseBLASToolkit.TechnicalReport.SPARKERWorkingNote2.
9
Some milestones of Sparse BLAS - 1997
I.S.Duff,M.Marrone,G.Radica+,C.Vi]oli.1997.Level3basiclinearalgebrasubprogramsforsparsematrices:auser-levelinterface.ACMTrans.Math.SoRw.
10
Some milestones of Sparse BLAS - 2002
I.S.Duff,M.A.Heroux,R.Pozo.2002.Anoverviewofthesparsebasiclinearalgebrasubprograms:ThenewstandardfromtheBLAStechnicalforum.ACMTrans.Math.SoRw.
11
Some implementations of Sparse BLAS - 1994
J.Dongarra,A.Lumsdaine,X.Niu,R.Pozo,K.Remington.1994.LAPACKWorkingNote74:ASparseMatrixLibraryinC++forHighPerformanceArchitectures.TechnicalReport.
12
Some implementations of Sparse BLAS - 2000
S.Filippone,M.Colajanni.2000.PSBLAS:alibraryforparallellinearalgebracomputa+ononsparsematrices.ACMTrans.Math.SoRw.
13
Some implementations of Sparse BLAS - 2002
I.S.Duff,C.Vömel.2002.Algorithm818:Areferencemodelimplementa+onofthesparseBLASinfortran95.ACMTrans.Math.SoRw.
14
Some implementations of Sparse BLAS - 2003
S.Filippone,A.Bu]ari.2012.Object-OrientedTechniquesforSparseMatrixComputa+onsinFortran2003.ACMTrans.Math.SoRw.
15
Combinatorial BLAS - 2011
A.Buluç,J.R.Gilbert.2011.TheCombinatorialBLAS:design,implementa+on,andapplica+ons.Int.J.HighPerform.Comput.Appl.
16
GraphBLAS - 2017
A.Buluç,T.Ma]son,S.McMillan,J.Moreira,C.Yang.DesignoftheGraphBLASAPIforC.2017IEEEInterna+onalParallelandDistributedProcessingSymposiumWorkshops(IPDPSW).
17
SuiteSparse:GraphBLAS - 2018
T.Davis.Algorithm9xx:SuiteSparse:GraphBLAS:graphalgorithmsinthelanguageofsparselinearalgebra.ACMTrans.Math.SoRw.Underreview.
18
Recent work on optimizing sparse kernels
19
Sparse kernels received much attention
• Sparsematrix-vectorMul+plica+on(SpMV)
x 0 2 0 1
0 3
0 6 0 5 0 4 0 d 0 c
0 a 0 b 2a+3b
1c
0 4a+5c+6d
=
• Sparsetransposi+on(SpTRANS)
0 2 0 1
0 3
0 6 0 5 0 4
0 2
0 1 0 3
0 6 0 5
0 4
->
• Sparsematrix-matrixMul+plica+on(SpGEMM)
0 2 0 1
0 3
0 6 0 5 0 4 0 d
0 c 0 a
0 f
0 b 0 e
0 1d
4a+5e 0 5d
1e 0 3b 0 3c
0 6f
2a x =
• Sparsetriangularsolve(SpTRSV)
0 x3
0 x2
0 x0 0 x1 0 1
0 1
0 1 0 1 0 3
0 2 0 d 0 c
0 a 0 b x =
20
Some recent sparse kernels – 2014 • [SpMV] J. L. Greathouse, M. Daga. Efficient Sparse Matrix-Vector Multiplication on GPUs using
the CSR Storage Format. SC ’14. • [SpMV] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, P. Sadayappan. Fast Sparse
Matrix-Vector Multiplication on GPUs for Graph Applications. SC ’14. • [SpMV] A. Ashari, N. Sedaghati, J. Eisenlohr, P. Sadayappan. An Efficient Two-Dimensional
Blocking Strategy for Sparse Matrix-vector Multiplication on GPUs. ICS ’14. • [SpMV] S. Yan, C. Li, Y. Zhang, H. Zhou. yaSpMV: Yet Another SpMV Framework on GPUs.
PPoPP ’14. • [SpMV] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Bishop. A Unified Sparse Matrix Data
Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SISC.
• [SpGEMM] W. Liu, B. Vinter. An efficient GPU general sparse matrix-matrix multiplication for irregular data. IPDPS ’14.
• [SpTRSV] J. Park, M. Smelyanskiy, N. Sundaram, P. Dubey. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. ISC ’14.
21
Some recent sparse kernels - 2015 • [SpMV] W. Liu, B. Vinter. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-
Vector Multiplication. ICS ’15. • [SpMV] N. Sedaghati, T. Mu, L. N. Pouchet, et al. Automatic selection of sparse matrix
representation on GPUs. ICS ’15. • [SpMV] M. Daga, J. L. Greathouse. Structural agnostic SpMV: Adapting CSR-adaptive for
irregular matrices. HiPC ’15. • [SpMV, SpGEMM] S. Dalton, S. Baxter, D. Merrill, L. Olson. Optimizing Sparse Matrix
Operations on GPUs Using Merge Path. IPDPS ’15. • [SpGEMM] F. Gremse, A. Hofter, L. O. Schwen, F. Kiessling, U. Naumann. GPU-accelerated
sparse matrix-matrix multiplication by iterative row merging. SISC. • [SpGEMM] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park. Parallel efficient sparse
matrix-matrix multiplication on multicore platforms. ISC ’15. • [SpGEMM] S. Dalton, L. Olson, N. Bell. Optimizing Sparse Matrix-Matrix Multiplication for the
GPU. TOMS. • [SpTRSV] H. Kabir, J.D. Booth, G. Aupy, A. Benoit, Y. Robert, P. Raghavan. STSk: A Multilevel
Sparse Triangular Solution Scheme for NUMA Multicores. SC ’15.
22
Some recent sparse kernels - 2016 • [SpMV] Y. Zhang, S. Li, S. Yan, H. Zhou. A cross-platform SpMV framework on
many-core architectures. TACO. • [SpMV] D. Merrill, M. Garland. Merge-based parallel sparse matrix-vector
multiplication. SC ’16. • [SpGEMM] A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori. Exploiting
multiple levels of parallelism in sparse matrix-matrix multiplication. SISC. • [SpGEMM] P. N. Q. Anh, R. Fan, Y. Wen. Balanced hashing and efficient gpu
sparse general matrix-matrix multiplication. ICS ’16. • [SpTRSV] W. Liu, A. Li, J. D. Hogg, I. S. Duff, B. Vinter. A Synchronization-Free
Algorithm for Parallel Sparse Triangular Solves. Euro-Par ’16. • [SpTRSV] A. M. Bradley. A Hybrid Multithreaded Direct Sparse Triangular Solver.
CSC ’16. • [SpTRANS] H. Wang, W. Liu, K. Hou, W. Feng. Parallel Transposition of Sparse
Data Structures. ICS ’16.
23
Some recent sparse kernels - 2017 • [SpMV] M. Steinberger, R. Zayer, H. P. Seidel. Globally homogeneous, locally adaptive sparse
matrix-vector multiplication on the GPU. ICS ’17. • [SpMV] A. Elafrou, G. Goumas, N. Koziris. Performance Analysis and Optimization of Sparse
Matrix-Vector Multiplication on Modern Multi-and Many-Core Processors. ICPP ’17. • [SpMV] J. P. Ecker, R. Berrendorf, F. Mannuss. New Efficient General Sparse Matrix Formats for
Parallel SpMV Operations. Euro-Par ’17. • [SpMV] G. Flegar, E. S. Quintana-Ortí. Balanced CSR Sparse Matrix-Vector Product on Graphics
Processors. Euro-Par ’17. • [SpMSpV] A. Azad, A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication
algorithm. IPDPS ’17. • [SpGEMM] K. Akbudak, C. Aykanat. Exploiting locality in sparse matrix-matrix multiplication on
many-core architectures. TPDS. • [SpGEMM] Y. Nagasaka, A. Nukada, S. Matsuoka. High-performance and Memory-saving
Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. ICPP ’17. • [SpGEMM] R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa, P. Sadayappan. On
improving performance of sparse matrix-matrix multiplication on GPUs. ICS ’17.
24
Some recent sparse kernels - 2018 • [SpMV] Y. Zhao, W. Zhou, X. Shen, G. Yiu. Overhead-Conscious Format Selection for SpMV-
Based Applications. IPDPS ’18. • [SpMV] C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, X. Liu. Towards Efficient SpMV on Sunway
Manycore Architectures. ICS ’18. • [SpMV] B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He. CVR: efficient vectorization of SpMV on
x86 processors. CGO ’18. • [SpMV] A. Elafrou, V. Karakasis, T. Gkountouvas. SparseX: A Library for High-Performance
Sparse Matrix-Vector Multiplication on Multicore Platforms. TOMS. • [SpMV] Q. Sun, C. Zhang, C. Wu, J. Zhang, L. Li. Bandwidth Reduced Parallel SpMV on the
SW26010 Many-Core Platform. ICPP ’18. • [SpMV] G. Tan, J. Liu, J. Li. Design and Implementation of Adaptive SpMV Library for Multicore
and Many-Core Architecture. TOMS. • [SpMM] C. Yang, A Buluç, J. D. Owens. Design Principles for Sparse Matrix Multiplication on
the GPU. Euro-Par ’18. • [SpMM] C. Hong, A. Sukumaran-Rajam. Efficient sparse-matrix multi-vector product on GPUs.
HPDC ’18.
25
Some recent sparse kernels - 2018 (cont.) • [SpGEMM] M. Deveci, C. Trott, S. Rajamanickam. Multi-threaded Sparse Matrix-
Matrix Multiplication for Many-Core and GPU Architectures. PARCO. • [SpGEMM] J. Liu, X. He, W. Liu, G. Tan. Register-Aware Optimizations for Parallel
Sparse Matrix-Matrix Multiplication. IJPP. • [SpGEMM] F. Gremse, K. Küpper, U. Naumann. Memory-Efficient Sparse Matrix-
Matrix Multiplication by Row Merging on Many-Core Architectures. SISC. • [SpGEMM] Y. Nagasaka, S. Matsuoka, A. Azad, A. Buluç. High-performance sparse
matrix-matrix products on Intel KNL and multicore architectures. ICPPW ’18. • [SpTRSV] X. Wang, W. Liu, W. Xue, L. Wu. swSpTRSV: a fast sparse triangular
solve with sparse level tile layout on sunway architectures. PPoPP ’18. • [SpTRSV] E. Dufrechou, P. Ezzatti. A New GPU Algorithm to Compute a Level Set-
Based Analysis for the Parallel Solution of Sparse Triangular Systems. IPDPS ’18. • [SpTRSV] X. Wang, P. Xu, W. Xue, Y. Ao, C. Yang, H. Fu. A Fast Sparse Triangular
Solver for Structured-grid Problems on Sunway Many-core Processor SW26010. ICPP ’18.
26
Some observations 1. Diverse performance
27
CSR5-based SpMV (our work) • Organize nonzeros in Tiles of identical size. The design objectives include load
balancing, SIMD-friendly, low preprocessing cost and reduced storage space.
W.Liu,B.Vinter.CSR5:AnEfficientStorageFormatforCross-Pla:ormSparseMatrix-VectorMul@[email protected].
28
Merge-based SpMV • Both nonzeros and output vector are assigned to CTAs/processes in a
balanced way.
D.Merrill,M.Garland.Merge-basedParallelSparseMatrix-VectorMul@[email protected].
29
Diverse performance - SpMV • CSR5 outperforms merge-spmv in double precision, but merge-spmv
outperforms CSR5 in single precision.
Running956matricesonanNVIDIATitanXPascal.
FP64 FP32
30
Diverse performance - SpGEMM W.Liu,B.Vinter.AFrameworkforGeneralSparseMatrix-MatrixMul@[email protected],A.Nukada,S.Matsuoka.High-performanceandMemory-savingSparseGeneralMatrix-MatrixMul@[email protected],C.Tro],S.Rajamanickam.Mul@-threadedSparseMatrix-MatrixMul@[email protected].
31
Diverse performance - SpTRSV
W.Liu,A.Li,J.D.Hogg,I.S.Duff,B.Vinter.FastSynchroniza@on-FreeAlgorithmsforParallelSparseTriangularSolveswithMul@pleRight-HandSides.CCPE.2017.
32
Some observations 2. Libraries get benefits from very limited kernels
33
Libraries get benefits from very limited kernels • [MAGMA-SpMV] W. Liu, B. Vinter. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-
Vector Multiplication. ICS ’15. • [MAGMA-SpTRSV] W. Liu, A. Li, J. D. Hogg, I. S. Duff, B. Vinter. A Synchronization-Free Algorithm for Parallel
Sparse Triangular Solves. Euro-Par ’16. • [Trilinos-SpGEMM] M. Deveci, C. Trott, S. Rajamanickam. Multi-threaded Sparse Matrix-Matrix Multiplication
for Many-Core and GPU Architectures. PARCO. 2018. • [Trilinos-SpTRSV] A. M. Bradley. A Hybrid Multithreaded Direct Sparse Triangular Solver. CSC ’16. • [CombBLAS-SpMSpV] A. Azad, A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication
algorithm. IPDPS ’17. • [CombBLAS-SpGEMM] A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori. Exploiting multiple levels of
parallelism in sparse matrix-matrix multiplication. SISC. 2016. • [clSPARSE-SpGEMM] W. Liu, B. Vinter. An efficient GPU general sparse matrix-matrix multiplication for
irregular data. IPDPS ’14. • [GHOST-SpMV] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Bishop. A Unified Sparse Matrix Data
Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SISC.
• [ViennaCL-SpGEMM] F. Gremse, A. Hofter, L. O. Schwen, F. Kiessling, U. Naumann. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging. SISC. 2015.
• [cuSPARSE-SpMV] D. Merrill, M. Garland. Merge-based parallel sparse matrix-vector multiplication. SC ’16.
34
OpenSPARSE: An open platform for Sparse BLAS - objective, design and preliminary results
35
OpenSPARSE: Objective
Mathema+callibraries:MAGMA,Trilinos,
CombBLAS,GraphBLAS,clSPARSE,GHOST,
ViennaCL,……
Real-worldapplica+ons
Alargeamountofop+mizedsparsekernels
OpenSPARSE:Tobuildanopenplanormthatbridgesthegapbetweenop+mized
sparsekernelsandmathema+callibraries.
36
OpenSPARSE: Design • Language: C11 • Environments: OpenMP, CUDA, OpenCL, etc. • Kernels: defined in Sparse BLAS with sparse/dense inputs/outputs. • Basic matrix formats: DIA, COO, ELL, CSR, CSC, etc. • Data types: BOOL, INT8/16/32/64, FP16/32/64, COMPLEX16/32/64, etc. • Operators: multiplication/addition and other semirings in GraphBLAS. • Code generator: Python scripts
37
OpenSPARSE: Matrix data structure
38
OpenSPARSE: An SpMV function
…
…
y = αAx+ βy
39
OpenSPARSE: A complete SpMV program
…
…
40
OpenSPARSE: Add a new format
41
OpenSPARSE: Preliminary performance
Running956matricesonanNVIDIATitanXPascal.
• CSR5-SpMV performance in OpenSPARSE
42
T k u ! 0 4 9 8
A y Q s n s ? 0 2 7 4 11 13 12
We welcome your cooperation!