Post on 27-Dec-2015
Optimal Algorithm Selection of Parallel Sparse Matrix-Vector
Multiplication Is Important
Makoto Kudoh*1, Hisayasu Kuroda*1,
Takahiro Katagiri*2, Yasumasa Kanada*1
*1 The University of Tokyo
*2 PRESTO, Japan Science and Technology Corporation
Introduction
Sparse Matrix-Vector multiplication(SpMxV)
( A is a sparse matrix, x is a dense vector)
Basic computational kernel used in scientific computations
-ex. Iterative solver for linear systems, eigenvalue problems
nnn xAAx ,
Large scale SpMxV problems
Parallel Sparse Matrix-Vector Multiplication
Calculation of Parallel Sparse Matrix-Vector Multiplication
Two phase computations:data communication and local computation
A
PE0PE1PE2PE3y x
Row block distribution Compressed sparse row format
4020
0020
0301
4001 0 2 4 5
0 3 0 2 1 1 3
1 4 -1 3 2 -2 -4
rowptr
colind
value
PE0
PE1
PE2
PE3
Vector data communication
x A x yPE0
PE1
PE2
PE3
Local computation
Optimization of Parallel SpMxV
Many optimization algorithms of SpMxV proposed
BUTThe effect depends highly on the non-zero structure of the matrix and the machine’s
architecture
Optimal algorithm selection is important
Poor performance compared with dense matrix Increased memory reference to matrix data
caused by indirect access Irregular memory access pattern to vector x
Related Works
Library approach PSPARSLIB, PETSc, ILIB, etc Fixed optimize algorithm Work on parallel systems
Compiler approach SPARSITY, sparse compiler, etc Generate optimized code for matrix and machine Not work on parallel systems
The purpose of our work
Include several algorithms for local computation and data communication
Measure performance of each algorithm exhaustively Select the best algorithm for the matrix and machine
Algorithm selecting time is not in concern
Performance of best algorithm for matrix and machine
Performance of fixed algorithm for all matrices and machines
compare
Our program
Optimization algorithms of our program
Algorithms implemented in our routine Local computation
Register Blocking Diagonal Blocking Unrolling
Data communication Allgather Communication Range Limited Communication Minimum Data Size Communication
Register Blocking (Local Computation 1/3)
Extract small dense blocks and make a blocked matrix
+
• Reduce the number of load instruction• Increase temporal locality to the source
vectorAbbreviate size mxn Register Blocking to RmxnR1x2,R1x3,R1x4,R2x1,R2x2,R2x3,R2x4,
R3x1,R3x2,R3x3,R3x4,R4x1,R4x2,R4x3,R4x4
Original matrix Blocked matrix Remaining matrix
Diagonal Blocking (Local Computation 2/3)
For matrices with dense non-zero structure around diagonal part
Block diagonal part and treat it as a dense band matrix
+
• Reduce the number of load instruction• Optimize the access of register and
cacheAbbreviate size n Diagonal Blocking to DnD3,D5,D7,D9,D11,D13,D15,D17,D19
Original matrix Blocked matrix Remaining matrix
Unrolling (Local Computation 3/3)
Just unroll the inner loop
Abbreviate unrolling level n to Un
• Reduce the loop overhead• Exploit instruction level parallelism
U1,U2,U3,U4,U5,U6,U7,U8
Allgather Communication (data communication 1/3)
Each processor sends all vector data to all other processors
Easy to implement (with MPI_Allgather)
PE0
PE1
PE2
PE3
The communication data size is very large
Range-limited Communication (data communication 2/3)
Send only minimum contiguous required block Not communicate between unnecessary processors
Small overhead CPU time, since data rearrangement is unnecessary
Communication data size is not minimum on most matrices
PE0
vector vector
PE1
Send
Minimum Data Size Communication (data communication 1/3)
Communicate only the required elements Need ‘pack’ and ‘unpack’ operations before and
after communication
The communication data size is minimum ‘pack’, ‘unpack’ operations require a little
overhead CPU time
PE0 PE1
vector
unpack
vector
pack
buffer
send
buffer
Implementation of Communication
Use MPI library 3 implementations for 1 to 1 communication
Send-Recv Isend-Irecv Irecv-Isend
3 implementations for range-limited and minimum data size communication
Allgather
SendRecv-range, IsendIrecv-range, IrecvIsend-range
SendRecv-min, IsendIrecv-min, Irecv-Isend-min
Methodology of Selecting Optimal Algorithm
Measure the time of local computation and data communication independently
When combined, total time is not necessarily fastest
1. Measure time of each data communication, select best algorithm
2. Combine local computation and best data communication, measure time and select best
Select at runtimeCan not detect the characteristic of the matrix until
runtime
Default Fixed Algorithms
No. Local computation
Data communication
1 U1 Allgather
2 R2x2 Allgather
3 U1 IrecvIsend-min
4 R2x2 IrecvIsend-min
Local computation : U1 and R2x2
Data communication :Allgather and IrecvIsend-min
Experimental Environment
NameProcessor # of PEs Network
Compiler Compiler Version
Compiler Option
PC-ClusterPentiumIII 800 MHz 8 100 base-T Ethernet
GCC 2.95.2 -O3
SUN Enterprise 3500
Ultra Sparc II 336 MHz 8 SMP
WorkShop Compilers 5.0 -xO5
COMPAQAlphaServer GS80
Alpha 21264 731MHz
8 SMP
Compaq C 6.3-027 -fast
SGI2100MIPS R12000 350MHz
8 DSM
MIPSpro C 7.30 -64 -O3
HITACHI HA8000-ex880
Intel Itanium 800MHz 8 SMP
Intel Itanium Compiler 5.0.1 -O3
Language C
Communication library MPI (MPICH 1.2.1)
Test Matrices
From Tim Davis’ matrix collectionNo.
Name Explanation Dimension
Non-zeros
1 3dtube 3-D pressure tube 45,330 3,213,618
2 cfd1 Symmetric pressure matrix 70,656 1,828,364
3 crystk03 FEM crystal vibration 24,696 1,751,178
4 venkat01 Unstructured 2D euler solver 62,424 1,717,792
5 bcsstk35 Automobile seat frame and body attachment
30,237 1,450,163
6 cfd2 Symmetric pressure matrix 123,440 3,087,898
7 ct20stif Stiffness matrix 52,329 2,698,463
8 nasasrb Shuttle rocket booster 54,870 2,677,324
9 raefsky3 Fluid structure interaction turbulence
21,200 1,488,768
10 pwtk Pressurized wind tunnel 217,918 11,634,424
11 gearbox Aircraft flap actuator 153,746 9,080,404
cfd1ct20stifgearbox
Result of Matrix No.2
0
50
100
150
200
250
def1 def2 def3 def4 opt
R2x4 U2 U2 U2R2x4 U6 U2 U5
IrecvIsend-min
PentiumIII-Ethernet
0
5
10
15
20
25
30
def1 def2 def3 def4 opt
U1 R2x2 R1x3 R2x2
R3x1 U1 R1x3 U2
IrecvIsend-range
Alpha-SMP
MIPS-DSM
0
5
10
15
20
25
30
35
def1 def2 def3 def4 opt
R2x1 U2 D7 U1
U4 U3 U4 U3
IrecvIsend-range
0
5
10
15
20
25
30
def1 def2 def3 def4 opt
R3x1 R3x1 R1x3 R2x2
R1x3 D9 R3x1 R1x3
IsendIrecv-range
Itanium-SMP
Comm-time(msec)Local-time(msec)
Comm-algorithmLocal-algorithm
Result of Matrix No.7
0
20
40
60
80
100
120
140
160
180
def1 def2 def3 def4 opt
R2x3 R3x3 R3x3 R3x3R2x3 R3x3 R3x3 R3x3
IsendIrecv-min
PentiumIII-Ethernet
0
5
10
15
20
25
def1 def2 def3 def4 opt
U1 U1 R3x1 R3x1U1 R3x1 U1 U1
SendRecv-min
Alpha-SMP
MIPS-DSM
0
5
10
15
20
25
30
35
def1 def2 def3 def4 opt
R4x2 R3x3 R3x3 R3x3R4x1 R3x3 R3x3 R3x2
IrecvIsend-min
0
5
10
15
20
25
30
35
40
45
def1 def2 def3 def4 opt
D9 D15 R3x3 D11D7 D9 R3x3 R2x3
SendRecv-min
Itanium-SMP
Comm-time(msec)
Local-time(msec)
Comm-algorithmLocal-algorithm
Result of Matrix No.11
050
100150200250300350400450500
def1 def2 def3 def4 opt
R2x3 R3x3 D15 R3x3R2x3 R3x3 R3x3 R3x3
IsendIrecv-min
PentiumIII-Ethernet
0
20
40
60
80
100
120
140
def1 def2 def3 def4 opt
D5 R3x3 R3x3 R3x3R3x3 R3x3 R3x3 R3x3
SendRecv-min
Alpha-SMP
MIPS-DSM
0
20
40
60
80
100
120
140
def1 def2 def3 def4 opt
D5 R3x3 R3x3 R3x3R3x3 D7 D9 R3x3
SendRecv-min
0
10
20
30
40
50
60
70
80
90
def1 def2 def3 def4 opt
R3x3 R3x3 R3x3 R4x3R3x3 R3x3 R3x3 R3x3
SendRecv-min
Itanium-SMP
Comm-time(msec)Local-time(msec)
Comm-algorithmLocal-algorithm
Summary of Experiment
def 1 def 2 def 3 def 4
PC-cluster 8.16 7.90 1.32 1.05
Sun Enterprise 3500
2.82 3.07 1.35 1.58
COMPAQ 3.56 3.10 1.59 1.44
SGI 3.73 3.33 1.61 1.36
Hitachi 2.51 1.81 2.03 1.39
Summary of speed-up
•Best algorithm depends highly on characteristics of matrix and machine
•Obtained at least 1.05 speed-up compared with fixed default algorithms
Conclusion and Future Work
Compared performance of best algorithm with that of typical fixed algorithms
Obtained meaningful speed-up by selecting best algorithm
Selecting optimal algorithm according to characteristics of matrix and machine is important
Create light overhead method of selecting algorithm Now, selecting time takes hundreds of SpMxV
time