Optimizing the Performance of Sparse Matrix-Vector Multiplication
Optimizing Sparse Matrix-Vector Multiplication Using Index...
Transcript of Optimizing Sparse Matrix-Vector Multiplication Using Index...
Optimizing Sparse Matrix-VectorMultiplication Using Index and
Value CompressionKornilios Kourtis
National Technical University of Athens
Computing Systems Laboratory
OutlineIntroduction and Motivation
Index Compression (CSR-DU)
Value Compression (CSR-VI)
Performance Evaluation
Conclusions
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.1
SpMxVSparse Matrices:
Larger portion of elements are 0’sEfficient representation (storage and computation)
non-zero values (nnz)indexing information – structure
Formats:CSR, CSC, COOBCSRJD, CDS, Elpack-Itpack
Sparse Matrix-Vector Multiplication (SpMxV):y = A · x, A is sparseimportant, used in a variety of applications(eg, PDE solvers – CG, GMRES)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.2
Compressed Sparse Row (CSR)
0
B
B
B
B
B
B
B
B
B
B
@
5.4 1.1 0 0 0 0
0 6.3 0 7.7 0 8.8
0 0 1.1 0 0 0
0 0 2.9 0 3.7 2.9
9.0 0 0 1.1 4.5 0
1.1 0 2.9 3.7 0 1.1
1
C
C
C
C
C
C
C
C
C
C
A
row_ptr : ( 0 2 5 6 9 12 16 )
col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )
values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.3
Compressed Sparse Row (CSR)
0
B
B
B
B
B
B
B
B
B
B
@
5.4 1.1 0 0 0 0
0 6.3 0 7.7 0 8.8
0 0 1.1 0 0 0
0 0 2.9 0 3.7 2.9
9.0 0 0 1.1 4.5 0
1.1 0 2.9 3.7 0 1.1
1
C
C
C
C
C
C
C
C
C
C
A
row_ptr : ( 0 2 5 6 9 12 16 )
col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )
values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.3
CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)
y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;
row_ptr : ( 0 2 5 6 9 12 16 )
col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )
x : ( x0 x1 x2 x3 x4 x5 x6 )
values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
y : ( y0 y1 y2 y3 y4 y5 y6 )
▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.4
CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)
y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;
row_ptr : ( 0 2 5 6 9 12 16 )
col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )
x : ( x0 x1 x2 x3 x4 x5 x6 )
values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
y : ( y0 y1 y2 y3 y4 y5 y6 )
i = 3
(row limits)
(indirect access)
(∗)
(P
)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.4
CSR SpMxV performancememory bandwidth is the main bottleneck
(Goumas et al. PDP ’08)
spmv accesses: (N × N sparse matrix, nnz ≫ N )
Array size accesses pattern type
row_ptr N N sequential read
values nnz nnz sequential read
col_ind nnz nnz sequential read
x N nnz random, ↑ read
y N N sequential write
Thus, we target working set (ws) reduction
allows better scaling for shared memory architectures
values, col_ind dominate working set
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.5
CSR SpMxV working setws ≈ nnz · value_size
︸ ︷︷ ︸
values
+nnz · index_size︸ ︷︷ ︸
col_ind
32-bit indices, 64-bit values (common case)
64-bit indices, 64-bit values (∼ 1T ws size)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.6
ObjectiveExplore the design space for accelerating SpMxVusing working set reduction techniques
Propose two methods (index / value compression)
Evaluate on a rich matrix set
Investigate issues, identify trade-offs
Explore future directions
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.7
Compression Methods
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.8
Methods overviewCompression ⇒ trade computation for data size
data size reduction is not enough (SpMxV run-time)
Index Compression: CSR-DUgeneralcoarse-grain delta encoding for column indices
Value Compression: CSR-VIspecializedexploits large number of common values
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.9
Index CompressionBlocking methods (BCSR, VBR)per block indexing ⇒ index data reduction
Delta encoding for column indices(Willcock and Lumsdaine : DCSR, RPCSR – ICS 06)
col_ind : 61311 61336 61390 61400 61428
deltas : . . . 25 54 10 28
DCSR:byte-oriented
6 sub-operations for implementing SpMxV
decoding overhead → performance degradation (branches)
patterns of frequent used groups of sub-ops
complex, non-portable, matrix-specific
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.10
CSR-DU (CSR Delta Units)Exploit dense areas using delta encoding
Coarse-grain approach:matrix is partitioned into variable-length unitseach unit has a delta sizeless compression ratioinnermost loops without branches
Compared to DCSR:comparable performanceportable, easier to implementsuitable for matrices with large variation
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.11
CSR-DU storage formatctl byte array replaces row_ptr, col_ind
unit contents:
field description size
usize size 1 byte
uflags flags (new row, delta_size) 1 byte
ujmp initial delta variable length
ucis subsequent deltas usize · delta_size
Example:(7, 1)(7, 127)(7, 250)(7, 255)(8, 10)(8, 1021)
[4,
uflags︷ ︸︸ ︷
NR|U8 , 1,
ucis︷ ︸︸ ︷
(126, 123, 5)]︸ ︷︷ ︸
unit
[2, NR|U16, 10, (1011)]
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.12
CSR-DU SpMxV
Unit size trade-off:
small units:loop overhead (small rows)
large units:less chances for compres-sion
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.13
Value CompressionValues:
Typically the largest part of the ws (32i-64v)(more) difficult to compress:
FP arithmetic produces rounded resultsFP format
sign exponent (11 bit) fraction (52 bit)
63 052
significant number of matrices in our set with a smallnumber of unique values.
feasibility metric: total-to-unique ratio(ttu = nnz
unique values)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.14
CSR-VIIndirect access for values:
( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )
( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )
values:
val_ind + vals_unique:
▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15
CSR-VIIndirect access for values:
( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )
( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )
values:
val_ind + vals_unique:
▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15
CSR-VIIndirect access for values:
( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )
( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )
( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )
values:
val_ind + vals_unique:
format values size
CSR nnz · size_v
CSR-VI nnz · size_vi + uvals · size_v
size_vi → smallest integer that can address uvals elements(e.g. uvals ≤ 256 ⇒ size_vi = 1 byte)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15
CSR-VI SpMxV
for ( i =0; i <N; i ++)
for ( j = row_ptr [ i ] ; j < row_ptr [ i + 1 ] ; j ++){
va l = vals_unique [ va l_ ind [ j ] ] ;
y [ i ] += va l ∗x [ co l_ ind [ j ] ] ;
}
one memory access added (indirect)
access to vals_unique is random
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.16
Experimental Evaluation
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.17
Experimental SetupSystem
Intel Core 2 Xeon (Woodcrest) @2.6 GHz, 4MB L2
64-bit linux, gcc-4.2 -O3
SpMxV Benchmark
32-bit indices, 64-bit values
128 iterations
Matrix set
start: 100 matrices (Tim Davis, SPARSITY, ...)
memory bound set M0: ws > 34L2 (77 matrices)
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.18
CSR-DU PerformanceReject small row matrices: 59 remaining matrices( 85% nnz in rows with ≤ 6 elements)
Summary:
matrices speedup (%)
total sp > 1 avg. min max dense
64 59 8.1 −8.1 18.9 35
64-bit indices +36%
detailed results
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.19
CSR-VI PerformanceReject matrices with low ttu: 30 remaining matrices:(ttu < 5)
Summary:
matrices speedup (%)
total sp > 1 avg. min max
30 26 21.5 −31.1 74.1
detailed results
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.20
Conclusions and Future DirectionsIndex compression:
limited perfomance gain for the 32i-64v case
“pure” computation (not hard-to-predict branches)
more aggressive compression (global)
expand the “unit” concept to support more types of regularities
matrix-specific code generation
Value compression
common case: values largest part of ws
difficult (constrained regularity, nature of FP)
specialized schemes
shared memory architectures
working set reduction for other applications
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.21
EOF
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.22
CSR-DU Performance (2)
2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60
matrix id
0.8
0.9
1.0
1.1
1.2
spee
dup 16.8 20.8 23.4
22.4
17.116.9
21.7 16.8 20.5
24.9
18.5
23.0 19.0 24.9
24.821.6 17.2
24.7 24.9
21.4
16.722.1
17.0
17.3 16.718.8
16.6
18.1 16.421.4 19.3 17.0
2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60
matrix id
0.8
0.9
1.0
1.1
1.2
spee
dup 16.8 20.8 23.4
22.4
17.116.9
21.7 16.8 20.5
24.9
18.5
23.0 19.0 24.9
24.821.6 17.2
24.7 24.9
21.4
16.722.1
17.0
17.3 16.718.8
16.6
18.1 16.421.4 19.3 17.0
61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100
matrix id
0.8
0.9
1.0
1.1
1.2
spee
dup
11.1
18.0
17.1 16.8
23.0
18.1
16.9 12.36.7 8.1
16.9 17.015.9 19.1
16.6 16.8 16.5
16.59.9
15.916.8
21.5
9.415.3
2.21.5
16.716.9
16.716.4
16.713.0
61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100
matrix id
0.8
0.9
1.0
1.1
1.2
spee
dup
11.1
18.0
17.1 16.8
23.0
18.1
16.9 12.36.7 8.1
16.9 17.015.9 19.1
16.6 16.8 16.5
16.59.9
15.916.8
21.5
9.415.3
2.21.5
16.716.9
16.716.4
16.713.0
summarized results
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.23
CSR-VI Performance (2)
9 26 40 41 42 44 45 46 47 50 51 52 53 57 61
matrix id
0.60.81.01.21.41.61.8
spee
dup
49.243.4
49.053.0 53.0
54.053.7
49.848.7
57.8
53.6
47.228.8 30.2
49.0
9 26 40 41 42 44 45 46 47 50 51 52 53 57 61
matrix id
0.60.81.01.21.41.61.8
spee
dup
49.243.4
49.053.0 53.0
54.053.7
49.848.7
57.8
53.6
47.228.8 30.2
49.0
63 67 68 69 70 73 79 80 82 84 85 86 87 93 99
matrix id
0.60.81.01.21.41.61.8
spee
dup
44.5 20.323.6
49.2 26.523.1
38.146.4
30.9
25.7
57.6
25.1
55.7
49.1
57.7
63 67 68 69 70 73 79 80 82 84 85 86 87 93 99
matrix id
0.60.81.01.21.41.61.8
spee
dup
44.5 20.323.6
49.2 26.523.1
38.146.4
30.9
25.7
57.6
25.1
55.7
49.1
57.7
summarized results
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.24
CSR-VI Performance (3 – ttu)
1 10 100 1000 10000 100000 1000000 10000000
total to unique values ratio
0.60.70.80.91.01.11.21.31.41.51.61.71.8
spee
dup
9
26
40
4142
44
45
46
47
50
51
52
53 57
61
6367
68
6970
73
79
8082
84
85
86
87
93
99
summarized results
CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.25