Optimizing Sparse Matrix-Vector Multiplication Using Index...

30
Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression Kornilios Kourtis [email protected] National Technical University of Athens Computing Systems Laboratory

Transcript of Optimizing Sparse Matrix-Vector Multiplication Using Index...

Page 1: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Optimizing Sparse Matrix-VectorMultiplication Using Index and

Value CompressionKornilios Kourtis

[email protected]

National Technical University of Athens

Computing Systems Laboratory

Page 2: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

OutlineIntroduction and Motivation

Index Compression (CSR-DU)

Value Compression (CSR-VI)

Performance Evaluation

Conclusions

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.1

Page 3: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

SpMxVSparse Matrices:

Larger portion of elements are 0’sEfficient representation (storage and computation)

non-zero values (nnz)indexing information – structure

Formats:CSR, CSC, COOBCSRJD, CDS, Elpack-Itpack

Sparse Matrix-Vector Multiplication (SpMxV):y = A · x, A is sparseimportant, used in a variety of applications(eg, PDE solvers – CG, GMRES)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.2

Page 4: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Compressed Sparse Row (CSR)

0

B

B

B

B

B

B

B

B

B

B

@

5.4 1.1 0 0 0 0

0 6.3 0 7.7 0 8.8

0 0 1.1 0 0 0

0 0 2.9 0 3.7 2.9

9.0 0 0 1.1 4.5 0

1.1 0 2.9 3.7 0 1.1

1

C

C

C

C

C

C

C

C

C

C

A

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.3

Page 5: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Compressed Sparse Row (CSR)

0

B

B

B

B

B

B

B

B

B

B

@

5.4 1.1 0 0 0 0

0 6.3 0 7.7 0 8.8

0 0 1.1 0 0 0

0 0 2.9 0 3.7 2.9

9.0 0 0 1.1 4.5 0

1.1 0 2.9 3.7 0 1.1

1

C

C

C

C

C

C

C

C

C

C

A

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.3

Page 6: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)

y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

x : ( x0 x1 x2 x3 x4 x5 x6 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

y : ( y0 y1 y2 y3 y4 y5 y6 )

▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.4

Page 7: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)

y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

x : ( x0 x1 x2 x3 x4 x5 x6 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

y : ( y0 y1 y2 y3 y4 y5 y6 )

i = 3

(row limits)

(indirect access)

(∗)

(P

)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.4

Page 8: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR SpMxV performancememory bandwidth is the main bottleneck

(Goumas et al. PDP ’08)

spmv accesses: (N × N sparse matrix, nnz ≫ N )

Array size accesses pattern type

row_ptr N N sequential read

values nnz nnz sequential read

col_ind nnz nnz sequential read

x N nnz random, ↑ read

y N N sequential write

Thus, we target working set (ws) reduction

allows better scaling for shared memory architectures

values, col_ind dominate working set

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.5

Page 9: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR SpMxV working setws ≈ nnz · value_size

︸ ︷︷ ︸

values

+nnz · index_size︸ ︷︷ ︸

col_ind

32-bit indices, 64-bit values (common case)

64-bit indices, 64-bit values (∼ 1T ws size)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.6

Page 10: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

ObjectiveExplore the design space for accelerating SpMxVusing working set reduction techniques

Propose two methods (index / value compression)

Evaluate on a rich matrix set

Investigate issues, identify trade-offs

Explore future directions

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.7

Page 11: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Compression Methods

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.8

Page 12: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Methods overviewCompression ⇒ trade computation for data size

data size reduction is not enough (SpMxV run-time)

Index Compression: CSR-DUgeneralcoarse-grain delta encoding for column indices

Value Compression: CSR-VIspecializedexploits large number of common values

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.9

Page 13: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Index CompressionBlocking methods (BCSR, VBR)per block indexing ⇒ index data reduction

Delta encoding for column indices(Willcock and Lumsdaine : DCSR, RPCSR – ICS 06)

col_ind : 61311 61336 61390 61400 61428

deltas : . . . 25 54 10 28

DCSR:byte-oriented

6 sub-operations for implementing SpMxV

decoding overhead → performance degradation (branches)

patterns of frequent used groups of sub-ops

complex, non-portable, matrix-specific

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.10

Page 14: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-DU (CSR Delta Units)Exploit dense areas using delta encoding

Coarse-grain approach:matrix is partitioned into variable-length unitseach unit has a delta sizeless compression ratioinnermost loops without branches

Compared to DCSR:comparable performanceportable, easier to implementsuitable for matrices with large variation

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.11

Page 15: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-DU storage formatctl byte array replaces row_ptr, col_ind

unit contents:

field description size

usize size 1 byte

uflags flags (new row, delta_size) 1 byte

ujmp initial delta variable length

ucis subsequent deltas usize · delta_size

Example:(7, 1)(7, 127)(7, 250)(7, 255)(8, 10)(8, 1021)

[4,

uflags︷ ︸︸ ︷

NR|U8 , 1,

ucis︷ ︸︸ ︷

(126, 123, 5)]︸ ︷︷ ︸

unit

[2, NR|U16, 10, (1011)]

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.12

Page 16: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-DU SpMxV

Unit size trade-off:

small units:loop overhead (small rows)

large units:less chances for compres-sion

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.13

Page 17: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Value CompressionValues:

Typically the largest part of the ws (32i-64v)(more) difficult to compress:

FP arithmetic produces rounded resultsFP format

sign exponent (11 bit) fraction (52 bit)

63 052

significant number of matrices in our set with a smallnumber of unique values.

feasibility metric: total-to-unique ratio(ttu = nnz

unique values)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.14

Page 18: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VIIndirect access for values:

( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )

( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )

values:

val_ind + vals_unique:

▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15

Page 19: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VIIndirect access for values:

( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )

( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )

values:

val_ind + vals_unique:

▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15

Page 20: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VIIndirect access for values:

( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )

( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )

values:

val_ind + vals_unique:

format values size

CSR nnz · size_v

CSR-VI nnz · size_vi + uvals · size_v

size_vi → smallest integer that can address uvals elements(e.g. uvals ≤ 256 ⇒ size_vi = 1 byte)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.15

Page 21: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VI SpMxV

for ( i =0; i <N; i ++)

for ( j = row_ptr [ i ] ; j < row_ptr [ i + 1 ] ; j ++){

va l = vals_unique [ va l_ ind [ j ] ] ;

y [ i ] += va l ∗x [ co l_ ind [ j ] ] ;

}

one memory access added (indirect)

access to vals_unique is random

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.16

Page 22: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Experimental Evaluation

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.17

Page 23: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Experimental SetupSystem

Intel Core 2 Xeon (Woodcrest) @2.6 GHz, 4MB L2

64-bit linux, gcc-4.2 -O3

SpMxV Benchmark

32-bit indices, 64-bit values

128 iterations

Matrix set

start: 100 matrices (Tim Davis, SPARSITY, ...)

memory bound set M0: ws > 34L2 (77 matrices)

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.18

Page 24: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-DU PerformanceReject small row matrices: 59 remaining matrices( 85% nnz in rows with ≤ 6 elements)

Summary:

matrices speedup (%)

total sp > 1 avg. min max dense

64 59 8.1 −8.1 18.9 35

64-bit indices +36%

detailed results

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.19

Page 25: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VI PerformanceReject matrices with low ttu: 30 remaining matrices:(ttu < 5)

Summary:

matrices speedup (%)

total sp > 1 avg. min max

30 26 21.5 −31.1 74.1

detailed results

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.20

Page 26: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

Conclusions and Future DirectionsIndex compression:

limited perfomance gain for the 32i-64v case

“pure” computation (not hard-to-predict branches)

more aggressive compression (global)

expand the “unit” concept to support more types of regularities

matrix-specific code generation

Value compression

common case: values largest part of ws

difficult (constrained regularity, nature of FP)

specialized schemes

shared memory architectures

working set reduction for other applications

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.21

Page 27: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

EOF

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.22

Page 28: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-DU Performance (2)

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup 16.8 20.8 23.4

22.4

17.116.9

21.7 16.8 20.5

24.9

18.5

23.0 19.0 24.9

24.821.6 17.2

24.7 24.9

21.4

16.722.1

17.0

17.3 16.718.8

16.6

18.1 16.421.4 19.3 17.0

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup 16.8 20.8 23.4

22.4

17.116.9

21.7 16.8 20.5

24.9

18.5

23.0 19.0 24.9

24.821.6 17.2

24.7 24.9

21.4

16.722.1

17.0

17.3 16.718.8

16.6

18.1 16.421.4 19.3 17.0

61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup

11.1

18.0

17.1 16.8

23.0

18.1

16.9 12.36.7 8.1

16.9 17.015.9 19.1

16.6 16.8 16.5

16.59.9

15.916.8

21.5

9.415.3

2.21.5

16.716.9

16.716.4

16.713.0

61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup

11.1

18.0

17.1 16.8

23.0

18.1

16.9 12.36.7 8.1

16.9 17.015.9 19.1

16.6 16.8 16.5

16.59.9

15.916.8

21.5

9.415.3

2.21.5

16.716.9

16.716.4

16.713.0

summarized results

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.23

Page 29: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VI Performance (2)

9 26 40 41 42 44 45 46 47 50 51 52 53 57 61

matrix id

0.60.81.01.21.41.61.8

spee

dup

49.243.4

49.053.0 53.0

54.053.7

49.848.7

57.8

53.6

47.228.8 30.2

49.0

9 26 40 41 42 44 45 46 47 50 51 52 53 57 61

matrix id

0.60.81.01.21.41.61.8

spee

dup

49.243.4

49.053.0 53.0

54.053.7

49.848.7

57.8

53.6

47.228.8 30.2

49.0

63 67 68 69 70 73 79 80 82 84 85 86 87 93 99

matrix id

0.60.81.01.21.41.61.8

spee

dup

44.5 20.323.6

49.2 26.523.1

38.146.4

30.9

25.7

57.6

25.1

55.7

49.1

57.7

63 67 68 69 70 73 79 80 82 84 85 86 87 93 99

matrix id

0.60.81.01.21.41.61.8

spee

dup

44.5 20.323.6

49.2 26.523.1

38.146.4

30.9

25.7

57.6

25.1

55.7

49.1

57.7

summarized results

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.24

Page 30: Optimizing Sparse Matrix-Vector Multiplication Using Index ...kkourt/papers/cf08-spmv-kkourt-pr.pdf(indirect access) (∗) (P) CF 08: Optimizing Sparse Matrix-Vector Multiplication

CSR-VI Performance (3 – ttu)

1 10 100 1000 10000 100000 1000000 10000000

total to unique values ratio

0.60.70.80.91.01.11.21.31.41.51.61.71.8

spee

dup

9

26

40

4142

44

45

46

47

50

51

52

53 57

61

6367

68

6970

73

79

8082

84

85

86

87

93

99

summarized results

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.25