Optimizing Sparse Matrix-Vector Multiplication Using Index...

Optimizing Sparse Matrix-VectorMultiplication Using Index and

Value CompressionKornilios Kourtis

[email protected]

National Technical University of Athens

Computing Systems Laboratory

OutlineIntroduction and Motivation

Index Compression (CSR-DU)

Value Compression (CSR-VI)

Performance Evaluation

Conclusions

CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.1

SpMxVSparse Matrices:

Larger portion of elements are 0’sEfficient representation (storage and computation)

non-zero values (nnz)indexing information – structure

Formats:CSR, CSC, COOBCSRJD, CDS, Elpack-Itpack

Sparse Matrix-Vector Multiplication (SpMxV):y = A · x, A is sparseimportant, used in a variety of applications(eg, PDE solvers – CG, GMRES)


Compressed Sparse Row (CSR)

0

B

B

B

B

B

B

B

B

B

B

@

5.4 1.1 0 0 0 0

0 6.3 0 7.7 0 8.8

0 0 1.1 0 0 0

0 0 2.9 0 3.7 2.9

9.0 0 0 1.1 4.5 0

1.1 0 2.9 3.7 0 1.1

1

C

C

C

C

C

C

C

C

C

C

A

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

▽ CF 08: Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression – p.3

Compressed Sparse Row (CSR)

0

B

B

B

B

B

B

B

B

B

B

@

5.4 1.1 0 0 0 0

0 6.3 0 7.7 0 8.8

0 0 1.1 0 0 0

0 0 2.9 0 3.7 2.9

9.0 0 0 1.1 4.5 0

1.1 0 2.9 3.7 0 1.1

1

C

C

C

C

C

C

C

C

C

C

A

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )


CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)

y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

x : ( x0 x1 x2 x3 x4 x5 x6 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

y : ( y0 y1 y2 y3 y4 y5 y6 )


CSR SpMxVfor ( i =0; i<N; i++)for ( j=row ptr [ i ] ; j<row ptr [ i +1] ; j++)

y [ i ] += values [ j ]∗ x [ c o l i n d [ j ] ] ;

row_ptr : ( 0 2 5 6 9 12 16 )

col_ind : ( 0 1 1 3 5 2 2 4 5 0 3 4 0 2 3 5 )

x : ( x0 x1 x2 x3 x4 x5 x6 )

values : ( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

y : ( y0 y1 y2 y3 y4 y5 y6 )

i = 3

(row limits)

(indirect access)

(∗)

(P

)


CSR SpMxV performancememory bandwidth is the main bottleneck

(Goumas et al. PDP ’08)

spmv accesses: (N × N sparse matrix, nnz ≫ N )

Array size accesses pattern type

row_ptr N N sequential read

values nnz nnz sequential read

col_ind nnz nnz sequential read

x N nnz random, ↑ read

y N N sequential write

Thus, we target working set (ws) reduction

allows better scaling for shared memory architectures

values, col_ind dominate working set


CSR SpMxV working setws ≈ nnz · value_size

︸︷︷︸

values

+nnz · index_size︸︷︷︸

col_ind

32-bit indices, 64-bit values (common case)

64-bit indices, 64-bit values (∼ 1T ws size)


ObjectiveExplore the design space for accelerating SpMxVusing working set reduction techniques

Propose two methods (index / value compression)

Evaluate on a rich matrix set

Investigate issues, identify trade-offs

Explore future directions


Compression Methods


Methods overviewCompression ⇒ trade computation for data size

data size reduction is not enough (SpMxV run-time)

Index Compression: CSR-DUgeneralcoarse-grain delta encoding for column indices

Value Compression: CSR-VIspecializedexploits large number of common values


Index CompressionBlocking methods (BCSR, VBR)per block indexing ⇒ index data reduction

Delta encoding for column indices(Willcock and Lumsdaine : DCSR, RPCSR – ICS 06)

col_ind : 61311 61336 61390 61400 61428

deltas : . . . 25 54 10 28

DCSR:byte-oriented

6 sub-operations for implementing SpMxV

decoding overhead → performance degradation (branches)

patterns of frequent used groups of sub-ops

complex, non-portable, matrix-specific


CSR-DU (CSR Delta Units)Exploit dense areas using delta encoding

Coarse-grain approach:matrix is partitioned into variable-length unitseach unit has a delta sizeless compression ratioinnermost loops without branches

Compared to DCSR:comparable performanceportable, easier to implementsuitable for matrices with large variation


CSR-DU storage formatctl byte array replaces row_ptr, col_ind

unit contents:

field description size

usize size 1 byte

uflags flags (new row, delta_size) 1 byte

ujmp initial delta variable length

ucis subsequent deltas usize · delta_size

Example:(7, 1)(7, 127)(7, 250)(7, 255)(8, 10)(8, 1021)

[4,

uflags︷︸︸︷

NR|U8 , 1,

ucis︷︸︸︷

(126, 123, 5)]︸︷︷︸

unit

[2, NR|U16, 10, (1011)]


CSR-DU SpMxV

Unit size trade-off:

small units:loop overhead (small rows)

large units:less chances for compres-sion


Value CompressionValues:

Typically the largest part of the ws (32i-64v)(more) difficult to compress:

FP arithmetic produces rounded resultsFP format

sign exponent (11 bit) fraction (52 bit)

63 052

significant number of matrices in our set with a smallnumber of unique values.

feasibility metric: total-to-unique ratio(ttu = nnz

unique values)


CSR-VIIndirect access for values:

( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )

( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )

values:

val_ind + vals_unique:


CSR-VIIndirect access for values:

( 5.4 1.1 6.3 7.7 8.8 1.1 2.9 3.7 2.9 9.0 1.1 4.5 1.1 2.9 3.7 1.1 )

( 0 1 2 3 4 1 5 6 5 7 1 8 1 5 6 1 )

( 5.4 1.1 6.3 7.7 8.8 2.9 3.7 9.0 4.5 )

values:

val_ind + vals_unique:

format values size

CSR nnz · size_v

CSR-VI nnz · size_vi + uvals · size_v

size_vi → smallest integer that can address uvals elements(e.g. uvals ≤ 256 ⇒ size_vi = 1 byte)


CSR-VI SpMxV

for ( i =0; i <N; i ++)

for ( j = row_ptr [ i ] ; j < row_ptr [ i + 1 ] ; j ++){

va l = vals_unique [ va l_ ind [ j ] ] ;

y [ i ] += va l ∗x [ co l_ ind [ j ] ] ;

}

one memory access added (indirect)

access to vals_unique is random


Experimental Evaluation


Experimental SetupSystem

Intel Core 2 Xeon (Woodcrest) @2.6 GHz, 4MB L2

64-bit linux, gcc-4.2 -O3

SpMxV Benchmark

32-bit indices, 64-bit values

128 iterations

Matrix set

start: 100 matrices (Tim Davis, SPARSITY, ...)

memory bound set M0: ws > 34L2 (77 matrices)


CSR-DU PerformanceReject small row matrices: 59 remaining matrices( 85% nnz in rows with ≤ 6 elements)

Summary:

matrices speedup (%)

total sp > 1 avg. min max dense

64 59 8.1 −8.1 18.9 35

64-bit indices +36%

detailed results


CSR-VI PerformanceReject matrices with low ttu: 30 remaining matrices:(ttu < 5)

Summary:

matrices speedup (%)

total sp > 1 avg. min max

30 26 21.5 −31.1 74.1

detailed results


Conclusions and Future DirectionsIndex compression:

limited perfomance gain for the 32i-64v case

“pure” computation (not hard-to-predict branches)

more aggressive compression (global)

expand the “unit” concept to support more types of regularities

matrix-specific code generation

Value compression

common case: values largest part of ws

difficult (constrained regularity, nature of FP)

specialized schemes

shared memory architectures

working set reduction for other applications


EOF


CSR-DU Performance (2)

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup 16.8 20.8 23.4

22.4

17.116.9

21.7 16.8 20.5

24.9

18.5

23.0 19.0 24.9

24.821.6 17.2

24.7 24.9

21.4

16.722.1

17.0

17.3 16.718.8

16.6

18.1 16.421.4 19.3 17.0

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 26 40 41 42 44 46 47 48 49 50 52 53 55 56 58 59 60

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup 16.8 20.8 23.4

22.4

17.116.9

21.7 16.8 20.5

24.9

18.5

23.0 19.0 24.9

24.821.6 17.2

24.7 24.9

21.4

16.722.1

17.0

17.3 16.718.8

16.6

18.1 16.421.4 19.3 17.0

61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup

11.1

18.0

17.1 16.8

23.0

18.1

16.9 12.36.7 8.1

16.9 17.015.9 19.1

16.6 16.8 16.5

16.59.9

15.916.8

21.5

9.415.3

2.21.5

16.716.9

16.716.4

16.713.0

61 64 65 66 67 68 69 71 72 74 76 77 78 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97 98 99 100

matrix id

0.8

0.9

1.0

1.1

1.2

spee

dup

11.1

18.0

17.1 16.8

23.0

18.1

16.9 12.36.7 8.1

16.9 17.015.9 19.1

16.6 16.8 16.5

16.59.9

15.916.8

21.5

9.415.3

2.21.5

16.716.9

16.716.4

16.713.0

summarized results


CSR-VI Performance (2)

9 26 40 41 42 44 45 46 47 50 51 52 53 57 61

matrix id

0.60.81.01.21.41.61.8

spee

dup

49.243.4

49.053.0 53.0

54.053.7

49.848.7

57.8

53.6

47.228.8 30.2

49.0

9 26 40 41 42 44 45 46 47 50 51 52 53 57 61

matrix id

0.60.81.01.21.41.61.8

spee

dup

49.243.4

49.053.0 53.0

54.053.7

49.848.7

57.8

53.6

47.228.8 30.2

49.0

63 67 68 69 70 73 79 80 82 84 85 86 87 93 99

matrix id

0.60.81.01.21.41.61.8

spee

dup

44.5 20.323.6

49.2 26.523.1

38.146.4

30.9

25.7

57.6

25.1

55.7

49.1

57.7

63 67 68 69 70 73 79 80 82 84 85 86 87 93 99

matrix id

0.60.81.01.21.41.61.8

spee

dup

44.5 20.323.6

49.2 26.523.1

38.146.4

30.9

25.7

57.6

25.1

55.7

49.1

57.7

summarized results


CSR-VI Performance (3 – ttu)

1 10 100 1000 10000 100000 1000000 10000000

total to unique values ratio

0.60.70.80.91.01.11.21.31.41.51.61.71.8

spee

dup

9

26

40

4142

44

45

46

47

50

51

52

53 57

61

6367

68

6970

73

79

8082

84

85

86

87

93

99

summarized results


Optimizing Sparse Matrix-Vector Multiplication Using Index...

Documents

Transcript of Optimizing Sparse Matrix-Vector Multiplication Using Index...