When Sparse Matrices Met Heterogeneous Processors ... · Implemen

45
When Sparse Matrices Met Heterogeneous Processors: Opportunities and Challenges Weifeng Liu Norwegian University of Science and Technology, Norway March 10 th , 2018, Tokyo, Japan

Transcript of When Sparse Matrices Met Heterogeneous Processors ... · Implemen

Page 1: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

When Sparse Matrices Met Heterogeneous Processors: Opportunities and Challenges

Weifeng Liu Norwegian University of Science and Technology, Norway

March10th,2018,Tokyo,Japan

Page 2: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

2

Background

Sparse Matrix &

Heterogeneous Processor (Tightly-coupled CPU-GPU Chip)

Page 3: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

3

Sparse matrix •  If the majority of entries in a matrix are zeros, we can store the matrix by

using a sparse representation and only records nonzero entries. •  The Compressed Sparse Row (CSR) format is the most widely used.

0 2

0 1

0 3

0 6 0 5 0 4 A

(4x4) sparse, 6 nonzeros

0 2

0 1 0 0

0 3 0

0 0 0 0

0

0 0 6 0 5 0 4 A

(4x4) dense

0

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

A (in CSR) (4x4)

6 nonzeros

Page 4: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

4

CPU chip

Page 5: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

5

GPU chip

Page 6: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

6

Loosely-coupled CPU-GPU platform

Page 7: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

7

Tightly-coupled CPU-GPU chip

Page 8: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

8

Experimental Setup

One APU (AMD R5-2400G) &

Four Sparse Matrix Kernels (OMP/OCL) &

956 Sparse Matrices (from SuiteSparse)

Page 9: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

9

Testbed

Ryzen 5 2400G, Zen+Vega, quad-core CPU, 4MB L3 cache, 704 GCN-core GPU),

2-ch 16GB DDR4-2933, 46.9GB/s B/W. MS Win10 :O

nVidiaGeForceTitanX,PascalGP102,3584CUDAcores,3MBL2cache,12GBGDDR5X,480GB/sB/W.

(onlyusedforsomeperf.comparison)

Page 10: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

10

Four sparse kernels

•  Sparsematrix-vectorMulQplicaQon(SpMV)

x 0 2 0 1

0 3

0 6 0 5 0 4 0 d 0 c

0 a 0 b 2a+3b

1c

0 4a+5c+6d

=

•  SparsetransposiQon(SpTRANS)

0 2 0 1

0 3

0 6 0 5 0 4

0 2

0 1 0 3

0 6 0 5

0 4

->

•  Sparsematrix-matrixMulQplicaQon(SpGEMM)

0 2 0 1

0 3

0 6 0 5 0 4 0 d

0 c 0 a

0 f

0 b 0 e

0 1d

4a+5e 0 5d

1e 0 3b 0 3c

0 6f

2a x =

•  Sparsetriangularsolve(SpTRSV)

0 x3

0 x2

0 x0 0 x1 0 1

0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

Page 11: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

11

956 matrices from SuiteSparse Collection

square matrices, 100K <= number of nonzeros <= 200M

956 matrices from in total

2757 matrices

Page 12: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

12

Kernel 1. Sparse Matrix-Vector Multiplication

(SpMV)

Page 13: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

13

SpMV •  Multiply a sparse matrix A by a dense vector x, and obtain a resulting dense

vector y.

x 0 2

0 1

0 3

0 6 0 5 0 4 A

(4x4) sparse, 6 nonzeros

=

0 d

0 c

0 a

0 b

x (4x1) dense

2a+3b

1c

0

y (4x1) dense

4a+5c+6d

Page 14: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

14

Question, Experiment, Opportunity and Challenge

•  Sparsematrix-vectorMulQplicaQon(SpMV)

x 0 2 0 1

0 3

0 6 0 5 0 4 0 d 0 c

0 a 0 b 2a+3b

1c

0 4a+5c+6d

=

QuesQon1:Who(CPUsideorGPUside)givesthebestperformance?Experiment1:TestfourSpMVmethods(CSR-omp,CSR5-omp,CSR-adapQve-opencl,CSR5-opencl)onCPUsideandGPUside.

Opportunity1andChallenge1

Page 15: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

15

Parallel SpMV (using the CSR format) •  Assign a row block to one core of a many-core processor.

core 0

core 1

core 2

core 3

Probably imbalanced computation, degraded

performance on many-core processors

NathanBell,MichaelGarland.Implemen<ngSparseMatrix-VectorMul<plica<ononThroughput-OrientedProcessors.SC09.

0 2

0 1

0 3

0 6 0 5 0 4

Page 16: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

16

Parallel SpMV (using CSR-Adaptive) •  Adaptively assign row blocks to cores for better load balancing.

core 0

core 1

But when several rows are much longer

(power-law distribution), possible imbalance

still exists.

0 2

0 1

0 3

0 6 0 5 0 4

JosephL.Greathouse,MayankDaga.Efficientsparsematrix-vectormul<plica<ononGPUsusingtheCSRstorageformat.SC14.

Page 17: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

17

Parallel SpMV (using CSR5) (our work) •  Organize nonzeros in Tiles of identical size. The design objectives include load

balancing, SIMD-friendly, low preprocessing cost and reduced storage space.

WeifengLiu,BrianVinter.CSR5:AnEfficientStorageFormatforCross-PlaNormSparseMatrix-VectorMul<plica<on.ICS15.

core 0

core 1

Page 18: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

18

SpMV on CPU side

0 0

0 0

0 0

0

0 0

0 0

0 0 0

0

0

uniform dist.

power-law dist.

best `top’ performance: CSR-omp wins

best `load-balanced’ performance:

CSR5-omp wins

Page 19: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

19

SpMV on GPU side

0 0

0 0

0 0

0

0 0

0 0

0 0 0

0

0

uniform dist.

power-law dist.

best `top’ performance: CSR-adaptive-ocl wins a bit

best `load-balanced’ performance: CSR5-ocl wins

Page 20: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

20

CSR5-SpMV (GPU vs. CPU)

0 0

0 0

0 0

0

0 0

0 0

0 0 0

0

0

uniform dist.

power-law dist.

best `overall’

performance: CSR5-ocl wins

Page 21: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

21

Best SpMV (GPU vs. CPU)

0 0

0 0

0 0

0

0 0

0 0

0 0 0

0

0

uniform dist.

power-law dist.

best `top’ performance: CPU-best wins

best `overall’ performance:

GPU-best wins

matrix `dc1’ : CPU-best: 35 GB/s GPU-best: 42 GB/s

Titan-X CSR: 1.1 GB/s

Page 22: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

22

SpMV (Opportunity and Challenge)

Opportunities: •  Load balancing is still a problem on such `small’ processors •  Load balanced methods on APU can largely outperforms naive code on GPU •  Load balanced methods can be slower in many cases •  CPU achieves higher `top’ performance •  GPU achieves higher `overall’ performance

Challenges: •  Understanding correlations between performance on real hardware and sparsity

structures of sparse matrices •  Utilizing both CPU side and GPU side more efficiently

Page 23: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

23

Kernel 2. Sparse Matrix-Matrix Multiplication

(SpGEMM)

Page 24: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

24

•  Sparsematrix-matrixMulQplicaQon(SpGEMM)

0 2 0 1

0 3

0 6 0 5 0 4 0 d

0 c 0 a

0 f

0 b 0 e

0 1d

4a+5e 0 5d

1e 0 3b 0 3c

0 6f

2a x =

QuesQon2:Istheunifiedmemory(aka,sharedvirtualmemory)helpful?Experiment2:SpGEMMrunningonGPUsidebutusingtheunifiedmemory.

Opportunity2andChallenge2

Question, Experiment, Opportunity and Challenge

Page 25: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

25

SpGEMM •  Multiply a sparse matrix A by a sparse matrix B, and obtain a resulting

sparse matrix C.

x 0 2

0 1

0 3

0 6 0 5 0 4

A (4x4)

sparse, 6 nonzeros

= 0 d

0 c

0 a

0 f

0 b

0 e

B (4x4)

sparse, 6 nonzeros

0 1d

4a+5e 0 5d

1e

C (4x4)

sparse, 8 nonzeros

0 3b 0 3c

0 6f

2a

Page 26: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

26

SpGEMM - parallel insertion •  Because of the compressed sparse format, the result nonzeros are

“inserted” into C, but not “added” to predictable locations of C.

0 2

0 1

0 3

0 6 0 5

0 d

0 c

0 a

0 f

0 b

0 4

0 e

A B

0 4a

0 5d 0 5e

0 6f

x

=

=

insert to column 1

fuse to column 3

insert to column 2

row 3 of C 4a+5e 0 5d 0 6f

0 4a

0 5d 0 4a

4a+5e 0 5d 0 6f

0 5d 0 4a 4a+5e

Page 27: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

27

An SpGEMM framework (our work)

WeifengLiu,BrianVinter.AFrameworkforGeneralSparseMatrix-MatrixMul<plica<ononGPUsandHeterogeneousProcessors.JPDC.2015.(extendedfromIPDPS14)WeifengLiu,BrianVinter.AnEfficientGPUGeneralSparseMatrix-MatrixMul<plica<onforIrregularData.IPDPS14.

Page 28: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

28

An SpGEMM framework (our work)

Page 29: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

29

SpGEMM on GPU (Unified vs. Non-unified mem)

more “insertion”

more “fusion”

Obvious performance

gain

Page 30: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

30

SpGEMM (Opportunity and Challenge)

Opportunity: •  For GPU-friendly irregular problems with output of unknown size, using unified

memory can bring higher performance.

Challenge: •  Various memory configurations and deeper memory hierarchy need more careful

tuning.

Page 31: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

31

Kernel 3. Sparse Transposition

(SpTRANS)

Page 32: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

32

SpTRANS •  Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., AT, in

CSR). This is equivalent to a conversion from CSR to CSC, or vice versa.

->

A

0 2

0 1

0 3

0 6 0 5 0 4

AT

0 2

0 1

0 3

0 6

0 5

0 4

0 2 0 4 0 3 0 1 0 5 0 6

1 3 1 0 3 3

0 2 3 5 6

value =

column index =

row pointer =

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

Page 33: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

33

•  SparsetransposiQon(SpTRANS)

0 2 0 1

0 3

0 6 0 5 0 4

0 2

0 1 0 3

0 6 0 5

0 4

->

•  Sparsetriangularsolve(SpTRSV)

0 x3

0 x2

0 x0 0 x1 0 1

0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

QuesQon3:CanatomicoperaQonsworkwell?

Experiment3:SpTRANSandSpTRSVonGPUside.

Opportunity3andChallenge3

Question, Experiment, Opportunity and Challenge

Page 34: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

34

SpTRANS •  Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., AT, in

CSR). This is equivalent to a conversion from CSR to CSC, or vice versa.

->

A

0 2

0 1

0 3

0 6 0 5 0 4

AT

0 2

0 1

0 3

0 6

0 5

0 4

0 2 0 4 0 3 0 1 0 5 0 6

1 3 1 0 3 3

0 2 3 5 6

value =

column index =

row pointer =

0 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3

0 1 3 3 6

value =

column index =

row pointer =

HaoWang,WeifengLiu,KaixiHou,Wu-chunFeng.ParallelTransposi<onofSparseDataStructures.ICS16.

atomic operations

Page 35: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

35

Performance on GPU (dGPU vs. iGPU)

dGPU has much higher performance

more “dense”

more “sparse”

Page 36: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

36

B/W utilization on GPU (iGPU vs. dGPU)

iGPU has much higher bandwidth utilization

more “dense”

more “sparse”

Page 37: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

37

Kernel 4. Sparse Triangular Solve

(SpTRSV)

Page 38: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

38

SpTRSV •  Compute a dense solution vector x from a sparse linear system Lx=b, where

L is a square lower triangular sparse matrix, and b is a dense vector.

x 0 1

0 1

0 1

0 1

0 3 0 2

L (4x4)

nnzL = 6 known

=

0 x3

0 x2

0 x0

0 x1

x (4x1)

dense unknown

b (4x1)

dense known

0 d

0 c

0 a

0 b

x0=ax1=bx2=c-3a-2bx3=d

1*x0=a1*x1=b3*x0+2*x1+1*x2=c1*x3=d

system:

solu<on:

Page 39: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

39

•  SparsetransposiQon(SpTRANS)

0 2 0 1

0 3

0 6 0 5 0 4

0 2

0 1 0 3

0 6 0 5

0 4

->

•  Sparsetriangularsolve(SpTRSV)

0 x3

0 x2

0 x0 0 x1 0 1

0 1

0 1 0 1 0 3

0 2 0 d 0 c

0 a 0 b x =

QuesQon3:CanatomicoperaQonsworkwell?

Experiment3:SpTRANSandSpTRSVonGPUside.

Opportunity3andChallenge3

Question, Experiment, Opportunity and Challenge

Page 40: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

40

Sync-free method for SpTRSV (our work) •  All cores (one for each column as a node) are activiated: some of them compute

solution, the others busy-wait for spin-locks unlocked to start.

Node 0

Node 1

Node 2

Node 3 1 1 3 1 In-degree:

2 2 1 1 Out-degree:

0 1

0 1

0 1

0 1

0 3 0 2

0 1 2 5 CSR row_ptr:

0 2 4 5 CSC column_ptr:

6

6

WeifengLiu,AngLi,JonathanHogg,IainS.Duff,BrianVinter.FastSynchroniza<on-FreeAlgorithmsforParallelSparseTriangularSolveswithMul<pleRight-HandSides.CCPE.2017.(extendedfromEuropar16)WeifengLiu,AngLi,JonathanHogg,IainS.Duff,BrianVinter.ASynchroniza<on-FreeAlgorithmforParallelSparseTriangularSolves.Europar16.

atomic operations

Page 41: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

41

Performance on GPU (dGPU vs. iGPU)

dGPU has much higher performance

more “levels” less “nodes per level”

more “nodes per level”

less “levels”

Page 42: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

42

B/W utilization on GPU (iGPU vs. dGPU)

iGPU has much higher bandwidth utilization

more “levels” less “nodes per level”

more “nodes per level”

less “levels”

Page 43: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

43

SpTRANS and SpTRSV (Opportunity and Challenge)

Opportunity: •  In terms of utilization, using more atomic operations looks more promising on

APUs, though overall performance is not competitive over dGPU yet.

Challenge: •  Possibly converting/eliminating more (global) synchronizations to atomic

operations for overall higher performamce.

Page 44: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

44

Conclusion

•  The emergement of heterogeneous processors bring larger algorithm design and tuning space.

•  Based on a large set of experimental results, some opportunities are observed.

•  New exciting challenges are coming.

Page 45: When Sparse Matrices Met Heterogeneous Processors ... · Implemen

45

T k u ! 0 4 9 8

A y Q s n s ? 0 2 7 4 11 13 12