Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

47
Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2 , Ariful Azad 2 , Aydin Buluç 2 ,Dmitriy Morozov 2 , Sang-Yun Oh 2,3 , Leonid Oliker 2 , Katherine Yelick 1,2 1 Computer Science Division, University of California, Berkeley 2 Lawrence Berkeley National Laboratory 3 University of California, Santa Barbara May 26, 2016

Transcript of Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Page 1: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Penporn Koanantakool1,2,Ariful Azad2, Aydin Buluç2,Dmitriy Morozov2,

Sang-Yun Oh2,3, Leonid Oliker2, Katherine Yelick1,2

1Computer Science Division, University of California, Berkeley2Lawrence Berkeley National Laboratory3University of California, Santa Barbara

May 26, 2016

Page 2: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Outline

• Background and Motivation

• Algorithms

– 2D SUMMA

– 3D SUMMA

– 1.5D variants

• Experimental Results : 100x speedup

• Iterative Multiplication

• Conclusions

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 2

Page 3: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Parallel Algorithm Cost Model

• p distributed, homogenous processors connected through network

• Per-processor costs along critical path

• time_per_flop, time_per_message, time_per_word are machine-specific constants.

• Goal: minimize #messages (S) and #words (W)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 3

time = #flops · time_per_flop + #messages · time_per_message +#words · time_per_word

Communication

Computation

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Image courtesy of: James Demmel

Page 4: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Sparse ⨉ Dense Matmul

4Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

A(sparse)

B(dense)

C(dense)

⨉ =

Ai,:

B:,j

Ci,j

i

j

kCi,j += Ai,k ⨉ Bk,j

Page 5: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Matrix Multiplication Algorithms

5Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

A A A

B B B

C C C1D 2D 3D

2.5D1.5D

p0

p0

A (sparse) C (dense)

B (dense)

A (sparse) C (dense)

B (dense)

2D (SUMMA)

1DOptimal fordense-dense/sparse-sparse

matmul

[Solomonik and Demmel 2011]

[Ballard et al. 2013]

*2D and 3D images courtesy of Grey Ballard

Page 6: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Existing Algorithms

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 6

Page 7: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0 1 2 3

1D Ring Algorithm

• p=4

• 1D layout

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 7

0 1 2 3 0 1 2 3A

B

C

p

Page 8: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0 1 2 3

1D Ring Algorithm

• p=4

• 1D layout

• Shifts matrix around cyclically

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 8

3 0 1 2 0 1 2 3A

B

C

p

Page 9: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0 1 2 3

1D Ring Algorithm

• p=4

• 1D layout

• Shifts matrix around cyclically

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 9

2 3 0 1 0 1 2 3A

B

C

p

Page 10: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

1D Ring Algorithm

• p=4

• 1D layout

• Shifts matrix around cyclically

• #msgs: p

• msg size: nnz(A)/p

• #words: nnz(A)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 10

1 2 3 0 0 1 2 3A

B

C

p

0 1 2 3

Page 11: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 11

0 1 2 3

4 5 6 7

8 9 a b

c d e f

0 1 2 3

4 5 6 7

8 9 a b

c d e f

0 1 2 3

4 5 6 7

8 9 a b

c d e f

A

B

C

Page 12: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 12

A

B

C

Page 13: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 13

A

B

A broadcasts row-wise

C

Page 14: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 14

A

B

A broadcasts row-wise

B broadcasts column-wise C

Page 15: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 15

A

B

C

Page 16: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 16

A

B

C

Page 17: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

2D SUMMA Algorithm

• p=16

• 2D layout, grid

• outer products

• broadcasts of size nnz(A)/p

• broadcastsof size nnz(B)/p

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 17

A

B

C

Page 18: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

3D SUMMA Algorithm

• p=32, c=2 (c is #copies)

• grid (teams of size c)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 18

A

B

C A

B

C

Page 19: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

3D SUMMA Algorithm

• p=32, c=2 (c is #copies)

• grid (teams of size c)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 19

A

B

C

First team member calculates the first half outer products.

A

B

C

Second team member calculates the latter half outer products.

Page 20: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

3D SUMMA Algorithm

• p=32, c=2 (c is #copies)

• grid (teams of size c)

• Each team member calculates1/c of all outer products

• broadcasts of size nnz(A)c/p

• broadcasts of size nnz(B)c/p

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 20

A

B

C

Second team member calculates the latter half outer products.

Page 21: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

3D SUMMA: Costs#messages #words

• Replication

– Replicate A:

– Replicate B:

• Propagation

– Broadcast A:

– Broadcast B:

• Collection

– Reduce C:

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 21

Page 22: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

1.5D Algorithms

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 22

Page 23: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0 1 2 3

0, 1 2, 3

1.5D Col A

• p=4, c=2

• A has p/c block columns

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 23

0 1 2 3

A

B

C

Round1

p/c

p

Page 24: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0 1 2 3

1.5D Col A

• p=4, c=2

• A has p/c block columns

• #msgs: p/c

• msg size:

• #words: nnz(A)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 24

0 1 2 3

0 1 2 3

0, 1 2, 3

2, 3 0, 1

A

B

C

A C

Round1

Round2

p/c

p

Page 25: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

01

23

45

67

1.5D Col ABC

• p=8, c=2

• A and B both havep/c block columns

• Each team memberdoes 1/c of work

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 25

01

23

45

67

05

27

41

63

A

B

C

Round1

p/c

p/c

Page 26: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

01

23

45

67

1.5D Col ABC

• p=8, c=2

• A and B both havep/c block columns

• Each team memberdoes 1/c of work

• #msgs: p/c2

• msg size:

• #words:

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 26

01

23

45

67

01

23

45

67

05

27

41

63

63

05

27

41

A

B

C

A C

Round1

Round2

p/c

p/c

Page 27: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

1.5D Inner ABC

• p=8, c=2

• A has p/c block rows

• B has p/c block columns

• Each team member does 1/c of work

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 27

0, 52, 74, 16, 3

0 52 6

1 43 7

01

23

45

67

A

B

C

Round1

p/c

Page 28: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

0, 52, 74, 16, 3

0 52 6

1 43 7

01

23

45

67

A

B

C

6, 30, 52, 74, 1

0 3 5 60 2 5 71 2 4 71 3 4 6

A C

Round1

Round2

1.5D Inner ABC

• p=8, c=2

• A has p/c block rows

• B has p/c block columns

• Each team member does 1/c of work

• #msgs: p/c2

• msg size:

• #words:

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 28

p/c

Page 29: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Cost Comparison

• Latency cost

• Append , , for bandwidth cost

• Even replication and collection can be the bottleneck if nnz(A)≪nnz(B), nnz(C)

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 29

Page 30: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Experimental Results

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 30

Page 31: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Test environments

• Comparison– 1.5D variants: Col A, Col ABC, and Inner ABC– Baseline: Best of 2D/2.5D/3D SUMMA ABC

• C++ MPI, hybrid (12 threads per MPI process)• MKL’s multithreaded routine for local

computation (mkl_dcsrmm) (double-precision)• CSR + row major

• Edison @NERSC– 2.57PFLOPS Cray XC30– 12 cores, 2 sockets each– Cray Aries, Dragonfly topology

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 31

Page 32: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Cost Breakdown

• p=3,072, n=64k x 64k, 41 nnz per row, Cray XC30

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 32

Existing New

Page 33: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Moving Dense Matrix is Costly

• p=3,072, n=64k x 64k, 41 nnz per row, Cray XC30

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 33

Existing New

Page 34: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Replication might not pay off

• p=3,072, n=64k x 64k, 41 nnz per row, Cray XC30

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 34

Existing New

17x

Page 35: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Different Computation Efficiency

• p=3,072, n=64k x 64k, 41 nnz per row, Cray XC30

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 35

Existing New

17x

Page 36: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

100x Improvement

• A66k x 172k, B172k x 66k, 0.0038% nnz, Cray XC30

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 36

100x

Up is good

Page 37: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

When is 1.5D better?

• Best theoretical bandwidth cost areas

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 37

Most sparse matrices have < 5% nonzeroes

Assuming nnz(C) = nnz(B)

Also applicable to dense-dense and sparse-sparse

Page 38: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Iterative Multiplication

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 38

Page 39: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Sparse ICM Estimation

• ICM = inverse covariance matrix →– Direct pairwise dependency

• Data could be– Brain fMRI scan

– Gene expression

• S is usually rank-deficient or too expensive to invert directly

• Estimate by settingan objective functionand optimize

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 39

X(observed data)

XT S(sample covariance)

rank-deficient

n variables/features

robservations

*[S. Oh et al 2014]

Page 40: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

CONCORD-ISTA

• CONCORD objective function

• Iteratively compute [or ]

• Replication (and sometimes collection) costs can be amortized.

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 40

Ω S⨉ Ω XT⨉or

: diagonal elements: off-diagonal elements

: estimated sparse inverse of S

Page 41: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Conclusions

• 1.5D beneficial when #nnz of matrices are imbalanced.

– Also applicable to dense-dense/sparse-sparse.

• Observed up to 100⨉ speedup.

• 1.5D can handle moderately imbalanced load with greedy partitioning.

• Iterative multiplication gains even more benefit.

• Future work

– Parallel CONCORD-ISTA

– New communication lower bounds

– Try recursive approach

– Consider different computation efficiency

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 41

Page 42: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

References I• P. Koanantakool, A. Azad, A. Buluç,

D. Morozov, S. Oh, L. Oliker, K. Yelick. Communication-avoidingparallel sparse-dense matrix-matrix multiplication.In IPDPS 2016.

• E. Solomonik and J. Demmel, Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms.In Euro-Par, 2011, vol. 6853, pp. 90–109.

• G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices.In Proceedings of the twenty-fifth annual ACM symposium on Parallelismin algorithms and architectures. ACM, 2013, pp. 222–231.

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 42

Page 43: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

References II• M. Driscoll, E. Georganas, P. Koanantakool, E. Solomonik, K. Yelick.

A communication-optimal n-body algorithm for direct interactions.In IPDPS 2013.

• S. Oh, O. Dalal, K. Khare, and B. Rajaratnam, Optimization methods for sparse pseudo-likelihoodgraphical model selection. In NIPS 27, 2014, pp. 667–675.

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 43

Page 44: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Thank you!Questions?Suggestions/Applications?

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 44

Page 45: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Communication Costs

• 3 phases

• Non-replicating version only has propagation cost.

• Replication and collection happen between team members, i.e., only c processors ( ), so they are usually cheaper than propagation.

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 45

Phase Description As c increases..

Replication to have c copies

Propagation to multiply

Collection of matrix C

Page 46: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

CONCORD-ISTA

• CONCORD objective function

• Proximal gradient method

– Smooth: Gradient descent

– Non-smooth: Soft-thresholding(ISTA: Iterative Soft-Thresholding Algorithm)

• Does not assume Gaussian distribution

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 46

: diagonal elements: off-diagonal elements

𝜆−𝜆

: estimated sparse inverse of S

non-smooth

Page 47: Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix ...

Pseudocode

• Distributed operations– S = XTX/r : dense-dense matrix multiplication

– SΩ+ΩS : sparse-dense matrix multiplication

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication 47

S = XTX/rΩ = Ido

Ω0 = ΩCompute gradient Gtry 𝜏 = 1, 1/2, 1/4, … // find appropriate step size

Ω = soft_thresholding(Ω0-𝜏G, 𝜏𝜆) until descent condition satisfied

while max(|Ω0-Ω|) ≥ 𝜀 // until no more progress

: estimated sparse inverse of S