Parallel Sparse Matrix Indexing and Assignmentgilbert/cs240a/old/cs240aSpr2011/slides/...Parallel...

1

Parallel Sparse Matrix Indexing and Assignment

Aydın Buluç Lawrence Berkeley Na4onal Laboratory John R. Gilbert University of California, Santa Barbara

•  Every graph is a sparse matrix and vice-‐versa

•  Adjacency matrix: sparse array w/ nonzeros for graph edges

•  Storage-‐efficient implementa4on from sparse data structures

x

1 2

3

4 7

6

5

AT

à

Sparse adjacency matrix and graph

Sparse matrix-‐matrix Mul4plica4on (SpGEMM)

Element-‐wise opera4ons

3

x

Matrices on semirings, e.g. (×, +), (and, or), (+, min)

Sparse matrix-‐sparse vector mul4plica4on

Sparse Matrix Indexing

x

.*

Linear-‐algebraic primi4ves for graphs

4

Indexed reference and assignment

Matlab internal names: subsref, subsasgn For sparse special case, we use: SpRef, SpAsgn

SpRef: B = A(I,J) SpAsgn: B(I,J) = A

A,B: sparse matrices I,J: vectors of indices

SpRef using mixed-‐mode sparse matrix-‐matrix mul4plica4on (SpGEMM). Ex: B = A([2,4], [1,2,3])

A B

5

Why are SpRef/SpAsgn important?

Subscrip4ng and colon nota4on: ⇒  Batched and vectorized opera4ons ⇒  High Performance and parallelism.

A=rmat(15) A(r,r) : r random A(r,r) : r=symrcm(A)

− Load balance hard ± Some locality

+ Load balance easy − No locality

− Load balance hard + Good locality

6

More applica4ons

Prune isolated ver4ces; plug-‐n-‐play way (Graph 500)

0 2 4 6 8 10 12 14 16

0

2

4

6

8

10

12

14

16

nz = 36

0 2 4 6 8 10 12

0

2

4

6

8

10

12

nz = 36

sa = sum(A); // A is symmetric, for undirected graph nonisov = find(sa>0); A= A(nonisov, nonisov); // keep only connected vertices

7

More applica4ons

Extrac4ng (induced) subgraphs

1 2

3

4 7

6

5

1 5 2 4 7 3 6 1

5

2 4 7

3 6

PAPT

Area A

Area B

•  Per-‐area analysis on power grids •  Subrou4ne for recursive algorithms on graphs

8

Sequen4al algorithms

function B = spref(A,I,J) !R = sparse(1:length(I),I,1,length(I),size(A,1)); !Q = sparse(J,1:length(J),1,size(A,2),length(J)); !B = R*A*Q;!

Tspref = flops(R !A)+ flops(RA !Q) = nnz(R !A)+ nnz(RA !Q) =O(nnz(A))

function C = spasgn(A,I,J,B) ![ma,na] = size(A);![mb,nb] = size(B);!R = sparse(I,1:mb,1,ma,mb);!Q = sparse(1:nb,J,1,nb,na);!S = sparse(I,I,1,ma,ma);!T = sparse(J,J,1,na,na);!C = A + R*B*Q - S*A*T;!

A+0 0 00 B 00 0 0

!

"

###

$

%

&&&'

0 0 00 A(I, J ) 00 0 0

!

"

###

$

%

&&&

Tspasgn =O(nnz(A))

9

Parallel algorithm for SpRef

1. Forming R from I in parallel, on a 3x3 processor grid

7 2

5

8

1 3

I R

P(0,0)

P(1,1)

P(2,2)

0 2 1 3 5 4 6 8 7 SCATTER

•  Vector distributed only on diagonal processors; for illustra4on. •  Full (2D) vector distribu4on: SCATTER à ALLTOALLV •  Forming QT from J is iden4cal, followed by Q=QT.Transpose()

10

Parallel algorithm for SpRef

2. SpGEMM using memory-‐efficient Sparse SUMMA.

Minimize temporaries by: •  Splilng local matrix, and broadcas4ng mul4ple 4mes •  Dele4ng P (and A if in-‐place) immediately amer forming C=P*A

j

x = i

k k

Cij

Cij += Pik Akj

Matrix/vector distribu4ons, interleaved on each other.

5

8

!

x1

!

x1,1

!

x1,2

!

x1,3

!

x2,1

!

x2,2

!

x2,3

!

x3,1

!

x3,2

!

x3,3

!

x2

!

x3

!

A1,1

!

A1,2

!

A1,3

!

A2,1

!

A2,2

!

A2,3

!

A3,1

!

A3,2

!

A3,3

!

n pr!

n pc

2D vector distribu4on

-‐ Performance change is marginal (dominated by SpGEMM) -‐ Scalable with increasing number of processes -‐ No significant load imbalance

Default distribu4on in Combinatorial BLAS.

Complexity analysis

Assump4ons: -‐ The triple product is evaluated from lem to right: B=(R*A)*Q -‐ Nonzeros uniformly distributed to processors (chicken-‐egg?)

Dominated by SpGEMM Bopleneck: bandwidth costs Speedup:

!

" p( )

Tcomp ! "nnz(A)p

# log length(I )p

+length(J )

p+ p

$

%&

'

()

$

%&

'

()

Tcomm =! ! " p +" " nnz(A)p

#

$%%

&

'((

SpGEMM:

Matrix formaDon:

! ! " log(p)+" " length(I )+ length(J )p

#

$%%

&

'((

Strong scaling of SpRef

random symmetric permutaDon ó relabeling graph verDces •  RMAT Scale 22; edge factor=8; a=.6, b=c=d=.4/3 •  Franklin/NERSC, each node is a quad-‐core AMD Budapest

0

20

40

60

80

100

120

0

10

20

30

40

50

60

1 4 16 64 256 1024

Speedu

p

Second

s

Cores

Time (secs) Speedup

Strong scaling of SpRef

Extracts 10 random (induced) subgraphs, each with |V|/10 vert. Higher span à Decreased parallelism à Lower speedup

0

10

20

30

40

50

60

0

50

100

150

200

1 4 16 64 256 1024

Speedu

p

Second

s

Cores

Time (secs) Speedup

Conclusions

•  Parallel algorithms for SpRef and SpAsgn •  Systemic algorithm structure imposed by SpGEMM •  Analysis made possible for the general case •  Good strong scaling for 1000-‐way concurrency •  Many applica4ons on sparse matrix and graph world.

Caveat: Avoid load imbalance by indexing non-‐monotonically

I =

134678

!

"

#######

$

%

&&&&&&&

I =

738164

!

"

#######

$

%

&&&&&&&

Parallel Sparse Matrix Indexing and Assignmentgilbert/cs240a/old/cs240aSpr2011/slides/...Parallel...

Documents

Transcript of Parallel Sparse Matrix Indexing and Assignmentgilbert/cs240a/old/cs240aSpr2011/slides/...Parallel...