Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny...

33
Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Parallel matrix multiplication Communication lower bounds Communication avoiding algorithms

Transcript of Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny...

Page 1: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Lecture 16CSE 260 – Parallel Computation

(Fall 2015)Scott B. Baden

Parallel matrix multiplicationCommunication lower bounds

Communication avoiding algorithms

Page 2: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Announcements• Today’s office hours delay by 30 minutes• A3 is due on Monday 11/23• I will hold office hours by appointment that

day.

Scott B. Baden / CSE 260, UCSD / Fall '15 2

Page 3: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Today’s lecture

• Parallel Matrix multiplication: Cannon’s algorithm

• Working with communicators• Communication lower bounds• The Communication Avoiding (CA)

“2.5D” algorithm

Scott B. Baden / CSE 260, UCSD / Fall '15 5

Page 4: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Recall Matrix Multiplication• Given two conforming matrices A and B,

form the matrix product A × BA is m × nB is n × p

• Operation count: O(n3) multiply-adds for an n × n square matrix

• Different variants, e.g. ijk, etc.

Scott B. Baden / CSE 260, UCSD / Fall '15 6

Page 5: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

ijk variantfor i := 0 to n-1

for j := 0 to n-1for k := 0 to n-1

C[i,j] += A[i,k] * B[k,j]

+= *C[i,j] A[i,:]

B[:,j]

Scott B. Baden / CSE 260, UCSD / Fall '15 7

Page 6: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Parallel matrix multiplication• Organize processors into rows and columns

u Process rank is an ordered pair of integersu Assume p is a perfect square

• Each processor gets an n/√p × n/√p chunk of data• Assume that we have an efficient serial matrix

multiply (dgemm, sgemm)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

Scott B. Baden / CSE 260, UCSD / Fall '15 8

Page 7: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

A simple parallel algorithm• Conceptually, like the blocked serial algorithm

= ×

Scott B. Baden / CSE 260, UCSD / Fall '15 9

Page 8: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Cost• Each processor performs n3/p multiply-adds• Multiplies a wide and short matrix by a tall

skinny matrix• Needs to collect these matrices via collective

communication• High memory overhead

= ×

Scott B. Baden / CSE 260, UCSD / Fall '15 10

Page 9: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

A more efficient algorithm• We can form the same product by computing √p separate

matrix multiplies involving n2/p x n2/p submatrices and accumulating partial results

for k := 0 to n - 1 C[i, j] += A[i, k] * B[k, j];

• Move data incrementally in √p phases within a row or column

• In effect, a linear time ring broadcast algorithm• Modest buffering requirements

= ×

Scott B. Baden / CSE 260, UCSD / Fall '15 11

Page 10: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Canon’s algorithm• Implements the strategy• In effect we are using a ring broadcast algorithm• Consider block C[1,2]

C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2]

B(0,1) B(0,2)

B(1,0)

B(2,0)

B(1,1) B(1,2)

B(2,1) B(2,2)

B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

Image: Jim Demmel

×=C1,0)

C(2,0)

C(0,1) C(0,2)

C(1,1)

C(2,1)

C(1,2)

C(2,2)

C(0,0)

Scott B. Baden / CSE 260, UCSD / Fall '15 12

Page 11: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

B(0,1) B(0,2)

B(1,0)

B(2,0)

B(1,1) B(1,2)

B(2,1) B(2,2)

B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

Skewing the matrices

• Before we start, we preskew the matrices so everything lines up

• Shift row i by i columns to the left using sends and receives

u Do the same for each column

u Communication wraps around

• Ensures that each partial product is computed on the same processor that owns C[I,J], using only shifts

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2]

Scott B. Baden / CSE 260, UCSD / Fall '15 13

Page 12: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Shift and multiply• √p steps• Circularly shift Rows 1 column to the left, columns 1 row up• Each processor forms the partial product of local A& B and

adds into the accumulated sum in C

C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2]

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(2,1)

A(1,2) A(1,1)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1)A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,1)

A(2,1)

A(0,2) A(0,0)

A(2,2)

A(1,0) A(1,2)

A(2,0)

A(0,1)

B(1,1)

B(1,2)B(2,0)

B(0,0)

B(2,1)

B(2,2)

B(0,1)

B(0,2)B(1,0)

Scott B. Baden / CSE 260, UCSD / Fall '15 14

Page 13: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Cost of Cannon’s Algorithmforall i=0 to √p -1

CShift-left A[i; :] by i // T= α+βn2/pforall j=0 to √p -1

Cshift-up B[: , j] by j // T= α+βn2/pfor k=0 to √p -1

forall i=0 to √p -1 and j=0 to √p -1C[i,j] += A[i,j]*B[i,j] // T = 2*n3/p3/2

CShift-leftA[i; :] by 1 // T= α+βn2/pCshift-up B[: , j] by 1 // T= α+βn2/p

end forallend for

TP = 2(n3/p)γ+ 2(α(1+√p) + (βn2)(1+√p)/p)EP = T1 /(pTP) = ( 1 + αp3/2/n3 + β√p/n)) -1

≈ ( 1 + O(√p/n)) -1

EP → 1 as (n/√p) grows [sqrt of data / processor]Scott B. Baden / CSE 260, UCSD / Fall '15 15

Page 14: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Today’s lecture

• Parallel Matrix multiplication: Cannon’s algorithm

• Working with communicators• The Communication Avoiding (CA)

“2.5D” algorithm

Scott B. Baden / CSE 260, UCSD / Fall '15 16

Page 15: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Communication domains• Cannon’s algorithm shifts data along rows and columns of

processors• MPI provides communicators for grouping processors,

reflecting the communication structure of the algorithm• An MPI communicator is a name space, a subset of

processes that communicate• Messages remain within their

communicator• A process may be a member of

more than one communicator

X0 X1 X2 X3

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

Y0

Y1

Y2

Y3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Scott B. Baden / CSE 260, UCSD / Fall '15 17

Page 16: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Creating the communicators• Create a communicator for each row key = myRank div √P

Column?MPI_Comm rowComm;MPI_Comm_split( MPI_COMM_WORLD,

myRank / √P, myRank, &rowComm);MPI_Comm_rank(rowComm,&myRow);

X0 X1 X2 X3

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

Y0

Y1

Y2

Y3

0 0 0 0

1 1 1 1

2 2 2 2

3 3 3 3

• Each process obtains a new communicator according to the key

• Process rank relativeto the new communicator

• Rank applies to the respective communicator only

• Ordered according to myRank

Scott B. Baden / CSE 260, UCSD / Fall '15 18

Page 17: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

More on Comm_splitMPI_Comm_split(MPI_Comm comm, int splitKey,

int rankKey, MPI_Comm* newComm)

• Ranks assigned arbitrarily among processes sharing the same rankKey value

• May exclude a process by passing the constant MPI_UNDEFINED as the splitKey

• Return a special MPI_COMM_NULL communicator• If a process is a member of several communicators,

it will have a rank within each one

Scott B. Baden / CSE 260, UCSD / Fall '15 19

Page 18: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Circular shift• Communication with columns (and rows)MPI_Comm_rank(rowComm,&myidRing);MPI_Comm_size(rowComm,&nodesRing);int next = (myidRng + 1 ) % nodesRing;MPI_Send(&X,1,MPI_INT,next,0, rowComm);MPI_Recv(&XR,1,MPI_INT,

MPI_ANY_SOURCE,0, rowComm, &status); p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

• Processes 0, 1, 2 in one communicator because they share the same key value (0)

• Processes 3, 4, 5 are in another (key=1), and so on

Scott B. Baden / CSE 260, UCSD / Fall '15 20

Page 19: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Today’s lecture

• Parallel Matrix multiplication:Cannon’s algorithm

• Working with communicators• The Communication Avoiding (CA)

“2.5D” algorithm

Scott B. Baden / CSE 260, UCSD / Fall '15 23

Page 20: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Recalling Cannon’s algorithm• √p shift and multiply-add steps• Each processor forms the partial products of local A& B• TP = 2(n3/p)γ + 2(α(1+√p) + (α n2)(1+√p)/p)

C[1,2] = A[1,0]*B[0,2] + A[1,1]*B[1,2] + A[1,2]*B[2,2]

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(2,1)

A(1,2) A(1,1)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1)A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,1)

A(2,1)

A(0,2) A(0,0)

A(2,2)

A(1,0) A(1,2)

A(2,0)

A(0,1)

B(1,1)

B(1,2)B(2,0)

B(0,0)

B(2,1)

B(2,2)

B(0,1)

B(0,2)B(1,0)

Scott B. Baden / CSE 260, UCSD / Fall '15 24

Page 21: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Can we improve on Cannon’s algorithm?• Relative to arithmetic speeds, communication is

becoming more costly with time• Communication can be on or off-chip,across address

spaces• We seek an algorithm that increases the amount of

work (flops) relative to the data it moves

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Jim Demmel

Scott B. Baden / CSE 260, UCSD / Fall '15 25

Page 22: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Why it is important to reduce communication• Running time has 3 terms:

u # flops ⨉ time per flop (γ)u # words moved ÷ bandwidth (β)u # messages ⨉ latency (α)

γ << 1/ β << α

John Shalf

Scott B. Baden / CSE 260, UCSD / Fall '15 26

Annual improvements

Time/flop Bandwidth Latency

Network 26% 15%

DRAM 23% 5%59%

Page 23: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Communication lower bound for Matrix Multiplication and other direct linear algebra

(“O(N3)-like”)• Let M = Size of fast memory/processor, e.g. cache• # words moved /processor

Ω (#flops(per processor) / √M ))• # messages sent per processor

Ω (#flops(per processor) / M3/2 ) • # messages sent/processor ≥

# words moved/processor ≥ largest message size• Holds not only for Matrix Multiply but many other

“direct” algorithms in linear algebra, sparse matrices, some graph theoretic algorithmsSIAM SIAG/Linear Algebra Prize, 2012Demmel, Ballard, Holtz, Schwartz

• We can realize this in practice for many algorithms, though mostly for dense matrices

Scott B. Baden / CSE 260, UCSD / Fall '15 27

Page 24: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Canon’s Algorithm - optimality• General result

u If each processor has M words of local memory …u … at least 1 processor must transmit Ω (# flops / M1/2)

words of data• If local memory M = O(n2/p) …

u at least 1 processor performs f ≥ n3/p flopsu … lower bound on number of words transmitted by at

least 1 processor

Ω ((n3/p) / √ (n2/p) ) = Ω ((n3/p) / √M)

= Ω (n2 / √p )

Scott B. Baden / CSE 260, UCSD / Fall '15 30

Page 25: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Johnson’s 3D Algorithm • 3D processor grid: p1/3× p1/3× p1/3

u The matrix initially resides in just 1 place of processorsu Bcast A (B) in j (i) direction (p1/3 redundant copies) to other planesu Local multiplicationsu Accumulate (Reduce) in k direction

• Communication costs (optimal)u Volume =O(n2/p2/3)u Messages = O(log(p))

• Assumes space forp1/3 redundant copies

• Trade memory forcommunication

i

j

k

“Aface”

“Cface”

A(2,1)

A(1,3)

B(1,3)

B(3,1)

C(1,1)

C(2,3)

CuberepresentingC(1,1)+=

A(1,3)*B(3,1)

Source: Edgar SolomonikA

B

C

p1/3

Scott B. Baden / CSE 260, UCSD / Fall '15 32

Page 26: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

2.5D Algorithm • What if we have space for only 1 ≤ c ≤ p1/3 copies ?• P processor on a (P/c)1/2× (P/c)1/2× c mesh M = Ω(c·n2/p)• Communication costs : lower bounds

u Volume =Ω(n2 /(cp)1/2 ) ; Set M = c·n2/p in Ω (# flops / M1/2))u Messages =Ω(p1/2 / c3/2 ) ; Set M = c·n2/p in Ω (# flops / M3/2))u sends c1/2 times fewer words, c3/2 times fewer messages

• 2.5D algorithm “interpolates” between 2D & 3D algorithms

Source: Edgar Solomonik

3D 2.5D

Scott B. Baden / CSE 260, UCSD / Fall '15 33

Page 27: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

2.5D Algorithm • Assume can fit cn2/P data per processor, c>1• Processors form (P/c)1/2 x (P/c)1/2 x c grid

Source Jim Demmel

c

(P/c)1/2

Example: P = 32, c = 2

Scott B. Baden / CSE 260, UCSD / Fall '15 34

Page 28: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

2.5D Algorithm • Assume can fit cn2/P data per processor, c>1• Processors form (P/c)1/2 x (P/c)1/2 x c grid

Source Jim Demmel

c

(P/c)1/2

Initially P(i,j,0) owns A(i,j) &B(i,j)each of size n(c/P)1/2 x n(c/P)1/2

(1) P(i,j,0) broadcasts A(i,j) and B(i,j) to P(i,j,k)(2) Processors at level k perform 1/c-th of SUMMA,

i.e. 1/c-th of Σm A(i,m)*B(m,j)(3) Sum-reduce partial sums Σm A(i,m)*B(m,j) along k-axis so that

P(i,j,0) owns C(i,j)

Scott B. Baden / CSE 260, UCSD / Fall '15 35

Page 29: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Performance on Blue Gene P

Source Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 36

0

20

40

60

80

100

8192 131072

Perc

enta

ge o

f m

ach

ine p

eak

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using c=16 matrix copies

2D MM2.5D MM

64k Cores

Page 30: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Communication reduction

EuroPar’11 (Solomonik and Demmel)SC’11 (Solomonik, Bhatele, Demmel)Scott B. Baden / CSE 260, UCSD / Fall '15 37

64k Cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e n

orm

aliz

ed b

y 2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

Page 31: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Implications for scaling (parallel case)• To ensure that communication is not the

bottleneck, we must balance the relationships among various performance attributesu γ M1/2 > ≈ β: time to add two rows of locally stored

square matrix > reciprocal bandwidthu γ M3/2 > ≈ α: time to multiply 2 locally stored square

matrices > latency• Machine parameters:

u γ = seconds per flop (multiply or add)u β = reciprocal bandwidth (time)u α = latency (time)u M = local (fast) memory sizeu P = number of processors

• Time = γ * #flops + β * #flops/M1/2 + α * #flops/M3/2

Scott B. Baden / CSE 260, UCSD / Fall '15 38

Page 32: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

2.5D Algorithm - Summary • Interpolate between 2D (Cannon) and 3D

u c copies of A & B u Perform p1/2/c3/2 Cannon steps on each copy of A&Bu Sum contributions to C over all c layers

• Communication costs (not quite optimal, but not far off)u Volume:

O(n2 /(cp)1/2 )[ Ω(n2 /(cp)1/2 ]

u Messages: O(p1/2 / c3/2 + log(c))[ Ω(p1/2 / c3/2 ) ]

Source: Edgar Solomonik

Scott B. Baden / CSE 260, UCSD / Fall '15 39

Page 33: Lecture 16 CSE 260 – Parallel Computation (Fall 2015) Scott B. … · 2015-11-19 · skinny matrix • Needs to collect these matrices via collective communication ... u Do the

Lower bounds results – in perspective• Let M = Size of fast memory/processor, e.g. cache• # words moved /processor

Ω (#flops(per processor) / √M ))• # messages sent per processor

Ω (#flops(per processor) / M3/2 ) • We identified 3 values of M, 3 different cases

u 2D (Cannon’s algorithm)u 3D (Johnson’s algorithm)u 2.5D (Ballard and Demmel)

• Cannon’s algorithm realizes the lower bound • 1 copy of the data, M ≈ n2 / P• Lower bounds are Ω(n2 / √P ) and Ω( √P )

u The 2.5D algorithm effectively utilizes a higher ratio of flops/M

Scott B. Baden / CSE 260, UCSD / Fall '15 40