Advanced MPI programming

124
Advanced MPI programming Julien Langou George Bosilca

description

Advanced MPI programming. Julien Langou George Bosilca. Outline. Point-to-point communications, group and communicators, data-type, collective communications 4 real life applications to play with. Point-to-point Communications. Point-to-point Communications. - PowerPoint PPT Presentation

Transcript of Advanced MPI programming

Page 1: Advanced MPI programming

Advanced MPI programming

Julien Langou

George Bosilca

Page 2: Advanced MPI programming

Outline

• Point-to-point communications, group and communicators, data-type, collective communications

• 4 real life applications to play with

Page 3: Advanced MPI programming

Point-to-point Communications

Page 4: Advanced MPI programming

Point-to-point Communications

• Data transfer from one process to another– The communication requires a sender and

a receiver, i.e.there must be a way to identify processes

– There must be a way to describe the data– There must be a way to identify messages

Page 5: Advanced MPI programming

Who are you ?

• Processes are identified by a double:– Communicator: a “stream like” safe place

to pass messages– Rank: the relative rank of the remote

process in the group of processes attached to the corresponding communicator

• Messages are identified by a tag: an integer that allow you to differentiate messages in the same communicator

Page 6: Advanced MPI programming

What data to exchange ?

• The data is described by a triple:– The address of the buffer where the data is

located in the memory of the current process

– A count: how many elements make up the message

– A data-type: a description of the memory layout involved in the message

Page 7: Advanced MPI programming

The basic send

• Once this function returns the data have been copied out of the user memory and the buffer can be reused

• This operation may require system buffers, and in this case it will block until enough space is available

• The completion of the send DO NOT state anything about the reception state of the message

MPI_Send(buf, count, type, dest, tag, communicator)

Page 8: Advanced MPI programming

The basic receive

• When this call returns the data is located in the user memory

• Receiving fewer than count is permitted, but receiving more is forbidden

• The status contain information about the received message (MPI_ANY_SOURCE, MPI_ANY_TAG)

MPI_Recv(buf, count, type, dest, tag, communicator, status )

Page 9: Advanced MPI programming

The MPI_Status

• Structure in C (MPI_Status), array in Fortran (integer[MPI_STATUS_SIZE])

• Contain at least 3 fields (integers):– MPI_TAG: the tag of the received message– MPI_SOURCE: the rank of the source

process in the corresponding communicator

– MPI_ERROR: the error raised by the receive (usually MPI_SUCCESS).

Page 10: Advanced MPI programming

Non Blocking Versions

• The memory pointer by buf (and described by the count and type) should not be touched until the communication is completed

• req is a handle to an MPI_Request which can be used to check the status of the communication …

MPI_Isend(buf, count, type, dest, tag, communicator, req ) MPI_Irecv (buf, count, type, dest, tag, communicator, req )

Page 11: Advanced MPI programming

Testing for Completion

• Test or wait the completion status of the request

• If the request is completed:– Fill the status with the related information– Release the request and replace it by

MPI_REQUEST_NULL

BlockingMPI_Wait(req, status)

Non BlockingMPI_Test(req, flag, status)

Page 12: Advanced MPI programming

Multiple Completions

• MPI_Waitany( count, array_of_requests, index, status )• MPI_Testany( count, array_of_requests, index, flag, status)• MPI_Waitall( count, array_of_requests, array_of_statuses)• MPI_Testall( count, array_of_requests, flag, array_of_statuses )• MPI_Waitsome( count, array_of_requests, outcount,

array_of_indices, array_of_statuses )

Page 13: Advanced MPI programming

Synchronous Communications

• MPI_Ssend, MPI_Issend– No restrictions …– Does not complete until the corresponding receive

has been posted, and the operation has been started

• It doesn’t means the operation is completed

– Can be used instead of sending ack– Provide a simple way to avoid unexpected

messages (i.e. no buffering on the receiver side)

Page 14: Advanced MPI programming

Buffered Communications

• MPI_Bsend, MPI_Ibsend• MPI_Buffer_attach, MPI_Buffer_detach

– Buffered sends do not rely on system buffers– The buffer is managed by MPI– The buffer should be large enough to contains the

largest message send by one send– The user cannot use the buffer as long as it is

attached to the MPI library

Page 15: Advanced MPI programming

Ready Communications

• MPI_Rsend, MPI_Irsend– Can ONLY be used if the matching receive has

been posted– It’s the user responsibility to insure the previous

condition, otherwise the outcome is undefined– Can be efficient in some situations as it avoid the

rendezvous handshake– Should be used with extreme caution

Page 16: Advanced MPI programming

Persistent Communications

• MPI_{BR}Send_init, MPI_Recv_init

• MPI_Start, MPI_Startall– Reuse the same requests for multiple

communications– Provide only a half-binding: the send is not

connected to the receive– A persistent send can be matched by any

kind of receive

Page 17: Advanced MPI programming

• Point to point communication– 01-gs (simple)– 02-nanopse (tuning)– 03-summa (tuning)

• Groups and communicators– 03-summa

• Data-types– 04-lila

• Collective– MPI_OP: 04-lila

Page 18: Advanced MPI programming

01-gs

Page 19: Advanced MPI programming

Gram-Schmidt algorithm

19

The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.

A1

A2

A3

Q1

Q2

Q3

R

Input:A is block distributed by rows

Output:Q is block distributed by rowsR is global

Page 20: Advanced MPI programming

Example of applications: iterative methods.

20

a) in iterative methods with multiple right-hand sides (block iterative methods:)

1) Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).

2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, …

b) in iterative methods with a single right-hand side

1) s-step methods for linear systems of equations (e.g. A. Chronopoulos),

2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,

3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).

c) in iterative eigenvalue solvers,

1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),

2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,

3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),

4) PRIMME (A. Stathopoulos, Coll. William & Mary ),

5) And also TRLAN, BLZPACK, IRBLEIGS.

Page 21: Advanced MPI programming

21

Example of applications:

a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),

b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply

c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.

Page 22: Advanced MPI programming

22

Example of applications:

a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),

b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply

c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.

The main characteristics of those three examples are that

a)there is only one column of processors involved but several processor rows,

b) all the data is known from the beginning,

c)and the matrix is dense.

Page 23: Advanced MPI programming

23

Example of applications:

a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),

b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply

c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.

The main characteristics of those three examples are that

a)there is only one column of processors involved but several processor rows,

b) all the data is known from the beginning,

c)and the matrix is dense.

Various methods already exist to perform the QR factorization of such matrices:

a)Gram-Schmidt (mgs(row),cgs),

b)Householder (qr2, qrf),

c)or CholeskyQR.

We present a new method:

Allreduce Householder (rhh_qr3, rhh_qrf).

Page 24: Advanced MPI programming

Q = A

r11= || Q1 ||2

Q1 = Q1 / r11

= Q1 / r11

r12 = Q1T Q2

Q2 = Q2 – Q1 r12

r22 = || Q2 ||2

Q2 = Q2 / r22

r13 = Q1T Q3

Q3 = Q3 – Q1 r13

r23 = Q2T Q3

Q3 = Q3 – Q2 r23

r33 = || Q3 ||2

Q3 = Q3 / r33

An example with modified Gram-Schmidt. A nonsingular m x 3

A = QR QTQ = I3

Page 25: Advanced MPI programming

Look at the codes

Page 26: Advanced MPI programming

The CholeskyQR Algorithm

26

chol ( )

\

Ci AiT Ai

AiQi

CR

R

SYRK: C:= ATA ( mn2)

CHOL: R := chol( C ) ( n3/3 )

TRSM: Q := A\R ( mn2)

Page 27: Advanced MPI programming

Bibligraphy

• A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchronization requirements, SIAM Journal on Scientific Computing, 23(6):2165-2182, 2002.

• Popularized by iterative eigensolver libraries:

1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),

2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,

3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),

4) PRIMME (A. Stathopoulos, Coll. William & Mary ).

Page 28: Advanced MPI programming

Parallel distributed CholeskyQRThe CholeskyQR method in the parallel distributed context can be described as follows:

28

+

chol ( )

\

+ +

1.

2.

3-4.

5.

C C1 C2 C3 C4

Ci AiT Ai

AiQi

CR

R

1: SYRK: C:= ATA ( mn2)

2: MPI_Reduce: C:= sumprocs C (on proc 0)

3: CHOL: R := chol( C ) ( n3/3 )

4: MPI_Bdcast Broadcast the R factor on proc 0

to all the other processors

5: TRSM: Q := A\R ( mn2)

This method is extremely fast. For two reasons: 1. first, there is only one or two communications phase, 2. second, the local computations are performed with fast operations.

Another advantage of this method is that the resulting code is exactly four lines,3. so the method is simple and relies heavily on other libraries.

Despite all those advantages, 4. this method is highly unstable.

Page 29: Advanced MPI programming

CholeskyQR - code

29

int choleskyqr_A_v0(int mloc, int n, double *A, int lda, double *R, int ldr, MPI_Comm mpi_comm){

int info; cblas_dsyrk( CblasColMajor, CblasUpper, CblasTrans, n, mloc,

1.0e+00, A, lda, 0e+00, R, ldr ); MPI_Allreduce( MPI_IN_PLACE, R, n*n, MPI_DOUBLE, MPI_SUM, mpi_comm );lapack_dpotrf( lapack_upper, n, R, ldr, &info ); cblas_dtrsm( CblasColMajor, CblasRight, CblasUpper, CblasNoTrans,

CblasNonUnit, mloc, n, 1.0e+00, R, ldr, A, lda );

return 0;

}

(And OK, you might want to add an MPI user defined datatype to send only the upper part of R)

+

chol ( )

\

+ +

1.

2.

3-4.

5.

C C1 C2 C3 C4

CiAi

TAi

AiQi

CR

R

Page 30: Advanced MPI programming

Let us look at the codes in 01-gs

Page 31: Advanced MPI programming

0.08

0.16

0.32

0.64

1.28

2.56

5.12

10.24

20.48

1 2 4 8 16 32

Efficient enough?

# of procs

cholqr cgs mgs(row) qrf mgs

1 489.2 (1.02) 134.1 (3.73) 73.5 (6.81) 39.1 (12.78) 56.18 (8.90)

2 467.3 (0.54) 78.9 (3.17) 39.0 (6.41) 22.3 (11.21) 31.21 (8.01)

4 466.4 (0.27) 71.3 (1.75) 38.7 (3.23) 22.2 (5.63) 29.58 (4.23)

8 434.0 (0.14) 67.4 (0.93) 36.7 (1.70) 20.8 (3.01) 21.15 (2.96)

16 359.2 (0.09) 54.2 (0.58) 31.6 (0.99) 18.3 (1.71) 14.44 (2.16)

32 197.8 (0.08) 41.9 (0.37) 29.0 (0.54) 15.8 (0.99) 8.38 (1.87)

MFLOP/sec/proc

Time in sec

050100150200250300350400450500

1 2 4 8 16 32

cholqr

cgs

mgs(row)

qrf

mgs

In this experiment, we fix the problem: m=100,000 and n=50.

Perf

orm

ance

(MFL

OP/

sec/

proc

)

Tim

e (s

ec)

# of procs# of procs

Page 32: Advanced MPI programming

Stability

κ2(A)

κ2(A)

|| A

– Q

R |

| 2 / |

| A

|| 2

|| I

–Q T Q

|| 2

m=100, n=50

Page 33: Advanced MPI programming

02-nanopse

Page 34: Advanced MPI programming

performance analysis tool in PESCAN

Optimization of codes through performance tools

Portability of codes while keeping good performances

Page 35: Advanced MPI programming

idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo

call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo

-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)

idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo

idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo

do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do

call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------

beforeafter

Page 36: Advanced MPI programming
Page 37: Advanced MPI programming

Original code:

13.2% of the overall time at barrier

Modified code:

Removing most of the barrier report a part of the problem on other synchronization point but we observe 6% of improvements

Page 38: Advanced MPI programming

• Lots of self-communication. After investigation, some copies can be avoided.

• Try different variants of asynchronous communication, removing barrier,….

Page 39: Advanced MPI programming

Cd 83 – Se 81 on 16 processors

Original code

Code without me2me communications

Code with asynchronous receive – without barrier

Code with both modifications

Time (s) 18.23 15.42 15.29 13.12

ratio 1.00 0.85 0.83 0.72

• In this example, we see that both modifications are useful and they are working fine together for small number of processors

Page 40: Advanced MPI programming

0

100

200

300

400

500

600

700

16procs

32procs

64procs

128procs

original

me2me

withoutbarrier+assynchall threemodifs

- me2me: 10% improvements on the matrix-vector products, less buffer used- asynchronous communication are not a good idea, (this is why the original code have them)- barrier are useful for large matrices (this why the original code have them)

ZB4096_Ecut30 (order = 2,156,241) – time for 50 matrix-vector products

Comparison of different matrix-vector products implementations

Page 41: Advanced MPI programming

General resultsOriginal (s)

Modified (s)

Ratio

83 Cd – 81 Se (~34K)

16 processors18.23 13.23 28%

232 Cd – 235 Se (~75K)

16 processors36.00 26.16 27%

534 Cd – 527 Se

2x16 processors50.50 38.78 23%

83 Cd – 81 Se – Ecut 30

16 processors 156.33 123.89 21%

83 Cd – 81 Se – Ecut 30

2x16 processors103.50 69.22 34%

Page 42: Advanced MPI programming

idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo

call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo

-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)

idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo

idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo

do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do

call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------

beforeafter

This is just one way of doing it : Solution use (and have) tuned MPI Global Communications (here MPI_Alltoallv )

Page 43: Advanced MPI programming

03-summa

Page 44: Advanced MPI programming

First rule in linear algebra: Have an efficient DGEMM

• All the dense linear algebra operations rely on an efficient DGEMM (matrix Matrix Multiply)

• This is by far the easiest operation O(n3) in Dense Linear Algebra.

– So if we can not implement DGEMM correctly (peak performance), we will not be able to do much for the others operations.

Page 45: Advanced MPI programming

Blocked LU and QR algorithms (LAPACK)

45

-

lu( )

dgetf2

dtrsm (+ dswp)

dgemm

\

L

U

A(1)

A(2)L

U

-

qr( )

dgeqf2 + dlarft

dlarfb

V

R

A(1)

A(2)V

R

LAPACK block LU (right-looking): dgetrf LAPACK block QR (right-looking): dgeqrf

Upd

ate

of th

e re

mai

ning

sub

mat

rixPa

nel

fact

oriz

ation

Page 46: Advanced MPI programming

Blocked LU and QR algorithms (LAPACK)

-

lu( )

dgetf2

dtrsm (+ dswp)

dgemm

\

L

U

A(1)

A(2)L

U

LAPACK block LU (right-looking): dgetrf

Upd

ate

of th

e re

mai

ning

sub

mat

rixPa

nel

fact

oriz

ation

Latency bounded: more than nb AllReduce for n*nb2 ops

CPU - bandwidth bounded: the bulk of the computation: n*n*nb ops

highly paralleliable, efficient and saclable.

Page 47: Advanced MPI programming

B11

B21

B31

Parallel Distributed MM algorithms

A11C11

A21C21

A31C31

B12

B22

B32

A12C12

A22C22

A32C32

B13

B23

B33

A13C13

A23C23

A33C33

*

C11

C21

C31

C12

C22

C32

C13

C23

C33

+ αβ

Page 48: Advanced MPI programming

Parallel Distributed MM algorithms

A B* = A B* + A B*

A B* + A B*+

• Use the outer version of the matrix-matrix multiply algorithm.

+ …

Page 49: Advanced MPI programming

Parallel Distributed MM algorithms

A B* = A B* + A B*

A B* + A B*+ + …

A B*+CC

For k = 1:nb:n,

End For

PDGEMM :

Page 50: Advanced MPI programming

B11

B21

B31

Parallel Distributed MM algorithms

A11 C11

A21 C21

A31 C31

B12

B22

B32

A12 C12

A22 C22

A32 C32

B13

B23

B33

A13 C13

A23 C23

A33 C33

Page 51: Advanced MPI programming

B11

B21

B31

Parallel Distributed MM algorithms

A11 C11

A21 C21

A31 C31

B12

B22

B32

A12 C12

A22 C22

A32 C32

B13

B23

B33

A13 C13

A23 C23

A33 C33

Page 52: Advanced MPI programming

Parallel Distributed MM algorithms

Broadcast of size nb*nloc along the columns, root is active_row.

1

Page 53: Advanced MPI programming

Parallel Distributed MM algorithms

Broadcast of size nb*nloc along the columns, root is active_row.

1

Broadcast of size nloc*nb along the rows, root is active_col.

2

Page 54: Advanced MPI programming

Parallel Distributed MM algorithms

Broadcast of size nb*nloc along the columns, root is active_row.

1

Broadcast of size nloc*nb along the rows, root is active_col.

2

Perform matrix matrix multiply:number of FLOPS is

nloc*nloc*nb

3

Page 55: Advanced MPI programming

Model.

• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast algorithm used

(Pipeline=SUMMA, tree=PUMMA, etc.).

Blocking communication Nonblocking communication.

))log(2,2

max(

223

3

βγp

nppn

nperf=

# of operations

Time for computation = # of ops * ( time for one op ) / (# of proc)

βγp

np

pn

nperf 23

3

)log(22

2

+=

Time with blocking = time comp + time commTime with nonblocking = max( time comp , time comm )

Time for communication = 2 * Bdcast on sqrt(p) processors of [n2 / sqrt(p)] numbers

Page 56: Advanced MPI programming

Model.

• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast

algorithm used (Pipeline=SUMMA, tree=PUMMA, etc.).

• For the TN cluster of PS3,– Bandwidth = 600 MB/sec ( with GigaBit Ethernet)– Flop rate = 149.85 GFLOPs/sec

Blocking communication Nonblocking communication.

βγp

np

pn

nperf 23

3

)log(22

2

+=

))log(2,2

max(

223

3

βγp

nppn

nperf=

Page 57: Advanced MPI programming

jacquard.nersc.gov

• Processor type Opteron 2.2 GHz • Processor theoretical peak 4.4 GFlops/sec • Number of application processors 712 • System theoretical peak (computational nodes) 3.13

TFlops/sec • Number of shared-memory application nodes 356• Processors per node 2 • Physical memory per node 6 GBytes • Usable memory per node 3-5 GBytes • Switch Interconnect InfiniBand • Switch MPI Unidrectional Latency 4.5 μsec • Switch MPI Unidirectional Bandwidth (peak) 620 MB/s • Global shared disk GPFS Usable disk space 30 TBytes • Batch system PBS Pro

Page 58: Advanced MPI programming

Mvapich vs FTMPI

# of processors # of processors

GFL

OPs

/sec

/pro

c

GFL

OPs

/sec

/pro

c

Page 59: Advanced MPI programming

Modeling.

Page 60: Advanced MPI programming

Performance model.

To get 90% of efficiency with nonblocking operations, one need to work on matrices of size 14,848. To get 90% of efficiency with blocking operations, one need to work on matrices of size 146,949.

Really worse using nonblocking communication.

Take for instance, the TN cluster of four PS3:Bandwidth = 600 Mb/sec (with GigaBit Ethernet)Flop rate = 149.85 GFLOPs/sec (theoretical peak is 153.6 GFLOPs/sec)

Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.

Page 61: Advanced MPI programming

Oups…Memory for PS3 is 256MB.

There is no way to hide the n2 term (communication) by the n3 term (computation). n can not get big enough. Actucally, it is the computation that are hidden by the communication.Dense Linear Algebra is stucked. There is nothing to be done.

Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.

Page 62: Advanced MPI programming

Three Solutions

Increase the memory of the nodes

Increase the bandwidth of the network

Decrease the computational power of the nodes

600 Mb/sec – 3.3 GB – 6 SPEs

1.39 Gb/sec – 258 MB – 6 SPEs

600 Mb/sec – 258 MB – 2 SPEs

Instead of 600 Mb/sec – 258 MB – 6 SPEs :

Three ways to complain• The network is too slow (complain to GigE)• There is not enough memory on the nodes (complain to Sony)• The nodes are too fast (complain to IBM)

Page 63: Advanced MPI programming

Groups and Communicators

Page 64: Advanced MPI programming

Groups and Communicators

• Break the MPI_COMM_WORLD into smaller sets that have a specific relationship

• Each communicator have a group of processes attached, which are all processes that can be contacted using this communicator– The processes are indexed in the group based on

their rank, that is contiguous and start from 0

Page 65: Advanced MPI programming

Groups vs. Communicators

• The BIG difference– Groups are local entities while communicators are

global

1

2

3

4

5

6

7

0

MPI_Comm_WORLDCommunicator_1Communicator_2

Page 66: Advanced MPI programming

Operations on Groups

• Retrieve the rank and the size• Translate the ranks from one group to

another, compare 2 groups• Constructors: create one group from another

based on a defined relationship– Creating a group is a local operation– Once create the group is not attached to any

communicator (i.e. there is no possible communication).

Page 67: Advanced MPI programming

Group Constructors

• Extract the group from a communicator (MPI_Comm_group)

• Union, intersection or difference of two other groups:– Union: return a group with all processes from

group1 followed by all processes in group2– Intersection: contain all processes that are in both

groups, ordered as in group1– Difference: contain all processes that are in

group1 but not in group2, ordered as in group 1

Page 68: Advanced MPI programming

Example

• Let group1={a,b,c,d,e,f,g} and group2={d,g,a,c,h,i}

• Union(group1,group2): newgroup={a,b,c,d,e,f,g,h,i}

• Intersection(group1,group2): newgroup={a,c,d,g}

• Difference(group1,group2):newgroup={b,e,f}

• Union(group2,group1): newgroup={d,g,a,c,h,i,b,e,f}

• Intersection(group2,group1): newgroup={d,g,a,c}

• Difference(group2,group1):newgroup={h,i}

Page 69: Advanced MPI programming

Group Constructors

• Inclusion and exclusion: by element or by range MPI_Group_*(group, n, ranks, newgroup)

• When by element ranks is a simple array of integers which indicate which rank from group1 goes into group2

• When by range ranks is an array of triple (int ranks[][3]) containing the start rank, the end rank and the stride

• The order of the processes in the resulting group is identical to the original group

Page 70: Advanced MPI programming

Operations on Communicators

• Retrieve the size and the rank• Compare 2 communicators

– MPI_INDENT : they are handles to the same object

– MPI_CONGRUENT : they have the same group attributes

– MPI_SIMILAR : they contain the same processes in different orders

– MPI_UNEQUAL : in all other cases

Page 71: Advanced MPI programming

Communicator Constructors

• MPI_Comm_dup(oldcomm,newcomm)

– the basic communicator duplication– The newcomm have the same attributes as

oldcomm• MPI_Comm_create(oldcomm, group, newcomm)

– The newcomm contains only the processes in the group

– MPI_COMM_NULL is returned to all orher processes

– 2 requirements: group contain a subset of processes from oldcomm and all processes should use the same group.

Page 72: Advanced MPI programming

Communicator Constructors

• MPI_Comm_split(oldcomm, color, key, newcomm)– Create as many groups and communicators as the

number of distinct values of color– The rank in the new group is determined by the

value of key, ties are broken according to the ranking in oldcomm

– MPI_UNDEFINED can be used as color for processes which will not be included in any of the newcomm

Page 73: Advanced MPI programming

Example

• MPI_Comm_split

rank 0 1 2 3 4 5 6 7 8 9 10

proc a b c d e f g h i j k

color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

Page 74: Advanced MPI programming

Example

• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0}– Color = 1 : {c:2,d:3,i:1}– Color = 7 : {f:9}

rank 0 1 2 3 4 5 6 7 8 9 10

proc a b c d e f g h i j k

color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

Page 75: Advanced MPI programming

Example

• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0} -> {k,b,e,g,h}– Color = 1 : {c:2,d:3,i:1} -> {i,c,d}– Color = 7 : {f:9} -> {f}

rank 0 1 2 3 4 5 6 7 8 9 10

proc a b c d e f g h i j k

color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

Page 76: Advanced MPI programming

03 - summa (pumma)

Page 77: Advanced MPI programming

Data-types

Page 78: Advanced MPI programming

Data Representation

• Different across different machines– Length: 32 vs. 64 bits (vs. …?)– Endian: big vs. little

• Problems– No standard about the data length in the

programming languages (C/C++)– No standard floating point data

representation• IEEE Standard 754 Floating Point Numbers

– Subnormals, infinities, NANs …

• Same representation but different lengths

Page 79: Advanced MPI programming

MPI Datatypes

• MPI uses “datatypes” to:– Efficiently represent and transfer data– Minimize memory usage

• Even between heterogeneous systems– Used in most communication functions

(MPI_SEND, MPI_RECV, etc.)– And file operations

• MPI contains a large number of pre-defined datatypes

Page 80: Advanced MPI programming

Some of MPI’s Pre-Defined Datatypes

DOUBLE PRECISION*8long doubleMPI_LONG_DOUBLE

DOUBLE PRECISIONdoubleMPI_DOUBLE

REALfloatMPI_FLOAT

unsigned long intMPI_UNSIGNED_LONG

unsigned intMPI_UNSIGNED

unsigned shortMPI_UNSIGNED_SHORT

unsigned charMPI_UNSIGNED_CHAR

signed long intMPI_LONG

INTEGERsigned intMPI_INT

INTEGER*2signed short intMPI_SHORT

CHARACTERsigned charMPI_CHAR

Fortran datatypeC datatypeMPI_Datatype

Page 81: Advanced MPI programming

Datatype Conversion

• “Data sent = data received”

• 2 types of conversions:– Representation conversion: change the

binary representation (e.g., hex floating point to IEEE floating point)

– Type conversion: convert from different types (e.g., int to float)

Only representation conversion is allowed

Page 82: Advanced MPI programming

Datatype Conversion

if (my_rank == root)

MPI_Send(msg, 1, MPI_INT, …)

else

MPI_Recv(msg, 1, MPI_INT, …)

if (my_rank == root) MPI_Send(msg, 1, MPI_INT, …)else MPI_Recv(msg, 1, MPI_FLOAT, …)

Page 83: Advanced MPI programming

Datatype Specifications

• Type signature– Used for message matching

{ type0, type1, …, typen }

• Type map– Used for local operations

{ (type0, disp0), (type1, disp1),…, (typen, dispn) }

It’s all about the memory layout

Page 84: Advanced MPI programming

User-Defined Datatypes

• Applications can define unique datatypes– Composition of other datatypes– MPI functions provided for common patterns

• Contiguous

• Vector

• Indexed

• …

Always reduces to a type map of pre-defined datatypes

Page 85: Advanced MPI programming

Contiguous Blocks

• Replication of the datatype into contiguous locations.MPI_Type_contiguous( 3, oldtype, newtype )

MPI_TYPE_CONTIGUOUS( count, oldtype, newtype )MPI_TYPE_CONTIGUOUS( count, oldtype, newtype ) IN count replication count( positive integer)IN count replication count( positive integer) IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

Page 86: Advanced MPI programming

Vectors

• Replication of a datatype into locations that consist of equally spaced blocksMPI_Type_vector( 7, 2, 3, oldtype, newtype )

MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN blocklength number of elements in each block (positive integer)IN blocklength number of elements in each block (positive integer)IN stride number of elements between start of each block (integer)IN stride number of elements between start of each block (integer)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

sb

Page 87: Advanced MPI programming

Indexed Blocks

• Replication of an old datatype into a sequence of blocks, where each block can contain a different number of copies and have a different displacement

MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_d displacement of each block from the beginning in multiple multipleIN a_of_d displacement of each block from the beginning in multiple multiple of oldtype (array of integers)of oldtype (array of integers)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)

Page 88: Advanced MPI programming

Indexed Blocks

array_of_blocklengths[] = { 2, 3, 1, 2, 2, 2 }

array_of_displs[] = { 0, 3, 10, 13, 16, 19 }

MPI_Type_indexed( 6, array_of_blocklengths,

array_of_displs, oldtype, newtype )

B[0] B[1] B[2] B[3] B[4] B[5]

D[0]D[1]

D[2]D[3]

D[4]

Page 89: Advanced MPI programming

Datatype Composition

• Each of the previous functions are the super set of the previous

CONTIGUOUS < VECTOR < INDEXED

• Extend the description of the datatype by allowing more complex memory layout– Not all data structures fit in common patterns– Not all data structures can be described as compositions of

others

Page 90: Advanced MPI programming

“H” Functions

• Displacement is not in multiple of another datatype

• Instead, displacement is in bytes– MPI_TYPE_HVECTOR– MPI_TYPE_HINDEX

• Otherwise, similar to their non-”H” counterparts

Page 91: Advanced MPI programming

Arbitrary Structures

• The most general datatype constructor

• Allows each block to consist of replication of different datatypes

MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength,MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength, array_of_displs, array_of_types, newtype )array_of_displs, array_of_types, newtype )IN count number of entries in each array ( positive integer)IN count number of entries in each array ( positive integer)IN a_of_b number of elements in each block (array of integers)IN a_of_b number of elements in each block (array of integers)IN a_of_d byte displacement in each block (array of Aint)IN a_of_d byte displacement in each block (array of Aint)IN a_of_t type of elements in each block (array of MPI_Datatype handle)IN a_of_t type of elements in each block (array of MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

Page 92: Advanced MPI programming

Arbitrary Structures

struct {

int i[3];

float f[2];

} array[100];

Array_of_lengths[] = { 2, 1 };

Array_of_displs[] = { 0, 3*sizeof(int) };

Array_of_types[] = { MPI_INT, MPI_FLOAT };

MPI_Type_struct( 2, array_of_lengths,

array_of_displs, array_of_types, newtype );

int int int float float

length[0] length[1]

displs[0] displs[1]

Page 93: Advanced MPI programming

Arbitrary StructuresArray_of_lengths[] = { 2, 1 };

Array_of_displs[] = { 0, 3*sizeof(int) };

Array_of_types[] = { MPI_INT, MPI_FLOAT };

int int int float float

length[0] length[1]

displs[0] displs[1]

Page 94: Advanced MPI programming

Arbitrary StructuresArray_of_lengths[] = { 2, 1 };

Array_of_displs[] = { 0, 3*sizeof(int) };

Array_of_types[] = { MPI_INT, MPI_FLOAT };

Real datadescription

int int int float float

length[0] length[1]

displs[0] displs[1]

Page 95: Advanced MPI programming

MPI_GET_ADDRESS

• Allow all languages to compute displacements– Necessary in Fortran– Usually unnecessary in C (e.g., “&foo”)

MPI_GET_ADDRESS( location, address )MPI_GET_ADDRESS( location, address )IN location location in the caller memory (choice)IN location location in the caller memory (choice)OUT address address of location (address integer)OUT address address of location (address integer)

Page 96: Advanced MPI programming

And Now the Dark Side…

• Sometimes more complex memory layout have to be expressed

Extent of each element

Interesting part

ddt

MPI_Send( buf, 3, ddt, … )

one datatype

Page 97: Advanced MPI programming

Typemap = { (type0, disp0), …, (typen, dispn) }

Minj dispj if no entry has type lbminj({dispj such that typej = lb) otherwise

Maxj dispj + sizeof(typej) + align if no entry has type ubMaxj{dispj such that typej = ub} otherwise

lb(Typemap)

ub(Typemap)

Lower-Bound andUpper-Bound Markers

• Define datatypes with “holes” at the beginning or end

• 2 pseudo-types: MPI_LB and MPI_UB– Used with MPI_TYPE_STRUCT

Page 98: Advanced MPI programming

MPI_LB and MPI_UB

displs = ( -3, 0, 6 )

blocklengths = ( 1, 1, 1 )

types = ( MPI_LB, MPI_INT, MPI_UB )

MPI_Type_struct( 3, displs, blocklengths, types, type1 )

MPI_Type_contiguous( 3, type1, type2 )

Typemap= { (lb, -3), (int, 0), (ub, 6) }

Typemap= { (lb, -3), (int, 0), (int, 9), (int, 18), (ub, 24) }

Page 99: Advanced MPI programming

Typemap = { (type0, disp0), …, (typen, dispn) }

true_lb(Typemap) = minj { dispj : typej != lb }true_ub(Typemap) = maxj { dispj + sizeof(typej) : typej != ub }

True Lower-Bound andTrue Upper-Bound Markers

• Define the real extent of the datatype: the amount of memory needed to copy the datatype inside

• TRUE_LB define the lower-bound ignoring all the MPI_LB markers.

Page 100: Advanced MPI programming

Information About Datatypes

MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)

MPI_TYPE_SIZE( datatype, size)MPI_TYPE_SIZE( datatype, size)IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT size datatype size (integer)OUT size datatype size (integer)

extent

size

true extent

Page 101: Advanced MPI programming

Test your Data-type skills

• Imagine the following architecture:– Integer size is 4 bytes– Cache line is 16 bytes

• We want to create a datatype containing the second integer from each cache line, repeated three times

• How many ways are there?

Page 102: Advanced MPI programming

Solution 1MPI_Datatype array_of_types[] = { MPI_INT, MPI_INT, MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 0, 0, 0, 0 };int array_of_lengths[] = { 1, 1, 1, 1 };struct one_by_cacheline c[4];

MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1].int[1], &(array_of_displs[1]) );MPI_Get_address( &c[2].int[1], &(array_of_displs[2]) );MPI_Get_address( &c[3], &(array_of_displs[3]) );

for( i = 0; i < 4; i++ ) Array_of_displs[i] -= start;

MPI_Type_create_struct( 4, array_of_lengths, array_of_displs, array_of_types, newtype )

newtype

Page 103: Advanced MPI programming

Solution 2MPI_Datatype array_of_types[] = { MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 4, 16 };int array_of_lengths[] = { 1, 1 };struct one_by_cacheline c[2];

MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1], &(array_of_displs[1]) );

Array_of_displs[0] -= start;Array_of_displs[1] -= start;MPI_Type_create_struct( 2, array_of_lengths, array_of_displs, array_of_types, temp_type )MPI_Type_contiguous( 3, temp_type, newtype )

temp_type

newtype

Page 104: Advanced MPI programming

Data-type for triangular matrices

Application Cholesky QR

Page 105: Advanced MPI programming

EXAMPLE OF CONSTRUCTION OF DATA_TYPE FOR TRIANGULAR MATRICES, EXAMPLE OF

MPI_OP ON TRIANGULAR MATRICES

• See:– choleskyqr_A_v1.c– choleskyqr_B_v1.c– LILA_mpiop_sum_upper.c

• starting from choleskyqr_A_v0.c, this– shows how to construct a datatype for a triangular matrix– show how to use a MPI_OP on the datatype for an

AllReduce operation– here we simply want to summ the upper triangular matrices

together

Page 106: Advanced MPI programming

TRICK FOR TRIANGULAR MATRICES DATATYPES

• See:– check_orthogonality_RFP.c– choleskyqr_A_v2.c– choleskyqr_A_v3.c– choleskyqr_B_v2.c– choleskyqr_B_v3.c

• A trick to use RFP format to do an fast allreduce on P triangular matrices without datatypes. The trick is at the user level.

Page 107: Advanced MPI programming

Idea behind RFP

• Rectangular full packed format

• Just be careful in the case for odd and even matrices

Page 108: Advanced MPI programming

Collective Communications

Page 109: Advanced MPI programming

Collective Communications

• Involves all processes in a communicator– All processes must participate– May be a subset of all running processes– May be more than what you started with

• Blocking (logical) semantics only– BUT: some processes may not block– Except barrier: all processes block until

each process has reached the barrier

Page 110: Advanced MPI programming

Operations• MPI defines several collective operations

– Some are rooted (e.g., broadcast)– Others are rootless (e.g., barrier)

• “Collectives” generally refers to data-passing collective operations– Although technically also refers to any

action in MPI where all processes in a communicator must participate

– Example: communicator maintenance

Page 111: Advanced MPI programming

Barrier Synchronization

• Logical operation– All processes block

until each has reached the barrier

– The official MPI synchronization call

Time A B C

Barrier

Barrier

Barrier

MPI_Barrier( comm )

Page 112: Advanced MPI programming

Broadcast

• Logical operation– Send data from one

node (the “root”) to all the rest

• A broadcast is not a synchronization (!!!)

MPI_Bcast(buffer, cnt, type, root, comm )

Page 113: Advanced MPI programming

Gather

• Logical operation– Obtain data from each node

and assemble in a root process

– Receive argument are only meaningful at the root

– Each process send the same amount of data

– Root can use the MPI_IN_PLACE

MPI_Gather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Page 114: Advanced MPI programming

Gatherv

• The vector variant of the gather operation

• Each process participate with a different amount of data

• Allow the root to specify where the data goes

• No overwrite (!!!)

MPI_Gatherv(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, displs, recvtype, root, comm)

Page 115: Advanced MPI programming

Scatter

• Logical operation– Opposite of gather– Send a portion of the root’s

buffer to each process

– Root can use MPI_IN_PLACE for the receive buffer

MPI_Scatter(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Page 116: Advanced MPI programming

Scatterv

• The logical extension of the scatter– No portion of the sendbuf

can be send more than once

– Root can use MPI_IN_PLACE for the receive buffer

MPI_Scatterv(sendbuf, sendcnt, displs, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Page 117: Advanced MPI programming

All Gather• Same as gather except

all processes get the full result

• MPI_IN_PLACE can be used on all processes instead of the sendbuf

• Equivalent to a gather followed by a broadcast

• There is a v version

MPI_Allgather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Page 118: Advanced MPI programming

All to All

• Logical operation– Combined scatter

and gather– Not an all-broadcast

• Uniform and vector versions defined

MPI_Alltoall(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, comm)

Page 119: Advanced MPI programming

Global Reduction• Logical operation

– Mathematical reduction

• Pre-defined MPI operations – Min, max, sum, …– Always commutative

and associative

• User-defined operations

MPI_Reduce(sendbuf, recvbuf, count, type, op, root, comm)

Page 120: Advanced MPI programming

All Reduce

• Logical operation– Reduce where all

processes get the result

– Similar to a reduce followed by a broadcast

MPI_Allreduce(sendbuf, recvbuf, count, type, op, root, comm)

Page 121: Advanced MPI programming

Reduce and Scatter

• Logical operation– Global reduction– Scatterv the result

MPI_Reduce_scatter(sendbuf, recvbuf, recvcount, type, op, comm)

Page 122: Advanced MPI programming

Scan

• Logical operation– Mathematical scan– Both internal and

external

• All processes get a result– Except process 0 in

an external scan

MPI_Scan(sendbuf, recvbuf, count, type, op, comm)

Page 123: Advanced MPI programming

User defined MPI_Op

• MPI_Op_create( function, commute, op )• MPI_Op_free( op )• If commute is true the operation is supposed as being

commutative• Function is a user defined function having 4 arguments:

– Invec : the input vector– Inoutvec : the input and output vector– Count : the number of elements– Datatype : the data-type description

• Return– inoutvec[i] = invec[i] op inoutvec[i] for i in [0..count-1]

Page 124: Advanced MPI programming

User defined MPI_Op

04 - lila

MPI_OP to compute || x || (for Gram-Schmidt)

Weirdest MPI_OP ever: motivation & results

Weirdest MPI_OP ever: how to attach attributes to a datatype