Advanced MPI programming

Advanced MPI programming

Julien Langou

George Bosilca

Outline

• Point-to-point communications, group and communicators, data-type, collective communications

• 4 real life applications to play with

Point-to-point Communications

Point-to-point Communications

• Data transfer from one process to another– The communication requires a sender and

a receiver, i.e.there must be a way to identify processes

– There must be a way to describe the data– There must be a way to identify messages

Who are you ?

• Processes are identified by a double:– Communicator: a “stream like” safe place

to pass messages– Rank: the relative rank of the remote

process in the group of processes attached to the corresponding communicator

• Messages are identified by a tag: an integer that allow you to differentiate messages in the same communicator

What data to exchange ?

• The data is described by a triple:– The address of the buffer where the data is

located in the memory of the current process

– A count: how many elements make up the message

– A data-type: a description of the memory layout involved in the message

The basic send

• Once this function returns the data have been copied out of the user memory and the buffer can be reused

• This operation may require system buffers, and in this case it will block until enough space is available

• The completion of the send DO NOT state anything about the reception state of the message

MPI_Send(buf, count, type, dest, tag, communicator)

The basic receive

• When this call returns the data is located in the user memory

• Receiving fewer than count is permitted, but receiving more is forbidden

• The status contain information about the received message (MPI_ANY_SOURCE, MPI_ANY_TAG)

MPI_Recv(buf, count, type, dest, tag, communicator, status )

The MPI_Status

• Structure in C (MPI_Status), array in Fortran (integer[MPI_STATUS_SIZE])

• Contain at least 3 fields (integers):– MPI_TAG: the tag of the received message– MPI_SOURCE: the rank of the source

process in the corresponding communicator

– MPI_ERROR: the error raised by the receive (usually MPI_SUCCESS).

Non Blocking Versions

• The memory pointer by buf (and described by the count and type) should not be touched until the communication is completed

• req is a handle to an MPI_Request which can be used to check the status of the communication …

MPI_Isend(buf, count, type, dest, tag, communicator, req ) MPI_Irecv (buf, count, type, dest, tag, communicator, req )

Testing for Completion

• Test or wait the completion status of the request

• If the request is completed:– Fill the status with the related information– Release the request and replace it by

MPI_REQUEST_NULL

BlockingMPI_Wait(req, status)

Non BlockingMPI_Test(req, flag, status)

Multiple Completions

• MPI_Waitany( count, array_of_requests, index, status )• MPI_Testany( count, array_of_requests, index, flag, status)• MPI_Waitall( count, array_of_requests, array_of_statuses)• MPI_Testall( count, array_of_requests, flag, array_of_statuses )• MPI_Waitsome( count, array_of_requests, outcount,

array_of_indices, array_of_statuses )

Synchronous Communications

• MPI_Ssend, MPI_Issend– No restrictions …– Does not complete until the corresponding receive

has been posted, and the operation has been started

• It doesn’t means the operation is completed

– Can be used instead of sending ack– Provide a simple way to avoid unexpected

messages (i.e. no buffering on the receiver side)

Buffered Communications

• MPI_Bsend, MPI_Ibsend• MPI_Buffer_attach, MPI_Buffer_detach

– Buffered sends do not rely on system buffers– The buffer is managed by MPI– The buffer should be large enough to contains the

largest message send by one send– The user cannot use the buffer as long as it is

attached to the MPI library

Ready Communications

• MPI_Rsend, MPI_Irsend– Can ONLY be used if the matching receive has

been posted– It’s the user responsibility to insure the previous

condition, otherwise the outcome is undefined– Can be efficient in some situations as it avoid the

rendezvous handshake– Should be used with extreme caution

Persistent Communications

• MPI_{BR}Send_init, MPI_Recv_init

• MPI_Start, MPI_Startall– Reuse the same requests for multiple

communications– Provide only a half-binding: the send is not

connected to the receive– A persistent send can be matched by any

kind of receive

• Point to point communication– 01-gs (simple)– 02-nanopse (tuning)– 03-summa (tuning)

• Groups and communicators– 03-summa

• Data-types– 04-lila

• Collective– MPI_OP: 04-lila

Gram-Schmidt algorithm

19

The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.

A1

A2

A3

Q1

Q2

Q3

R

Input:A is block distributed by rows

Output:Q is block distributed by rowsR is global

Example of applications: iterative methods.

20

a) in iterative methods with multiple right-hand sides (block iterative methods:)

1) Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).

2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, …

b) in iterative methods with a single right-hand side

1) s-step methods for linear systems of equations (e.g. A. Chronopoulos),

2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,

3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).

c) in iterative eigenvalue solvers,

1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),

2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,

3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),

4) PRIMME (A. Stathopoulos, Coll. William & Mary ),

5) And also TRLAN, BLZPACK, IRBLEIGS.

21

Example of applications:

a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),

b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply

c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.

22





The main characteristics of those three examples are that

a)there is only one column of processors involved but several processor rows,

b) all the data is known from the beginning,

c)and the matrix is dense.

23





The main characteristics of those three examples are that

a)there is only one column of processors involved but several processor rows,

b) all the data is known from the beginning,

c)and the matrix is dense.

Various methods already exist to perform the QR factorization of such matrices:

a)Gram-Schmidt (mgs(row),cgs),

b)Householder (qr2, qrf),

c)or CholeskyQR.

We present a new method:

Allreduce Householder (rhh_qr3, rhh_qrf).

Q = A

r11= || Q1 ||2

Q1 = Q1 / r11

= Q1 / r11

r12 = Q1T Q2

Q2 = Q2 – Q1 r12

r22 = || Q2 ||2

Q2 = Q2 / r22

r13 = Q1T Q3

Q3 = Q3 – Q1 r13

r23 = Q2T Q3

Q3 = Q3 – Q2 r23

r33 = || Q3 ||2

Q3 = Q3 / r33

An example with modified Gram-Schmidt. A nonsingular m x 3

A = QR QTQ = I3

Look at the codes

The CholeskyQR Algorithm

26

chol ( )

\

Ci AiT Ai

AiQi

CR

R

SYRK: C:= ATA ( mn2)

CHOL: R := chol( C ) ( n3/3 )

TRSM: Q := A\R ( mn2)

Bibligraphy

• A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchronization requirements, SIAM Journal on Scientific Computing, 23(6):2165-2182, 2002.

• Popularized by iterative eigensolver libraries:

1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),

2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,

3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),

4) PRIMME (A. Stathopoulos, Coll. William & Mary ).

Parallel distributed CholeskyQRThe CholeskyQR method in the parallel distributed context can be described as follows:

28

+

chol ( )

\

+ +

1.

2.

3-4.

5.

C C1 C2 C3 C4

Ci AiT Ai

AiQi

CR

R

1: SYRK: C:= ATA ( mn2)

2: MPI_Reduce: C:= sumprocs C (on proc 0)

3: CHOL: R := chol( C ) ( n3/3 )

4: MPI_Bdcast Broadcast the R factor on proc 0

to all the other processors

5: TRSM: Q := A\R ( mn2)

This method is extremely fast. For two reasons: 1. first, there is only one or two communications phase, 2. second, the local computations are performed with fast operations.

Another advantage of this method is that the resulting code is exactly four lines,3. so the method is simple and relies heavily on other libraries.

Despite all those advantages, 4. this method is highly unstable.

CholeskyQR - code

29

int choleskyqr_A_v0(int mloc, int n, double *A, int lda, double *R, int ldr, MPI_Comm mpi_comm){

int info; cblas_dsyrk( CblasColMajor, CblasUpper, CblasTrans, n, mloc,

1.0e+00, A, lda, 0e+00, R, ldr ); MPI_Allreduce( MPI_IN_PLACE, R, n*n, MPI_DOUBLE, MPI_SUM, mpi_comm );lapack_dpotrf( lapack_upper, n, R, ldr, &info ); cblas_dtrsm( CblasColMajor, CblasRight, CblasUpper, CblasNoTrans,

CblasNonUnit, mloc, n, 1.0e+00, R, ldr, A, lda );

return 0;

}

(And OK, you might want to add an MPI user defined datatype to send only the upper part of R)

+

chol ( )

\

+ +

1.

2.

3-4.

5.

C C1 C2 C3 C4

CiAi

TAi

AiQi

CR

R

Let us look at the codes in 01-gs

0.08

0.16

0.32

0.64

1.28

2.56

5.12

10.24

20.48

1 2 4 8 16 32

Efficient enough?

# of procs

cholqr cgs mgs(row) qrf mgs

1 489.2 (1.02) 134.1 (3.73) 73.5 (6.81) 39.1 (12.78) 56.18 (8.90)

2 467.3 (0.54) 78.9 (3.17) 39.0 (6.41) 22.3 (11.21) 31.21 (8.01)

4 466.4 (0.27) 71.3 (1.75) 38.7 (3.23) 22.2 (5.63) 29.58 (4.23)

8 434.0 (0.14) 67.4 (0.93) 36.7 (1.70) 20.8 (3.01) 21.15 (2.96)

16 359.2 (0.09) 54.2 (0.58) 31.6 (0.99) 18.3 (1.71) 14.44 (2.16)

32 197.8 (0.08) 41.9 (0.37) 29.0 (0.54) 15.8 (0.99) 8.38 (1.87)

MFLOP/sec/proc

Time in sec

050100150200250300350400450500

1 2 4 8 16 32

cholqr

cgs

mgs(row)

qrf

mgs

In this experiment, we fix the problem: m=100,000 and n=50.

Perf

orm

ance

(MFL

OP/

sec/

proc

)

Tim

e (s

ec)

# of procs# of procs

Stability

κ2(A)

κ2(A)

|| A

– Q

R |

| 2 / |

| A

|| 2

|| I

–Q T Q

|| 2

m=100, n=50

02-nanopse

performance analysis tool in PESCAN

Optimization of codes through performance tools

Portability of codes while keeping good performances

idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo

call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo

-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)

idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo

idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo

do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do

call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------

beforeafter

Original code:

13.2% of the overall time at barrier

Modified code:

Removing most of the barrier report a part of the problem on other synchronization point but we observe 6% of improvements

• Lots of self-communication. After investigation, some copies can be avoided.

• Try different variants of asynchronous communication, removing barrier,….

Cd 83 – Se 81 on 16 processors

Original code

Code without me2me communications

Code with asynchronous receive – without barrier

Code with both modifications

Time (s) 18.23 15.42 15.29 13.12

ratio 1.00 0.85 0.83 0.72

• In this example, we see that both modifications are useful and they are working fine together for small number of processors

0

100

200

300

400

500

600

700

16procs

32procs

64procs

128procs

original

me2me

withoutbarrier+assynchall threemodifs

- me2me: 10% improvements on the matrix-vector products, less buffer used- asynchronous communication are not a good idea, (this is why the original code have them)- barrier are useful for large matrices (this why the original code have them)

ZB4096_Ecut30 (order = 2,156,241) – time for 50 matrix-vector products

Comparison of different matrix-vector products implementations

General resultsOriginal (s)

Modified (s)

Ratio

83 Cd – 81 Se (~34K)

16 processors18.23 13.23 28%

232 Cd – 235 Se (~75K)

16 processors36.00 26.16 27%

534 Cd – 527 Se

2x16 processors50.50 38.78 23%

83 Cd – 81 Se – Ecut 30

16 processors 156.33 123.89 21%

83 Cd – 81 Se – Ecut 30

2x16 processors103.50 69.22 34%

idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo

call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo

-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)

idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo

idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo

do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do

call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------

beforeafter

This is just one way of doing it : Solution use (and have) tuned MPI Global Communications (here MPI_Alltoallv )

03-summa

First rule in linear algebra: Have an efficient DGEMM

• All the dense linear algebra operations rely on an efficient DGEMM (matrix Matrix Multiply)

• This is by far the easiest operation O(n3) in Dense Linear Algebra.

– So if we can not implement DGEMM correctly (peak performance), we will not be able to do much for the others operations.

Blocked LU and QR algorithms (LAPACK)

45

-

lu( )

dgetf2

dtrsm (+ dswp)

dgemm

\

L

U

A(1)

A(2)L

U

-

qr( )

dgeqf2 + dlarft

dlarfb

V

R

A(1)

A(2)V

R

LAPACK block LU (right-looking): dgetrf LAPACK block QR (right-looking): dgeqrf

Upd

ate

of th

e re

mai

ning

sub

mat

rixPa

nel

fact

oriz

ation

Blocked LU and QR algorithms (LAPACK)

-

lu( )

dgetf2

dtrsm (+ dswp)

dgemm

\

L

U

A(1)

A(2)L

U

LAPACK block LU (right-looking): dgetrf

Upd

ate

of th

e re

mai

ning

sub

mat

rixPa

nel

fact

oriz

ation

Latency bounded: more than nb AllReduce for n*nb2 ops

CPU - bandwidth bounded: the bulk of the computation: n*n*nb ops

highly paralleliable, efficient and saclable.

B11

B21

B31

Parallel Distributed MM algorithms

A11C11

A21C21

A31C31

B12

B22

B32

A12C12

A22C22

A32C32

B13

B23

B33

A13C13

A23C23

A33C33

*

C11

C21

C31

C12

C22

C32

C13

C23

C33

+ αβ


A B* = A B* + A B*

A B* + A B*+

• Use the outer version of the matrix-matrix multiply algorithm.

+ …


A B* = A B* + A B*

A B* + A B*+ + …

A B*+CC

For k = 1:nb:n,

End For

PDGEMM :

B11

B21

B31


A11 C11

A21 C21

A31 C31

B12

B22

B32

A12 C12

A22 C22

A32 C32

B13

B23

B33

A13 C13

A23 C23

A33 C33


Broadcast of size nb*nloc along the columns, root is active_row.

1



1

Broadcast of size nloc*nb along the rows, root is active_col.

2



1

Broadcast of size nloc*nb along the rows, root is active_col.

2

Perform matrix matrix multiply:number of FLOPS is

nloc*nloc*nb

3

Model.

• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast algorithm used

(Pipeline=SUMMA, tree=PUMMA, etc.).

Blocking communication Nonblocking communication.

))log(2,2

max(

223

3

βγp

nppn

nperf=

# of operations

Time for computation = # of ops * ( time for one op ) / (# of proc)

βγp

np

pn

nperf 23

3

)log(22

2

+=

Time with blocking = time comp + time commTime with nonblocking = max( time comp , time comm )

Time for communication = 2 * Bdcast on sqrt(p) processors of [n2 / sqrt(p)] numbers

Model.

• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast

algorithm used (Pipeline=SUMMA, tree=PUMMA, etc.).

• For the TN cluster of PS3,– Bandwidth = 600 MB/sec ( with GigaBit Ethernet)– Flop rate = 149.85 GFLOPs/sec

Blocking communication Nonblocking communication.

βγp

np

pn

nperf 23

3

)log(22

2

+=

))log(2,2

max(

223

3

βγp

nppn

nperf=

jacquard.nersc.gov

• Processor type Opteron 2.2 GHz • Processor theoretical peak 4.4 GFlops/sec • Number of application processors 712 • System theoretical peak (computational nodes) 3.13

TFlops/sec • Number of shared-memory application nodes 356• Processors per node 2 • Physical memory per node 6 GBytes • Usable memory per node 3-5 GBytes • Switch Interconnect InfiniBand • Switch MPI Unidrectional Latency 4.5 μsec • Switch MPI Unidirectional Bandwidth (peak) 620 MB/s • Global shared disk GPFS Usable disk space 30 TBytes • Batch system PBS Pro

Mvapich vs FTMPI

# of processors # of processors

GFL

OPs

/sec

/pro

c

GFL

OPs

/sec

/pro

c

Modeling.

Performance model.

To get 90% of efficiency with nonblocking operations, one need to work on matrices of size 14,848. To get 90% of efficiency with blocking operations, one need to work on matrices of size 146,949.

Really worse using nonblocking communication.

Take for instance, the TN cluster of four PS3:Bandwidth = 600 Mb/sec (with GigaBit Ethernet)Flop rate = 149.85 GFLOPs/sec (theoretical peak is 153.6 GFLOPs/sec)

Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.

Oups…Memory for PS3 is 256MB.

There is no way to hide the n2 term (communication) by the n3 term (computation). n can not get big enough. Actucally, it is the computation that are hidden by the communication.Dense Linear Algebra is stucked. There is nothing to be done.

Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.

Three Solutions

Increase the memory of the nodes

Increase the bandwidth of the network

Decrease the computational power of the nodes

600 Mb/sec – 3.3 GB – 6 SPEs

1.39 Gb/sec – 258 MB – 6 SPEs

600 Mb/sec – 258 MB – 2 SPEs

Instead of 600 Mb/sec – 258 MB – 6 SPEs :

Three ways to complain• The network is too slow (complain to GigE)• There is not enough memory on the nodes (complain to Sony)• The nodes are too fast (complain to IBM)

Groups and Communicators

Groups and Communicators

• Break the MPI_COMM_WORLD into smaller sets that have a specific relationship

• Each communicator have a group of processes attached, which are all processes that can be contacted using this communicator– The processes are indexed in the group based on

their rank, that is contiguous and start from 0

Groups vs. Communicators

• The BIG difference– Groups are local entities while communicators are

global

1

2

3

4

5

6

7

0

MPI_Comm_WORLDCommunicator_1Communicator_2

Operations on Groups

• Retrieve the rank and the size• Translate the ranks from one group to

another, compare 2 groups• Constructors: create one group from another

based on a defined relationship– Creating a group is a local operation– Once create the group is not attached to any

communicator (i.e. there is no possible communication).

Group Constructors

• Extract the group from a communicator (MPI_Comm_group)

• Union, intersection or difference of two other groups:– Union: return a group with all processes from

group1 followed by all processes in group2– Intersection: contain all processes that are in both

groups, ordered as in group1– Difference: contain all processes that are in

group1 but not in group2, ordered as in group 1

Example

• Let group1={a,b,c,d,e,f,g} and group2={d,g,a,c,h,i}

• Union(group1,group2): newgroup={a,b,c,d,e,f,g,h,i}

• Intersection(group1,group2): newgroup={a,c,d,g}

• Difference(group1,group2):newgroup={b,e,f}

• Union(group2,group1): newgroup={d,g,a,c,h,i,b,e,f}

• Intersection(group2,group1): newgroup={d,g,a,c}

• Difference(group2,group1):newgroup={h,i}

Group Constructors

• Inclusion and exclusion: by element or by range MPI_Group_*(group, n, ranks, newgroup)

• When by element ranks is a simple array of integers which indicate which rank from group1 goes into group2

• When by range ranks is an array of triple (int ranks[][3]) containing the start rank, the end rank and the stride

• The order of the processes in the resulting group is identical to the original group

Operations on Communicators

• Retrieve the size and the rank• Compare 2 communicators

– MPI_INDENT : they are handles to the same object

– MPI_CONGRUENT : they have the same group attributes

– MPI_SIMILAR : they contain the same processes in different orders

– MPI_UNEQUAL : in all other cases

Communicator Constructors

• MPI_Comm_dup(oldcomm,newcomm)

– the basic communicator duplication– The newcomm have the same attributes as

oldcomm• MPI_Comm_create(oldcomm, group, newcomm)

– The newcomm contains only the processes in the group

– MPI_COMM_NULL is returned to all orher processes

– 2 requirements: group contain a subset of processes from oldcomm and all processes should use the same group.

Communicator Constructors

• MPI_Comm_split(oldcomm, color, key, newcomm)– Create as many groups and communicators as the

number of distinct values of color– The rank in the new group is determined by the

value of key, ties are broken according to the ranking in oldcomm

– MPI_UNDEFINED can be used as color for processes which will not be included in any of the newcomm

Example

• MPI_Comm_split

rank 0 1 2 3 4 5 6 7 8 9 10

proc a b c d e f g h i j k

color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

Example

• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0}– Color = 1 : {c:2,d:3,i:1}– Color = 7 : {f:9}

rank 0 1 2 3 4 5 6 7 8 9 10


color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

Example

• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0} -> {k,b,e,g,h}– Color = 1 : {c:2,d:3,i:1} -> {i,c,d}– Color = 7 : {f:9} -> {f}

rank 0 1 2 3 4 5 6 7 8 9 10


color U 3 1 1 3 7 3 3 1 U 3

key 0 1 2 3 1 9 3 8 1 0 0

03 - summa (pumma)

Data-types

Data Representation

• Different across different machines– Length: 32 vs. 64 bits (vs. …?)– Endian: big vs. little

• Problems– No standard about the data length in the

programming languages (C/C++)– No standard floating point data

representation• IEEE Standard 754 Floating Point Numbers

– Subnormals, infinities, NANs …

• Same representation but different lengths

MPI Datatypes

• MPI uses “datatypes” to:– Efficiently represent and transfer data– Minimize memory usage

• Even between heterogeneous systems– Used in most communication functions

(MPI_SEND, MPI_RECV, etc.)– And file operations

• MPI contains a large number of pre-defined datatypes

Some of MPI’s Pre-Defined Datatypes

DOUBLE PRECISION*8long doubleMPI_LONG_DOUBLE

DOUBLE PRECISIONdoubleMPI_DOUBLE

REALfloatMPI_FLOAT

unsigned long intMPI_UNSIGNED_LONG

unsigned intMPI_UNSIGNED

unsigned shortMPI_UNSIGNED_SHORT

unsigned charMPI_UNSIGNED_CHAR

signed long intMPI_LONG

INTEGERsigned intMPI_INT

INTEGER*2signed short intMPI_SHORT

CHARACTERsigned charMPI_CHAR

Fortran datatypeC datatypeMPI_Datatype

Datatype Conversion

• “Data sent = data received”

• 2 types of conversions:– Representation conversion: change the

binary representation (e.g., hex floating point to IEEE floating point)

– Type conversion: convert from different types (e.g., int to float)

Only representation conversion is allowed

Datatype Conversion

if (my_rank == root)

MPI_Send(msg, 1, MPI_INT, …)

else

MPI_Recv(msg, 1, MPI_INT, …)

if (my_rank == root) MPI_Send(msg, 1, MPI_INT, …)else MPI_Recv(msg, 1, MPI_FLOAT, …)

Datatype Specifications

• Type signature– Used for message matching

{ type0, type1, …, typen }

• Type map– Used for local operations

{ (type0, disp0), (type1, disp1),…, (typen, dispn) }

It’s all about the memory layout

User-Defined Datatypes

• Applications can define unique datatypes– Composition of other datatypes– MPI functions provided for common patterns

• Contiguous

• Vector

• Indexed

• …

Always reduces to a type map of pre-defined datatypes

Contiguous Blocks

• Replication of the datatype into contiguous locations.MPI_Type_contiguous( 3, oldtype, newtype )

MPI_TYPE_CONTIGUOUS( count, oldtype, newtype )MPI_TYPE_CONTIGUOUS( count, oldtype, newtype ) IN count replication count( positive integer)IN count replication count( positive integer) IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

Vectors

• Replication of a datatype into locations that consist of equally spaced blocksMPI_Type_vector( 7, 2, 3, oldtype, newtype )

MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN blocklength number of elements in each block (positive integer)IN blocklength number of elements in each block (positive integer)IN stride number of elements between start of each block (integer)IN stride number of elements between start of each block (integer)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

sb

Indexed Blocks

• Replication of an old datatype into a sequence of blocks, where each block can contain a different number of copies and have a different displacement

MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_d displacement of each block from the beginning in multiple multipleIN a_of_d displacement of each block from the beginning in multiple multiple of oldtype (array of integers)of oldtype (array of integers)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)

Indexed Blocks

array_of_blocklengths[] = { 2, 3, 1, 2, 2, 2 }

array_of_displs[] = { 0, 3, 10, 13, 16, 19 }

MPI_Type_indexed( 6, array_of_blocklengths,

array_of_displs, oldtype, newtype )

B[0] B[1] B[2] B[3] B[4] B[5]

D[0]D[1]

D[2]D[3]

D[4]

Datatype Composition

• Each of the previous functions are the super set of the previous

CONTIGUOUS < VECTOR < INDEXED

• Extend the description of the datatype by allowing more complex memory layout– Not all data structures fit in common patterns– Not all data structures can be described as compositions of

others

“H” Functions

• Displacement is not in multiple of another datatype

• Instead, displacement is in bytes– MPI_TYPE_HVECTOR– MPI_TYPE_HINDEX

• Otherwise, similar to their non-”H” counterparts

Arbitrary Structures

• The most general datatype constructor

• Allows each block to consist of replication of different datatypes

MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength,MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength, array_of_displs, array_of_types, newtype )array_of_displs, array_of_types, newtype )IN count number of entries in each array ( positive integer)IN count number of entries in each array ( positive integer)IN a_of_b number of elements in each block (array of integers)IN a_of_b number of elements in each block (array of integers)IN a_of_d byte displacement in each block (array of Aint)IN a_of_d byte displacement in each block (array of Aint)IN a_of_t type of elements in each block (array of MPI_Datatype handle)IN a_of_t type of elements in each block (array of MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)

Arbitrary Structures

struct {

int i[3];

float f[2];

} array[100];

Array_of_lengths[] = { 2, 1 };

Array_of_displs[] = { 0, 3*sizeof(int) };

Array_of_types[] = { MPI_INT, MPI_FLOAT };

MPI_Type_struct( 2, array_of_lengths,

array_of_displs, array_of_types, newtype );

int int int float float

length[0] length[1]

displs[0] displs[1]

Arbitrary StructuresArray_of_lengths[] = { 2, 1 };




length[0] length[1]

displs[0] displs[1]

Arbitrary StructuresArray_of_lengths[] = { 2, 1 };



Real datadescription


length[0] length[1]

displs[0] displs[1]

MPI_GET_ADDRESS

• Allow all languages to compute displacements– Necessary in Fortran– Usually unnecessary in C (e.g., “&foo”)

MPI_GET_ADDRESS( location, address )MPI_GET_ADDRESS( location, address )IN location location in the caller memory (choice)IN location location in the caller memory (choice)OUT address address of location (address integer)OUT address address of location (address integer)

And Now the Dark Side…

• Sometimes more complex memory layout have to be expressed

Extent of each element

Interesting part

ddt

MPI_Send( buf, 3, ddt, … )

one datatype

Typemap = { (type0, disp0), …, (typen, dispn) }

Minj dispj if no entry has type lbminj({dispj such that typej = lb) otherwise

Maxj dispj + sizeof(typej) + align if no entry has type ubMaxj{dispj such that typej = ub} otherwise

lb(Typemap)

ub(Typemap)

Lower-Bound andUpper-Bound Markers

• Define datatypes with “holes” at the beginning or end

• 2 pseudo-types: MPI_LB and MPI_UB– Used with MPI_TYPE_STRUCT

MPI_LB and MPI_UB

displs = ( -3, 0, 6 )

blocklengths = ( 1, 1, 1 )

types = ( MPI_LB, MPI_INT, MPI_UB )

MPI_Type_struct( 3, displs, blocklengths, types, type1 )

MPI_Type_contiguous( 3, type1, type2 )

Typemap= { (lb, -3), (int, 0), (ub, 6) }

Typemap= { (lb, -3), (int, 0), (int, 9), (int, 18), (ub, 24) }

Typemap = { (type0, disp0), …, (typen, dispn) }

true_lb(Typemap) = minj { dispj : typej != lb }true_ub(Typemap) = maxj { dispj + sizeof(typej) : typej != ub }

True Lower-Bound andTrue Upper-Bound Markers

• Define the real extent of the datatype: the amount of memory needed to copy the datatype inside

• TRUE_LB define the lower-bound ignoring all the MPI_LB markers.

Information About Datatypes

MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)

MPI_TYPE_SIZE( datatype, size)MPI_TYPE_SIZE( datatype, size)IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT size datatype size (integer)OUT size datatype size (integer)

extent

size

true extent

Test your Data-type skills

• Imagine the following architecture:– Integer size is 4 bytes– Cache line is 16 bytes

• We want to create a datatype containing the second integer from each cache line, repeated three times

• How many ways are there?

Solution 1MPI_Datatype array_of_types[] = { MPI_INT, MPI_INT, MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 0, 0, 0, 0 };int array_of_lengths[] = { 1, 1, 1, 1 };struct one_by_cacheline c[4];

MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1].int[1], &(array_of_displs[1]) );MPI_Get_address( &c[2].int[1], &(array_of_displs[2]) );MPI_Get_address( &c[3], &(array_of_displs[3]) );

for( i = 0; i < 4; i++ ) Array_of_displs[i] -= start;

MPI_Type_create_struct( 4, array_of_lengths, array_of_displs, array_of_types, newtype )

newtype

Solution 2MPI_Datatype array_of_types[] = { MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 4, 16 };int array_of_lengths[] = { 1, 1 };struct one_by_cacheline c[2];

MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1], &(array_of_displs[1]) );

Array_of_displs[0] -= start;Array_of_displs[1] -= start;MPI_Type_create_struct( 2, array_of_lengths, array_of_displs, array_of_types, temp_type )MPI_Type_contiguous( 3, temp_type, newtype )

temp_type

newtype

Data-type for triangular matrices

Application Cholesky QR

EXAMPLE OF CONSTRUCTION OF DATA_TYPE FOR TRIANGULAR MATRICES, EXAMPLE OF

MPI_OP ON TRIANGULAR MATRICES

• See:– choleskyqr_A_v1.c– choleskyqr_B_v1.c– LILA_mpiop_sum_upper.c

• starting from choleskyqr_A_v0.c, this– shows how to construct a datatype for a triangular matrix– show how to use a MPI_OP on the datatype for an

AllReduce operation– here we simply want to summ the upper triangular matrices

together

TRICK FOR TRIANGULAR MATRICES DATATYPES

• See:– check_orthogonality_RFP.c– choleskyqr_A_v2.c– choleskyqr_A_v3.c– choleskyqr_B_v2.c– choleskyqr_B_v3.c

• A trick to use RFP format to do an fast allreduce on P triangular matrices without datatypes. The trick is at the user level.

Idea behind RFP

• Rectangular full packed format

• Just be careful in the case for odd and even matrices

Collective Communications

Collective Communications

• Involves all processes in a communicator– All processes must participate– May be a subset of all running processes– May be more than what you started with

• Blocking (logical) semantics only– BUT: some processes may not block– Except barrier: all processes block until

each process has reached the barrier

Operations• MPI defines several collective operations

– Some are rooted (e.g., broadcast)– Others are rootless (e.g., barrier)

• “Collectives” generally refers to data-passing collective operations– Although technically also refers to any

action in MPI where all processes in a communicator must participate

– Example: communicator maintenance

Barrier Synchronization

• Logical operation– All processes block

until each has reached the barrier

– The official MPI synchronization call

Time A B C

Barrier

Barrier

Barrier

MPI_Barrier( comm )

Broadcast

• Logical operation– Send data from one

node (the “root”) to all the rest

• A broadcast is not a synchronization (!!!)

MPI_Bcast(buffer, cnt, type, root, comm )

Gather

• Logical operation– Obtain data from each node

and assemble in a root process

– Receive argument are only meaningful at the root

– Each process send the same amount of data

– Root can use the MPI_IN_PLACE

MPI_Gather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Gatherv

• The vector variant of the gather operation

• Each process participate with a different amount of data

• Allow the root to specify where the data goes

• No overwrite (!!!)

MPI_Gatherv(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, displs, recvtype, root, comm)

Scatter

• Logical operation– Opposite of gather– Send a portion of the root’s

buffer to each process

– Root can use MPI_IN_PLACE for the receive buffer

MPI_Scatter(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

Scatterv

• The logical extension of the scatter– No portion of the sendbuf

can be send more than once

– Root can use MPI_IN_PLACE for the receive buffer

MPI_Scatterv(sendbuf, sendcnt, displs, sendtype, recvbuf, recvcnt, recvtype, root, comm)

All Gather• Same as gather except

all processes get the full result

• MPI_IN_PLACE can be used on all processes instead of the sendbuf

• Equivalent to a gather followed by a broadcast

• There is a v version

MPI_Allgather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)

All to All

• Logical operation– Combined scatter

and gather– Not an all-broadcast

• Uniform and vector versions defined

MPI_Alltoall(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, comm)

Global Reduction• Logical operation

– Mathematical reduction

• Pre-defined MPI operations – Min, max, sum, …– Always commutative

and associative

• User-defined operations

MPI_Reduce(sendbuf, recvbuf, count, type, op, root, comm)

All Reduce

• Logical operation– Reduce where all

processes get the result

– Similar to a reduce followed by a broadcast

MPI_Allreduce(sendbuf, recvbuf, count, type, op, root, comm)

Reduce and Scatter

• Logical operation– Global reduction– Scatterv the result

MPI_Reduce_scatter(sendbuf, recvbuf, recvcount, type, op, comm)

Scan

• Logical operation– Mathematical scan– Both internal and

external

• All processes get a result– Except process 0 in

an external scan

MPI_Scan(sendbuf, recvbuf, count, type, op, comm)

User defined MPI_Op

• MPI_Op_create( function, commute, op )• MPI_Op_free( op )• If commute is true the operation is supposed as being

commutative• Function is a user defined function having 4 arguments:

– Invec : the input vector– Inoutvec : the input and output vector– Count : the number of elements– Datatype : the data-type description

• Return– inoutvec[i] = invec[i] op inoutvec[i] for i in [0..count-1]

User defined MPI_Op

04 - lila

MPI_OP to compute || x || (for Gram-Schmidt)

Weirdest MPI_OP ever: motivation & results

Weirdest MPI_OP ever: how to attach attributes to a datatype

Advanced MPI programming

Documents

Transcript of Advanced MPI programming