Introduction to MPI Programming
(Part II)
Michael Griffiths, Deniz Savas & Alan Real
January 2006
OverviewReview point to point communications
Data typesData packing
Collective CommunicationBroadcast, Scatter & Gather of dataReduction OperationsBarrier Synchronisation
Blocking operations
Relate to when the operation has completedOnly return from the subroutine call when the
operation has completed
Non-blocking operations
Return straight away and allow the sub-program to return to perform other work.At some time later the sub-program should test or wait for the
completion of the non-blocking operation.A non-blocking operation immediately followed by a matching
wait is equivalent to a blocking operation.Non-blocking operations are not the same as sequential
subroutine calls as the operation continues after the call has returned.
Always completes (unless an error occurs), irrespective of receiver.MPI_Isend
Begins a nonblocking send
Blocking
Send and Receive
Ready send
Standard send
Buffered send
Synchronous send
Sender mode
Completes when a message arrives and received by pair of processors.MPI_Sendrecv
Always completes (unless an error occurs), irrespective of whether the receive has completed.MPI_Rsend
Can be synchronous or buffered (often implementation dependent).MPI_Send
Always completes (unless an error occurs), irrespective of receiver.MPI_Bsend
Only completes when the receive has completed.MPI_Ssend
Completion statusMPI Call (F/C)
Non-blocking communication
Separate communication into three phases:Initiate non-blocking communicationDo some work:
Perhaps involving other communications
Wait for non-blocking communication to complete.
Non-blocking send
Send is initiated and returns straight away.Sending process can do other thingsCan test later whether operation has completed.
0
4
1
35
2
MPI_COMM_WORLD
Send req
Receive
Wait
44
222
Non-blocking receive
Receive is initiated and returns straight away.Receiving process can do other thingsCan test later whether operation has completed.
0 1
354
MPI_COMM_WORLD
Rec req
Send
Wait
The Request Handle
Same arguments as non-blocking callAdditional request handle
In C/C++ is of type MPI_Request/MPI::RequestIn Fortran is an INTEGER
Request handle is allocated when a communication is initiated
Can query to test whether non-blocking operation has completed
Non-blocking synchronous send
Fortran:CALL MPI_ISSEND(buf, count, datatype, dest, tag,
comm, request, error)CALL MPI_WAIT(request, status, error)
C:MPI_Issend(&buf, count, datatype, dest, tag, comm,
&request);MPI_Wait(&request, &status);
C++:request = comm.Issend(&buf, count, datatype, dest,
tag);request.Wait();
Non-blocking synchronous receive
Fortran:CALL MPI_IRECV(buf, count, datatype, src, tag,
comm, request, error)CALL MPI_WAIT(request, status, error)
C:MPI_Irecv(&buf, count, datatype, src, tag, comm,
&request);MPI_Wait(&request, &status);
C++:request = comm.Irecv(&buf, count, datatype, src,
tag);request.Wait(status);
Blocking v Non-blocking
Send and receive can be blocking or non-blocking.A blocking send can be used with a non-blocking receive, and vice
versa.Non-blocking sends can use any mode:
Synchronous mode affects completion, not initiation.A non-blocking call followed by an explicit wait is identical to the
corresponding blocking communication.
Comm.Recv(…)MPI_Recv(…)Receive
Comm.Rsend(…)MPI_Rsend(…)Ready send
Comm.Bsend(…)MPI_Bsend(…)Buffered send
Comm.Ssend(…)MPI_Ssend(…)Synchronous send
Comm.Send(…)MPI_Send(…)Standard send
C++MPI call Fortran/COperation
CompletionCan either wait or test for completion:Fortran (LOGICAL flag):
CALL MPI_WAIT(request, status, ierror)CALL MPI_TEST(request, flag, status, ierror)
C (int flag):MPI_Wait(&request, &status)MPI_Test(&request, &flag, &status)
C++ (bool flag):request.Wait()flag = request.Test(); (for sends)request.Wait(status);flag = request.Test(status); (for receives)
Other related wait and test routines
If multiple non-blocking calls are issued …MPI_TESTANY : Tests if any one of ‘a list of requests’ (they could
be send or receive requests) have been completed.MPI_WAITANY : Waits until any one of the list of requests have
been completed.
MPI_TESTALL : Test if all the requests in a list are completed.MPI_WAITALL : Waits until all the requests in a list are completed.
MPI_PROBE , MPI_IPROBE : Allows for the incoming messages to be checked for without actually receiving them. Note that MPI_PROBE is blocking. It waits until there is
something to probe for.MPI_CANCEL : Cancels pending communication. Last resort, clean-
up operation !All routines take an array of requests and can return an array of
statuses.‘any’ routines return an index of the completed operation
Merging send and receive operations into a single unit
The following is the syntax of the MPI_Sendrecv command:
IN C:int MPI_Sendrecv( void * sendbuf, int sendcount, MPI_Datatype sendtype, int dest,
int sendtag, void* recvbuf, int recvcount, MPI_Datatye recvtype ,int source , int recvtag, MPI_Comm comm, MPI_Status *status )
IN FORTRAN
<sendtype> sendbuf(:) <recvtype> recvbuf(:)INTEGER sendcount,sendtype, dest, sendtag, recvcount, recvtype, INTEGER source, recvtag, comm, status(MPI_STATUS_SIZE), ierrorMPI_SENDRECV( sendbuf,sendcount,sendtype, dest, sendtag, recvbuf, recvcount ,
recvtype , source, recvtag , comm , status , ierror )
Important Notes about MPI_Sendrecv
Beware! A message sent by MPI_sendrecv is receivable by a regular receive operation if the destination and tag match.
For the destination and source MPI_PROC_NULL can be specified to allow one directional working. (Useful in non-circular communication for the very end-nodes).Any communication with MPI_PROC_NULL returns immediately with no effect but as if the operation has been successful. This can make programming easier.
The send and receive buffers must not overlap, they must be separate memory locations. This restriction can be avoided by using the MPI_Sendrecv_replace routine
Data Packing
Up until now we have only seen contiguous data of pre-defined data-types being communicated by MPI calls. This can be rather restricting if what we are intending to transfer involves structures of data made up of mixtures of primitive data types, such as integer count followed by a sequence of real numbers. One solution to this problem is to use the MPI_PACK and MPI_UNPACK routines. The philosophy used is similar to the Fortran write/read to/from internal buffers and the scanf function in C.MPI_PACK routine can be called consecutively to compress the data into a send_buffer, the resulting buffer of data can then be sent by using MPI_SEND ‘or equivalent’ with the data_type set to MPI_PACKED.
At the receiving-end it can be received by using MPI_RECV with the data type MPI_PACKED. The received data can then be unpacked by using MPI_UNPACK to recover the original packed data. This method of working can also improve communications efficiency by reducing the number of data transfer ‘send-receive’ calls. There are usually fixed overheads associated with setting up the communications that would cause inefficiencies if the sent/received messages are just too small.
MPI_Pack
Fortran :<type> INBUF(:) , OUTBUF(:)INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IERRORMPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF, OUTSIZE,POSITION,
COMM,IERROR )
C :int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf ,int outsize, int *position, MPI_Comm comm )
Packs the message in inbuf of type datatype and length=incount and stores it in outbuf . Outbuf size is specified in bytes. Outsize being the maximum length of outbuf ’in bytes’, rather than its actuaL size.
On entry position indicates the starting location at the outbuf where data will be written. On exit position points to the first free position in outbuf following the location occupied by the packed message. This can then be readily used as the position parameter for the next mpi_pack call.
MPI_Unpack
Fortran :<type> INBUF(:) , OUTBUF(:)
INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE, COMM,IERRORMPI_UNPACK(INBUF,INSIZE,POSITION, OUTBUF,OUTCOUNT,DATATYPE,
,COMM,IERROR )
C :int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf ,int outcount, MPI_Datatype datatype, MPI_Comm comm )
Unpacks the message which is in inbuf as data of type datatype and length of outcounts and stores it in outbuf .
On entry, position indicates the starting location of data in inbuf where data will be read from. On exit position points to the first position of the next set of data in inbuf. This can then be readily used as the position parameter for the next mpi_unpack call.
Collective Communication
Overview
Introduction & characteristicsBarrier SynchronisationGlobal reduction operations
Predefined operations
BroadcastScatterGatherPartial sumsExercise:
Collective communications
Are higher-level routines involving several processes at a time.
Can be built out of point-to-point communications.
Examples are:BarriersBroadcastReduction operations
Collective Communication
Communications involving a group of processes.Called by all processes in a communicator.Examples:
Broadcast, scatter, gather (Data Distribution)Global sum, global maximum, etc. (Reduction Operations)Barrier synchronisation
CharacteristicsCollective communication will not interfere with point-to-point communication
and vice-versa.All processes must call the collective routine.Synchronization not guaranteed (except for barrier)No non-blocking collective communicationNo tagsReceive buffers must be exactly the right size
Collective Communications(one for all, all for one!!!)
Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories:
Barrier synchronisationBroadcast from one to all.Scatter from one to allGather from all to one.Scatter/Gather. From all to all.Global reduction (distribute elementary operations)IMPORTANT NOTE: Collective Communication operations and point-to-
point operations we have seen earlier are invisible to each other and hence do not interfere with each other.This is important to avoid dead-locks due to interference.
BARRIER SYNCHRONIZATIONT I M
E
B A R R I E R STATEMENT
Here, there are seven processes running and three of them are waiting idle at the barrier statement for the other four to catch up.
Graphic Representations of Collective Communication Types
E
D
C
B
AEDCBA
A
tojeE
snidD
rmhcC
qlgbB
pkfaA
EDCBA
EDCBA
EDCBA
EDCBA
EDCBA
A
A
A
A
A
E
D
C
B
A
BROADCAST
SCATTERALLTOALL
ALLGATHER
GATHERtsrqp
onmlk
jihgf
edcba
EDCBA
P
R
O
C
E
S
S
E
S
D A T A
D A T A
D A T A
D A T A
Barrier Synchronisation
Each processes in communicator waits at barrier until all processes encounter the barrier.
Fortran:INTEGER comm, errorCALL MPI_BARRIER(comm, error)
C:MPI_Barrier(MPI_Comm comm);
C++:Comm.Barrier();E.g.:
MPI::COMM_WORLD.Barrier();
Global reduction operations
Used to compute a result involving data distributed over a group of processes:
Global sum or productGlobal maximum or minimumGlobal user-defined operation
Predefined operations
MPI_MINLOC
MPI_MAXLOC
MPI_BXOR
MPI_LXOR
MPI_BOR
MPI_LOR
MPI_BAND
MPI_LAND
MPI_PROD
MPI_SUM
MPI_MIN
MPI_MAX
MPI name (F/C)
Minimum and locationMPI::MINLOC
Maximum and locationMPI::MAXLOC
Bitwise exclusive ORMPI::BXOR
Logical exclusive ORMPI::LXOR
Bitwise ORMPI::BOR
Logical ORMPI::LOR
Bitwise ANDMPI::BAND
Logical ANDMPI::LAND
ProductMPI::PROD
SumMPI::SUM
MinimumMPI::MIN
MaximumMPI::MAX
FunctionMPI name (C++)
MPI_Reduce
Performs count operations (o) on individual elements of sendbuf between processes
A B C
D E F
G H I
AoDoG
MPI_REDUCE
0
1
2
D E F
G H I BoEoH
A B C
Rank
MPI_Reduce syntax
FortranINTEGER count, type, count, rtype, root, comm,
errorCALL MPI_REDUCE(sbuf, rbuf, count, rtype, op, root,
comm, error)
C:MPI_Reduce(void *sbuf, void *rbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);
C++:Comm::Reduce(const void* sbuf, void* recvbuf, int
count, const MPI::Datatype& datatype, const MPI::Op& op, int root);
MPI_Reduce example
Integer global sum:Fortran
INTEGER x, result, error
CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, error)
C:int x, result;
MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
C++: int x, result;
MPI::COMM_WORLD.Reduce(&x, &result, 1, MPI::INT, MPI::SUM);
MPI_Allreduce
No root processAll processes get results of reduction operation
A B C
D E F
G H I
A B C
D E F
G H I
MPI_ALLREDUCE
AoDoG
0
1
2
Rank
MPI_Allreduce syntaxFortran
INTEGER count, type, count, rtype, comm, errorCALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op,
comm, error)
C:MPI_Allreduce(void *sbuf, void *rbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
C++:Comm.Allreduce(const void* sbuf, void* recvbuf, int
count, const MPI::Datatype& datatype, const MPI::Op& op);
Practice Session 3
Using reduction operationsThis example shows the use of the continued
fraction method of calculating pi and makes each processor calculate a different portion of the expansion series.
Broadcast
Duplicates data from root process to other processes in communicator
A
A
AA A A
Broadcast
0 1 2 3Rank
Broadcast syntax
Fortran:INTEGER count, datatype, root, comm, error
CALL MPI_BCAST(buffer, count, datatype, root, comm, error)
C:MPI_Bcast (void *buffer, int count, MPI_Datatype
datatype, int root, MPI_Comm comm);
C++:Comm.Bcast(void* buffer, int count, const MPI::Datatype&
datatype, int root);
E.g broadcasting 10 integers from rank 0int tenints[10];
MPI::COMM_WORLD.Bcast(&tenints, 10, MPI::INT, 0);
Scatter
Distributes data from root process amongst processors within communicator.
A B C D
A B C D
BA DC
Scatter
0 1 2 3Rank
Scatter syntax
scount (and rcount) is number of elements each process is sent (i.e. = no received)
FortranINTEGER scount, stype, rcount, rtype, root, comm, errorCALL MPI_SCATTER(sbuf, scount, stype, rbuf, rcount,
rtype, root, comm, error)
C:MPI_Scatter(void *sbuf, int scount, MPI_Datatype stype,
void *rbuf, int rcount, MPI_Datatype rtype, root, comm);
C++:Comm.Scatter(const void* sbuf, int scount, const
MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);
Gather
Collects data distributed amongst processes in communicator onto root process ( Collection done in rank order ) .
BA DC
A B C D
BA DC
Gather
0 1 2 3Rank
Gather syntax
Takes same arguments as Scatter operationFortran
INTEGER scount, stype, rcount, rtype, root, comm, error
CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)
C:MPI_Gather(void *sbuf, int scount, MPI_Datatype stype,
void *rbuf, int rcount, MPI_Datatype rtype, root, comm);
C++:Comm.Gather(const void* sbuf, int scount, const
MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);
All Gather
Collects all data on all processes in communicator
BA DC
A B C D
BA DC
Gather
0 1 2 3Rank
A B C D A B C D A B C D
All Gather syntax
As Gather but no root defined.Fortran
INTEGER scount, stype, rcount, rtype, comm, error
CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, comm, error)
C:MPI_Gather(void *sbuf, int scount, MPI_Datatype stype,
void *rbuf, int rcount, MPI_Datatype rtype, comm);
C++:Comm.Gather(const void* sbuf, int scount, const
MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype);
Top Related