AssignPrelim1.1 ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson. Course Preliminaries.
2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.
-
Upload
thomas-beasley -
Category
Documents
-
view
214 -
download
1
Transcript of 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.
![Page 1: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/1.jpg)
2.1
Message-Passing Computing
Cluster Computing, UNC B. Wilkinson, 2007.
![Page 2: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/2.jpg)
2.2
Message-Passing Programming using User-level Message-Passing Libraries
Programming by using a normal high-level language such as C, augmented with message-passing library calls that perform direct process-to-process message passing.
1. A method of creating separate processes for execution on different computers
2. A method of sending and receiving messages
![Page 3: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/3.jpg)
2.3
Multiple program, multiple data (MPMD) model
Sourcefile
Executable
Processor 0 Processor p - 1
Compile to suitprocessor
Sourcefile
![Page 4: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/4.jpg)
2.4
Multiple Program Multiple Data (MPMD) Model with dynamic process creation
Process 1
Process 2spawn();
Time
Start executionof process 2
Processes started from within master process - dynamic process creation.
Potentially very flexible but incurs overhead of dynamically starting processes
PVM used this form.MPI-2 has dynamic process creation although we not use it.
![Page 5: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/5.jpg)
2.5
Single Program Multiple Data (SPMD) model.
Sourcefile
Executables
Processor 0 Processor p - 1
Compile to suitprocessor
Different processes merged into one program. Control (IF) statements select different parts for each processor to execute. All executables started together - static process creation
![Page 6: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/6.jpg)
2.6
Single Program Multiple Data (SPMD) model.
Sourcefile
Executables
Processor 0 Processor p - 1
Compile to suitprocessor
if (processID == 0) {
… // do this code
} else if (processID == 1) {
… //do this code
} else …
Typically coded for one master and a set of identical slave processes rather than each process different.
MPI uses this form
![Page 7: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/7.jpg)
2.7
Advantages/disadvantages of MPMD and SPMD
• MPMD with dynamic process creation– Flexible – can start up processes on demand during
execution, for example when searching through a search space.
– Has process start-up overhead
• SPMD (with static process creation)– Easier to code– Just one program to write– Collective message passing routines in each process the
same (see later)– Efficient process execution
![Page 8: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/8.jpg)
2.8
Basic “point-to-point”Send and Receive Routines
Process 1 Process 2
send(&x, 2);
recv(&y, 1);
x y
Movementof data
Generic syntax (actual formats later)
Passing a message between processes using send() and recv() library calls:
![Page 9: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/9.jpg)
2.9
Synchronous Message PassingRoutines that actually return when message transfer completed.
Synchronous send routine• Waits until complete message has been accepted by the receiving
process before returning.
Synchronous receive routine• Waits until the message it is expecting arrives.
Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Neither can proceed until the message has been passed from the source to the destination. So no message buffer storage is needed.
![Page 10: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/10.jpg)
2.10
Synchronous send() and recv() using 3-way protocol
Process 1 Process 2
send();
recv();Suspend
Time
processAcknowledgment
MessageBoth processescontinue
(a) When send() occurs before recv()
Process 1 Process 2
recv();
send();Suspend
Time
process
Acknowledgment
MessageBoth processescontinue
(b) When recv() occurs before send()
Request to send
Request to send
![Page 11: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/11.jpg)
2.11
Asynchronous Message Passing• Blocking - has been used to describe routines that do
not return until the transfer is completed.– The routines are “blocked” from continuing.
• Non-blocking - has been used to describe routines that return whether or not the message had been received. Usually require local storage for messages.– In general, they do not synchronize processes but allow
processes to move forward sooner. Must be used with care.• In that sense, the terms synchronous and blocking were
synonymous.
![Page 12: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/12.jpg)
2.12
MPI Definitions of Blocking and Non-Blocking
• Locally Blocking - return after their local actions complete, though the message transfer may not have been completed.
• Non-blocking - return immediately.
Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this.These terms may have different interpretations in other systems.
![Page 13: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/13.jpg)
2.13
How message-passing routines return before message transfer completed
Process 1 Process 2
send();
recv();
Message buffer
Readmessage buffer
Continueprocess
Time
Message buffer needed between source and destination to hold message:
![Page 14: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/14.jpg)
Message Buffer• For a receive routine, the message has to have been received if we want
the message.
• If recv() is reached before send(), the message buffer will be empty and recv() waits for the message.
• For a send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work.
• In this way, using such send routines can decrease the overall execution time.
• In practice, buffers can only be of finite length and a point could be reached when the send routine is held up because all the available buffer space has been exhausted.
• It may be necessary to know at some point if the message has actually been received, which will require additional message passing.
2.14
![Page 15: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/15.jpg)
2.15
Message Tag
• Used to differentiate between different types of messages being sent.
• Message tag is carried within message.
• If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().
![Page 16: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/16.jpg)
2.16
Message Tag Example
Process 1 Process 2
send(&x,2, 5);
recv(&y,1, 5);
x y
Movementof data
Waits for a message from process 1 with a tag of 5
To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:
![Page 17: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/17.jpg)
2.17
“Group” message passing routines
Have routines that send message(s) to a group of processes or receive message(s) from a group of processes
Higher efficiency than separate point-to-point routines although not absolutely necessary.
![Page 18: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/18.jpg)
2.18
BroadcastSending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes.
bcast();
buf
bcast();
data
bcast();
datadata
Process 0 Process p - 1Process 1
Action
Code
SPMD (MPI) form Broadcast action does not occur until all the processes have executed their broadcast routine, and the broadcast operation will have the effect of synchronizing the processes.
![Page 19: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/19.jpg)
2.19
Scatter
scatter();
buf
scatter();
data
scatter();
datadata
Process 0 Process p - 1Process 1
Action
Code
Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.Can send more than one element.
SPMD (MPI) form
![Page 20: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/20.jpg)
2.20
Gather
gather();
buf
gather();
data
gather();
datadata
Process 0 Process p - 1Process 1
Action
Code
Having one process collect individual values from set of processes.
SPMD (MPI) form
![Page 21: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/21.jpg)
2.21
Reduce
reduce();
buf
reduce();
data
reduce();
datadata
Process 0 Process p - 1Process 1
+
Action
Code
Gather operation combined with specified arithmetic/logical operation to a single value.
Example: Values could be gathered and then added together by root:
SPMD (MPI) form
![Page 22: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/22.jpg)
2.22
Software Tools for ClustersLate 1980’s Parallel Virtual Machine (PVM) - developed
Became very popular.
Mid 1990’s - Message-Passing Interface (MPI) - standard defined.
Based upon Message Passing Parallel Programming model.
Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++, ...).
![Page 23: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/23.jpg)
2.23
PVM(Parallel Virtual Machine)
Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge.
Programmer decomposes problem into separate programs (usually master and group of identical slave programs).
Programs compiled to execute on specific types of computers.
Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).
![Page 24: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/24.jpg)
2.24
Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine.
PVM
Application
daemon
program
Workstation
PVMdaemon
Applicationprogram
Applicationprogram
PVMdaemon
Workstation
Workstation
Messagessent throughnetwork
(executable)
(executable)
(executable)
MPI implementation we use is similar.
Can have more than one processrunning on each computer.
![Page 25: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/25.jpg)
2.25
MPI(Message Passing Interface)
• Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability.
• Defines routines, not implementation.
• Several free implementations exist.
![Page 26: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/26.jpg)
2.26
MPIProcess Creation and Execution
• Purposely not defined - Will depend upon implementation.
• Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together.
• Originally SPMD model of computation. • MPMD also possible with static creation
![Page 27: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/27.jpg)
2.27
Using SPMD Computational Model
main (int argc, char *argv[]){MPI_Init(&argc, &argv);
.
.MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */
if (myrank == 0)master();
elseslave();..
MPI_Finalize();}
where master() and slave() are to be executed by master process and slave process, respectively.
![Page 28: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/28.jpg)
2.28
Communicators• Defines scope of a communication operation.
• Processes have ranks associated with communicator.
• Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.
• Other communicators can be established for groups of processes.
• Two types of communicator– Intracommunicators for communication within a defined groups– Intercommunicators for communication between defined groups
![Page 29: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/29.jpg)
2.29
Reasoning for Communicators
• Provides a solution to unsafe message passing, – Message tags alone are not sufficient.
• Enables basic error checking of message passing code by allowing programmer to define communication domains.
– Messages cannot be sent to destinations outside defined communication domain
![Page 30: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/30.jpg)
2.30
Unsafe message passing - Example
lib()
lib()
send(…,1,…);
recv(…,0,…);
Process 0 Process 1
send(…,1,…);
recv(…,0,…);(a) Intended behavior
(b) Possible behaviorlib()
lib()
send(…,1,…);
recv(…,0,…);
Process 0 Process 1
send(…,1,…);
recv(…,0,…);
Destination
Source
![Page 31: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/31.jpg)
2.31
MPI Blocking Routines
• Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent.
• Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.
![Page 32: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/32.jpg)
2.32
Parameters of blocking send
MPI_Send(buf, count, datatype, dest, tag, comm)
Address of
Number of items
Datatype of
Rank of destination
Message tag
Communicator
send buffer
to send
each item
process
![Page 33: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/33.jpg)
2.33
Parameters of blocking receive
MPI_Recv(buf, count, datatype, src, tag, comm, status)
Address of
Maximum number
Datatype of
Rank of source
Message tag
Communicator
receive buffer
of items to receive
each item
process
Statusafter operation
![Page 34: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/34.jpg)
2.34
Example
To send an integer x from process 0 to process 1,
MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */
if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);
} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status);
}
![Page 35: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/35.jpg)
2.35
MPI Nonblocking Routines
• Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.
• Nonblocking receive - MPI_Irecv() - will return even if no message to accept.
![Page 36: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/36.jpg)
2.36
Nonblocking Routine Formats
MPI_Isend(buf,count,datatype,dest,tag,comm,request)
MPI_Irecv(buf,count,datatype,source,tag,comm, request)
Completion detected by MPI_Wait() and MPI_Test().
MPI_Wait() waits until operation completed and returns then.
MPI_Test() returns with flag set indicating whether operation completed at that time.
Need to know whether particular operation completed.
Determined by accessing request parameter.
![Page 37: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/37.jpg)
2.37
Example
To send an integer x from process 0 to process 1 and allow process 0 to continue,
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */
if (myrank == 0) {
int x;
MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);
compute();
MPI_Wait(req1, status);
} else if (myrank == 1) {
int x;
MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);
}
![Page 38: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/38.jpg)
2.38
Send Communication Modes
• Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached.
• Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach().
• Synchronous Mode - Send and receive can start before each other but can only complete together.
• Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.
![Page 39: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/39.jpg)
2.39
Parameters of synchronous send(same as blocking send)
MPI_Ssend(buf, count, datatype, dest, tag, comm)
Address of
Number of items
Datatype of
Rank of destination
Message tag
Communicator
send buffer
to send
each item
process
![Page 40: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/40.jpg)
2.40
Collective Communication
Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:
• MPI_Bcast() - Broadcast from root to all other processes• MPI_Gather() - Gather values for group of processes• MPI_Scatter() - Scatters buffer in parts to group of processes• MPI_Alltoall() - Sends data from all processes to all processes• MPI_Reduce() - Combine values on all processes to single value• MPI_Reduce_scatter() - Combine values and scatter results• MPI_Scan()- Compute prefix reductions of data on processes
• MPI_Barrier() - A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.
![Page 41: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/41.jpg)
2.41
Barrier: Block process until all processes have called it
MPI_Barrier(comm)
communicator
![Page 42: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/42.jpg)
2.42
Broadcast message from root process to all processes in comm
and itself.
MPI_Bcast(*buf, count, datatype, root, comm)
Parameters:
*buf message buffer (loaded)
count number of entries in buffer
datatype data type of buffer
root rank of root
![Page 43: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/43.jpg)
2.43
Gather values for group of processes
MPI_Gather(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)
Parameters:*sendbuf send buffersendcount number of send buffer elementssendtype data type of send elements*recvbuf receive buffer (loaded)recvcount number of elements each receiverecvtype data type of receive elementsroot rank of receiving processcomm communicator
![Page 44: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/44.jpg)
2.44
ExampleTo gather items from group of processes into process 0, using dynamically allocated memory in root process:
int data[10]; /*data to be gathered from processes*/
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */
if (myrank == 0) {
MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/
buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/
}
MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ;
MPI_Gather() gathers from all processes, including root.
![Page 45: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/45.jpg)
2.45
Scatter a buffer from root in parts to group of processes
MPI_Scatter(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)
Parameters:*sendbuf send buffersendcount number of elements sent
(each process)sendtype data type of elements*recvbuf receive buffer (loaded)recvcount number of recv buffer elementsrecvtype type of recv elementsroot root process rankcomm communicator
![Page 46: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/46.jpg)
2.46
Combine values on all processes to single value
MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm)
Parameters:*sendbuf send buffer address*recvbuf receive buffer addresscount number of send buffer elementsdatatype data type of send elementsop reduce operation.
Several operations, includingMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD Product
root root process rank for resultcomm communicator
![Page 47: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/47.jpg)
Hello World#include <stdio.h>
#include <mpi.h>
int main(int argc, char ** argv)
{
int size,rank;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size);
printf("Hello MPI! Process %d of %d\n", rank, size); MPI_Finalize();
}
2.47
![Page 48: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/48.jpg)
2.48
Sample MPI program
#include “mpi.h”
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000
void main(int argc, char *argv)
{
int myid, numprocs;
int data[MAXSIZE], i, x, low, high, myresult, result;
char fn[255];
char *fp;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if (myid == 0) { /* Open input file and initialize data */
strcpy(fn,getenv(“HOME”));
strcat(fn,”/MPI/rand_data.txt”);
if ((fp = fopen(fn,”r”)) == NULL) {
printf(“Can’t open the input file: %s\n\n”, fn);
exit(1);
}
for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);
}
MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */
x = n/nproc; /* Add my portion Of data */
low = myid * x;
high = low + x;
for(i = low; i < high; i++)
myresult += data[i];
printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */
MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) printf(“The sum is %d.\n”, result);
MPI_Finalize();
}
![Page 49: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/49.jpg)
MPI Groups, Communicators
2.49
• Collective communication needed to be performed by subsets of the processes in the computation
• MPI provides routines for– defining new process groups from subsets of existing
process groups like MPI_COMM_WORLD– creating a new communicator for a new process group– performing collective communication within that process
group
![Page 50: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/50.jpg)
Groups and communicators
2.50
![Page 51: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/51.jpg)
Facts about groups and communicators
2.51
• Group:– ordered set of processes– each process in group has a unique integer id called its rank within
that group– process can belong to more than one group
• rank is always relative to a group– groups are “opaque objects”
• use only MPI provided routines for manipulating groups• Communicators:
– all communication must specify a communicator– from the programming viewpoint, groups and communicators are
equivalent– communicators are also “opaque objects”
• Groups and communicators are dynamic objects and can be created and destroyed during the execution of the program
![Page 52: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/52.jpg)
Typical usage
2.52
1. Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
2. Form new group as a subset of global group using MPI_Group_incl or MPI_Group_excl
3. Create new communicator for new group using MPI_Comm_create
4. Determine new rank in new communicator using MPI_Comm_rank
5. Conduct communications using any MPI message passing routine
6. When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free
![Page 53: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/53.jpg)
main(int argc, char **argv) { int me, count, count2; void *send_buf, *recv_buf, *send_buf2, *recv_buf2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commslave; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); if(me != 0){ /* compute on slave */ MPI_Reduce(send_buf,recv_buff,count, MPI_INT, MPI_SUM, 1,
commslave); }
/* zero falls through immediately to this reduce, others do later... */ MPI_Reduce(send_buf2, recv_buff2, count2, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD); MPI_Comm_free(&commslave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize();}
![Page 54: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/54.jpg)
2.54
#include “mpi.h”#include <stdio.h>#define NPROCS 8int main(argc,argv)int argc;char *argv[]; {int rank, new_rank, sendbuf, recvbuf, numtasks, ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7};MPI_Group orig_group, new_group;MPI_Comm new_comm;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);sendbuf = rank;
/* Extract the original group handle */MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
/* Divide tasks into two distinct groups based upon rank */if (rank < NPROCS/2) { MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group); }else { MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group); }
/* Create new new communicator and then perform collective communications */MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm);MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm);
MPI_Group_rank (new_group, &new_rank);printf("rank= %d newrank= %d recvbuf= %d\n",rank,new_rank,recvbuf);
MPI_Finalize();}
Sample output:
rank= 7 newrank= 3 recvbuf= 22 rank= 0 newrank= 0 recvbuf= 6 rank= 1 newrank= 1 recvbuf= 6rank= 2 newrank= 2 recvbuf= 6 rank= 6 newrank= 2 recvbuf= 22 rank= 3 newrank= 3 recvbuf= 6 rank= 4 newrank= 0 recvbuf= 22 rank= 5 newrank= 1 recvbuf= 22
![Page 55: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/55.jpg)
2.55
Evaluating Programs EmpiricallyMeasuring Execution Time
To measure the execution time between point L1 and point L2 in the code, we might have a construction such as:
.L1: time(&t1); /* start timer */
.
.L2: time(&t2); /* stop timer */
.elapsed_time = difftime(t2, t1); /* time=t2 - t1 */
printf(“Elapsed time = %5.2f seconds”, elapsed_time);
![Page 56: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/56.jpg)
2.56
MPI provides the routine MPI_Wtime() for returning time (in seconds):
double start_time, end_time, exe_time;
start_time = MPI_Wtime();
. .
end_time = MPI_Wtime();exe_time = end_time - start_time;
![Page 57: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/57.jpg)
Average/least/most execution time spent by individual process
• int myrank, numprocs; • double mytime, maxtime, mintime, avgtime; /*variables used for gathering timing statistics*/
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
• MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/ • mytime = MPI_Wtime(); /*get time just before work section */ • work(); • mytime = MPI_Wtime() - mytime; /*get time just after work section*/ /*compute max, min, and
average timing statistics*/ • MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD); • MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); • MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); • if (myrank == 0) {• avgtime /= numprocs; • printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);• }
2.57
![Page 58: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/58.jpg)
2.58
Compiling/Executing MPI Programs
• Minor differences in the command lines required depending upon MPI implementation.
• For the assignments, we will use MPICH-2.
• Generally, a file need to be present that lists all the computers to be used. MPI then uses those computers listed.
![Page 59: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/59.jpg)
Set MPICH2 Path• Then add the following at• the end of your ~/.bashrc file and source it with the command• source ~/.bashrc (or log in again).• #----------------------------------------------------------------------• # MPICH2 setup• export PATH=/opt/MPICH2/bin:$PATH• export MANPATH= =/opt/MPICH2/bin :$MANPATH• #----------------------------------------------------------------------
• Some logging and visulaization help: • You can Link with the libraries -llmpe -lmpe to enable logging and the MPE
environment. Then run the program as usual and a log file will be produced. The log file can be visualized using the jumpshot program that comes bundled with MPICH2.
2.59
![Page 60: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/60.jpg)
2.60
Defining the Computers to UseGenerally, need to create a file containing the list of machines to be used.
Sample machines file (or hostfile)
athena.cs.siu.eduoscarnode1.cs.siu.edu………. oscarnode8.cs.siu.edu
In MPICH, if just using one computer, do not need this file.
![Page 61: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/61.jpg)
2.61
MPICH Commands
Two basic commands:
• mpicc, a script to compile MPI programs
• Mpiexec, mpirun, the command to execute an MPI program.
![Page 62: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/62.jpg)
2.62
Compiling/executing (SPMD) MPI program
For MPICH. At a command line:
To start MPI: Nothing special.
To compile MPI programs:
for C mpicc -o prog prog.c
for C++ mpiCC -o prog prog.cpp
To execute MPI program:
mpiexec -n no_procs prog
A positive integer
![Page 63: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/63.jpg)
2.63
Executing MPICH program on multiple computers
Create a file called say “machines” containing the list of machines:
athena.cs.siu.eduoscarnode1.cs.siu.edu
………. oscarnode8.cs.siu.edu
Establish network environmentsmpdboot –n 9 –f machinesmpdtracempdallexit
![Page 64: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/64.jpg)
2.64
mpirun -machinefile machines -np 4 prog
would run prog with four processes.
Each processes would execute on one of machines in list. MPI would cycle through list of machines giving processes to machines.
Can also specify number of processes on a particular machine by adding that number after machine name.)
“MPI standard” command mpiexec is now the replacement for mpirun although mpirun exists.
![Page 65: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.](https://reader036.fdocuments.us/reader036/viewer/2022062722/56649f345503460f94c51c21/html5/thumbnails/65.jpg)
Reference Tutorial Materials
• http://www-unix.mcs.anl.gov/mpi/tutorial/index.html• http://www.cs.utexas.edu/users/pingali/
CS378/2008sp/lectureschedule.html
2.65