Post on 11-May-2015
MPI & Distributed Computing
Eric Borisch, M.S.Mayo Clinic
Topics
Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up
Shared vs. Distributed Memory Shared Memory: all memory within a system is
directly addressable (ignoring access restrictions) by each process [or thread] Single- and multi-CPU desktops & laptops Multi-threaded apps GPGPU * MPI *
Distributed Memory: memory available a given node within a system is unique and distinct from its peers MPI Google MapReduce / Hadoop
Why bother?
1 2 3 4 5 6 7 80
0.5
1
1.5
2
2.5
Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2 800MHz
CopyScaleAddTriad
# of processes
Rela
tive p
erf
orm
an
ce
http://www.cs.virginia.edu/stream/
But what about Nehalem?
0 4 8 12 160%
50%100%150%200%250%300%350%400%
STREAM benchmark OpenMP per-formance
Add:Copy:Scale:Triad:
Threads (8 Physical cores + HT)
Rela
tive p
erf
orm
an
ce
http://www.cs.virginia.edu/stream/2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3
Memory Limitations
Bandwidth (FSB, HT, Nehalem, CUDA, …) Frequently run into with high-level languages (MATLAB)
Capacity – cost & availability High-density chips are $$$ (if even available) Memory limits on individual systems
Distributed computing addresses both bandwidth and capacity with multiple systems
MPI is the glue used to connect multiple distributed processes together
Memory Requirements [Example]
Custom iterative SENSE reconstruction 3 x 8 coils x 400 x 320 x 176 x 8 [complex
float] Profile data (img space) Estimate (img <-> k space) Acquired data (k space) > 4GB data touched during each iteration
16, 32 channel data here or on the way…
Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
FTx
Real-time SENSE unfoldingDATA
Place view into correct x-Ky-Kz space (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
GW Correction (Y, Z)
GW Correction (X)
MIP
Store / DICOM
FTyz (AP & LP)CAL
RESULT
Root node
Worker nodes
Real-time data
Pre-loaded data
MPI Communication
Root Node
3.6GHz P4
3.6GHz P4
16GB RAM
80GB HDD
1Gb Eth
2x8Gb IB
Worker Node (x7)3.6GHz P4
3.6GHz P4
16GB RAM
80GB HDD2x8Gb IB
24-Port Infiniband Switch
16-Port Gigabit Ethernet Switch
1Gb Eth
MRI System
Site Intranet
x7x2 MPI interconnects16Gb/s bandwidth per node
x7 File system connections
KeyCluster Hardware
External Hardware
2x8Gig Infiniband connection
1Gig Ethernet connection
500GB HDD
1Gb Eth
1Gb Eth
8Gb/s Connection
MRI Reconstruction Cluster
Many Approaches to “Distributed”
Loosely coupled SETI / BOINC “Grid computing”
BIOS-level abstraction ScaleMP
Tightly coupled MPI “Cluster computing”
Hybrid Folding@Home gpugrid.net
http://en.wikipedia.org/wiki/File:After_Dark_Flying_Toasters.pnghttp://en.wikipedia.org/wiki/File:Setiathomeversion3point08.png
Grid vs. Cluster
Master
Worker
Worker
Worker
Worker
WorkerHead Node
WorkerNode
Interconnect
Shared vs. Distributed
HostOS
Process A
Process B
Thread 1
Thread 2
Thread N
Host IOS I
Process A
Host IIOS II
Process B
Host NOS N
Process CMemory Transfers
Network Transfers
Shared vs. Distributed
HostOS
Process A
Process B
Thread 1
Thread 2
Thread N
Host IOS I
Process A
Host IIOS II
Process B
Host NOS N
Process CMemory Transfers
Network Transfers
Process D
Process E
Process F
Topics
Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up
MPI
Message Passing Interface is… “a library specification for message-
passing” 1
Available in many implementations on multiple platforms *
A set of functions for moving messages between different processes without a shared memory environment
Low-level*; no concept of overall computing tasks to be performed
[1] http://www.mcs.anl.gov/research/projects/mpi/
MPI history
MPI-1 Version 1.0 draft standard 1994 Version 1.1 in 1995 Version 1.2 in 1997 Version 1.3 in 2008
MPI-2 Added:▪ 1-sided communication▪ Dynamic “world” sizes; spawn / join
Version 2.0 in 1997 Version 2.1 in 2008
MPI-3 In process Enhanced fault handling
Forward compatibility preserved
MPI Status
MPI is the de-facto standard for distributed computing Freely available Open source implementations exist Portable Mature
From a discussion of why MPI is dominant [1]: […] 100s of languages have come and gone. Good stuff must have been created [… yet] it is broadly accepted in the
field that they’re not used. MPI has a lock. OpenMP is accepted, but a distant second. There are substantial barriers to the introduction of new languages and
language constructs. Economic, ecosystem related, psychological, a catch-22 of widespread
use, etc. Any parallel language proposal must come equipped with reasons why it
will overcome those barriers.[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html
MPI Distributions
MPI itself is just a specification. We want an implementation MPICH, MPICH2
Widely portable MVAPICH, MVAPICH2
Infiniband-centric; MPICH/MPICH2 based OpenMPI
Plug-in architecture; many run-time options And more:
IntelMPI HP-MPI MPI for IBM Blue Gene MPI for Cray Microsoft MPI MPI for SiCortex MPI for Myrinet Express (MX) MPICH2 over SCTP
Implementing a distributed system
Without MPI: Start all of the processes across bank of
machines (shell scripting + ssh) socket(), bind(), listen(), accept() or
connect() each link send(), read() on individual links Raw byte interfaces; no discrete
messages
Implementing a distributed system
With MPI mpiexec –np <n> app MPI_Init() MPI_Send() MPI_Recv() MPI_Finalize()
MPI: Manages the connections Packages messages Provides launching mechanism
MPI (the document)1
Provides definitions for: Communication functions
MPI_Send() MPI_Recv() MPI_Bcast() etc.
Datatype mangement functions MPI_Type_create_hvector()
C, C++, and Fortran bindings Also recommends process startup
mpiexec –np <nproc> <program> <args>[1] http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html
MPI FunctionsMPI_AbortMPI_AccumulateMPI_Add_error_classMPI_Add_error_codeMPI_Add_error_stringMPI_AddressMPI_AllgatherMPI_AllgathervMPI_Alloc_memMPI_AllreduceMPI_AlltoallMPI_AlltoallvMPI_AlltoallwMPI_Attr_deleteMPI_Attr_getMPI_Attr_putMPI_BarrierMPI_BcastMPI_BsendMPI_Bsend_initMPI_Buffer_attachMPI_Buffer_detachMPI_CancelMPI_Cart_coordsMPI_Cart_createMPI_Cart_getMPI_Cart_mapMPI_Cart_rankMPI_Cart_shiftMPI_Cart_subMPI_Cartdim_getMPI_Close_portMPI_Comm_acceptMPI_Comm_call_errhandlerMPI_Comm_compareMPI_Comm_connectMPI_Comm_createMPI_Comm_create_errhandlerMPI_Comm_create_keyvalMPI_Comm_delete_attrMPI_Comm_disconnectMPI_Comm_dupMPI_Comm_freeMPI_Comm_free_keyvalMPI_Comm_get_attrMPI_Comm_get_errhandlerMPI_Comm_get_nameMPI_Comm_get_parentMPI_Comm_groupMPI_Comm_joinMPI_Comm_rankMPI_Comm_remote_groupMPI_Comm_remote_sizeMPI_Comm_set_attrMPI_Comm_set_errhandlerMPI_Comm_set_nameMPI_Comm_sizeMPI_Comm_spawnMPI_Comm_spawn_multipleMPI_Comm_splitMPI_Comm_test_interMPI_Dims_createMPI_Errhandler_createMPI_Errhandler_freeMPI_Errhandler_getMPI_Errhandler_setMPI_Error_classMPI_Error_stringMPI_ExscanMPI_File_c2fMPI_File_call_errhandlerMPI_File_closeMPI_File_create_errhandlerMPI_File_deleteMPI_File_f2cMPI_File_get_amodeMPI_File_get_atomicityMPI_File_get_byte_offsetMPI_File_get_errhandlerMPI_File_get_groupMPI_File_get_infoMPI_File_get_positionMPI_File_get_position_sharedMPI_File_get_sizeMPI_File_get_type_extent
MPI_File_get_viewMPI_File_ireadMPI_File_iread_atMPI_File_iread_sharedMPI_File_iwriteMPI_File_iwrite_atMPI_File_iwrite_sharedMPI_File_openMPI_File_preallocateMPI_File_readMPI_File_read_allMPI_File_read_all_beginMPI_File_read_all_endMPI_File_read_atMPI_File_read_at_allMPI_File_read_at_all_beginMPI_File_read_at_all_endMPI_File_read_orderedMPI_File_read_ordered_beginMPI_File_read_ordered_endMPI_File_read_sharedMPI_File_seekMPI_File_seek_sharedMPI_File_set_atomicityMPI_File_set_errhandlerMPI_File_set_infoMPI_File_set_sizeMPI_File_set_viewMPI_File_syncMPI_File_writeMPI_File_write_allMPI_File_write_all_beginMPI_File_write_all_endMPI_File_write_atMPI_File_write_at_allMPI_File_write_at_all_beginMPI_File_write_at_all_endMPI_File_write_orderedMPI_File_write_ordered_beginMPI_File_write_ordered_endMPI_File_write_sharedMPI_FinalizeMPI_FinalizedMPI_Free_memMPI_GatherMPI_GathervMPI_GetMPI_Get_addressMPI_Get_countMPI_Get_elementsMPI_Get_processor_nameMPI_Get_versionMPI_Graph_createMPI_Graph_getMPI_Graph_mapMPI_Graph_neighborsMPI_Graph_neighbors_countMPI_Graphdims_getMPI_Grequest_completeMPI_Grequest_startMPI_Group_compareMPI_Group_differenceMPI_Group_exclMPI_Group_freeMPI_Group_inclMPI_Group_intersectionMPI_Group_range_exclMPI_Group_range_incl
MPI_Group_rankMPI_Group_sizeMPI_Group_translate_ranksMPI_Group_unionMPI_IbsendMPI_Info_createMPI_Info_deleteMPI_Info_dupMPI_Info_freeMPI_Info_getMPI_Info_get_nkeysMPI_Info_get_nthkeyMPI_Info_get_valuelenMPI_Info_setMPI_InitMPI_Init_threadMPI_InitializedMPI_Intercomm_createMPI_Intercomm_mergeMPI_IprobeMPI_IrecvMPI_IrsendMPI_Is_thread_mainMPI_IsendMPI_IssendMPI_Keyval_createMPI_Keyval_freeMPI_Lookup_nameMPI_Op_createMPI_Op_freeMPI_Open_portMPI_PackMPI_Pack_externalMPI_Pack_external_sizeMPI_Pack_sizeMPI_PcontrolMPI_ProbeMPI_Publish_nameMPI_PutMPI_Query_threadMPI_RecvMPI_Recv_initMPI_ReduceMPI_Reduce_scatterMPI_Register_datarepMPI_Request_freeMPI_Request_get_statusMPI_RsendMPI_Rsend_initMPI_ScanMPI_ScatterMPI_ScattervMPI_SendMPI_Send_initMPI_SendrecvMPI_Sendrecv_replaceMPI_SsendMPI_Ssend_initMPI_StartMPI_StartallMPI_Status_set_cancelledMPI_Status_set_elementsMPI_TestMPI_Test_cancelledMPI_TestallMPI_TestanyMPI_TestsomeMPI_Topo_testMPI_Type_commitMPI_Type_contiguousMPI_Type_create_darrayMPI_Type_create_hindexedMPI_Type_create_hvectorMPI_Type_create_indexed_block
MPI_Type_create_keyvalMPI_Type_create_resizedMPI_Type_create_structMPI_Type_create_subarrayMPI_Type_delete_attrMPI_Type_dupMPI_Type_extentMPI_Type_freeMPI_Type_free_keyvalMPI_Type_get_attrMPI_Type_get_contentsMPI_Type_get_envelopeMPI_Type_get_extentMPI_Type_get_nameMPI_Type_get_true_extentMPI_Type_hindexedMPI_Type_hvectorMPI_Type_indexedMPI_Type_lbMPI_Type_match_sizeMPI_Type_set_attrMPI_Type_set_nameMPI_Type_sizeMPI_Type_structMPI_Type_ubMPI_Type_vectorMPI_UnpackMPI_Unpack_externalMPI_Unpublish_nameMPI_WaitMPI_WaitallMPI_WaitanyMPI_WaitsomeMPI_Win_call_errhandlerMPI_Win_completeMPI_Win_createMPI_Win_create_errhandlerMPI_Win_create_keyvalMPI_Win_delete_attrMPI_Win_fenceMPI_Win_freeMPI_Win_free_keyvalMPI_Win_get_attrMPI_Win_get_errhandlerMPI_Win_get_groupMPI_Win_get_nameMPI_Win_lockMPI_Win_postMPI_Win_set_attrMPI_Win_set_errhandlerMPI_Win_set_nameMPI_Win_startMPI_Win_testMPI_Win_unlockMPI_Win_waitMPI_WtickMPI_Wtime
The message passing mindset Each process owns their data – there is no “our”
Makes many things simpler; no mutexes, condition variables, semaphores, etc; memory access order race conditions go away
Every message is an explicit copy I have the memory I sent from, you have the memory
you used to received into Even when running in a “shared memory” environment
Synchronization comes along for free I won’t get your message (or data) until you choose to
send it Programming to MPI first can make it easier to
scale-out later
Topics
Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up
Getting started with MPI
Download / decompress MPICH source: http://www.mcs.anl.gov/research/projects/mpich2/ Suports: c / c++ / Fortran Requires Python >= 2.2
./configure make install
installs into /usr/local by default, or use --prefix=<chosen path>
Make sure <prefix>/bin is in PATH Make sure <prefix>/share/man is in
MANPATH
MPI Installation
c compiler wrapper c++ compiler wrapper
MPI job launcher
MPD launcher
MPD launch
Set up passwordless ssh to workers Start the daemons with mpdboot -n <N>
Requires ~/.mpd.conf to exist on each host▪ Contains: (same on each host)▪ MPD_SECRETWORD=<some gibberish string>
▪ permissions set to 600 (r/w access for owner only) Requires ./mpd.hosts to list other host names▪ Unless run as mpdboot -n 1 (run on current host
only)▪ Will not accept current host in list (implicit)
Check for running daemons with mpdtrace
For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
MPD launch
MPI Compile & launch
Use mpicc / mpicxx for c/c++ compiler Wrapper script around c/c++ compilers
detected during install▪ $ mpicc --showgcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -lpthread -luuid -lpthread –lrt
$ mpicc -o hello hello.c Use mpiexec -np <nproc> <app>
<args> to launch $ mpiexec -np 4 ./hello
Hello, Hello, Hello, world world world
/* hello.c */#include <stdio.h>#include <mpi.h>
int main (int argc, char * argv[]){
int i, rank, nodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i < nodes; i++){
MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);
}MPI_Finalize();return 0;
}
$ mpicc -o hello hello.c$ mpiexec -np 4 ./hello
Hello, from 0 of 4!Hello, from 2 of 4!Hello, from 1 of 4!Hello, from 3 of 4!
Threads vs. MPI startup./
threaded_app
main()
pthread_create( func() ) func()
pthread_exit()pthread_join()
Memory
Do work Do work
exit()
Thread within threaded_app process
mpi_app [rank 3]
mpi_app [rank 0]
mpi_app [rank 1]
Threads vs. MPI startupmpiexec –np 4
./mpi_app
main() main() main()
MPI comm.MPI_Init() MPI_Init() MPI_Init()
MPI comm.MPI_Bcast() MPI_Bcast() MIP_Bcast()
Do Work on local mem
Do Work on local mem
Do Work on local mem
mpd launches jobs
MPI comm.MPI_Allreduce() MPI_Allreduce() MPI_Allreduce()
MPI comm.MPI_Finalize() MPI_Finalize() MPI_Finalize()
exit() exit() exit()
Hello, world: unique to ranks
/* hello.c */#include <stdio.h>#include <mpi.h>
intmain (int argc, char * argv[]){
int i;int rank;int nodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i < nodes; i++){MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);}MPI_Finalize();return 0;
}
MPE (Multi-Process Environment)
MPICH2 comes with mpe by default (unless disabled during configure)
Multiple tracing / logging options to track MPI traffic
Enabled through –mpe=<option> at compile timeMacPro:code$ mpicc -mpe=mpilog -o hello hello.cMacPro:code$ mpiexec -np 4 ./helloHello from 0 of 4!Hello from 2 of 4!Hello from 1 of 4!Hello from 3 of 4!Writing logfile....Enabling the Default clock synchronization...Finished writing logfile ./hello.clog2.
jumpshot view of log
Output with -mpe=mpitraceMacPro:code$ mpicc -mpe=mpitrace -o hello hello.cMacPro:code$ mpiexec -np 2 ./hello > trace
MacPro:code$ grep 0 trace [0] Ending MPI_Init[0] Starting MPI_Comm_size...[0] Ending MPI_Comm_size[0] Starting MPI_Comm_rank...[0] Ending MPI_Comm_rank[0] Starting MPI_Barrier...[0] Ending MPI_BarrierHello from 0 of 2![0] Starting MPI_Barrier...[0] Ending MPI_Barrier[0] Starting MPI_Finalize...[0] Ending MPI_Finalize
MacPro:code$ grep 1 trace [1] Ending MPI_Init[1] Starting MPI_Comm_size...[1] Ending MPI_Comm_size[1] Starting MPI_Comm_rank...[1] Ending MPI_Comm_rank[1] Starting MPI_Barrier...[1] Ending MPI_Barrier[1] Starting MPI_Barrier...[1] Ending MPI_BarrierHello from 1 of 2![1] Starting MPI_Finalize...[1] Ending MPI_Finalize
A more interesting log…
3D-sinc interpolation
MPI_Send (Blocking)
int MPI_Send( void *buf,
memory location to send from
int count,number of elements (of type datatype) at buf
MPI_Datatype datatype, MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc
int dest, rank (within the communicator comm) of destination for this message
int tag, used to distinguish this message from other messages
MPI_Comm comm )communicator for this transferoften MPI_COMM_WORLD
MPI_Recv (Blocking)
int MPI_Recv(void *buf,
memory location to receive data into
int count, number of elements (of type datatype) available to receive into at buf
MPI_Datatype datatype,MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc.Typically matches sending datatype, but doesn’t have to…
int source, rank (within the communicator comm) of source for this messagecan also be MPI_ANY_SOURCE
int tag, used to distinguish this message from other messagescan also be MPI_ANY_TAG
MPI_Comm comm, communicator for this transferoften MPI_COMM_WORLD
MPI_Status *status )Structure describing the received message, including:
actual count (can be smaller than passed count)source (useful if used with source = MPI_ANY_SOURCE)tag (useful if used with tag = MPI_ANY_TAG)
Another example
/* sr.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif
intmain (int argc, char * argv[] ){
int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status sendStatus;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);
myData[0] = rank;MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD);MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes, 0, MPI_COMM_WORLD, &sendStatus);
printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);
MPI_Finalize();return 0;
}
Does it run?
$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0
Log output (-np 4)
May != Will
$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0
$ mpicc -o sr sr.c -DSENDSIZE="0x1<<13”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0
$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14”$ mpiexec -np 2 ./sr^C
$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14 - 1”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0
What the standard has to say…3.4 Communication Modes
The send call described in Section Blocking send is blocking: it does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.
Message buffering decouples the send and receive operations. A blocking send can complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol.
The send call described in Section Blocking send used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver.
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard mode send is non-local: successful completion of the send operation may depend on the occurrence of a matching receive.
http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40
Rendezvous vs. eager (simplified)
Process 1 Process 2Send
“small” message &
returnEager send
Eager recv
Request & receive small
message
Send “large”
messageRndv. req.
Request large message
Receive Rndv. req.
Match Rndv. req.
Rndv. send
Receive Rndv. data
Receive large message
Blocks until completion.
User activityMPI activity
MPI communication modes MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)
Sends are “local” – they return independent of any remote activity Message buffer can be touched immediately after call returns Requires a user-provided buffer, provided via MPI_Buffer_attach() Forces an “eager”-like message transfer from sender’s perspective User can wait for completion by calling MPI_Buffer_detach()
MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init) Won’t return until matching receive is posted Forces a “rendezvous”-like message transfer Can be used to guarantee synchronization without additional MPI_Barrier() calls
MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init) Erroneous if matching receive has not been posted Performance tweak (on some systems) when user can guarantee matching receive is
posted MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)
Non-blocking, immediate return once send/receive request is posted Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion Send/receive buffers should not be touched until completed MPI_Request * argument used for eventual completion
The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to receive any send mode.
Fixing the code/* sr2.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif
intmain (int argc, char * argv[] ){
int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status xferStatus[2];MPI_Request xferRequest[2];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);
myData[0] = rank;MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]);MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]);
MPI_Waitall(2,xferRequest,xferStatus);
printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);
MPI_Finalize();return 0;
}
Fixed with MPI_I[send|recv]()
$ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”$ mpiexec -np 4 ./sr20 sent 0; received 32 sent 2; received 11 sent 1; received 03 sent 3; received 2
Topics
Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up
Types of parallelism [1/2] Task parallelism
Each process handles a unique kind of task▪ Example: multi-image uploader (with resize/recompress)▪ Thread 1: GUI / user interaction▪ Thread 2: file reader & decompression▪ Thread 3: resize & recompression▪ Thread 3: network communication
Can be used in a grid with a pipeline of separable tasks to be performed on each data set▪ Resample / warp volume▪ Segment volume▪ Calculate metrics on segmented volume
Types of parallelism [2/2]
Data parallelism Each process handles a portion of the
entire data Often used with large data sets▪ [task 0… | … task 1 … | … | … task n]
Frequently used in MPI programming Each process is “doing the same thing,”
just on a different subset of the whole
Data layout
xy
z
Node 0Node 1Node 2Node 3Node 4Node 5Node 6Node 7
Layout is crucial in high-performance computing BW efficiency; cache
efficiency Even more important in
distributed Poor layout extra
communication Shown is an example of
“block” data distribution x is contiguous dimension z is slowest dimension Each node has contiguous
portion of z
FTx
Real-time SENSE unfoldingDATA
Place view into correct x-Ky-Kz space (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
GW Correction (Y, Z)
GW Correction (X)
MIP
Display / DICOM
FTyz (AP & LP)CAL
RESULT
Root node
Worker nodes
Real-time data
Pre-loaded data
MPI Communication
Separability
Completely separable problems: Add 1 to everyone Multiply each a[i] * b[i]
Inseparable problems: [?] Max of a vector Sort a vector MIP of a volume 1D FFT of a volume 2d FFT of a volume 3d FFT of a volume
[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
3D-sinc interpolation
Next steps
Dynamic datatypes MPI_Type_vector() Enables communication of sub-sets without packing Combined with DMA, permits zero-copy transposes, etc.
Other collectives MPI_Reduce MPI_Scatter MPI_Gather
MPI-2 (MPICH2, MVAPICH2) One-sided (DMA) communication▪ MPI_Put()▪ MPI_Get()
Dynamic world size▪ Ability to spawn new processes during run
Topics
Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up
Optimizing MPI code
Take time on the algorithm & data layout Minimize traffic between nodes / separate problem▪ FTx into xKyKz in SENSE example
Cache-friendly (linear, efficient) access patternsOverlap processing and communication
MPI_Isend() / MPI_Irecv() with multiple work buffers While actively transferring one, process the other Larger messages will hit a higher BW (in general)
Other MPI / performance thoughts
Profile Vtune (Intel; Linux / Windows) Shark (Mac) MPI profiling with -mpe=mpilog
Avoid “premature optimization” (Knuth) Implementation time & effort vs. runtime
performance Use derived datatypes rather than packing Using a debugger with MPI is hard
Build in your own debugging messages from go
Conclusion
If you might need MPI, build to MPI. Works well in shared memory environments▪ It’s getting better all the time
Encourages memory locality in NUMA architectures▪ Nehalem, AMD
Portable, reusable, open-source Can be used in conjunction with threads /
OpenMP / TBB / CUDA / OpenCL “Hybrid model of parallel programming”
Messaging paradigm can create “less obfuscated” code than threads / OpenMP
Building a cluster
Homogeneous nodes Private network
Shared filesystem; ssh communication Password-less SSH High-bandwidth private interconnect
MPI communication exclusively GbE, 10GbE Infiniband
Consider using Rocks CentOS / RHEL based Built for building clusters Rapid network boot based install/reinstall of nodes http://www.rocksclusters.org/
References
MPI documents http://www.mpi-forum.org/docs/
MPICH2 http://www.mcs.anl.gov/research/projects/mpich2 http://lists.mcs.anl.gov/pipermail/mpich-discuss/
OpenMPI http://www.open-mpi.org/ http://www.open-mpi.org/community/lists/ompi.php
MVAPICH[1|2] (Infiniband-tuned distribution) http://mvapich.cse.ohio-state.edu/ http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/
Rocks http://www.rocksclusters.org/ https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/
Books: Pacheo, Peter S., Parallel Programming with MPI Karniadakis, George E., Parallel Scientific Computing in C++ and MPI Gropp, W., Using MPI-2
Questions?
SIMD Example(Transparency painting)
Transparency painting
This is the painting operation for one RGBA pixel (in) onto another (out)
We can do red and blue together, as we know they won’t collide, and we can mask out the unwanted results.
Post-multiply masks are applied in the shifted position to minimize the number of shift operations
Note: we’re using pre-multiplied colors & painting onto an opaque background
#define RB 0x00FF00FFu#define RB_8OFF 0xFF00FF00u#define RGB 0x00FFFFFFu#define G 0x0000FF00u#define G_8OFF 0x00FF0000u#define A 0xFF000000u
inline void blendPreToStatic(const uint32_t & in, uint32_t & out){ uint32_t alpha = in >> 24; if (alpha & 0x00000080u) ++alpha; out = A | RGB & (in + ( ( (alpha * (out & RB) & RB_8OFF) | (alpha * (out & G) & G_8OFF) ) >> 8 ) );}
€
Operation in 0 [00][ ][00][ ]x BB RR 0 [00][00][ ][00]x GG out
Load 0 7 00 80 80x F 0 40 50 60xFF
Mask 0 00 40 00 60x 0 00 00 50 00x
Multiply 0 1 0 2 0x F C F A 0 00 27 0 00x B
Mask 0 1 00 2 00x F F 0 00 27 00 00x
OR 0 1 27 2 00x F F
SHIFT 0 00 1 27 2x F F
ADD 0 7 1 7 x F F A AF
Mask 0 00 1 7 x F A AF
OR 0 1 7 xFF F A AF
Code Detail
€
[R×A][G×A][B×A][1−A]Cout= ′ C2 +C1 ′ α 2
OUT = A | RGB & (IN + ( ( (ALPHA * (OUT & RB) & RB_8OFF) | (ALPHA * (OUT & G) & G_8OFF) ) >> 8 ) );
Vectorizing
For cases where there is no overlap between the four output pixels for four input pixels, we can use vectorized (SSE2) code
128-bit wide registers; load four 32-bit RGBA values, use the same approach as previously (R|B and G) in two registers to perform four paints at once
Vectorizing
inlinevoidblend4PreToStatic(uint32_t ** in, uint32_t * out) // Paints in (quad-word) onto out{ __m128i rb, g, a, a_, o, mask_reg; // Registers rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary) a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4) g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4) mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4) rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word // These steps add one to transparancy values >= 80 o = _mm_srli_epi16(a,7); // Now the high bit is the low bit a = _mm_add_epi16(a,o);
// We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want // to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and // storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're // doing it in this fashion!) rb = _mm_mulhi_epu16(rb,a); g = _mm_mulhi_epu16(g,a); g =_mm_slli_epi32(g,8); // Move green into the correct location. // R and B, both the lower 8 bits of their 16 bits, don't need to be shifted o = _mm_set1_epi32(0xFF000000); // Opaque alpha value o = _mm_or_si128(o,g); o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color
mask_reg = _mm_set1_epi32(0x00FFFFFF); g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color o = _mm_add_epi32(o,g); // Add foreground and background contributions together _mm_storeu_si128((__m128i *) out,o); // Unaligned store}
Vectorizing
Vectorizing this code achieves 3-4x speedup on cluster 8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB Render 512x512x409 (400MB) volume in▪ ~22ms (45fps) (SIMD code) ▪ ~92ms (11fps) (Non-vectorized)
~18GB/s memory throughput ~11 cycles / voxel vs. ~45 cycles non-
vectorized
Results
MAN PAGES
MPI_Init() MPI_Init(3) MPI MPI_Init(3)
NAME MPI_Init - Initialize the MPI execution environment
SYNOPSIS int MPI_Init( int *argc, char ***argv )
INPUT PARAMETERS argc - Pointer to the number of arguments argv - Pointer to the argument vector
THREAD AND SIGNAL SAFETY This routine must be called by one thread only. That thread is called the main thread and must be the thread that calls MPI_Finalize .
NOTES The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE . In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
MPI_Barrier()
MPI_Barrier(3) MPI MPI_Barrier(3)
NAME MPI_Barrier - Blocks until all processes in the communicator have reached this routine.
SYNOPSIS int MPI_Barrier( MPI_Comm comm )
INPUT PARAMETER comm - communicator (handle)
NOTES Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.
MPI_Finalize()
MPI_Finalize(3) MPI MPI_Finalize(3)
NAME MPI_Finalize - Terminates MPI execution environment
SYNOPSIS int MPI_Finalize( void )
NOTES All processes must call this routine before exiting. The number of processes running after this routine is called is undefined; it is best not to perform much more than a return rc after calling MPI_Finalize .
MPI_Comm_size()
MPI_Comm_size(3) MPI MPI_Comm_size(3)
NAME MPI_Comm_size - Determines the size of the group associated with a communicator
SYNOPSIS int MPI_Comm_size( MPI_Comm comm, int *size )
INPUT PARAMETER comm - communicator (handle)
OUTPUT PARAMETER size - number of processes in the group of comm (integer)
MPI_Comm_rank()
MPI_Comm_rank(3) MPI MPI_Comm_rank(3)
NAME MPI_Comm_rank - Determines the rank of the calling process in the com- municator
SYNOPSIS int MPI_Comm_rank( MPI_Comm comm, int *rank )
INPUT ARGUMENT comm - communicator (handle)
OUTPUT ARGUMENT rank - rank of the calling process in the group of comm (integer)
MPI_Send()
MPI_Send(3) MPI MPI_Send(3)
NAME MPI_Send - Performs a blocking send
SYNOPSIS int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)
NOTES This routine may block until the message is received by the destination process.
MPI_Recv() MPI_Recv(3) MPI MPI_Recv(3)
NAME MPI_Recv - Blocking receive for a message
SYNOPSIS int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
OUTPUT PARAMETERS buf - initial address of receive buffer (choice) status - status object (Status)
INPUT PARAMETERS count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)
NOTES The count argument indicates the maximum length of a message; the actual length of the message can be determined with MPI_Get_count .
MPI_Isend()
MPI_Isend(3) MPI MPI_Isend(3)
NAME MPI_Isend - Begins a nonblocking send
SYNOPSIS int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)
OUTPUT PARAMETER request - communication request (handle)
MPI_Irecv()
MPI_Irecv(3) MPI MPI_Irecv(3)
NAME MPI_Irecv - Begins a nonblocking receive
SYNOPSIS int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
INPUT PARAMETERS buf - initial address of receive buffer (choice) count - number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)
OUTPUT PARAMETER request - communication request (handle)
MPI_Bcast()
MPI_Bcast(3) MPI MPI_Bcast(3)
NAME MPI_Bcast - Broadcasts a message from the process with rank "root" to all other processes of the communicator
SYNOPSIS int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
INPUT/OUTPUT PARAMETER buffer - starting address of buffer (choice)
INPUT PARAMETERS count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle)
MPI_Allreduce()
MPI_Allreduce(3) MPI MPI_Allreduce(3)
NAME MPI_Allreduce - Combines values from all processes and distributes the result back to all processes
SYNOPSIS int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )
INPUT PARAMETERS sendbuf - starting address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send buffer (handle) op - operation (handle) comm - communicator (handle)
OUTPUT PARAMETER recvbuf - starting address of receive buffer (choice)
MPI_Type_create_hvector()
MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)
NAME MPI_Type_create_hvector - Create a datatype with a constant stride given in bytes
SYNOPSIS int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
INPUT PARAMETERS count - number of blocks (nonnegative integer) blocklength - number of elements in each block (nonnegative integer) stride - number of bytes between start of each block (address integer) oldtype - old datatype (handle)
OUTPUT PARAMETER newtype - new datatype (handle)
mpicc mpicc(1) MPI mpicc(1)
NAME mpicc - Compiles and links MPI programs written in C
DESCRIPTION This command can be used to compile and link MPI programs written in C. It provides the options and any special libraries that are needed to compile and link MPI programs.
It is important to use this command, particularly when linking pro- grams, as it provides the necessary libraries.
COMMAND LINE ARGUMENTS -show - Show the commands that would be used without runnning them -help - Give short help -cc=name - Use compiler name instead of the default choice. Use this only if the compiler is compatible with the MPICH library (see below) -config=name - Load a configuration file for a particular compiler. This allows a single mpicc command to be used with multiple compil- ers.
[…]
mpiexec mpiexec(1) MPI mpiexec(1)
NAME mpiexec - Run an MPI program
SYNOPSIS mpiexec args executable pgmargs [ : args executable pgmargs ... ]
where args are command line arguments for mpiexec (see below), exe- cutable is the name of an executable MPI program, and pgmargs are com- mand line arguments for the executable. Multiple executables can be specified by using the colon notation (for MPMD - Multiple Program Mul- tiple Data applications). For example, the following command will run the MPI program a.out on 4 processes: mpiexec -n 4 a.out
The MPI standard specifies the following arguments and their meanings:
-n <np> - Specify the number of processes to use -host <hostname> - Name of host on which to run processes -arch <architecture name> - Pick hosts with this architecture type
[…]