Post on 23-Sep-2015
description
Context and BackgroundIIT- Madras has recently added a good deal of compute power.Why Further R&D in sciences, engineeringProvide computing services to the regionCreate new opportunities in education and skillsWhy this course Update skills to program modern cluster computers Length -2 theory and 2 practice sessions, 4 hrs each
Audience Check
ContentsMPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_SendMPI_RecvMPI_BcastMPI_Create_commMPI_SendrecvMPI_ScatterMPI_Gather Instead weUnderstand IssuesUnderstand ConceptsLearn enough to pickup from the manualGo by motivating examplesTry out some of the examples
OutlineSequential vs Parallel programming Shared vs Distributed MemoryParallel work breakdown modelsCommunication vs ComputationMPI ExamplesMPI ConceptsThe role of IO
Sequential vs ParallelWe are used to sequential programming C, Java, C++, etc. E.g., Bubble Sort, Binary Search, Strassen Multiplication, FFT, BLAST, Main idea Specify the steps in perfect orderReality We are used to parallelism a lot more than we think as a concept; not for programmingMethodology Launch a set of tasks; communicate to make progress. E.g., Sorting 500 answer papers by making 5 equal piles, have them sorted by 5 people, merge them together.
Shared vs Distributed Memory ProgrammingShared Memory All tasks access the same memory, hence the same data. pthreadsDistributed Memory All memory is local. Data sharing is by explicitly transporting data from one task to another (send-receive pairs in MPI, e.g.)
HW Programming model relationship Tasks vs CPUs; SMPs vs ClustersProgram MemoryCommunications channel
Designing Parallel Programs
Simple Parallel Program sorting numbers in a large array ANotionally divide A into 5 pieces [0..99;100..199;200..299;300..399;400..499].Each part is sorted by an independent sequential algorithm and left within its region.
The resultant parts are merged by simply reordering among adjacent parts.
What is different Think aboutHow many people doing the work. (Degree of Parallelism)What is needed to begin the work. (Initialization)Who does what. (Work distribution)Access to work part. (Data/IO access)Whether they need info from each other to finish their own job. (Communication)When are they all done. (Synchronization)What needs to be done to collate the result.
Work Break-downParallel algorithmPrefer simple intuitive breakdownsUsually highly optimized sequential algorithms are not easily parallelizableBreaking work often involves some pre- or post- processing (much like divide and conquer)Fine vs large grain parallelism and relationship to communication
Digression Lets get a simple MPI Program to work#include #include
int main(){int total_size, my_rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &total_size);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("\n Total number of programs = %d, out of which rank of this process is %d\n", total_size, my_rank);MPI_Finalize();return 0;}
Getting it to workCompile it:mpicc o simple simple.c # If you want HP-MPI set your path # /opt/hpmpi/binRun itThis depends a bit on the systemmpirun -np2 simpleqsub l ncpus=2 o simple.out /opt/hpmpi/bin/mpirun /simple[Fun: qsub l ncpus=2 I hostname ]
Results are in the output file.What is mpirun ? What does qsub have to do with MPI?... More about qsub in a separate talk.
What goes onSame program is run at the same time on 2 different CPUsEach is slightly different in that each returns different values for some simple calls like MPI_Comm_rank.This gives each instance its identityWe can make different instances run different pieces of code based on this identity differenceTypically it is an SPMD model of computation
Continuing work breakdownSimple Example: Find shortest distances572163227Let Nodes be numbered 0,1,,n-1Let us put all of this in a matrixA[i][j] is the distance from i to j123040 2 1 .. 67 0 .. .. .. 1 5 0 2 3.. .. 2 0 2.. .. .. .. 0PROBLEM:Find shortest path distances
Dividing the workEach task gets [n/p] rows, with the last possibly getting a little more.
T0Tq q x [ n/p ]Remember the observationi-th rowk-th row
The MPI modelRecall MPI tasks are typically created when the jobs are launched not inside the MPI program (no forking).mpirun usually creates the task setmpirun np 2 a.out a.out is run on all nodes and a communication channel is setup between themFunctions allow for tasks to find out Size of the task groupOnes own position within the group
MPI Notions [ Taking from the example ]Communicator A group of tasks in a programRank Each tasks ID in the groupMPI_Comm_rank() /* use this to set id */Size Of the groupMPI_Comm_size() /* use to set p */Notion of send/receive/broadcastMPI_Bcast() /* use to broadcast rowk[] */
For actual syntax use a good MPI book or manualOnline resource: http://www-unix.mcs.anl.gov/mpi/www/
MPI Prologue to our Floyd exampleint a[MAX][MAX];int n=20; /* real size of the matrix,can be read in */int id,p;MPI_Init(argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);MPI_Comm_size(MPI_COMM_WORLD,&p);../* This is where all the real work happens */. MPI_Finalize(); /* Epilogue */
This is the time to try out several simple MPI programs using the few functions we have seen.- use mpicc- use mpirun
Visualizing the executionJob is LaunchedTasks On CPUsMultiple Tasks/CPUs maybe on the same nodeScheduler ensures 1 task per cpuMPI_INIT, MPI_Comm_rank, MPI_Comm_size etcOther initializations, like reading in the arrayFor initial values of k, task with rank 0 broadcasts row k, others receiveFor each value of k they do their computation with the correct rowkLoop above for all values of kTask 0 receives all blocks of the final array and prints them outMPI_Finalize
Communication vs ComputationOften communication is needed between iterations to complete the work.Often the more the tasks the more the communication can become. In Floyd, bigger p indicates that rowk will be sent to a larger number of tasks.If each iteration depends on more data, it can get very busy.This may mean network contention; i.e., delays.Try to count the numbr of as in a string. Time vs pThis is why for a fixed problem size increasing number of CPUs does not continually increase performanceThis needs experimentation problem specific
Communication primitivesMPI_Send(sendbuffer, senddatalength, datatype, destination, tag, communicator);MPI_Send(Hello, strlen(Hello), MPI_CHAR, 2 , 100, MPI_COMM_WORLD);MPI_Recv(recvbuffer, revcdatalength, MPI_CHAR, source, tag,MPI_COMM_WORLD,&status);Send-Recv happen in pairs.
CollectivesBroadcast is one-to-all communicationBoth receivers and sender call the same functionAll MUST call it. All end up with SAME result.MPI_Bcast (buffer, count, type, root, comm);ExamplesMPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World);Task 0 sends its integer k and all others receive it.MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_WORLD);Current_owner_task sends rowk to all others.
Try out a simple MPI program withsend-recvs and braodcasts.
Try out Floyds algorithm.What if you have to read a file to initialize Floyds algorithm?
A bit more on BroadcastMPI_Bcast(&x,1,..,0,..);MPI_Bcast(&x,1,..,0,..);MPI_Bcast(&x,1,..,0,..);Ranks: 0 1 2 x : 0 1 2 x : 0 0 0 0000
Other useful collectivesMPI_Reduce(&values,&results,count,type,operator,root,comm);MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,9, MPI_COMM_WORLD);
Task number 9 gets in the variable res the sum of whatever was in x in all of the tasks (including itself).Must be called by ALL tasks.
Scattering as opposed to broadcastingMPI_Scatterv(sndbuf, sndcount[], send_disp[], type, recvbuf, recvcount, recvtype, root, comm);All nodes MUST callRank0Rank0Rank1Rank2Rank3
Common Communication pitfalls!!Make sure that communication primitives are called by the right number of tasks.Make sure they are called in the right sequence.Make sure that you use the proper tags.If not, you can easily get into deadlock (My program seems to be hung)
More on work breakdownFinding the right work breakdown can be challengingSometime dynamic work breakdown is goodMaster (usually task 0) decides who will do what and collects the results.E.g., you have a huge number of 5x5 matrices to multiply (chained matrix multiplication).E.g., Search for a substring in a huge collection of strings.
Master-slave dynamic work assignment01234MasterSlaves
Master slave example Reverse stringsSlave(){do{ MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat); n=strlen(work); if(n==0) break; /* detecting the end */
reverse(work);
MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD); } while (1); MPI_Finalize();}
Master slave example Reverse stringsMaster(){ /* rank 0 task */initialize_work_tems();for(i=1;iMPI_source, 0,MPI_COMM_WORLD); }}
Master slave exampleMain(){... MPI_Comm_Rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&np); if (id ==0 )Master();elseSlave();... }
Matrix Multiply and Communication Patterns
Block Distribution of MatricesMatrix Mutliply:Cij = (Aik * Bkj)BMR Algorithm:
Each task owns a block its own part of A,B and CThe old formula holds for blocks!Example: C21=A20 * B01 A21 * B11 A22 * B21 A23 * B31
Each is a smaller Block a submatrix
Block Distribution of MatricesMatrix Mutliply:Cij = (Aik * Bkj)BMR Algorithm:
Each is a smaller Block a submatrixC21 = A20 * B01 A21 * B11 A22 * B21 A23 * B31A22 is row broadcastA22*B21 added into C21B_1 is Rolled up one slotOut task now has B31Now repeat the above block except the item to broadcast is A23
Attempt doing this with just Send-Recv and Broadcast
Communicators and TopologiesBMR example shows limitations of broadcast.. Although there is patternCommunicators can be created on subgroups of processes.Communicators can be created that have a topology Will make programming naturalMight improve performance by matching to hardware
for (k = 0; k < s; k++) { sender = (my_row + k) % s; if (sender == my_col) { MPI_Bcast(&my_A, m*m, MPI_INT, sender, row_comm); T = my_A; else MPI_Bcast(&T, m*m, MPI_INT, sender, row_comm); my_C = my_C + T x my_B; } MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0, source, 0, col_comm, &status); }
Creating topologies and communicatorsCreating a gridMPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes, istorus, canreorder, &grid_comm); int dim_sizes[2], int istorus[2], int canreorder, MPI_Comm grid_comm
Divide a grid into rows- each with own communicatorMPI_Cart_sub(grid_comm,free,&rowcom)MPI_Comm rowcomm; int free[2]
Try implementing the BMR algorithm with communicators
A brief on other MPI Topics The last legMPI+Multi-threaded / OpenMPOne sided CommunicationMPI and IO
MPI and OpenMPGrainCommunicationWhere does the interesting pragma omp for fit in our MPI Floyd?How do I assign exactly one MPI task per CPU?
One-Sided CommunicationHave no corresponding send-recv pairs!RDMAGetPut
IO in Parallel ProgramsTypically a root task, does the IO. Simpler to programNatural because of some post processing occasionally needed (sorting)All nodes generating IO requests might overwhelm fileserver, essentially sequentializing it.Performance not the limitation for Lustre/SFS.Parallel IO interfaces such as MPI-IO can make use of parallel filesystems such as Lustre.
MPI-BLAST exec time vs other time[4]
How IO/Comm Optimizations help MPI-BLAST[4]
What did we learn?Distributed Memory Programming ModelParallel Algorithm BasicsWork BreakdownTopologies in CommunicationCommunication Overhead vs ComputationImpact of Parallel IO
What MPI Calls did we see here?MPI_InitMPI_FinalizeMPI_Comm_sizeMPI_Comm_RankMPI_SendMPI_RecvMPI_Sendrecv_replaceMPI_BcastMPI_ReduceMPI_Cart_createMPI_Cart_subMPI_Scatter
ReferencesParallel Programming in C with MPI and OpenMP, M J Quinn, TMH. This is an excellent practical book. Motivated much of the material here, specifically Floyds algorithm.BMR Algorithm for Matrix Multiply and topology ideas is motivated by http://www.cs.indiana.edu/classes/b673/notes/matrix_mult.htmlMPI online manual http://www-unix.mcs.anl.gov/mpi/www/Efficient Data Access For Parallel BLAST, IPDPDS05