Parallel Programming and MPI

54
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Parallel Programming and MPI A course for IIT-M. September 2008 R Badrinath, STSD Bangalore ([email protected])

description

Parallel Programming and MPI

Transcript of Parallel Programming and MPI

  • Context and BackgroundIIT- Madras has recently added a good deal of compute power.Why Further R&D in sciences, engineeringProvide computing services to the regionCreate new opportunities in education and skillsWhy this course Update skills to program modern cluster computers Length -2 theory and 2 practice sessions, 4 hrs each

  • Audience Check

  • ContentsMPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_SendMPI_RecvMPI_BcastMPI_Create_commMPI_SendrecvMPI_ScatterMPI_Gather Instead weUnderstand IssuesUnderstand ConceptsLearn enough to pickup from the manualGo by motivating examplesTry out some of the examples

  • OutlineSequential vs Parallel programming Shared vs Distributed MemoryParallel work breakdown modelsCommunication vs ComputationMPI ExamplesMPI ConceptsThe role of IO

  • Sequential vs ParallelWe are used to sequential programming C, Java, C++, etc. E.g., Bubble Sort, Binary Search, Strassen Multiplication, FFT, BLAST, Main idea Specify the steps in perfect orderReality We are used to parallelism a lot more than we think as a concept; not for programmingMethodology Launch a set of tasks; communicate to make progress. E.g., Sorting 500 answer papers by making 5 equal piles, have them sorted by 5 people, merge them together.

  • Shared vs Distributed Memory ProgrammingShared Memory All tasks access the same memory, hence the same data. pthreadsDistributed Memory All memory is local. Data sharing is by explicitly transporting data from one task to another (send-receive pairs in MPI, e.g.)

    HW Programming model relationship Tasks vs CPUs; SMPs vs ClustersProgram MemoryCommunications channel

  • Designing Parallel Programs

  • Simple Parallel Program sorting numbers in a large array ANotionally divide A into 5 pieces [0..99;100..199;200..299;300..399;400..499].Each part is sorted by an independent sequential algorithm and left within its region.

    The resultant parts are merged by simply reordering among adjacent parts.

  • What is different Think aboutHow many people doing the work. (Degree of Parallelism)What is needed to begin the work. (Initialization)Who does what. (Work distribution)Access to work part. (Data/IO access)Whether they need info from each other to finish their own job. (Communication)When are they all done. (Synchronization)What needs to be done to collate the result.

  • Work Break-downParallel algorithmPrefer simple intuitive breakdownsUsually highly optimized sequential algorithms are not easily parallelizableBreaking work often involves some pre- or post- processing (much like divide and conquer)Fine vs large grain parallelism and relationship to communication

  • Digression Lets get a simple MPI Program to work#include #include

    int main(){int total_size, my_rank;

    MPI_Init(NULL,NULL);

    MPI_Comm_size(MPI_COMM_WORLD, &total_size);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    printf("\n Total number of programs = %d, out of which rank of this process is %d\n", total_size, my_rank);MPI_Finalize();return 0;}

  • Getting it to workCompile it:mpicc o simple simple.c # If you want HP-MPI set your path # /opt/hpmpi/binRun itThis depends a bit on the systemmpirun -np2 simpleqsub l ncpus=2 o simple.out /opt/hpmpi/bin/mpirun /simple[Fun: qsub l ncpus=2 I hostname ]

    Results are in the output file.What is mpirun ? What does qsub have to do with MPI?... More about qsub in a separate talk.

  • What goes onSame program is run at the same time on 2 different CPUsEach is slightly different in that each returns different values for some simple calls like MPI_Comm_rank.This gives each instance its identityWe can make different instances run different pieces of code based on this identity differenceTypically it is an SPMD model of computation

  • Continuing work breakdownSimple Example: Find shortest distances572163227Let Nodes be numbered 0,1,,n-1Let us put all of this in a matrixA[i][j] is the distance from i to j123040 2 1 .. 67 0 .. .. .. 1 5 0 2 3.. .. 2 0 2.. .. .. .. 0PROBLEM:Find shortest path distances

  • Floyds (sequential) algorithmFor (k=0; k
  • Parallelizing FloydActually we just need n2 tasks, with each task iterating n times (once for each value of k).After each iteration we need to make sure everyone sees the matrix.Ideal for shared memory.. ProgrammingWhat if we have less than n2 tasks?... Say p
  • Dividing the workEach task gets [n/p] rows, with the last possibly getting a little more.

    T0Tq q x [ n/p ]Remember the observationi-th rowk-th row

  • /* id is TASK NUMBER, each node has only the part of A that it owns. This is approximate code */for (k=0;k
  • The MPI modelRecall MPI tasks are typically created when the jobs are launched not inside the MPI program (no forking).mpirun usually creates the task setmpirun np 2 a.out a.out is run on all nodes and a communication channel is setup between themFunctions allow for tasks to find out Size of the task groupOnes own position within the group

  • MPI Notions [ Taking from the example ]Communicator A group of tasks in a programRank Each tasks ID in the groupMPI_Comm_rank() /* use this to set id */Size Of the groupMPI_Comm_size() /* use to set p */Notion of send/receive/broadcastMPI_Bcast() /* use to broadcast rowk[] */

    For actual syntax use a good MPI book or manualOnline resource: http://www-unix.mcs.anl.gov/mpi/www/

  • MPI Prologue to our Floyd exampleint a[MAX][MAX];int n=20; /* real size of the matrix,can be read in */int id,p;MPI_Init(argc,argv);

    MPI_Comm_rank(MPI_COMM_WORLD,&id);MPI_Comm_size(MPI_COMM_WORLD,&p);../* This is where all the real work happens */. MPI_Finalize(); /* Epilogue */

  • This is the time to try out several simple MPI programs using the few functions we have seen.- use mpicc- use mpirun

  • Visualizing the executionJob is LaunchedTasks On CPUsMultiple Tasks/CPUs maybe on the same nodeScheduler ensures 1 task per cpuMPI_INIT, MPI_Comm_rank, MPI_Comm_size etcOther initializations, like reading in the arrayFor initial values of k, task with rank 0 broadcasts row k, others receiveFor each value of k they do their computation with the correct rowkLoop above for all values of kTask 0 receives all blocks of the final array and prints them outMPI_Finalize

  • Communication vs ComputationOften communication is needed between iterations to complete the work.Often the more the tasks the more the communication can become. In Floyd, bigger p indicates that rowk will be sent to a larger number of tasks.If each iteration depends on more data, it can get very busy.This may mean network contention; i.e., delays.Try to count the numbr of as in a string. Time vs pThis is why for a fixed problem size increasing number of CPUs does not continually increase performanceThis needs experimentation problem specific

  • Communication primitivesMPI_Send(sendbuffer, senddatalength, datatype, destination, tag, communicator);MPI_Send(Hello, strlen(Hello), MPI_CHAR, 2 , 100, MPI_COMM_WORLD);MPI_Recv(recvbuffer, revcdatalength, MPI_CHAR, source, tag,MPI_COMM_WORLD,&status);Send-Recv happen in pairs.

  • CollectivesBroadcast is one-to-all communicationBoth receivers and sender call the same functionAll MUST call it. All end up with SAME result.MPI_Bcast (buffer, count, type, root, comm);ExamplesMPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World);Task 0 sends its integer k and all others receive it.MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_WORLD);Current_owner_task sends rowk to all others.

  • Try out a simple MPI program withsend-recvs and braodcasts.

    Try out Floyds algorithm.What if you have to read a file to initialize Floyds algorithm?

  • A bit more on BroadcastMPI_Bcast(&x,1,..,0,..);MPI_Bcast(&x,1,..,0,..);MPI_Bcast(&x,1,..,0,..);Ranks: 0 1 2 x : 0 1 2 x : 0 0 0 0000

  • Other useful collectivesMPI_Reduce(&values,&results,count,type,operator,root,comm);MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,9, MPI_COMM_WORLD);

    Task number 9 gets in the variable res the sum of whatever was in x in all of the tasks (including itself).Must be called by ALL tasks.

  • Scattering as opposed to broadcastingMPI_Scatterv(sndbuf, sndcount[], send_disp[], type, recvbuf, recvcount, recvtype, root, comm);All nodes MUST callRank0Rank0Rank1Rank2Rank3

  • Common Communication pitfalls!!Make sure that communication primitives are called by the right number of tasks.Make sure they are called in the right sequence.Make sure that you use the proper tags.If not, you can easily get into deadlock (My program seems to be hung)

  • More on work breakdownFinding the right work breakdown can be challengingSometime dynamic work breakdown is goodMaster (usually task 0) decides who will do what and collects the results.E.g., you have a huge number of 5x5 matrices to multiply (chained matrix multiplication).E.g., Search for a substring in a huge collection of strings.

  • Master-slave dynamic work assignment01234MasterSlaves

  • Master slave example Reverse stringsSlave(){do{ MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat); n=strlen(work); if(n==0) break; /* detecting the end */

    reverse(work);

    MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD); } while (1); MPI_Finalize();}

  • Master slave example Reverse stringsMaster(){ /* rank 0 task */initialize_work_tems();for(i=1;iMPI_source, 0,MPI_COMM_WORLD); }}

  • Master slave exampleMain(){... MPI_Comm_Rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&np); if (id ==0 )Master();elseSlave();... }

  • Matrix Multiply and Communication Patterns

  • Block Distribution of MatricesMatrix Mutliply:Cij = (Aik * Bkj)BMR Algorithm:

    Each task owns a block its own part of A,B and CThe old formula holds for blocks!Example: C21=A20 * B01 A21 * B11 A22 * B21 A23 * B31

    Each is a smaller Block a submatrix

  • Block Distribution of MatricesMatrix Mutliply:Cij = (Aik * Bkj)BMR Algorithm:

    Each is a smaller Block a submatrixC21 = A20 * B01 A21 * B11 A22 * B21 A23 * B31A22 is row broadcastA22*B21 added into C21B_1 is Rolled up one slotOut task now has B31Now repeat the above block except the item to broadcast is A23

  • Attempt doing this with just Send-Recv and Broadcast

  • Communicators and TopologiesBMR example shows limitations of broadcast.. Although there is patternCommunicators can be created on subgroups of processes.Communicators can be created that have a topology Will make programming naturalMight improve performance by matching to hardware

  • for (k = 0; k < s; k++) { sender = (my_row + k) % s; if (sender == my_col) { MPI_Bcast(&my_A, m*m, MPI_INT, sender, row_comm); T = my_A; else MPI_Bcast(&T, m*m, MPI_INT, sender, row_comm); my_C = my_C + T x my_B; } MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0, source, 0, col_comm, &status); }

  • Creating topologies and communicatorsCreating a gridMPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes, istorus, canreorder, &grid_comm); int dim_sizes[2], int istorus[2], int canreorder, MPI_Comm grid_comm

    Divide a grid into rows- each with own communicatorMPI_Cart_sub(grid_comm,free,&rowcom)MPI_Comm rowcomm; int free[2]

  • Try implementing the BMR algorithm with communicators

  • A brief on other MPI Topics The last legMPI+Multi-threaded / OpenMPOne sided CommunicationMPI and IO

  • MPI and OpenMPGrainCommunicationWhere does the interesting pragma omp for fit in our MPI Floyd?How do I assign exactly one MPI task per CPU?

  • One-Sided CommunicationHave no corresponding send-recv pairs!RDMAGetPut

  • IO in Parallel ProgramsTypically a root task, does the IO. Simpler to programNatural because of some post processing occasionally needed (sorting)All nodes generating IO requests might overwhelm fileserver, essentially sequentializing it.Performance not the limitation for Lustre/SFS.Parallel IO interfaces such as MPI-IO can make use of parallel filesystems such as Lustre.

  • MPI-BLAST exec time vs other time[4]

  • How IO/Comm Optimizations help MPI-BLAST[4]

  • What did we learn?Distributed Memory Programming ModelParallel Algorithm BasicsWork BreakdownTopologies in CommunicationCommunication Overhead vs ComputationImpact of Parallel IO

  • What MPI Calls did we see here?MPI_InitMPI_FinalizeMPI_Comm_sizeMPI_Comm_RankMPI_SendMPI_RecvMPI_Sendrecv_replaceMPI_BcastMPI_ReduceMPI_Cart_createMPI_Cart_subMPI_Scatter

  • ReferencesParallel Programming in C with MPI and OpenMP, M J Quinn, TMH. This is an excellent practical book. Motivated much of the material here, specifically Floyds algorithm.BMR Algorithm for Matrix Multiply and topology ideas is motivated by http://www.cs.indiana.edu/classes/b673/notes/matrix_mult.htmlMPI online manual http://www-unix.mcs.anl.gov/mpi/www/Efficient Data Access For Parallel BLAST, IPDPDS05