An Introduction to Parallel Programming with MPI

An Introduction to Parallel An Introduction to Parallel Programming with MPIProgramming with MPI

March 22, 24, 29, 31March 22, 24, 29, 31

20052005

David AdamsDavid Adams

OutlineOutline

DisclaimersDisclaimersOverview of basic parallel programming on Overview of basic parallel programming on a cluster with the goals of MPIa cluster with the goals of MPIBatch system interactionBatch system interactionStartup proceduresStartup proceduresBlocking message passingBlocking message passingNon-blocking message passingNon-blocking message passingCollective CommunicationsCollective Communications

DisclaimersDisclaimers

I do not have all the answers.I do not have all the answers.

Completion of this short course will give Completion of this short course will give you enough tools to begin making use of you enough tools to begin making use of MPI. It will not “automagically” allow your MPI. It will not “automagically” allow your code to run on a parallel machine simply code to run on a parallel machine simply by logging in.by logging in.

Some codes are easier to parallelize than Some codes are easier to parallelize than others. others.

The Goals of MPIThe Goals of MPI

Design an application programming interface.Design an application programming interface.

Allow efficient communication.Allow efficient communication.

Allow for implementations that can be used in a Allow for implementations that can be used in a heterogeneous environment.heterogeneous environment.

Allow convenient C and Fortran 77 bindings.Allow convenient C and Fortran 77 bindings.

Provide a reliable communication interface.Provide a reliable communication interface.

Portable.Portable.

Thread safe.Thread safe.

Message Passing ParadigmMessage Passing Paradigm

P 6

P 5

P 4

P 3

P 2

P 1

Network

Message Passing ParadigmMessage Passing Paradigm

Conceptually, all processors communicate Conceptually, all processors communicate through messages (even though some may through messages (even though some may share memory space).share memory space).Low level details of message transport are Low level details of message transport are handled by MPI and are invisible to the user.handled by MPI and are invisible to the user.Every processor is running the same program Every processor is running the same program but will take different logical paths determined by but will take different logical paths determined by self processor identification (Who am I?).self processor identification (Who am I?).Programs are written, in general, for an arbitrary Programs are written, in general, for an arbitrary number of processors though they may be more number of processors though they may be more efficient on specific numbers (powers of 2?).efficient on specific numbers (powers of 2?).

Distributed Memory and I/O Distributed Memory and I/O SystemsSystems

The cluster machines available at Virginia Tech are The cluster machines available at Virginia Tech are distributed memory distributed I/O systems.distributed memory distributed I/O systems.

Each node (processor pair) has its own memory and local hard Each node (processor pair) has its own memory and local hard disk.disk.

Allows asynchronous execution of multiple instruction Allows asynchronous execution of multiple instruction streams.streams.Heavy disk I/O should be delegated to the local disk Heavy disk I/O should be delegated to the local disk instead of across the network and minimized as much as instead of across the network and minimized as much as possible.possible.While getting your program running, another goal to While getting your program running, another goal to keep in mind is to see that it makes good use of the keep in mind is to see that it makes good use of the hardware available to you.hardware available to you.

What does “good use” mean?What does “good use” mean?

SpeedupSpeedup

The The speedupspeedup achieved by a parallel achieved by a parallel algorithm running on algorithm running on pp processors is the processors is the ratio between the time taken by that ratio between the time taken by that parallel computer executing the fastest parallel computer executing the fastest serial algorithm and the time taken by the serial algorithm and the time taken by the same parallel computer executing the same parallel computer executing the parallel algorithm using parallel algorithm using pp processors. processors.

-Designing Efficient Algorithms for Parallel -Designing Efficient Algorithms for Parallel Computers, Michael J. QuinnComputers, Michael J. Quinn

SpeedupSpeedup

Sometimes a “fastest serial version” of the Sometimes a “fastest serial version” of the code is unavailable. code is unavailable. The speedup of a parallel algorithm can be The speedup of a parallel algorithm can be measured based on the speed of the measured based on the speed of the parallel algorithm run serially but this gives parallel algorithm run serially but this gives an unfair advantage to the parallel code as an unfair advantage to the parallel code as the inefficiencies of making the code the inefficiencies of making the code parallel will also appear in the serial parallel will also appear in the serial version.version.

Speedup ExampleSpeedup Example

Our really_big_code01 executes on a single Our really_big_code01 executes on a single processor in 100 hours.processor in 100 hours.The same code on 10 processors takes 10 The same code on 10 processors takes 10 hours.hours.100 hrs./10 hrs. = 10 = speedup.100 hrs./10 hrs. = 10 = speedup.When speedup = When speedup = p p it is called it is called ideal (or perfect)ideal (or perfect) speedup.speedup.Speedup by itself is not very meaningful. A Speedup by itself is not very meaningful. A speedup of 10 may sound good (We are solving speedup of 10 may sound good (We are solving the problem 10 times as fast!) but what if we the problem 10 times as fast!) but what if we were using 1000 processors to get that number? were using 1000 processors to get that number?

EfficiencyEfficiency

The The efficiency efficiency of a parallel algorithm running on of a parallel algorithm running on pp processors is the speedup divided by processors is the speedup divided by pp..

-Designing Efficient Algorithms for Parallel Computers, -Designing Efficient Algorithms for Parallel Computers, Michael J. QuinnMichael J. Quinn

From our last example, From our last example, when when pp = 10 the efficiency is 10/10=1 (great!), = 10 the efficiency is 10/10=1 (great!), When When pp = 1000 the efficiency is 10/1000=0.01 (bad!). = 1000 the efficiency is 10/1000=0.01 (bad!).

Speedup and efficiency give us an idea of how Speedup and efficiency give us an idea of how well our parallel code is making use of the well our parallel code is making use of the available resources.available resources.

ConcurrencyConcurrency

The first step in parallelizing any code is to The first step in parallelizing any code is to identify the types of concurrency found in identify the types of concurrency found in the problem itself (not necessarily the the problem itself (not necessarily the serial algorithm).serial algorithm). Many parallel algorithms show few Many parallel algorithms show few

resemblances to the (fastest known) serial resemblances to the (fastest known) serial version they are compared to and sometimes version they are compared to and sometimes require an unusual perspective on the require an unusual perspective on the problem.problem.


Consider the problem of finding the sum of Consider the problem of finding the sum of nn integer values. integer values. A sequential algorithm may look something A sequential algorithm may look something

like this:like this:BEGINBEGIN

sumsum = A = A00

FOR FOR ii = 1 TO = 1 TO nn – 1 DO – 1 DO

sumsum = = sumsum + A + Aii

ENDFORENDFOR

ENDEND


Suppose Suppose nn = 4. Then the additions would be done in a = 4. Then the additions would be done in a precise order as follows:precise order as follows:

[(A[(A00 +A +A11) + A) + A22] +A] +A33

Without any insight into the problem itself we might Without any insight into the problem itself we might assume that the process is completely sequential and assume that the process is completely sequential and can not be parallelized.can not be parallelized.Of course, we know that addition is associative (mostly). Of course, we know that addition is associative (mostly). The same expression could be written as:The same expression could be written as:

(A(A00 +A +A11) + (A) + (A22 +A +A33))

By using our insight into the problem of addition we can By using our insight into the problem of addition we can exploit the inherent concurrency of the problem and not exploit the inherent concurrency of the problem and not the algorithm.the algorithm.

Communication is SlowCommunication is Slow

Continuing our example of adding Continuing our example of adding nn integers we integers we may want to parallelize the process to exploit as may want to parallelize the process to exploit as much concurrency as possible. We call on the much concurrency as possible. We call on the services of Clovus the Parallel Guru. services of Clovus the Parallel Guru.

Let Let nn = 128. = 128. Clovus divides the integers into pairs and distributes Clovus divides the integers into pairs and distributes

them to 64 processors maximizing the concurrency them to 64 processors maximizing the concurrency inherent in the problem.inherent in the problem.

The solution to the 64 sub-problems are distributed to The solution to the 64 sub-problems are distributed to 32 and those 32 to 16 etc…32 and those 32 to 16 etc…

Communication OverheadCommunication Overhead

Suppose it takes Suppose it takes tt units of time to perform a floating- units of time to perform a floating-point addition.point addition.Suppose it takes 100Suppose it takes 100tt units of time to pass a floating- units of time to pass a floating-point number from one processor to another.point number from one processor to another.The entire calculation on a single processor would take The entire calculation on a single processor would take 127127tt time units. time units.Using the maximum number of processors possible (64) Using the maximum number of processors possible (64) Clovus finds the sum of the first set of pairs in 101Clovus finds the sum of the first set of pairs in 101tt time time units. Further steps for 32, 16, 8, 4, and 2 follow to units. Further steps for 32, 16, 8, 4, and 2 follow to obtain the final solution.obtain the final solution.

(64) (32) (16) (8) (4) (2)(64) (32) (16) (8) (4) (2) 101101tt + 101 + 101tt + 101 + 101tt + 101 + 101tt + 101 + 101tt + 101 + 101tt + =606 + =606tt total time units total time units

Parallelism and Pipelining to Parallelism and Pipelining to Achieve ConcurrencyAchieve Concurrency

There are two primary ways to achieve There are two primary ways to achieve concurrency in an algorithm. concurrency in an algorithm. ParallelismParallelism

The use of multiple resources to increase concurrency.The use of multiple resources to increase concurrency.Partitioning.Partitioning.

Example: Our summation problem. Example: Our summation problem. Pipelining Pipelining

Dividing the computation into a number of steps that are Dividing the computation into a number of steps that are repeated throughout the algorithm.repeated throughout the algorithm.An ordered set of segments in which the output of each An ordered set of segments in which the output of each segment is the input of its successor.segment is the input of its successor.

Example: Automobile assembly line.Example: Automobile assembly line.

ExamplesExamples(Jacobi style update)(Jacobi style update)

Imagine we have a Imagine we have a cellular automata that cellular automata that we want to parallelize. we want to parallelize. North

West

West

SouthWest

South

SouthEast

EAST

NorthEast

North

Site Cell

7 8 …

1 2 3 4 5 6

ExamplesExamples

We try to distribute the We try to distribute the rows evenly between two rows evenly between two

processors.processors. NorthWest

West

SouthWest

South

SouthEast

EAST

NorthEast

North

Site Cell

7 8 …

1 2 3 4 5 6

ExamplesExamples

Columns seem to Columns seem to work better for this work better for this problem. problem. North

West

West

SouthWest

South

SouthEast

EAST

NorthEast

North

Site Cell

7 8 …

1 2 3 4 5 6

ExamplesExamples

Minimizing Minimizing communication.communication.

NorthWest

West

SouthWest

South

SouthEast

EAST

NorthEast

North

Site Cell

7 8 …

1 2 3 4 5 6

ExamplesExamples(Gauss-Seidel style update)(Gauss-Seidel style update)

Emulating a serial Emulating a serial Gauss-Seidel update Gauss-Seidel update style with a pipe. style with a pipe. North

West

West

SouthWest

South

SouthEast

EAST

NorthEast

North

Site Cell

7 8 …

1 2 3 4 5 6

Batch System InteractionBatch System Interaction

Both Anantham (400 processors) and Both Anantham (400 processors) and System “X” (2200 processors) will System “X” (2200 processors) will normally operate in batch mode.normally operate in batch mode. Jobs are not interactive.Jobs are not interactive.

Multi-user etiquette is enforced by a job Multi-user etiquette is enforced by a job scheduler and queuing system.scheduler and queuing system.Users will submit jobs using a script file Users will submit jobs using a script file built by the administrator and modified by built by the administrator and modified by the user.the user.

PBS (Portable Batch Scheduler) PBS (Portable Batch Scheduler) Submission ScriptSubmission Script

#/bin/bash#/bin/bash #!#! #! Example of job file to submit parallel MPI applications.#! Example of job file to submit parallel MPI applications. #! Lines starting with #PBS are options for the qsub command.#! Lines starting with #PBS are options for the qsub command. #! Lines starting with #! are comments#! Lines starting with #! are comments

#! Set queue (production queue --- the only one right now) and#! Set queue (production queue --- the only one right now) and #! the number of nodes.#! the number of nodes. #! In this case we require 10 nodes from the entire set ("all").#! In this case we require 10 nodes from the entire set ("all"). #PBS -q prod_q#PBS -q prod_q #PBS -l nodes=10:all#PBS -l nodes=10:all

PBS Submission ScriptPBS Submission Script

#! Set time limit.#! Set time limit. #! The default is 30 minutes of cpu time.#! The default is 30 minutes of cpu time. #! Here we ask for up to 1 hour.#! Here we ask for up to 1 hour. #! (Note that this is *total* cpu time, e.g., 10 minutes on#! (Note that this is *total* cpu time, e.g., 10 minutes on #! each of 4 processors is 40 minutes)#! each of 4 processors is 40 minutes) #! Hours:minutes:seconds#! Hours:minutes:seconds #PBS -l cput=01:00:00#PBS -l cput=01:00:00

#! Name of output files for std output and error;#! Name of output files for std output and error; #! Defaults are <job-name>.o<job number> and <job-#! Defaults are <job-name>.o<job number> and <job-

name>.e<job-number>name>.e<job-number> #!PBS -e ZCA.err#!PBS -e ZCA.err #!PBS -o ZCA.log#!PBS -o ZCA.log

PBS Submission ScriptPBS Submission Script #! Mail to user when job terminates or aborts#! Mail to user when job terminates or aborts #! #PBS -m ae#! #PBS -m ae

#!change the working directory (default is home #!change the working directory (default is home directory)directory)

cd $PBS_O_WORKDIRcd $PBS_O_WORKDIR

#! Run the parallel MPI executable (change the #! Run the parallel MPI executable (change the default a.out)default a.out)

#! (Note: omit "-kill" if you are running a 1 node job)#! (Note: omit "-kill" if you are running a 1 node job)

/usr/local/bin/mpiexec -kill a.out/usr/local/bin/mpiexec -kill a.out

Common Scheduler CommandsCommon Scheduler Commands

qsub <script file name>qsub <script file name> Submits your script file for scheduling. It is immediately checked Submits your script file for scheduling. It is immediately checked

for validity and if it passes the check you will get a message that for validity and if it passes the check you will get a message that your job has been added to the queue.your job has been added to the queue.

qstatqstat Displays information on jobs waiting in the queue and jobs that Displays information on jobs waiting in the queue and jobs that

are running. How much time they have left and how many are running. How much time they have left and how many processors they are using.processors they are using.

Each job aquires a unique job_id that can be used to Each job aquires a unique job_id that can be used to communicate with a job that is already running (perhaps to kill communicate with a job that is already running (perhaps to kill it).it).

qdel <job_id>qdel <job_id> If for some reason you have a job that you need to remove from If for some reason you have a job that you need to remove from

the queue, this command will do it. It will also kill a job in the queue, this command will do it. It will also kill a job in progress. progress.

You, of course, only have access to delete your own jobs.You, of course, only have access to delete your own jobs.

MPI Data TypesMPI Data Types

MPI thinks of every message as a starting point MPI thinks of every message as a starting point in memory and some measure of length along in memory and some measure of length along with a possible interpretation of the data.with a possible interpretation of the data. The direct measure of length (number of bytes) is The direct measure of length (number of bytes) is

hidden from the user through the use of MPI data hidden from the user through the use of MPI data types. types.

Each language binding (C and Fortran 77) has its own Each language binding (C and Fortran 77) has its own list of MPI types that are intended to increase list of MPI types that are intended to increase portability as the length of these types can change portability as the length of these types can change from machine to machine.from machine to machine.

Interpretations of data can change from machine to Interpretations of data can change from machine to machine in heterogeneous clusters (Macs and PCs in machine in heterogeneous clusters (Macs and PCs in the same cluster for example).the same cluster for example).

MPI types in CMPI types in C

MPI_CHAR – signed charMPI_CHAR – signed charMPI_SHORT – signed short intMPI_SHORT – signed short intMPI_INT – signed intMPI_INT – signed intMPI_LONG – signed long intMPI_LONG – signed long intMPI_UNSIGNED_CHAR – unsigned short intMPI_UNSIGNED_CHAR – unsigned short intMPI_UNSIGNED – unsigned intMPI_UNSIGNED – unsigned intMPI_UNSIGNED_LONG – unsigned long intMPI_UNSIGNED_LONG – unsigned long intMPI_FLOAT – floatMPI_FLOAT – floatMPI_DOUBLE – doubleMPI_DOUBLE – doubleMPI_LONG_DOUBLE – long doubleMPI_LONG_DOUBLE – long doubleMPI_BYTEMPI_BYTEMPI_PACKEDMPI_PACKED

MPI Types in Fortran 77MPI Types in Fortran 77

MPI_INTEGER – INTEGERMPI_INTEGER – INTEGERMPI_REAL – REALMPI_REAL – REALMPI_DOUBLE_PRECISION – DOUBLE PRECISIONMPI_DOUBLE_PRECISION – DOUBLE PRECISIONMPI_COMPLEX – COMPLEXMPI_COMPLEX – COMPLEXMPI_LOGICAL – LOGICALMPI_LOGICAL – LOGICALMPI_CHARACTER – CHARACTER(1)MPI_CHARACTER – CHARACTER(1)MPI_BYTEMPI_BYTEMPI_PACKEDMPI_PACKED

Caution: Fortran90 does not always store arrays Caution: Fortran90 does not always store arrays contiguously.contiguously.

Functions Appearing in all MPI Functions Appearing in all MPI Programs (Fortran 77)Programs (Fortran 77)

MPI_INIT(IERROR)MPI_INIT(IERROR) INTEGER IERRORINTEGER IERROR Must be called before any other MPI routine.Must be called before any other MPI routine. Can be visualized as the point in the code Can be visualized as the point in the code

where every processor obtains its own copy where every processor obtains its own copy of the program and continues to execute of the program and continues to execute though this may happen earlier.though this may happen earlier.

Functions Appearing in all MPI Functions Appearing in all MPI Programs (Fortran 77)Programs (Fortran 77)

MPI_FINALIZE (IERROR)MPI_FINALIZE (IERROR) INTEGER IERRORINTEGER IERROR This routine cleans up all MPI state. This routine cleans up all MPI state. Once this routine is called no MPI routine may Once this routine is called no MPI routine may

be called.be called. It is the users responsibility to ensure that ALL It is the users responsibility to ensure that ALL

pending communications involving a process pending communications involving a process complete before the process calls complete before the process calls MPI_FINALIZEMPI_FINALIZE

Typical Startup FunctionsTypical Startup Functions

MPI_COMM_SIZE(COMM, SIZE, IERROR)MPI_COMM_SIZE(COMM, SIZE, IERROR) IN INTEGER COMMIN INTEGER COMM OUT INTEGER SIZE, IERROROUT INTEGER SIZE, IERROR Returns the size of the group associated with the Returns the size of the group associated with the

communicator COMM.communicator COMM.

……What’s a communicator?What’s a communicator?

CommunicatorsCommunicators

A communicator is an integer that tells A communicator is an integer that tells MPI what communication domain it is in. MPI what communication domain it is in.

There is a special communicator that There is a special communicator that exists in every MPI program called exists in every MPI program called MPI_COMM_WORLD.MPI_COMM_WORLD. MPI_COMM_WORLD can be thought of as MPI_COMM_WORLD can be thought of as

the superset of all communication domains. the superset of all communication domains. Every processor requested by your initial Every processor requested by your initial script is a member of MPI_COMM_WORLD.script is a member of MPI_COMM_WORLD.


MPI_COMM_SIZE(COMM, SIZE, IERROR)MPI_COMM_SIZE(COMM, SIZE, IERROR) IN INTEGER COMM, SIZE, IERRORIN INTEGER COMM, SIZE, IERROR OUT INTEGER SIZE, IERROROUT INTEGER SIZE, IERROR Returns the size of the group associated with the communicator Returns the size of the group associated with the communicator

COMM.COMM.

A typical program contains the following command as A typical program contains the following command as one of the very first MPI calls to provide the code with one of the very first MPI calls to provide the code with the number of processors it has available for this the number of processors it has available for this execution. (Step one of self identification).execution. (Step one of self identification).

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)


MPI_COMM_RANK(COMM, RANK, IERROR)MPI_COMM_RANK(COMM, RANK, IERROR) IN INTEGER COMMIN INTEGER COMM OUT INTEGER RANK, IERROROUT INTEGER RANK, IERROR Indicates the rank of the process that calls it in the range from Indicates the rank of the process that calls it in the range from

0..size-1, where size is the return value of MPI_COMM_SIZE.0..size-1, where size is the return value of MPI_COMM_SIZE. This rank is relative to the communication domain specified by This rank is relative to the communication domain specified by

the communicator COMM.the communicator COMM. For MPI_COMM_WORLD, this function will return the absolute For MPI_COMM_WORLD, this function will return the absolute

rank of the process, a unique identifier. (Step 2 of self rank of the process, a unique identifier. (Step 2 of self identification).identification).

CALL MPI_COMM_Rank(MPI_COMM_WORLD, size_p, ierr_p)CALL MPI_COMM_Rank(MPI_COMM_WORLD, size_p, ierr_p)

Startup VariablesStartup Variables

SIZESIZE INTEGER size_pINTEGER size_p

RANKRANK INTEGER rank_pINTEGER rank_p

STATUS (more on this guy later)STATUS (more on this guy later) INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status_pINTEGER, DIMENSION(MPI_STATUS_SIZE) :: status_p

IERROR (Fortran 77)IERROR (Fortran 77) INTEGER ierr_pINTEGER ierr_p

Hello WorldHello WorldFortran90Fortran90

PPROGRAM Hello_WorldPPROGRAM Hello_World

IMPLICIT NONEIMPLICIT NONEINCLUDE 'mpif.h'INCLUDE 'mpif.h'

INTEGER :: ierr_p, rank_p, size_pINTEGER :: ierr_p, rank_p, size_pINTEGER, DIMENSION(MPI_STATUS_SIZE) :: status_pINTEGER, DIMENSION(MPI_STATUS_SIZE) :: status_p

CALL MPI_INIT(ierr_p)CALL MPI_INIT(ierr_p)CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank_p, ierr_p)CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank_p, ierr_p)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)

IF (rank_p==0) THENIF (rank_p==0) THENWRITE(*,*) ‘Hello world! I am process 0 and I am special!’WRITE(*,*) ‘Hello world! I am process 0 and I am special!’ELSEELSEWRITE(*,*) ‘Hello world! I am process ‘, rank_p WRITE(*,*) ‘Hello world! I am process ‘, rank_p END IFEND IF

CALL MPI_FINALIZE(ierr_p)CALL MPI_FINALIZE(ierr_p)

END PROGRAM Hello_WorldEND PROGRAM Hello_World

Hello WorldHello WorldCC

#include <stdio.h>#include <stdio.h>#include <mpi.h>#include <mpi.h>

main (int argc, char **argv)main (int argc, char **argv){{ int node;int node;

MPI_Init(&argc, &argv);MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &node);MPI_Comm_rank(MPI_COMM_WORLD, &node);

if (node == 0) {if (node == 0) { printf("Hello word! I am C process 0 and I am special!\n");printf("Hello word! I am C process 0 and I am special!\n"); } else {} else { printf("Hello word! I am C process %d\n", node);printf("Hello word! I am C process %d\n", node); }}

MPI_Finalize();MPI_Finalize();}}

An Introduction to Parallel Programming with MPI

Documents

Transcript of An Introduction to Parallel Programming with MPI