MPI Parallel Programming

113
Information Technology Services, H MPI Parallel MPI Parallel Programming Programming Information Technology Services The University of Hong Kong By HPC/Grid Team, Email: [email protected]

description

MPI Parallel Programming. Information Technology Services The University of Hong Kong. By HPC/Grid Team, Email: [email protected]. Must Know Before Programming…. HKU HPC Facilities hpcpower2 : 64-bit Linux cluster consists of 24 nodes - PowerPoint PPT Presentation

Transcript of MPI Parallel Programming

Page 1: MPI Parallel Programming

Information Technology Services, HKU

MPI Parallel MPI Parallel ProgrammingProgrammingInformation Technology ServicesThe University of Hong Kong

By HPC/Grid Team, Email: [email protected]

Page 2: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Must Know Before Must Know Before Programming…Programming…• HKU HPC Facilities

– hpcpower2: 64-bit Linux cluster consists of 24 nodes

• each node has TWO 64-bit quad-core Intel Xeon CPUs running at 3GHz

– hpcpower: 32-bit Linux cluster consists of 178 nodes

• 128 nodes of dual 2.8 GHz Xeon processors, and

• 50 nodes of dual 3.06 GHz Xeon processors

Page 3: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

How to get an accountHow to get an account

• Download the form at http://www.hku.hk/cc/home/services/forms.htm 

CF-137e : High Performance Computing Cluster (hpcpower) Account Application (for Non-TOSI Staff/Students)

• Return the application form to CC office.

Page 4: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

How to login HPCPOWERHow to login HPCPOWER

• From any PC within HKU campus network: ssh hpcpower (bundled in Linux) Download PuTTY to use SSH at Windows

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

• PC outside HKU campus network: Must use HKUVPN to get a HKU campus network IP

Page 5: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

File TransferFile Transfer

• From/To any PC within HKU campus network: scp/sftp hpcpower (bundled in Linux) Download WINSCP to use SSH at

Windowshttp://winscp.net/eng/download.php

• PC outside HKU campus network: Must use HKUVPN to get a HKU campus network IP

Page 6: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Program EditingProgram Editing

• You can use the command vi, emacs or pico to edit programs. Please refer to the UNIX user's guide for detail. http://www.hku.hk/cc/handbook/unix/unix_toc.html

• Microsoft Windows users: do not use a standard Microsoft Windows editor such as Notepad/Wordpad to edit files that will be used on the Linux system.

• To convert Windows file to UNIX file, enter:

dos2unix file.txt

Page 7: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Resource Management SystemResource Management System

• After logging into to the cluster, the user is on the master node. When a program is run, it is also immediately run on the master. This is the "interactive mode", which is convenient for running simple commands like ls, vi, etc. or for editing/compiling a program.

• Any interactive program runs on the master node will be killed without further notice.

Page 8: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

PBS (Portable Batch System)PBS (Portable Batch System)

• Long computing jobs should be submitted through the batch system.

• The submitted job will be in a queue waiting for its turn, then will be sent to one or more compute node(s), which the job will have dedicated access to until it finishes.

• With PBS, the job will run faster and the cluster will be more efficiently utilized.

Page 9: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Basic PBS CommandsBasic PBS Commands

Command

Description

qsub To submit a job to the queuing system

qdel To delete a job that has been submitted to the queuing system

qstat / showq

List all information about queues and jobs

Page 10: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Sample PBS job scriptSample PBS job script

• Sample PBS job script (pbs.cmd)#!/bin/sh#PBS -N test#PBS -m ea#PBS -q qdev#PBS -l walltime=02:00:00#PBS -l nodes=1:ppn=2### Define number of processors usedNP=`wc -l < $PBS_NODEFILE` cd $PBS_O_WORKDIR mpirun –np $NP ./a.out

Page 11: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Sample PBS job scriptSample PBS job script

• A line beginning with # is a comment; • A line beginning with #PBS is a PBS

directive; • This submit script pbs.cmd specifies the

name of the job (test), which queue to use (qdev), that it needs both processors on a single node (nodes=1:ppn=2), that it will run for at most 2 hours (walltime), and that TORQUE should email to user when the job exits or aborts (ea). Additionally, the user specifies where and what to execute (./a.out).

Page 12: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Submit a PBS jobSubmit a PBS job

$ qsub pbs.cmd

2234.hpcpower

• PBS returns a job identifier of the form jobid.hpcpower where jobid (2234) is an integer number assigned by PBS.

• The job identifier is needed for any actions involving the job, such as checking job status or deleting the job.

Page 13: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

qsub optionsqsub options

• The resource requested on command line has a high preference than the directive line in the script file.

• e.g. submit job by command qsub -l nodes=2:ppn=8 pbs.cmd

, the job will run on 2 compute nodes with 8 processors each instead of what stated in the script file pbs.cmd.

Option Action

qsub -l list Set job resource list

qsub -N jobname

Set job name to jobname

qsub -q dest Submit to queue dest

Page 14: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

List all Submitted jobs statusList all Submitted jobs status

$ qstat –a or $ qa

hpcpower:

Job ID UsernameQueue Jobname SessID NDS TSKReq'd Memory

Req'd Time S

Elap Time

-------------

------- ----- --------

------ --- --- ------ ----- - -----

2230.hpcpower

ycleung qdev Test 5597 8 -- -- 02:00 R 01:02

2231.hpcpower

ycleung qprod Job_1 11048 16 -- -- 10:00 R 09:17

2233.hpcpower

chliu oneday

huge -- 16 -- -- 15:00 Q --Job information provided• Username: Job owner• NDS: Number of nodes requested• Req’d Time: Requested amount of wallclock time• Elap Time: Elapsed time in the current job state• S: Job state (E-Exit; R-Running; Q-Queuing)

Page 15: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

qstat optionsqstat options

Option Action

qstat –a List all jobs status

qstat –q List all queues on the system

qstat –n List all jobs’ allocated node(s)

qstat –u username

List all jobs owned by user username

qstat –r List all running jobs

qstat –f jobid List all information known about specified job (jobid)

Page 16: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Delete a JobDelete a Job

$ qdel 2234

where 2234 is the job id.

Page 17: MPI Parallel Programming

Information Technology Services, HKU

MPI MPI (Message Passing Interface)(Message Passing Interface)

Page 18: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI (Message Passing Interface)MPI (Message Passing Interface)

• Is a communication protocol for parallel programming

• Is a library of functions but not a language

• MPI software is available for free• MPICH and Open MPI is most common

free versions• Many vendors have their own

optimized version of MPI - IBM, Sun, SGI, Myrinet …..

Page 19: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

What is MPIWhat is MPI

P0 P1 P2 P3

Communication Network

Message Passing Interface

ProcessA process is a set of executable instructions (program) which runs on a processor. For maximum performance, each CPU (or core in a multicore machine) will be assigned just a single process.

Processes

Message PassingThe method by which data from one processor's memory is copied to the memory of another processor.

Page 20: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

What is MPIWhat is MPI

• Same program running on many processors at the same time.

• Single-Program-Multiple-Data (SPMD) style.if (my_rank == 0) then

Call Master(....)

else

call Worker(....)

endif

Page 21: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

What is MPIWhat is MPI

• MPI codes can run on parallel machines with distributed memory and shared memory architecture.

• High portability.• All variable are private to each

process.• Process communicate via specific

communication call.

Page 22: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

General MPI Program StructureGeneral MPI Program Structure

MPI include file

#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}

variable declarations #include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}

Initialize MPI environment

#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}

Do work and make message passing calls

#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}

Terminate MPI Environment

#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}

Page 23: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI Header FileMPI Header File

• C :#include “mpi.h”

• FORTRAN :include ‘mpif.h’

Page 24: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI Function FormatMPI Function Format

• All MPI functions have names that begin with the prefix MPI_ to advoid name collisions.

• C function names are case sensitive but Fortran function names are not.C :ierror = MPI_Xxxx(parameter...) ;

FORTRAN :call MPI_Xxxx(argument,..., ierror)

Page 25: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Initialising MPI environmentInitialising MPI environment• C :

int MPI_Init (int *argc, char*** argv);

• FORTRAN :MPI_Init(ierror)

INTEGER ierror

Page 26: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Exit MPI environmentExit MPI environment

• C :int MPI_Finialize()

• FORTRAN :MPI_Finalize(ierror)

INTEGER ierror

Page 27: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

CommunicatorCommunicator

• Communicator is a collection of processes that can send messages to each other.

• MPI_COMM_WORLD is predefined and consist of all the processes.

1 3

02

4

5Communicator

MPI_COMM_WORLD

Page 28: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

SizeSize

• How many processes are contained within a communicator ?

• C :MPI_Comm_size(MPI_Comm comm, *size)

• FORTRAN :MPI_Comm_size(comm,size,ierror)

INTEGER comm,size,ierror

Page 29: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

RankRank• How do we identify different processes ?• C:

MPI_Comm_rank(MPI_Comm comm, *rank);

• FORTRANMPI_Comm_rank(comm,rank,ierror) INTEGER comm,rank,ierror

Rank values are range from 0 to N-1, where N is the number of processes in the communicator.

Page 30: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

A First MPI Program : HelloworldA First MPI Program : Helloworld

Program HelloworldINCLUDE ‘mpif.h’INTEGER nproc,myrank,ierrCALL MPI_Init(ierr)CALL MPI_Comm_size(MPI_COMM_WORLD,nproc,ierr)CALL MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)Print *, “Hello World! I’m process “,myrank,” of “,nprocCALL MPI_Finalize(ierr)

• C

• FORTRAN

#include <stdio.h>#include ‘mpi.h’void main(int argc, char **argv) {int nproc,myrank,ierr;ierr=MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&nproc);MPI_Comm_rank(MPI_COMM_WORLD,&myrank);printf (“Hello World! I’m process %d of %d\

n”,myrank,nproc);ierr=MPI_Finalize(); }

Page 31: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Compile MPI in HPCPOWERCompile MPI in HPCPOWER

• Cpgcc -Mmpi program.c

• FORTRAN (FIXED FORM)pgf90 -Mmpi program.f

• FORTRAN (FREE FORM)pgf90 -Mmpi –Mfree program.f90

Page 32: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Execute MPI in HPCPOWERExecute MPI in HPCPOWER• Running in 4 processes in interactive node :

mpirun –np 4 ./a.out

Hello World! I’m process 0 of 4Hello World! I’m process 1 of 4Hello World! I’m process 2 of 4Hello World! I’m process 3 of 4

Page 33: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

What is in a messageWhat is in a message

• An MPI message is an array of elements of a particular MPI datatype.

Page 34: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI DatatypeMPI DatatypeMPI Fortran

DatatypeFortran

DatatypeMPI C Datatype C Datatype

MPI_CHARACTER character(1) MPI_CHAR signed char

MPI_UNSIGNED_CHAR

unsigned char

MPI_INTEGER integer MPI_INT signed int

MPI_SHORT signed short int

MPI_LONG signed long int

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_REAL real MPI_FLOAT float

MPI_DOUBLE_PRECISION

double precision

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_COMPLEX complex

MPI_LOGICAL logical

Page 35: MPI Parallel Programming

Information Technology Services, HKU

Point-to-point communicationPoint-to-point communication

Page 36: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Point-to-point communicationPoint-to-point communication

A point-to-point communication always involves exactly two processes. One process acts as the sender and the other acts as the receiver.

1 3

02

4

5Communicator

source

dest

Page 37: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Point-to-point communicationPoint-to-point communication

• Communicate between 2 processes.

• Source send message to destination process.

• Communication takes place within a communicator.

• Destination process is identified by its rank in the communicator.

Page 38: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Sending MessagesSending Messages

• C :int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

• FORTRAN :MPI_Send( buf, count, datatype, dest, tag,

comm, ierror)

<type> BUF(*)

INTEGER count,datatype,dest,tag,comm

INTEGER ierror

Page 39: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Receiving MessagesReceiving Messages

• C :int MPI_Recv(void *buf, int count,

MPI_Datatype datatype, int source,

int tag, MPI_Comm comm,

MPI_Status *status)

• FORTRAN :MPI_Recv( buf, count, datatype, source, tag,

comm, status, ierror)

<type> BUF(*)

INTEGER count,datatype,dest,tag,comm

INTEGER status(MPI_STATUS_SIZE),ierror

Page 40: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MessagesMessages

• Messages = envelope + message body • Envelope

1. Source – The sending process2. Destination – The receiving process3. Communicator – specifies group of processes

which both sourse and destination belong4. Tag – used to classify messages

• Message Body1. Buffer – the message data2. Datatype – type of message data3. Count – number of items of type datatype in

buffer

Page 41: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

TagTag

• The tag is a number specified by programmer for each message. It is attached with the message being sent.

• When the destination process receives the message, the process can check the tag and acts accordingly.

Page 42: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Message Matching RulesMessage Matching Rules• Sender must specify a valid

destination RANK.• Receiver must specify a valid source

RANK.• Sender and receiver must use the

same COMMUNICATOR.• TAG must match.• DATATYPE must match.• Receiver's buffer must be large

enough.

Page 43: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Greetings(Fortran)Example : Greetings(Fortran)PROGRAM greetingsinclude 'mpif.h'integer my_rankinteger pinteger sourceinteger destinteger tagcharacter*100 messagecharacter*10 digit_stringinteger sizeinteger status(MPI_STATUS_SIZE)integer ierr

call MPI_Init(ierr)call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

Page 44: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Greetings(Fortran)Example : Greetings(Fortran)if (my_rank /= 0) then write(digit_string,FMT="(I3)") my_rank message ='Greetings from process ' // trim(digit_string) // '!' dest = 0 tag = 0 call MPI_Send(message, len_trim(message), & & MPI_CHARACTER, dest, tag, MPI_COMM_WORLD, ierr)else ! my_rank == 0 do source = 1, p-1 tag = 0 call MPI_Recv(message, len(message), & & MPI_CHARACTER, source, tag, & & MPI_COMM_WORLD, status, ierr) write(6,FMT="(A)") message enddo

endifcall MPI_Finalize(ierr)END PROGRAM greetings

Page 45: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Greetings (C)Example : Greetings (C)#include <stdio.h>#include <string.h>#include "mpi.h"main(int argc, char* argv[]) { int my_rank; /* rank of process */ int p; /* number of processes */ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* tag for messages */ char message[100]; /* storage for message */ MPI_Status status; /* return status for */ /* receive */

/* Start up MPI */ MPI_Init(&argc, &argv); /* Find out process rank */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Find out number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &p);

Page 46: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Greetings (C)Example : Greetings (C) if (my_rank != 0) { /* Create message */ sprintf(message, "Greetings from process %d!", my_rank); dest = 0; /* Use strlen+1 so that '\0' gets transmitted */ MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

} else { /* my_rank == 0 */ for (source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n", message); } } /* Shut down MPI */ MPI_Finalize();} /* main */

Page 47: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : GreetingsExample : Greetings

• % mpirun –np 4 ./a.out

Greetings from process 1 !Greetings from process 2 !Greetings from process 3 !

Page 48: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_SendrecvMPI_Sendrecv

MPI_Sendrecv (*sendbuf,sendcount,sendtype,dest,sendtag, *recvbuf,recvcount,recvtype,source,recvtag, comm,*status)

Page 49: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

WildcardWildcard

• MPI_Recv can use MPI_ANY_SOURCE and MPI_ANY_TAG as argument.

• There is no wildcard for communicator.

call MPI_Recv(message, 1, MPI_INTEGER, & MPI_ANY_SOURCE, MPI_ANY_TAG, &

MPI_COMM_WORLD, MPI_Status , ierror)

• MPI_Send cannot use wildcard.

Page 50: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Communication EnvelopeCommunication Envelope

Page 51: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

statusstatus

• Envelope information is returned from the argument of MPI_Recv - status.

• status has 3 elements :– status(MPI_SOURCE)– status(MPI_TAG)– status(MPI_ERROR)

Page 52: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Received Message CountReceived Message Count

• status also returns information on the size of message (i.e. number of elements) received. However, this is not directly accessible.

• To determine the size of message received, we can call :

call MPI_Get_Count(status, &

& MPI_Datatype, count, ierror)

Page 53: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Message Order PreservationMessage Order Preservation

• Message do not take over each other.

1 3

0

2

4

5Communicator

source

dest

Page 54: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

System BufferSystem Buffer

Network

Send Buffer Receive Buffer

System Buffer

Page 55: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Blocking CommunicationBlocking Communication

• MPI_Recv would hang until the message has been received into the buffer specified by the buffer argument.

• If message is not available, the process will remain hang until a message become available.

Page 56: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Blocking CommunicationBlocking Communication

• MPI_Send would hang until– The whole message has arrived the receiv

er. OR– The whole message has been copied into

a system buffer.• After MPI_Send, the send buffer is save

for reuse.

Page 57: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Blocking CommunicationBlocking Communication

• Correct version - Succeed even if no system buffer is available.

IF (rank == 0) THEN

call MPI_SEND(sendbuf,....to P(1))

call MPI_RECV(recvbuf,....from P(1))

ELSE ! rank == 1

call MPI_RECV(recvbuf,....from P(0))

call MPI_SEND(sendbuf,....to P(0))

ENDIF

Page 58: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Blocking CommunicationBlocking Communication

• Common bug-Always deadlock.

IF (rank == 0) THEN

call MPI_RECV(recvbuf,....from P(1))

call MPI_SEND(sendbuf,....to P(1))

ELSE ! rank == 1

call MPI_RECV(recvbuf,....from P(0))

call MPI_SEND(sendbuf,....to P(0))

ENDIF

Page 59: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Blocking CommunicationBlocking Communication

• Unsafe - As least one of the two messages sent must be buffered.

• Deadlock if system buffer not big enough.IF (rank == 0) THEN

call MPI_SEND(sendbuf,....to P(1))

call MPI_RECV(recvbuf,....from P(1))

ELSE ! rank == 1

call MPI_SEND(sendbuf,....to P(0))

call MPI_RECV(recvbuf,....from P(0))

ENDIF

Page 60: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Non-Block CommunicationNon-Block Communication

• One can improve performance by overlapping communication and computation.

Page 61: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Non-Blocking SendNon-Blocking Send• A non-blocking send start call initiates the

send operation, but does not complete it.• The send start call will returns before the

message was copied out of the send buffer.• Communication may proceed concurrently

with computation.• A separate send complete operation is

need to complete the communication. i.e. to verify that the data has been copied out of the send buffer.

Page 62: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Non-Blocking Send ExampleNon-Blocking Send Examplecall MPI_Isend(x, 1, MPI_INTEGER, &

&1,0,MPI_COMM_WORLD,request1,ierror)

.....

! Do sth here while message being

! sent.

call MPI_Wait(request1,status,ierror)

Page 63: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_Isend SyntaxMPI_Isend SyntaxMPI_Isend(buf, count, datatype, dest, &

&tag, comm, request, ierror)

<type> buff(*)

INTEGER, intent(in) :: &

count, datatype, dest, tag, comm

INTEGER, intent(out) :: request, ierror

Page 64: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_IsendMPI_Isend• The sender should not access any part

of the send buffer after MPI_Isend is called, until the send completes.

Page 65: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

RequestRequest

• request is a handle. i.e. the object referenced by request is system defined, and that it cannot be directly accessed by the user.

• In FORTRAN, every request should be declared as an integer.

Page 66: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

RequestRequest

• Its purpose is to identify the operation started by the non-blocking call.

• It match the operations that initiates the communication with the operation that terminate it.

• It stores information about status of the pending communication operation.

Page 67: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Communication CompletionCommunication Completion

• Completion of send indicates that the whole message has been copied out and the sender is now free to update the location in the send buffer.

• It doesn’t indicate message has been received, rather, it indicates that the whole message may have been put in system buffer.

Page 68: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_WaitMPI_Wait• MPI_Wait blocks until operation identi

fied by request completes.• When MPI_Wait returns, request is se

t to MPI_REQUEST_NULL. This means there is no pending operation associated to request. The opaque object associated with it would also be deallocated.

Page 69: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_Wait SyntaxMPI_Wait SyntaxMPI_Wait(request, status, ierror)

INTEGER, intent(inout) :: request, &

& ierror

INTEGER, intent(out) :: &

& status(MPI_STATUS_SIZE)

Page 70: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TestMPI_Test• MPI_Test is non-blocking version of M

PI_Wait.MPI_Test(request, flag, status,ierror)

LOGICAL, intent(out) :: flag

INTEGER, intent(inout) :: request, &

& ierror

INTEGER, intent(out) :: &

& status(MPI_STATUS_SIZE)

Page 71: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TestMPI_Test• MPI_Test returns immediately. It don’t

wait the completion of communication operation.

• It sets flag == .TRUE. if the operation identified by request is complete. It also sets request to MPI_REQUEST_NULL.

• It sets flag == .FALSE. if the operation identified by request is not complete.

Page 72: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Nonblocking Receive ExampleNonblocking Receive Example• Method 1: MPI_Wait

• Method 2: MPI_Test

MPI_Irecv(buf,…,req);…do work not using bufMPI_Wait(req,status);…do work using buf

MPI_Irecv(buf,…,req);MPI_Test(req,flag,status);while (flag != 0) { …do work not using buf MPI_Test(req,flag,status);}…do work using buf

Page 73: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_IRecvMPI_IRecv• The receiver should not access any par

t of the receive buffer after MPI_Irecv is called, until the receive completes.

Page 74: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Multiple CompletionMultiple Completion

• MPI provides routines which test mulitple communications at once.– MPI_WAITANY, MPI_TESTANY– MPI_WAITALL, MPI_TESTALL– MPI_WAITSOME, MPI_TESTSOME

Page 75: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

ProbeProbe

• MPI_Probe and MPI_IProbe operations allow incoming messages to be checked for, without actually receiving them.

• The user can then decide how to receive them based on the information returned by probe.

• e.g. allocate memory for receive buffer according to the length of the probed message.

Page 76: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

CancelCancel

• MPI_Cancel operation allows pending communcations to be canceled.

Page 77: MPI Parallel Programming

Information Technology Services, HKU

Collective communicationCollective communication

Page 78: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Collective communicationCollective communicationA collective communication involves communication among all processes in a process group (communicator).

1 3

0

2

4

5

Communicator

source

e.g. MPI_Bcast

Page 79: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Collective communicationCollective communication• All collective communication events are

blocking. • The call to the collective routine,

typically replaces several point-to-point calls.

• The source code is much more readable.• Optimized forms of the collective

routines are often faster than the equivalent operation expressed in terms of point-to-point routines.

Page 80: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Collective communicationCollective communication

11 33 55 77

1616

reduction

scatter gather

broadcast

Page 81: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_BarrierMPI_Barrier

• MPI_Barrier (comm)• Creates a barrier synchronization in

a group (communicator).

Page 82: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_BcastMPI_Bcast

Page 83: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_BcastMPI_Bcast

• Correct - Call MPI_Bcast in all nodesMPI_Bcast (message, … ,0,MPI_COMM_WORLD)

• Common mistakeif (I am master) then

MPI_Bcast (message,…,0,MPI_COMM_WORLD)

elseMPI_Recv(message, … ,0, MPI_COMM_WORLD)

endif

Page 84: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_ScatterMPI_Scatter

Page 85: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_GatherMPI_Gather

Page 86: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_AllgatherMPI_Allgather

Page 87: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_ReduceMPI_Reduce

Page 88: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_AllreduceMPI_Allreduce

Page 89: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_ScanMPI_Scan

Page 90: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI Reduce and Scan predefined MPI Reduce and Scan predefined operationsoperations

MPI OP Operation CMPI_MAX Maximum integer, float

MPI_MIN Minimum integer, float

MPI_SUM Sum integer, float

MPI_PROD Product integer, float

MPI_LAND Logical AND integer

MPI_BAND Bitwise AND integer, MPI_BYTE

MPI_LOR Logical OR integer

MPI_BOR Bitwise OR integer, MPI_BYTE

MPI_LXOR Logical exclusive OR

Integer

MPI_BXOR Bitwise exclusive OR

integer, MPI_BYTE

MPI_MAXLOC

Max val and location

float, double, long double

MPI_MINLOC Min val and location

float, double, long double

Page 91: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Matrix-vector MultiplicatiExample : Matrix-vector MultiplicationonSchematic of parallel decomposition for matrix-vector multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and elements assigned to each processor, respectively.

Page 92: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Matrix-vector MultiplicatiExample : Matrix-vector Multiplicationon

A=B*C

33,322,311,300,3

33,222,211,200,2

33,122,111,100,1

33,022,011,000,0

3

2

1

0

cbcbcbcb

cbcbcbcb

cbcbcbcb

cbcbcbcb

a

a

a

a

P0 P1 P2 P3

+

+

+

+

+

+

+

+

+

+

+

+

Reduction (SUM)

P0 P1 P2 P3

Page 93: MPI Parallel Programming

Information Technology Services, HKU

Further ReadingFurther Reading

Page 94: MPI Parallel Programming

Information Technology Services, HKU

Group and TopologyGroup and Topology

Page 95: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Group and Communicator Group and Communicator Management RoutinesManagement Routines

• Purpose – Allows you to organize tasks, based

upon function, into task groups. – Enables Collective Communications

operations across a subset of related tasks.

– Provides basis for implementing virtual topologies

Page 96: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Group and Communicator Group and Communicator Management RoutinesManagement Routines• MPI_Group_difference• MPI_Group_excl• MPI_Group_incl• MPI_Group_intersection• MPI_Group_union• MPI_Group_compare• MPI_Group_rank• MPI_Group_size• MPI_Group_free• MPI_Comm_group• MPI_Comm_create• MPI_Comm_dup• MPI_Comm_compare• MPI_Comm_free

Page 97: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Virtual TopologiesVirtual Topologies

• Cartesian, tree or graph• Cartesian example :

Page 98: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

ExampleExample

Sendrecv(

sendmsg, 1, MPI_INTEGER, northprocess, sendtag,

Recvmsg, 1, MPI_INTEGER, southprocess, recvtag,

comm,status,ierror )

Page 99: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Virtual TopologiesVirtual Topologies• MPI_Cart_coords • MPI_Cart_create • MPI_Cart_get • MPI_Cart_map • MPI_Cart_rank • MPI_Cart_shift • MPI_Cart_sub • MPI_Cartdim_get • MPI_Dims_create • MPI_Graph_create • MPI_Graph_get • MPI_Graph_map • MPI_Graph_neighbors • MPI_Graph_neighbors_count • MPI_Graphdims_get • MPI_Topo_test

Page 100: MPI Parallel Programming

Information Technology Services, HKU

Derived DatatypesDerived Datatypes

Page 101: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

The count argumentThe count argument

• MPI_Send and MPI_Recv has a count argument. E.g.,

real, dimension(100) :: vector

if ( my_rank == 0 ) then

MPI_Send(vector,50,MPI_REAL,1,0, &

&MPI_COMM_WORLD)

else

MPI_Recv(vector,50,MPI_REAL,1,0, &

&MPI_COMM_WORLD,status)

endif

Page 102: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

The count argumentThe count argument

• Many data item can be sent together. However, they must be :– Continous, and– Be same FORTRAN datatype

Page 103: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Why Derived DatatypesWhy Derived Datatypes• Consider double precision results(IMAX,JMAX)

where we want to send results(5,1), results(5,2)...results(5,JMAX). The data to be sent does not lie in one contiguous area of memory.

Page 104: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Why Derived DatatypesWhy Derived Datatypes• One solution is using FORTRAN 90

array section. However,• This method is not efficient

because it involves large amount of memory copying.

• Since FORTRAN 90 array section is implemented by memory copying. As a result it cannot be used in non-block communication.

Page 105: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Why Derived DatatypesWhy Derived Datatypes• Consider

INTEGER nResult, n, m

DOUBLE PRECISION result (RMAX)

• where it is required to send nResullt followed by results. The data is of mixed type.

Page 106: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Discussion of examplesDiscussion of examples

• Programmer can make consecutive MPI calls to send and receive each data element in turn, which is slow and clumsy.

• Programmer may copy data to a buffer before sending it, but it is waste of memory and long-winded.

Page 107: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TYPE_VECTORMPI_TYPE_VECTOR

• count = 2• stride = 5• blocklength = 3

Page 108: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TYPE_VECTOR SyntaxMPI_TYPE_VECTOR Syntax

MPI_TYPE_VECTOR( count, &

blocklength, stride, oldtype, &

newtype )

INTEGER, intent(in) :: count, &

blocklength, stride, oldtype

INTEGER, intent(out):: newtype, &

ierror

Page 109: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Sending Sub-matrixExample : Sending Sub-matrix

1 2 3 4 5 6 7 8

2

3

4

6

7

8

9

10

Page 110: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

Example : Sending Sub-matrixExample : Sending Sub-matrix

double precision :: results (10,8)

INTEGER :: newtype

....

call MPI_TYPE_VECTOR(8, 1, 10, MPI_DOUBLE_PRECISION, newtype, ierror)

call MPI_TYPE_COMMIT(newtype, ierror)

call MPI_SEND(results(5,1),1,newtype, dest, tag, comm, ierror)

Page 111: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TYPE_COMMITMPI_TYPE_COMMIT

• The new datatype is “commited” with a call to MPI_TYPE_COMMIT. It can then be used in any number of communications. The form of MPI_TYPE_COMMIT is :

Call MPI_TYPE_COMMIT(newtype,ierror)

Page 112: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

MPI_TYPE_FREEMPI_TYPE_FREE

• MPI_TYPE_FREE deallocates the datatype.

call MPI_TYPE_FREE(datatype, ierror)

• datatype is returned as MPI_DATATYPE_NULL.

• After calling MPI_TYPE_FREE, the datatype can be redefined.

Page 113: MPI Parallel Programming

Info

rmat

ion T

ech

nolo

gy S

erv

ices,

HK

U

CommunicatorCommunicator• Parallel libraries would also send

message. Library message may accidentally use same tag as user’s own messages. We need to distinguish message send by library routines and message send by our own code.

• The solution adopted by MPI is to add a further information : a communicator.

• Two processes using distinct communicators cannot receive message from each other.