AdvancedAdvanced Message-Passing Interface (MPI)Message ... · Message-Passing Interface...

Post on 20-May-2020

67 views 0 download

Transcript of AdvancedAdvanced Message-Passing Interface (MPI)Message ... · Message-Passing Interface...

AdvancedAdvancedMessage-Passing Interface (MPI)Message-Passing Interface (MPI)

Bart Oldeman, Calcul Quebec – McGill HPC

Bart.Oldeman@mcgill.ca

1

Outline of the workshopOutline of the workshop

Morning: Advanced MPI

• Revision

• More on Collectives

• More on Point-to-Point

• Data types and Packing

• Communicators and Groups

• Topologies

• Exercises

2

Outline of the workshopOutline of the workshop

Afternoon: Hybrid MPI/OpenMP

• Theory and benchmarking

• Examples

3

What is MPI?What is MPI?

• MPI is a specification for a standardized library:

• http://www.mpi-forum.org• You use its subroutines• You link it with your code

• History: MPI-1 (1994), MPI-2 (1997), MPI-3(2012).

• Different implementations :

• MPICH(2), MVAPICH(2), OpenMPI, HP-MPI, ....

4

Overview of New Features in MPI-3Overview of New Features in MPI-3• Major new features

• Non-blocking collectives• Neighbourhood collectives• Improved one-sided communication interface• Tools interface• Fortran 2008 bindings (“use mpi f08”), less error-prone.

• Other new features• Matching Probe and Recv for thread-safe probe/receive• Non-collective communicator creation function• “const” correct C bindings• Comm split type function• Non-blocking Comm dup• Type create hindexed block function

• C++ bindings and deprecated functions removed• New MPI libraries support it (e.g. OpenMPI 1.8)

5

Review: MPI routines we know ...Review: MPI routines we know ...• Start up and exit:

• MPI Init, MPI Finalize

• Information on the processes:

• MPI Comm rank, MPI Comm size

• Point-to-point communications:

• MPI Send, MPI RecvMPI Irecv, MPI Isend, MPI Wait

Collective communications:

• MPI Bcast, MPI Reduce,MPI Scatter, MPI Gather

6

Example: “Hello from N cores”Example: “Hello from N cores”Fortran C

PROGRAM hello

USE mpi_f08

INTEGER rank, size

CALL MPI Init()

CALL MPI Comm rank (MPI COMM WORLD, &

rank)

CALL MPI Comm size (MPI COMM WORLD, &

size)

WRITE(*,*) ’Hello from processor ’,&

rank, ’ of ’, size

CALL MPI Finalize()

END PROGRAM hello

#include <stdio.h>

#include <mpi.h>

int main (int argc, char * argv[])

{

int rank, size;

MPI Init( &argc, &argv );

MPI Comm rank( MPI COMM WORLD, &rank );

MPI Comm size( MPI COMM WORLD, &size );

printf( "Hello from processor %d"

" of %d\n", rank, size );

MPI Finalize();

return 0;

}

7

More on CollectivesMore on Collectives

8

More on CollectivesMore on Collectives• “All” Functions

• MPI Allgather, MPI Allreduce CombinedMPI Gather/MPI Reduce with MPI Bcast: all ranksreceive the resulting data.

• MPI Alltoall Everybody gathers subsequent blocks.Works like a matrix transpose.

• “v” Functions• MPI Scatterv, MPI Gatherv

MPI Allgatherv, MPI AlltoallvInstead of “count” argument, use “counts” and “displs”arrays that specify the counts and array displacementsfor every rank involved.

• MPI Barrier Synchronization.

• MPI Abort Abort with error code.9

Exercise 1: MPI AlltoallExercise 1: MPI AlltoallLog in and compile the file alltoall.f90 oralltoall.c:

ssh -X user@guillimin.hpc.mcgill.ca

cp -a /software/workshop/advancedmpi/* ./

module add ifort icc openmpi/1.8.3-intel

mpicc alltoall.c -o alltoall

mpifort alltoall.f90 -o alltoall

There are errors. Can you fix them? Hint: type man

MPI Alltoall to obtain the syntax for the MPIfunction. To submit the job, use

qsub -q class alltoall.pbs

10

Exercise 2: Matrix-vector multiplicationExercise 2: Matrix-vector multiplication

Complete the multiplication in mv1.f90 or mv1.c usingMPI Allgatherv.Rows of the matrix are distributed among processors.Example: rows 1 and 2 in rank 0, row 3 in rank 1:

v = Ax =

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

x1

x2

x3

=

a1,1x1 + a1,2x2 + a1,3x3

a2,1x1 + a2,2x2 + a2,3x3

a3,1x1 + a3,2x2 + a3,3x3

=

v1

v2

v3

11

Exercise 3: Matrix-vector multiplicationExercise 3: Matrix-vector multiplication

Complete the multiplication in mv2.f90 or mv2.c usingMPI Reduce scatter. Columns of the matrix and inputvector are distributed among processors. Example:columns 1 and 2 in rank 0, column 3 in rank 1:

v = Ax =

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

x1

x2

x3

=

a1,1x1 + a1,2x2

a2,1x1 + a2,2x2

a3,1x1 + a3,2x2

+

a1,3x3

a2,3x3

a3,3x3

=

12

Exercise 3: Matrix-vector multiplicationExercise 3: Matrix-vector multiplicationa1,1x1 + a1,2x2

a2,1x1 + a2,2x2

a3,1x1 + a3,2x2

+

a1,3x3

a2,3x3

a3,3x3

=

(after MPI Reduce scatter)a1,1x1 + a1,2x2

a2,1x1 + a2,2x2

a3,1x1 + a3,2x2

+

a1,3x3

a2,3x3

a3,3x3

=

v1

v2

v3

Note: could also use MPI Reduce, MPI Allreduce, orMPI Alltoallv here.

13

More on point-to-pointMore on point-to-point

• MPI Ssend: Synchronized, force to complete onlywhen matching receive posted.

• MPI Bsend: Buffered using user-provided buffer.

• MPI Rsend: Ready send, must go after matchingreceive was posted. Rarely used.

• MPI Issend, MPI Ibsend, MPI Irsend:Asynchronous versions.

• MPI Sendrecv[ replace]: Sends and receives,avoiding deadlock (like MPI Irecv, MPI Isend,

MPI Wait)

Note: generally plain MPI Recv and MPI Send are best.14

Example: Poisson ProblemExample: Poisson Problem• See http://www.mcs.anl.gov/~thakur/

sc13-mpi-tutorial/• To approximate the solution of the Poisson Problem∇2u = f on the unit square, with u = 0 defined onthe boundaries of the domain (Dirichlet boundaryconditions), this simple 2nd order difference schemeis often used:• (U(x + h, y)− 2U(x , y) + U(x − h, y))/h2+

(U(x , y + h)− 2U(x , y) + U(x , y − h))/h2 = f (x , y)• Where the solution U is approximated on a discrete grid of

pointsx = 0, h, 2h, 3h, . . . , (1/h)h = 1, y = 0, h, 2h, 3h, . . . , 1.

• To simplify the notation, U(ih, jh) is denoted Uij

• This is defined on a discrete mesh of points(x , y) = (ih, jh), for a mesh spacing “h”.

15

2D grid with halos of ghost cells2D grid with halos of ghost cells

Example:4 processes in2x2 grid,n = 8,bx = by = 4.Heat sourcesat(2, 2), (4, 4), (6, 7).

MPI: North−South exchange

MPI:

East−

West

exchg

Grid cells

by+2

bx+2

5−pt stencil

Ghost cells

heat source

16

Exercise 4: Poisson ProblemExercise 4: Poisson Problem

• Serial code: stencil.cpp, stencil.f90

• Use mpicxx to compile C++ code and mpifort ormpif90 to compile Fortran code.

• MPI code using non-blocking functions MPI Irecv

et al: stencil mpi.{cpp,f90}. Note also the useof MPI PROC NULL and MPI Waitall.

• Exercise: use MPI Sendrecv instead of nonblockingcommunications.

17

Packing and DatatypesPacking and Datatypes

These functions create new data types:

• MPI Type contiguous, MPI Type vector,

MPI Type indexed: Transfer parts of a matrixdirectly.

• MPI Type struct: Transfer a struct.

• MPI Pack, MPI Unpack: Pack and sendheterogenous data.

• Note: generally data types are recommendedinstead of MPI Pack.

18

Exercise 5: Poisson with datatypesExercise 5: Poisson with datatypesMPI_Datatype north_south_type;

MPI_Type_contiguous(bx, MPI_DOUBLE, &north_south_type);

MPI_Type_commit(&north_south_type);

MPI_Datatype east_west_type;

MPI_Type_vector(by, 1, bx+2, MPI_DOUBLE, &east_west_type);

MPI_Type_commit(&east_west_type);

...

MPI_Type_free(&north_south_type);

MPI_Type_free(&east_west_type);

Note: forMPI_Type_vector(count, length, stride,...);

• Count=by: number of vertical elements• Length=1: number of contiguous elements in range• Stride=bx+2: offset between each range

Call MPI Sendrecv with such types, scount=rcount=1.19

CommunicatorsCommunicators

• So far only used MPI COMM WORLD.

• Can split this communicator into subsets, to allowcollective operations on a subset of ranks.

• Easiest to use: MPI Comm split(comm, color,key, newcomm[, ierror]):• comm: old communicator• color: all processes with the same color go into the

same communicator• key: rank within new communicator (can be 0 for

automatic determination)• newcomm: resulting new communicator

20

TopologiesTopologies

• Topologies group processes in an n-dimensional grid(Cartesian) or graph. Here we restrict to aCartesian 2D grid.

• Helps programmer and (sometimes) hardware.

• MPI Dims create(p, n, dims): create balancedn-dimensional grid for p processes in n-dimensionalarray dims.

21

TopologiesTopologies• MPI Cart create(oldcomm, n, dims,

periodic, reorder, newcomm): Creates newcommunicator for grid with n dimensions in dims,with implied periodicity in array periodic.reorder specifies whether the ranks may changefor the new communicator.

• MPI Cart rank(comm, coords, rank): Givenn-dimensional coordinates, return rank.

• MPI Cart coords(comm, rank, n, coords):Given the rank, return n coordinates.

• MPI Cart shift(comm, dim, disp, source,

dest): Given the dimension number anddisplacement shift, return previous and next rank. 22

Exercise 6: Poisson with Cartesian topologyExercise 6: Poisson with Cartesian topologyUse cartesian grid in Poisson example:int pdims[2] = {0,0};

// compute good (rectangular) domain decomposition

MPI_Dims_create(p, 2, pdims);

int px = pdims[0], py = pdims[1];

int periods[2] = {0,0};

MPI_Comm topocomm; // create Cartesian topology

MPI_Cart_create(comm, 2, pdims, periods, 0, &topocomm);

// get my local x,y coordinates

int coords[2];

MPI_Cart_coords(topocomm, r, 2, coords);

int rx = coords[0], ry = coords[1];

int source, north, south, east, west;

MPI_Cart_shift(topocomm, 0, 1, &west, &east);

MPI_Cart_shift(topocomm, 1, 1, &north, &south);

Then use topocomm instead of comm in MPIcommunications. 23

MPI-3: Neighborhood collectivesMPI-3: Neighborhood collectives

• Communicate with direct neighbors in Cartesiantopology• Corresponds to cart shift with disp=1.• Collective (all processes in comm must call it, including

processes without neighbors)• Buffers are laid out as neighbor sequence:

• Defined by order of dimensions, first negative, then positive• 2*ndims sources and destinations• Processes at borders (MPI PROC NULL) leave holes in buffers

(will not be updated or communicated)!

24

MPI 3.0: Neighborhood collectivesMPI 3.0: Neighborhood collectives

• MPI Neighbor alltoall(sendbuf, sendcount,

sendtype, recvbuf, recvcount, recvtype,

comm)

• Like MPI Sendrecv, but sends to and receives fromall direct neighbours in Cartesion grid.

• Variants MPI Ineighbor alltoall (non-blockingcollective), ...alltoallv and ...alltoallw

variants exist too.

25

Exercise 8: Poisson/MPI Neighbor alltoallvExercise 8: Poisson/MPI Neighbor alltoallv

Use one call to MPI Neighbor alltoallv instead of 4MPI Sendrecv calls. Use an explicit buffer where youpack and unpack data.

26

Exercise 9: Poisson/MPI Neighbor alltoallwExercise 9: Poisson/MPI Neighbor alltoallw

(Bonus) Use one call to MPI Neighbor alltoallw

using derived data types. Hint: use MPI Get address tocompute the byte offsets for the array elements.

27

Exercise 10:Poisson/MPI Ineighbor alltoallwExercise 10:Poisson/MPI Ineighbor alltoallw

(Bonus) Use one call to MPI INeighbor alltoallw

using derived data types, where you do computation ofinner cells (that do not need the ghost cells) before theMPI Wait call.(Overlapping communication with computation).

28

Exercise 11: Matrix-vector multiplicationExercise 11: Matrix-vector multiplication(Bonus): complete the multiplication in mv3.f90 ormv3.c using a Cartesian topology.Blocks of the matrix are distributed among processors.Example:rows 1–2, columns 1–2 in rank 0 (0,0)rows 1–2, column 3 in rank 1 (0,1)row 3, columns 1–2 in rank 2 (1,0)row 3, column 3 in rank 3 (1,1)

v = Ax =

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

x1

x2

x3

=

29

Exercise 11: Matrix-vector multiplicationExercise 11: Matrix-vector multiplication

v = Ax =

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

x1

x2

x3

=

a1,1x1 + a1,2x2

a2,1x1 + a2,2x2

a3,1x1 + a3,2x2

+

a1,3x3

a2,3x3

a3,3x3

=

v1

v2

v3

Use MPI Reduce call to obtain v.

Advantage: both vectors and the matrix can bedistributed in memory,n × n-matrix for p processors: transferred bytesO(n/

√p) instead of O(n).

30

Hybrid MPI and OpenMPHybrid MPI and OpenMP

• Most clusters, including Guillimin, contain multicorenodes.

• For Guillimin, 12 or 16 cores per node.

• Idea: use hybrid MPI and OpenMP: MPI forinternode communication, OpenMP intranode,eliminating intranode communication.

• May or may not run faster than pure MPI code.

31

Considerations for performanceConsiderations for performance

• latency: minimal time to send a message (overhead)

• bandwidth: bytes per second sent across• MPI (on Guillimin!)

• inter-node (network) latency: around 1.8 µs.• intra-node (shared memory) latency: around 0.4 µs.• inter-node (network) bandwidth: around 3–5 GB/s.• intra-node (shared memory) bandwidth: around 8–16

GB/s (similar to memory-bound computation bandwidth)

• OpenMP• pragma omp barrier overhead: around 0.4 µs.

32

First step: measure efficiencyFirst step: measure efficiency

• Insert MPI Wtime calls to measure wall clock time.

• Run for various values of p to determine scaling.

33

Amdahl’s lawAmdahl’s law

• Let f be the fraction of operations in a computationthat must be performed sequentially, where0 ≤ f ≤ 1. The maximum speedup ψ achievable bya parallel computer with p processors performingthe computation is

ψ ≤ 1

f + (1− f )/p

• Example: if f = 0.0035 then the maximum speedupis 285 for p →∞, and for p = 1024, ψ = 223.

34

Karp-Flatt metricKarp-Flatt metric

• We can also determine the experimentallydetermined serial fraction e given measured speedupψ.

e =1/ψ − 1/p

1− 1/p

• Example: p = 2, ψ = 1.95, e = 0.026.Example: p = 1024, ψ = 200, e = 0.0040.

35

When to consider hybrid?When to consider hybrid?• If the serial portion is too expensive to parallelize

using MPI but can be done using threads.Definitely! Also think of load-balancing, memory.

• If the problem does not scale well due to excessivecommunication (e increases significantly as pincreases). Maybe. Perhaps MPI performance canbe improved:• Fewer messages (less latency).• Shorter messages.• Replace communication by computation where possible.• Example: for broadcasts, tree-like communication much

more efficient than sending from master process directlyto all other processes (fewer messages in master process).

• Analysts are here to help you optimize your code!36

When to consider hybrid?When to consider hybrid?

• Otherwise pure MPI can be just as fast.

• Also, must look out for OpenMP pitfalls: caching,false sharing, synchronization overhead, races.

• MPI-3 supports another alternative: shared memorywindows (see stencil mpi shmem.cpp Poissonexample).

37

MPI’s Four Levels of Thread SafetyMPI’s Four Levels of Thread Safety• MPI defines four levels of thread safety – these are

commitments the application makes to the MPI• MPI THREAD SINGLE: only one thread exists in the

application• MPI THREAD FUNNELED: multithreaded, but only the

main thread makes MPI calls (the one that calledMPI Init thread)

• MPI THREAD SERIALIZED: multithreaded, but only onethread at a time makes MPI calls

• MPI THREAD MULTIPLE: multithreaded and any threadcan make MPI calls at any time (with some restrictionsto avoid races: complex!)

• MPI defines an alternative to MPI Init• MPI Init thread(requested, provided)

• Application gives level it needs; MPI implementation giveslevel it supports. 38

MPI THREAD FUNNELEDMPI THREAD FUNNELED• All MPI calls are made by the master thread

• Outside the OpenMP parallel regions• In OpenMP master regions

• Most common construct but limited scalability:during MPI comms all other threads are sleeping!

int main(int argc, char **argv)

{

int buf[100], provided;

MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

#pragma omp parallel for

for (i = 0; i < 100; i++)

compute(buf[i]);

/* Do MPI stuff */

MPI_Finalize();

return 0;

} 39

Example job script for GuilliminExample job script for GuilliminFor 48 CPU cores on 4 nodes with 12 cores each:

#!/bin/bash

#PBS -l nodes=4:ppn=12

#PBS -V

#PBS -N jobname

cd $PBS O WORKDIR

export IPATH NO CPUAFFINITY=1

export OMP NUM THREADS=12

mpiexec -n 4 --map-by node:PE=12 ./yourcode

For 48 CPU cores on 8 sockets with 6 cores each:

...

export OMP NUM THREADS=6

mpiexec -n 8 --map-by socket:PE=6 ./yourcode

# use mpiexec -n 8 ... --report-bindings to see affinity bindings

40

Example job script for GuilliminExample job script for GuilliminThe particular features of this submission script are:• export IPATH NO CPUAFFINITY=1: tells the

underlying software not to pin each process to oneCPU core, which would effectively disable OpenMPparallelism.

• export OMP NUM THREADS=12: specifies thenumber of threads used for OpenMP for all 4processes.

• mpiexec -n 4 --map-by node:PE=12

./yourcode: starts program yourcode, compiledwith MPI, in parallel on 4 nodes, with 12 processorsbound to one MPI process (may use -npernode 1

or -ppn 1 with other MPI implementations). 41

OpenMP example: parallel for (C)OpenMP example: parallel for (C)

• Example:void addvectors(const int *a, const int *b, int *c, const int n)

{

int i;

#pragma omp parallel for

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

}

• Here i is automatically made private because it isthe loop variable. All other variables are shared.

• Loop split between threads, for example, for n=10,thread 0 does index 0 to 4 and thread 1 does index5 to 9.

42

OpenMP example: parallel do (Fortran)OpenMP example: parallel do (Fortran)

• Example:subroutine addvectors(a, b, c, n)

integer n, a(n), b(n), c(n)

integer i

!$OMP PARALLEL DO

do i = 1, n

c(i) = a(i) + b(i)

enddo

!$OMP END PARALLEL DO

end subroutine

• Here i is automatically made private because it isthe loop variable. All other variables are shared.

• Loop split between threads, for example, for n=10,thread 0 does index 0 to 4 and thread 1 does index5 to 9.

43

Exercise 12: Hybrid PoissonExercise 12: Hybrid Poisson

• Use the solution to exercise 5 (Poisson withdatatypes).

• Use MPI Init thread(...,

MPI THREAD FUNNELED, ...) and #pragma omp

parallel for or $OMP PARALLEL DO to parallelizethe outer loop involving by.

• (optional) remove all North-South MPIcommunication in the y direction, assuming thatpy=1.

44

Exercise 13: Matrix-vector multiplicationExercise 13: Matrix-vector multiplication• Consider again mv1.c and mv1.f90.

• Add a parallel for or parallel do pragma tothe outer for/do loop to obtain a hybrid code.

• Measure performance (>mv1, <mv3).

• Optional: do the same for the other 2 matrix-vectormultiplication codes.

• To improve: consider: overlapping communicationand computation, one MPI process per socket, andhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?

arnumber=4912964

Questions? Write the guillimin support team atguillimin@calculquebec.ca

45