Programming using MPI - ccds.iitkgp.ac.in

Programming using MPI

Preeti MalakarIndian Institute of Technology Kanpur

National Supercomputing

Mission

Centre for Development of

Advanced Computing

SKAIndia

IITKharagpur

High Performance Computing for Astronomy and Astrophysics

FLOPS

2

Programming Supercomputers

PARAM Shakti, IIT Kharagpur3

Top 500 Supercomputers

4

Parallel Programming Models

Shared memory programming – OpenMP, Pthreads

Thread

Core

5

Cache

Message Passing Interface

• Standard for message passing in a distributed memory environment (most widely used programming model in supercomputers)

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and David W. Walker

• MPI Forum formed in 1993• Version 1.0: 1994

• Version 2.0: 1997

• Version 3.0: 2012

•

6

Parallel Programming Models

Distributed memory programming – MPI (Message Passing Interface)

Memory

Process

Distributed memory (each process has its own address space)

Core

7

8Adapted from Neha Karanjkar’s slides

What is MPI?

Enables parallel programming using message passing

9

Process 0 Process 1

Distinct Process Address Space

x = 1, y = 2. . .

x+=2. . .

print x, y

Program order

x = 1, y = 2. . .y++. . .

print x, y

3, 2 1, 3

Process 0 Process 1

x = 1, y = 2. . .

x+=1. . .

print x, y

x = 10, y = 20. . .x++. . .

print x, y

2, 2 11, 20

Process 0 Process 1

10

Parallel Execution – Single Node

11

Cache

Process

Core

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

parallel.x parallel.x parallel.x parallel.x

parallel.x

Parallel Execution – Multiple Nodes

12

Memory

Process

Core

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

Instruction 1

Instruction 2

…

parallel.x parallel.x parallel.x parallel.x

parallel.x

Message Passing

13

Process 0 Process 1 Process 2 Process 3

Getting Started

14

Initialization

Finalization

• gather information about the parallel job

• set up internal library state• prepare for communication

Process Identification

15

Total number of processes

Rank of a process

• MPI_Comm_size: get the total number of processes

• MPI_Comm_rank: get my rank

MPI Code Execution Steps (On cluster)

• Compile• mpicc -o program.x program.c

• Execute• mpirun -np 1 ./program.x (or mpiexec in place of mpirun)

• Runs 1 process on the launch node

• mpirun -np 6 ./program.x• Runs 6 processes on the launch node

• mpirun -np 6 -f hostfile ./program.x• Runs 6 processes on the nodes specified in the hostfile

16

<hostfile>

Host1:3Host2:2Host3:1…

Multiple Processes on Multiple Cores and Nodes

mpiexec –np 8 –f hostfile ./program.x

17

Memory

Process

Core

Node 1 Node 2 Node 3 Node 4

0

1

2

3

4

5

6

7

Node1:2Node2:2Node3:2Node4:2

Hostfile

How?

Multiple Processes on Multiple Cores and Nodes

mpiexec –np 8 –f hostfile ./program.x 18

Memory

Process

Core

Node 1 Node 2 Node 3 Node 4

0

1

2

3

4

5

6

7

Node1:2Node2:2Node3:2Node4:2

Hostfile

Launch node

Proxy Proxy ProxyProxy

19

MPI Process Spawning

The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. The entire virtual address space of the parent is replicated in the child.

https://man7.org/linux/man-pages/man2/fork.2.html

https://linux.die.net/man/3/execvp

The exec() family of functions replaces the current process image with a new process image.





SIMD Parallelism (Recap)

Memory

Process

Core

Instruction 1

Instruction 2

…

parallel.x

20

NO centralized server/master (typically)

Sum of Squares of N numbers

21

Serial

for i = 1 to N

sum += a[i] * a[i]

Parallel

for i = 1 to N/P

sum += a[i] * a[i]

P

Memory

Process

Core

0 1 2 3

for i = 1 to N/P

sum += a[i] * a[i]

for i = 1 to N/P

sum += a[i] * a[i]

for i = 1 to N/P

sum += a[i] * a[i]

for i = 1 to N/P

sum += a[i] * a[i]

for i = N/P * rank ; i < N/P * (rank+1) ; i++

localsum += a[i] * a[i]

Collect localsum, add up at one of the ranks

MPI Installation

22

1. sudo apt-get install mpich

OR

1. Download source code from https://www.mpich.org/downloads/2. ./configure --prefix=<INSTALL_PATH>3. make4. make install

https://www.mpich.org/downloads/

https://www.mpich.org/downloads/

MPI Code

23

Global communicator

Global communicator

MPI_COMM_WORLD

Required in every MPI communication

24

Communication Scope

25

Communicator (communication handle)

• Defines the scope

• Specifies communication context

Process

• Belongs to a group

• Identified by a rank within a group

Identification

• MPI_Comm_size – total number of processes in communicator

• MPI_Comm_rank

MPI_COMM_WORLD

2

4

3

1

0

26

Process identified by rank/id

MPI Message

• Data and header/envelope

• Typically, MPI communications send/receive messages

27

Source: Origin of message

Destination: Receiver of message

Communicator

Tag (0:MPI_TAG_UB)

Message Envelope

MPI Communication Types

Point-to-pointCollective

Point-to-point Communication

• MPI_Send

• MPI_Recv

SENDER

RECEIVER

int MPI_Send (const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Tags should match

Blocking send and receive

MPI_Datatype

• MPI_BYTE

• MPI_CHAR

• MPI_INT

• MPI_FLOAT

• MPI_DOUBLE

30

Example 1

31

MPI_Comm_rank (MPI_COMM_WORLD, &myrank);

// Sender processif (myrank == 0) /* code for process 0 */{strcpy (message,"Hello, there");MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);

}

// Receiver processelse if (myrank == 1) /* code for process 1 */{MPI_Recv (message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);printf ("received :%s\n", message);

}

Message tag

Message tag

MPI_Status

• Source rank

• Message tag

• Message length• MPI_Get_count

32

MPI_ANY_*

• MPI_ANY_SOURCE• Receiver may specify wildcard value for source

• MPI_ANY_TAG• Receiver may specify wildcard value for tag

33

Example 2

34

MPI_Comm_rank (MPI_COMM_WORLD, &myrank);

// Sender processif (myrank == 0) /* code for process 0 */{strcpy (message,"Hello, there");MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);

}

// Receiver processelse if (myrank == 1) /* code for process 1 */{MPI_Recv (message, 20, MPI_CHAR, MPI_ANY_SOURCE, 99, MPI_COMM_WORLD,

&status);printf ("received :%s\n", message);

}

MPI_Send (Blocking)

• Does not return until buffer can be reused

• Message buffering

• Implementation-dependent

• Standard communication mode

SENDER

RECEIVER

35

Safety

36

0 1

MPI_Send

MPI_Send

MPI_Recv

MPI_RecvSafe

MPI_Send

MPI_Recv

MPI_Send

MPI_RecvUnsafe

MPI_Send

MPI_Recv

MPI_Recv

MPI_SendSafe

MPI_Recv

MPI_Send

MPI_Recv

MPI_SendUnsafe

Other Send Modes

• MPI_Bsend Buffered• May complete before matching receive is posted

• MPI_Ssend Synchronous• Completes only if a matching receive is posted

• MPI_Rsend Ready•

37

Non-blocking Point-to-Point

• MPI_Isend (buf, count, datatype, dest, tag, comm, request)

• MPI_Irecv (buf, count, datatype, source, tag, comm, request)

• MPI_Wait

38

0 1

MPI_Isend

MPI_Recv

MPI_Isend

MPI_RecvSafe

Computation Communication Overlap

39

0 1

MPI_Isend

MPI_Recv

MPI_Wait

compute

compute

compute

compute

compute

Time

Collective Communications

• Must be called by all processes that are part of the communicator

Types

• Synchronization (MPI_Barrier)

• Global communication (MPI_Bcast, MPI_Gather, …)

• Global reduction (MPI_Reduce

40

Barrier

• Synchronization across all group members

• Collective call

• Blocks until all processes have entered the call

• MPI_Barrier (comm)

41

Broadcast

• Root process sends message to all processes

• Any process can be root process but has to be the same in all processes

• int MPI_Bcast (buffer, count, datatype, root, comm)

• Number of elements in buffer – count

• buffer – Input or output

42

X X

X

X

X

X

X

X

X

Example 3

43

int rank, size, color;MPI_Status status;

MPI_Init (&argc, &argv);MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);

color = rank + 2;int oldcolor = color;MPI_Bcast (&color, 1, MPI_INT, 0, MPI_COMM_WORLD);

printf ("%d: %d color changed to %d\n", rank, oldcolor, color);

Gather

• Gathers values from all processes to a root process

• int MPI_Gather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)

• Arguments recv* not relevant on non-root processes

44

A0 A1 A2

A0

A1

A2

DATA

PR

OC

ESSE

S

Example 4

45

MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);

color = rank + 2;

int colors[size];MPI_Gather (&color, 1, MPI_INT, colors, 1, MPI_INT, 0, MPI_COMM_WORLD);

if (rank == 0)for (i=0; i<size; i++)printf ("color from %d = %d\n", i, colors[i]);

Scatter

• Scatters values to all processes from a root process

• int MPI_Scatter (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)

• Arguments send* not relevant on non-root processes

• Output parameter – recvbuf

46

A0 A1 A2

A0

A1

A2

DATA

PR

OC

ESSE

S

Allgather

• All processes gather values from all processes

• int MPI_Allgather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

47

A0 A1 A2

A0 A1 A2

A0 A1 A2

A0

A1

A2

DATA

PR

OC

ESSE

S

Reduce

• MPI_Reduce (inbuf, outbuf, count, datatype, op, root, comm)

• Combines element in inbuf of each process

• Combined value in outbuf of root

• op: MIN, MAX, SUM, PROD, …

48

0 1 2 3 4 5 6 2 1 2 3 2 5 2 0 1 1 0 1 1 0

2 1 2 3 4 5 6 MAX at root

Allreduce

• MPI_Allreduce (inbuf, outbuf, count, datatype, op, comm)

• op: MIN, MAX, SUM, PROD, …

• Combines element in inbuf of each process

• Combined value in outbuf of each process

49

0 1 2 3 4 5 6 2 1 2 3 2 5 2 0 1 1 0 1 1 0

2 1 2 3 4 5 6

MAX

2 1 2 3 4 5 62 1 2 3 4 5 6

DEMO

50

[email protected]

51

Programming using MPI - ccds.iitkgp.ac.in

Documents

Transcript of Programming using MPI - ccds.iitkgp.ac.in