Programming using MPI - ccds.iitkgp.ac.in
Transcript of Programming using MPI - ccds.iitkgp.ac.in
Programming using MPI
Preeti MalakarIndian Institute of Technology Kanpur
National Supercomputing
Mission
Centre for Development of
Advanced Computing
SKAIndia
IITKharagpur
High Performance Computing for Astronomy and Astrophysics
FLOPS
2
Programming Supercomputers
PARAM Shakti, IIT Kharagpur3
Top 500 Supercomputers
4
Parallel Programming Models
Shared memory programming – OpenMP, Pthreads
Thread
Core
5
Cache
Message Passing Interface
• Standard for message passing in a distributed memory environment (most widely used programming model in supercomputers)
• Efforts began in 1991 by Jack Dongarra, Tony Hey, and David W. Walker
• MPI Forum formed in 1993• Version 1.0: 1994
• Version 2.0: 1997
• Version 3.0: 2012
•
6
Parallel Programming Models
Distributed memory programming – MPI (Message Passing Interface)
Memory
Process
Distributed memory (each process has its own address space)
Core
7
8Adapted from Neha Karanjkar’s slides
What is MPI?
Enables parallel programming using message passing
9
Process 0 Process 1
Distinct Process Address Space
x = 1, y = 2. . .
x+=2. . .
print x, y
Program order
x = 1, y = 2. . .y++. . .
print x, y
3, 2 1, 3
Process 0 Process 1
x = 1, y = 2. . .
x+=1. . .
print x, y
x = 10, y = 20. . .x++. . .
print x, y
2, 2 11, 20
Process 0 Process 1
10
Parallel Execution – Single Node
11
Cache
Process
Core
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
parallel.x parallel.x parallel.x parallel.x
parallel.x
Parallel Execution – Multiple Nodes
12
Memory
Process
Core
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
Instruction 1
Instruction 2
…
parallel.x parallel.x parallel.x parallel.x
parallel.x
Message Passing
13
Process 0 Process 1 Process 2 Process 3
Getting Started
14
Initialization
Finalization
• gather information about the parallel job
• set up internal library state• prepare for communication
Process Identification
15
Total number of processes
Rank of a process
• MPI_Comm_size: get the total number of processes
• MPI_Comm_rank: get my rank
MPI Code Execution Steps (On cluster)
• Compile• mpicc -o program.x program.c
• Execute• mpirun -np 1 ./program.x (or mpiexec in place of mpirun)
• Runs 1 process on the launch node
• mpirun -np 6 ./program.x• Runs 6 processes on the launch node
• mpirun -np 6 -f hostfile ./program.x• Runs 6 processes on the nodes specified in the hostfile
16
<hostfile>
Host1:3Host2:2Host3:1…
Multiple Processes on Multiple Cores and Nodes
mpiexec –np 8 –f hostfile ./program.x
17
Memory
Process
Core
Node 1 Node 2 Node 3 Node 4
0
1
2
3
4
5
6
7
Node1:2Node2:2Node3:2Node4:2
Hostfile
How?
Multiple Processes on Multiple Cores and Nodes
mpiexec –np 8 –f hostfile ./program.x 18
Memory
Process
Core
Node 1 Node 2 Node 3 Node 4
0
1
2
3
4
5
6
7
Node1:2Node2:2Node3:2Node4:2
Hostfile
Launch node
Proxy Proxy ProxyProxy
19
MPI Process Spawning
The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. The entire virtual address space of the parent is replicated in the child.
https://man7.org/linux/man-pages/man2/fork.2.html
https://linux.die.net/man/3/execvp
The exec() family of functions replaces the current process image with a new process image.
SIMD Parallelism (Recap)
Memory
Process
Core
Instruction 1
Instruction 2
…
parallel.x
20
NO centralized server/master (typically)
Sum of Squares of N numbers
21
Serial
for i = 1 to N
sum += a[i] * a[i]
Parallel
for i = 1 to N/P
sum += a[i] * a[i]
P
Memory
Process
Core
0 1 2 3
for i = 1 to N/P
sum += a[i] * a[i]
for i = 1 to N/P
sum += a[i] * a[i]
for i = 1 to N/P
sum += a[i] * a[i]
for i = 1 to N/P
sum += a[i] * a[i]
for i = N/P * rank ; i < N/P * (rank+1) ; i++
localsum += a[i] * a[i]
Collect localsum, add up at one of the ranks
MPI Installation
22
1. sudo apt-get install mpich
OR
1. Download source code from https://www.mpich.org/downloads/2. ./configure --prefix=<INSTALL_PATH>3. make4. make install
MPI Code
23
Global communicator
Global communicator
MPI_COMM_WORLD
Required in every MPI communication
24
Communication Scope
25
Communicator (communication handle)
• Defines the scope
• Specifies communication context
Process
• Belongs to a group
• Identified by a rank within a group
Identification
• MPI_Comm_size – total number of processes in communicator
• MPI_Comm_rank
MPI_COMM_WORLD
2
4
3
1
0
26
Process identified by rank/id
MPI Message
• Data and header/envelope
• Typically, MPI communications send/receive messages
27
Source: Origin of message
Destination: Receiver of message
Communicator
Tag (0:MPI_TAG_UB)
Message Envelope
MPI Communication Types
Point-to-pointCollective
Point-to-point Communication
• MPI_Send
• MPI_Recv
SENDER
RECEIVER
int MPI_Send (const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
Tags should match
Blocking send and receive
MPI_Datatype
• MPI_BYTE
• MPI_CHAR
• MPI_INT
• MPI_FLOAT
• MPI_DOUBLE
30
Example 1
31
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender processif (myrank == 0) /* code for process 0 */{strcpy (message,"Hello, there");MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
}
// Receiver processelse if (myrank == 1) /* code for process 1 */{MPI_Recv (message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);printf ("received :%s\n", message);
}
Message tag
Message tag
MPI_Status
• Source rank
• Message tag
• Message length• MPI_Get_count
32
MPI_ANY_*
• MPI_ANY_SOURCE• Receiver may specify wildcard value for source
• MPI_ANY_TAG• Receiver may specify wildcard value for tag
33
Example 2
34
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender processif (myrank == 0) /* code for process 0 */{strcpy (message,"Hello, there");MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
}
// Receiver processelse if (myrank == 1) /* code for process 1 */{MPI_Recv (message, 20, MPI_CHAR, MPI_ANY_SOURCE, 99, MPI_COMM_WORLD,
&status);printf ("received :%s\n", message);
}
MPI_Send (Blocking)
• Does not return until buffer can be reused
• Message buffering
• Implementation-dependent
• Standard communication mode
SENDER
RECEIVER
35
Safety
36
0 1
MPI_Send
MPI_Send
MPI_Recv
MPI_RecvSafe
MPI_Send
MPI_Recv
MPI_Send
MPI_RecvUnsafe
MPI_Send
MPI_Recv
MPI_Recv
MPI_SendSafe
MPI_Recv
MPI_Send
MPI_Recv
MPI_SendUnsafe
Other Send Modes
• MPI_Bsend Buffered• May complete before matching receive is posted
• MPI_Ssend Synchronous• Completes only if a matching receive is posted
• MPI_Rsend Ready•
37
Non-blocking Point-to-Point
• MPI_Isend (buf, count, datatype, dest, tag, comm, request)
• MPI_Irecv (buf, count, datatype, source, tag, comm, request)
• MPI_Wait
38
0 1
MPI_Isend
MPI_Recv
MPI_Isend
MPI_RecvSafe
Computation Communication Overlap
39
0 1
MPI_Isend
MPI_Recv
MPI_Wait
compute
compute
compute
compute
compute
Time
Collective Communications
• Must be called by all processes that are part of the communicator
Types
• Synchronization (MPI_Barrier)
• Global communication (MPI_Bcast, MPI_Gather, …)
• Global reduction (MPI_Reduce
40
Barrier
• Synchronization across all group members
• Collective call
• Blocks until all processes have entered the call
• MPI_Barrier (comm)
41
Broadcast
• Root process sends message to all processes
• Any process can be root process but has to be the same in all processes
• int MPI_Bcast (buffer, count, datatype, root, comm)
• Number of elements in buffer – count
• buffer – Input or output
42
X X
X
X
X
X
X
X
X
Example 3
43
int rank, size, color;MPI_Status status;
MPI_Init (&argc, &argv);MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);
color = rank + 2;int oldcolor = color;MPI_Bcast (&color, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf ("%d: %d color changed to %d\n", rank, oldcolor, color);
Gather
• Gathers values from all processes to a root process
• int MPI_Gather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
• Arguments recv* not relevant on non-root processes
44
A0 A1 A2
A0
A1
A2
DATA
PR
OC
ESSE
S
Example 4
45
MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);
color = rank + 2;
int colors[size];MPI_Gather (&color, 1, MPI_INT, colors, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (rank == 0)for (i=0; i<size; i++)printf ("color from %d = %d\n", i, colors[i]);
Scatter
• Scatters values to all processes from a root process
• int MPI_Scatter (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
• Arguments send* not relevant on non-root processes
• Output parameter – recvbuf
46
A0 A1 A2
A0
A1
A2
DATA
PR
OC
ESSE
S
Allgather
• All processes gather values from all processes
• int MPI_Allgather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
47
A0 A1 A2
A0 A1 A2
A0 A1 A2
A0
A1
A2
DATA
PR
OC
ESSE
S
Reduce
• MPI_Reduce (inbuf, outbuf, count, datatype, op, root, comm)
• Combines element in inbuf of each process
• Combined value in outbuf of root
• op: MIN, MAX, SUM, PROD, …
48
0 1 2 3 4 5 6 2 1 2 3 2 5 2 0 1 1 0 1 1 0
2 1 2 3 4 5 6 MAX at root
Allreduce
• MPI_Allreduce (inbuf, outbuf, count, datatype, op, comm)
• op: MIN, MAX, SUM, PROD, …
• Combines element in inbuf of each process
• Combined value in outbuf of each process
49
0 1 2 3 4 5 6 2 1 2 3 2 5 2 0 1 1 0 1 1 0
2 1 2 3 4 5 6
MAX
2 1 2 3 4 5 62 1 2 3 4 5 6
DEMO
50