Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster.
High Performance Computing - morrisriedel.demorrisriedel.de/wp-content/uploads/2018/03/HPC...1. High...
Transcript of High Performance Computing - morrisriedel.demorrisriedel.de/wp-content/uploads/2018/03/HPC...1. High...
ADVANCED SCIENTIFIC COMPUTING
Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany
Parallel Programming with MPIAugust 31, 2017Room Ág-303
High Performance Computing
LECTURE 3
Review of Lecture 2 – Parallelization Fundamentals
Strategies for Parallelization
Lecture 3 – Parallel Programming with MPI
Terminology
Modified from [1] Introduction to High Performance Computing for Scientists and Engineers
(SPMD Example)
Modified from [2] Caterham F1 team races past competition with HPC
(MPMD Example)
2 / 40
Outline of the Course
1. High Performance Computing
2. Parallelization Fundamentals
3. Parallel Programming with MPI
4. Advanced MPI Techniques
5. Parallel Algorithms & Data Structures
6. Parallel Programming with OpenMP
7. Hybrid Programming & Patterns
8. Debugging & Profiling Techniques
9. Performance Optimization & Tools
10. Scalable HPC Infrastructures & GPUs
Lecture 3 – Parallel Programming with MPI
11. Scientific Visualization & Steering
12. Terrestrial Systems & Climate
13. Systems Biology & Bioinformatics
14. Molecular Systems & Libraries
15. Computational Fluid Dynamics
16. Finite Elements Method
17. Machine Learning & Data Mining
18. Epilogue
+ additional practical lectures for our
hands-on exercises in context
3 / 40
Outline
Message Passing Interface (MPI) Review Distributed Memory Systems Point-to-Point Message Passing Functions Understanding MPI Collectives MPI Rank & Communicators Standardization & Portability
MPI Parallel Programming Basics Environment with Libraries & Modules Thinking Parallel Basic Building Blocks of a Parallel Program Code Compilation & Parallel Executions Simple PingPong Application Example
Lecture 3 – Parallel Programming with MPI
Promises from previous lecture(s):
Lecture 1: Lecture 3 & 4 will give in-depth details on the distributed-memory programming model with MPI
4 / 40
Message Passing Interface (MPI)
Lecture 3 – Parallel Programming with MPI 5 / 40
Lecture 3 – Parallel Programming with MPI
Distributed-Memory Computers Reviewed
Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today
A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly
Modified from [1] Introduction to High Performance Computing for Scientists and Engineers
Programming Model:
Message Passing
6 / 40
Lecture 3 – Parallel Programming with MPI
Programming with Distributed Memory using MPI
No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX
Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method
P1 P2 P3 P4 P5
Distributed-memory programming enables explicit message passing as communication between processors
MPI is dominant distributed-memory programming standard today (v3.1)
[3] MPI Standard
7 / 40
Lecture 3 – Parallel Programming with MPI
What is MPI?
‘Communication library’ abstracting from low-level network view Offers 500+ available functions to communicate between computing nodes Practice reveals: parallel applications often require just ~12 (!) functions Includes routines for efficient ‘parallel I/O’ (using underlying hardware)
Supports ‘different ways of communication’ ‘Point-to-point communication’ between two computing nodes (P P) Collective functions involve ‘N computing nodes in useful communiction’
Deployment on Supercomputers Installed on (almost) all parallel computers Different languages: C, Fortran, Python, R, etc. Careful: Different versions might be installed
Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer
8 / 40
Lecture 3 – Parallel Programming with MPI
HPC Machine
Message Passing: Exchanging Data with Send/Receive
P1 P2 P3 P4 P5 P6
P
M
P
M
P
M
P
M
Each processor has its own data in its memory that can not be seen/accessed by other processors
DATA: 17
DATA: 06 DATA: 19
DATA: 80
Point-To-PointCommunications
NEW: 17
NEW: 06
Compute Node
9 / 40
Lecture 3 – Parallel Programming with MPI
Collective Functions : Broadcast (one-to-many)
Lecture 5 will provide some more detailed examples of how collective communication is used
P
M
P
M
P
M
P
M
DATA: 17
DATA: 06 DATA: 19
DATA: 80
NEW: 17
NEW: 17NEW: 17
Broadcast distributes the same data to many or even all other processors
10 / 40
Lecture 3 – Parallel Programming with MPI
Collective Functions: Scatter (one-to-many) Scatter distributes
different data to many or even all other processors
Lecture 5 will provide some more detailed examples of how collective communication is used
P
M
P
M
P
M
P
M
DATA: 30DATA: 20DATA: 10
NEW: 10
DATA: 80
DATA: 19DATA: 06
NEW: 20NEW: 30
11 / 40
Lecture 3 – Parallel Programming with MPI
Collective Functions: Gather (many-to-one) Gather collects data
from many or even all other processorsto one specific processor
Lecture 5 will provide some more detailed examples of how collective communication is used
P
M
P
M
P
M
P
M
DATA: 17 DATA: 80
DATA: 19DATA: 06
NEW: 80NEW: 19NEW: 06
12 / 40
Lecture 3 – Parallel Programming with MPI
Collective Functions: Reduce (many-to-one) Reduce combines
collection with computation based on data from many or even all other processors
Lecture 5 will provide some more detailed examples of how collective communication is used
P
M
P
M
P
M
P
M
DATA: 17 DATA: 80
DATA: 19DATA: 06
NEW: 122
Usage of reduce includes finding a global minimum or maximum, sum, or product of the different data locatedat different processors
++
+
+
global sumas example+
13 / 40
Lecture 3 – Parallel Programming with MPI
MPI Communicators & MPI Rank
Each MPI activity specifies thecontext in which a corresponding function is performed MPI_COMM_WORLD
(region/context of all processes) Create (sub-)groups of the
processes / virtual groups of processes
Peform communicationsonly within these sub-groupseasily with well-defined processes
Using communicators wisely in collective functionscan reduce the number of affected processors
MPI rank is a unique number for each processor
[4] LLNL MPI
Tutorial
Lecture 4 will provide pieces of information about the often used MPI cartesian communicator
(numbers reflect unique identityof processor named ‘MPI rank)
14 / 40
Lecture 3 – Parallel Programming with MPI
Is MPI yet another network library?
TCP/IP and socket programming libraries are plentiful available Do we need a dedicated communication & network protocols library? Goal: simplify programming in parallel programming, focus on applications
Selected reasons Designed for performance within large parallel computers (e.g. no security) Supports various interconnects between ‘computing nodes’ (hardware) Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’
MPI is not designed to handle any communication in computer networks and is thus very special Not good for clients that constantly establishing/closing connections again and again (e.g. would
have very slow performance in MPI) Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond
firewalls, no message encryption directly available, etc.)
15 / 40
Pro: Network communication is relativel hidden and supported Contra: Programming with MPI still requires using ‘parallelization methods’ Not easy: Write ‘technical code’ well integrated in ‘problem-domain code’
Example: Race Car Simulation(cf. Lecture 2) Apply a good parallelization method
(e.g. domain decomposition) Write manually good MPI code for
(technical) communication between processors(e.g. across 1024 cores)
Integrate well technical codewith problem-domain code(e.g. computational fluid dynamics & airflow)
Lecture 3 – Parallel Programming with MPI
Parallel Applications & Simulations use MPI
Modified from [2] Caterham F1 team
16 / 40
Lecture 3 – Parallel Programming with MPI
MPI as Open Standard
Many vendors provide supercomputers/clusters in the past/today Libraries in addition to OS are required to support message passing Proprietary and vendor-specific libraries existed up until early ~1990s.
The MPI ‘Joint Standardization’ Forum Members from many organizations that define/ maintain the MPI standard MPI1.0 1994; MPI2.0 1997; MPI2.2 2009; MPI3.0 getting used
[3] MPI Standard
17 / 40
Lecture 3 – Parallel Programming with MPI
MPI Standard enables Portability
Key reasons for requiring a standard programming library Technical advancement in supercomputers is extremely fast Parallel computing experts switch organizations and face another system
Applications using proprietary libraries where not portable Create whole applications from scratch or time-consuming code updates
MPI changed this & is dominant parallel programming model
HPC Machine A
MPI Library
MPI is an open standard thatsignificantly supports the portabilityof parallel applications
HPC Machine B
MPI LibraryPorting a parallelMPI application
18 / 40
Lecture 3 – Parallel Programming with MPI
[Video] Introducing MPI – Summary
[5] Introducing MPI, YouTube Video
19 / 40
MPI Parallel Programming Basics
Lecture 3 – Parallel Programming with MPI 20 / 40
Lecture 3 – Parallel Programming with MPI
Starting Parallel Programming
Check access to the cluster machine Check MPI standard implementation and its version Often SSH is used to remotely access clusters
OpenMPI ‘Open Source High Performance Computing’ E.g. Openmpi-x86_64; openmpi/1.3.6;
Other Implementations exists E.g. MPICH implementation ‘High-Performance Portable MPI‘ (we don‘t use this in this course)
[6] OpenMPI
[7] MPICH
Practical Lecture 3.1 will provide more insights on how to use MPI within a cluster environment
21 / 40
Lecture 3 – Parallel Programming with MPI
HPC Machine Environment – OS & Hostname
Most parallel computers/supercomputers have UNIX OSs Exceptions exist: HPC Windows server machines
Often UNIX commands are necesseary to work productively Tools exist to abstract from underlying UNIX and OS technical aspects Examples : parallel tool platform (PTP), middleware UNICORE, etc.
Example: ‘hostname -A’ command on JOTUNN cluster[8] Parallel Tools Platform [9] UNICORE
Practical Lecture 3.1 consists of details on MPI with batch systems and concept of login nodes
22 / 40
Lecture 3 – Parallel Programming with MPI
HPC Machine Environment – Compiler & Modules
Knowledge of installed compilers essential (e.g. C, Fortran90, etc.) Different versions and types of compilers exist (Intel, GNU, MPI, etc.) E.g. mpicc pingpong.c –o pingpong
Module environment tool Avoids to manually setup environment information for every application Simplifies shell initialization and lets users easily modify their environment Modules can be loaded and unloaded
Module avail Lists all available modules on the HPC system (e.g. compilers, MPI, etc.)
Module load Loads particular modules into the current work environment E.g. module load gnu openmpi
Practical Lecture 3.1 consists of details on MPI with using compilers & available system modules
23 / 40
Lecture 3 – Parallel Programming with MPI
Start ‘Thinking’ Parallel
Parallel MPI programs know about the existence of other processes of it and what their own role is in the bigger picture
MPI programs are written in a sequentialprogramming language, but executed in parallel Same MPI program runs on all processes (SPMD)
Data exchange is key for design of applications Sending/receiving data at specific times in the program No shared memory for sharing variables with other remote processes Messages can be simple variables (e.g. a word) or complex structures
Start with the basic building blocks using MPI Building up the ‘parallel computing environment’
Recall SPMD stands for Single ProgramMultiple Data
P P
P P …24 / 40
Lecture 3 – Parallel Programming with MPI
(MPI) Basic Building Blocks: A main() function
The main() function is automatically started when launching a C program
Normally the ‘return code’ denotes whether the program exitwas ok (0) or problematic (-1)
Practice view: use of resiliency is not part of MPI (e.g. automatic restart and error handlings), therefore this is rarely used in practice
‘standard C programming…’
25 / 40
Lecture 3 – Parallel Programming with MPI
(MPI) Basic Building Blocks: Variables & Output
Libraries can be used by including C header files, here library forscreen outputs for example
Two integer variables that are later useful for working withspecific data obtained fromMPI library
Output with printf using stdio library: ‘Hello World’ and which process is printing out of the summary of all n processes
‘standard C programming…’
26 / 40
Lecture 3 – Parallel Programming with MPI
MPI Basic Building Blocks: Header & Init/Finalize
‘standard C programming including MPI library use…’ Libraries can be used by including C header files, here library forMPI included
The MPI_INIT() function initializes the MPI environment and can take inputs via themain() functionarguments
MPI_Finalize() shuts down the MPI environment(after this statement no parallel execution of the code can take place)
27 / 40
Lecture 3 – Parallel Programming with MPI
MPI Basic Building Blocks: Rank & Size Variables
‘standard C programming including MPI library use…’
MPI_COMM_WORLD communicator constantdenotes the ‘region of communication’, here all processes
The MPI_Comm_size()function determines the overall number of n processes in the parallel program: stores it in variable size
The MPI_Comm_rank()function determines the unique identifier for each processor:stores it in variablerank with valures (0 … n-1)
28 / 40
Lecture 3 – Parallel Programming with MPI
Compiling & Executing an MPI program
Compilers and linkers need various information where include files and libraries can be found E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’ Compiling is different for each programming language
Executing the MPI program on 4 processors Normally batch system allocations
(cf. SLURM on JOTUNN cluster) Manual start-up example:
Output of the program Order of outputs
can vary because I/Oscreen ‘serial resource’
P
M
P
M
P
M
P
M
hello hello
hello hello
$> mpirun –np 4 ./hello
create 4 processes that produceoutput in parallel
29 / 40
Lecture 3 – Parallel Programming with MPI
Practice: Our 4 CPU Program alongside many other Programs
[10] LLView Tool
Maybe our program!
30 / 40
Lecture 3 – Parallel Programming with MPI
(Blocking) Point-to-Point Communication
MPI messages are defined as an array of elements of a particular MPI data type Basic data types (MPI_INT, MPI_LONG,… ) Derived types (can be specifically defined)
The data types on sender and receiver sides must match Otherwise the message passing step/transfer will not succeed
Point-to-point communication takes place among exactly one sender and exactly one receiver.
Both ends are identified uniquely by their ranks.
P
M
P
M
rank 0 rank 1
[3] MPI Standard
31 / 40
Lecture 3 – Parallel Programming with MPI
Detailed View: The Role of the System Buffer
[4] LLNL MPI Tutorial
32 / 40
Lecture 3 – Parallel Programming with MPI
Sending an MPI Message: MPI_Send
P
M
P
M
rank 0 rank 1
…MPI_Send()… MPI_Send() performs a blocking send
block until message is received by the destination process.
buf
initial address of send buffer (choice)
count
number of elements in send buffer (nonnegative integer)
datatype
datatype of each send buffer element (handle)
dest
rank of destination (integer)
tag
message tag (integer)
comm
communicator (handle)
[3] MPI Standard
33 / 40
Lecture 3 – Parallel Programming with MPI
Receiving an MPI Message: MPI_Recv
P
M
P
M
rank 0 rank 1
…MPI_Send()…
MPI_Recv() performs a blocking receive for a message (until arrival)
buf
initial address of receive buffer (choice)
count
maximum number of elements in receive buffer (integer)
datatype
datatype of each receive buffer element (handle)
source
rank of source (integer)
tag
message tag (integer)
comm
communicator (handle)
status
status object (Status) …MPI_Recv()…
[3] MPI Standard
time 34 / 40
Lecture 3 – Parallel Programming with MPI
Summary of the Parallel Environment & Message Passing
Modified from [4] LLNL MPI Tutorial
PM
PM
PM
PM
PM
……
35 / 40
Lecture 3 – Parallel Programming with MPI
MPI PingPong Program with Message Passsing#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc; char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) { dest = 1; source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }
else if (rank == 1) { dest = 0; source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d \n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
}
Lecture 5 consists of further advanced examples of using several MPI functions in applications
Simple PingPong Parallel Program MPI Rank 0 pings Rank 1 and awaits
return ping (check Send/Recv order of calling the functions here…)
Function MPI_Get_count() counts the number of received elements
36 / 40
Lecture 3 – Parallel Programming with MPI
[Video] Open MPI
[11] YouTube Video, What is Open MPI
37 / 40
Lecture Bibliography
Lecture 3 – Parallel Programming with MPI 38 / 40
Lecture Bibliography
[1] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science, ISBN 143981192X
[2] Caterham F1 Team Races Past Competition with HPC, Online: http://insidehpc.com/2013/08/15/caterham-f1-team-races-past-competition-with-hpc
[3] The MPI Standard, Online: http://www.mpi-forum.org/docs/
[4] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/
[5] HPC – Introducting MPI, YouTube VideoOnline: http://www.youtube.com/watch?v=kHV6wmG35po
[6] OpenMPI, ‘Open Source High Performance Computing’, Online: http://www.open-mpi.org/
[7] MPICH, ‘High-Performance Portable MPI’, Online: http://www.mpich.org/
[8] Parallel Tools Platform, Online: http://www.eclipse.org/ptp/
[9] UNICORE Middleware, Online: https://www.unicore.eu/
[10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html
[11] YouTube Video, What is OpenMPI,Online: http://www.youtube.com/watch?v=D0-xSWBGNAw
Lecture 3 – Parallel Programming with MPI 39 / 40
Lecture 3 – Parallel Programming with MPI 40 / 40