nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel...
Transcript of nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel...
![Page 1: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/1.jpg)
Introduction to Programming by MPI for Parallel FEM
Report S1 & S2in Fortran (1/2)
Kengo NakajimaInformation Technology Center
The University of Tokyo
![Page 2: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/2.jpg)
11
Motivation for Parallel Computing(and this class)
• Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering.
• Why parallel computing ?– faster & larger– “larger” is more important from the view point of “new frontiers
of science & engineering”, but “faster” is also important.– + more complicated– Ideal: Scalable
• Weak Scaling, Strong Scaling
MPI Programming
![Page 3: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/3.jpg)
2
Scalable, Scaling, Scalability
• Solving Nx scale problem using Nx computational resources during same computation time– for large-scale problems: Weak Scaling, Weak
Scalability– e.g. CG solver: more iterations needed for larger
problems
• Solving a problem using Nx computational resources during 1/N computation time– for faster computation: Strong Scaling, Strong
Scalability
Intro
![Page 4: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/4.jpg)
33
Overview
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 5: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/5.jpg)
44
What is MPI ? (1/2)• Message Passing Interface• “Specification” of message passing API for distributed
memory environment– Not a program, Not a library
• http://www.mcs.anl.gov/mpi/www/• https://www.mpi-forum.org/docs/
• History– 1992 MPI Forum
• https://www.mpi-forum.org/
– 1994 MPI-1– 1997 MPI-2: MPI I/O– 2012 MPI-3: Fault Resilience, Asynchronous Collective
• Implementation– mpich ANL (Argonne National Laboratory), OpenMPI, MVAPICH– H/W vendors– C/C++, FOTRAN, Java ; Unix, Linux, Windows, Mac OS
MPI Programming
![Page 6: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/6.jpg)
55
What is MPI ? (2/2)
• “mpich” (free) is widely used– supports MPI-2 spec. (partially)– MPICH2 after Nov. 2005.– http://www.mcs.anl.gov/mpi/
• Why MPI is widely used as de facto standard ?– Uniform interface through MPI forum
• Portable, can work on any types of computers• Can be called from Fortran, C, etc.
– mpich• free, supports every architecture
• PVM (Parallel Virtual Machine) was also proposed in early 90’s but not so widely used as MPI
MPI Programming
![Page 7: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/7.jpg)
66
References
• W.Gropp et al., Using MPI second edition, MIT Press, 1999. • M.J.Quinn, Parallel Programming in C with MPI and OpenMP,
McGrawhill, 2003.• W.Gropp et al., MPI:The Complete Reference Vol.I, II, MIT Press,
1998. • http://www.mcs.anl.gov/mpi/www/
– API (Application Interface) of MPI
MPI Programming
![Page 8: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/8.jpg)
77
How to learn MPI (1/2)• Grammar
– 10-20 functions of MPI-1 will be taught in the class• although there are many convenient capabilities in MPI-2
– If you need further information, you can find information from web, books, and MPI experts.
• Practice is important– Programming– “Running the codes” is the most important
• Be familiar with or “grab” the idea of SPMD/SIMD op’s– Single Program/Instruction Multiple Data– Each process does same operation for different data
• Large-scale data is decomposed, and each part is computed by each process
– Global/Local Data, Global/Local Numbering
MPI Programming
![Page 9: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/9.jpg)
88
SPMD
PE #0
Program
Data #0
PE #1
Program
Data #1
PE #2
Program
Data #2
PE #M-1
Program
Data #M-1
mpirun -np M <Program>
You understand 90% MPI, if you understand this figure.
PE: Processing ElementProcessor, Domain, Process
Each process does same operation for different dataLarge-scale data is decomposed, and each part is computed by each processIt is ideal that parallel program is not different from serial one except communication.
MPI Programming
![Page 10: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/10.jpg)
99
Some Technical Terms• Processor, Core
– Processing Unit (H/W), Processor=Core for single-core proc’s
• Process– Unit for MPI computation, nearly equal to “core”– Each core (or processor) can host multiple processes (but not
efficient)
• PE (Processing Element)– PE originally mean “processor”, but it is sometimes used as
“process” in this class. Moreover it means “domain” (next)• In multicore proc’s: PE generally means “core”
• Domain– domain=process (=PE), each of “MD” in “SPMD”, each data set
• Process ID of MPI (ID of PE, ID of domain) starts from “0”– if you have 8 processes (PE’s, domains), ID is 0~7
MPI Programming
![Page 11: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/11.jpg)
1010
SPMD
PE #0
Program
Data #0
PE #1
Program
Data #1
PE #2
Program
Data #2
PE #M-1
Program
Data #M-1
mpirun -np M <Program>
You understand 90% MPI, if you understand this figure.
PE: Processing ElementProcessor, Domain, Process
Each process does same operation for different dataLarge-scale data is decomposed, and each part is computed by each processIt is ideal that parallel program is not different from serial one except communication.
MPI Programming
![Page 12: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/12.jpg)
1111
How to learn MPI (2/2)
• NOT so difficult.• Therefore, 5-6-hour lectures are enough for just learning
grammar of MPI.
• Grab the idea of SPMD !
MPI Programming
![Page 13: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/13.jpg)
1212
Schedule
• MPI– Basic Functions– Collective Communication – Point-to-Point (or Peer-to-Peer) Communication
• 105 min. x 3-4 lectures– Collective Communication
• Report S1
– Point-to-Point Communication• Report S2: Parallelization of 1D code
– At this point, you are almost an expert of MPI programming
MPI Programming
![Page 14: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/14.jpg)
1313
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 15: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/15.jpg)
1414
Login to Oakbridge -CX (OBCX)MPI Programming
PCOBCX
ssh t37**@obcx.cc.u-tokyo.ac.jp
Create directory>$ cd /work/gt37/t37XXX>$ mkdir pFEM>$ cd pFEM
In this class this top-directory is called <$O-TOP>.Files are copied to this directory.
Under this directory, S1, S2, S1-ref are created:<$O-S1> = <$O-TOP>/mpi/S1
<$O-S2> = <$O-TOP>/mpi/S2
![Page 16: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/16.jpg)
1515
Copying files on OBCXMPI Programming
Fortan>$ cd /work/gt37/t37XXX/pFEM
>$ cp /work/gt00/z30088/pFEM/F/s1-f.tar .
>$ tar xvf s1-f.tar
C>$ cd /work/gt37/t37XXX/pFEM
>$ cp /work/gt00/z30088/pFEM/C/s1-c.tar .
>$ tar xvf s1-c.tar
Confirmation>$ ls
mpi
>$ cd mpi/S1
This directory is called as <$O-S1>.
<$O-S1> = <$O-TOP>/mpi/S1
![Page 17: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/17.jpg)
161616
First Exampleimplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
hello.f
hello.c
MPI Programming
![Page 18: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/18.jpg)
171717
Compiling hello.f/cMPI Programming
FORTRAN$> “mpiifort”:
required compiler & libraries are included for FORTRAN90+MPI
C$> “mpiicc”:
required compiler & libraries are included for C+MPI
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ mpiifort -align array64byte -O3 -axCORE-AVX512 hello.f
>$ mpiicc -align -O3 -axCORE-AVX512 hello.c
![Page 19: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/19.jpg)
181818
Running Job• Batch Jobs
– Only batch jobs are allowed.– Interactive executions of jobs are not allowed.
• How to run– writing job script– submitting job– checking job status– checking results
• Utilization of computational resources – 1-node (56 cores) is occupied by each job.– Your node is not shared by other jobs.
MPI Programming
![Page 20: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/20.jpg)
191919
Job Script• <$O-S1>/hello.sh
• Scheduling + Shell Script
MPI Programming
#!/bin/sh
#PJM -N "hello“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o hello.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
mpiexec.hydra Command for Runnig MPI JOB
-n ${PJM_MPI_PROC} = (--mpi proc=XX), in this case =4
./a.out name of executable file
![Page 21: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/21.jpg)
202020
MPI Programming
Process Number
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
#PJM -L node=1;#PJM --mpi proc= 1 1-node, 1-proc, 1-proc/n
#PJM -L node=1;#PJM --mpi proc= 4 1-node, 4-proc, 4-proc/n
#PJM -L node=1;#PJM --mpi proc=16 1-node,16-proc,16-proc/n
#PJM -L node=1;#PJM --mpi proc=28 1-node,28-proc,28-proc/n
#PJM -L node=1;#PJM --mpi proc=56 1-node,56-proc,56-proc/n
#PJM -L node=4;#PJM --mpi proc=128 4-node,128-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=256 8-node,256-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=448 8-node,448-proc,56-proc/n
![Page 22: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/22.jpg)
212121
Job Submission
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
(modify hello.sh)
>$ pjsub hello.sh
>$ cat hello.lst
Hello World 0
Hello World 3
Hello World 2
Hello World 1
MPI Programming
![Page 23: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/23.jpg)
222222
Available QUEUE’s
• Following 2 queues are available.• 8 nodes can be used
– lecture
• 8 nodes (448 cores), 15 min., valid until the end of March 2020
• Shared by all “educational” users
– lecture7
• 8 nodes (448 cores), 15 min., active during class time • More jobs (compared to lecture) can be processed up
on availability.
MPI Programming
![Page 24: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/24.jpg)
232323
Submitting & Checking Jobs
• Submitting Jobs pjsub SCRIPT NAME
• Checking status of jobs pjstat
• Deleting/aborting pjdel JOB ID
• Checking status of queues pjstat --rsc
• Detailed info. of queues pjstat --rsc –x
• Info of running jobs pjstat –a
• History of Submission pjstat –H
• Limitation of submission pjstat --limit
MPI Programming
![Page 25: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/25.jpg)
242424
Basic/Essential Functionsimplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
MPI Programming
‘mpif.h’, “mpi.h”Essential Include file“use mpi” is possible in F90
MPI_InitInitialization
MPI_Comm_sizeNumber of MPI Processesmpirun -np XX <prog>
MPI_Comm_rankProcess ID starting from 0
MPI_FinalizeTermination of MPI processes
![Page 26: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/26.jpg)
252525
Difference between FORTRAN/C
• (Basically) same interface– In C, UPPER/lower cases are considered as different
• e.g.: MPI_Comm_size– MPI: UPPER case– First character of the function except “MPI_” is in UPPER case.– Other characters are in lower case.
• In Fortran, return value ierr has to be added at the end of the argument list.
• C needs special types for variables:– MPI_Comm, MPI_Datatype, MPI_Op etc.
• MPI_INIT is different:– call MPI_INIT (ierr)
– MPI_Init (int *argc, char ***argv)
MPI Programming
![Page 27: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/27.jpg)
262626
What’s are going on ?implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
MPI Programming
• mpiexec.hydra starts up 4 MPI processes(”proc=4”)– A single program runs on four processes.– each process writes a value of myid
• Four processes do same operations, but values of myid are different.
• Output of each process is different.• That is SPMD !
#!/bin/sh
#PJM -N "hello“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o hello.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
![Page 28: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/28.jpg)
272727
mpi.h ,mpif.himplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
• Various types of parameters and variables for MPI & their initial values.
• Name of each var. starts from “MPI_”• Values of these parameters and
variables cannot be changed by users.
• Users do not specify variables starting from “MPI_” in users’ programs.
MPI Programming
![Page 29: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/29.jpg)
2828
MPI_INIT
• Initialize the MPI execution environment (required)• It is recommended to put this BEFORE all statements in the program.
• call MPI_INIT (ierr)
– ierr I O Completion Code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 30: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/30.jpg)
2929
MPI_FINALIZE
• Terminates MPI execution environment (required)• It is recommended to put this AFTER all statements in the program.• Please do not forget this.
• call MPI_FINALIZE (ierr)
– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 31: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/31.jpg)
3030
MPI_COMM_SIZE
• Determines the size of the group associated with a communicator• not required, but very convenient function
• call MPI_COMM_SIZE (comm, size, ierr)
– comm I I communicator– size I O number of processes in the group of communicator– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 32: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/32.jpg)
313131
What is Communicator ?
• Group of processes for communication• Communicator must be specified in MPI program as a unit
of communication • All processes belong to a group, named
“MPI_COMM_WORLD” (default)• Multiple communicators can be created, and complicated
operations are possible.– Computation, Visualization
• Only “MPI_COMM_WORLD” is needed in this class.
MPI_Comm_Size (MPI_COMM_WORLD, PETOT)
MPI Programming
![Page 33: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/33.jpg)
323232
MPI_COMM_WORLD
Communicator in MPIOne process can belong to multiple communicators
COMM_MANTLE COMM_CRUST
COMM_VIS
MPI Programming
![Page 34: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/34.jpg)
333333MPI Programming
Coupling between “Ground Motion” and “Sloshing of Tanks for Oil-Storage”
![Page 35: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/35.jpg)
![Page 36: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/36.jpg)
35
Target Application• Coupling between “Ground Motion” and “Sloshing of
Tanks for Oil-Storage”– “One-way” coupling from “Ground Motion” to “Tanks”.– Displacement of ground surface is given as forced
displacement of bottom surface of tanks.– 1 Tank = 1 PE (serial)
Deformation of surface will be given as boundary conditionsat bottom of tanks.
Deformation of surface will be given as boundary conditionsat bottom of tanks.
MPI Programming
![Page 37: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/37.jpg)
36
2003 Tokachi Earthquake (M8.0)Fire accident of oil tanks due to long period
ground motion (surface waves) developed in the basin of Tomakomai
MPI Programming
![Page 38: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/38.jpg)
37
Seismic Wave Propagation, Underground Structure
MPI Programming
![Page 39: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/39.jpg)
38
Simulation Codes• Ground Motion (Ichimura): Fortran
– Parallel FEM, 3D Elastic/Dynamic• Explicit forward Euler scheme
– Each element: 2m×2m×2m cube– 240m×240m×100m region
• Sloshing of Tanks (Nagashima): C– Serial FEM (Embarrassingly Parallel)
• Implicit backward Euler, Skyline method• Shell elements + Inviscid potential flow
– D: 42.7m, H: 24.9m, T: 20mm,– Frequency: 7.6sec.– 80 elements in circ., 0.6m mesh in height– Tank-to-Tank: 60m, 4×4
• Total number of unknowns: 2,918,169
MPI Programming
![Page 40: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/40.jpg)
39
3939
Three Communicators
meshGLOBAL%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
meshGLOBAL%my_rank= 0~3
meshBASE%my_rank = 0~3
meshGLOBAL%my_rank= 4~12
meshTANK%my_rank = 0~ 8
meshTANK%my_rank = -1 meshBASE%my_rank = -1
meshGLOBAL%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
meshGLOBAL%my_rank= 0~3
meshBASE%my_rank = 0~3
meshGLOBAL%my_rank= 4~12
meshTANK%my_rank = 0~ 8
meshTANK%my_rank = -1 meshBASE%my_rank = -1
MPI Programming
![Page 41: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/41.jpg)
MPI Programming
40
MPI_COMM_RANK• Determines the rank of the calling process in the communicator
– “ID of MPI process” is sometimes called “rank”
• MPI_COMM_RANK (comm, rank, ierr)
– comm I I communicator– rank I O rank of the calling process in the group of comm
Starting from “0”– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
![Page 42: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/42.jpg)
MPI_ABORT
• Aborts MPI execution environment
• call MPI_ABORT (comm, errcode, ierr)
– comm I I communication– errcode I O error code– ierr I O completion code
41
Fortran
MPI Programming
41
![Page 43: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/43.jpg)
MPI_WTIME
• Returns an elapsed time on the calling processor
• time= MPI_WTIME ()
– time R8 O Time in seconds since an arbitrary time in the past.
…real(kind=8):: Stime, Etime
Stime= MPI_WTIME ()do i= 1, 100000000a= 1.d0
enddoEtime= MPI_WTIME ()
write (*,'(i5,1pe16.6)') my_rank, Etime-Stime
Fortran
42
42MPI Programming
![Page 44: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/44.jpg)
434343
Example of MPI_WtimeMPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiicc –O1 time.c$> mpiifort –O1 time.f
(modify go4.sh, 4 processes)$> pjsub go4.sh
0 1.113281E+003 1.113281E+002 1.117188E+001 1.117188E+00
Process TimeID
![Page 45: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/45.jpg)
444444
go4.shMPI Programming
#!/bin/sh
#PJM -N "test"
#PJM -L rscgrp=lecture7
#PJM -L node=1
#PJM --mpi proc=4
#PJM -L elapse=00:15:00
#PJM -g gt37
#PJM -j
#PJM -e err
#PJM -o test.lst
![Page 46: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/46.jpg)
454545
MPI_Wtick
• Returns the resolution of MPI_Wtime• depends on hardware, and compiler
• time= MPI_Wtick ()
– time R8 O Time in seconds of resolution of MPI_Wtime
implicit REAL*8 (A-H,O-Z)include 'mpif.h'
…TM= MPI_WTICK ()write (*,*) TM…
double Time;
…Time = MPI_Wtick();printf("%5d%16.6E¥n", MyRank, Time);…
MPI Programming
![Page 47: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/47.jpg)
464646
Example of MPI_WtickMPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiicc –O1 wtick.c$> mpiifort –O1 wtick.f
(modify go1.sh, 1 process)$> pjsub go1.sh
![Page 48: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/48.jpg)
474747
go1.shMPI Programming
#!/bin/sh
#PJM -N "test"
#PJM -L rscgrp=lecture7
#PJM -L node=1
#PJM --mpi proc=1
#PJM -L elapse=00:15:00
#PJM -g gt37
#PJM -j
#PJM -e err
#PJM -o test.lst
![Page 49: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/49.jpg)
4848
MPI_BARRIER
• Blocks until all processes in the communicator have reached this routine.• Mainly for debugging, huge overhead, not recommended for real code.
• call MPI_BARRIER (comm, ierr)
– comm I I communicator– ierr I O completion code
Fortran
MPI Programming
![Page 50: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/50.jpg)
494949
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 51: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/51.jpg)
505050
What is Collective Communication ?集団通信,グループ通信
• Collective communication is the process of exchanging information between multiple MPI processes in the communicator: one-to-all or all-to-all communications.
• Examples– Broadcasting control data– Max, Min– Summation– Dot products of vectors– Transformation of dense matrices
MPI Programming
![Page 52: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/52.jpg)
515151
Example of Collective Communications (1/4)
A0P#0 B0 C0 D0
P#1
P#2
P#3
BroadcastA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3Gather
MPI Programming
![Page 53: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/53.jpg)
525252
Example of Collective Communications (2/4)
All gatherA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
All-to-All
A0P#0
B0P#1
C0P#2
D0P#3
A0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
MPI Programming
![Page 54: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/54.jpg)
535353
Example of Collective Communications (3/4)
ReduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
All reduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
MPI Programming
![Page 55: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/55.jpg)
545454
Example of Collective Communications (4/4)
Reduce scatterP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3
op.B0-B3
op.C0-C3
op.D0-D3
MPI Programming
![Page 56: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/56.jpg)
555555
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 57: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/57.jpg)
565656
Global/Local Data
• Data structure of parallel computing based on SPMD, where large scale “global data” is decomposed to small pieces of “local data”.
MPI Programming
![Page 58: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/58.jpg)
575757
Domain Decomposition/Partitioning
MPI Programming
Large-scaleData
localdata
localdata
localdata
localdata
localdata
localdata
localdata
localdata
comm.DomainDecomposition
• PC with 1GB RAM: can execute FEM application with up to 106 meshes
– 103km× 103 km× 102 km (SW Japan): 108 meshes by 1km cubes
• Large-scale Data: Domain decomposition, parallel & local operations
• Global Computation: Comm. among domains needed
![Page 59: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/59.jpg)
585858
Local Data Structure• It is important to define proper local data structure for
target computation (and its algorithm)– Algorithms= Data Structures
• Main objective of this class !
MPI Programming
![Page 60: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/60.jpg)
595959
Global/Local Data
• Data structure of parallel computing based on SPMD, where large scale “global data” is decomposed to small pieces of “local data”.
• Consider the dot product of following VECp and VECs with length=20 by parallel computation using 4 processors
VECp[ 0]= 2[ 1]= 2[ 2]= 2
…[17]= 2[18]= 2[19]= 2
VECs[ 0]= 3[ 1]= 3[ 2]= 3
…[17]= 3[18]= 3[19]= 3
VECp( 1)= 2( 2)= 2( 3)= 2
…(18)= 2(19)= 2(20)= 2
VECs( 1)= 3( 2)= 3( 3)= 3
…(18)= 3(19)= 3(20)= 3
MPI Programming
![Page 61: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/61.jpg)
606060
<$O-S1>/dot.f, dot.cimplicit REAL*8 (A-H,O-Z)real(kind=8),dimension(20):: &
VECp, VECs
do i= 1, 20VECp(i)= 2.0d0VECs(i)= 3.0d0
enddo
sum= 0.d0do ii= 1, 20sum= sum + VECp(ii)*VECs(ii)
enddo
stopend
#include <stdio.h>int main(){
int i;double VECp[20], VECs[20]double sum;
for(i=0;i<20;i++){VECp[i]= 2.0;VECs[i]= 3.0;
}
sum = 0.0;for(i=0;i<20;i++){sum += VECp[i] * VECs[i];
}return 0;
}
MPI Programming
![Page 62: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/62.jpg)
616161
<$O-S1>/dot.f, dot.c(do it on PC/ECCS 2016)
MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ icc dot.c>$ ifort dot.f
>$ ./a.out
1 2. 3.2 2. 3.3 2. 3.
…18 2. 3.19 2. 3.20 2. 3.
dot product 120.
![Page 63: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/63.jpg)
6262
MPI_REDUCE• Reduces values on all processes to a single value
– Summation, Product, Max, Min etc.
• call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)– sendbuf choice I starting address of send buffer– recvbuf choice O starting address receive buffer
type is defined by ”datatype”– count I I number of elements in send/receive buffer – datatype I I data type of elements of send/recive buffer
FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc
– op I I reduce operation MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND etc
Users can define operations by MPI_OP_CREATE
– root I I rank of root process – comm I I communicator– ierr I O completion code
ReduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
Fortran
MPI Programming
![Page 64: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/64.jpg)
636363
Send/Receive Buffer(Sending/Receiving)
• Arrays of “send (sending) buffer” and “receive (receiving) buffer” often appear in MPI.
• Addresses of “send (sending) buffer” and “receive (receiving) buffer” must be different.
MPI Programming
![Page 65: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/65.jpg)
646464
Send/Receive Buffer (1/3)A: Scalar
MPI Programming
call MPI_REDUCE (A,recvbuf, 1,datatype,op,root,comm,ierr)
MPI_Reduce(A,recvbuf,1,datatype,op,root,comm)
![Page 66: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/66.jpg)
656565
Send/Receive Buffer (2/3)A: Array
MPI Programming
call MPI_REDUCE (A,recvbuf, 3,datatype,op,root,comm,ierr)
MPI_Reduce(A,recvbuf,3,datatype,op,root,comm)
• Starting Address of Send Buffer– A(1): Fortran, A[0]: C– 3 (continuous) components of A (A(1)-A(3), A[0]-A[2])
are sent
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
A(:)
A[:]
![Page 67: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/67.jpg)
666666
Send/Receive Buffer (3/3)A: Array
MPI Programming
call MPI_REDUCE (A(4),recvbuf, 3,datatype,op,root,comm,ierr)
MPI_Reduce(A[3],recvbuf,3,datatype,op,root,comm)
• Starting Address of Send Buffer– A(4): Fortran, A[3]: C– 3 (continuous) components of A (A(4)-A(6), A[3]-A[5])
are sent
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
A(:)
A[:]
![Page 68: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/68.jpg)
6767
Example of MPI_Reduce (1/2)call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
real(kind=8):: X0, X1
call MPI_REDUCE(X0, X1, 1, MPI_DOUBLE_PRECISION, MPI_MAX, 0, <comm>, ierr)
real(kind=8):: X0(4), XMAX(4)
call MPI_REDUCE(X0, XMAX, 4, MPI_DOUBLE_PRECISION, MPI_MAX, 0, <comm>, ierr)
Global Max. values of X0(i) go to XMAX(i) on #0 process (i=1-4)
Fortran
MPI Programming
![Page 69: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/69.jpg)
6868
Example of MPI_Reduce (2/2)call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
real(kind=8):: X0, XSUM
call MPI_REDUCE(X0, XSUM, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, <comm>, ierr)
real(kind=8):: X0(4)
call MPI_REDUCE(X0(1), X0(3), 2, MPI_DOUBLE_PRECISION, MPI_SUM, 0, <comm>, ierr)
Global summation of X0 goes to XSUM on #0 process.
Fortran
MPI Programming
・ Global summation of X0(1) goes to X0(3) on #0 process.・ Global summation of X0(2) goes to X0(4) on #0 process.
1 2 3 4
1 2 3 4
sendbuf
recvbuf
![Page 70: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/70.jpg)
6969
MPI_BCAST
• Broadcasts a message from the process with rank "root" to all other processes of the communicator
• call MPI_BCAST (buffer,count,datatype,root,comm,ierr)
– buffer choice I/O starting address of buffertype is defined by ”datatype”
– count I I number of elements in send/recv buffer
– datatype I I data type of elements of send/recv bufferFORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc.
– root I I rank of root process – comm I I communicator– ierr I O completion code
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
BroadcastA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
Fortran
MPI Programming
![Page 71: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/71.jpg)
7070
MPI_ALLREDUCE
• MPI_Reduce + MPI_Bcast• Summation (of dot products) and MAX/MIN values are likely to utilized in
each process
• call MPI_ALLREDUCE
(sendbuf,recvbuf,count,datatype,op, comm,ierr)
– sendbuf choice I starting address of send buffer– recvbuf choice O starting address receive buffer
type is defined by ”datatype”
– count I I number of elements in send/recv buffer – datatype I I data type of elements in send/recv buffer
– op I I reduce operation – comm I I commuinicator– ierr I O completion code
All reduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
Fortran
MPI Programming
![Page 72: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/72.jpg)
71
“op” of MPI_Reduce/Allreduce
• MPI_MAX,MPI_MIN Max, Min• MPI_SUM,MPI_PROD Summation, Product• MPI_LAND Logical AND
call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
Fortran
MPI Programming
![Page 73: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/73.jpg)
7272
Local Data (1/2)• Decompose vector with length=20 into 4 domains (processes)• Each process handles a vector with length= 5
VECp( 1)= 2( 2)= 2( 3)= 2
…(18)= 2(19)= 2(20)= 2
VECs( 1)= 3( 2)= 3( 3)= 3
…(18)= 3(19)= 3(20)= 3
Fortran
MPI Programming
![Page 74: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/74.jpg)
7373
Local Data (2/2)• 1th-5th components of original global vector go to 1th-5th components
of PE#0, 6th-10th -> PE#1, 11th-15th -> PE#2, 16th-20th -> PE#3.
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
PE#0
PE#1
PE#2
PE#3
VECp(16)~VECp(20)VECs(16)~VECs(20)
VECp(11)~VECp(15)VECs(11)~VECs(15)
VECp( 6)~VECp(10)VECs( 6)~VECs(10)
VECp( 1)~VECp( 5)VECs( 1)~VECs( 5)
Fortran
MPI Programming
![Page 75: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/75.jpg)
74
But ...
• It is too easy !! Just decomposing and renumbering from 1 (or 0).
• Of course, this is not enough. Further examples will be shown in the latter part.
VL(1)VL(2)VL(3)VL(4)VL(5)
PE#0
PE#1
PE#2
PE#3
VL(1)VL(2)VL(3)VL(4)VL(5)
VL(1)VL(2)VL(3)VL(4)VL(5)
VL(1)VL(2)VL(3)VL(4)VL(5)
VG( 1)VG( 2)VG( 3)VG( 4)VG( 5)VG( 6)VG( 7)VG( 8)VG( 9)VG(10)
VG(11)VG(12)VG(13)VG(14)VG(15)VG(16)VG(17)VG(18)VG(19)VG(20)
MPI Programming
![Page 76: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/76.jpg)
7575
Example: Dot Product (1/3)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(5) :: VECp, VECs
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
sumA= 0.d0sumR= 0.d0do i= 1, 5
VECp(i)= 2.d0VECs(i)= 3.d0
enddo
sum0= 0.d0do i= 1, 5
sum0= sum0 + VECp(i) * VECs(i)enddo
if (my_rank.eq.0) thenwrite (*,'(a)') '(my_rank, sumALLREDUCE, sumREDUCE)‘
endif
<$O-S1>/allreduce.f
MPI Programming
Local vector is generatedat each local process.
![Page 77: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/77.jpg)
7676
!C!C-- REDUCE
call MPI_REDUCE (sum0, sumR, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, &MPI_COMM_WORLD, ierr)
!C!C-- ALL-REDUCE
call MPI_Allreduce (sum0, sumA, 1, MPI_DOUBLE_PRECISION, MPI_SUM, &MPI_COMM_WORLD, ierr)
write (*,'(a,i5, 2(1pe16.6))') 'before BCAST', my_rank, sumA, sumR
Example: Dot Product (2/3)<$O-S1>/allreduce.f
MPI Programming
Dot ProductSummation of results of each process (sum0)“sumR” has value only on PE#0.
“sumA” has value on all processes by MPI_Allreduce
![Page 78: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/78.jpg)
7777
Example: Dot Product (3/3)
!C!C-- BCAST
call MPI_BCAST (sumR, 1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &ierr)
write (*,'(a,i5, 2(1pe16.6))') 'after BCAST', my_rank, sumA, sumR
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/allreduce.f
MPI Programming
“sumR” has value on PE#1-#3 by MPI_Bcast
![Page 79: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/79.jpg)
787878
Execute <$O-S1>/allreduce.f/c$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiifort -align array64byte -O3 -axCORE-AVX512 allreduce.f
$> mpiicc -align -O3 -axCORE-AVX512 allreduce.c
(modify go4.sh, 4-processes)$> pjsub go4.sh
(my_rank, sumALLREDUCE,sumREDUCE)
before BCAST 0 1.200000E+02 1.200000E+02
after BCAST 0 1.200000E+02 1.200000E+02
before BCAST 1 1.200000E+02 0.000000E+00
after BCAST 1 1.200000E+02 1.200000E+02
before BCAST 3 1.200000E+02 0.000000E+00
after BCAST 3 1.200000E+02 1.200000E+02
before BCAST 2 1.200000E+02 0.000000E+00
after BCAST 2 1.200000E+02 1.200000E+02
MPI Programming
![Page 80: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/80.jpg)
797979
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 81: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/81.jpg)
8080
Global/Local Data (1/3)
• Parallelization of an easy process where a real number αis added to each component of real vector VECg:
do i= 1, NG VECg(i)= VECg(i) + ALPHA enddo
for (i=0; i<NG; i++{ VECg[i]= VECg[i] + ALPHA }
MPI Programming
![Page 82: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/82.jpg)
8181
Global/Local Data (2/3)• Configurationa
– NG= 32 (length of the vector)– ALPHA=1000.– Process # of MPI= 4
• Vector VECg has following 32 components (<$O-S1>/a1x.all):
(101.0, 103.0, 105.0, 106.0, 109.0, 111.0, 121.0, 151.0, 201.0, 203.0, 205.0, 206.0, 209.0, 211.0, 221.0, 251.0, 301.0, 303.0, 305.0, 306.0, 309.0, 311.0, 321.0, 351.0, 401.0, 403.0, 405.0, 406.0, 409.0, 411.0, 421.0, 451.0)
MPI Programming
![Page 83: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/83.jpg)
8282
Global/Local Data (3/3)• Procedure① Reading vector VECg with length=32 from one process (e.g. 0th process)
– Global Data
② Distributing vector components to 4 MPI processes equally (i.e. length= 8 for each processes)
– Local Data, Local ID/Numbering
③ Adding ALPHA to each component of the local vector (with length= 8) on each process.
④ Merging the results to global vector with length= 32.
• Actually, we do not need parallel computers for such a kind of small computation.
MPI Programming
![Page 84: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/84.jpg)
8383
Operations of Scatter/Gather (1/8)Reading VECg (length=32) from a process (e.g. #0)
• Reading global data from #0 process include 'mpif.h' integer, parameter :: NG= 32 real(kind=8), dimension(NG):: VECg call MPI_INIT (ierr) call MPI_COMM_SIZE (<comm>, PETOT , ierr) call MPI_COMM_RANK (<comm>, my_rank, ierr) if (my_rank.eq.0) then open (21, file= 'a1x.all', status= 'unknown') do i= 1, NG read (21,*) VECg(i)
enddo close (21)
endif
#include <mpi.h> #include <stdio.h> #include <math.h> #include <assert.h> int main(int argc, char **argv){ int i, NG=32; int PeTot, MyRank, MPI_Comm; double VECg[32]; char filename[80]; FILE *fp; MPI_Init(&argc, &argv); MPI_Comm_size(<comm>, &PeTot); MPI_Comm_rank(<comm>, &MyRank); fp = fopen("a1x.all", "r"); if(!MyRank) for(i=0;i<NG;i++){ fscanf(fp, "%lf", &VECg[i]); }
MPI Programming
![Page 85: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/85.jpg)
8484
Operations of Scatter/Gather (2/8)Distributing global data to 4 process equally (i.e. length=8 for
each process)
• MPI_Scatter
MPI Programming
![Page 86: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/86.jpg)
8585
MPI_SCATTER
• Sends data from one process to all other processes in a communicator – scount-size messages are sent to each process
• call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer
type is defined by ”datatype”– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer
FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc.
– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 87: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/87.jpg)
8686
MPI_SCATTER(cont.)• call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer
type is defined by ”datatype”– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
• Usually– scount = rcount
– sendtype= recvtype
• This function sends scount components starting from sendbuf (sending buffer) at process #root to each process in comm. Each process receives rcount components starting from recvbuf (receiving buffer).
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 88: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/88.jpg)
8787
Operations of Scatter/Gather (3/8)Distributing global data to 4 processes equally
• Allocating receiving buffer VEC (length=8) at each process.• 8 components sent from sending buffer VECg of process #0 are
received at each process #0-#3 as 1st-8th components of receiving buffer VEC.
call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf, rcount, recvtype, root, comm, ierr)
MPI Programming
integer, parameter :: N = 8 real(kind=8), dimension(N ) :: VEC ... call MPI_Scatter & (VECg, N, MPI_DOUBLE_PRECISION, & VEC , N, MPI_DOUBLE_PRECISION, & 0, <comm>, ierr)
int N=8; double VEC [8]; ... MPI_Scatter (VECg, N, MPI_DOUBLE, VEC, N, MPI_DOUBLE, 0, <comm>);
![Page 89: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/89.jpg)
8888
Operations of Scatter/Gather (4/8)Distributing global data to 4 processes equally
• 8 components are scattered to each process from root (#0)• 1st-8th components of VECg are stored as 1st-8th ones of VEC at #0,
9th-16th components of VECg are stored as 1st-8th ones of VEC at #1, etc.– VECg: Global Data, VEC: Local Data
VECg
sendbuf
VEC
recvbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
VECg
sendbuf
VEC
recvbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8local data
global data
MPI Programming
![Page 90: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/90.jpg)
8989
Operations of Scatter/Gather (5/8)Distributing global data to 4 processes equally
• Global Data: 1st-32nd components of VECg at #0• Local Data: 1st-8th components of VEC at each process• Each component of VEC can be written from each process in the
following way:
do i= 1, N write (*,'(a, 2i8,f10.0)') 'before', my_rank, i, VEC(i) enddo
for(i=0;i<N;i++){ printf("before %5d %5d %10.0F\n", MyRank, i+1, VEC[i]);}
MPI Programming
![Page 91: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/91.jpg)
9090
Operations of Scatter/Gather (5/8)Distributing global data to 4 processes equally
• Global Data: 1st-32nd components of VECg at #0• Local Data: 1st-8th components of VEC at each process• Each component of VEC can be written from each process in the
following way:
MPI Programming
PE#0 before 0 1 101. before 0 2 103. before 0 3 105. before 0 4 106. before 0 5 109. before 0 6 111. before 0 7 121. before 0 8 151.
PE#1 before 1 1 201. before 1 2 203. before 1 3 205. before 1 4 206. before 1 5 209. before 1 6 211. before 1 7 221. before 1 8 251.
PE#2 before 2 1 301. before 2 2 303. before 2 3 305. before 2 4 306. before 2 5 309. before 2 6 311. before 2 7 321. before 2 8 351.
PE#3 before 3 1 401. before 3 2 403. before 3 3 405. before 3 4 406. before 3 5 409. before 3 6 411. before 3 7 421. before 3 8 451.
![Page 92: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/92.jpg)
9191
Operations of Scatter/Gather (6/8)On each process, ALPHA is added to each of 8 components
of VEC
• On each process, computation is in the following way
real(kind=8), parameter :: ALPHA= 1000. do i= 1, N VEC(i)= VEC(i) + ALPHA enddo
double ALPHA=1000.; ... for(i=0;i<N;i++){ VEC[i]= VEC[i] + ALPHA;}
• Results:
PE#0 after 0 1 1101. after 0 2 1103. after 0 3 1105. after 0 4 1106. after 0 5 1109. after 0 6 1111. after 0 7 1121. after 0 8 1151.
PE#1 after 1 1 1201. after 1 2 1203. after 1 3 1205. after 1 4 1206. after 1 5 1209. after 1 6 1211. after 1 7 1221. after 1 8 1251.
PE#2 after 2 1 1301. after 2 2 1303. after 2 3 1305. after 2 4 1306. after 2 5 1309. after 2 6 1311. after 2 7 1321. after 2 8 1351.
PE#3 after 3 1 1401. after 3 2 1403. after 3 3 1405. after 3 4 1406. after 3 5 1409. after 3 6 1411. after 3 7 1421. after 3 8 1451.
MPI Programming
![Page 93: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/93.jpg)
9292
Operations of Scatter/Gather (7/8)Merging the results to global vector with length= 32
• Using MPI_Gather (inverse operation to MPI_Scatter)
MPI Programming
![Page 94: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/94.jpg)
9393
MPI_GATHER
• Gathers together values from a group of processes, inverse operation to MPI_Scatter
• call MPI_GATHER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
• recvbuf is on root process.
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 95: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/95.jpg)
9494
Operations of Scatter/Gather (8/8)Merging the results to global vector with length= 32
• Each process components of VEC to VECg on root (#0 in this case).
VECg
recvbuf
VEC
sendbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
VECg
recvbuf
VEC
sendbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
• 8 components are gathered from each process to the root process.
local data
global data
MPI Programming
call MPI_Gather & (VEC , N, MPI_DOUBLE_PRECISION, & VECg, N, MPI_DOUBLE_PRECISION, & 0, <comm>, ierr)
MPI_Gather (VEC, N, MPI_DOUBLE, VECg, N, MPI_DOUBLE, 0, <comm>);
![Page 96: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/96.jpg)
959595
<$O-S1>/scatter-gather.f/c
PE#0 before 0 1 101. before 0 2 103. before 0 3 105. before 0 4 106. before 0 5 109. before 0 6 111. before 0 7 121. before 0 8 151.
PE#1 before 1 1 201. before 1 2 203. before 1 3 205. before 1 4 206. before 1 5 209. before 1 6 211. before 1 7 221. before 1 8 251.
PE#2 before 2 1 301. before 2 2 303. before 2 3 305. before 2 4 306. before 2 5 309. before 2 6 311. before 2 7 321. before 2 8 351.
PE#3 before 3 1 401. before 3 2 403. before 3 3 405. before 3 4 406. before 3 5 409. before 3 6 411. before 3 7 421. before 3 8 451.
PE#0 after 0 1 1101. after 0 2 1103. after 0 3 1105. after 0 4 1106. after 0 5 1109. after 0 6 1111. after 0 7 1121. after 0 8 1151.
PE#1 after 1 1 1201. after 1 2 1203. after 1 3 1205. after 1 4 1206. after 1 5 1209. after 1 6 1211. after 1 7 1221. after 1 8 1251.
PE#2 after 2 1 1301. after 2 2 1303. after 2 3 1305. after 2 4 1306. after 2 5 1309. after 2 6 1311. after 2 7 1321. after 2 8 1351.
PE#3 after 3 1 1401. after 3 2 1403. after 3 3 1405. after 3 4 1406. after 3 5 1409. after 3 6 1411. after 3 7 1421. after 3 8 1451.
MPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiifort -align array64byte -O3 -axCORE-AVX512 scatter-gather.f
$> mpiicc -align -O3 -axCORE-AVX512 scatter-gather.c
(modify go4.sh, 4-processes)$> pjsub go4.sh
![Page 97: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/97.jpg)
9696
MPI_REDUCE_SCATTER
• MPI_REDUCE + MPI_SCATTER
• call MPI_REDUCE_SCATTER (sendbuf, recvbuf, rcount,
datatype, op, comm, ierr)– sendbuf choice I starting address of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I integer array specifying the number of elements in result
distributed to each process. Array must be identical on all calling processes.
– datatype I I data type of elements of sending/receiving buffer– op I I reduce operation – comm I I communicator– ierr I O completion code
Reduce scatterP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3op.A0-A3
op.B0-B3op.B0-B3
op.C0-C3op.C0-C3
op.D0-D3op.D0-D3
Fortran
MPI Programming
![Page 98: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/98.jpg)
9797
MPI_ALLGATHER
• MPI_GATHER+MPI_BCAST– Gathers data from all tasks and distribute the combined data to all tasks
• call MPI_ALLGATHER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcount I I number of elements received from each process– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
All gatherA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3
Fortran
MPI Programming
![Page 99: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/99.jpg)
9898
MPI_ALLTOALL
• Sends data from all to all processes: transformation of dense matrix
• call MPI_ALLTOALL (sendbuf, scount, sendtype, recvbuf,
rcount, recvrype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcount I I number of elements received from each process– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
All-to-AllA0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
Fortran
MPI Programming
![Page 100: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/100.jpg)
999999
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 101: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/101.jpg)
100100100
Operations of Distributed Local Files
• In Scatter/Gather example, PE#0 reads global data, that is scattered to each processer, then parallel operations are done.
• If the problem size is very large, a single processor may not read entire global data.– If the entire global data is decomposed to distributed local data
sets, each process can read the local data.– If global operations are needed to a certain sets of vectors,
MPI functions, such as MPI_Gather etc. are available.
MPI Programming
![Page 102: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/102.jpg)
101101101
Reading Distributed Local Files: Uniform Vec. Length (1/2)
MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1>$ ls a1.*
a1.0 a1.1 a1.2 a1.3 a1x.all is decomposed to 4 files.
>$ mpiicc –O3 file.c>$ mpiifort –O3 file.f(modify go4.sh for 4 processes)>$ pjsub go4.sh
a1.0101.0103.0105.0106.0109.0111.0121.0151.0
a1.1201.0203.0205.0206.0209.0211.0221.0251.0
a1.2301.0303.0305.0306.0309.0311.0321.0351.0
a1.3401.0403.0405.0406.0409.0411.0421.0451.0
![Page 103: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/103.jpg)
102
go4.sh
#!/bin/sh
#PJM -N “test“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o test.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
![Page 104: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/104.jpg)
103103
Reading Distributed Local Files: Uniform Vec. Length (2/2)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(8) :: VECcharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a1.0'if (my_rank.eq.1) filename= 'a1.1'if (my_rank.eq.2) filename= 'a1.2'if (my_rank.eq.3) filename= 'a1.3'
open (21, file= filename, status= 'unknown')do i= 1, 8
read (21,*) VEC(i)enddo
close (21)
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/file.f
MPI Programming
Similar to “Hello”
Local ID is 1-8
![Page 105: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/105.jpg)
104104
Typical SPMD Operation
PE #0
“a.out”
“a1.0”
PE #1
“a.out”
“a1.1”
PE #2
“a.out”
“a1.2”
mpirun -np 4 a.out
PE #3
“a.out”
“a1.3”
MPI Programming
![Page 106: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/106.jpg)
105105105
Non-Uniform Vector Length (1/2)MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ ls a2.*a2.0 a2.1 a2.2 a2.3
>$ cat a2.1
5 Number of Components at each Process201.0 Components203.0
205.0
206.0
209.0
>$ mpiicc –O3 file2.c>$ mpiifort –O3 file2.f
(modify go4.sh for 4 processes)>$ pjsub go4.sh
![Page 107: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/107.jpg)
MPIprog. 106
a2.0~a2.3
PE#0
8101.0103.0105.0106.0109.0111.0121.0151.0
PE#1
5201.0203.0205.0206.0209.0
PE#2
7301.0303.0305.0306.0311.0321.0351.0
PE#3
3401.0403.0405.0
![Page 108: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/108.jpg)
107107
Non-Uniform Vector Length (2/2)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(:), allocatable :: VECcharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a2.0'if (my_rank.eq.1) filename= 'a2.1'if (my_rank.eq.2) filename= 'a2.2'if (my_rank.eq.3) filename= 'a2.3'
open (21, file= filename, status= 'unknown')read (21,*) Nallocate (VEC(N))do i= 1, N
read (21,*) VEC(i)enddo
close(21)
call MPI_FINALIZE (ierr)stopend
<$O-S1>/file2.f
MPI Programming
“N” is different at each process
![Page 109: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/109.jpg)
108
How to generate local data• Reading global data (N=NG)
– Scattering to each process– Parallel processing on each process– (If needed) reconstruction of global data by gathering local
data
• Generating local data (N=NL), or reading distributed local data – Generating or reading local data on each process– Parallel processing on each process– (If needed) reconstruction of global data by gathering local
data
• In future, latter case is more important, but former case is also introduced in this class for understanding of operations of global/local data.
MPI Programming
![Page 110: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/110.jpg)
109109109
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 111: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/111.jpg)
110
MPI_GATHERV,MPI_SCATTERV
• MPI_Gather, MPI_Scatter– Length of message from/to each process is uniform
• MPI_XXXv extends functionality of MPI_XXX by allowing a varying count of data from each process: – MPI_Gatherv– MPI_Scatterv– MPI_Allgatherv– MPI_Alltoallv
MPI Programming
![Page 112: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/112.jpg)
111
MPI_ALLGATHERV
• Variable count version of MPI_Allgather– creates “global data” from “local data”
• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcounts I I integer array (of length groupsize) containing the number of
elements that are to be received from each process (array: size= PETOT)
– displs I I integer array (of length groupsize). Entry i specifies the displacement (relative to recvbuf ) at which to place the incoming data from process i (array: size= PETOT+1)
– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
Fortran
MPI Programming
![Page 113: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/113.jpg)
112
MPI_ALLGATHERV (cont.)• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
– rcounts I I integer array (of length groupsize) containing the number of elements that are to be received from each process (array: size= PETOT)
– displs I I integer array (of length groupsize). Entry i specifies the displacement (relative to recvbuf ) at which to place the incoming data from process i (array: size= PETOT+1)
– These two arrays are related to size of final “global data”, therefore each process requires information of these arrays (rcounts, displs)
• Each process must have same values for all components of both vectors– Usually, stride(i)=rcounts(i)
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 114: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/114.jpg)
113
What MPI_Allgatherv is doing
stride(1)
PE#0 N
PE#1 N
PE#2 N
PE#3 N
rcounts(1)rcounts(2)
rcounts(3)rcounts
(4)
stride(2)
stride(3)
stride(4)
displs(1)
displs(2)
displs(3)
displs(4)
displs(5)
Generating global data from local data
MPI Programming
Local Data: sendbuf Global Data: recvbuf
![Page 115: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/115.jpg)
114
What MPI_Allgatherv is doing
stride(1)= rcounts(1)
PE#0 N
PE#1 N
PE#2 N
PE#3 N
rcounts(1)rcounts(2)
rcounts(3)rcounts
(4)stride(2)= rcounts(2)
stride(3)= rcounts(3)
stride(4)= rcounts(4)
displs(1)
displs(2)
displs(3)
displs(4)
displs(5)
Generating global data from local data
MPI Programming
Local Data: sendbuf Global Data: recvbuf
![Page 116: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/116.jpg)
115
MPI_Allgatherv in detail (1/2)
• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
• rcounts
– Size of message from each PE: Size of Local Data (Length of Local Vector)• displs
– Address/index of each local data in the vector of global data– displs(PETOT+1)= Size of Entire Global Data (Global Vector)
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 117: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/117.jpg)
116
MPI_Allgatherv in detail (2/2)• Each process needs information of rcounts & displs
– “rcounts” can be created by gathering local vector length “N” from each process.
– On each process, “displs” can be generated from “rcounts” on each process.• stride[i]= rcounts[i]
– Size of ”recvbuf” is calculated by summation of ”rcounts”.
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 118: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/118.jpg)
117
Preparation for MPI_Allgatherv<$O-S1>/agv.f
• Generating global vector from “a2.0”~”a2.3”.
• Length of the each vector is 8, 5, 7, and 3, respectively. Therefore, size of final global vector is 23 (= 8+5+7+3).
MPI Programming
![Page 119: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/119.jpg)
118
a2.0~a2.3
PE#0
8101.0103.0105.0106.0109.0111.0121.0151.0
PE#1
5201.0203.0205.0206.0209.0
PE#2
7301.0303.0305.0306.0311.0321.0351.0
PE#3
3401.0403.0405.0
MPI Programming
![Page 120: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/120.jpg)
119
Preparation: MPI_Allgatherv (1/4)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'
integer :: PETOT, my_rank, SOLVER_COMM, ierrreal(kind=8), dimension(:), allocatable :: VECreal(kind=8), dimension(:), allocatable :: VEC2real(kind=8), dimension(:), allocatable :: VECginteger(kind=4), dimension(:), allocatable :: rcountsinteger(kind=4), dimension(:), allocatable :: displscharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a2.0'if (my_rank.eq.1) filename= 'a2.1'if (my_rank.eq.2) filename= 'a2.2'if (my_rank.eq.3) filename= 'a2.3'
open (21, file= filename, status= 'unknown')read (21,*) Nallocate (VEC(N))do i= 1, N
read (21,*) VEC(i)enddo
<$O-S1>/agv.f
MPI Programming
N(NL) is different ateach process
Fortran
![Page 121: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/121.jpg)
120
Preparation: MPI_Allgatherv (2/4)
allocate (rcounts(PETOT), displs(PETOT+1))rcounts= 0write (*,‘(a,10i8)’) “before”, my_rank, N, rcounts
call MPI_allGATHER ( N , 1, MPI_INTEGER, && rcounts, 1, MPI_INTEGER, && MPI_COMM_WORLD, ierr)
write (*,'(a,10i8)') "after ", my_rank, N, rcountsdispls(1)= 0
PE#0 N=8
PE#1 N=5
PE#2 N=7
PE#3 N=3
MPI_Allgather
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
<$O-S1>/agv.f
MPI Programming
Fortran
Rcounts on each PE
![Page 122: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/122.jpg)
121
Preparation: MPI_Allgatherv (2/4)
allocate (rcounts(PETOT), displs(PETOT+1))rcounts= 0write (*,‘(a,10i8)’) “before”, my_rank, N, rcounts
call MPI_allGATHER ( N , 1, MPI_INTEGER, && rcounts, 1, MPI_INTEGER, && MPI_COMM_WORLD, ierr)
write (*,'(a,10i8)') "after ", my_rank, N, rcountsdispls(1)= 0
do ip= 1, PETOTdispls(ip+1)= displs(ip) + rcounts(ip)
enddo
write (*,'(a,10i8)') "displs", my_rank, displs
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/agv.f
MPI Programming
Fortran
Rcounts on each PE
Displs on each PE
![Page 123: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/123.jpg)
122
Preparation: MPI_Allgatherv (3/4)>$ cd /work/gt37/t37XXX/pFEM/mpi/S1>$ mpiifort –O3 agv.f
(modify go4.sh for 4 processes)>$ pjsub go4.sh
before 0 8 0 0 0 0after 0 8 8 5 7 3displs 0 0 8 13 20 23FORTRAN STOP
before 1 5 0 0 0 0after 1 5 8 5 7 3displs 1 0 8 13 20 23FORTRAN STOP
before 3 3 0 0 0 0after 3 3 8 5 7 3displs 3 0 8 13 20 23FORTRAN STOP
before 2 7 0 0 0 0after 2 7 8 5 7 3displs 2 0 8 13 20 23FORTRAN STOP
write (*,‘(a,10i8)’) “before”, my_rank, N, rcountswrite (*,'(a,10i8)') "after ", my_rank, N, rcountswrite (*,'(a,10i8)') "displs", my_rank, displs
MPI Programming
![Page 124: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/124.jpg)
123
Preparation: MPI_Allgatherv (4/4)
• Only ”recvbuf” is not defined yet.• Size of ”recvbuf” = ”displs(PETOT+1)”
call MPI_allGATHERv
( VEC , N, MPI_DOUBLE_PRECISION,
recvbuf, rcounts, displs, MPI_DOUBLE_PRECISION,
MPI_COMM_WORLD, ierr)
MPI Programming
![Page 125: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/125.jpg)
124
Report S1 (1/2)
• Deadline: January 28th (Tue), 2020, 17:00– Send files via e-mail at nakajima(at)cc.u-tokyo.ac.jp
• Problem S1-1– Read local files <$O-S1>/a1.0~a1.3, <$O-S1>/a2.0~a2.3.– Develop codes which calculate norm ||x||2 of global vector for each
case.• <$O-S1>file.f,<$O-S1>file2.f
• Problem S1-2– Read local files <$O-S1>/a2.0~a2.3.– Develop a code which constructs “global vector” using
MPI_Allgatherv.
MPI Programming
![Page 126: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/126.jpg)
125
Report S1 (2/2)• Problem S1-3
– Develop parallel program which calculates the following numerical integration using “trapezoidal rule” by MPI_Reduce, MPI_Bcast etc.
– Measure computation time, and parallel performance
dxx +
1
0 21
4
• Report– Cover Page: Name, ID, and Problem ID (S1) must be written. – Less than two pages including figures and tables (A4) for each
of three sub-problems• Strategy, Structure of the Program, Remarks
– Source list of the program (if you have bugs)– Output list (as small as possible)
MPI Programming
![Page 127: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/127.jpg)
126
Options for Optimization
General Option (-O3)$ mpiifort -O3 test.f
$ mpiicc -O3 test.c
Special Options for AVX512, NOT Necessarily Fast …$ mpiifort -align array64byte -O3 -axCORE-AVX512 test.f
$ mpiicc -align -O3 -axCORE-AVX512 test.c
![Page 128: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/128.jpg)
127
Process Number
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
#PJM -L node=1;#PJM --mpi proc= 1 1-node, 1-proc, 1-proc/n
#PJM -L node=1;#PJM --mpi proc= 4 1-node, 4-proc, 4-proc/n
#PJM -L node=1;#PJM --mpi proc=16 1-node,16-proc,16-proc/n
#PJM -L node=1;#PJM --mpi proc=28 1-node,28-proc,28-proc/n
#PJM -L node=1;#PJM --mpi proc=56 1-node,56-proc,56-proc/n
#PJM -L node=4;#PJM --mpi proc=128 4-node,128-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=256 8-node,256-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=448 8-node,448-proc,56-proc/n
![Page 129: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/129.jpg)
128
a01.sh: Use 1 -core (0 th) #!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=1#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 130: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/130.jpg)
129
a24.sh: Use 24 -cores (0 th-23rd) #!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=24#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 131: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/131.jpg)
130
a48.sh: Use 48 -cores (0 th-23rd, 28th-51st)#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=48#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 132: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/132.jpg)
131
b48.sh: Use 8x48 -cores (0th-23rd, 28th-51st)
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 133: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/133.jpg)
132
b48b.sh: Use 8x48 -cores (0th-23rd, 28th-51st)
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} numactl -l ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 134: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/134.jpg)
133
NUMA Architecture
• Each Node of Oakbridge-CX (OBCX)– 2 Sockets (CPU’s) of Intel CLX– Each socket has 28 cores
• Each core of a socket can access to the memory on the other socket : NUMA (Non-Uniform Memory Access)– numactl –l : local memory to be used
– Sometimes (not always), more stable with this option
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 135: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/135.jpg)
134
Use 8x48-cores, 48 -cores are randomly selected from 56 -cores
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 136: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/136.jpg)
135
Use 8x56-cores#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=448 448÷8= 56-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 137: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/137.jpg)
Introduction to Programming by MPI for Parallel FEM
Report S1 & S2in Fortran (1/2)
Kengo NakajimaInformation Technology Center
The University of Tokyo
![Page 138: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/138.jpg)
11
Motivation for Parallel Computing(and this class)
• Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering.
• Why parallel computing ?– faster & larger– “larger” is more important from the view point of “new frontiers
of science & engineering”, but “faster” is also important.– + more complicated– Ideal: Scalable
• Weak Scaling, Strong Scaling
MPI Programming
![Page 139: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/139.jpg)
2
Scalable, Scaling, Scalability
• Solving Nx scale problem using Nx computational resources during same computation time– for large-scale problems: Weak Scaling, Weak
Scalability– e.g. CG solver: more iterations needed for larger
problems
• Solving a problem using Nx computational resources during 1/N computation time– for faster computation: Strong Scaling, Strong
Scalability
Intro
![Page 140: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/140.jpg)
33
Overview
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 141: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/141.jpg)
44
What is MPI ? (1/2)• Message Passing Interface• “Specification” of message passing API for distributed
memory environment– Not a program, Not a library
• http://www.mcs.anl.gov/mpi/www/• https://www.mpi-forum.org/docs/
• History– 1992 MPI Forum
• https://www.mpi-forum.org/
– 1994 MPI-1– 1997 MPI-2: MPI I/O– 2012 MPI-3: Fault Resilience, Asynchronous Collective
• Implementation– mpich ANL (Argonne National Laboratory), OpenMPI, MVAPICH– H/W vendors– C/C++, FOTRAN, Java ; Unix, Linux, Windows, Mac OS
MPI Programming
![Page 142: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/142.jpg)
55
What is MPI ? (2/2)
• “mpich” (free) is widely used– supports MPI-2 spec. (partially)– MPICH2 after Nov. 2005.– http://www.mcs.anl.gov/mpi/
• Why MPI is widely used as de facto standard ?– Uniform interface through MPI forum
• Portable, can work on any types of computers• Can be called from Fortran, C, etc.
– mpich• free, supports every architecture
• PVM (Parallel Virtual Machine) was also proposed in early 90’s but not so widely used as MPI
MPI Programming
![Page 143: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/143.jpg)
66
References
• W.Gropp et al., Using MPI second edition, MIT Press, 1999. • M.J.Quinn, Parallel Programming in C with MPI and OpenMP,
McGrawhill, 2003.• W.Gropp et al., MPI:The Complete Reference Vol.I, II, MIT Press,
1998. • http://www.mcs.anl.gov/mpi/www/
– API (Application Interface) of MPI
MPI Programming
![Page 144: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/144.jpg)
77
How to learn MPI (1/2)• Grammar
– 10-20 functions of MPI-1 will be taught in the class• although there are many convenient capabilities in MPI-2
– If you need further information, you can find information from web, books, and MPI experts.
• Practice is important– Programming– “Running the codes” is the most important
• Be familiar with or “grab” the idea of SPMD/SIMD op’s– Single Program/Instruction Multiple Data– Each process does same operation for different data
• Large-scale data is decomposed, and each part is computed by each process
– Global/Local Data, Global/Local Numbering
MPI Programming
![Page 145: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/145.jpg)
88
SPMD
PE #0
Program
Data #0
PE #1
Program
Data #1
PE #2
Program
Data #2
PE #M-1
Program
Data #M-1
mpirun -np M <Program>
You understand 90% MPI, if you understand this figure.
PE: Processing ElementProcessor, Domain, Process
Each process does same operation for different dataLarge-scale data is decomposed, and each part is computed by each processIt is ideal that parallel program is not different from serial one except communication.
MPI Programming
![Page 146: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/146.jpg)
99
Some Technical Terms• Processor, Core
– Processing Unit (H/W), Processor=Core for single-core proc’s
• Process– Unit for MPI computation, nearly equal to “core”– Each core (or processor) can host multiple processes (but not
efficient)
• PE (Processing Element)– PE originally mean “processor”, but it is sometimes used as
“process” in this class. Moreover it means “domain” (next)• In multicore proc’s: PE generally means “core”
• Domain– domain=process (=PE), each of “MD” in “SPMD”, each data set
• Process ID of MPI (ID of PE, ID of domain) starts from “0”– if you have 8 processes (PE’s, domains), ID is 0~7
MPI Programming
![Page 147: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/147.jpg)
1010
SPMD
PE #0
Program
Data #0
PE #1
Program
Data #1
PE #2
Program
Data #2
PE #M-1
Program
Data #M-1
mpirun -np M <Program>
You understand 90% MPI, if you understand this figure.
PE: Processing ElementProcessor, Domain, Process
Each process does same operation for different dataLarge-scale data is decomposed, and each part is computed by each processIt is ideal that parallel program is not different from serial one except communication.
MPI Programming
![Page 148: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/148.jpg)
1111
How to learn MPI (2/2)
• NOT so difficult.• Therefore, 5-6-hour lectures are enough for just learning
grammar of MPI.
• Grab the idea of SPMD !
MPI Programming
![Page 149: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/149.jpg)
1212
Schedule
• MPI– Basic Functions– Collective Communication – Point-to-Point (or Peer-to-Peer) Communication
• 105 min. x 3-4 lectures– Collective Communication
• Report S1
– Point-to-Point Communication• Report S2: Parallelization of 1D code
– At this point, you are almost an expert of MPI programming
MPI Programming
![Page 150: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/150.jpg)
1313
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 151: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/151.jpg)
1414
Login to Oakbridge -CX (OBCX)MPI Programming
PCOBCX
ssh t37**@obcx.cc.u-tokyo.ac.jp
Create directory>$ cd /work/gt37/t37XXX>$ mkdir pFEM>$ cd pFEM
In this class this top-directory is called <$O-TOP>.Files are copied to this directory.
Under this directory, S1, S2, S1-ref are created:<$O-S1> = <$O-TOP>/mpi/S1
<$O-S2> = <$O-TOP>/mpi/S2
![Page 152: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/152.jpg)
1515
Copying files on OBCXMPI Programming
Fortan>$ cd /work/gt37/t37XXX/pFEM
>$ cp /work/gt00/z30088/pFEM/F/s1-f.tar .
>$ tar xvf s1-f.tar
C>$ cd /work/gt37/t37XXX/pFEM
>$ cp /work/gt00/z30088/pFEM/C/s1-c.tar .
>$ tar xvf s1-c.tar
Confirmation>$ ls
mpi
>$ cd mpi/S1
This directory is called as <$O-S1>.
<$O-S1> = <$O-TOP>/mpi/S1
![Page 153: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/153.jpg)
161616
First Exampleimplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
hello.f
hello.c
MPI Programming
![Page 154: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/154.jpg)
171717
Compiling hello.f/cMPI Programming
FORTRAN$> “mpiifort”:
required compiler & libraries are included for FORTRAN90+MPI
C$> “mpiicc”:
required compiler & libraries are included for C+MPI
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ mpiifort -align array64byte -O3 -axCORE-AVX512 hello.f
>$ mpiicc -align -O3 -axCORE-AVX512 hello.c
![Page 155: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/155.jpg)
181818
Running Job• Batch Jobs
– Only batch jobs are allowed.– Interactive executions of jobs are not allowed.
• How to run– writing job script– submitting job– checking job status– checking results
• Utilization of computational resources – 1-node (56 cores) is occupied by each job.– Your node is not shared by other jobs.
MPI Programming
![Page 156: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/156.jpg)
191919
Job Script• <$O-S1>/hello.sh
• Scheduling + Shell Script
MPI Programming
#!/bin/sh
#PJM -N "hello“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o hello.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
mpiexec.hydra Command for Runnig MPI JOB
-n ${PJM_MPI_PROC} = (--mpi proc=XX), in this case =4
./a.out name of executable file
![Page 157: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/157.jpg)
202020
MPI Programming
Process Number
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
#PJM -L node=1;#PJM --mpi proc= 1 1-node, 1-proc, 1-proc/n
#PJM -L node=1;#PJM --mpi proc= 4 1-node, 4-proc, 4-proc/n
#PJM -L node=1;#PJM --mpi proc=16 1-node,16-proc,16-proc/n
#PJM -L node=1;#PJM --mpi proc=28 1-node,28-proc,28-proc/n
#PJM -L node=1;#PJM --mpi proc=56 1-node,56-proc,56-proc/n
#PJM -L node=4;#PJM --mpi proc=128 4-node,128-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=256 8-node,256-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=448 8-node,448-proc,56-proc/n
![Page 158: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/158.jpg)
212121
Job Submission
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
(modify hello.sh)
>$ pjsub hello.sh
>$ cat hello.lst
Hello World 0
Hello World 3
Hello World 2
Hello World 1
MPI Programming
![Page 159: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/159.jpg)
222222
Available QUEUE’s
• Following 2 queues are available.• 8 nodes can be used
– lecture
• 8 nodes (448 cores), 15 min., valid until the end of March 2020
• Shared by all “educational” users
– lecture7
• 8 nodes (448 cores), 15 min., active during class time • More jobs (compared to lecture) can be processed up
on availability.
MPI Programming
![Page 160: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/160.jpg)
232323
Submitting & Checking Jobs
• Submitting Jobs pjsub SCRIPT NAME
• Checking status of jobs pjstat
• Deleting/aborting pjdel JOB ID
• Checking status of queues pjstat --rsc
• Detailed info. of queues pjstat --rsc –x
• Info of running jobs pjstat –a
• History of Submission pjstat –H
• Limitation of submission pjstat --limit
MPI Programming
![Page 161: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/161.jpg)
242424
Basic/Essential Functionsimplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
MPI Programming
‘mpif.h’, “mpi.h”Essential Include file“use mpi” is possible in F90
MPI_InitInitialization
MPI_Comm_sizeNumber of MPI Processesmpirun -np XX <prog>
MPI_Comm_rankProcess ID starting from 0
MPI_FinalizeTermination of MPI processes
![Page 162: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/162.jpg)
252525
Difference between FORTRAN/C
• (Basically) same interface– In C, UPPER/lower cases are considered as different
• e.g.: MPI_Comm_size– MPI: UPPER case– First character of the function except “MPI_” is in UPPER case.– Other characters are in lower case.
• In Fortran, return value ierr has to be added at the end of the argument list.
• C needs special types for variables:– MPI_Comm, MPI_Datatype, MPI_Op etc.
• MPI_INIT is different:– call MPI_INIT (ierr)
– MPI_Init (int *argc, char ***argv)
MPI Programming
![Page 163: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/163.jpg)
262626
What’s are going on ?implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
MPI Programming
• mpiexec.hydra starts up 4 MPI processes(”proc=4”)– A single program runs on four processes.– each process writes a value of myid
• Four processes do same operations, but values of myid are different.
• Output of each process is different.• That is SPMD !
#!/bin/sh
#PJM -N "hello“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o hello.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
![Page 164: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/164.jpg)
272727
mpi.h ,mpif.himplicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
#include "mpi.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int n, myid, numprocs, i;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf ("Hello World %d¥n", myid);
MPI_Finalize();
}
• Various types of parameters and variables for MPI & their initial values.
• Name of each var. starts from “MPI_”• Values of these parameters and
variables cannot be changed by users.
• Users do not specify variables starting from “MPI_” in users’ programs.
MPI Programming
![Page 165: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/165.jpg)
2828
MPI_INIT
• Initialize the MPI execution environment (required)• It is recommended to put this BEFORE all statements in the program.
• call MPI_INIT (ierr)
– ierr I O Completion Code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 166: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/166.jpg)
2929
MPI_FINALIZE
• Terminates MPI execution environment (required)• It is recommended to put this AFTER all statements in the program.• Please do not forget this.
• call MPI_FINALIZE (ierr)
– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 167: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/167.jpg)
3030
MPI_COMM_SIZE
• Determines the size of the group associated with a communicator• not required, but very convenient function
• call MPI_COMM_SIZE (comm, size, ierr)
– comm I I communicator– size I O number of processes in the group of communicator– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
MPI Programming
![Page 168: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/168.jpg)
313131
What is Communicator ?
• Group of processes for communication• Communicator must be specified in MPI program as a unit
of communication • All processes belong to a group, named
“MPI_COMM_WORLD” (default)• Multiple communicators can be created, and complicated
operations are possible.– Computation, Visualization
• Only “MPI_COMM_WORLD” is needed in this class.
MPI_Comm_Size (MPI_COMM_WORLD, PETOT)
MPI Programming
![Page 169: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/169.jpg)
323232
MPI_COMM_WORLD
Communicator in MPIOne process can belong to multiple communicators
COMM_MANTLE COMM_CRUST
COMM_VIS
MPI Programming
![Page 170: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/170.jpg)
333333MPI Programming
Coupling between “Ground Motion” and “Sloshing of Tanks for Oil-Storage”
![Page 171: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/171.jpg)
![Page 172: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/172.jpg)
35
Target Application• Coupling between “Ground Motion” and “Sloshing of
Tanks for Oil-Storage”– “One-way” coupling from “Ground Motion” to “Tanks”.– Displacement of ground surface is given as forced
displacement of bottom surface of tanks.– 1 Tank = 1 PE (serial)
Deformation of surface will be given as boundary conditionsat bottom of tanks.
Deformation of surface will be given as boundary conditionsat bottom of tanks.
MPI Programming
![Page 173: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/173.jpg)
36
2003 Tokachi Earthquake (M8.0)Fire accident of oil tanks due to long period
ground motion (surface waves) developed in the basin of Tomakomai
MPI Programming
![Page 174: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/174.jpg)
37
Seismic Wave Propagation, Underground Structure
MPI Programming
![Page 175: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/175.jpg)
38
Simulation Codes• Ground Motion (Ichimura): Fortran
– Parallel FEM, 3D Elastic/Dynamic• Explicit forward Euler scheme
– Each element: 2m×2m×2m cube– 240m×240m×100m region
• Sloshing of Tanks (Nagashima): C– Serial FEM (Embarrassingly Parallel)
• Implicit backward Euler, Skyline method• Shell elements + Inviscid potential flow
– D: 42.7m, H: 24.9m, T: 20mm,– Frequency: 7.6sec.– 80 elements in circ., 0.6m mesh in height– Tank-to-Tank: 60m, 4×4
• Total number of unknowns: 2,918,169
MPI Programming
![Page 176: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/176.jpg)
39
3939
Three Communicators
meshGLOBAL%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
meshGLOBAL%my_rank= 0~3
meshBASE%my_rank = 0~3
meshGLOBAL%my_rank= 4~12
meshTANK%my_rank = 0~ 8
meshTANK%my_rank = -1 meshBASE%my_rank = -1
meshGLOBAL%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
basememt#0
basement#1
basement#2
basement#3
meshBASE%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
tank#0
tank#1
tank#2
tank#3
tank#4
tank#5
tank#6
tank#7
tank#8
meshTANK%MPI_COMM
meshGLOBAL%my_rank= 0~3
meshBASE%my_rank = 0~3
meshGLOBAL%my_rank= 4~12
meshTANK%my_rank = 0~ 8
meshTANK%my_rank = -1 meshBASE%my_rank = -1
MPI Programming
![Page 177: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/177.jpg)
MPI Programming
40
MPI_COMM_RANK• Determines the rank of the calling process in the communicator
– “ID of MPI process” is sometimes called “rank”
• MPI_COMM_RANK (comm, rank, ierr)
– comm I I communicator– rank I O rank of the calling process in the group of comm
Starting from “0”– ierr I O completion code
implicit REAL*8 (A-H,O-Z)include 'mpif.h‘integer :: PETOT, my_rank, ierr
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
write (*,'(a,2i8)') 'Hello World FORTRAN', my_rank, PETOT
call MPI_FINALIZE (ierr)
stopend
Fortran
![Page 178: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/178.jpg)
MPI_ABORT
• Aborts MPI execution environment
• call MPI_ABORT (comm, errcode, ierr)
– comm I I communication– errcode I O error code– ierr I O completion code
41
Fortran
MPI Programming
41
![Page 179: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/179.jpg)
MPI_WTIME
• Returns an elapsed time on the calling processor
• time= MPI_WTIME ()
– time R8 O Time in seconds since an arbitrary time in the past.
…real(kind=8):: Stime, Etime
Stime= MPI_WTIME ()do i= 1, 100000000a= 1.d0
enddoEtime= MPI_WTIME ()
write (*,'(i5,1pe16.6)') my_rank, Etime-Stime
Fortran
42
42MPI Programming
![Page 180: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/180.jpg)
434343
Example of MPI_WtimeMPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiicc –O1 time.c$> mpiifort –O1 time.f
(modify go4.sh, 4 processes)$> pjsub go4.sh
0 1.113281E+003 1.113281E+002 1.117188E+001 1.117188E+00
Process TimeID
![Page 181: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/181.jpg)
444444
go4.shMPI Programming
#!/bin/sh
#PJM -N "test"
#PJM -L rscgrp=lecture7
#PJM -L node=1
#PJM --mpi proc=4
#PJM -L elapse=00:15:00
#PJM -g gt37
#PJM -j
#PJM -e err
#PJM -o test.lst
![Page 182: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/182.jpg)
454545
MPI_Wtick
• Returns the resolution of MPI_Wtime• depends on hardware, and compiler
• time= MPI_Wtick ()
– time R8 O Time in seconds of resolution of MPI_Wtime
implicit REAL*8 (A-H,O-Z)include 'mpif.h'
…TM= MPI_WTICK ()write (*,*) TM…
double Time;
…Time = MPI_Wtick();printf("%5d%16.6E¥n", MyRank, Time);…
MPI Programming
![Page 183: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/183.jpg)
464646
Example of MPI_WtickMPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiicc –O1 wtick.c$> mpiifort –O1 wtick.f
(modify go1.sh, 1 process)$> pjsub go1.sh
![Page 184: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/184.jpg)
474747
go1.shMPI Programming
#!/bin/sh
#PJM -N "test"
#PJM -L rscgrp=lecture7
#PJM -L node=1
#PJM --mpi proc=1
#PJM -L elapse=00:15:00
#PJM -g gt37
#PJM -j
#PJM -e err
#PJM -o test.lst
![Page 185: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/185.jpg)
4848
MPI_BARRIER
• Blocks until all processes in the communicator have reached this routine.• Mainly for debugging, huge overhead, not recommended for real code.
• call MPI_BARRIER (comm, ierr)
– comm I I communicator– ierr I O completion code
Fortran
MPI Programming
![Page 186: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/186.jpg)
494949
• What is MPI ?
• Your First MPI Program: Hello World
• Collective Communication• Point-to-Point Communication
MPI Programming
![Page 187: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/187.jpg)
505050
What is Collective Communication ?集団通信,グループ通信
• Collective communication is the process of exchanging information between multiple MPI processes in the communicator: one-to-all or all-to-all communications.
• Examples– Broadcasting control data– Max, Min– Summation– Dot products of vectors– Transformation of dense matrices
MPI Programming
![Page 188: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/188.jpg)
515151
Example of Collective Communications (1/4)
A0P#0 B0 C0 D0
P#1
P#2
P#3
BroadcastA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3Gather
MPI Programming
![Page 189: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/189.jpg)
525252
Example of Collective Communications (2/4)
All gatherA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
All-to-All
A0P#0
B0P#1
C0P#2
D0P#3
A0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
MPI Programming
![Page 190: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/190.jpg)
535353
Example of Collective Communications (3/4)
ReduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
All reduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
MPI Programming
![Page 191: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/191.jpg)
545454
Example of Collective Communications (4/4)
Reduce scatterP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3
op.B0-B3
op.C0-C3
op.D0-D3
MPI Programming
![Page 192: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/192.jpg)
555555
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 193: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/193.jpg)
565656
Global/Local Data
• Data structure of parallel computing based on SPMD, where large scale “global data” is decomposed to small pieces of “local data”.
MPI Programming
![Page 194: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/194.jpg)
575757
Domain Decomposition/Partitioning
MPI Programming
Large-scaleData
localdata
localdata
localdata
localdata
localdata
localdata
localdata
localdata
comm.DomainDecomposition
• PC with 1GB RAM: can execute FEM application with up to 106 meshes
– 103km× 103 km× 102 km (SW Japan): 108 meshes by 1km cubes
• Large-scale Data: Domain decomposition, parallel & local operations
• Global Computation: Comm. among domains needed
![Page 195: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/195.jpg)
585858
Local Data Structure• It is important to define proper local data structure for
target computation (and its algorithm)– Algorithms= Data Structures
• Main objective of this class !
MPI Programming
![Page 196: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/196.jpg)
595959
Global/Local Data
• Data structure of parallel computing based on SPMD, where large scale “global data” is decomposed to small pieces of “local data”.
• Consider the dot product of following VECp and VECs with length=20 by parallel computation using 4 processors
VECp[ 0]= 2[ 1]= 2[ 2]= 2
…[17]= 2[18]= 2[19]= 2
VECs[ 0]= 3[ 1]= 3[ 2]= 3
…[17]= 3[18]= 3[19]= 3
VECp( 1)= 2( 2)= 2( 3)= 2
…(18)= 2(19)= 2(20)= 2
VECs( 1)= 3( 2)= 3( 3)= 3
…(18)= 3(19)= 3(20)= 3
MPI Programming
![Page 197: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/197.jpg)
606060
<$O-S1>/dot.f, dot.cimplicit REAL*8 (A-H,O-Z)real(kind=8),dimension(20):: &
VECp, VECs
do i= 1, 20VECp(i)= 2.0d0VECs(i)= 3.0d0
enddo
sum= 0.d0do ii= 1, 20sum= sum + VECp(ii)*VECs(ii)
enddo
stopend
#include <stdio.h>int main(){
int i;double VECp[20], VECs[20]double sum;
for(i=0;i<20;i++){VECp[i]= 2.0;VECs[i]= 3.0;
}
sum = 0.0;for(i=0;i<20;i++){sum += VECp[i] * VECs[i];
}return 0;
}
MPI Programming
![Page 198: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/198.jpg)
616161
<$O-S1>/dot.f, dot.c(do it on PC/ECCS 2016)
MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ icc dot.c>$ ifort dot.f
>$ ./a.out
1 2. 3.2 2. 3.3 2. 3.
…18 2. 3.19 2. 3.20 2. 3.
dot product 120.
![Page 199: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/199.jpg)
6262
MPI_REDUCE• Reduces values on all processes to a single value
– Summation, Product, Max, Min etc.
• call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)– sendbuf choice I starting address of send buffer– recvbuf choice O starting address receive buffer
type is defined by ”datatype”– count I I number of elements in send/receive buffer – datatype I I data type of elements of send/recive buffer
FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc
– op I I reduce operation MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND etc
Users can define operations by MPI_OP_CREATE
– root I I rank of root process – comm I I communicator– ierr I O completion code
ReduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
Fortran
MPI Programming
![Page 200: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/200.jpg)
636363
Send/Receive Buffer(Sending/Receiving)
• Arrays of “send (sending) buffer” and “receive (receiving) buffer” often appear in MPI.
• Addresses of “send (sending) buffer” and “receive (receiving) buffer” must be different.
MPI Programming
![Page 201: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/201.jpg)
646464
Send/Receive Buffer (1/3)A: Scalar
MPI Programming
call MPI_REDUCE (A,recvbuf, 1,datatype,op,root,comm,ierr)
MPI_Reduce(A,recvbuf,1,datatype,op,root,comm)
![Page 202: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/202.jpg)
656565
Send/Receive Buffer (2/3)A: Array
MPI Programming
call MPI_REDUCE (A,recvbuf, 3,datatype,op,root,comm,ierr)
MPI_Reduce(A,recvbuf,3,datatype,op,root,comm)
• Starting Address of Send Buffer– A(1): Fortran, A[0]: C– 3 (continuous) components of A (A(1)-A(3), A[0]-A[2])
are sent
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
A(:)
A[:]
![Page 203: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/203.jpg)
666666
Send/Receive Buffer (3/3)A: Array
MPI Programming
call MPI_REDUCE (A(4),recvbuf, 3,datatype,op,root,comm,ierr)
MPI_Reduce(A[3],recvbuf,3,datatype,op,root,comm)
• Starting Address of Send Buffer– A(4): Fortran, A[3]: C– 3 (continuous) components of A (A(4)-A(6), A[3]-A[5])
are sent
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
A(:)
A[:]
![Page 204: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/204.jpg)
6767
Example of MPI_Reduce (1/2)call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
real(kind=8):: X0, X1
call MPI_REDUCE(X0, X1, 1, MPI_DOUBLE_PRECISION, MPI_MAX, 0, <comm>, ierr)
real(kind=8):: X0(4), XMAX(4)
call MPI_REDUCE(X0, XMAX, 4, MPI_DOUBLE_PRECISION, MPI_MAX, 0, <comm>, ierr)
Global Max. values of X0(i) go to XMAX(i) on #0 process (i=1-4)
Fortran
MPI Programming
![Page 205: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/205.jpg)
6868
Example of MPI_Reduce (2/2)call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
real(kind=8):: X0, XSUM
call MPI_REDUCE(X0, XSUM, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, <comm>, ierr)
real(kind=8):: X0(4)
call MPI_REDUCE(X0(1), X0(3), 2, MPI_DOUBLE_PRECISION, MPI_SUM, 0, <comm>, ierr)
Global summation of X0 goes to XSUM on #0 process.
Fortran
MPI Programming
・ Global summation of X0(1) goes to X0(3) on #0 process.・ Global summation of X0(2) goes to X0(4) on #0 process.
1 2 3 4
1 2 3 4
sendbuf
recvbuf
![Page 206: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/206.jpg)
6969
MPI_BCAST
• Broadcasts a message from the process with rank "root" to all other processes of the communicator
• call MPI_BCAST (buffer,count,datatype,root,comm,ierr)
– buffer choice I/O starting address of buffertype is defined by ”datatype”
– count I I number of elements in send/recv buffer
– datatype I I data type of elements of send/recv bufferFORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc.
– root I I rank of root process – comm I I communicator– ierr I O completion code
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
BroadcastA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
Fortran
MPI Programming
![Page 207: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/207.jpg)
7070
MPI_ALLREDUCE
• MPI_Reduce + MPI_Bcast• Summation (of dot products) and MAX/MIN values are likely to utilized in
each process
• call MPI_ALLREDUCE
(sendbuf,recvbuf,count,datatype,op, comm,ierr)
– sendbuf choice I starting address of send buffer– recvbuf choice O starting address receive buffer
type is defined by ”datatype”
– count I I number of elements in send/recv buffer – datatype I I data type of elements in send/recv buffer
– op I I reduce operation – comm I I commuinicator– ierr I O completion code
All reduceP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3op.A0-A3 op.B0-B3 op.C0-C3 op.D0-D3
Fortran
MPI Programming
![Page 208: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/208.jpg)
71
“op” of MPI_Reduce/Allreduce
• MPI_MAX,MPI_MIN Max, Min• MPI_SUM,MPI_PROD Summation, Product• MPI_LAND Logical AND
call MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
Fortran
MPI Programming
![Page 209: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/209.jpg)
7272
Local Data (1/2)• Decompose vector with length=20 into 4 domains (processes)• Each process handles a vector with length= 5
VECp( 1)= 2( 2)= 2( 3)= 2
…(18)= 2(19)= 2(20)= 2
VECs( 1)= 3( 2)= 3( 3)= 3
…(18)= 3(19)= 3(20)= 3
Fortran
MPI Programming
![Page 210: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/210.jpg)
7373
Local Data (2/2)• 1th-5th components of original global vector go to 1th-5th components
of PE#0, 6th-10th -> PE#1, 11th-15th -> PE#2, 16th-20th -> PE#3.
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
VECp(1)= 2(2)= 2(3)= 2(4)= 2(5)= 2
VECs(1)= 3(2)= 3(3)= 3(4)= 3(5)= 3
PE#0
PE#1
PE#2
PE#3
VECp(16)~VECp(20)VECs(16)~VECs(20)
VECp(11)~VECp(15)VECs(11)~VECs(15)
VECp( 6)~VECp(10)VECs( 6)~VECs(10)
VECp( 1)~VECp( 5)VECs( 1)~VECs( 5)
Fortran
MPI Programming
![Page 211: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/211.jpg)
74
But ...
• It is too easy !! Just decomposing and renumbering from 1 (or 0).
• Of course, this is not enough. Further examples will be shown in the latter part.
VL(1)VL(2)VL(3)VL(4)VL(5)
PE#0
PE#1
PE#2
PE#3
VL(1)VL(2)VL(3)VL(4)VL(5)
VL(1)VL(2)VL(3)VL(4)VL(5)
VL(1)VL(2)VL(3)VL(4)VL(5)
VG( 1)VG( 2)VG( 3)VG( 4)VG( 5)VG( 6)VG( 7)VG( 8)VG( 9)VG(10)
VG(11)VG(12)VG(13)VG(14)VG(15)VG(16)VG(17)VG(18)VG(19)VG(20)
MPI Programming
![Page 212: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/212.jpg)
7575
Example: Dot Product (1/3)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(5) :: VECp, VECs
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
sumA= 0.d0sumR= 0.d0do i= 1, 5
VECp(i)= 2.d0VECs(i)= 3.d0
enddo
sum0= 0.d0do i= 1, 5
sum0= sum0 + VECp(i) * VECs(i)enddo
if (my_rank.eq.0) thenwrite (*,'(a)') '(my_rank, sumALLREDUCE, sumREDUCE)‘
endif
<$O-S1>/allreduce.f
MPI Programming
Local vector is generatedat each local process.
![Page 213: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/213.jpg)
7676
!C!C-- REDUCE
call MPI_REDUCE (sum0, sumR, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, &MPI_COMM_WORLD, ierr)
!C!C-- ALL-REDUCE
call MPI_Allreduce (sum0, sumA, 1, MPI_DOUBLE_PRECISION, MPI_SUM, &MPI_COMM_WORLD, ierr)
write (*,'(a,i5, 2(1pe16.6))') 'before BCAST', my_rank, sumA, sumR
Example: Dot Product (2/3)<$O-S1>/allreduce.f
MPI Programming
Dot ProductSummation of results of each process (sum0)“sumR” has value only on PE#0.
“sumA” has value on all processes by MPI_Allreduce
![Page 214: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/214.jpg)
7777
Example: Dot Product (3/3)
!C!C-- BCAST
call MPI_BCAST (sumR, 1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &ierr)
write (*,'(a,i5, 2(1pe16.6))') 'after BCAST', my_rank, sumA, sumR
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/allreduce.f
MPI Programming
“sumR” has value on PE#1-#3 by MPI_Bcast
![Page 215: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/215.jpg)
787878
Execute <$O-S1>/allreduce.f/c$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiifort -align array64byte -O3 -axCORE-AVX512 allreduce.f
$> mpiicc -align -O3 -axCORE-AVX512 allreduce.c
(modify go4.sh, 4-processes)$> pjsub go4.sh
(my_rank, sumALLREDUCE,sumREDUCE)
before BCAST 0 1.200000E+02 1.200000E+02
after BCAST 0 1.200000E+02 1.200000E+02
before BCAST 1 1.200000E+02 0.000000E+00
after BCAST 1 1.200000E+02 1.200000E+02
before BCAST 3 1.200000E+02 0.000000E+00
after BCAST 3 1.200000E+02 1.200000E+02
before BCAST 2 1.200000E+02 0.000000E+00
after BCAST 2 1.200000E+02 1.200000E+02
MPI Programming
![Page 216: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/216.jpg)
797979
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 217: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/217.jpg)
8080
Global/Local Data (1/3)
• Parallelization of an easy process where a real number αis added to each component of real vector VECg:
do i= 1, NG VECg(i)= VECg(i) + ALPHA enddo
for (i=0; i<NG; i++{ VECg[i]= VECg[i] + ALPHA }
MPI Programming
![Page 218: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/218.jpg)
8181
Global/Local Data (2/3)• Configurationa
– NG= 32 (length of the vector)– ALPHA=1000.– Process # of MPI= 4
• Vector VECg has following 32 components (<$O-S1>/a1x.all):
(101.0, 103.0, 105.0, 106.0, 109.0, 111.0, 121.0, 151.0, 201.0, 203.0, 205.0, 206.0, 209.0, 211.0, 221.0, 251.0, 301.0, 303.0, 305.0, 306.0, 309.0, 311.0, 321.0, 351.0, 401.0, 403.0, 405.0, 406.0, 409.0, 411.0, 421.0, 451.0)
MPI Programming
![Page 219: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/219.jpg)
8282
Global/Local Data (3/3)• Procedure① Reading vector VECg with length=32 from one process (e.g. 0th process)
– Global Data
② Distributing vector components to 4 MPI processes equally (i.e. length= 8 for each processes)
– Local Data, Local ID/Numbering
③ Adding ALPHA to each component of the local vector (with length= 8) on each process.
④ Merging the results to global vector with length= 32.
• Actually, we do not need parallel computers for such a kind of small computation.
MPI Programming
![Page 220: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/220.jpg)
8383
Operations of Scatter/Gather (1/8)Reading VECg (length=32) from a process (e.g. #0)
• Reading global data from #0 process include 'mpif.h' integer, parameter :: NG= 32 real(kind=8), dimension(NG):: VECg call MPI_INIT (ierr) call MPI_COMM_SIZE (<comm>, PETOT , ierr) call MPI_COMM_RANK (<comm>, my_rank, ierr) if (my_rank.eq.0) then open (21, file= 'a1x.all', status= 'unknown') do i= 1, NG read (21,*) VECg(i)
enddo close (21)
endif
#include <mpi.h> #include <stdio.h> #include <math.h> #include <assert.h> int main(int argc, char **argv){ int i, NG=32; int PeTot, MyRank, MPI_Comm; double VECg[32]; char filename[80]; FILE *fp; MPI_Init(&argc, &argv); MPI_Comm_size(<comm>, &PeTot); MPI_Comm_rank(<comm>, &MyRank); fp = fopen("a1x.all", "r"); if(!MyRank) for(i=0;i<NG;i++){ fscanf(fp, "%lf", &VECg[i]); }
MPI Programming
![Page 221: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/221.jpg)
8484
Operations of Scatter/Gather (2/8)Distributing global data to 4 process equally (i.e. length=8 for
each process)
• MPI_Scatter
MPI Programming
![Page 222: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/222.jpg)
8585
MPI_SCATTER
• Sends data from one process to all other processes in a communicator – scount-size messages are sent to each process
• call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer
type is defined by ”datatype”– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer
FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc.
C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc.
– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 223: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/223.jpg)
8686
MPI_SCATTER(cont.)• call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer
type is defined by ”datatype”– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
• Usually– scount = rcount
– sendtype= recvtype
• This function sends scount components starting from sendbuf (sending buffer) at process #root to each process in comm. Each process receives rcount components starting from recvbuf (receiving buffer).
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 224: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/224.jpg)
8787
Operations of Scatter/Gather (3/8)Distributing global data to 4 processes equally
• Allocating receiving buffer VEC (length=8) at each process.• 8 components sent from sending buffer VECg of process #0 are
received at each process #0-#3 as 1st-8th components of receiving buffer VEC.
call MPI_SCATTER (sendbuf, scount, sendtype, recvbuf, rcount, recvtype, root, comm, ierr)
MPI Programming
integer, parameter :: N = 8 real(kind=8), dimension(N ) :: VEC ... call MPI_Scatter & (VECg, N, MPI_DOUBLE_PRECISION, & VEC , N, MPI_DOUBLE_PRECISION, & 0, <comm>, ierr)
int N=8; double VEC [8]; ... MPI_Scatter (VECg, N, MPI_DOUBLE, VEC, N, MPI_DOUBLE, 0, <comm>);
![Page 225: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/225.jpg)
8888
Operations of Scatter/Gather (4/8)Distributing global data to 4 processes equally
• 8 components are scattered to each process from root (#0)• 1st-8th components of VECg are stored as 1st-8th ones of VEC at #0,
9th-16th components of VECg are stored as 1st-8th ones of VEC at #1, etc.– VECg: Global Data, VEC: Local Data
VECg
sendbuf
VEC
recvbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
VECg
sendbuf
VEC
recvbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8local data
global data
MPI Programming
![Page 226: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/226.jpg)
8989
Operations of Scatter/Gather (5/8)Distributing global data to 4 processes equally
• Global Data: 1st-32nd components of VECg at #0• Local Data: 1st-8th components of VEC at each process• Each component of VEC can be written from each process in the
following way:
do i= 1, N write (*,'(a, 2i8,f10.0)') 'before', my_rank, i, VEC(i) enddo
for(i=0;i<N;i++){ printf("before %5d %5d %10.0F\n", MyRank, i+1, VEC[i]);}
MPI Programming
![Page 227: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/227.jpg)
9090
Operations of Scatter/Gather (5/8)Distributing global data to 4 processes equally
• Global Data: 1st-32nd components of VECg at #0• Local Data: 1st-8th components of VEC at each process• Each component of VEC can be written from each process in the
following way:
MPI Programming
PE#0 before 0 1 101. before 0 2 103. before 0 3 105. before 0 4 106. before 0 5 109. before 0 6 111. before 0 7 121. before 0 8 151.
PE#1 before 1 1 201. before 1 2 203. before 1 3 205. before 1 4 206. before 1 5 209. before 1 6 211. before 1 7 221. before 1 8 251.
PE#2 before 2 1 301. before 2 2 303. before 2 3 305. before 2 4 306. before 2 5 309. before 2 6 311. before 2 7 321. before 2 8 351.
PE#3 before 3 1 401. before 3 2 403. before 3 3 405. before 3 4 406. before 3 5 409. before 3 6 411. before 3 7 421. before 3 8 451.
![Page 228: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/228.jpg)
9191
Operations of Scatter/Gather (6/8)On each process, ALPHA is added to each of 8 components
of VEC
• On each process, computation is in the following way
real(kind=8), parameter :: ALPHA= 1000. do i= 1, N VEC(i)= VEC(i) + ALPHA enddo
double ALPHA=1000.; ... for(i=0;i<N;i++){ VEC[i]= VEC[i] + ALPHA;}
• Results:
PE#0 after 0 1 1101. after 0 2 1103. after 0 3 1105. after 0 4 1106. after 0 5 1109. after 0 6 1111. after 0 7 1121. after 0 8 1151.
PE#1 after 1 1 1201. after 1 2 1203. after 1 3 1205. after 1 4 1206. after 1 5 1209. after 1 6 1211. after 1 7 1221. after 1 8 1251.
PE#2 after 2 1 1301. after 2 2 1303. after 2 3 1305. after 2 4 1306. after 2 5 1309. after 2 6 1311. after 2 7 1321. after 2 8 1351.
PE#3 after 3 1 1401. after 3 2 1403. after 3 3 1405. after 3 4 1406. after 3 5 1409. after 3 6 1411. after 3 7 1421. after 3 8 1451.
MPI Programming
![Page 229: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/229.jpg)
9292
Operations of Scatter/Gather (7/8)Merging the results to global vector with length= 32
• Using MPI_Gather (inverse operation to MPI_Scatter)
MPI Programming
![Page 230: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/230.jpg)
9393
MPI_GATHER
• Gathers together values from a group of processes, inverse operation to MPI_Scatter
• call MPI_GATHER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, root, comm, ierr)– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process– sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I number of elements received from the root process – recvtype I I data type of elements of receiving buffer– root I I rank of root process – comm I I communicator– ierr I O completion code
• recvbuf is on root process.
A0P#0 B0 C0 D0
P#1
P#2
P#3
A0P#0 B0 C0 D0
P#1
P#2
P#3
ScatterA0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3Gather
Fortran
MPI Programming
![Page 231: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/231.jpg)
9494
Operations of Scatter/Gather (8/8)Merging the results to global vector with length= 32
• Each process components of VEC to VECg on root (#0 in this case).
VECg
recvbuf
VEC
sendbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
VECg
recvbuf
VEC
sendbuf PE#0
8 8 8 8
8
root
PE#1
8
PE#2
8
PE#3
8
• 8 components are gathered from each process to the root process.
local data
global data
MPI Programming
call MPI_Gather & (VEC , N, MPI_DOUBLE_PRECISION, & VECg, N, MPI_DOUBLE_PRECISION, & 0, <comm>, ierr)
MPI_Gather (VEC, N, MPI_DOUBLE, VECg, N, MPI_DOUBLE, 0, <comm>);
![Page 232: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/232.jpg)
959595
<$O-S1>/scatter-gather.f/c
PE#0 before 0 1 101. before 0 2 103. before 0 3 105. before 0 4 106. before 0 5 109. before 0 6 111. before 0 7 121. before 0 8 151.
PE#1 before 1 1 201. before 1 2 203. before 1 3 205. before 1 4 206. before 1 5 209. before 1 6 211. before 1 7 221. before 1 8 251.
PE#2 before 2 1 301. before 2 2 303. before 2 3 305. before 2 4 306. before 2 5 309. before 2 6 311. before 2 7 321. before 2 8 351.
PE#3 before 3 1 401. before 3 2 403. before 3 3 405. before 3 4 406. before 3 5 409. before 3 6 411. before 3 7 421. before 3 8 451.
PE#0 after 0 1 1101. after 0 2 1103. after 0 3 1105. after 0 4 1106. after 0 5 1109. after 0 6 1111. after 0 7 1121. after 0 8 1151.
PE#1 after 1 1 1201. after 1 2 1203. after 1 3 1205. after 1 4 1206. after 1 5 1209. after 1 6 1211. after 1 7 1221. after 1 8 1251.
PE#2 after 2 1 1301. after 2 2 1303. after 2 3 1305. after 2 4 1306. after 2 5 1309. after 2 6 1311. after 2 7 1321. after 2 8 1351.
PE#3 after 3 1 1401. after 3 2 1403. after 3 3 1405. after 3 4 1406. after 3 5 1409. after 3 6 1411. after 3 7 1421. after 3 8 1451.
MPI Programming
$> cd /work/gt37/t37XXX/pFEM/mpi/S1
$> mpiifort -align array64byte -O3 -axCORE-AVX512 scatter-gather.f
$> mpiicc -align -O3 -axCORE-AVX512 scatter-gather.c
(modify go4.sh, 4-processes)$> pjsub go4.sh
![Page 233: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/233.jpg)
9696
MPI_REDUCE_SCATTER
• MPI_REDUCE + MPI_SCATTER
• call MPI_REDUCE_SCATTER (sendbuf, recvbuf, rcount,
datatype, op, comm, ierr)– sendbuf choice I starting address of sending buffer– recvbuf choice O starting address of receiving buffer– rcount I I integer array specifying the number of elements in result
distributed to each process. Array must be identical on all calling processes.
– datatype I I data type of elements of sending/receiving buffer– op I I reduce operation – comm I I communicator– ierr I O completion code
Reduce scatterP#0
P#1
P#2
P#3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
op.A0-A3op.A0-A3
op.B0-B3op.B0-B3
op.C0-C3op.C0-C3
op.D0-D3op.D0-D3
Fortran
MPI Programming
![Page 234: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/234.jpg)
9797
MPI_ALLGATHER
• MPI_GATHER+MPI_BCAST– Gathers data from all tasks and distribute the combined data to all tasks
• call MPI_ALLGATHER (sendbuf, scount, sendtype, recvbuf,
rcount, recvtype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcount I I number of elements received from each process– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
All gatherA0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0 B0 C0 D0
A0P#1 B0 C0 D0
A0P#2 B0 C0 D0
A0P#3 B0 C0 D0
A0P#0
B0P#1
C0P#2
D0P#3
A0P#0
B0P#1
C0P#2
D0P#3
Fortran
MPI Programming
![Page 235: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/235.jpg)
9898
MPI_ALLTOALL
• Sends data from all to all processes: transformation of dense matrix
• call MPI_ALLTOALL (sendbuf, scount, sendtype, recvbuf,
rcount, recvrype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcount I I number of elements received from each process– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
All-to-AllA0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 A1 A2 A3
B0P#1 B1 B2 B3
C0P#2 C1 C2 C3
D0P#3 D1 D2 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
A0P#0 B0 C0 D0
A1P#1 B1 C1 D1
A2P#2 B2 C2 D2
A3P#3 B3 C3 D3
Fortran
MPI Programming
![Page 236: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/236.jpg)
999999
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 237: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/237.jpg)
100100100
Operations of Distributed Local Files
• In Scatter/Gather example, PE#0 reads global data, that is scattered to each processer, then parallel operations are done.
• If the problem size is very large, a single processor may not read entire global data.– If the entire global data is decomposed to distributed local data
sets, each process can read the local data.– If global operations are needed to a certain sets of vectors,
MPI functions, such as MPI_Gather etc. are available.
MPI Programming
![Page 238: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/238.jpg)
101101101
Reading Distributed Local Files: Uniform Vec. Length (1/2)
MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1>$ ls a1.*
a1.0 a1.1 a1.2 a1.3 a1x.all is decomposed to 4 files.
>$ mpiicc –O3 file.c>$ mpiifort –O3 file.f(modify go4.sh for 4 processes)>$ pjsub go4.sh
a1.0101.0103.0105.0106.0109.0111.0121.0151.0
a1.1201.0203.0205.0206.0209.0211.0221.0251.0
a1.2301.0303.0305.0306.0309.0311.0321.0351.0
a1.3401.0403.0405.0406.0409.0411.0421.0451.0
![Page 239: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/239.jpg)
102
go4.sh
#!/bin/sh
#PJM -N “test“ Job Name
#PJM -L rscgrp=lecture7 Name of “QUEUE”
#PJM -L node=1 Node#
#PJM --mpi proc=4 Total MPI Process#
#PJM -L elapse=00:15:00 Computation Time
#PJM -g gt37 Group Name (Wallet)
#PJM -j
#PJM -e err Standard Error
#PJM -o test.lst Standard Output
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
![Page 240: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/240.jpg)
103103
Reading Distributed Local Files: Uniform Vec. Length (2/2)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(8) :: VECcharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a1.0'if (my_rank.eq.1) filename= 'a1.1'if (my_rank.eq.2) filename= 'a1.2'if (my_rank.eq.3) filename= 'a1.3'
open (21, file= filename, status= 'unknown')do i= 1, 8
read (21,*) VEC(i)enddo
close (21)
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/file.f
MPI Programming
Similar to “Hello”
Local ID is 1-8
![Page 241: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/241.jpg)
104104
Typical SPMD Operation
PE #0
“a.out”
“a1.0”
PE #1
“a.out”
“a1.1”
PE #2
“a.out”
“a1.2”
mpirun -np 4 a.out
PE #3
“a.out”
“a1.3”
MPI Programming
![Page 242: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/242.jpg)
105105105
Non-Uniform Vector Length (1/2)MPI Programming
>$ cd /work/gt37/t37XXX/pFEM/mpi/S1
>$ ls a2.*a2.0 a2.1 a2.2 a2.3
>$ cat a2.1
5 Number of Components at each Process201.0 Components203.0
205.0
206.0
209.0
>$ mpiicc –O3 file2.c>$ mpiifort –O3 file2.f
(modify go4.sh for 4 processes)>$ pjsub go4.sh
![Page 243: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/243.jpg)
MPIprog. 106
a2.0~a2.3
PE#0
8101.0103.0105.0106.0109.0111.0121.0151.0
PE#1
5201.0203.0205.0206.0209.0
PE#2
7301.0303.0305.0306.0311.0321.0351.0
PE#3
3401.0403.0405.0
![Page 244: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/244.jpg)
107107
Non-Uniform Vector Length (2/2)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'integer :: PETOT, my_rank, ierrreal(kind=8), dimension(:), allocatable :: VECcharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a2.0'if (my_rank.eq.1) filename= 'a2.1'if (my_rank.eq.2) filename= 'a2.2'if (my_rank.eq.3) filename= 'a2.3'
open (21, file= filename, status= 'unknown')read (21,*) Nallocate (VEC(N))do i= 1, N
read (21,*) VEC(i)enddo
close(21)
call MPI_FINALIZE (ierr)stopend
<$O-S1>/file2.f
MPI Programming
“N” is different at each process
![Page 245: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/245.jpg)
108
How to generate local data• Reading global data (N=NG)
– Scattering to each process– Parallel processing on each process– (If needed) reconstruction of global data by gathering local
data
• Generating local data (N=NL), or reading distributed local data – Generating or reading local data on each process– Parallel processing on each process– (If needed) reconstruction of global data by gathering local
data
• In future, latter case is more important, but former case is also introduced in this class for understanding of operations of global/local data.
MPI Programming
![Page 246: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/246.jpg)
109109109
Examples by Collective Comm.
• Dot Products of Vectors• Scatter/Gather• Reading Distributed Files• MPI_Allgatherv
MPI Programming
![Page 247: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/247.jpg)
110
MPI_GATHERV,MPI_SCATTERV
• MPI_Gather, MPI_Scatter– Length of message from/to each process is uniform
• MPI_XXXv extends functionality of MPI_XXX by allowing a varying count of data from each process: – MPI_Gatherv– MPI_Scatterv– MPI_Allgatherv– MPI_Alltoallv
MPI Programming
![Page 248: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/248.jpg)
111
MPI_ALLGATHERV
• Variable count version of MPI_Allgather– creates “global data” from “local data”
• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
– sendbuf choice I starting address of sending buffer– scount I I number of elements sent to each process – sendtype I I data type of elements of sending buffer– recvbuf choice O starting address of receiving buffer – rcounts I I integer array (of length groupsize) containing the number of
elements that are to be received from each process (array: size= PETOT)
– displs I I integer array (of length groupsize). Entry i specifies the displacement (relative to recvbuf ) at which to place the incoming data from process i (array: size= PETOT+1)
– recvtype I I data type of elements of receiving buffer– comm I I communicator– ierr I O completion code
Fortran
MPI Programming
![Page 249: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/249.jpg)
112
MPI_ALLGATHERV (cont.)• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
– rcounts I I integer array (of length groupsize) containing the number of elements that are to be received from each process (array: size= PETOT)
– displs I I integer array (of length groupsize). Entry i specifies the displacement (relative to recvbuf ) at which to place the incoming data from process i (array: size= PETOT+1)
– These two arrays are related to size of final “global data”, therefore each process requires information of these arrays (rcounts, displs)
• Each process must have same values for all components of both vectors– Usually, stride(i)=rcounts(i)
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 250: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/250.jpg)
113
What MPI_Allgatherv is doing
stride(1)
PE#0 N
PE#1 N
PE#2 N
PE#3 N
rcounts(1)rcounts(2)
rcounts(3)rcounts
(4)
stride(2)
stride(3)
stride(4)
displs(1)
displs(2)
displs(3)
displs(4)
displs(5)
Generating global data from local data
MPI Programming
Local Data: sendbuf Global Data: recvbuf
![Page 251: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/251.jpg)
114
What MPI_Allgatherv is doing
stride(1)= rcounts(1)
PE#0 N
PE#1 N
PE#2 N
PE#3 N
rcounts(1)rcounts(2)
rcounts(3)rcounts
(4)stride(2)= rcounts(2)
stride(3)= rcounts(3)
stride(4)= rcounts(4)
displs(1)
displs(2)
displs(3)
displs(4)
displs(5)
Generating global data from local data
MPI Programming
Local Data: sendbuf Global Data: recvbuf
![Page 252: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/252.jpg)
115
MPI_Allgatherv in detail (1/2)
• call MPI_ALLGATHERV (sendbuf, scount, sendtype, recvbuf,
rcounts, displs, recvtype, comm, ierr)
• rcounts
– Size of message from each PE: Size of Local Data (Length of Local Vector)• displs
– Address/index of each local data in the vector of global data– displs(PETOT+1)= Size of Entire Global Data (Global Vector)
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 253: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/253.jpg)
116
MPI_Allgatherv in detail (2/2)• Each process needs information of rcounts & displs
– “rcounts” can be created by gathering local vector length “N” from each process.
– On each process, “displs” can be generated from “rcounts” on each process.• stride[i]= rcounts[i]
– Size of ”recvbuf” is calculated by summation of ”rcounts”.
rcounts(1) rcounts(2) rcounts(3) rcounts(m)rcounts(m-1)
PE#0 PE#1 PE#2 PE#(m-2) PE#(m-1)
displs(1)=0 displs(2)=displs(1) + stride(1)
displs(m+1)=displs(m) + stride(m)
size(recvbuf)= displs(PETOT+1)= sum(stride)
stride(1) stride(2) stride(3) stride(m-1) stride(m)
Fortran
MPI Programming
![Page 254: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/254.jpg)
117
Preparation for MPI_Allgatherv<$O-S1>/agv.f
• Generating global vector from “a2.0”~”a2.3”.
• Length of the each vector is 8, 5, 7, and 3, respectively. Therefore, size of final global vector is 23 (= 8+5+7+3).
MPI Programming
![Page 255: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/255.jpg)
118
a2.0~a2.3
PE#0
8101.0103.0105.0106.0109.0111.0121.0151.0
PE#1
5201.0203.0205.0206.0209.0
PE#2
7301.0303.0305.0306.0311.0321.0351.0
PE#3
3401.0403.0405.0
MPI Programming
![Page 256: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/256.jpg)
119
Preparation: MPI_Allgatherv (1/4)
implicit REAL*8 (A-H,O-Z)include 'mpif.h'
integer :: PETOT, my_rank, SOLVER_COMM, ierrreal(kind=8), dimension(:), allocatable :: VECreal(kind=8), dimension(:), allocatable :: VEC2real(kind=8), dimension(:), allocatable :: VECginteger(kind=4), dimension(:), allocatable :: rcountsinteger(kind=4), dimension(:), allocatable :: displscharacter(len=80) :: filename
call MPI_INIT (ierr)call MPI_COMM_SIZE (MPI_COMM_WORLD, PETOT, ierr )call MPI_COMM_RANK (MPI_COMM_WORLD, my_rank, ierr )
if (my_rank.eq.0) filename= 'a2.0'if (my_rank.eq.1) filename= 'a2.1'if (my_rank.eq.2) filename= 'a2.2'if (my_rank.eq.3) filename= 'a2.3'
open (21, file= filename, status= 'unknown')read (21,*) Nallocate (VEC(N))do i= 1, N
read (21,*) VEC(i)enddo
<$O-S1>/agv.f
MPI Programming
N(NL) is different ateach process
Fortran
![Page 257: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/257.jpg)
120
Preparation: MPI_Allgatherv (2/4)
allocate (rcounts(PETOT), displs(PETOT+1))rcounts= 0write (*,‘(a,10i8)’) “before”, my_rank, N, rcounts
call MPI_allGATHER ( N , 1, MPI_INTEGER, && rcounts, 1, MPI_INTEGER, && MPI_COMM_WORLD, ierr)
write (*,'(a,10i8)') "after ", my_rank, N, rcountsdispls(1)= 0
PE#0 N=8
PE#1 N=5
PE#2 N=7
PE#3 N=3
MPI_Allgather
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
rcounts(1:4)= {8, 5, 7, 3}
<$O-S1>/agv.f
MPI Programming
Fortran
Rcounts on each PE
![Page 258: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/258.jpg)
121
Preparation: MPI_Allgatherv (2/4)
allocate (rcounts(PETOT), displs(PETOT+1))rcounts= 0write (*,‘(a,10i8)’) “before”, my_rank, N, rcounts
call MPI_allGATHER ( N , 1, MPI_INTEGER, && rcounts, 1, MPI_INTEGER, && MPI_COMM_WORLD, ierr)
write (*,'(a,10i8)') "after ", my_rank, N, rcountsdispls(1)= 0
do ip= 1, PETOTdispls(ip+1)= displs(ip) + rcounts(ip)
enddo
write (*,'(a,10i8)') "displs", my_rank, displs
call MPI_FINALIZE (ierr)
stopend
<$O-S1>/agv.f
MPI Programming
Fortran
Rcounts on each PE
Displs on each PE
![Page 259: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/259.jpg)
122
Preparation: MPI_Allgatherv (3/4)>$ cd /work/gt37/t37XXX/pFEM/mpi/S1>$ mpiifort –O3 agv.f
(modify go4.sh for 4 processes)>$ pjsub go4.sh
before 0 8 0 0 0 0after 0 8 8 5 7 3displs 0 0 8 13 20 23FORTRAN STOP
before 1 5 0 0 0 0after 1 5 8 5 7 3displs 1 0 8 13 20 23FORTRAN STOP
before 3 3 0 0 0 0after 3 3 8 5 7 3displs 3 0 8 13 20 23FORTRAN STOP
before 2 7 0 0 0 0after 2 7 8 5 7 3displs 2 0 8 13 20 23FORTRAN STOP
write (*,‘(a,10i8)’) “before”, my_rank, N, rcountswrite (*,'(a,10i8)') "after ", my_rank, N, rcountswrite (*,'(a,10i8)') "displs", my_rank, displs
MPI Programming
![Page 260: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/260.jpg)
123
Preparation: MPI_Allgatherv (4/4)
• Only ”recvbuf” is not defined yet.• Size of ”recvbuf” = ”displs(PETOT+1)”
call MPI_allGATHERv
( VEC , N, MPI_DOUBLE_PRECISION,
recvbuf, rcounts, displs, MPI_DOUBLE_PRECISION,
MPI_COMM_WORLD, ierr)
MPI Programming
![Page 261: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/261.jpg)
124
Report S1 (1/2)
• Deadline: January 28th (Tue), 2020, 17:00– Send files via e-mail at nakajima(at)cc.u-tokyo.ac.jp
• Problem S1-1– Read local files <$O-S1>/a1.0~a1.3, <$O-S1>/a2.0~a2.3.– Develop codes which calculate norm ||x||2 of global vector for each
case.• <$O-S1>file.f,<$O-S1>file2.f
• Problem S1-2– Read local files <$O-S1>/a2.0~a2.3.– Develop a code which constructs “global vector” using
MPI_Allgatherv.
MPI Programming
![Page 262: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/262.jpg)
125
Report S1 (2/2)• Problem S1-3
– Develop parallel program which calculates the following numerical integration using “trapezoidal rule” by MPI_Reduce, MPI_Bcast etc.
– Measure computation time, and parallel performance
dxx +
1
0 21
4
• Report– Cover Page: Name, ID, and Problem ID (S1) must be written. – Less than two pages including figures and tables (A4) for each
of three sub-problems• Strategy, Structure of the Program, Remarks
– Source list of the program (if you have bugs)– Output list (as small as possible)
MPI Programming
![Page 263: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/263.jpg)
126
Options for Optimization
General Option (-O3)$ mpiifort -O3 test.f
$ mpiicc -O3 test.c
Special Options for AVX512, NOT Necessarily Fast …$ mpiifort -align array64byte -O3 -axCORE-AVX512 test.f
$ mpiicc -align -O3 -axCORE-AVX512 test.c
![Page 264: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/264.jpg)
127
Process Number
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
#PJM -L node=1;#PJM --mpi proc= 1 1-node, 1-proc, 1-proc/n
#PJM -L node=1;#PJM --mpi proc= 4 1-node, 4-proc, 4-proc/n
#PJM -L node=1;#PJM --mpi proc=16 1-node,16-proc,16-proc/n
#PJM -L node=1;#PJM --mpi proc=28 1-node,28-proc,28-proc/n
#PJM -L node=1;#PJM --mpi proc=56 1-node,56-proc,56-proc/n
#PJM -L node=4;#PJM --mpi proc=128 4-node,128-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=256 8-node,256-proc,32-proc/n
#PJM -L node=8;#PJM --mpi proc=448 8-node,448-proc,56-proc/n
![Page 265: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/265.jpg)
128
a01.sh: Use 1 -core (0 th) #!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=1#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 266: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/266.jpg)
129
a24.sh: Use 24 -cores (0 th-23rd) #!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=24#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 267: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/267.jpg)
130
a48.sh: Use 48 -cores (0 th-23rd, 28th-51st)#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=1#PJM --mpi proc=48#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 268: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/268.jpg)
131
b48.sh: Use 8x48 -cores (0th-23rd, 28th-51st)
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 269: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/269.jpg)
132
b48b.sh: Use 8x48 -cores (0th-23rd, 28th-51st)
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} numactl -l ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 270: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/270.jpg)
133
NUMA Architecture
• Each Node of Oakbridge-CX (OBCX)– 2 Sockets (CPU’s) of Intel CLX– Each socket has 28 cores
• Each core of a socket can access to the memory on the other socket : NUMA (Non-Uniform Memory Access)– numactl –l : local memory to be used
– Sometimes (not always), more stable with this option
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 271: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/271.jpg)
134
Use 8x48-cores, 48 -cores are randomly selected from 56 -cores
#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=384 384÷8= 48-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
export I_MPI_PIN_PROCESSOR_LIST=0-23,28-51
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB
![Page 272: nkl.cc.u-tokyo.ac.jpnkl.cc.u-tokyo.ac.jp/19w/03-MPI/MPIprog-1F.pdf · 1 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing](https://reader035.fdocuments.us/reader035/viewer/2022070716/5eda8b40ca56d138585eb2eb/html5/thumbnails/272.jpg)
135
Use 8x56-cores#!/bin/sh#PJM -N "test"#PJM -L rscgrp=lecture#PJM -L node=8
#PJM --mpi proc=448 448÷8= 56-cores/node#PJM -L elapse=00:15:00#PJM -g gt37#PJM -j#PJM -e err#PJM -o test.lst
mpiexec.hydra -n ${PJM_MPI_PROC} ./a.out
Socket #0: 0 th-27th Cores Socket #1: 28 th-55th Cores
Intel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #1: 28th-55th cores
Memory96 GB
UPIIntel® Xeon® Platinum 8280
(Cascade Lake, CLX)2.7GHz, 28-Cores
2.419 TFLOPSSoc. #0: 0th-27th cores
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
UPI
UPI
Ultra Path Interconnect10.4 GT/sec ×3= 124.8 GB/sec
DDR4DDR4DDR4DDR4DDR4DDR4
2933 MHz×6ch140.8 GB/sec
Memory96 GB