NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
-
Upload
edmund-shields -
Category
Documents
-
view
214 -
download
0
Transcript of NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
![Page 1: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/1.jpg)
NORA/Clusters
AMANO, Hideharu
Textbook pp. 140-147
![Page 2: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/2.jpg)
NORA (No Remote Access Memory Model)
No hardware shared memory Data exchange is done by messages (or
packets) Dedicated synchronization mechanism is
provided. High peak performance Message passing library (MPI,PVM) is
provided.
![Page 3: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/3.jpg)
Message passing( Blocking: randezvous )
Send Receive Send Receive
![Page 4: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/4.jpg)
Message passing(with buffer)
Send Receive Send Receive
![Page 5: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/5.jpg)
Message passing( non-blocking)
Send ReceiveOther Job
![Page 6: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/6.jpg)
PVM (Parallel Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is
provided. Barrier synchronization
![Page 7: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/7.jpg)
MPI(Message Passing Interface) Superset of the PVM for 1 to 1
communication. Group communication Various communication is supported. Error check with communication tag. Detail will be introduced later.
![Page 8: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/8.jpg)
Programming style using MPI SPMD (Single Program Multiple Data
Streams) Multiple processes executes the same program. Independent processing is done based on the
process number. Program execution using MPI
Specified number of processes are generated. They are distributed to each node of the NORA
machine or PC cluster.
![Page 9: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/9.jpg)
Communication methods
Point-to-Point communication A sender and a receiver executes function for
sending and receiving. Each function must be strictly matched.
Collective communication Communication between multiple processes. The same function is executed by multiple
processes. Can be replaced with a sequence of Point-to-Point
communication, but sometimes effective.
![Page 10: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/10.jpg)
Fundamental MPI functions Most programs can be described using six fu
ndamental functions MPI_Init() … MPI Initialization MPI_Comm_rank() … Get the process # MPI_Comm_size() … Get the total process # MPI_Send() … Message send MPI_Recv() … Message receive MPI_Finalize() … MPI termination
![Page 11: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/11.jpg)
Other MPI functions
Functions for measurement MPI_Barrier() … barrier synchronization MPI_Wtime() … get the clock time
Non-blocking function Consisting of communication request and check Other calculation can be executed during waiting.
![Page 12: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/12.jpg)
An Example1: #include <stdio.h>2: #include <mpi.h>3:4: #define MSIZE 645:6: int main(int argc, char **argv)7: {8: char msg[MSIZE];9: int pid, nprocs, i;10: MPI_Status status;11:12: MPI_Init(&argc, &argv);13: MPI_Comm_rank(MPI_COMM_WORLD, &pid);14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs);15:16: if (pid == 0) {17: for (i = 1; i < nprocs; i++) {18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status);19: fputs(msg, stdout);20: }21: }22: else {23: sprintf(msg, "Hello, world! (from process #%d)\n", pid);24: MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD);25: }26:27: MPI_Finalize();28:29: return 0;30: }
![Page 13: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/13.jpg)
Initialize and Terminate
int MPI_Init(
int *argc, /* pointer to argc */
char ***argv /* pointer to argv */ );
mpi_init(ierr)
integer ierr ! return code
The attributes from command line must be passed directly to argc and argv.
int MPI_Finalize();
mpi_finalize(ierr)
integer ierr ! return code
![Page 14: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/14.jpg)
Commincator functionsIt returns the rank (process ID) in the communicator comm.int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ );mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code
It returns the total number of processes in the communicator comm.int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ );mpi_comm_size(comm, size, ierr) integer comm, size integer ierr ! return code
Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.
![Page 15: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/15.jpg)
MPI_Send
It sends data to process “dest”.int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ );
mpi_send(buf, count, datatype, dest, tag, comm, ierr) <type> buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code
Tags are used for identification of message.
![Page 16: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/16.jpg)
MPI_Recv
int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ );
mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) <type> buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code
The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members:
MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.
![Page 17: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/17.jpg)
datatype and count
The size of the message is identified with count and datatype. MPI_CHAR char MPI_INT int MPI_FLOAT float MPI_DOUBLE double … etc.
![Page 18: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/18.jpg)
Compile and Execution
% icc –o hello hello.c -lmpi
% mpirun –np 8 ./hello
Hello, world! (from process #1)
Hello, world! (from process #2)
Hello, world! (from process #3)
Hello, world! (from process #4)
Hello, world! (from process #5)
Hello, world! (from process #6)
Hello, world! (from process #7)
![Page 19: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/19.jpg)
Shared memory model vs . Message passing model
Benefits Distributed OS is easy to implement. Automatic parallelize compiler. OpenMP
Message passing Formal verification is easy (Blocking) No-side effect (Shared variable is side effect itsel
f) Small cost
![Page 20: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/20.jpg)
OpenMP
#include <stdio.h>int main(){pragma omp parallel{ int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes)}return 0;} Multiple threads are generated by using pragma. Variables declared globally can be shared.
![Page 21: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/21.jpg)
Convenient pragma for parallel execution#pragma omp parallel {#pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; }} The assignment between i and thread is automatical
ly adjusted in order that the load of each thread becomes even.
![Page 22: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/22.jpg)
Automatic parallelizing Compilers Automatically translating a code for uniprocessors in
to multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets
No pointers The array structure is simple
Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS
![Page 23: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/23.jpg)
PC Clusters Multicomputers
Dedicated hardware (CPU, network) High performance but expensive Hitachi’s SR8000, Cray T3E, etc.
WS/PC Clusters Using standard CPU boards High Performance/Cost Standard bus often forms a bottleneck
Beowluf Clusters Standard CPU boards, Standard components LAN+TCP/IP Free-software A cluster with Standard System Area Network(SAN) like Myr
inet is often called Beowulf Cluster
![Page 24: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/24.jpg)
PC clusters in supercomputingClusters occupies more than 80% of top 500
supercomputers in 2008/11Let’s check http://www.top500.org
![Page 25: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/25.jpg)
SAN (System Area Network) for PC clusters Virtual Cut-through routing High throughput/Low latency Out of the cabinet but in the floor Also used for connecting disk subsystems
Sometimes called System Area Network Infiniband Myrinet Quadrics
GBEthernet: 10GB EthernetStore & ForwardTree based topologies
![Page 26: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/26.jpg)
SAN vs. Ethernet
![Page 27: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/27.jpg)
Infiniband Point-to-point direct serial interconnection. Using 8b/10b code. Various types of topologies can be supported. Multicasting/atomic transactions are supported. The maximum throughput
SDR DDR QDR
1X 2Gbit/s 4Gbit/s 8Gbit/s
4X 8Gbit/s 16Gbit/s 32Gbit/s
12X 24Gbit/s 48Gbit/s 96Gbit/s
![Page 28: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/28.jpg)
Remote DMA (user level)
User
Kernel
Host I/F
Sender
Buffer
Data Source
KernelAgent
ProtocolEngine
Local Node
Network Interface
Buffer
Data Sink
ProtocolEngine
Remote Node
Network Interface
System Call
RDMAUserLevel
![Page 29: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/29.jpg)
PC Clusters with Myrinet (RWCP, using Myrinet)
![Page 30: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/30.jpg)
RHiNET Cluster
NodeNode CPU: Pentium III 933MHzCPU: Pentium III 933MHz Memory: 1GbyteMemory: 1Gbyte PCI bus: 64bit/66MHzPCI bus: 64bit/66MHz OS: Linux kernel 2.4.18OS: Linux kernel 2.4.18 SCore: version 5.0.1SCore: version 5.0.1
RHiNET-2 with 64 nodesRHiNET-2 with 64 nodesNetworkNetworkOptical →Optical → Gigabit Gigabit EthernetEthernet
![Page 31: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/31.jpg)
Cluster of supercomputers
Recent trend of supercomputers is connecting powerful components with high speed SANs like Infiniband
Roadrunner by IBM General purpose CPU + Cell BE
![Page 32: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/32.jpg)
Grid Computing
Supercomputers which are connected with Internet can be treated virtually as a single big supercomputer.
Using middleware or toolkit to manage it. Globus toolkit
GEO (Global Earth Observation Grid) AIST Japan
![Page 33: NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.](https://reader035.fdocuments.us/reader035/viewer/2022062519/5697c0221a28abf838cd358c/html5/thumbnails/33.jpg)
Contest/Homework
Assume that there is an array x[100]. Write the MPI code for computing sum of all elements of x with 8 processes.