Introduction to MPI MPI programming Running MPI program Architecture of MPICH Lecture 2: Part II...

Post on 02-Jan-2016

285 views 5 download

Tags:

Transcript of Introduction to MPI MPI programming Running MPI program Architecture of MPICH Lecture 2: Part II...

Introduction to MPI MPI programming Running MPI program Architecture of MPICH

Lecture 2: Part IIMessage Passing

Programming: MPI

Message Passing Interface (MPI)

What is MPI?

A message passing library specification– message-passing model– not a compiler specification– not a specific product

For parallel computers, clusters and heterogeneous networks.

Full-featured

Why use MPI? (1)

Message passing now mature as programming paradigm

well understood efficient match to hardware many applications

Why use MPI? (2)

Full range of desired features– modularity– access to peak performance– portability– heterogeneity– subgroups– topologies– performance measurement tools

Who Designed MPI ?

Venders– IBM, Intel, TMC, SGI, Meiko, Cray,

Convex, Ncube,….. Library writers

– PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda, DP (HKU), PM (Japan), AM (Berkeley), FM (HPVM at Illinois)

Application specialists and consultants

Cho-Li Wang 7

Vender-Supported MPI

HP-MPI Hewlett Packard; Convex SPPMPI-F IBM SP1/SP2Hitachi/MPI HitachiSGI/MPI SGI PowerChallenge seriesMPI/DE NEC.INTEL/MPI Intel. Paragon (iCC lib)T.MPI Telmat MultinodeFujitsu/MPI Fujitsu AP1000EPCC/MPI Cray & EPCC, T3D/T3E.

Cho-Li Wang 8

Public-Domain MPI

MPICH Argonne National Lab. &

Mississippi State Univ. LAM Ohio Supercomputer center MPICH/NT Mississippi State University MPI-FM Illinois (Myrinet) MPI-AM UC Berkeley (Myrinet) MPI-PM RWCP, Japan (Myrinet) MPI-CCL California Institute of Technology

Public-Domain MPI

CRI/EPCC MPI Cray Research and Edinburgh Parallel Computing Centre (Cray

T3D/E)

MPI-AP Australian National University-

CAP Research Program (AP1000)

W32MPI Illinois, Concurrent Systems RACE-MPI Hughes Aircraft Co. MPI-BIP INRIA, France (Myrinet)

Communicator Conceptin MPI

Identify the process group and context with respect to which the operation is to be performed

Process

Process

Process

Process

ProcessProcess

Process

Process

Process

Process

Process

Communicator (2)Four communicatorsProcess in different communicators

cannot communicate

Process

Process

Process

Process

ProcessProcess

Communicator within Communicator

Process

Process

Same process can be existed in different

communicators

Process

Features of MPI (1)

General– Communicators combine context and

group for message security

Features of MPI (2)

Point-to-point communication Structured buffers and derived data

types, heterogeneity Modes : normal (blocking and non-

blocking), synchronous, ready (to allow access to fast protocols), buffered

Collective Communication Both built-in and user-defined collective

operations Large number of data movement routines Subgroups defined directly or by topology E.g, broadcast, barrier, reduce, scatter,

gather, all-to-all, ..

Features of MPI (3)

MPI Programming

Writing MPI programs

MPI comprises 125 functions Many parallel programs can be written

with just 6 basic functions

Six basic functions (1)

MPI_INITInitiate an MPI computation

MPI_FINALIZETerminate a computation

Six basic functions (2)

MPI_COMM_SIZEDetermine number of processes in a communicator

MPI_COMM_RANKDetermine the identifier of a process in a specific communicator

Six basic functions (3)

MPI_SENDSend a message from one process to another process

MPI_RECVReceive a message from one process to another process

Program main

begin

MPI_INIT()

MPI_COMM_SIZE(MPI_COMM_WORLD, count)

MPI_COMM_RANK(MPI_COMM_WORLD, myid)

print(“I am ”, myid, “ of ”, count)

MPI_FINALIZE()

end

A simple program

MPI_INIT()

Initiate computation

MPI_COMM_SIZE(MPI_COMM_WORLD, count)

Find the numberof processes

MPI_COMM_RANK(MPI_COMM_WORLD, myid)

Find the process ID ofcurrent process

print(“I am “, myid, “ of “, count)

Each process prints out its output

MPI_FINALIZE()

Shut down

Result

I’m 3 of 4

I’m 0 of 4

I’m 1 of 4

I’m 2 of 4

Process 0 Process 4

Process 1Process 3

Point-to-Point Communication

The basic point-to-point communication operators are send and receive.

Sender Receiver

BufferBuffer

TransmissionSend Receive

Another simple program (2 nodes)

…..MPI_COMM_RANK(MPI_COMM_WORLD, myid)if myid=0 MPI_SEND(“Zero”,…,…,1,…,…) MPI_RECV(words,…,…,1,…,…,…)else MPI_RECV(words,…,…,0,…,…,…) MPI_SEND(“One”,…,…,0,…,…)END IFprint(“Received from “,words)……

I’m process 0!if myid=0 MPI_SEND(“Zero”,…,…,1,…,…) MPI_RECV(words,…,…,1,…,…,…)……

I’m process 1!else MPI_RECV(words,…,…,0,…,…,…) MPI_SEND(“One”,…,…,0,…,…)

Process 0 Process 1MPI_SEND(“Zero”,…,…,1,…,…)

MPI_RECV(words,…,…,0,…,…,…)

Send “Zero”to process 1

Setup buffer and wait the messagefrom process 0

words(buffer)

Received

WaitMPI_RECV(words,…,…,1,…,…)

MPI_SEND(“One”,…,…,0,…,…,…)

Setup buffer and wait the messagefrom process 1

Send “One”to process 0

Wait

words(buffer)

Received

Print(“Receivedfrom “,words)

Print(“Receivedfrom “,words)

Zero

One

Result

Received from One

Received from Zero

Process 0

Process 1

Collective Communication (1)

Communication that involves a group of processes

Sender

Receivers

Buffer

Buffer

TransmissionSend

Buffer

Buffer

Receive

Collective Communication (2)

Three Types Barrier

• MPI_BARRIER

Data movement• MPI_BCAST• MPI_GATHER• MPI_SCATTER

Reduction operations• MPI_REDUCE

Barrier

MPI_BARRIER Used to synchronize execution of a

group of processesWait for us!

We can’tgo on!

Barrier

Barrier

We’re together! The barrier will be disappeared!

Barrier

Let’s go!

FACEFACE

Process 0 Process 1 Process 2 Process 3

BCAST BCAST BCAST BCAST

FACE FACE FACE

Data movement (1)

MPI_BCAST One single process sends the same

data to all other processes, itself included

Process 0 Process 1 Process 2 Process 3

GATHER GATHER GATHER GATHER

EA C EF FACFACE

Data movement (2)

MPI_GATHER All process (include the root process)

send the same data to one process and store them in rank order

Process 0 Process 1 Process 2 Process 3

SCATTER SCATTER SCATTER SCATTER

FACEF C EA

Data movement (3)

MPI_SCATTER A process sends out a message, which

is split into several equals parts, and the ith portion is sent to the ith process

Process 0 Process 1 Process 2 Process 3

REDUCE REDUCE REDUCE REDUCE

9 3 789

8 9 3 7max

Data movement (4)

MPI_REDUCE (e.g., find maximum value)

combine the values of each process, using a specified operation, and return the combined value to a process

Example program (1)

Calculating the value of by:

1

02

dxx1

4

Example program (2)

……

MPI_BCAST(numprocs, …, …, 0, …)

for (i = myid + 1; i <= n; i += numprocs)

compute the area for each interval

accumulate the result in processes’

program data (sum)

MPI_REDUCE(&sum, …, …, …, MPI_SUM, 0, …)

if (myid == 0)

Output result

…… Boardcast the no. of process

MPI_BCAST(numprocs, …, …, 0, …)

Each process calculate specified areas

for (i = myid + 1; i <= n; i += numprocs)

compute the area for each interval

accumulate the result in processes’

program data (sum)

Sum up all the areas

MPI_REDUCE(&sum, …, …, …, MPI_SUM, 0, …)

Print the resultif (myid == 0)

Output result

Calculated by process 0Calculated by process 1Calculated by process 2Calculated by process 3

OK!

OK!

OK!

OK!

=3.141...

Start calculation!

MPICH - A Portable Implementation of MPI

Argonne National Laboratory

What is MPICH???

The first complete and portable implementation of full MPI standard.

‘CH’ stands for “Chameleon” symbol of adaptability and portability.

It contains a programming environment for working with MPI programs.

It includes a portable startup mechanism and libraries.

How can I install it??? Install the packet mpich.tar.gz to a directory Use ‘./configure’ and ‘make >& make.log to

choose appropriate architecture and device and compile the file – Syntax: ./configure -device=DEVICE -

arch=ARCH_TYPE• ARCH_TYPE: specify the type of machine to be

configured• DEVICE: specify what kind of communication

device the system will choose - ch_p4 (TCP/IP)

How to run an MPI Program

1 Edit mpich/util/machines/machines.XXXX, to contain names of machines of architecture xxxx. For example:

Computermercury

Computervenus

Computermars

Computerearth

The file should be in the format:

mercuryvenusearthmarsearthmars

How to run an MPI Program

2 include “mpi.h” into the source program. 3 Compile program by using command

‘mpicc’ - mpicc -c foo.c4 Use ‘mpirun’ to run an MPI program.

mpirun will determine the environment for the program to run

How to run an MPI Program

mpirun -np 4 a.out - a.out are going to run four processors for massively parallel processors

mpirun -arch sun4 -np2 -arch rs6000 -np 3 program

- Run a program on 2 sun4s and 3 rs6000s, with local machine being a sun4 (multiple architectures)

5

6

MPIRUN (1)

How to start a mpi program? Use mpirun Examples:

– #mpirun -np 4 cpi– it starts four processes of cpi

MPIRUN (2) What MPIRUN do?

– 1. Read the arguments to specify the environment of the mpi program.

i) How many processes should be started

ii) Which machines will the mpi program be started

iii) What device will be used (e.g. ch_p4)

– 2. Split the processes to the machines will be ran

– 3. Record down the split results in the PI???? file

MPIRUN(3)

Example

Suppose using ch_p4 device– #mpirun -np 4 cpi

1. mpirun knows 4 processes need to be started

2. mpirun reads the machines file to find which machines can be ran

3. ch_p4 device will be used if no specified argument given in the command

MPIRUN (4)

4. Split the tasks and save in PI???? file

File format:

<hostname> <no. of proc.> <program>

genius.cs.hku.hk 0 cpi

eagle.cs.hku.hk 1 cpi

dragon.cs.hku.hk 1 cpi

virtue.cs.hku.hk 1 cpi

5. Start the processes in remote machines by using “rsh”

Architecture of MPICH

Low Level LayerLow Level Layer

ABSTRACT

DEVICE

INTERFACE

ABSTRACT

DEVICE

INTERFACE

ABSTRACT

DEVICE

INTERFACE

Structure of MPICH

ABSTRACT

DEVICE

INTERFACE

ABSTRACT

DEVICE

INTERFACE

ABSTRACT

DEVICE

INTERFACE

ABSTRACT

DEVICE

INTERFACE

Low Level LayerLow Level LayerLow Level LayerLow Level LayerLow Level LayerLow Level LayerLow Level LayerLow Level LayerLow Level Layer

MPI PORTABLE API LIBRARY

MPICH ABSTRACT DEVICE

MPICH CHANNEL INTERFACE

Socket

TCP/IP

Shared

Memory

Vendor

Design

MPICH - Abstract Device Interface

Interface between high-level MPI and low-level device.

Manages message packaging, buffering policies and handle heterogeneous communication.

4 sets of functions: – 1. Specify send or receive of a message.– 2. Data movement between API and hardware.– 3. Manage lists of pending messages.– 4. Provide information about execution environment.

MPICH - The Channel Interface (1)

The interface transfer data from one process‘s address space to another’s.

Information is divided into two parts:– message envelop and data

It includes five functions:• MPID_SendControl, MPID_RecvAnyControl,

MPID_ControlMsgAvail - envelop information• MPID_SendChannel, MPID_RecvFromChannel - data

information

MPICH - The Channel Interface (2)

Channel Interface adopt data exchange mechanism in accordance to the size of message.

Data Exchange Mechanism implemented:– Short, Eager, Rendezvous, Get

Protocol - Short

The size of data managed by this mechanism is shortest.

The data is delivered within the message envelop.

Data

Control MessageControl MessageControl Message

Reach

Short Protocol Data Transfer

Control MessageControl MessageControl MessageControl MessageControl MessageControl Message

Store in Buffer

Control MessageControl MessageControl MessageControl MessageControl Message

ReachReachReachReachReach

MPI_RecvMPI_RecvMPI_Recv

Protocol - Eager

Data is sent to the destination immediately.

The receiver must allocate some space to store the data locally.

It is the default choice in MPICH. It is not suitable for large amounts of

data transfer.

Eager Protocol Data Transfer

MPI_Control

Data

Save in Buffer

MPI_RecvMPI_Recv

Buffer Full!!!

MPI_ControlMPI_ControlMPI_ControlMPI_ControlMPI_ControlMPI_ControlMPI_ControlMPI_Control

Data Data Data Data DataData Data DataDataData1Data3

Data2Data4

MPI_RecvMPI_Recv

Protocol - Rendezvous

Data is sent to the destination only when requested.

If users want to use it, add -use_rndv in the command ‘./configure’.

No buffering required.

Rendezvous Protocol Data Transfer

MPI_Control

Data

Wait!MPI_ControlMPI_Control MPI_Cotrol MPI_Control MPI_ControlMPI_Control

MPI_Recv

MPI_RequestMPI_RequestMPI_RequestMPI_RequestMPI_RequestMPI_RequestMPI_Request

Data Data Data

MPI_Control

Wait Again!

Match!!!WaitData DataDataData DataReceived!

Protocol - Get

In this protocol, data is read directly by the receiver.

Data is directly transferred from one process’s memory to another.

Highest Performance. – require shared memory– remote memory operation

Get Protocol Data Transfer

I want to get data

from sender

Receiver directly access sender shared memory

Receiver directly copy data from sender shared memory to its memory

Conclusion

MPI–1.1 (June 95)

MPI 1.1 doesn’t provide process management remote memory transfers active messages threads virtual shared memory

MPI–2 (July 97)

Extensions to the MPI process creation and management one-sided communications extended collective operations external interface I/O additional language bindings