Parallel Computing/Programming using MPI

Parallel Computing/Programming using MPI

by

R Phillip BordingCSC, Inc

Center of ExcellenceIn

High Performance ComputingFebruary 3, 2004

Table of Contents

• Introduction

• Program Structure

• Defining Parallel Computing

• Domain Decomposition

Program Structure

• How do we define parallel computing?– Running a program using more than one processor

• Two interesting choices exist, among others

– Single Program using Multiple machines, different Data -> SPMD

– Multiple Programs using Multiple machines, different Data-> MPMD

MPMD Parallel Computing

• The MPMD model has multiple source codes

• In this model each computer has a different code and different data

• The user is responsible for the program structure – which behaves differently on each machine

• Typically each machine passes data to the next one, a daisy chain

MPMD Parallel Computing

Program 0 Program 1 Program 2

Multiple Programs - Multiple Data

SPMD Parallel Computing

• The SPMD model has a single source code

• In the cluster model each computer has a copy of this code, each has the identical code

• The user is responsible for the program structure – which can behave differently on each machine


• Other versions of the SPMD model exist

• In the cluster model each processor has a unique id number, called rank.

• The programmer can test for rank and modify the program function as needed.

Single Program – Multiple Data

Rank and Structure

• Each processor has a name or rank– Rank=name=number

» identification

• Processor organization– Structure = Communicator

» The commucation chain

How the problem communicates

defines the needed structure!

Processor Rank• 16 Processors – Rank 0 to 15

0

5

15

6

1

Realize that the rank position is relative and could be structured differently as needed.

Processor Rank with Structure• 16 Processors – Rank 0 to 15

0

5

15

6

1

Realize that the rank position is relative and could be structured differently as needed.

14 151312

10 1198

6 754

3210


• Simple Code Example - almost correctInteger RankCall MPI_INIT(return_code)Dimension Psi(0:100)Call MPI_Rank(Rank,return_code)Write(6,*) Rank

Do i=0,Rank Psi(i) = i Enddo Write(6,*) (Psi(i),i=0,Rank) Call MPI_finish(return_code) End


• Simple Code Example - almost correct– Assuming four parallel processes– The Output looks like this

• 0• 0.0• 2• 0.0,1.0,2.0• 3• 0.0,1.0,2.0,3.0• 1• 0.0,1.0

MPI has no standard for the sequence of appearance in output streams


We’ll get back to MPI coding after we figure out how we are going to do the domain decomposition.

ΏΏ0 0 ΏΏ1 1 ΏΏ22

The Omega Domain

ΏΏ

Domain Decomposition

• Subdivision of problem domain into parallel regions

• Example using 2 dimensional data arrays

• Linear One Dimension versus

• Grid of Two Dimensions

Single Processor Memory Arrays, Nx by Ny

Dimension Array (Nx,Ny)

Multiple Processor Memory Arrays, Nx/2 by Ny/24 Processors

Two way decomposition

Multiple Processor Memory Arrays, Nx by Ny/33 Processors

One way decomposition

Multiple Processor Memory Arrays, Nx/3 by Ny3 Processors

One way decomposition – the other way

So which one is better? Or does it make a difference??One way decomposition – one way or the other?

Dimension Array (Nx,Ny) becomesDimension Array (Nx/3,Ny) orDimension Array (Nx,Ny/3)

The Nx/3 in Fortran has shorter do loop lengths in the fastest moving index. Which could limit performance.Further, the sharing of data via message passing will have non-unit stride data access patterns.

So the design issue becomes one of choice for the programming language

Decide which language you need to use and then

Create the decomposition plan

Realize of course that a one-dimension decomposition has on Np_one processorsAnd a two dimensional decomposition could have Np_two x Np_two.

So in the design of your parallel code you would have to be aware of your resources.

Further, few if any programs scale well and being realistic about the number of Processors to be used in important in deciding how hard you want to work at the parallelization effort.

Processor Interconnections

• Communication hardware connects the processors

• These wires carry data and address information

• The best interconnection is the most expensive -- all machines have a direct connection to all other machines

• Because of cost we have to compromise


• The slowest is also the cheapest

• We could just let each machine connect to some other machine in a daisy chain fashion.

• Messages would bump along until they reach their destination.

• What other schemes are possible?


• The Linear Daisy Chain• The Binary tree• The Fat Tree• The FLAT network• The Hypercube• The Torus• The Ring• The Cross Bar• And many, many others

The Linear Daisy Chain

Processor 0 Processor 1 Processor 2

The Cross BarProcessor 0

Processor 1

Processor 2

Processor 0 Processor 1 Processor 2

O(1) but Switch is O(n^2)The fastest and most expensive

Lets Look at the Binary TreeO(Log N)

Lets Look at the Fat TreeO(Log N)+

Lets Look at the Hypercube

Order 1

Order 2

Order 3

Duplicate and connect the edges together

Lets Look at the Binary Tree1) Every node can reach every other node2) Has Log N connections, 32 nodes have 5 levels3) Some neighbors are far apart4) Little more expensive5) Root is a bottleneck

Lets Look at the Fat Tree1) Every node can reach every other node2) Has Log N connections, 32 nodes have 5 levels3) Some neighbors are far apart4) Even more expensive5) Root bottleneck is better managed6) Each level has multiple connections

Message Passing• Broadcast

I

K

L

K+1

J

Broadcast from the Ith processor to all other processors

Message Passing• Gather

I

K

L

K+1

J

Gather from all other processors to the Ith processor

Message Passing• Exchange – Based on User Topology

I

K

L

K+1

J

Based on connection topology processors exchange information

Ring

Or

Linear

Just what is a message?

To: You@Address

Message Content

From: Me@Address

Just what is a message?

To: You@Address:Attn Payroll

Message Content

From: Me@Address:Attn Payroll

Message Structure

• To: Address(Rank)• Content(starting array/vector/word address and length)• Tag• Data Type• Error Flag• Communicator

We know who we are so From: Address(Rank) is implicit!

Messaging

• For every SEND we must have a RECEIVE!

• The transmission is one-sided the receiver agrees to allow the sender to put the data into a memory location in the receiver process.

Message PassingThe interconnection topology is called a

communicator –

Predefined at startup

However the user can define his own topology – and should as needed

A problem dependent communicator – actually more than one can be defined as needed

Program Structure

ProcessorRank 0

Input

Loops

Output

ProcessorRank 1

Input

Loops

Output

Processor Rank 2

Input

Loops

Output

Sync-Barriers

MPI Send – Receive Send

Processor KCount

Receive

Processor L

Length ≥ Count

Each cell holds one MPI_Data_Type

MPI_Data_Type

Must be the same!

MPI_Data_Type

MPI Data_Types

Type Number of bytesFloat 4Double 8Integer 4?Boolean 4Character 1?

A bit of care is need between Fortran and C data types

• #define MPI_BYTE ... #define MPI_PACKED ... #define MPI_CHAR ... #define MPI_SHORT ... #define MPI_INT ... #define MPI_LONG ... #define MPI_FLOAT ... #define MPI_DOUBLE ... #define MPI_LONG_DOUBLE ... #define MPI_UNSIGNED_CHAR ...

FORTRAN C

Fortran Data Type MPI Data Type C Data Type MPI Data Type

CHARACTER MPI_CHARACTER signed char MPI_CHAR

COMPLEX MPI_COMPLEX - -

INTEGER MPI_INTEGER int MPI_INT

LOGICAL MPI_LOGICAL - -

REAL MPI_REAL float MPI_FLOAT

MPI Data_TYPE Issues

• Just what is a data type?

• How many bits?

• Big Endian versus Little Endian?

• What ever is used must be consistent!

• Could type conversions be automatic or transparent??

1 Nx/2-1 Nx/2 Nx 1

Ny/2-1Ny/2

Ny

u(i,j) = u(i-1,j)+u(i,j)

u(i-1,j) u(i,j)

P0

P1

P2

P3

How do the processors see the variables they don’t have?

The address spaces are distinct and separated.

We can duplicate the data for otherprocessors into our processor

We will copy into these ghost vectors

We can message pass between processors

Actually we will send/receive between neighbors

• g77 used to reject the following program on 32-bit targets: PROGRAM PROG DIMENSION A(140 000 000) END with the message: prog.f: In program `prog': prog.f:2: DIMENSION A(140 000 000) ^ Array `a' at (^) is too large to handle because 140 000 000 REALs is larger than the largest bit-extent that can be expressed in 32 bits. However, bit-sizes never play a role after offsets have been converted to byte addresses. Therefore this check has been removed, and the limit is now 2 Gbyte of memory (around 530 000 000 REALs). Note: On GNU/Linux systems one has to compile and link programs that occupy more than 1 Gbyte statically, i.e. g77 -static ....

• Based on work done by Juergen Pfeifer ([email protected]) libf2c is now a shared library. One can still link in all objects with the program by specifying the -static option.

• Robert Anderson ([email protected]) thought up a two line change that enables g77 to compile such code as:

• SUBROUTINE SUB(A, N) • DIMENSION N(2)• DIMENSION A(N(1),N(2))• A(1,1) = 1. • END Note the use of array elements in the bounds of the adjustable array A.

mailto:[email protected]


G77/Linux Fortran Memory Limitations

• g77 used to reject the following program on 32-bit targets: PROGRAM PROG DIMENSION A(140 000 000) END with the message: prog.f: In program `prog': prog.f:2: DIMENSION A(140 000 000) ^ Array `a' at (^) is too large to handle because 140 000 000 REALs is larger than the largest bit-extent that can be expressed in 32 bits. However, bit-sizes never play a role after offsets have been converted to byte addresses. Therefore this check has been removed, and the limit is now 2 Gbyte of memory (around 530 000 000 REALs). Note: On GNU/Linux systems one has to compile and link programs that occupy more than 1 Gbyte statically, i.e. g77 -static ....

• Based on work done by Juergen Pfeifer ([email protected]) libf2c is now a shared library. One can still link in all objects with the program by specifying the -static option.

• Robert Anderson ([email protected]) thought up a two line change that enables g77 to compile such code as:

• SUBROUTINE SUB(A, N) • DIMENSION N(2)• DIMENSION A(N(1),N(2))• A(1,1) = 1. • END Note the use of array elements in the bounds of the adjustable array A.



•

• Uninitialized Variables at Run Time• g77 needs an option to initialize everything (not

otherwise explicitly initialized) to "weird" (machine-dependent) values, e.g. NaNs, bad (non-NULL) pointers, and largest-magnitude integers, would help track down references to some kinds of uninitialized variables at run time.

• Note that use of the options -O -Wuninitialized can catch many such bugs at compile time.

• Portable Unformatted Files• g77 has no facility for exchanging unformatted files with systems using different number formats--even differing only in endianness (byte

order)--or written by other compilers. Some compilers provide facilities at least for doing byte-swapping during unformatted I/O. • It is unrealistic to expect to cope with exchanging unformatted files with arbitrary other compiler runtimes, but the g77 runtime should at

least be able to read files written by g77 on systems with different number formats, particularly if they differ only in byte order. • In case you do need to write a program to translate to or from g77 (libf2c) unformatted files, they are written as follows: • Sequential

– Unformatted sequential records consist of – A number giving the length of the record contents; – the length of record contents again (for backspace).

– The record length is of C type long; this means that it is 8 bytes on 64-bit systems such as Alpha GNU/Linux and 4 bytes on other systems, such as x86 GNU/Linux. Consequently such files cannot be exchanged between 64-bit and 32-bit systems, even with the same basic number format.

• Direct access – Unformatted direct access files form a byte stream of length records*recl bytes, where records is the maximum record number (REC=records)

written and recl is the record length in bytes specified in the OPEN statement (RECL=recl). Data appear in the records as determined by the relevant WRITE statement. Dummy records with arbitrary contents appear in the file in place of records which haven't been written.

• Thus for exchanging a sequential or direct access unformatted file between big- and little-endian 32-bit systems using IEEE 754 floating point it would be sufficient to reverse the bytes in consecutive words in the file if, and only if, only REAL*4, COMPLEX, INTEGER*4 and/or LOGICAL*4 data have been written to it by g77.

• If necessary, it is possible to do byte-oriented i/o with g77's FGETC and FPUTC intrinsics. Byte-swapping can be done in Fortran by equivalencing larger sized variables to an INTEGER*1 array or a set of scalars.

• If you need to exchange binary data between arbitrary system and compiler variations, we recommend using a portable binary format with Fortran bindings, such as NCSA's HDF (http://hdf.ncsa.uiuc.edu/) or PACT's PDB1 (http://www.llnl.gov/def_sci/pact/pact_homepage.html). (Unlike, say, CDF or XDR, HDF-like systems write in the native number formats and only incur overhead when they are read on a system with a different format.) A future g77 runtime library should use such techniques.

http://hdf.ncsa.uiuc.edu/



http://www.llnl.gov/def_sci/pact/pact_homepage.html

http://www.llnl.gov/def_sci/pact/pact_homepage.html

GNU/Linux in Science and Engineering

Co-sponsored by the Vancouver Linux Users GroupPlease consider joining the IEEE/VanLUG Science and Engineering SIG and participating in both our mailing list and other events that we may schedule from time to time. In the meantime, we hope that you'll find this list of Frequently Asked Questions about GNU/Linux in Science and Engineering to be useful. While the target

audience for this FAQ are scientists and engineers who are in the process of migrating to Linux, hopefully even those with lots of experience will find a few useful nuggets. 1. How do I convince our senior management to let the technical staff run Linux rather than a widely used (but deeply flawed) alternative? 2. What is the purpose of the IEEE/VanLUG Science and Engineering SIG? 3. How does one join the IEEE/VanLUG Science and Engineering SIG? 4. What are some free alternatives to proprietary software? 5. Which commercial Science and Engineering software has been ported to GNU/Linux? 6. Which commercial Science and Engineering software hasn't yet been ported to GNU/Linux? 7. Can I compile and run my existing Fortran code under Linux? 8. Is there a good mechanical CAD or drafting package for Linux? 9. How can one keep up to date with science and engineering software for GNU/Linux? (journals, mailing lists, websites) 10. How is the Scientific Applications on Linux (SAL) website structured? 11. What is Scilab? 12. What is the Intel/ASCI Red Software? 13. What is IBM techexplorer? 14. What is Python? 15 What is PERL and why is it of interest to scientific programmers? 16. What are ten good practices for scientific programming?

FAQ Contributors: Dave Michelson, Petar Knezevich, Andrew Daviel, Xavier Calbet, ...

• http://www.comsoc.org/vancouver/scieng.html#1

xmaple A highly sophisticated package for doing symbolic or numeric mathematical operations. This package has nice built-in graphics that can be exported as a PostScript file. The option to save your output as a LaTeX file also exists.

xmgrace An excellent interactive 2-d plotting package. Import files to be plotted as a set of (x,y) or (x,y1,y2,...,yn) points.

gnuplot Another good plotting package for 2-d and surface plots.

konqueror The best browser on the market.

mozilla Another excellent browser.

kghostview A way to preview a PostScript file--saves printing ones with mistakes.

lyx Graphical LaTeX-based word processor.

openoffice This is the Gnu/Linux version of the popular word processor.

• Editors• joe [filename] Joe's Own Editor. An easy text editor, similar to wordstar.

Filenames for text files need not end with ".txt", or any extension for that matter, and "ctrl+k" pops up the menu for the editor. "ctrl+k" followed by an "h" brings up help.

• vi [filename] The most common text editor on Unix systems.• • xedit An X-based editor, mouse driven etc.

• xemacs [filename] Still more editors. This one is a highly developed, graphically based editor, with lots of built-in functions. It is popular among the Unix community.

• xjed [filename] Another text editor that is quite easy to use. "ctrl+h" brings up the menu bar at the bottom which tells you all you need to know.

• Command Line System Commands:•

• apropos [command or program] Command to search man pages for any item of text. cd [/path or ..] Change directory, path is the directory you want to go to, or".." moves you up one directory. Equivalent to the similar "cd" command in DOS, or "set default" in VMS. cp filename newfilename Copies a file. latex filename.tex Invokes the latex2e text formatting package. ls Gives a listing of the files in a directory. man [command or program] Searches the on-line manual pages for the command or program named. Sometimes it is still difficult to find help on some subjects. Also try typing the command or program name followed by "--help", i.e. "man ls" or "ls --help". TRY USING MAN TO SOLVE YOUR PROBLEM BEFORE YOU ASK ANYONE FOR HELP. mkdir [name] Creates a directory. mv [path/]filename Moves a file. You can use this to rename a [path/]filename file also. rm [path/]filename Removes a file. REMEMBER: there is no undelete in Gnu/Linux (or any other UNIX I know of). Please note that the switches for rm given in the man page are EXTREMELY DANGEROUS. WILDCARDS such as * or ? in rm instructions can cause MASSIVE PROBLEMS, even the RECURSIVE DELETION of all the files to which you have write privilege. Think before you rm! rmdir [name] Removes an empty directory. nohup [program name] & Runs a job in the backround. You can log off after you "run the job nohup", and the program will continue to execute. The output that would have been printed to the screen will be placed in a file "nohup.out", in the directory in which you executed the command. If you want to specify a different output file, consult the man page on nohup. mcopy a:/[path]filename . Copies a file from a DOS floppy to the hard drive, or the reverse is possible. mdir a:[/path] Reads the directory listing off of a DOS floppy.

Parallel Computing/Programming using MPI

Documents

Transcript of Parallel Computing/Programming using MPI