AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...

AMPI, Charisma and MSA

Celso Mendes

Pritish Jetley Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

AMPI – Charm++ Workshop 2008 205/02/08

Motivation

Challenges New generation parallel applications are:

Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are:

Not naturally suitable for dynamic applications Set of available processors:

May not match the natural expression of the algorithm

AMPI: Adaptive MPI MPI with virtualization: VP (“Virtual Processors”)


AMPI Status

Compliance to MPI-1.1 StandardMissing: error handling, profiling interface

Partial MPI-2 supportOne-sided communicationROMIO integrated for parallel I/OMissing: dynamic process management, language

bindings


AMPI: MPI with Virtualization Each virtual process implemented as a user-level

thread embedded in a Charm++ object

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)


Comparison with Native MPI

Performance Slightly worse w/o optimization Can be improved, via Charm++ Run time parameters for better

performance


Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3

10 100 10001

10

100

29.44 14.16 9.12 8.07 5.52Column B

Procs

Exe

c T

ime

[sec

]

Comparison with Native MPI

Flexibility Big runs on any number of

processors Fits the nature of algorithms


Building Charm++ / AMPI

Download website:http://charm.cs.uiuc.edu/download/Please register for better support

Build Charm++/AMPI > ./build <target> <version> <options> [charmc-options] See README file for details To build AMPI:

> ./build AMPI net-linux -g (-O3)


How to write AMPI programs Avoid using global variables Global variables are dangerous in multithreaded

programs Global variables are shared by all the threads on a

processor and can be changed by any of the threads Example:

Thread 1 Thread 2

count=myid (1)

MPI_Recv() block...

b=count

count=myid (2)

MPI_Recv() block...If count is a global

incorrect value is read!

time


How to run AMPI programs (1)

We can run multithreaded on each processor Running with many virtual processors:

+p command line option: # of physical processors +vp command line option: # of virtual processors

Multiple initial processor mappings are possible> charmrun hello +p3 +vp6 +mapping <map>

Available mappings at program initialization: RR_MAP: Round-Robin (cyclic) {(0,3)(1,4)(2,5)} BLOCK_MAP: Block (default) {(0,1)(2,3)(4,5)} PROP_MAP: Proportional to processors’ speeds {(0,1,2,3)(4)(5)}


How to run AMPI programs (2)

Specify stack size for each threadSet smaller/larger stack sizesNotice that thread’s stack space is unique across

processorsSpecify stack size for each thread with

+tcharm_stacksize command line option:charmrun hello +p2 +vp8 +tcharm_stacksize 8000000

Default stack size is 1 MByte for each thread


How to convert an MPI program

Remove global variables if possible If not possible, privatize global variables

Pack them into struct/TYPE or class Allocate struct / type in heap or stack

MODULE shareddata INTEGER :: myrank DOUBLE PRECISION :: xyz(100)END MODULE

Original Code

MODULE shareddata TYPE chunk INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END TYPEEND MODULE

AMPI Code


How to convert an MPI program

Fortran program entry point: MPI_Mainprogram pgm subroutine MPI_Main

... ...

end program end subroutine C program entry point is handled automatically,

via mpi.h


AMPI Extensions

Automatic load balancing Non-blocking collectives Checkpoint/restart mechanism Multi-module programming Handling global and static variables


Automatic Load Balancing

Automatic load balancing: MPI_Migrate() Collective call informing the load balancer that the thread is

ready to be migrated, if needed.

Link-time flag -memory isomalloc makes heap-data migration transparent Special memory allocation mode, giving allocated

memory the same virtual address on all processors Ideal on 64-bit machines But… granularity = 16KB per allocation Alternative: PUPer


Asynchronous Collectives Our implementation is asynchronous

Collective operation postedTest/wait for its completionMeanwhile useful computation can utilize CPU

MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);


Checkpoint/Restart Mechanism Large scale machines may suffer from fails Checkpoint/restart mechanism

State of applications checkpointed to disk filesCapable of restarting on different # of PE’sFacilitates future efforts on fault toleranceIn-disk: MPI_Checkpoint(DIRNAME)In-memory: MPI_MemCheckpoint(void)

Checkpoint/restart and load-balancing have been recently applied in CSAR’s Rocstar


Interoperability with Charm++ Charm++ has a collection of support libraries We can make use of them by running Charm++

code in AMPI programs Also we can run AMPI code in Charm++

programs


Handling Global & Static Variables Global and static variables are not thread-safe

Can we switch those variables when we switch threads? Globals: Executable and Linking Format (ELF)

Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2.x, and more

Integrated in Charm++/AMPI Invoked by compile time option –swapglobals

Statics in C codes: __thread privatizes them Requires linking to pthreads library

AMPI – Charm++ Workshop 200805/02/08

Programmer Productivity

Take away generality Program in small, uncomplicated languages

Incomplete Fit for specific types of programs

Inter-operable Modules Common ARTS


Charisma

Static data flow Suffices for number of applications Molecular dynamics, FEM, PDE's, etc.

Global data and control flow explicit Unlike Charm++


The Charisma Programming Model

Arrays of objects Global parameter space (PS)

Objects read from and write into PS Clean division between

Parallel (orchestration) code Sequential methods

Worker objects

Buffers (PS)


Example: Stencil Computation

foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders();end-foreach

foreach x,y in workers (+err) <- workers[x,y].compute(lb[x+1,y], rb[x-1,y], ub[x,y+1], db[x,y-1]);end-foreach

while(err > epsilon) foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y])

<- workers[x,y].produceBorders();

(+err) <- workers[x,y].compute( lb[x+1,y], rb[x-1,y], ub[x,y-1], db[x,y+1] ); end-foreachend-while Reduction on variable err


Language Features

Communication patterns P2P, Multicast, Scatter, Gather

Determinism Methods invoked on objects in Program Order

Support for libraries Use external Create your own

foreach w in workers (p[w]) <- workers[w].produce(); workers[w].consume(p[w-1]);end-foreach

foreach r in workers (p[r]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r])end-foreach

foreach r in workers (p[r,*]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r,c]);end-foreach

workers[w]

workers[r]

consumers[r,1] consumers[r,3]consumers[r,2]consumers[r,2]

workers[r]

consumers[r,1] consumers[r,3]


However...

Charisma is no panacea Only static dataflow Data-dependent dataflow can not be expressed


Shared Address Spaces

Shared Memory programming Easier to program in (sometimes) Cache coherence (hurts performance) Consistency, Data Races (productivity suffers)

MSA SM programs access data differently in various

phases


Matrix-Matrix Multiply

Read Read

Write Accumulate

B

CA C

B

A


MSA Access modes

SAS is part of programming model Support some access modes efficiently

Read-only Write Accumulate


MSA MMM Codetypedef MSA2D<double, MSA_NullA<double>, MSA_ROW_MAJOR> MSA2DRowMjr;typedef MSA2D<double, MSA_SumA<double>, MSA_COL_MAJOR> MSA2DColMjr;

MSA2DRowMjr arr1(ROWS1, COLS1, NUMWORKERS, cacheSize1); // row majorMSA2DColMjr arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2); // column majorMSA2DRowMjr prod(ROWS1, COLS2, NUMWORKERS, cacheSize3); //product matrix

for(unsigned int c = 0; c < COLS2; c++) { // Each thread computes a subset of rows of product matrix for(unsigned int r = rowStart; r <= rowEnd; r++) { double result = 0.0; for(unsigned int k = 0; k < COLS1; k++) result += arr1[r][k] * arr2[k][c];

prod[r][c] = result; }}

prod.sync();

Initialization

Thread-local accumulation

Global memory access

New phase for matrix prod


Future Work

Compiler-directed optimizations Performance and Productivity analyses Cross-module data exchange

AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...

Documents

Transcript of AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...