AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...

29
AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana- Champaign

Transcript of AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...

Page 1: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI, Charisma and MSA

Celso Mendes

Pritish Jetley Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

Page 2: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 205/02/08

Motivation

Challenges New generation parallel applications are:

Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are:

Not naturally suitable for dynamic applications Set of available processors:

May not match the natural expression of the algorithm

AMPI: Adaptive MPI MPI with virtualization: VP (“Virtual Processors”)

Page 3: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 305/02/08

AMPI Status

Compliance to MPI-1.1 StandardMissing: error handling, profiling interface

Partial MPI-2 supportOne-sided communicationROMIO integrated for parallel I/OMissing: dynamic process management, language

bindings

Page 4: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 405/02/08

AMPI: MPI with Virtualization Each virtual process implemented as a user-level

thread embedded in a Charm++ object

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Page 5: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 505/02/08

Comparison with Native MPI

Performance Slightly worse w/o optimization Can be improved, via Charm++ Run time parameters for better

performance

Page 6: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 605/02/08

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3

10 100 10001

10

100

29.44 14.16 9.12 8.07 5.52Column B

Procs

Exe

c T

ime

[sec

]

Comparison with Native MPI

Flexibility Big runs on any number of

processors Fits the nature of algorithms

Page 7: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 705/02/08

Building Charm++ / AMPI

Download website:http://charm.cs.uiuc.edu/download/Please register for better support

Build Charm++/AMPI > ./build <target> <version> <options> [charmc-options] See README file for details To build AMPI:

> ./build AMPI net-linux -g (-O3)

Page 8: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 805/02/08

How to write AMPI programs Avoid using global variables Global variables are dangerous in multithreaded

programs Global variables are shared by all the threads on a

processor and can be changed by any of the threads Example:

Thread 1 Thread 2

count=myid (1)

MPI_Recv() block...

b=count

count=myid (2)

MPI_Recv() block...If count is a global

incorrect value is read!

time

Page 9: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 905/02/08

How to run AMPI programs (1)

We can run multithreaded on each processor Running with many virtual processors:

+p command line option: # of physical processors +vp command line option: # of virtual processors

Multiple initial processor mappings are possible> charmrun hello +p3 +vp6 +mapping <map>

Available mappings at program initialization: RR_MAP: Round-Robin (cyclic) {(0,3)(1,4)(2,5)} BLOCK_MAP: Block (default) {(0,1)(2,3)(4,5)} PROP_MAP: Proportional to processors’ speeds {(0,1,2,3)(4)(5)}

Page 10: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1005/02/08

How to run AMPI programs (2)

Specify stack size for each threadSet smaller/larger stack sizesNotice that thread’s stack space is unique across

processorsSpecify stack size for each thread with

+tcharm_stacksize command line option:charmrun hello +p2 +vp8 +tcharm_stacksize 8000000

Default stack size is 1 MByte for each thread

Page 11: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1105/02/08

How to convert an MPI program

Remove global variables if possible If not possible, privatize global variables

Pack them into struct/TYPE or class Allocate struct / type in heap or stack

MODULE shareddata INTEGER :: myrank DOUBLE PRECISION :: xyz(100)END MODULE

Original Code

MODULE shareddata TYPE chunk INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END TYPEEND MODULE

AMPI Code

Page 12: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1205/02/08

How to convert an MPI program

Fortran program entry point: MPI_Mainprogram pgm subroutine MPI_Main

... ...

end program end subroutine C program entry point is handled automatically,

via mpi.h

Page 13: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1305/02/08

AMPI Extensions

Automatic load balancing Non-blocking collectives Checkpoint/restart mechanism Multi-module programming Handling global and static variables

Page 14: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1405/02/08

Automatic Load Balancing

Automatic load balancing: MPI_Migrate() Collective call informing the load balancer that the thread is

ready to be migrated, if needed.

Link-time flag -memory isomalloc makes heap-data migration transparent Special memory allocation mode, giving allocated

memory the same virtual address on all processors Ideal on 64-bit machines But… granularity = 16KB per allocation Alternative: PUPer

Page 15: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1505/02/08

Asynchronous Collectives Our implementation is asynchronous

Collective operation postedTest/wait for its completionMeanwhile useful computation can utilize CPU

MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);

Page 16: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1605/02/08

Checkpoint/Restart Mechanism Large scale machines may suffer from fails Checkpoint/restart mechanism

State of applications checkpointed to disk filesCapable of restarting on different # of PE’sFacilitates future efforts on fault toleranceIn-disk: MPI_Checkpoint(DIRNAME)In-memory: MPI_MemCheckpoint(void)

Checkpoint/restart and load-balancing have been recently applied in CSAR’s Rocstar

Page 17: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1705/02/08

Interoperability with Charm++ Charm++ has a collection of support libraries We can make use of them by running Charm++

code in AMPI programs Also we can run AMPI code in Charm++

programs

Page 18: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 2008 1805/02/08

Handling Global & Static Variables Global and static variables are not thread-safe

Can we switch those variables when we switch threads? Globals: Executable and Linking Format (ELF)

Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2.x, and more

Integrated in Charm++/AMPI Invoked by compile time option –swapglobals

Statics in C codes: __thread privatizes them Requires linking to pthreads library

Page 19: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Programmer Productivity

Take away generality Program in small, uncomplicated languages

Incomplete Fit for specific types of programs

Inter-operable Modules Common ARTS

Page 20: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Charisma

Static data flow Suffices for number of applications Molecular dynamics, FEM, PDE's, etc.

Global data and control flow explicit Unlike Charm++

Page 21: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

The Charisma Programming Model

Arrays of objects Global parameter space (PS)

Objects read from and write into PS Clean division between

Parallel (orchestration) code Sequential methods

Worker objects

Buffers (PS)

Page 22: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Example: Stencil Computation

foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders();end-foreach

foreach x,y in workers (+err) <- workers[x,y].compute(lb[x+1,y], rb[x-1,y], ub[x,y+1], db[x,y-1]);end-foreach

while(err > epsilon) foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y])

<- workers[x,y].produceBorders();

(+err) <- workers[x,y].compute( lb[x+1,y], rb[x-1,y], ub[x,y-1], db[x,y+1] ); end-foreachend-while Reduction on variable err

Page 23: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Language Features

Communication patterns P2P, Multicast, Scatter, Gather

Determinism Methods invoked on objects in Program Order

Support for libraries Use external Create your own

foreach w in workers (p[w]) <- workers[w].produce(); workers[w].consume(p[w-1]);end-foreach

foreach r in workers (p[r]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r])end-foreach

foreach r in workers (p[r,*]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r,c]);end-foreach

workers[w]

workers[r]

consumers[r,1] consumers[r,3]consumers[r,2]consumers[r,2]

workers[r]

consumers[r,1] consumers[r,3]

Page 24: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

However...

Charisma is no panacea Only static dataflow Data-dependent dataflow can not be expressed

Page 25: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Shared Address Spaces

Shared Memory programming Easier to program in (sometimes) Cache coherence (hurts performance) Consistency, Data Races (productivity suffers)

MSA SM programs access data differently in various

phases

Page 26: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Matrix-Matrix Multiply

Read Read

Write Accumulate

B

CA C

B

A

Page 27: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

MSA Access modes

SAS is part of programming model Support some access modes efficiently

Read-only Write Accumulate

Page 28: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

MSA MMM Codetypedef MSA2D<double, MSA_NullA<double>, MSA_ROW_MAJOR> MSA2DRowMjr;typedef MSA2D<double, MSA_SumA<double>, MSA_COL_MAJOR> MSA2DColMjr;

MSA2DRowMjr arr1(ROWS1, COLS1, NUMWORKERS, cacheSize1); // row majorMSA2DColMjr arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2); // column majorMSA2DRowMjr prod(ROWS1, COLS2, NUMWORKERS, cacheSize3); //product matrix

for(unsigned int c = 0; c < COLS2; c++) { // Each thread computes a subset of rows of product matrix for(unsigned int r = rowStart; r <= rowEnd; r++) { double result = 0.0; for(unsigned int k = 0; k < COLS1; k++) result += arr1[r][k] * arr2[k][c];

prod[r][c] = result; }}

prod.sync();

Initialization

Thread-local accumulation

Global memory access

New phase for matrix prod

Page 29: AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

AMPI – Charm++ Workshop 200805/02/08

Future Work

Compiler-directed optimizations Performance and Productivity analyses Cross-module data exchange