AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...
-
Upload
diane-shepherd -
Category
Documents
-
view
213 -
download
0
Transcript of AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of...
AMPI, Charisma and MSA
Celso Mendes
Pritish Jetley Parallel Programming Laboratory
University of Illinois at Urbana-Champaign
AMPI – Charm++ Workshop 2008 205/02/08
Motivation
Challenges New generation parallel applications are:
Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are:
Not naturally suitable for dynamic applications Set of available processors:
May not match the natural expression of the algorithm
AMPI: Adaptive MPI MPI with virtualization: VP (“Virtual Processors”)
AMPI – Charm++ Workshop 2008 305/02/08
AMPI Status
Compliance to MPI-1.1 StandardMissing: error handling, profiling interface
Partial MPI-2 supportOne-sided communicationROMIO integrated for parallel I/OMissing: dynamic process management, language
bindings
AMPI – Charm++ Workshop 2008 405/02/08
AMPI: MPI with Virtualization Each virtual process implemented as a user-level
thread embedded in a Charm++ object
MPI processes
Real Processors
MPI “processes”
Implemented as virtual processes (user-level migratable threads)
AMPI – Charm++ Workshop 2008 505/02/08
Comparison with Native MPI
Performance Slightly worse w/o optimization Can be improved, via Charm++ Run time parameters for better
performance
AMPI – Charm++ Workshop 2008 605/02/08
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3
10 100 10001
10
100
29.44 14.16 9.12 8.07 5.52Column B
Procs
Exe
c T
ime
[sec
]
Comparison with Native MPI
Flexibility Big runs on any number of
processors Fits the nature of algorithms
AMPI – Charm++ Workshop 2008 705/02/08
Building Charm++ / AMPI
Download website:http://charm.cs.uiuc.edu/download/Please register for better support
Build Charm++/AMPI > ./build <target> <version> <options> [charmc-options] See README file for details To build AMPI:
> ./build AMPI net-linux -g (-O3)
AMPI – Charm++ Workshop 2008 805/02/08
How to write AMPI programs Avoid using global variables Global variables are dangerous in multithreaded
programs Global variables are shared by all the threads on a
processor and can be changed by any of the threads Example:
Thread 1 Thread 2
count=myid (1)
MPI_Recv() block...
b=count
count=myid (2)
MPI_Recv() block...If count is a global
incorrect value is read!
time
AMPI – Charm++ Workshop 2008 905/02/08
How to run AMPI programs (1)
We can run multithreaded on each processor Running with many virtual processors:
+p command line option: # of physical processors +vp command line option: # of virtual processors
Multiple initial processor mappings are possible> charmrun hello +p3 +vp6 +mapping <map>
Available mappings at program initialization: RR_MAP: Round-Robin (cyclic) {(0,3)(1,4)(2,5)} BLOCK_MAP: Block (default) {(0,1)(2,3)(4,5)} PROP_MAP: Proportional to processors’ speeds {(0,1,2,3)(4)(5)}
AMPI – Charm++ Workshop 2008 1005/02/08
How to run AMPI programs (2)
Specify stack size for each threadSet smaller/larger stack sizesNotice that thread’s stack space is unique across
processorsSpecify stack size for each thread with
+tcharm_stacksize command line option:charmrun hello +p2 +vp8 +tcharm_stacksize 8000000
Default stack size is 1 MByte for each thread
AMPI – Charm++ Workshop 2008 1105/02/08
How to convert an MPI program
Remove global variables if possible If not possible, privatize global variables
Pack them into struct/TYPE or class Allocate struct / type in heap or stack
MODULE shareddata INTEGER :: myrank DOUBLE PRECISION :: xyz(100)END MODULE
Original Code
MODULE shareddata TYPE chunk INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END TYPEEND MODULE
AMPI Code
AMPI – Charm++ Workshop 2008 1205/02/08
How to convert an MPI program
Fortran program entry point: MPI_Mainprogram pgm subroutine MPI_Main
... ...
end program end subroutine C program entry point is handled automatically,
via mpi.h
AMPI – Charm++ Workshop 2008 1305/02/08
AMPI Extensions
Automatic load balancing Non-blocking collectives Checkpoint/restart mechanism Multi-module programming Handling global and static variables
AMPI – Charm++ Workshop 2008 1405/02/08
Automatic Load Balancing
Automatic load balancing: MPI_Migrate() Collective call informing the load balancer that the thread is
ready to be migrated, if needed.
Link-time flag -memory isomalloc makes heap-data migration transparent Special memory allocation mode, giving allocated
memory the same virtual address on all processors Ideal on 64-bit machines But… granularity = 16KB per allocation Alternative: PUPer
AMPI – Charm++ Workshop 2008 1505/02/08
Asynchronous Collectives Our implementation is asynchronous
Collective operation postedTest/wait for its completionMeanwhile useful computation can utilize CPU
MPI_Ialltoall( … , &req);/* other computation */MPI_Wait(req);
AMPI – Charm++ Workshop 2008 1605/02/08
Checkpoint/Restart Mechanism Large scale machines may suffer from fails Checkpoint/restart mechanism
State of applications checkpointed to disk filesCapable of restarting on different # of PE’sFacilitates future efforts on fault toleranceIn-disk: MPI_Checkpoint(DIRNAME)In-memory: MPI_MemCheckpoint(void)
Checkpoint/restart and load-balancing have been recently applied in CSAR’s Rocstar
AMPI – Charm++ Workshop 2008 1705/02/08
Interoperability with Charm++ Charm++ has a collection of support libraries We can make use of them by running Charm++
code in AMPI programs Also we can run AMPI code in Charm++
programs
AMPI – Charm++ Workshop 2008 1805/02/08
Handling Global & Static Variables Global and static variables are not thread-safe
Can we switch those variables when we switch threads? Globals: Executable and Linking Format (ELF)
Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2.x, and more
Integrated in Charm++/AMPI Invoked by compile time option –swapglobals
Statics in C codes: __thread privatizes them Requires linking to pthreads library
AMPI – Charm++ Workshop 200805/02/08
Programmer Productivity
Take away generality Program in small, uncomplicated languages
Incomplete Fit for specific types of programs
Inter-operable Modules Common ARTS
AMPI – Charm++ Workshop 200805/02/08
Charisma
Static data flow Suffices for number of applications Molecular dynamics, FEM, PDE's, etc.
Global data and control flow explicit Unlike Charm++
AMPI – Charm++ Workshop 200805/02/08
The Charisma Programming Model
Arrays of objects Global parameter space (PS)
Objects read from and write into PS Clean division between
Parallel (orchestration) code Sequential methods
Worker objects
Buffers (PS)
AMPI – Charm++ Workshop 200805/02/08
Example: Stencil Computation
foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders();end-foreach
foreach x,y in workers (+err) <- workers[x,y].compute(lb[x+1,y], rb[x-1,y], ub[x,y+1], db[x,y-1]);end-foreach
while(err > epsilon) foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y])
<- workers[x,y].produceBorders();
(+err) <- workers[x,y].compute( lb[x+1,y], rb[x-1,y], ub[x,y-1], db[x,y+1] ); end-foreachend-while Reduction on variable err
AMPI – Charm++ Workshop 200805/02/08
Language Features
Communication patterns P2P, Multicast, Scatter, Gather
Determinism Methods invoked on objects in Program Order
Support for libraries Use external Create your own
foreach w in workers (p[w]) <- workers[w].produce(); workers[w].consume(p[w-1]);end-foreach
foreach r in workers (p[r]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r])end-foreach
foreach r in workers (p[r,*]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r,c]);end-foreach
workers[w]
workers[r]
consumers[r,1] consumers[r,3]consumers[r,2]consumers[r,2]
workers[r]
consumers[r,1] consumers[r,3]
AMPI – Charm++ Workshop 200805/02/08
However...
Charisma is no panacea Only static dataflow Data-dependent dataflow can not be expressed
AMPI – Charm++ Workshop 200805/02/08
Shared Address Spaces
Shared Memory programming Easier to program in (sometimes) Cache coherence (hurts performance) Consistency, Data Races (productivity suffers)
MSA SM programs access data differently in various
phases
AMPI – Charm++ Workshop 200805/02/08
Matrix-Matrix Multiply
Read Read
Write Accumulate
B
CA C
B
A
AMPI – Charm++ Workshop 200805/02/08
MSA Access modes
SAS is part of programming model Support some access modes efficiently
Read-only Write Accumulate
AMPI – Charm++ Workshop 200805/02/08
MSA MMM Codetypedef MSA2D<double, MSA_NullA<double>, MSA_ROW_MAJOR> MSA2DRowMjr;typedef MSA2D<double, MSA_SumA<double>, MSA_COL_MAJOR> MSA2DColMjr;
MSA2DRowMjr arr1(ROWS1, COLS1, NUMWORKERS, cacheSize1); // row majorMSA2DColMjr arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2); // column majorMSA2DRowMjr prod(ROWS1, COLS2, NUMWORKERS, cacheSize3); //product matrix
for(unsigned int c = 0; c < COLS2; c++) { // Each thread computes a subset of rows of product matrix for(unsigned int r = rowStart; r <= rowEnd; r++) { double result = 0.0; for(unsigned int k = 0; k < COLS1; k++) result += arr1[r][k] * arr2[k][c];
prod[r][c] = result; }}
prod.sync();
Initialization
Thread-local accumulation
Global memory access
New phase for matrix prod
AMPI – Charm++ Workshop 200805/02/08
Future Work
Compiler-directed optimizations Performance and Productivity analyses Cross-module data exchange