Unified Parallel C (UPC)

64
Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick http://upc.lbl.gov Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu

description

Unified Parallel C (UPC). Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick http://upc.lbl.gov. Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu. Context. - PowerPoint PPT Presentation

Transcript of Unified Parallel C (UPC)

Page 1: Unified Parallel C (UPC)

Unified Parallel C (UPC)Costin Iancu

The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell,

P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick

http://upc.lbl.gov

Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu

Page 2: Unified Parallel C (UPC)

Context

• Most parallel programs are written using either:– Message passing with a SPMD model (MPI)

• Usually for scientific applications with C++/Fortran• Scales easily

– Shared memory with threads in OpenMP, Threads+C/C++/F or Java• Usually for non-scientific applications• Easier to program, but less scalable performance

• Partitioned Global Address Space (PGAS) Languages take the best of both

– SPMD parallelism like MPI (performance)– Local/global distinction, i.e., layout matters (performance)– Global address space like threads (programmability)

• 3 Current languages: UPC (C), CAF (Fortran), and Titanium (Java)

• 3 New languages: Chapel, Fortress, X10

Page 3: Unified Parallel C (UPC)

Partitioned Global Address Space

• Shared memory is logically partitioned by processors• Remote memory may stay remote: no automatic caching implied• One-sided communication: reads/writes of shared variables• Both individual and bulk memory copies • Some models have a separate private memory area• Distributed array generality and how they are constructed

Shared

Glo

bal

ad

dre

ss s

pac

e

X[0]

Privateptr: ptr: ptr:

X[1] X[P]

Thread0 Thread1 Threadn

Page 4: Unified Parallel C (UPC)

Partitioned Global Address Space Languages

• Explicitly-parallel programming model with SPMD parallelism– Fixed at program start-up, typically 1 thread per processor

• Global address space model of memory– Allows programmer to directly represent distributed data structures

• Address space is logically partitioned– Local vs. remote memory (two-level hierarchy)

• Programmer control over performance critical decisions– Data layout and communication

• Performance transparency and tunability are goals– Initial implementation can use fine-grained shared memory

Page 5: Unified Parallel C (UPC)

Current Implementations

• A successful language/library must run everywhere• UPC

– Commercial compilers: Cray, SGI, HP, IBM– Open source compilers: LBNL/UCB (source-to-source), Intrepid (gcc)

• CAF– Commercial compilers: Cray– Open source compilers: Rice (source-to-source)

• Titanium – Open source compilers: UCB (source-to-source)

• Common tools– Open64 open source research compiler infrastructure– ARMCI, GASNet for distributed memory implementations– Pthreads, System V shared memory

Page 6: Unified Parallel C (UPC)

Talk Overview

• UPC Language Design– Data Distribution (layout, memory management)– Work Distribution (data parallelism)– Communication (implicit,explicit, collective operations)– Synchronization (memory model, locks)

• Programming in UPC– Performance (one-sided communication)– Application examples: FFT, PC– Productivity (compiler support)– Performance tuning and modeling

Page 7: Unified Parallel C (UPC)

UPC Overview and Design

• Unified Parallel C (UPC) is:– An explicit parallel extension of ANSI C with common and

familiar syntax and semantics for parallel C and simple extensions to ANSI C

– A partitioned global address space language (PGAS)– Based on ideas in Split-C, AC, and PCP

• Similar to the C language philosophy– Programmers are clever and careful, and may need to get

close to hardware• to get performance, but• can get in trouble

• SPMD execution model: (THREADS, MYTHREAD), static vs. dynamic threads

Page 8: Unified Parallel C (UPC)

Data Distribution

Page 9: Unified Parallel C (UPC)

Data Distribution

Shared

Glo

bal

A

dd

ress

Sp

ace X[0]

Privateptr: ptr: ptr:

X[1] X[P]

Thread0 Thread1 Threadn

ours:

mine: mine: mine:

• Distinction between memory spaces through extensions of the type system (shared qualifier)

shared int ours;shared int X[THREADS];shared int *ptr;int mine;

• Data in shared address space:– Static: scalars (T0), distributed arrays– Dynamic: dynamic memory management (upc_alloc, upc_global_alloc, upc_all_alloc)

Page 10: Unified Parallel C (UPC)

Data Layout

• Data layout controlled through extensions of the type system (layout specifiers):– [0] or [] (indefinite layout, all on 1 thread):

shared [] int *p;

– Empty (cyclic layout) : shared int array[THREADS*M];

– [*] (blocked layout): shared [*] int array[THREADS*M];

– [b] or [b1][b2]…[bn] = [b1*b2*…bn] (block cyclic) shared [B] int array[THREADS*M];

• Element array[i] has affinity with thread

(i / B) % THREADS• Layout determines pointer arithmetic rules• Introspection (upc_threadof, upc_phaseof,

upc_blocksize)

Page 11: Unified Parallel C (UPC)

UPC Pointers Implementation • In UPC pointers to shared objects have three fields:

– thread number – local address of block– phase (specifies position in the block)

• Example: Cray T3E implementation

• Pointer arithmetic can be expensive in UPC

Phase Thread Virtual Address03738484963

Virtual Address Thread Phase

Page 12: Unified Parallel C (UPC)

UPC Pointers

Local Shared

Private PP (p1) PS (p3)

Shared SP (p2) SS (p4)

Where does the pointer point?

Where does the pointer reside?

int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */

Page 13: Unified Parallel C (UPC)

UPC Pointers

int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */

Shared

Glo

bal

ad

dre

ss s

pac

e

Private

p1:

Thread0 Thread1 Threadn

p2:

p1:

p2:

p1:

p2:

p3:

p4:

p3:

p4:

p3:

p4:

Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory.

Page 14: Unified Parallel C (UPC)

Common Uses for UPC Pointer Types

int *p1; • These pointers are fast (just like C pointers)• Use to access local data in part of code performing local work• Often cast a pointer-to-shared to one of these to get faster access to shared

data that is local

shared int *p2; • Use to refer to remote data• Larger and slower due to test-for-local + possible communication

int *shared p3; • Not recommended

shared int *shared p4; • Use to build shared linked structures, e.g., a linked list

• typedef is the UPC programmer’s best friend

Page 15: Unified Parallel C (UPC)

UPC Pointers Usage Rules

• Pointer arithmetic supports blocked and non-blocked array distributions

• Casting of shared to private pointers is allowed but not vice versa !

• When casting a pointer-to-shared to a pointer-to-local, the thread number of the pointer to shared may be lost

• Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast

Page 16: Unified Parallel C (UPC)

Work Distribution

Page 17: Unified Parallel C (UPC)

• Owner computes rule: loop over all, work on those owned by you• UPC adds a special type of loop

upc_forall(init; test; step; affinity) statement;

• Programmer indicates the iterations are independent– Undefined if there are dependencies across threads

• Affinity expression indicates which iterations to run on each thread. It may have one of two types:– Integer: affinity%THREADS == MYTHREAD– Pointer: upc_threadof(affinity) == MYTHREAD

• Syntactic sugar for: for(i=MYTHREAD; i<N; i+=THREADS) …

for(i=0; i<N; i++) if (MYTHREAD == i%THREADS) …

Work Distribution upc_forall()

Page 18: Unified Parallel C (UPC)

Inter-Processor Communication

Page 19: Unified Parallel C (UPC)

Data Communication

• Implicit (assignments)shared int *p;…*p = 7;

• Explicit (bulk synchronous) point-to-point (upc_memget, upc_memput, upc_memcpy, upc_memset)

• Collective operations http://www.gwu.edu/~upc/docs/

– Data movement: broadcast, scatter, gather, …– Computational: reduce, prefix, …– Interface has synchronization modes: (??)

• Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary)

• Data being collected may be read/written by any thread simultaneously

Page 20: Unified Parallel C (UPC)

Data Communication

• The UPC Language Specification V 1.2 does not contain non-blocking communication primitives

• Extensions for non-blocking communication available in the BUPC implementation

• UPC V1.2 does not have higher level communication primitives for point-to-point communication.

• See BUPC extensions for – scatter, gather– VIS

• Should non-blocking communication be a first class language citizen?

Page 21: Unified Parallel C (UPC)

Synchronization

Page 22: Unified Parallel C (UPC)

Synchronization

• Point-to-point synchronization: locks– opaque type : upc_lock_t*– dynamically managed: upc_all_lock_alloc, upc_global_lock_alloc

• Global synchronization:– Barriers (unaligned) : upc_barrier– Split-phase barriers: upc_notify; this thread is ready for barrier do computation unrelated to barrier upc_wait; wait for others to be ready

Page 23: Unified Parallel C (UPC)

Memory Consistency in UPC

• The consistency model defines the order in which one thread may see another threads accesses to memory

– If you write a program with un-synchronized accesses, what happens?

– Does this work?data = … while (!flag) { };flag = 1; … = data; // use the data

• UPC has two types of accesses: – Strict: will always appear in order– Relaxed: May appear out of order to other threads

• Consistency is associated either with a program scope (file, statement)

{ #pragma upc strict flag = 1; }or with a type shared strict int flag;

Page 24: Unified Parallel C (UPC)

Sample UPC Code

Page 25: Unified Parallel C (UPC)

Matrix Multiplication in UPC

• Given two integer matrices A(NxP) and B(PxM), we want to compute C =A x B.

• Entries cij in C are computed by the formula:

bac lj

p

lilij×=∑

=1

Page 26: Unified Parallel C (UPC)

Serial C code

#define N 4#define P 4#define M 4int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16},

c[N][M];int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};

void main (void) { int i, j , l; for (i = 0 ; i<N ; i++) {

for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l = 0 ; l<P ; l++)

c[i][j] += a[i][l]*b[l][j];}

}}

Page 27: Unified Parallel C (UPC)

Domain Decomposition

• A (N P) is decomposed row-wise into blocks of size (N P) / THREADS as shown below:

• B(P M) is decomposed column wise into M/ THREADS blocks as shown below:

Thread 0

Thread 1

Thread THREADS-1

0 .. (N*P / THREADS) -1

(N*P / THREADS)..(2*N*P / THREADS)-1

((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1

Columns 0: (M/THREADS)-1

Columns ((THREAD-1) M)/THREADS:(M-1)

Thread 0Thread THREADS-1

•Note: N and M are assumed to be multiples of THREADS

• Exploits locality in matrix multiplication

N

P M

P

Page 28: Unified Parallel C (UPC)

UPC Matrix Multiplication Code#include <upc_relaxed.h>#define N 4#define P 4#define M 4shared [N*P/THREADS] int a[N][P] = {1,..,16}, c[N][M];// data distribution: a and c are blocked shared matricesshared [M/THREADS] int b[P][M] = {0,1,0,1,… ,0,1};

void main (void) {int i, j , l; // private variablesupc_forall(i = 0 ; i<N ; i++; &c[i][0]) {

//work distribution for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)

//implicit communication c[i][j] += a[i][l]*b[l][j]; }

}}

Page 29: Unified Parallel C (UPC)

UPC Matrix Multiplication With Block Copy

#include <upc_relaxed.h>shared [N*P /THREADS] int a[N][P], c[N][M];// a and c are blocked shared matricesshared[M/THREADS] int b[P][M];int b_local[P][M];

void main (void) {int i, j , l; // private variables

//explicit bulk communicationupc_memget(b_local, b, P*M*sizeof(int));

//work distribution (c aligned with a??)upc_forall(i = 0 ; i<N ; i++; &c[i][0]) { for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)

c[i][j] += a[i][l]*b_local[l][j]; }}

}

Page 30: Unified Parallel C (UPC)

Programming in UPC

Don’t ask yourself what can my compiler do for me, ask yourself what can I do for my compiler!

Page 31: Unified Parallel C (UPC)

Principles of Performance Software

To minimize the cost of communication• Use the best available communication mechanism on

a given machine• Hide communication by overlapping (programmer or compiler or runtime)

• Avoid synchronization using data-driven execution (programmer or runtime)

• Tune communication using performance models when they work (??); search when they don’t

(programmer or compiler/runtime)

Page 32: Unified Parallel C (UPC)

Best Available Communication Mechanism

• Performance is determined by overhead, latency and bandwidth

• Data transfer (one-sided communication) is often faster than (two sided) message passing

• Semantics limit performance– In-order message delivery– Message and tag matching– Need to acquire information from remote host processor– Synchronization (message receipt) tied to data transfer

Page 33: Unified Parallel C (UPC)

One-Sided vs Two-Sided: Theory

• A two-sided messages needs to be matched with a receive to identify memory address to put data– Offloaded to Network Interface in networks like Quadrics– Need to download match tables to interface (from host)

• A one-sided put/get message can be handled directly by a network interface with RDMA support– Avoid interrupting the CPU or storing data from CPU (preposts)

address

message id

data payload

data payload

one-sided put message

two-sided message

network

interface

memory

host

CPU

Page 34: Unified Parallel C (UPC)

GASNet: Portability and High-Performance(d

ow

n i

s g

oo

d)

GASNet better for overhead and latency across machinesUPC Group; GASNet design by Dan Bonachea

8-byte Roundtrip Latency

14.6

6.6

24.2

22.1

9.6

18.5

11.8

6.6

4.5

9.5 9.58.3

17.8

13.5

0

5

10

15

20

25

Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3

Roundtrip Latency (usec)

MPI ping-pong

GASNet put+sync

Page 35: Unified Parallel C (UPC)

(up

is

go

od

)

GASNet excels at mid-range sizes: important for overlap

GASNet: Portability and High-Performance

Joint work with UPC Group; GASNet design by Dan Bonachea

Flood Bandwidth for 4KB messages

526

547

420

190

702

152

252

765

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3

Percent HW peak

MPI

GASNet

Flood Bandwidth for 2MB messages

1504

630

244

857 225

610

11061490799255

858 228795

1109

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3

Percent HW peak (BW in MB)

MPI GASNet(up

is

go

od

)

GASNet at least as high (comparable) for large messages

Page 36: Unified Parallel C (UPC)

One-Sided vs. Two-Sided: Practice

0

100

200

300

400

500

600

700

800

900

16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K128K256K

Bandwidth (MB/s)

0

0.5

1

1.5

2

2.5

Speedup (GASNet / MPI)

GASNet put (nonblock)"

MPI Flood

GASNet/MPI

• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5• Half power point (N ½ ) differs by one order of magnitude• This is not a criticism of the implementation!

Yelick,Hargrove, Bonachea

(up

is

go

od

)

NERSC Jacquard machine with Opteron processors

Page 37: Unified Parallel C (UPC)

Overlap

Page 38: Unified Parallel C (UPC)

Hide Communication by Overlapping

• A programming model that decouples data transfer and synchronization (init, sync)

• BUPC has several extensions: (programmer)– explicit handle based– region based– implicit handle based

• Examples:– 3D FFT (programmer)– split-phase optimizations (compiler)– automatic overlap (runtime)

Page 39: Unified Parallel C (UPC)

Performing a 3D FFT• NX x NY x NZ elements spread across P processors• Will Use 1-Dimensional Layout in Z dimension

– Each processor gets NZ / P “planes” of NX x NY elements per plane

1D Partition

NX

NY

Example: P = 4

NZ

p0p1

p2p3

NZ/P

Bell, Nishtala, Bonachea, Yelick

Page 40: Unified Parallel C (UPC)

Performing a 3D FFT (part 2)

• Perform an FFT in all three dimensions• With 1D layout, 2 out of the 3 dimensions are local

while the last Z dimension is distributed

Step 1: FFTs on the columns(all elements local)

Step 2: FFTs on the rows (all elements local)

Step 3: FFTs in the Z-dimension(requires communication)

Bell, Nishtala, Bonachea, Yelick

Page 41: Unified Parallel C (UPC)

Performing the 3D FFT (part 3)

• Can perform Steps 1 and 2 since all the data is available without communication

• Perform a Global Transpose of the cube– Allows step 3 to continue

Transpose

Bell, Nishtala, Bonachea, Yelick

Page 42: Unified Parallel C (UPC)

Communication Strategies for 3D FFTchunk = all rows with same destination

pencil = 1 row

• Three approaches:

–Chunk: • Wait for 2nd dim FFTs to finish• Minimize # messages

–Slab: • Wait for chunk of rows destined for

1 proc to finish• Overlap with computation

–Pencil: • Send each row as it completes• Maximize overlap and• Match natural layout

slab = all rows in a single plane with same destination

Bell, Nishtala, Bonachea, Yelick

Page 43: Unified Parallel C (UPC)

NAS FT Variants Performance Summary

• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPIBest MPIBest UPC

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4

#procs 64 256 256 512 256 512

MF

lop

s p

er

Th

rea

d

Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)

Page 44: Unified Parallel C (UPC)

Bisection Bandwidth Limits

• Full bisection bandwidth is (too) expensive• During an all-to-all communication phase

–Effective (per-thread) bandwidth is fractional share–Significantly lower than link bandwidth–Use smaller messages mixed with computation to avoid swamping the network

Bell, Nishtala, Bonachea, Yelick

Page 45: Unified Parallel C (UPC)

Compiler Optimizations

• Naïve scheme (blocking call for each load/store) not good enough

• PRE on shared expressions– Reduce the amount of unnecessary communication– Apply also to UPC shared pointer arithmetic

• Split-phase communication

– Hide communication latency through overlapping • Message coalescing

– Reduce number of messages to save startup overhead and achieve better bandwidth

Chen, Iancu, Yelick

Page 46: Unified Parallel C (UPC)

Benchmarks

• Gups– Random access (read/modify/write) to distributed array

• Mcop– Parallel dynamic programming algorithm

• Sobel– Image filter

• Psearch – Dynamic load balancing/work stealing

• Barnes Hut– Shared memory style code from SPLASH2

• NAS FT/IS– Bulk communication

Page 47: Unified Parallel C (UPC)

Performance Improvements%

im

pro

vem

ent

ove

r u

no

pti

miz

ed

Chen, Iancu, Yelick

Page 48: Unified Parallel C (UPC)

Data Driven Execution

Page 49: Unified Parallel C (UPC)

Data-Driven Execution

• Many algorithms require synchronization with remote processor – Mechanisms: (BUPC extensions)

• Signaling store: Raise a semaphore upon transfer

• Remote enqueue: Put a task in a remote queue

• Remote execution: Floating functions (X10 activities)

• Many algorithms have irregular data dependencies (LU)– Mechanisms (BUPC extensions)

• Cooperative multithreading

Page 50: Unified Parallel C (UPC)

Matrix FactorizationBlocks 2Dblock-cyclicdistributed

Panel factorizationsinvolve communicationfor pivoting Matrix-

matrixmultiplicationused here.Can be coalesced

Completed part of U

Co

mp

lete

d p

art o

f L

A(i,j) A(i,k)

A(j,i) A(j,k)

Trailing matrix

to be updated

Panel being factored

Husbands,Yelick

Page 51: Unified Parallel C (UPC)

Three Strategies for LU Factorization

Husbands,Yelick

• Organize in bulk-synchronous phases (ScaLAPACK)• Factor a block column, then perform updates

• Relatively easy to understand/debug, but extra synchronization

• Overlapping phases (HPL): • Work associated with on block column factorization can be overlapped

• Parameter to determine how many (need temp space accordingly)

• Event-driven multithreaded (UPC Linpack): • Each thread runs an event handler loop

• Tasks: factorization (w/ pivoting), update trailing, update upper

• Tasks my suspend (voluntarily) to wait for data, synchronization, etc.

• Data moved with remote gets (synchronization built-in)

• Must “gang” together for factorizations

• Scheduling priorities are key to performance and deadlock avoidance

Page 52: Unified Parallel C (UPC)

UPC-HP Linpack Performance

X1 UPC vs. MPI/HPL

0

200

400

600

800

1000

1200

1400

60 X1/64 X1/128

GFlop/s

MPI/HPL

UPC

Opteron clusterUPC vs. MPI/HPL

0

50

100

150

200

Opt/64

GFlop/s MPI/HPL

UPC

Altix UPC. Vs.

MPI/HPL

0

20

40

60

80

100

120

140

160

Alt/32

GFlop/sMPI/HPL

UPC

•Comparable to HPL (numbers from HPCC database)•Faster than ScaLAPACK due to less synchronization•Large scaling of UPC code on Itanium/Quadrics (Thunder)

• 2.2 TFlops on 512p and 4.4 TFlops on 1024pHusbands, Yelick

UPC vs. ScaLAPACK

0

20

40

60

80

2x4 proc grid 4x4 proc grid

GFlops

ScaLAPACK

UPC

Page 53: Unified Parallel C (UPC)

Performance Tuning

Iancu, Strohmaier

Page 54: Unified Parallel C (UPC)

Efficient Use of One-Sided

• Implementations: need to be efficient and have scalable performance

• Application level use of NB benefits from new design techniques: finer grained decompositions and overlap

• Overlap exercises the system in “un-expected” ways

• Prototyping of implementations for large scale systems is a hard problem: non-linear behavior of networks, communication scheduling is NP-hard

• Need methodology for fast prototyping:– understand interaction network/CPU at large scale

Page 55: Unified Parallel C (UPC)

Performance Tuning

• Performance is determined by overhead, latency and bandwidth, computational characteristics and communication topology

• It’s all relative: Performance characteristics are determined by system load

• Basic principles– Minimize communication overhead– Avoid congestion:

• control injection rate (end-point)• avoid hotspots (end-point, network routes)

• Have to use models.• What kind of answers can a model answer?

Page 56: Unified Parallel C (UPC)

Example: Vector-Add

shared double *rdata;double *ldata, *buf;

upc_memget(buf, rdata, N);for(i=0; i<N; i++) ldata[i]+= buf[i];

for(i=0; i<N/B; i++) h[i]=upc_memget_nb(buf+i*B,rdata+i*B,B);

for(i=0; i<N/B; i++) { sync(h[i]); for(j=0;j<B; j++)

ldata[i*B+j]+=buf[i*B+j]; }

GET_nb(B0):GET_nb(Bb)GET_nb(Bb+1):GET_nb(B2b)sync(B0)compute(B0):sync(Bb)compute(Bb)GET_nb(B2b+1):GET_nb(B3b):sync(BN)compute(BN)

Which implementation is faster?

What is B,b?

b

b

b

Page 57: Unified Parallel C (UPC)

Prototyping

• Usual approach: use time accurate performance model (applications, automatically tuned collectives)– Models (LogP..) don’t capture important behavior (parallelism,

congestion, resource constraints, non-linear behavior)– Exhaustive search of the optimization space– Validated only at low concurrency (~tens of procs), might break at high

concurrency, might break for torus networks

• Our approach:– Use performance model for ideal implementation– Understand hardware resource constraints and the variation of

performance parameters (understand trends not absolute values)

– Derive implementation constraints to satisfy both optimal implementation and hardware constraints

– Force implementation parameters to converge towards optimal

Page 58: Unified Parallel C (UPC)

Performance

• Network: bandwidth and overhead

• Application: communication pattern and schedule (congestion), computation

Overhead is determined bymessage size, communication

schedule, hardware flow of control

Bandwidth is determinedby message size, communicationschedule, fairness of allocation

Iancu, Strohmaier

Page 59: Unified Parallel C (UPC)

• Understand network behavior in the presence of non-blocking communication (Infiniband, Elan)

• Develop performance model for scenarios widely encountered in applications (p2p, scatter, gather, all-to-all) and a variety of aggressive optimization techniques (strip mining, pipelining, communication schedule skewing)

• Use both micro-benchmarks and application kernels to validate approach

Validation

Iancu, Strohmaier

Page 60: Unified Parallel C (UPC)

Findings

• Can choose optimal values for implementation parameters

• Time accurate model for an implementation: hard to develop, inaccurate at

high concurrency • Methodology does not require exhaustive search of the optimization space

(only p2p and qualitative behavior of gather)

• In practice one can produce templatized implementations for an algorithm and use our approach to determine optimal values: code generation (UPC), automatic tuning of collective operations, application development

• Need to further understand the mathematical and statistical properties

Page 61: Unified Parallel C (UPC)

End

Page 62: Unified Parallel C (UPC)

Avoid synchronization: Data-driven Execution

• Many algorithms require synchronization with remote processor

• What is the right mechanism in a PGAS model for doing this?

• Is it still one-sided?

Part 3: Event-Driven Execution Models

Page 63: Unified Parallel C (UPC)

Mechanisms for Event-Driven Execution

• “Put” operation does a send side notification– Needed for memory consistency model ordering

• Need to signal remote side on completion• Strategies:

– Have remote side do a “get” (works in some algorithms)– Put + strict flag write: do a put, wait for completion, then

do another (strict) put– Pipelined put + put-flag: works only on ordered networks– Signaling put: add new “store” operation that embeds

signal (2nd remote address) into single message

Page 64: Unified Parallel C (UPC)

Mechanisms for Event-Driven Execution

Signalling Store

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

payload size (Bytes)

time (usec)

memput signalupc memput + strict flagpipelined memput (unsafe)MPI n-byte ping/pongMPI send/rec

Preliminary results on Opteron + Infiniband