2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science...
-
Upload
xavier-casebolt -
Category
Documents
-
view
219 -
download
3
Transcript of 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science...
March 19, 2003 12003 Michigan Technological University
Steven SeidelSteven SeidelDepartment of Computer ScienceDepartment of Computer Science
Michigan Technological UniversityMichigan Technological University
[email protected]@mtu.edu
March 19, 2003 22003 Michigan Technological University
OverviewOverview BackgroundBackground
Collective operations in the UPC languageCollective operations in the UPC language
The V1.0 UPC collectives specificationThe V1.0 UPC collectives specification
Relocalization operationsRelocalization operations
Computational operationsComputational operations
Performance and implementation issuesPerformance and implementation issues
ExtensionsExtensions
Other workOther work
March 19, 2003 32003 Michigan Technological University
BackgroundBackground
UPC is an extension of C that provides a partitioned shared UPC is an extension of C that provides a partitioned shared memory programming model.memory programming model.
The V1.1 UPC spec was adopted on March 25.The V1.1 UPC spec was adopted on March 25.
Processes in UPC are called Processes in UPC are called threadsthreads..
Each thread has a private (local) address space.Each thread has a private (local) address space.
All threads share a global address space that is partitioned All threads share a global address space that is partitioned among the threads.among the threads.
A shared object that resides in thread A shared object that resides in thread ii’s partition is said to have ’s partition is said to have affinityaffinity to thread to thread ii..
If thread If thread ii has affinity to a shared object has affinity to a shared object xx, it is expected that , it is expected that accesses to accesses to xx take less time than accesses to shared objects to take less time than accesses to shared objects to which thread which thread ii does not have affinity. does not have affinity.
March 19, 2003 42003 Michigan Technological University
UPC programming modelUPC programming model
shared
A[0]=7;
7
local
th0 th1 th2
shared [5] int A[10*THREADS];
0
15
105
20 25
int i;
i ii
i=3;
3
A[i]=A[0]+2;
9
March 19, 2003 52003 Michigan Technological University
Collective operations in UPCCollective operations in UPC
If any thread calls a If any thread calls a collectivecollective function, then all threads must function, then all threads must also call that function.also call that function.
Collectives arguments are Collectives arguments are single-valuedsingle-valued: corresponding : corresponding function arguments have the same value.function arguments have the same value.
V1.1 UPC contains several collective functions:V1.1 UPC contains several collective functions: upc_notify upc_notify andand upc_wait upc_wait upc_barrierupc_barrier upc_all_allocupc_all_alloc upc_all_lock_allocupc_all_lock_alloc
These collectives provide synchronization and memory These collectives provide synchronization and memory allocation across all threads.allocation across all threads.
March 19, 2003 62003 Michigan Technological University
shared void *upc_all_alloc(nblocks, nbytes);shared void *upc_all_alloc(nblocks, nbytes);
This function allocates This function allocates shared [nbytes] char[nblocks*nbytes]shared [nbytes] char[nblocks*nbytes]
shared
localp p p
th0 th1 th2
p=upc_all_alloc(4,5);p=upc_all_alloc(4,5);
0
15
105
p=upc_all_alloc(4,5);
shared [5] char *p;
March 19, 2003 72003 Michigan Technological University
The V1.0 UPC Collectives SpecThe V1.0 UPC Collectives Spec
First draft by Wiebel and Greenberg, March 2002.First draft by Wiebel and Greenberg, March 2002.
Spec discussed at May, 2002, and SC’02 UPC workshops.Spec discussed at May, 2002, and SC’02 UPC workshops.
Many helpful comments from Dan Bonachea and Brian Many helpful comments from Dan Bonachea and Brian Wibecan.Wibecan.
V1.0 will be released shortly.V1.0 will be released shortly.
March 19, 2003 82003 Michigan Technological University
Collective functionsCollective functions InitializationInitialization
upc_all_initupc_all_init
““Relocalization” collectives change data affinity.Relocalization” collectives change data affinity. upc_all_broadcast upc_all_scatter upc_all_gather upc_all_gather_all upc_all_exchange upc_all_permute
““Computational” collectives for reduction and sorting.Computational” collectives for reduction and sorting. upc_all_reduce upc_all_prefix_reduce upc_all_sort
March 19, 2003 92003 Michigan Technological University
void upc_all_broadcast(dst, src, blk);void upc_all_broadcast(dst, src, blk);
shared
local
th0 th1 th2
dst dst dst
src src src
}blk
Thread 0 sends the same block of data to each thread.
shared [] char src[blk];shared [blk] char dst[blk*THREADS];
March 19, 2003 102003 Michigan Technological University
void upc_all_scatter(dst, src, blk);void upc_all_scatter(dst, src, blk);
shared
local
th0 th1 th2
dst dst dst
src src src
Thread 0 sends a unique block of data to each thread.
shared [] char src[blk*THREADS];shared [blk] char dst[blk*THREADS];
March 19, 2003 112003 Michigan Technological University
void upc_all_gather(dst, src, blk);void upc_all_gather(dst, src, blk);
shared
local
th0 th1 th2
dst dst dst
src src src
Each thread sends a block of data to thread 0.
shared [blk] char src[blk*THREADS];shared [] char dst[blk*THREADS];
March 19, 2003 122003 Michigan Technological University
void upc_all_gather_all(dst, src, blk);void upc_all_gather_all(dst, src, blk);
shared
local
th0 th1 th2
dst dst dst
src src src
Each thread sends one block of data to all threads.
March 19, 2003 132003 Michigan Technological University
void upc_all_exchange(dst, src, blk);void upc_all_exchange(dst, src, blk);
shared
local
th0 th1 th2
dst dst dst
src src src
Each thread sends a unique block of data to each thread.
March 19, 2003 142003 Michigan Technological University
void upc_all_permute(dst, src, perm, blk);void upc_all_permute(dst, src, perm, blk);
shared
local
th0 th1 th2
1 2 0
dst dst dst
src src src
perm perm perm
Thread i sends a block of data to thread perm(i).
March 19, 2003 152003 Michigan Technological University
Computational collectivesComputational collectives Reduce and prefix reduceReduce and prefix reduce
One function for each C scalar type, One function for each C scalar type, e.g.e.g.,,upc_all_reduceI(…)upc_all_reduceI(…) returns an integer returns an integer
OperationsOperations +, *, &, |, XOR, &&, ||, min, max+, *, &, |, XOR, &&, ||, min, max user-defined binary functionuser-defined binary function
SortSort User-defined comparison functionUser-defined comparison function
void upc_all_sort(shared void *A,void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk,size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *));int (*func)(shared void *, shared void *));
March 19, 2003 162003 Michigan Technological University
int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);
shared
local
src src src
th0 th1 th2
0
9
63
i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);
shared [3] int src[4*THREADS];int i;
i i i
42 81 16 32 64 128 2565121024 2048
42 81 16 32 64 128 256
44856 35915121024 2048
4095
i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);
i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);
Thread 0 receives UPC_OP src[i].i=0
n
March 19, 2003 172003 Michigan Technological University
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);
shared
local
th0 th1 th2
dst dst dst
src src src
shared [*] int src[3*THREADS], dst[3*THREADS];
0 3 6
0 3 6
1 321642 8 64 128 2561 321642 8 64 128 256
1 324 162 8 64 128 2561 1276315 31 51125533 77 15 31 63 127 255
Thread k receives UPC_OP src[i].i=0
k
March 19, 2003 182003 Michigan Technological University
Performance and implementation issuesPerformance and implementation issues
““Push” or “pull”?Push” or “pull”?
Synchronization semanticsSynchronization semantics
Effects of data distributionEffects of data distribution
March 19, 2003 192003 Michigan Technological University
shared
local
th0 th1 th2
A “pull” implementation of upc_all_broadcast
void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk );}
0 21
dst dst dst
src src src
March 19, 2003 202003 Michigan Technological University
shared
local
th0 th1 th2
A “push” implementation of upc_all_broadcast
void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk );}
0 21
dst dst dst
src src src
i ii012
March 19, 2003 212003 Michigan Technological University
Synchronization semanticsSynchronization semantics
When are function arguments ready?When are function arguments ready?
When are function results available?When are function results available?
March 19, 2003 222003 Michigan Technological University
local
shared
Synchronization semanticsSynchronization semantics Arguments with affinity to thread Arguments with affinity to thread ii are ready when are ready when
thread thread ii calls the function; results with affinity to calls the function; results with affinity to thread thread ii are ready when thread are ready when thread ii returns. returns.
This is appealing but it is incorrect: In a broadcast, This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready.thread 1 does not know when thread 0 is ready.
0 21
dst dst dst
src src src
March 19, 2003 232003 Michigan Technological University
Synchronization semanticsSynchronization semantics
Require the implementation to provide barriers at function Require the implementation to provide barriers at function entry and exit.entry and exit.
This is convenient for the programming but it is likely to This is convenient for the programming but it is likely to adversely affect performance.adversely affect performance.
void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier;}
March 19, 2003 242003 Michigan Technological University
Synchronization semanticsSynchronization semantics V1.0 spec: Synchronization is a user responsibility.V1.0 spec: Synchronization is a user responsibility.
#define numelems 10shared [] int A[numelems];shared [numelems] int B[numelems*THREADS];
void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk );}..// Initialize A...upc_barrier;upc_all_broadcast( B, A, sizeof(int)*numelems );upc_barrier;
March 19, 2003 252003 Michigan Technological University
Performance and implementation issuesPerformance and implementation issues
Data distribution affects both performance and Data distribution affects both performance and implementation.implementation.
March 19, 2003 262003 Michigan Technological University
shared127
local
th0 th1 th2
dst dst dst
src src src
shared int src[3*THREADS], dst[3*THREADS];
0 1 2
0 1 2
1 3216 428 64 128 256
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);, src, UPC_ADD, n, blk, NULL);
Thread k receives UPC_OP src[i].i=0
k
1 3216 428 64 128 256
1 3 715 31 63255 5113 715 31 63255127
March 19, 2003 272003 Michigan Technological University
ExtensionsExtensions
Strided copyingStrided copying
Vectors of offsets for Vectors of offsets for srcsrc and and dstdst arrays arrays
Variable-sized blocksVariable-sized blocks
Reblocking (Reblocking (cf:cf: preceding example of prefix reduce) preceding example of prefix reduce)
shared int src[3*THREADS];shared int src[3*THREADS];
shared [3] int dst[3*THREADS];shared [3] int dst[3*THREADS];
upc_forall(i=0; i<3*THREADS; i++; ?)upc_forall(i=0; i<3*THREADS; i++; ?)
dst[i] = src[i];dst[i] = src[i];
March 19, 2003 282003 Michigan Technological University
More sophisticated synchronization More sophisticated synchronization semanticssemantics
Consider the “pull” implementation of broadcast.Consider the “pull” implementation of broadcast.
There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other.
Each thread does a pairwise synchronization with thread 0.
Thread i will not have to wait if it reaches its synchronization point after thread 0.
Thread 0 returns from the call after it has sync’d with each thread.
March 19, 2003 292003 Michigan Technological University
What’s next?What’s next?
The V1.0 collective spec will be adopted in the next few The V1.0 collective spec will be adopted in the next few weeks.weeks.
A reference implementation will be available from MTU A reference implementation will be available from MTU immediately afterwards.immediately afterwards.
March 19, 2003 302003 Michigan Technological University
MuPC run time system for UPCMuPC run time system for UPC
UPC memory model (Chuck Wallace)UPC memory model (Chuck Wallace)
UPC programmability (Phil Merkey)UPC programmability (Phil Merkey)
UPC test suite (Phil Merkey)UPC test suite (Phil Merkey)
http://www.upc.mtu.edu