2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May...

31
May 6, 2003 1 2003 Michigan Technological University UPC Workshop UPC Workshop George Washington University George Washington University May 6-7, 2003 May 6-7, 2003
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of 2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May...

May 6, 2003 12003 Michigan Technological University

UPC WorkshopUPC WorkshopGeorge Washington UniversityGeorge Washington University

May 6-7, 2003May 6-7, 2003

May 6, 2003 22003 Michigan Technological University

The V1.0 UPC Collectives SpecThe V1.0 UPC Collectives Spec

First draft by Wiebel and Greenberg, March 2002.First draft by Wiebel and Greenberg, March 2002.

Spec discussed at May, 2002, and SC’02 UPC workshops.Spec discussed at May, 2002, and SC’02 UPC workshops.

Many helpful comments from Dan Bonachea and Brian Many helpful comments from Dan Bonachea and Brian Wibecan.Wibecan.

pre4V1.0, dated April 2, is now on the table.pre4V1.0, dated April 2, is now on the table.

May 6, 2003 32003 Michigan Technological University

Collective functionsCollective functions InitializationInitialization

upc_all_initupc_all_init

5.3 “Relocalization” collectives change data affinity. These are 5.3 “Relocalization” collectives change data affinity. These are byte-oriented operations.byte-oriented operations. upc_all_broadcast upc_all_scatter upc_all_gather upc_all_gather_all upc_all_exchange upc_all_permute

5.4 “Computational” collectives for reduction and sorting. 5.4 “Computational” collectives for reduction and sorting. These operations respect data type and blocksize.These operations respect data type and blocksize. upc_all_reduce upc_all_prefix_reduce upc_all_sort

May 6, 2003 42003 Michigan Technological University

Remaining collectives spec issuesRemaining collectives spec issues(large and small)(large and small)

Wording used to specify the affinity of certain argumentsWording used to specify the affinity of certain arguments

{signed}{signed} option for types supported by option for types supported by reduce reduce and and prefix prefix reduce reduce operationsoperations

What requirements are made of the phase of function What requirements are made of the phase of function arguments?arguments?

Associativity of Associativity of reduce reduce and and prefix reduce prefix reduce operationsoperations

Commutativity of Commutativity of reduce reduce and and prefix reduce prefix reduce operationsoperations

Can Can nbytesnbytes be 0 in 5.3 functions? be 0 in 5.3 functions?

What are the synchronization semantics?What are the synchronization semantics?

May 6, 2003 52003 Michigan Technological University

Wording used to specify the affinity of Wording used to specify the affinity of certain argumentscertain arguments

Resolved:Resolved: The target of the The target of the src/dstsrc/dst pointer must have pointer must have affinity to thread 0.affinity to thread 0.

This applies to distributed arrays, such as the targets of This applies to distributed arrays, such as the targets of a a broadcastbroadcast and and scatterscatter, and the source of a , and the source of a gathergather..

May 6, 2003 62003 Michigan Technological University

{signed}{signed} option for types supported by option for types supported by reducereduce and and prefix reduceprefix reduce operationsoperations

““signed charsigned char” and “” and “charchar” are separate and ” are separate and incompatible types.incompatible types.

Resolved:Resolved: Remove the brackets around all signed Remove the brackets around all signed keywords for all the types. Arguments of type “keywords for all the types. Arguments of type “charchar” ” are treated in an implementation-dependent manner.are treated in an implementation-dependent manner.

Resolved:Resolved: Remove references to “ASCII values” since Remove references to “ASCII values” since these equivalents are already specified by ANSIC.these equivalents are already specified by ANSIC.

May 6, 2003 72003 Michigan Technological University

What requirements are made of the phase of What requirements are made of the phase of function arguments?function arguments?

Resolved:Resolved: Remove the “common” statement regarding Remove the “common” statement regarding phase.phase.

Resolved:Resolved: To the 5.3 functions add: “The To the 5.3 functions add: “The srcsrc and and dstdst arguments are treated as if they have zero phase.”arguments are treated as if they have zero phase.”

Resolved:Resolved: To the 5.4 functions add: “The phase field for To the 5.4 functions add: “The phase field for the the XX argument is respected when referencing array argument is respected when referencing array elements.”elements.”

May 6, 2003 82003 Michigan Technological University

Associativity and commutative Associativity and commutative reducereduce and and prefix reduceprefix reduce operationsoperations

All provided reduction operators are assumed to be associative and All provided reduction operators are assumed to be associative and commutative. All reduction operators (except those provided using commutative. All reduction operators (except those provided using the the UPC_NONCOMM_FUNCUPC_NONCOMM_FUNC) are assumed to be commutative. ) are assumed to be commutative.

The operation op is always assumed to be associative. All predefined The operation op is always assumed to be associative. All predefined operations are also assumed to be commutative. Users may define operations are also assumed to be commutative. Users may define operations that are assumed to be associative, but not commutative. operations that are assumed to be associative, but not commutative. The “canonical” evaluation order of a reduction is in the order of array The “canonical” evaluation order of a reduction is in the order of array indices. However, the implementation may take advantage of indices. However, the implementation may take advantage of associativity, or associativity and commutativity in order to change associativity, or associativity and commutativity in order to change the order of evaluation. This may change the result of the reduction the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such for operations that are not strictly associative and commutative, such as floating point addition. as floating point addition.

Advice to implementors.Advice to implementors. It is strongly recommended that the function be implemented so that It is strongly recommended that the function be implemented so that

the same result be obtained whenever the function is applied on the the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of prevent optimizations that take advantage of the physical location of processors.processors.

May 6, 2003 92003 Michigan Technological University

Alternative Synchronization semanticsAlternative Synchronization semantics1a) The collective function may begin to read or write data when 1a) The collective function may begin to read or write data when anyany

thread enters the collective function. thread enters the collective function. 1b) The collective function may begin to read or write data with 1b) The collective function may begin to read or write data with affinityaffinity

to a thread when that thread enters the collective function. to a thread when that thread enters the collective function. 1c) The collective function may begin to read or write data when 1c) The collective function may begin to read or write data when allall

threads have entered the collective function.threads have entered the collective function.

2a) The collective function may exit before the operation is complete. 2a) The collective function may exit before the operation is complete. The operation is guaranteed to be complete at the beginning of the The operation is guaranteed to be complete at the beginning of the next synchronization phase.next synchronization phase.

2b) The collective function may return in a thread when all reads and 2b) The collective function may return in a thread when all reads and writes with writes with affinityaffinity to the thread are complete. to the thread are complete.

2c) The operation is complete when 2c) The operation is complete when anyany thread exits the collective thread exits the collective function.function.

3) Each collective function implements any pair (13) Each collective function implements any pair (1xx,2,2yy) of ) of synchronization requirements based on the argument synchronization requirements based on the argument UPC_SYNC_SEM.UPC_SYNC_SEM.

May 6, 2003 102003 Michigan Technological University

Synch semantic naming ideasSynch semantic naming ideas

UPC_BEGIN_ON_{ANY, MINE, ALL}_ COMPLETE_{LATER, MINE, ALL}

May 6, 2003 112003 Michigan Technological University

Can Can nbytesnbytes be 0 in 5.3 functions? be 0 in 5.3 functions?

Resolved:Resolved: Yes. Use the variable name Yes. Use the variable name numbytesnumbytes to to distinguish it from distinguish it from nbytesnbytes in the allocation functions. in the allocation functions. Add a statement that if Add a statement that if numbytesnumbytes is 0 then the function is 0 then the function is a no-op.is a no-op.

May 6, 2003 122003 Michigan Technological University

1. Synchronization phase1. Synchronization phase

““Arguments to each call to a collective function must be Arguments to each call to a collective function must be ready at the beginning of the synchronization phase in ready at the beginning of the synchronization phase in which the call is made. Results of each call to a which the call is made. Results of each call to a collective function are not ready until the beginning of collective function are not ready until the beginning of the next synchronization phase.”the next synchronization phase.”

This is a policy that can be relaxed as implementations This is a policy that can be relaxed as implementations demonstrate that fewer constraints lead to better demonstrate that fewer constraints lead to better performance.performance.

This is an easy-to-remember semantic. This is an easy-to-remember semantic.

May 6, 2003 132003 Michigan Technological University

2. Bill’s strict semantic2. Bill’s strict semantic

On input, no data will be accessed until all threads enter On input, no data will be accessed until all threads enter the collective function. On exit, all output will be written the collective function. On exit, all output will be written before any thread exits the collective function.before any thread exits the collective function.

May 6, 2003 142003 Michigan Technological University

3. Affinity-based semantics3. Affinity-based semantics

Source data with affinity to a thread must be ready Source data with affinity to a thread must be ready when that thread calls the collective function.when that thread calls the collective function.

Destination data with affinity to a thread will be ready Destination data with affinity to a thread will be ready when that thread returns from the collective function.when that thread returns from the collective function.

Version A: Version A: Provide two versions of each collective.Provide two versions of each collective.Provide distinct function names:Provide distinct function names: ““strict”: guarantee Bill’s strict semantics;strict”: guarantee Bill’s strict semantics;

““relaxed”: affinity-based semanticsrelaxed”: affinity-based semantics

Version B: Version B: Only the “relaxed” affinity-based version is Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to provided; the user provides explicit barriers to guarantee safety.guarantee safety.

May 6, 2003 152003 Michigan Technological University

4. “Split-phase” semantics4. “Split-phase” semantics

Split-phase collectives. How can the split-phase Split-phase collectives. How can the split-phase concept be extended to describe the synchronization concept be extended to describe the synchronization semantics of the collective functions?semantics of the collective functions?

May 6, 2003 162003 Michigan Technological University

What are the synchronization semantics?What are the synchronization semantics?

Resolution A: Resolution A: Provide two versions of each collective.Provide two versions of each collective.

By distinct function names:By distinct function names:

““strict”: guaranteed entry and exit barriers;strict”: guaranteed entry and exit barriers;

““relaxed”: affinity-based semantics appliesrelaxed”: affinity-based semantics applies

Resolution B: Resolution B: Only the “relaxed” affinity-based version Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to is provided; the user provides explicit barriers to guarantee safety.guarantee safety.

May 6, 2003 172003 Michigan Technological University

void upc_all_broadcast(dst, src, blk);void upc_all_broadcast(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

}blk

Thread 0 sends the same block of data to each thread.

shared [] char src[blk];shared [blk] char dst[blk*THREADS];

May 6, 2003 182003 Michigan Technological University

void upc_all_scatter(dst, src, blk);void upc_all_scatter(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Thread 0 sends a unique block of data to each thread.

shared [] char src[blk*THREADS];shared [blk] char dst[blk*THREADS];

May 6, 2003 192003 Michigan Technological University

void upc_all_gather(dst, src, blk);void upc_all_gather(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends a block of data to thread 0.

shared [blk] char src[blk*THREADS];shared [] char dst[blk*THREADS];

May 6, 2003 202003 Michigan Technological University

void upc_all_gather_all(dst, src, blk);void upc_all_gather_all(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends one block of data to all threads.

May 6, 2003 212003 Michigan Technological University

void upc_all_exchange(dst, src, blk);void upc_all_exchange(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends a unique block of data to each thread.

May 6, 2003 222003 Michigan Technological University

void upc_all_permute(dst, src, perm, blk);void upc_all_permute(dst, src, perm, blk);

shared

local

th0 th1 th2

1 2 0

dst dst dst

src src src

perm perm perm

Thread i sends a block of data to thread perm(i).

May 6, 2003 232003 Michigan Technological University

int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);

shared

local

src src src

th0 th1 th2

0

9

63

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

shared [3] int src[4*THREADS];int i;

i i i

42 81 16 32 64 128 2565121024 2048

42 81 16 32 64 128 256

44856 35915121024 2048

4095

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

Thread 0 receives UPC_OP src[i].i=0

n

May 6, 2003 242003 Michigan Technological University

void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);

shared

local

th0 th1 th2

dst dst dst

src src src

shared [*] int src[3*THREADS], dst[3*THREADS];

0 3 6

0 3 6

1 321642 8 64 128 2561 321642 8 64 128 256

1 324 162 8 64 128 2561 1276315 31 51125533 77 15 31 63 127 255

Thread k receives UPC_OP src[i].i=0

k

May 6, 2003 252003 Michigan Technological University

shared

local

th0 th1 th2

A “pull” implementation of upc_all_broadcast

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk );}

0 21

dst dst dst

src src src

May 6, 2003 262003 Michigan Technological University

shared

local

th0 th1 th2

A “push” implementation of upc_all_broadcast

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk );}

0 21

dst dst dst

src src src

i ii012

May 6, 2003 272003 Michigan Technological University

shared127

local

th0 th1 th2

dst dst dst

src src src

shared int src[3*THREADS], dst[3*THREADS];

0 1 2

0 1 2

1 3216 428 64 128 256

void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);, src, UPC_ADD, n, blk, NULL);

Thread k receives UPC_OP src[i].i=0

k

1 3216 428 64 128 256

1 3 715 31 63255 5113 715 31 63255127

May 6, 2003 282003 Michigan Technological University

ExtensionsExtensions

Strided copyingStrided copying

Vectors of offsets for Vectors of offsets for srcsrc and and dstdst arrays arrays

Variable-sized blocksVariable-sized blocks

Reblocking (Reblocking (cf:cf: preceding example of prefix reduce) preceding example of prefix reduce)

shared int src[3*THREADS];shared int src[3*THREADS];

shared [3] int dst[3*THREADS];shared [3] int dst[3*THREADS];

upc_forall(i=0; i<3*THREADS; i++; ?)upc_forall(i=0; i<3*THREADS; i++; ?)

dst[i] = src[i];dst[i] = src[i];

May 6, 2003 292003 Michigan Technological University

More sophisticated synchronization More sophisticated synchronization semanticssemantics

Consider the “pull” implementation of broadcast.Consider the “pull” implementation of broadcast.

There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other.

Each thread does a pairwise synchronization with thread 0.

Thread i will not have to wait if it reaches its synchronization point after thread 0.

Thread 0 returns from the call after it has sync’d with each thread.

May 6, 2003 302003 Michigan Technological University

What requirements are made of the phase of What requirements are made of the phase of function arguments?function arguments?

Resolved:Resolved: Remove the “common” statement regarding phase. Remove the “common” statement regarding phase. Resolved:Resolved: To the 5.3 functions add: “The To the 5.3 functions add: “The srcsrc and and dstdst arguments arguments

are treated as if they have zero phase.”are treated as if they have zero phase.” Resolved:Resolved: To the 5.4 functions add: “The phase field for the To the 5.4 functions add: “The phase field for the XX

argument is respected when referencing array elements.”argument is respected when referencing array elements.” Suitably define “respected”.Suitably define “respected”.

Note that “respecting” the phase requires over 20 integer Note that “respecting” the phase requires over 20 integer operations to compute the address of an arbitrary array element operations to compute the address of an arbitrary array element given:given: a a shared void *shared void * array address of arbitrary phase array address of arbitrary phase an element index (offset)an element index (offset) the blocksize and element sizethe blocksize and element size

May 6, 2003 312003 Michigan Technological University

Commutativity of Commutativity of reducereduce and and prefix prefix reducereduce operationsoperations

All reduction operators (except those provided using the All reduction operators (except those provided using the UPC_NONCOMM_FUNCUPC_NONCOMM_FUNC) are assumed to be commutative. A ) are assumed to be commutative. A commutative reduction operator whose result is commutative reduction operator whose result is dependent on a particular order of execution has dependent on a particular order of execution has undefined results.undefined results.