Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...

63
Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

Transcript of Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel...

Page 1: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Lecture 6. OpenMP

Prof. Taeweon SuhComputer Science Education

Korea University

COM503 Parallel Computer Architecture & Programming

Page 2: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Clauses

• The OpenMP directives support a number of clauses, optional addition that controls the behavior of the construct Shared()

Private()

Firstprivate()

Lastprivate()

Default()

Nowait()

Schedule()

2

#pragma omp parallel [clause[[,] clause]…]

structured block

Page 3: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Shared Clause

• The shared clause is used to specify the data shared among the threads In many cases, variables are shared by default in OpenMP

• An important implication is that multiple threads might attempt to simultaneously update the same memory location or that one thread might try to read from a location that another thread is updating Special care has to be taken to ensure that accesses to shared data are ordered

OpenMP places the synchronization responsibility to users

OpenMP provides several synchronization constructs

3

#pragma omp parallel for shared(a) { for (i=0; i<n; i++) a[i] += i; }

// all threads can read from and write to array a

Page 4: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Data Race Condition

4

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 128

int main(argc, argv)int argc;char * argv[];{

int Xshared;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel { int Xlocal = omp_get_thread_num();

Xshared = omp_get_thread_num();

if (Xlocal != Xshared)printf("Xlocal is %d; Xshared %d\n", Xlocal, Xshared);

}

}

Page 5: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Data Race Condition

5

int compute(int n){

int i;double h, x, sum;

sum = 0.0;

#pragma omp for reduction(+;sum) shared(h)for (i=0; i <= n; i++) {

x = h * ((double)i - 0.5);sum += (1.0 / (1.0 + x*x));

}

return(sum);}

Page 6: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Private Clause

• Each variable in the list of private is replicated in each thread, so each thread has exclusive access to a local copy of this variable Changes made to the data by one thread are not visible to other threads

• By default, the iteration variable of a parallel for is given the private attribute However, it is recommended that programmer not rely on the OpenMP

default rules

6

#pragma omp parallel for shared(a) private (i) { for (i=0; i<n; i++) a[i] += i; }

// all threads can read from and write to array a

Page 7: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ7

OpenMP Program in Memory

• Each thread has its own stack for storing its private data

• Shared data are passed as arguments, and referenced by their address in thread

• threadprivate data can be stored either on heap or on local stack, depending on implementation

Page 8: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Loop Iteration Variable

8

int i, j;

#pragma omp parallel for for (i=0; i <= n; i++) {

for (j=0; j <= m; j++) {a[i][j] = compute(i, j);

}

• In C, the index variables of parallel for are private

• But, it does not extend to the index variables at a deeper nesting level Loop variable i is private by default, but this is not the case for j (j is shared

by default)

It results in undefined runtime behavior

Page 9: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Example

9

#define M 1024#define N 1024*1024

int main(){ int i, j; int sum, local_sum; int * a; a = (int *) malloc(N*sizeof(int));

sum = 0;

for (i=0; i<N; i++) a[i] = i;

#pragma omp parallel private(local_sum) num_threads(4) {

local_sum = 0;

//#pragma omp for private (j) #pragma omp for for (i=0; i<M; i++) for (j=0; j<N; j++) local_sum += a[j];

#pragma omp critical { sum += local_sum;}

} printf("\n sum is %d\n", sum); free(a);}

Page 10: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Private Clause

• The values of private variables are undefined upon entry to and exit from the specific construct

10

#include <stdio.h>#include <omp.h>#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int n = 8; omp_set_num_threads(NUM_THREADS);

#pragma omp parallel private(n) { printf("n is %d; Thread %d\n", n, omp_get_thread_num()); }}

Page 11: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Firstprivate Clause

• Private data is undefined on entry to the construct where it is specified It could be a problem if you want to pre-initialize private

variables

• OpenMP provides the firstprivate clause for such cases Variables declared to be firstprivate are private variables But, they can be pre-initialized from the serial part of the code

before the parallel construct

11

Page 12: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Firstprivate Example

12

#include <stdio.h>#include <stdlib.h>#include <omp.h>#define n 8 int main(void){

int offset, i;int a[n];

for (i=0; i<n; i++) a[i] = 1;for (i=0; i<n; i++) printf("Before: a[%d] = %d\n", i, a[i]);

offset = 10;

#pragma omp parallel for firstprivate(offset) for(i=0; i<n; i++) { if(i == 0 || i == n-1)

offset = offset + 2;

a[i] = a[i] + offset;}for (i=0; i<n; i++) printf("After: a[%d] = %d\n", i, a[i]);

}

Page 13: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Lastprivate Clause

• The lastprivate(list) clause ensures that the last value of a data object listed is accessible after the corresponding construct has completed execution

• In parallel program, what does ‘last’ means? In the case of a work-sharing loop, the object will

have the value from the loop iteration that would be the last in a sequential execution

If the lastprivate clause is used on a sections construct, the object gets assigned the value that it has at the end of the lexically last sections construct

13

Page 14: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Lastprivate Example

14

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(int argc, char * argv[]){ int n = 8; int i, a;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for private(i) lastprivate(a) for (i=0 ; i<n; i++) { a = i + 1; printf("Thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(), a, i); }

printf("Value of a after parallel for: a = %d\n", a);

}

Page 15: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Lastprivate Clause

• A performance penalty is likely to be associated with the use of lastprivate, because the OpenMP library needs to keep track of which thread executes the last iteration

• In fact, all this clause really does is providing some extra convenience, since the same functionality can be implemented by using an additional shared variable and some simple logic

15

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(int argc, char * argv[]){ int n = 8; int i, a, a_shared;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for private(i, a) shared(a_shared) for (i=0 ; i<n; i++) { a = i + 1; printf("Thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(), a, i); if (i == n-1) a_shared = a; } printf("Value of a after parallel for: a = %d\n", a_shared);}

Page 16: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Default Clause

• The default clause is used to give variables a default data-sharing attribute For example, default(shared)assigns the shared attribute to all

variables referenced in the construct

In C/C++, the syntax is default(none)or default(shared)• default (private) is not supported in C/C++

• This clause is most often used to define the data-sharing attribute of the majority of the variables in a parallel region Only the exceptions need to be explicitly listed

For example, #pragma omp for default(shared) private(a,b,c)

16

Page 17: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

nowait Clause

• In the work-sharing constructs, there is an implicit barrier at the end. The nowait clause overrides that feature of OpenMP If it is added to a construct, the barrier at the end of the associated

construct will be suppressed

The nowait clause allows the programmer to fine-tune a program’s performance

• In the following example, when a thread is finished with the work associated with the parallelized for loop, it continues and no longer waits for the other threads to finish as well

17

#pragma omp for nowait for (i=0; i<n; i++) { ……… }

Page 18: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ18

Data Race Condition

#pragma omp parallel { #pragma omp for nowait for (i=0 ; i<n; i++) { b[i] = a[i] + a[i-1];

#pragma omp for nowait for (i=0 ; i<n; i++) { z[i] = sqrt(b[i]); }

• If n is not a multiple of #threads, even static scheduling could introduce race condition

• OpenMP 2.5 spec: There are several algorithms for distributing the remaining iterations

No guarantee that the same algorithm has to be used for different loops

Page 19: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ19

Nested Parallelism

#pragma omp parallel shared(n,a,b) { #pragma omp for for (i=0; i<n; i++) { a[i] = i + 1; #pragma omp parallel for for (j=0; j<n; j++) b[i][j] = a[i];

} } } /*-- End of parallel region --*/

#pragma omp parallel shared(n,a,b) { #pragma omp for for (i=0; i<n; i++) { a[i] = i + 1; #pragma omp for for (j=0; j<n; j++) b[i][j] = a[i];

} } } /*-- End of parallel region --*/

• Nest work-sharing directives in a program with providing a new parallel region

• To enable the nested parallelism, Set the OMP_NESTED environment variable to TRUE (default is false)

Runtime library routine omp_set_nested(true or false)

Page 20: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

schedule Clause

• The schedule clause is supported on the loop construct only It is used to control the manner in which loop

iterations are distributed over the threads Syntax: schedule (kind [, chunk_size])

• There are 4 kinds of schedule static dynamic guided runtime

20

Page 21: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

static

• It is the default on many OpenMP implementations

• Iterations are divided into chunks of size chunk_size. When no chunk_size is specified, the iteration space is

divided into chunks that are approximately equal in size

The last chunk to be assigned may have a smaller number of iterations

• The chunks are assigned to the threads statically in the order of the thread number

21

Page 22: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

dynamic

• The iterations are assigned to threads as the threads request them

• The thread executes the chunk of iterations (controlled through the chunk_size parameter), then requests another chunk until there are no more chunks to work on The last chunk may have fewer iterations than chunk_size

• When no chunk_size is specified, it defaults to 1

22

Page 23: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

guided

• Similar to the dynamic schedule

• The difference between dynamic and guided is that the size of the chunk (of iterations) decreases over time The rationale behind this scheme is that initially

larger chunks are desirable because they reduce the overhead

Load balancing is often more of an issue toward the end of computation

• The system uses relatively small chunks to fill in the gaps in the schedule

23

Page 24: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

guided

• For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations, divided by the number of threads, decreasing to 1

• For a chunk_size of k (k>1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (with a possible exception for the last chunk to be assigned)

• When no chunk_size is specified, it defaults to 1

24

Page 25: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

runtime

• It is not always easy to select the appropriate schedule and value for chunk_size up front The choice may depend not only on the code in the loop,

but also on the specific problem size and the number of threads used

• The runtime clause is convenient The decision regarding scheduling kind is made at run time Instead of making a compile time decision, the OMP_SCHEDULE environment variable can be used to choose the schedule and (optional) chunk_size at run time

OMP_SCHEDULE• export OMP_SCHEDULE="GUIDED,4"

• export OMP_SCHEDULE="DYNAMIC"

25

Page 26: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Graphical Illustration of Schedules

26

Page 27: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Synchronization Constructs

• OpenMP provides constructs that help to organize accesses to shared data by multiple threads Those can be used when the implicit barrier provided

with work-sharing constructs does not suffice to specify the required interactions or would be inefficient

• Synchronization constructs barrier ordered critical atomic lock master

27

Page 28: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

barrier Construct

• Many OpenMP constructs imply a barrier The compiler automatically inserts a barrier at the end of the

construct

All threads wait there until all of the work associated with the construct has been completed

Thus, it is often unnecessary for the programmer to explicitly add a barrier to a code

• In case that it is required, OpenMP provides a barrier construct #pragma omp barrier

• The most common use of a barrier is to avoid a race condition Inserting a barrier between the writes to and reads from a shared

variable guarantees that the accesses are appropriately ordered

28

Page 29: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

barrier Example

29

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int n = 9; int i, TID, a[10];

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel private(TID) {

TID = omp_get_thread_num();

if (TID < omp_get_num_threads()/2 ) system("sleep 3");

printf("Before: %d\n", omp_get_thread_num());system("date");

#pragma omp barrier

printf("After: %d\n", omp_get_thread_num());system("date");

}}

Page 30: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Illegal Use of barrier

30

#pragma omp parallel {

if (omp_get_num_threads() == 0 ) { ……. #pragma omp barrier}else{

……. #pragma omp barrier

}}

• Each barrier region must be encountered by all threads in a team of none at all

• Otherwise, the program would cause deadlock

Page 31: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ31

barrier Implementation

• A straightforward way to implement a barrier is to have a shared counter that is initialized to reflect the number of threads in a team

• When a thread reaches the barrier, it will decrement the counter atomically and wait until the counter is set to 0

#pragma omp parallel private(TID){ TID = omp_get_thread_num(); printf("Before: %d\n", omp_get_thread_num()); system("date");

#pragma omp barrier

printf("After: %d\n", omp_get_thread_num()); system("date");}

Page 32: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

ordered Construct

• The ordered construct allows for executing a structured block within a parallel loop in sequential order #pragma omp ordered

An ordered clause should be added to the parallel region in which this construct appears; it informs the compiler that the construct occurs

It is, for example, used to enforce the ordering on the printing of data computed by different threads

• Note that the ordered clause and construct come with a performance penalty The OpenMP implementation needs to perform additional book-

keeping tasks to keep track of the order in which threads should execute the corresponding region

Moreover, if threads finish out of order, there may be an additional performance penalty because some threads might have to wait

32

Page 33: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

ordered Example

33

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int n = 9; int i, TID, a[10];

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for default(none) ordered schedule(runtime) \ private(i, TID) shared(n, a) for (i=0; i<n; i++) {

TID = omp_get_thread_num();

printf("Thread %d updates a[%d]\n", TID, i);

a[i] = i;

#pragma omp ordered{ printf("Thread %d prints value of a[%d] = %d\n", TID, i, a[i]);}

}}

ordered clause

ordered construct

Page 34: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

critical Construct

• The critical construct provides a means to ensure that multiple threads do not attempt to update the same shared data simultaneously #pragma omp critical [(name)] The associated code is referred to as a critical

region (or a critical section) An optional name can be given to a critical

construct

34

Page 35: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Critical Example

35

#include <stdio.h>#include <omp.h>

int main(argc, argv)int argc;char * argv[];{ int n = 9; int i, TID; int sum, sumlocal;

sum = 0;

#pragma omp parallel private(i, TID, sumlocal) shared(n, sum) {

TID = omp_get_thread_num();sumlocal = 0;

#pragma omp for for (i=0; i<n; i++)

sumlocal += i; #pragma omp critical

{ sum += sumlocal; printf("TID=%d: sumlocal = %d, sum = %d\n", TID, sumlocal, sum);}

} printf("Value of sum after parallel region: %d\n", sum);}

Page 36: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

atomic Construct

• Similar to the critical construct, but it is applied only to the (single) assignment statement that immediately follows it

• Examples #pragma omp atomic ic += 1;

#pragma omp atomic ic += bigfunc() // The atomic construct does not prevent multiple threads from

executing the function bigfunc() at the same time. It is only the update to the memory location of the variable ic that will occur atomically

36

Page 37: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Locks

• The OpenMP API provides a set of low-level, general-purpose locking runtime library routines These routines provide greater flexibility for synchronization

(than does the use of critical or atomic constructs)

Syntax:

void omp_func_lock (omp_lock_t *lck)

func: init, destroy, set, unset, test

• These routines operate on special-purpose lock variables, which should be accessed via the lock routines only 2 types of locks: simple locks and nestable locks

Simple lock variables are declared with omp_lock_t

Nestable lock variables are declared with omp_nest_lock_t

37

Page 38: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Locks

• The general procedure to use lock Define the (simple or nested) lock variables Initialize lock: omp_init_lock() Set lock: omp_set_lock(), omp_test_lock()

• omp_test_lock() checks whether the lock is actually available before attempting to set it

Unset lock: omp_unset_lock() Remove lock: omp_destroy_lock()

• Special care has to be taken when the programmer synchronizes the actions of threads using these routines If these routines are used improperly, a number of

programming errors are possible (for example, a code may deadlock)

38

Page 39: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

lock Example

39

#include <stdio.h>#include <stdlib.h>#include <omp.h>

omp_lock_t my_lock;

int main() {

omp_init_lock(&my_lock);

#pragma omp parallel num_threads(4) { int tid = omp_get_thread_num( ); int i;

for (i = 0; i < 2; ++i) { omp_set_lock(&my_lock); printf("Thread %d - starting locked region\n", tid); printf("Thread %d - ending locked region\n", tid); omp_unset_lock(&my_lock); } }

omp_destroy_lock(&my_lock);}

Page 40: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

lock Example

40

#include <stdio.h>#include <stdlib.h>#include <omp.h> omp_lock_t simple_lock; int main() { omp_init_lock(&simple_lock); #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num();

while (!omp_test_lock(&simple_lock)) printf("Thread %d - failed to acquire simple_lock\n“, tid); printf("Thread %d - acquired simple_lock\n", tid); printf("Thread %d - released simple_lock\n", tid); omp_unset_lock(&simple_lock); } omp_destroy_lock(&simple_lock);}

http://msdn.microsoft.com/en-us/library/6e1yztt8(v=vs.110).aspx

Page 41: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

master Construct

• The master construct defines a block of code that is guaranteed to be executed by the master thread only

#pragma omp master

structured block

It is similar to the single construct, but it does not have an implied barrier on entry or exit

The lack of a barrier may lead to problems

If the master construct is used to initialize data, for example, care should be taken that this initialization is completed before the other threads in the team use the data

The typical solution is either to rely on an implied barrier further down the execution stream or to use an explicit barrier construct

41

Page 42: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Incorrect Use of master

42

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int Xinit, Xlocal;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel {

#pragma omp master{ Xinit = 10; }

Xlocal = Xinit;printf("TID %d, Xlocal = %d\n", omp_get_thread_num(), Xlocal);

}}

Page 43: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

if Clause

• The if clause is supported on the parallel construct only, where it is used to specify conditional execution

• Since some overheads are inevitably incurred with the creation and termination of a parallel region, it is sometimes necessary to test whether there is enough work in the region to warrant its parallelization

• Syntax: if (scalar-logical-expression) If the logical expression evaluates to true, the parallel region is

executed by a team of threads

If it evaluates to false, the region is executed by a single thread only

43

Page 44: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

reduction Clause

• On the slide 35, we used the critical construct to parallelize the summation operation

• There is a much easier way; OpenMP provides the reduction clause for specifying some forms of recurrence calculations, so they can be performed in parallel without code modification The programmer must identify the operations and the variables that

will hold the results: the rest of the work is then left to the compiler

The results will be shared and it is not necessary to specify the corresponding variables explicitly shared

Syntax: reduction(operator: list)

44

Page 45: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Reduction Example

45

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int n = 10; int i; int sum;

omp_set_num_threads(NUM_THREADS);

sum = 0;

#pragma omp parallel for private(i) shared(n) reduction(+:sum) for (i=0; i<=n; i++) sum += i;

printf("Value of sum after parallel region: %d\n", sum);}

Page 46: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Supported Operators for reduction

46

Operator Initial Value

+ 0

* 1

- 0

& (bitwise AND) ~0

| (bitwise OR) 0

^ (bitwise XOR) 0

&& (logical AND) 1

|| (logical OR) 0

Page 47: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Reduction Example

47

#include <stdio.h>#include <omp.h>

int main(argc, argv)int argc;char * argv[];{ int n = 8; int i; float multout, f[n];

multout = 1.0;

for (i=0; i<=n; i++) f[i] = 0.1 + ((float) i) *0.1;

#pragma omp parallel for private(i) shared(n) reduction(*:multout) for (i=0; i<=n; i++)

multout *= f[i];

printf("Multiplication output = %.16f\n", multout);

}

Page 48: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

reduction Clause

• The order in which thread-specific values are combined is unspecified For floating-point operations, there would be

numerical differences between the results of a sequential and parallel run, or even of two parallel runs using the same number of threads

This is a result of the limitation in precision with which computers represent floating-point numbers

• Results may vary slightly, depending on the order in which operations are performed

• But, it is not a cause of concern if the values are all of roughly the same magnitude

• Keep this in mind when using the reduction clause

48

Page 49: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

flush Clause

• The OpenMP standard specifies that all modifications are written back to main memory and are thus available to all threads, at synchronization points in the program

• Between these synchronization points, threads are permitted to have new values for shared variables stored in their local memory rather than in the global shared memory

• Sometimes, updated values of shared values must become visible to other threads in-between synchronization points

• The OpenMP API provides the flush directive #pragma omp flush [(list)]

The flush operation applies to all variables specified in the list

If no list is provided, it applies to all thread-visible shared data

49

Page 50: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

flush Clause

• If the flush operation is invoked by a thread that has updated the variables, their new values will be flushed to memory and therefore be accessible to all other threads

• If the construct is invoked by a thread that has not updated a value, it will ensure that any local copies of the data are replaced by the latest value from main memory

• Implicit flush operations with no list occur at the following locations All explicit and implicit barriers

• E.g., at the end of a parallel region or work-sharing construct

Entry to and exit from critical regions

Entry to and exit from lock routines50

Page 51: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

flush Example

51

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int main(argc, argv)int argc;char * argv[];{ int new_data, local_data, TID, signal;

omp_set_num_threads(NUM_THREADS);

signal = 0; new_data = 0;

#pragma omp parallel default(none) shared(signal, new_data) private(TID, local_data) {

TID = omp_get_thread_num();if (TID == 3) {

new_data = 10; signal = 1;

#pragma omp flush(new_data, signal)

} else { while (signal == 0);

local_data = new_data; printf("Thread %d has %d\n", TID, local_data); } }

}

Page 52: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

threadprivate Clause

• The global data is shared by default

• In some situations, we may need or prefer to have private data that persists throughout the computation This is where the threadprivate directive comes in handy

#pragma omp threadprivate (list)

• The effect of the threadprivate directive is that the named global-lifetime objects are replicated, so that each thread has its own copy Each thread gets a private or local copy of the specified

global variables

52

Page 53: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Serial Code

53

int calculate_sum(int length);

int *pglobal;

int main(){ int i, j, sum, n = 5; int length[n];

for (i=0; i<n; i++) length[i] = 10 * (i+1);

for (i=0; i<n; i++) { if ( (pglobal = (int *) malloc(length[i]*sizeof(int))) != NULL ) {

for (j=0; j<length[i]; j++) pglobal[j] = j+1;

sum = calculate_sum(length[i]);

printf("Value of sum for i = %d is %8d\n",i,sum);

free(pglobal);

}

return(0);}

int calculate_sum(int length){ int sum = 0; int j; for (j=0; j<length; j++) sum += pglobal[j]; return(sum);}

Page 54: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

threadprivate Example

54

#define TRUE 1#define FALSE 0int calculate_sum(int length);

int *pglobal;

#pragma omp threadprivate(pglobal)

int main(){ int i, j, sum, TID, n = 5; int length[n];

for (i=0; i<n; i++) length[i] = 10 * (i+1);

#pragma omp parallel for shared(n,length) private(TID,i,j,sum) for (i=0; i<n; i++) { TID = omp_get_thread_num();

if ( (pglobal = (int *) malloc(length[i]*sizeof(int))) != NULL ) {

for (j=0; j<length[i]; j++) pglobal[j] = j+1; sum = calculate_sum(length[i]);

printf("TID %d: value of sum for i = %d is %8d\n“, TID,i,sum);

free(pglobal);

} } /*-- End of parallel for --*/ return(0);}

int calculate_sum(int length){ int sum = 0; int j; for (j=0; j<length; j++) sum += pglobal[j]; return(sum);}

Page 55: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

copyin Clause

• The threadprivate variables are private variables (that are made to private to each thread) Each thread has its own set of these variables Just as with regular private data, the initial values are

undefined

• The copyin clause provides a means to copy the value of the master thread’s threadprivate variables to the corresponding threadprivate variables of the other threads

55

Page 56: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

copyin Example

56

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4

int x;

#pragma omp threadprivate(x)

int main(argc, argv)int argc;char * argv[];{ int tid; x = 33;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel private(tid) copyin(x) {

tid = omp_get_thread_num();printf("Thread %d, x = %d\n", tid, x);

}

}

Page 57: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

copyprivate Clause

• The copyprivate clause is supported on the single directive only #pragma omp single copyprivate(a, b, c)

• It provides a mechanism for broadcasting the value of a private variable from one thread to the other threads in the team

• The typical use is to have one thread initialize private data that is subsequently used by the other threads as well

• After the single construct has ended, but before the threads have left the associated barrier, the values of variables specified in the associated list are copied to the other threads

57

Page 58: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

copyprivate Example

58http://msdn.microsoft.com/en-us/library/bc1k0739.aspx

#include <stdio.h>#include <omp.h>

float x, y, fGlobal = 1.0;

float get_float() { fGlobal += 0.1; return fGlobal;}

#pragma omp threadprivate(x, y)

int main() {

float a, b;

#pragma omp parallel { #pragma omp single copyprivate(a, b, x, y) { a = get_float(); b = get_float(); x = get_float(); y = get_float(); }

printf("Value = %f, thread = %d\n", a, omp_get_thread_num()); printf("Value = %f, thread = %d\n", b, omp_get_thread_num()); printf("Value = %f, thread = %d\n", x, omp_get_thread_num()); printf("Value = %f, thread = %d\n", y, omp_get_thread_num()); }}

Page 59: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

OpenMP Environment Variables

• The OpenMP standard provides several means with which the programmer can interact with the execution environment

OMP_NUM_THREADS• export OMP_NUM_THREADS=16

OMP_DYNAMIC controls dynamic adjustment of the number of threads that will be used to execute future parallel regions

• export OMP_DYNAMIC=“TRUE"

OMP_NESTED• export OMP_NESTED=“TRUE“

OMP_SCHEDULE• export OMP_SCHEDULE="GUIDED,4"

• export OMP_SCHEDULE="DYNAMIC"

59

Page 60: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Backup

60

Page 61: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ61

Page 62: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Firstprivate Clause Example

• Assume that each thread in a parallel region needs access to a thread-specific section of a vector Access starts at a certain (nonzero) offset

62

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4#define vlen 10

int main(int argc, char * argv[]){ int n = 2; int i, a_shared; int TID, indx; int a[10];

omp_set_num_threads(NUM_THREADS);

for (i=0; i<vlen; i++) a[i] = -1;

indx = 2;

#pragma omp parallel firstprivate(indx) private(i, TID) shared(n, a) { TID = omp_get_thread_num(); indx += n*TID; for (i=indx; i<indx+n; i++) a[i] = TID + 1 ; }

printf("After the parallel region:\n");

for (i=0; i<vlen; i++) printf("a[%d] = %d\n", i, a[i]);}

Page 63: Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Korea Univ

Alternative Solution

• If the variable indx is not updated any further, the following simpler and better solution is preferred

63

#include <stdio.h>#include <omp.h>

#define NUM_THREADS 4#define vlen 10

int main(int argc, char * argv[]){ int n = 2; int i, offset; int TID, indx; int a[10];

omp_set_num_threads(NUM_THREADS);

for (i=0; i<vlen; i++) a[i] = -1;

offset = 2;

#pragma omp parallel private(i, TID, indx) shared(n, offset, a) { TID = omp_get_thread_num(); indx = offset + n*TID; for (i=indx; i<indx+n; i++) a[i] = TID + 1 ; }

printf("After the parallel region:\n");

for (i=0; i<vlen; i++) printf("a[%d] = %d\n", i, a[i]);}